Home
Power Search and Replace Tool Prev Page Prev Page
Introducing PowerGREP
Getting Started with PowerGREP
Regular Expression Quick Start
PowerGREP Contact Information
How to Use PowerGREP
Mark Files for Searching
Define a Search Action
Interpret Search Results
Edit Files and Replace or Revert Individual Matches
Keyboard Shortcuts
PowerGREP Examples
Search Through File Names
Find Email Addresses
Find Word Pairs
Boolean Operators “and” and “or”
Find Two Words Near Each Other
Find Two or More Words on The Same Line
Extract or Delete Lines Matching One or More Search Terms
Delete Repeated Words
Add a Header and Footer to Files
Update Copyright Years
Add Proper HTML <TITLE> Tags
Replace HTML Tags
Replace HTML Attributes
Search Through or Skip Source Code Comments and Strings
Convert Windows to UNIX Paths
Extract Data into a CSV File or Spreadsheet
Collect a List of Header and Item Pairs
Inspect Web Logs
Extract Google Search Terms from Web Logs
Compile Indices of Files
Make Sections and Their Contents Consistent
Generate a PHP Navigation Bar
Include a PHP Navigation Bar
PowerGREP Reference
PowerGREP Assistant
File Selector Reference
File Selector Menu
Action Reference
Search Terms and Options
Action Definition
Extra Processing
File Sectioning
Target and Backup Files
Action Menu
Library Reference
Library Menu
Results Reference
Results Menu
Editor Reference
Editor Menu
Undo History Reference
Undo History Menu
Change PowerGREP’s Appearance
Preferences
Path Placeholders
Supported File Formats
Command Line Parameters
XML Format of PowerGREP Files
Regular Expression Tutorial
Introduction
Tutorial Contents
Characters
How a Regex Engine Works Internally
Character Classes
Dot (Any Character)
Start and End of String or Line
Word boundaries
Alternation
Making a Token Optional
Quantifiers (Repetition)
Grouping and Backreferences
Named Capturing Groups
Unicode
Mode Modifiers
Atomic Grouping and Possessive Quantifiers
Lookahead and Lookbehind
Test The Same Part of The String for More Than Once
Continuing from The Previous Match
If-Then-Else Conditionals
Adding Comments
Regular Expression Reference
Basic Syntax
Advanced Syntax
Flavor-Specific Syntax
Unicode Syntax
Regular Expression Examples
Regular Expression Examples
Floating Point Numbers
Dates
Matching Complete Lines
Delete Duplicate Lines
Programming Language Syntax
Two Near Words
Regex Tools and Languages
Overview
grep
RegexBuddy
EditPad Pro
Delphi
Java
Java Example
JavaScript
JavaScript Example
.NET (C# and VB.NET)
C# Example
PCRE
Perl
PHP
Python
Ruby
Regular Expression Books
Teach Yourself...in 10 Minutes
Mastering Regular Expressions
Java Regular Expressions
Regular Expression Recipies

Deleting Duplicate Lines From a File

If you have a file in which all lines are sorted (alphabetically or otherwise), you can easily delete (subsequent) duplicate lines. Simply open the file in your favorite text editor, and do a search-and-replace searching for ^(.*)(\r?\n\1)+$ and replacing with \1. For this to work, the anchors need to match before and after line breaks (and not just at the start and the end of the file or string), and the dot must not match newlines.

Here is how this works. The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The round brackets store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

Removing Duplicate Items From a String

We can generalize the above example to afterseparator(item)(separator\1)+beforeseparator, where afterseparator and beforeseparator are zero-width. So if you want to remove subsequent duplicates from a comma-delimited list, you could use (?<=,|^)([^,]*)(,\1)+(?=,|$) .

The positive lookbehind (?<=,|^) forces the regex engine to start matching at the start of the string or after a comma. ([^,]*) captures the item. (,\1)+ matches subsequent duplicate items. Finally, the positive lookahead (?=,|$) checks if the duplicate items are complete items by checking for a comma or the end of the string.

 

Visit the best online web design and script resources