Atomic Grouping and Possessive Quantifiers When discussing the repetition operators or quantifiers, I explained the difference between greedy and lazy repetition. Greediness and laziness determine the order in which the regex engine tries the possible permutations of the regex pattern. A greedy quantifier will first try to repeat the token as many times as possible, and gradually give up matches as the engine backtracks to find an overall match. A lazy quantifier will first repeat the token as few times as required, and gradually expand the match as the engine backtracks through the regex to find an overall match. Because greediness and laziness change the order in which permutations are tried, they can change the overall regex match. However, they do not change the fact that the regex engine will backtrack to try all possible permutations of the regular expression in case no match can be found. First, let's see why backtracking can lead to problems. Catastrophic Backtracking Recently I got a complaint from a customer that EditPad Pro hung (i.e. it stopped responding) when trying to find lines in a comma-delimited text file where the 12th item on a line started with a P. The customer was using the regexp ^(.*?,){11}P . At first sight, this regex looks like it should do the job just fine. The lazy dot and comma match a single comma-delimited field, and the {11} skips the first 11 fields. Finally, the P checks if the 12th field indeed starts with P. In fact, this is exactly what will happen when the 12th field indeed starts with a P. The problem rears its ugly head when the 12th field does not start with a P. Let's say the string is 1,2,3,4,5,6,7,8,9,10,11,12,13. At that point, the regex engine will backtrack. It will backtrack to the point where ^(.*?,){11} had consumed 1,2,3,4,5,6,7,8,9,10,11, giving up the last match of the comma. The next token is again the dot. The dot matches a comma. The dot matches the comma! However, the comma does not match the 1 in the 12th field, so the dot continues until the 11th iteration of .*?, has consumed 11,12,. You can already see the root of the problem: the part of the regex (the dot) matching the contents of the field also matches the delimiter (the comma). Because of the double repetition (star inside {11}), this leads to a catastrophic amount of backtracking. The regex engine now checks whether the 13th field starts with a P. It does not. Since there is no comma after the 13th field, the regex engine can no longer match the 11th iteration of .*?,. But it does not give up there. It backtracks to the 10th iteration, expanding the match of the 10th iteration to 10,11,. Since there is still no P, the 10th iteration is expanded to 10,11,12,. Reaching the end of the string again, the same story starts with the 9th iteration, subsequently expanding it to 9,10,, 9,10,11,, 9,10,11,12,. But between each expansion, there are more possiblities to be tried. When the 9th iteration consumes 9,10,, the 10th could match just 11, as well as 11,12,. Continuously failing, the engine backtracks to the 8th iteration, again trying all possible combinations for the 9th, 10th, and 11th iterations. You get the idea: the possible number of combinations that the regex engine will try for each line where the 12th field does not start with a P is huge. This causes software like EditPad Pro to stop responding longer than your patience lasts. Other applications may even crash as the regex engine runs out of memory trying to remember all backtracking positions. If you try this example with RegexBuddy's debugger, you will see that it needs 48,096 steps to conclude there regex cannot match. If the string is 1,2,3,4,5,6,7,8,9,10,11,12,13,14, just 3 characters more, the number of steps jumps to 96,857. It's not too hard to imagine that at this kind of exponential rate, attempting this regex on a large file with long lines could easily take forever. RegexBuddy's debugger will abort the attempt after 100,000 steps, to prevent it from running out of memory. Preventing Catastrophic Backtracking The solution is simple. When nesting repetition operators, make absolutely sure that there is only one way to match the same match. If repeating the inner loop 4 times and the outer loop 7 times results in the same overall match as repeating the inner loop 6 times and the outer loop 2 times, you can be sure that the regex engine will try all those combinations. In our example, the solution is to be more exact about what we want to match. We want to match 11 comma-delimited fields. The fields must not contain comma's. So the regex becomes: ^([^,\r\n]*,){11}P . If the P cannot be found, the engine will still backtrack. But it will backtrack only 11 times, and each time the [^,\r\n] is not able to expand beyond the comma, forcing the regex engine to the previous one of the 11 iterations immediately, without trying further options. Atomic Grouping and Possessive Quantifiers Recent regex flavors have introduced two additional solutions to this problem: atomic grouping and possessive quantifiers. Their purpose is to prevent backtracking, allowing the regex engine to fail faster. In the above example, we could easily reduce the amount of backtracking to a very low level by better specifying what we wanted. But that is not always possible in such a straightforward manner. In that case, you should use atomic grouping to prevent the regex engine from backtracking. Using atomic grouping, the above regex becomes ^(?>(.*?,){11})P. Everything between (?>) is treated as one single token by the regex engine, once the regex engine leaves the group. Because the entire group is one token, no backtracking can take place once the regex engine has found a match for the group. If backtracking is required, the engine has to backtrack to the regex token before the group (the caret in our example). If there is no token before the group, the regex must retry the entire regex at the next position in the string. Possessive quantifiers are a limited form of atomic grouping with a cleaner notation. To make a quantifier possessive, place a plus after it. x++ is the same as (?>x+). Similarly, you can use x*+, x?+ and x{m,n}+. Note that you cannot make a lazy quantifier possessive. It would match the minimum number of matches and never expand the match because backtracking is not allowed. Tool and Language Support for Atomic Grouping and Possessive Quantifiers Atomic grouping is a recent addition to the regex scene, and only supported by the latest versions of most regex flavors. Perl supports it starting with version 5.6. The Java supports it starting with JDK version 1.4.2, though the JDK documentation uses the term "independent group" rather than "atomic group". All versions of .NET support atomic grouping, as do recent versions of PCRE, PHP's pgreg functions and Ruby. Python does not support atomic grouping. At this time, possessive quantifiers are only supported by the Java JDK 1.4.0 and later, and PCRE version 4 and later. The latest versions of EditPad Pro and PowerGREP support both atomic grouping and possessive quantifiers, as do all versions of RegexBuddy. Atomic Grouping Inside The Regex Engine Let's see how ^(?>(.*?,){11})P is applied to 1,2,3,4,5,6,7,8,9,10,11,12,13. The caret matches at the start of the string and the engine enters the atomic group. The star is lazy, so the dot is initially skipped. But the comma does not match 1, so the engine backtracks to the dot. That's right: backtracking is allowed here. The star is not possessive, and is not immediately enclosed by an atomic group. That is, the regex engine did not cross the closing round bracket of the atomic group. The dot matches 1, and the comma matches too. {11} causes further repetition until the atomic group has matched 1,2,3,4,5,6,7,8,9,10,11,. Now, the engine leaves the atomic group. Because the group is atomic, all backtracking information is discarded and the group is now considered a single token. The engine now tries to match P to the 1 in the 12th field. This fails. So far, everything happened just like in the original, troublesome regular expression. Now comes the difference. P failed to match, so the engine backtracks. The previous token is an atomic group, so the group's entire match is discarded and the engine backtracks further to the caret. The engine now tries to match the caret at the next position in the string, which fails. The engine walks through the string until the end, and declares failure. Failure is declared after 30 attempts to match the caret, and just one attempt to match the atomic group, rather than after 30 attempts to match the caret and a huge number of attempts to try all combinations of both quantifiers in the regex. That is what atomic grouping and possessive quantifiers are for: efficiency by disallowing backtracking. The most efficient regex for our problem at hand would be ^(?>((?>[^,\r\n]*),){11})P , since possessive, greedy repetition of the star is faster than a backtracking lazy dot. If possessive quantifiers are available, you can reduce clutter by writing ^(?>([^,\r\n]*+,){11})P . When To Use Atomic Grouping or Possessive Quantifiers Atomic grouping and possessive quantifiers speed up failure by eliminating backtracking. They do not speed up success, only failure. When nesting quantifiers like in the above example, you really should use atomic grouping and/or possessive quantifiers whenever possible. While x[^x]*+x and x(?>[^x]*)x fail faster than x[^x]*x, the increase in speed is minimal. If the final x in the regex cannot be matched, the regex engine backtracks once for each character matched by the star. With simple repetition, the amount of time wasted with pointless backtracking increases in a linear fashion to the length of the string. With combined repetition, the amount of time wasted increases exponentially and will very quickly exhaust the capabilities of your computer. Still, if you are smart about combined repetition, you often can avoid the problem without atomic grouping as in the example above. If you are simply doing a search in a text editor, using simple repetition, you will not earn back the extra time to type in the characters for the atomic grouping. If the regex will be used in a tight loop in an application, or process huge amounts of data, then atomic grouping may make a difference. Note that atomic grouping and possessive quantifiers can alter the outcome of the regular expression match. \d+6 will match 123456 in 123456789. \d++6 will not match at all. \d+ will match the entire string. With the former regex, the engine backtracks until the 6 can be matched. In the latter case, no backtracking is allowed, and the match fails. Again, the cause of this is that the token \d that is repeated can also match the delimiter 6. Sometimes this is desirable, often it is not. This shows again that understanding how the regex engine works on the inside will enable you to avoid many pitfalls and craft efficient regular expressions that match exactly what you want.
|