Lookie Here, Lookie There
Using Regular Expression Lookarounds
Issue: 12.4 (July/August 2014)
Author: Kem Tekinay
Author Bio: Kem Tekinay is a Macintosh consultant and programmer who started with Xojo when it was still REALbasic. He is the author of RegExRX (http://www.mactechnologies.com/index.php:i?page=downloads#regexrx), the popular regular expression editor for Mac and Windows.
Article Description: No description available.
Article Length (in bytes): 10,377
Starting Page Number: 79
Article Number: 12415
Related Web Link(s):
Excerpt of article text...
The concept behind regular expression is actually pretty simple, even if the language itself can be a bit dense. A series of tokens represent one or more characters in your text, and if those tokens match something, you get a result that includes everything that was matched. Easy, right?
If that's all there was to it, it also would be easy to use and easy to explain (well,
easier, at least), but limited in usefulness. See, there are times when the same text will or won't match depending on what's around it. For example, suppose you wanted to match
cat, but only if it was directly after the word
female? Using subgroups (covered last time) can help, but there is another way: Lookarounds.
Pointing The Way
Lookarounds let the regex engine examine surrounding text without including it in the match, but to understand them, you first have to know what's going on internally.
When you create a pattern, you're telling the engine to use each token to examine your text one character at a time. If there is no match, it moves on to the next character, but if there is a match, it takes note and advances an internal pointer. For every subsequent character that matches your pattern, the pointer is advanced again and again until it either runs out of tokens, meaning the complete match has been found, or the match fails. In the latter case, it backtracks the pointer as far as it can (as defined by your pattern) and tries again. At each step, that pointer is advanced or rewound so it can keep track of the start of the match and all the text that should be included.
Imagine you were doing this manually. You would open your text in a word processor and position your cursor at the beginning of the document. If the first character doesn't fit your criteria, you'd press the right arrow key to advance the cursor until you got to a character that does fit. You'd note that starting position somewhere, then press the right arrow again, examining each character in turn. Eventually you would find the text you were looking for, or you'd start pressing the left arrow until you got back to a point where you could start again. The regex engine is doing essentially the same thing, keeping track of its internal pointer and start position.
...End of Excerpt. Please purchase the magazine to read the full article.