Here’s a sneak peak of some of the content I’m covering at the STC Summit this year, plus a few things I had to cut. If you’re interested in attending my presentation, visit the STC Summit home page or the page for my class.
Groups are a big part of regular expression syntax. Groups:
- Support OR logic
- Account for optional or repeated content by letting you specify how many times to match a series of characters
- Let you qualify a match based on nearby content
- Let you reference matched content in replacement strings
Feel free to try the example regular expressions at regex101.com.
OR logic and variance
If we want to match variations of URLs, for example, we might use a regular expression like this:
(https?:\/\/)?(www.)?domain.(com|org|net)
Which matches all these URLs:
Using the ? token after a group, (www.)?
, matches the group zero or one time. That’s useful if there are optional parts in the text we’re trying to match.
Including values on either side of a pipe character in a group, (com|org|net)
, specifies OR logic. With an OR group, the regex can match any of the values in the group.
Qualifying matches based on nearby content
Lookbehinds and lookaheads are special groups. They do not contribute to a match, but help qualify the match based on nearby content. Positive lookbehinds use this pattern:
(?<=previous content)content to match
Positive lookbehinds return matched content only if it comes after some other content. Lookbehinds have a fixed length, so you need to know the content inside the lookbehind. There are negative lookbehinds, too. They work the way you think they do.
The negative lookbehind pattern is: (?<!not this content)content to match
Positive lookaheads follow this pattern: content to match(?=following content)
Positive lookaheads return matched content only when the content after it matches something. Unlike lookbehinds, lookaheads are variable-length and support greedy matches. Negative lookbehinds do the opposite.
Negative lookbehinds follow this pattern: content to match(?!not this following content)
In both cases, the content inside the lookahead or lookbehind group is not part of the match. Lookahead and lookbehind content only serve to qualify matches based on nearby content.
Here’s an example using both patterns to get content from a paragraph tag.
(?<=p class="test">).*?(?=<\/p>)
Given <p class="test">Hello this is a test</p>
, the regex matches Hello this is a test
.
That’s good enough for many tasks, but what if you want to use that content as part of a replacement string? You can do that, too.
Referencing matched content in a replacement string
Groups allow you to mark sub-patterns in a regular expression. Sub-patterns are useful because you can reference them in replacement strings. To make a sub-pattern, just put that part of the regular expression in parenthesis. Consider the previous example: (?<=p class="test">).*?(?=<\/p>)
This matches “Hello this is a test.” But, because the lazy match isn’t in a group, you can’t use it in your replacement string. Putting it in a group does the trick.
(?<=p class="test">)(.*?)(?=<\/p>)
Groups are also called “capturing groups.” In this example, there’s one capture group. (Remember, lookbehinds and lookaheads don’t contribute to matches.) To reference a capture group, use this pattern in your replacement string: $n
, where n is the capturing group’s number.
Capture groups are numbered in the order they occur in the expression. In our example, we’d use $1
. Let’s put it all together:
Original text: <p class="test">Hello this is a test</p>
Regex: (?<=p class="test">)(.*?)(?=<\/p>)
Replacement string: $1 of the emergency broadcast system.
Updated text: <p class="test">Hello this is a test of the emergency broadcast system.</p>
That’s pretty cool. Not only can we match unknown content, we can use that content in our replacement strings. There are a lot of ways this is useful, but here’s an example:
You’re working with some legacy HTML pages produced by an older version of your CMS. This legacy content has content you want to remove, like span tags with “bold” as the class.
Capturing groups allow us to move the content in one tag to another.
Original text: <p class="test">Hello this is a <span class="bold">test</span></p>
Regex: <span.*?class="bold".*?>(.*?)<\/span>
Replacement string: <strong>$1</strong>
Updated text: <p class="test">Hello this is a <strong>test</strong></p>
Conclusion
I’ve covered a few of the most common ways that technical writers can use groups to make their tasks easier. There are other kinds of groups and ways to use them. More regex posts to come!
If there are topics you’d like me to write about, leave them in the comments!