Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Advanced Matching

00:00 In the previous lesson, I showed you how to use flags to modify your regular expression behavior. In this lesson, I’m going to show you some even more advanced regular expression patterns.

00:11 I’m going to cover four concepts using pythex: conditional matches, lookahead matches, lookbehind matches, and comments. Conditional matches change the matching behavior based on whether or not something is present or not.

00:25 Lookahead and lookbehind matches change the behavior of the regex based on content ahead or behind what you’re looking for, but don’t include that part in the match.

00:37 Comments are, well, comments. We’ll start with a familiar plain text match inside of a group, looking for the word ACME. I can add a quantifier to say this can occur zero or more times. For the purposes of this expression, nothing’s changed.

00:59 I’ve added a little more to it. This time it’s looking for ACME zero more times, a whitespace (\s), the word Super, and then the period meta-character (.) zero or more times.

01:12 This essentially captures everything in a paragraph after the word ACME. Now I’m going to introduce the concept of a conditional.

01:22 The part that’s been added is inside of this group. The question mark with brackets says that this is a conditional group. The 1 is a backreference, so this conditional is conditional on the presence of backreference 1.

01:38 So, if ACME, the first group, is present, then this evaluates to True. If ACME is not present, then this evaluates to False.

01:51 The regular expression that gets run in the case of True is everything to the right of the backreference group and to the left of the OR symbol, the pipe.

02:01 This is the word Out in a group. In the case where ACME isn’t found, the other pattern is run. And in this case, it’s \w* and fit in a group.

02:14 You can see the two different places where this is happening. The first match has 'ACME', then 'Super', and because 'ACME' is present, it’s including the 'Out' part of 'Outfit'but only the 'Out' part of 'Outfit'.

02:30 In the second case, where 'ACME' isn’t present, it’s including \w*, which in this case is also 'Out', and the 'fit' part.

02:40 You can also see how this works out inside of the groups. For the first match, 'ACME' is found and is set for the first group. The conditional is run, so the first part of the conditional is operated, which is the second group, which in this case is 'Out'. And the third group, which is the 'fit', does not get evaluated because the conditional passed. Match 2 has the opposite. The first group is empty, the second group is empty because it’s part of the True portion of the conditional, and the third group is run because it’s part of the False portion of the conditional. Because group 1, ACME wasn’t found, the second part of the conditional is run and 'fit' is matched.

03:27 Now I’m going to show you a lookahead.

03:31 I’m going to start out with a regular expression without a lookahead in it. I’m looking for the word writing, some whitespace, and then the literal letter t inside of a group.

03:44 I change the group to be lookahead behavior using the ?= operator. What happens here is the regular expression is still looking for writing \s(t), but the ?=t portion is not consumed.

04:01 No group is created, and more importantly, the t isn’t considered part of the match. If you had more matching criteria later on, the t would participate in that matching criteria.

04:15 The reason this is called a lookahead is because you’re looking for writing, the match looks ahead to see if it’s a t, if it does find it, then it matches, but it doesn’t use the t as part of the evaluation.

04:32 Here’s another example. This is looking for 4 digits and then followed in a lookahead group by the literal [, a \w, and the literal ].

04:46 This matches the model numbers below. Because it’s a lookahead, only the '3990' actually participates in the match. But you’ll notice, because the lookahead is there, none of the other four-digit numbers is matched. The lookahead looks for the [<letter>] form, limiting this to just the digits inside of the model number.

05:11 Changing the equal sign (=) to an exclamation mark (!) changes it to a negative lookahead, meaning “Only match situations where the digits are not followed by [<letter>].” This matches all the four-digit numbers that aren’t associated with the model number.

05:32 ?<= is lookbehind. This is a similar concept to lookahead, but it happens before the pattern that you’re looking for. So in this situation, I’m looking to match the literal [, \w, literal ], preceded by 4 digits.

05:52 Notice that I was explicit about how many digits was here. Due to the way regular expressions are implemented, lookbehinds have to be of a fixed length.

06:01 If I change the \d{4} to be \d+ it will fail.

06:10 Regular expressions are built on something called finite-state machines. Finite-state machines only allow certain kinds of computing patterns, and this is not one of them. In the reference material in a later lesson, I’ll show you where you can dig out more information on how this works.

06:29 You can also negate lookbehind. ?<! is a negative lookbehind. This is looking for two digits, not preceded by two digits. And finally, every good programmer knows to put comments in their code.

06:48 You can put comments inside of your regular expression. ?# is the comment symbol. Anything inside of the group is ignored. This is part of the regular expression standard. In Python, where you have the VERBOSE flag, I would much rather use that.

07:04 It’s far clearer than trying to insert this inside of your regex.

07:10 Next up, I’ll show you some fun regular expressions. And by fun, I mean horrific.

Become a Member to join the conversation.