Advanced Matching
00:00 In the previous lesson, I showed you how to use flags to modify your regular expression behavior. In this lesson, I’m going to show you some even more advanced regular expression patterns.
00:11 I’m going to cover four concepts using pythex: conditional matches, lookahead matches, lookbehind matches, and comments. Conditional matches change the matching behavior based on whether or not something is present or not.
00:25 Lookahead and lookbehind matches change the behavior of the regex based on content ahead or behind what you’re looking for, but don’t include that part in the match.
00:37
Comments are, well, comments. We’ll start with a familiar plain text match inside of a group, looking for the word ACME
. I can add a quantifier to say this can occur once or not at all. For the purposes of this expression, nothing’s changed.
00:59
I’ve added a little more to it. This time it’s looking for ACME
once or not at all, a whitespace (\s
), the word Super
, and then the period meta-character (.
) zero or more times.
01:12
This essentially captures everything in a paragraph after the word ACME
. Now I’m going to introduce the concept of a conditional.
01:22
The part that’s been added is inside of this group. The question mark with brackets says that this is a conditional group. The 1
is a backreference, so this conditional is conditional on the presence of backreference 1
.
01:38
So, if ACME
, the first group, is present, then this evaluates to True
. If ACME
is not present, then this evaluates to False
.
01:51
The regular expression that gets run in the case of True
is everything to the right of the backreference group and to the left of the OR symbol, the pipe.
02:01
This is the word Out
in a group. In the case where ACME
isn’t found, the other pattern is run. And in this case, it’s \w*
and fit
in a group.
02:14
You can see the two different places where this is happening. The first match has 'ACME'
, then 'Super'
, and because 'ACME'
is present, it’s including the 'Out'
part of 'Outfit'
—but only the 'Out'
part of 'Outfit'
.
02:30
In the second case, where 'ACME'
isn’t present, it’s including \w*
, which in this case is also 'Out'
, and the 'fit'
part.
02:40
You can also see how this works out inside of the groups. For the first match, 'ACME'
is found and is set for the first group. The conditional is run, so the first part of the conditional is operated, which is the second group, which in this case is 'Out'
. And the third group, which is the 'fit'
, does not get evaluated because the conditional passed. Match 2 has the opposite. The first group is empty, the second group is empty because it’s part of the True
portion of the conditional, and the third group is run because it’s part of the False
portion of the conditional. Because group 1
, ACME
wasn’t found, the second part of the conditional is run and 'fit'
is matched.
03:27 Now I’m going to show you a lookahead.
03:31
I’m going to start out with a regular expression without a lookahead in it. I’m looking for the word writing
, some whitespace, and then the literal letter t
inside of a group.
03:44
I change the group to be lookahead behavior using the ?=
operator. What happens here is the regular expression is still looking for writing \s(t)
, but the ?=t
portion is not consumed.
04:01
No group is created, and more importantly, the t
isn’t considered part of the match. If you had more matching criteria later on, the t
would participate in that matching criteria.
04:15
The reason this is called a lookahead is because you’re looking for writing
, the match looks ahead to see if it’s a t
, if it does find it, then it matches, but it doesn’t use the t
as part of the evaluation.
04:32
Here’s another example. This is looking for 4
digits and then followed in a lookahead group by the literal [
, a \w
, and the literal ]
.
04:46
This matches the model numbers below. Because it’s a lookahead, only the '3990'
actually participates in the match. But you’ll notice, because the lookahead is there, none of the other four-digit numbers is matched. The lookahead looks for the [<letter>]
form, limiting this to just the digits inside of the model number.
05:11
Changing the equal sign (=
) to an exclamation mark (!
) changes it to a negative lookahead, meaning “Only match situations where the digits are not followed by [<letter>]
.” This matches all the four-digit numbers that aren’t associated with the model number.
05:32
?<=
is lookbehind. This is a similar concept to lookahead, but it happens before the pattern that you’re looking for. So in this situation, I’m looking to match the literal [
, \w
, literal ]
, preceded by 4
digits.
05:52 Notice that I was explicit about how many digits was here. Due to the way regular expressions are implemented, lookbehinds have to be of a fixed length.
06:01
If I change the \d{4}
to be \d+
it will fail.
06:10 Regular expressions are built on something called finite-state machines. Finite-state machines only allow certain kinds of computing patterns, and this is not one of them. In the reference material in a later lesson, I’ll show you where you can dig out more information on how this works.
06:29
You can also negate lookbehind. ?<!
is a negative lookbehind. This is looking for two digits, not preceded by two digits. And finally, every good programmer knows to put comments in their code.
06:48
You can put comments inside of your regular expression. ?#
is the comment symbol. Anything inside of the group is ignored. This is part of the regular expression standard. In Python, where you have the VERBOSE
flag, I would much rather use that.
07:04 It’s far clearer than trying to insert this inside of your regex.
07:10 Next up, I’ll show you some fun regular expressions. And by fun, I mean horrific.
Christopher Trudeau RP Team on March 17, 2021
Thanks Roy. You’re right. We’ll get a patch in on that.
Rahul Pandey on Aug. 23, 2023
Hey Christopher, what if I want to use a variable to check for a match like for this string:
re.search(r"var1(?:¦|:|;|-)[ ]{0,1}(.*?)\\n\\x0c", texts).group(1)
Christopher Trudeau RP Team on Aug. 24, 2023
Hi Rahul,
The regex itself being passed into re.search()
or any other regex method is also text. You can build the text the way you would with any other variables, either adding strings or using an f-string.
One caution though, remember that I’m using raw strings r"..."
to avoid all the escaping as backslashes are important to a regex. Depending on how you build your string it won’t be raw. You may also have to escape the contents of the variable itself.
So, something like this:
>>> import re
>>> text = "cat dog cat3 dog2"
>>> look = "cat"
>>> my_regex = re.escape(look) + r'([0-9])'
>>> re.search(my_regex, text)
<re.Match object; span=(8, 12), match='cat3'>
>>> re.search(my_regex, text).group(1)
'3'
Become a Member to join the conversation.
Roy Telles on March 17, 2021
The video uses a
?
quantifier but says “searches for ACME zero or more times” but this quantifier is used for zero or one time, so I think there may need to be an edit.