00:00 In the previous lesson, I showed you how to use a regular expression in Python with groups and getting at the group data. In this lesson, I’m going to show you an advanced version of grouping where you can name the groups. First, a little tangent.
00:40 This escape mechanism is the same escape mechanism that regular expressions use. This can be messy. You end up having to escape your escape sequences in order for them to work properly inside of the regex. Let’s go into the REPL and I’ll show you what this means.
Here’s the first regular expression, looking for meta-character
\w one or more times, and a comma. This matches
'one,'. So, why haven’t I brought up this backslash problem before? Well, Python tries to be smart about this.
Both of the matches are the same in this case. This gets complicated when you’re looking for backslashes. The string being searched, the second parameter, is the word
"one" followed by a literal slash (
"\")—denoted by two backslashes to escape it in the Python string—followed by the word
"two". The regular expression, meta-character
"\w+\\" and the word
"two" fails to find anything.
This is because the two backslashes in the regular expression get turned into a single backslash to the
re module. It sees
\t and interprets that as a tab, not matching the string on the right-hand side with the literal backslash.
In order to find the actual backslash, you have to escape the backslash in the Python string and escape the backslash in the regex. This results in four backslashes. The first two backslashes escape a single backslash in the Python string and pass a backslash to the
This will actually find the match in the string on the right-hand side. Python automatically converting a non-escape sequence backslash into a real backslash is neat to have but can cause all sorts of problems and ambiguity, so you have to be careful with it. If I hadn’t been looking for the word
"two" but I’d been looking for the word
"eight", I would have had a much bigger problem.
Oh, and there it is. Let me scroll up so you can see that again. Take a look at this regular expression. The first backslash gets automatically converted by Python into a backslash character and that gets passed into the regex as the meta
\w. The second two backslashes get escaped, turning into a single backslash character.
04:17 And that’s the match you’re looking for. I generally don’t recommend taking advantage of Python’s automatic conversion of those backslashes. You should match for what is there. A better choice is to use a raw string.
04:59 This makes it far easier to understand as you’re coding along. To summarize, both regexes and Python strings use backslashes for special characters. That becomes complicated when you’re using a string to describe a regex. Inside of a regex you should always specify the double backslash for the meta-character to be safe. If you’re searching for an actual backslash, it needs to be escaped twice—once for Python and once for the regex. Or alternatively, just use a raw string. It’s far easier to read and understand.
Here’s your first named group. Let’s break it down together. On the outside of the regex is meta-character
\s, so you’re looking for something that is separated by whitespace. Between the whitespace characters is a grouping. The part you haven’t seen before is
06:36 The remainder of the group is the regular expression, as before. This particular expression has a character class of non-vowels, a character class of small alphabetic letters, and that second class is repeated one or more times. On the right-hand side, just like before you can see the match groups, but now—instead of them being numbered—they’re named.
match.groups() has the same content. There’s another method that I haven’t shown you before, which is
.groupdict(). It returns a dictionary with the named contents, so now I’ve got a mapping between
'third' in the regular expression named groups with the values of
'three', which is the content of the matched groups.
This is a bit of a trade-off. It makes your code easier to read because you’re actually naming what you’re looking for—rather than
'three'—but it tends to make the regular expression harder to read because your group has the extra content in it to specify the group name.
You can also name a backreference.
?P with an equals (
=) says to use the named backreference. Everything between the parentheses here is the equivalent of
\1, the backreference from before, but this time it uses the named reference
Here’s another set of groups. There are three groups here. The first is a non-vowel, the second is a character class with one or more vowels, and the third group is a non-vowel. If you look at the matches on the right-hand side, this matches things like
'suit'. So far, so good. What if you’re not interested in the piece in the middle, but still need the grouping to be able to match the expression?
11:10 So notice, before, there were three groups for each match. Now there are only two. This middle group is not capturing. This can be useful for a couple of reasons. One, if you’re doing a lot of regex work, each group takes up some memory, so a non-capturing group tends to be less expensive.
If you’re not actually going to use the sub-component group from the expression, then making it non-capturing will be more efficient. In the next lesson, I’ll introduce you to even more of the methods inside of the
Become a Member to join the conversation.