Naming Groups
00:00 In the previous lesson, I showed you how to use a regular expression in Python with groups and getting at the group data. In this lesson, I’m going to show you an advanced version of grouping where you can name the groups. First, a little tangent.
00:16
You may recall that Python uses backslashes to escape special characters inside of strings. For example, \t
and \n
are the tab and newline characters.
00:29 If you want a backslash inside of a string, you have to escape it with another backslash. This string would only show a single backslash if it were printed out.
00:40 This escape mechanism is the same escape mechanism that regular expressions use. This can be messy. You end up having to escape your escape sequences in order for them to work properly inside of the regex. Let’s go into the REPL and I’ll show you what this means.
00:59
Here’s the first regular expression, looking for meta-character \w
one or more times, and a comma. This matches 'one,'
. So, why haven’t I brought up this backslash problem before? Well, Python tries to be smart about this.
01:14 If you put a backslash into a Python string and have it escape a character that is not a valid escape character, Python automatically turns that into a literal backslash.
01:27 This only gets complicated when you overlap between the escape sequences in a string and the escape sequences in a regex. To be proper, I should have been using two backslashes.
01:42
Both of the matches are the same in this case. This gets complicated when you’re looking for backslashes. The string being searched, the second parameter, is the word "one"
followed by a literal slash ("\"
)—denoted by two backslashes to escape it in the Python string—followed by the word "two"
. The regular expression, meta-character "\w+\\"
and the word "two"
fails to find anything.
02:08
This is because the two backslashes in the regular expression get turned into a single backslash to the re
module. It sees \t
and interprets that as a tab, not matching the string on the right-hand side with the literal backslash.
02:27
In order to find the actual backslash, you have to escape the backslash in the Python string and escape the backslash in the regex. This results in four backslashes. The first two backslashes escape a single backslash in the Python string and pass a backslash to the re
module.
02:47
The second two backslashes do the same, resulting in the re
module receiving "\\"
, which is the regex escape for a backslash. This an escape of an escape.
03:01
This will actually find the match in the string on the right-hand side. Python automatically converting a non-escape sequence backslash into a real backslash is neat to have but can cause all sorts of problems and ambiguity, so you have to be careful with it. If I hadn’t been looking for the word "two"
but I’d been looking for the word "eight"
, I would have had a much bigger problem.
03:31
Oh, and there it is. Let me scroll up so you can see that again. Take a look at this regular expression. The first backslash gets automatically converted by Python into a backslash character and that gets passed into the regex as the meta \w
. The second two backslashes get escaped, turning into a single backslash character.
03:55
And there is no \e
—there’s no meta-character that uses the e
. So the re
module throws an exception. In order to match the "\e"
in "eight"
on the right-hand side, you will need four slashes.
04:12 Here’s how to do it right.
04:17 And that’s the match you’re looking for. I generally don’t recommend taking advantage of Python’s automatic conversion of those backslashes. You should match for what is there. A better choice is to use a raw string.
04:32 This is a feature built into Python that says “This string doesn’t have any escape sequences in it.” Here’s an example.
04:44
When you look at the value of the content
variable, you’ll see that Python has turned this into the escaped backslash. Your best practice is to use raw strings everywhere with your regexes.
04:59 This makes it far easier to understand as you’re coding along. To summarize, both regexes and Python strings use backslashes for special characters. That becomes complicated when you’re using a string to describe a regex. Inside of a regex you should always specify the double backslash for the meta-character to be safe. If you’re searching for an actual backslash, it needs to be escaped twice—once for Python and once for the regex. Or alternatively, just use a raw string. It’s far easier to read and understand.
05:37 It’s time to go back into regex groups. So far, you’ve seen groups with numbers. You can also use named groups and non-capturing groups. I’m back inside pythex. Time to look at named groups.
05:52 I hope you’re comfortably seated. The regexes are only going to get longer and longer and more and more complicated.
06:01
Here’s your first named group. Let’s break it down together. On the outside of the regex is meta-character \s
, so you’re looking for something that is separated by whitespace. Between the whitespace characters is a grouping. The part you haven’t seen before is ?P<>
.
06:22
This is the named group. ?P
indicates that it’s a named group. What’s in between the angle brackets is what the name is. So in this case, the name is c1
.
06:36 The remainder of the group is the regular expression, as before. This particular expression has a character class of non-vowels, a character class of small alphabetic letters, and that second class is repeated one or more times. On the right-hand side, just like before you can see the match groups, but now—instead of them being numbered—they’re named.
07:02
There’s only one group here named c1
, so every match has c1
in front of it instead of 1. like before.
07:13
Here’s another one. This is actually the same regular expression twice, in two different named groups. The first group, ?P
says it’s a named group, named c1
. The second group, ?P
, named c2
.
07:32 This regex is looking for the same pattern twice in a row, separated by a whitespace character. Both of these groups are looking for texts that don’t start with the small vowels.
07:44
Match captures shows the two named groups showing up in the result. The first result is in the subject line, ' suit failure'
, with c1
being 'suit'
and c2
being 'failure'
.
08:00
Here that is inside of Python. content
is the text to be searched.
08:11 Here’s the regular expression.
08:16 I’m going to pause for a moment here. See if you can figure out what will be in the groups.
08:23
The Match
object shows you the portion of the string that matches, which is the 'one, two, three'
,
08:31
and calling .groups()
returns the values of 'one'
, 'two'
, and 'three'
. I’m going to rewrite the exact same regular expression, this time using named groups.
08:41
So you see the capital P
inside of the group to indicate the name, and I’m naming each of these groups first
, second
, and third
, respectively.
08:53
match.groups()
has the same content. There’s another method that I haven’t shown you before, which is .groupdict()
. It returns a dictionary with the named contents, so now I’ve got a mapping between 'first'
, 'second'
, and 'third'
in the regular expression named groups with the values of 'one'
, 'two'
, and 'three'
, which is the content of the matched groups.
09:18
This is a bit of a trade-off. It makes your code easier to read because you’re actually naming what you’re looking for—rather than 'one'
, 'two'
, and 'three'
—but it tends to make the regular expression harder to read because your group has the extra content in it to specify the group name.
09:39
here’s a quick review of how backreferences work. The group here contains a meta-character \w
and the \1
is a backreference which refers to the first group.
09:54
Same thing goes with a named group. This group is named twice
and \1
is a backreference that works the same as before.
10:05
You can also name a backreference. ?P
with an equals (=
) says to use the named backreference. Everything between the parentheses here is the equivalent of \1
, the backreference from before, but this time it uses the named reference twice
.
10:28
Here’s another set of groups. There are three groups here. The first is a non-vowel, the second is a character class with one or more vowels, and the third group is a non-vowel. If you look at the matches on the right-hand side, this matches things like 'S'
, 'e'
, 'p'
for 'September'
, 'T'
, 'o'
, ':'
, 's'
, 'ui'
, and 't'
for 'suit'
. So far, so good. What if you’re not interested in the piece in the middle, but still need the grouping to be able to match the expression?
10:58
?:
says “This is a non-capturing group.” The matches for the regex are the same as before but the groups in the Match captures on the right-hand side are different.
11:10 So notice, before, there were three groups for each match. Now there are only two. This middle group is not capturing. This can be useful for a couple of reasons. One, if you’re doing a lot of regex work, each group takes up some memory, so a non-capturing group tends to be less expensive.
11:30 It’s also useful to use groups for repetitions or keeping the chunk of characters together. Essentially, it might be easier to write the regular expression using a group.
11:40
If you’re not actually going to use the sub-component group from the expression, then making it non-capturing will be more efficient. In the next lesson, I’ll introduce you to even more of the methods inside of the re
module.
Christopher Trudeau RP Team on March 27, 2022
Hi @Cindy,
Yeah, this can be a bit confusing. Python is trying to be helpful here and in doing so is kind of inconsistent. There are a family of escape sequences that are valid in a Python string. In the course so far, you’ve seen \n, \t, ', maybe a couple of others.
\w is not a valid string escape sequence. As such, if you use it, Python will assume you mean “slash” and “w”. It can’t assume that when you put “\n” because it thinks you mean “newline”.
Because it is being helpful, you can get away with an un-escaped slash. I find, personally, that that is more confusing rather than helpful, so I tend to either escape them or use a raw string where it doesn’t matter.
Cindy on March 30, 2022
Thank you Christopher, it looks like Python is tolerant with some regex : )
Become a Member to join the conversation.
Cindy on March 27, 2022
Hi Christoper,
Could you please explain the reason
'/w'
was automatically converted as the same by Python and pass to Regex?Thank you.