Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Regex Grouping

Christopher Trudeau

Regular Expressions and Building Regexes in Python Christopher Trudeau 08:01

Transcript
Discussion (4)

00:00 In the previous lesson, I showed you how to multiply your regular expression using quantifiers. In this lesson, I’ll show you how to group parts of your regular expression together in subsets.

00:12 First, a little review of quantifiers. The asterisk (*) quantifier means zero or more, so 67*3 matches any number of '7's—three '7's, two '7's , one '7', and zero '7's on the right-hand side.

00:30 The plus (+) quantifier means one or more, so the match on the right changes. The first three are the same because they have one or more '7's, but the last one no longer matches. Zero '7's is not one or more.

00:45 Question mark (?) means zero or one matches. This affects the right-hand side with the third grouping of numbers having a match and the fourth grouping having a match with one '7' and zero '7's, respectively.

01:01 By default, quantifiers are greedy. The angle brackets (<>) similar to what you would find in XML here, with a .* inside of it, consumes the entire '<two> three <four>' on the right-hand side.

01:17 The right angle bracket after the 'two' gets consumed because this is a greedy match. You can change the match to be not-greedy by adding a question mark (?).

01:28 This splits it up into matching the two pieces of XML, probably closer to what you were intending in the first place. Greediness also applies to the plus (+) quantifier.

01:39 Changing it to not-greedy consumes fewer digits. The '77' turns into a single '7' and the '48348' turns into a single '4'. The not-greedy version of one or more is one.

01:55 A question mark (?) means zero or one matches. The not-greedy version of this means zero matches. Not-greedy of zero and one is zero.

02:07 Curly braces indicates a number of matches. {3} here says “Look for 3 matching literal Bs.” You can put a range inside of the braces. This is 2, 3, or 4 matches. It is inclusive. If there are no numbers, the braces are taken literally, matching 'C{A}A' on the right-hand side.

02:33 The not-greedy version of this consumes the least number in the match—in this case, 3.

02:40 Regular expressions allow you to group parts of a pattern together. This grouping gives you access to portions of a match. By grouping things together and combining them with a quantifier, you can create strings of repeated sequences.

02:55 You can also use a mechanism called backreferences. A backreference is a reference to a previous grouped match inside of the expression.

03:05 I’m back inside of pythex, once again using the Wile E. Coyote email message for testing.

03:12 Here’s a regular expression of the literal 1 and any digit. As expected, this matches a whole bunch of parts of the text. You can create groups of expressions by using parentheses.

03:27 Here’s the same expression inside of a group. The matches in the text are the same—nothing’s changed. But now, on the right-hand side, you have access to the groupings of matches.

03:39 pythex shows these and allows you to get at each one of them. When doing this in Python, you can pull out these multiple parts from a single match result.

03:52 Here, I’ve combined a group and a class. This is the literal letter b and then a grouping of any of the vowels repeated one or more times. This matches the 'be' in 'September' and 'beyond' and the 'boo' in 'boomerang'. Notice that the plus (+) is inside of the group.

04:12 This means the repetition is on the vowels. On the right-hand side in the Match captures area, you can see the vowel portions of the match. Though the 'b' is highlighted in the text, only the group portion is shown in the match on the right-hand side.

04:32 Changing the position of the + changes how this expression works. Now you have the b as before, the class with a vowel in it, and it’s the group that repeats. The matches in the text on the left look the same, but the groups on the right-hand side are different.

04:49 Particularly, Match 3 used to be 'oo' because the + was on the inside. Now with the + on the outside, the group is being repeated, and the group is only matched once.

05:03 Backslash with a number is a backreference. This is looking for a match the same as a preceding group. \w inside of parentheses matches a word character.

05:17 \1 looks for whatever was matched in that group. So this regular expression is looking for the same two letters in a row. You can see this in the text, in the 'aa' and the 'bb', as well as the '99' in the model number.

05:34 Remember that \w does both alphabetic characters and digits and the underscore. The grouping is around only the single letter, so the Match captures section on the right-hand side only has the initial match.

05:50 The backreference provides the other portion of the regular expression.

05:56 Well, that was fun. Here’s a little more of a challenge. Break the regular expression down to parts. Look at the first \b and then another \b.

06:07 Those are anchors looking for word breaks, so the group inside is something that is a word separated by spaces. \w meta-character means a letter, a number, or an underscore. The curly braces indicate that this will be repeated 3 times.

06:25 So the first part of this regular expression says “A three-letter word, separated by word boundaries.” .* consumes zero or more characters, and then \1 is a backreference to whatever was in the group.

06:43 So whatever’s matched as this three-letter word—some amount of text and then that text appearing again. This happens three times. The first match is with the 'you' and ending in the 'you' portion of 'your' in the first sentence.

07:01 The second match is 'the' ending with 'the' a little later. And the third match, again, is a pair of 'the'.

07:10 Inside of the Match captures, you can see the portions of the group being pulled out, so you can see exactly what word was repeated. Pay particular attention to the first sentence.

07:21 There’s something a little subtle going on here. Although the first part of the match is looking for a word surrounded by a boundary, what is inside of the group is just the letters, so the group contains 'you'.

07:36 The bigger part of the expression outside of the group includes the word boundaries, but the backreference of Match 1 is only on the 'you'. It will match 'you' at the beginning of the word 'your'.

07:52 You’ve seen enough of regular expressions, the language inside the language. Now it’s time to actually see how to use them in Python.

teiturhelgason on Nov. 8, 2021

At around 1:30 into the video, I don’t fully understand how <.*?> leads to the non-greedy outcome that is displayed. What exactly is it that causes the regex to stop matching at the first “>”?

Bartosz Zaczyński RP Team on Nov. 8, 2021

@teiturhelgason It’s the question mark (?) appended to a quantifier that makes it lazy. Examples:

Greedy: .*, (foo)+
Lazy: .*?, (foo)+?

Alternatively, you can use a trick, which takes advantage of a negated character class to match a sequence until some terminating character. For instance, this will match everything between the angle brackets:

>>> import re

>>> re.findall(r"<.*>", "<h1>Hello <i>world</i></h1>")
['<h1>Hello <i>world</i></h1>']

>>> re.findall(r"<([^>]+)>", "<h1>Hello <i>world</i></h1>")
['h1', 'i', '/i', '/h1']

Dawn0fTime on Sept. 12, 2022

Is there a way to nest groups so that instead of having only ‘you’, ‘the’, and ‘the’ as groups you also get ‘you of a drastic failure of your ACME Super Outfit. On Saturday, September 17, I endeavored to use you’, etc. returned as groups? I tried simply adding parentheses around ‘\b(\w{3})\b.\1’, but it didn’t work.

Christopher Trudeau RP Team on Sept. 12, 2022

Hi @DawnOfTime,

You can nest groups, they still show up as ordered groups using the number they were declared in. For example

(d(\w{5}))

Will give you segments of text at least six letters long and beginning with “d”. With the first group being all six letters, and the second group being the five following letters. Using that in the same ACME text above gives you matches for “desert” & “esert” in the “From:” line, “drasti” and “rasti” from “drastic” in the body, and more.

You don’t need to use nested groups to do what you described though. If I’m understanding your question correctly, you’ll get the full match through the regex and then the groups inside. Note the green text in the example in the video, that green text shows the match whereas the groups are inside the match.

I haven’t played with it too much, but I suspect that the reason you can’t just put parenthesis around the regex you asked about is because it contains a greedy “*”.

Become a Member to join the conversation.