Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Regex Grouping

00:00 In the previous lesson, I showed you how to multiply your regular expression using quantifiers. In this lesson, I’ll show you how to group parts of your regular expression together in subsets.

00:12 First, a little review of quantifiers. The asterisk (*) quantifier means zero or more, so 67*3 matches any number of '7's—three '7's, two '7's , one '7', and zero '7's on the right-hand side.

00:30 The plus (+) quantifier means one or more, so the match on the right changes. The first three are the same because they have one or more '7's, but the last one no longer matches. Zero '7's is not one or more.

00:45 Question mark (?) means zero or one matches. This affects the right-hand side with the third grouping of numbers having a match and the fourth grouping having a match with one '7' and zero '7's, respectively.

01:01 By default, quantifiers are greedy. The angle brackets (<>) similar to what you would find in XML here, with a .* inside of it, consumes the entire '<two> three <four>' on the right-hand side.

01:17 The right angle bracket after the 'two' gets consumed because this is a greedy match. You can change the match to be not-greedy by adding a question mark (?).

01:28 This splits it up into matching the two pieces of XML, probably closer to what you were intending in the first place. Greediness also applies to the plus (+) quantifier.

01:39 Changing it to not-greedy consumes fewer digits. The '77' turns into a single '7' and the '48348' turns into a single '4'. The not-greedy version of one or more is one.

01:55 A question mark (?) means zero or one matches. The not-greedy version of this means zero matches. Not-greedy of zero and one is zero.

02:07 Curly braces indicates a number of matches. {3} here says “Look for 3 matching literal Bs.” You can put a range inside of the braces. This is 2, 3, or 4 matches. It is inclusive. If there are no numbers, the braces are taken literally, matching 'C{A}A' on the right-hand side.

02:33 The not-greedy version of this consumes the least number in the match—in this case, 3.

02:40 Regular expressions allow you to group parts of a pattern together. This grouping gives you access to portions of a match. By grouping things together and combining them with a quantifier, you can create strings of repeated sequences.

02:55 You can also use a mechanism called backreferences. A backreference is a reference to a previous grouped match inside of the expression.

03:05 I’m back inside of pythex, once again using the Wile E. Coyote email message for testing.

03:12 Here’s a regular expression of the literal 1 and any digit. As expected, this matches a whole bunch of parts of the text. You can create groups of expressions by using parentheses.

03:27 Here’s the same expression inside of a group. The matches in the text are the same—nothing’s changed. But now, on the right-hand side, you have access to the groupings of matches.

03:39 pythex shows these and allows you to get at each one of them. When doing this in Python, you can pull out these multiple parts from a single match result.

03:52 Here, I’ve combined a group and a class. This is the literal letter b and then a grouping of any of the vowels repeated one or more times. This matches the 'be' in 'September' and 'beyond' and the 'boo' in 'boomerang'. Notice that the plus (+) is inside of the group.

04:12 This means the repetition is on the vowels. On the right-hand side in the Match captures area, you can see the vowel portions of the match. Though the 'b' is highlighted in the text, only the group portion is shown in the match on the right-hand side.

04:32 Changing the position of the + changes how this expression works. Now you have the b as before, the class with a vowel in it, and it’s the group that repeats. The matches in the text on the left look the same, but the groups on the right-hand side are different.

04:49 Particularly, Match 3 used to be 'oo' because the + was on the inside. Now with the + on the outside, the group is being repeated, and the group is only matched once.

05:03 Backslash with a number is a backreference. This is looking for a match the same as a preceding group. \w inside of parentheses matches a word character.

05:17 \1 looks for whatever was matched in that group. So this regular expression is looking for the same two letters in a row. You can see this in the text, in the 'aa' and the 'bb', as well as the '99' in the model number.

05:34 Remember that \w does both alphabetic characters and digits and the underscore. The grouping is around only the single letter, so the Match captures section on the right-hand side only has the initial match.

05:50 The backreference provides the other portion of the regular expression.

05:56 Well, that was fun. Here’s a little more of a challenge. Break the regular expression down to parts. Look at the first \b and then another \b.

06:07 Those are anchors looking for word breaks, so the group inside is something that is a word separated by spaces. \w meta-character means a letter, a number, or an underscore. The curly braces indicate that this will be repeated 3 times.

06:25 So the first part of this regular expression says “A three-letter word, separated by word boundaries.” .* consumes zero or more characters, and then \1 is a backreference to whatever was in the group.

06:43 So whatever’s matched as this three-letter word—some amount of text and then that text appearing again. This happens three times. The first match is with the 'you' and ending in the 'you' portion of 'your' in the first sentence.

07:01 The second match is 'the' ending with 'the' a little later. And the third match, again, is a pair of 'the'.

07:10 Inside of the Match captures, you can see the portions of the group being pulled out, so you can see exactly what word was repeated. Pay particular attention to the first sentence.

07:21 There’s something a little subtle going on here. Although the first part of the match is looking for a word surrounded by a boundary, what is inside of the group is just the letters, so the group contains 'you'.

07:36 The bigger part of the expression outside of the group includes the word boundaries, but the backreference of Match 1 is only on the 'you'. It will match 'you' at the beginning of the word 'your'.

07:52 You’ve seen enough of regular expressions, the language inside the language. Now it’s time to actually see how to use them in Python.

Become a Member to join the conversation.