Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

The Python re Module

00:00 In the previous lesson, I finished up the language of regular expressions and introduced the concept of grouping. In this lesson, I’ll be showing you how to use regular expressions inside of Python.

00:13 First, a little review. The regex on the left has some grouping in it. The square brackets indicate a character class of the vowels, the parentheses around it are a group of that choice of vowel, and the r is literal.

00:28 This highlights the 'ar', 'or', and 'er' in the string, creating three groups matching an 'a', an 'o', and an 'e'. Later when I show you how to do this in Python, you’ll actually be able to access this content. Here’s another one, this time a character class of the digits 4 to 9, and that is grouped, and then one or more of those can happen using the plus (+) quantifier. Notice that this matches the '45', the '9', and the '888', but the grouping is the '5', the '9', and the '8'.

01:05 When you apply a quantifier to a group, only the last group gets counted. So the '4' and the '5' matches the regex, but the '5' ends up in the group.

01:15 The '9' matches the regex and ends up in the group, and the '888' matches the regex and the last '8' ends up in the group.

01:25 This regex looks for the literal letters the grouped together, can have zero or more repetitions of any character, then uses a backreference.

01:36 So you’re looking for whatever matched the first group happening again inside of the string. So what you’re getting is a match between the words 'the'.

01:47 All of the red text matches, but only the 'the' ends up in the group.

01:53 You’ve been pretty patient with me up until now. Everything’s been about regexes without really talking about Python, so now I’m going to show you how to use this inside of Python.

02:04 The re—short for regular expression—library is a standard part of Python. Most of the methods inside of the re module take a string pattern—which is the regex—and a string to search against, and return a result.

02:19 The result is usually a Match object. This Match object gives information about the match—whether or not a match happened and what portions of the string matched the result. Match objects are truthy.

02:33 That means they can be compared as Booleans, so you can use a re method that returns one of these objects and then compare the object in an if statement to see whether or not a match happened. We’ll start out by importing the module.

02:52 This question will be my test string.

02:57 Using the search() method inside of the re module returns a Match object. In this line, I’ve searched for the literal expression "spam" inside of the question. As a Match object was returned, that tells you that 'spam' was successfully found within the question.

03:16 The span parameter inside of the Match object tells you where the match happened. This is between letters 7 and 11.

03:25 The numbers in the span are equivalent to a slice in a list or a slice of a string, so this indicates that it starts at 7 and finishes at the 1011 is the upper limit, not included.

03:39 This is the opposite of the curly brackets inside of the regular expressions themselves. It can be a little confusing as you switch back and forth between the two mechanisms, but the span parameter of the Match is closer to the Pythonic mechanism.

03:57 Here, you can see I’ve sliced question using the 7 and 11 from the span, and I get back 'spam', the match from the string.

04:06 I’ll do that again, this time storing it in a variable.

04:13 Evaluating this variable as a Boolean returns True, indicating that a match was found. I can run a function called .span() on the Match object that returns the lower and upper boundaries, Another function called .start(), showing the lower boundary, and finally .end() to give you the upper boundary.

04:36 The .string attribute shows you what was being matched against. Somewhat confusingly, the re module also has a function called match().

04:46 To be clear as I’m moving forward, if I’m talking about the function, I will be explicit and say the match() function. Otherwise, I’m talking about a resulting Match object.

04:58 The match() function matches the beginning of the string. This is the equivalent of using a caret anchor (^) inside of your regex.

05:07 This did not return anything, and that’s because no match was found. Let me do that again, this time storing it in a variable.

05:18 The variable doesn’t contain anything.

05:22 Comparing it to None shows that it’s True. Or, converting it to a Boolean means it’s False. This shows you how you could test the results of your regular expression functions inside of an if statement.

05:42 This regex was successful. It’s looking for the repetition of 5 word-like meta-characters. As the string starts with 'Lovel', which are all word characters, the match results showing the span of 0 to 5. Python 3.4 added a function called fullmatch().

06:06 As you might guess from its name, it’s looking for a regular expression that matches the entire string. Of course, looking for "spam", that’s not going to match the whole string,

06:20 so once again, you’re getting back a None object.

06:29 Let’s break this regular expression down. Looking at the inner group first, there’s the word meta-character with zero or more instances, there’s the whitespace character with zero or more instances.

06:41 So I’m looking for something that looks like an actual word. That is inside of a group. That group repeats itself zero or more times, and then is followed by an exclamation mark. All of that is grouped.

06:57 The outside group can be repeated zero or more times. This is successful because it matches the two sub-parts of this string. "Lovely spam!" and "Wonderful spam!" each match the outer group. And because the outer group is repeated, this regular expression matches the entire string, giving a truth value for fullmatch().

07:22 Another function the library has is findall().

07:27 Unlike the other functions I’ve shown you so far, findall() doesn’t return a Match object—it returns a list. It applies the regular expression and finds each match inside of the string, returning the matching characters in a list.

07:43 This regular expression is looking for a vowel, followed by not a vowel. Inside of "Lovely spam! Wonderful spam!" you have 'ov' from 'Lovely', 'el' from 'Lovely', 'am' from 'spam', et cetera.

07:58 findall() returns a list. Sometimes instead of wanting a list, you want an iterator. Enter finditer(). It essentially does the same thing as findall() but returns an iterator instead of a list. This is more efficient in memory if you’re doing a large number of matches.

08:17 Well, that was your first exposure to using regular expressions inside of Python. Next up, I’ll show you how to take advantage of grouped results.

Become a Member to join the conversation.