The Python re Module
00:00 In the previous lesson, I finished up the language of regular expressions and introduced the concept of grouping. In this lesson, I’ll be showing you how to use regular expressions inside of Python.
00:13
First, a little review. The regex on the left has some grouping in it. The square brackets indicate a character class of the vowels, the parentheses around it are a group of that choice of vowel, and the r
is literal.
00:28
This highlights the 'ar'
, 'or'
, and 'er'
in the string, creating three groups matching an 'a'
, an 'o'
, and an 'e'
. Later when I show you how to do this in Python, you’ll actually be able to access this content. Here’s another one, this time a character class of the digits 4
to 9
, and that is grouped, and then one or more of those can happen using the plus (+
) quantifier. Notice that this matches the '45'
, the '9'
, and the '888'
, but the grouping is the '5'
, the '9'
, and the '8'
.
01:05
When you apply a quantifier to a group, only the last group gets counted. So the '4'
and the '5'
matches the regex, but the '5'
ends up in the group.
01:15
The '9'
matches the regex and ends up in the group, and the '888'
matches the regex and the last '8'
ends up in the group.
01:25
This regex looks for the literal letters the
grouped together, can have zero or more repetitions of any character, then uses a backreference.
01:36
So you’re looking for whatever matched the first group happening again inside of the string. So what you’re getting is a match between the words 'the'
.
01:47
All of the red text matches, but only the 'the'
ends up in the group.
01:53 You’ve been pretty patient with me up until now. Everything’s been about regexes without really talking about Python, so now I’m going to show you how to use this inside of Python.
02:04
The re
—short for regular expression—library is a standard part of Python. Most of the methods inside of the re
module take a string pattern—which is the regex—and a string to search against, and return a result.
02:19
The result is usually a Match
object. This Match
object gives information about the match—whether or not a match happened and what portions of the string matched the result. Match
objects are truthy.
02:33
That means they can be compared as Booleans, so you can use a re
method that returns one of these objects and then compare the object in an if
statement to see whether or not a match happened. We’ll start out by importing the module.
02:52
This question
will be my test string.
02:57
Using the search()
method inside of the re
module returns a Match
object. In this line, I’ve searched for the literal expression "spam"
inside of the question
. As a Match
object was returned, that tells you that 'spam'
was successfully found within the question
.
03:16
The span
parameter inside of the Match
object tells you where the match happened. This is between letters 7
and 11
.
03:25
The numbers in the span
are equivalent to a slice in a list or a slice of a string, so this indicates that it starts at 7
and finishes at the 10
—11
is the upper limit, not included.
03:39
This is the opposite of the curly brackets inside of the regular expressions themselves. It can be a little confusing as you switch back and forth between the two mechanisms, but the span
parameter of the Match
is closer to the Pythonic mechanism.
03:57
Here, you can see I’ve sliced question using the 7
and 11
from the span
, and I get back 'spam'
, the match from the string.
04:06 I’ll do that again, this time storing it in a variable.
04:13
Evaluating this variable as a Boolean returns True
, indicating that a match was found. I can run a function called .span()
on the Match
object that returns the lower and upper boundaries, Another function called .start()
, showing the lower boundary, and finally .end()
to give you the upper boundary.
04:36
The .string
attribute shows you what was being matched against. Somewhat confusingly, the re
module also has a function called match()
.
04:46
To be clear as I’m moving forward, if I’m talking about the function, I will be explicit and say the match()
function. Otherwise, I’m talking about a resulting Match
object.
04:58
The match()
function matches the beginning of the string. This is the equivalent of using a caret anchor (^
) inside of your regex.
05:07 This did not return anything, and that’s because no match was found. Let me do that again, this time storing it in a variable.
05:18 The variable doesn’t contain anything.
05:22
Comparing it to None
shows that it’s True
. Or, converting it to a Boolean means it’s False
. This shows you how you could test the results of your regular expression functions inside of an if
statement.
05:42
This regex was successful. It’s looking for the repetition of 5
word-like meta-characters. As the string starts with 'Lovel'
, which are all word characters, the match results showing the span
of 0
to 5
. Python 3.4 added a function called fullmatch()
.
06:06
As you might guess from its name, it’s looking for a regular expression that matches the entire string. Of course, looking for "spam"
, that’s not going to match the whole string,
06:20
so once again, you’re getting back a None
object.
06:29 Let’s break this regular expression down. Looking at the inner group first, there’s the word meta-character with zero or more instances, there’s the whitespace character with zero or more instances.
06:41 So I’m looking for something that looks like an actual word. That is inside of a group. That group repeats itself zero or more times, and then is followed by an exclamation mark. All of that is grouped.
06:57
The outside group can be repeated zero or more times. This is successful because it matches the two sub-parts of this string. "Lovely spam!"
and "Wonderful spam!"
each match the outer group. And because the outer group is repeated, this regular expression matches the entire string, giving a truth value for fullmatch()
.
07:22
Another function the library has is findall()
.
07:27
Unlike the other functions I’ve shown you so far, findall()
doesn’t return a Match
object—it returns a list
. It applies the regular expression and finds each match inside of the string, returning the matching characters in a list.
07:43
This regular expression is looking for a vowel, followed by not a vowel. Inside of "Lovely spam! Wonderful spam!"
you have 'ov'
from 'Lovely'
, 'el'
from 'Lovely'
, 'am'
from 'spam'
, et cetera.
07:58
findall()
returns a list
. Sometimes instead of wanting a list, you want an iterator. Enter finditer()
. It essentially does the same thing as findall()
but returns an iterator instead of a list. This is more efficient in memory if you’re doing a large number of matches.
08:17 Well, that was your first exposure to using regular expressions inside of Python. Next up, I’ll show you how to take advantage of grouped results.
Christopher Trudeau RP Team on Jan. 19, 2021
Hi @Walid,
Yep, I find I use .search()
and .findall()
the most myself, but there are cases where you’re looking for the beginning to match. Think of it like using .startswith
in strings instead of in
.
If you are only looking at the beginning, .match()
will definitely be faster, especially for longer chunks of data to be matched against.
Bartosz Zaczyński RP Team on Jan. 20, 2021
@Walid A common use case for re.match()
is password or input validation. Although, the re.fullmatch()
is usually a better choice in such a case.
Walid on Jan. 20, 2021
Thank you @Bartosz. I wish there was a like button ;-)
nullrealpython on April 17, 2021
Christopher, your clarification of what multiline mode means may well have launched my career. Thanks! Whatever we are paying you… it ain’t enough.
Christopher Trudeau RP Team on April 17, 2021
Glad you’re finding the course helpful.
VitaminC on Dec. 12, 2023
Great course, Christopher.
How would I search for “nine”, “eight” and “ten” in the string “nineighten”?
Bartosz Zaczyński RP Team on Jan. 2, 2024
@VitaminC Using regular expressions to find sequences of characters that overlap can be tricky. However, it’s doable with the help of a so-called lookahead, which allows the regex engine to match a certain pattern without actually consuming the characters it matched:
>>> import re
>>> text = "nineighten"
>>> re.findall(r"(?=(nine|eight|ten))", text)
['nine', 'eight', 'ten']
Become a Member to join the conversation.
Walid on Jan. 19, 2021
re.match()
seems very restrictive use case! Not sure why would one use it?