Substituting, Splitting, and Escaping
In the previous lesson, I showed you how to name your groups. In this lesson, I’ll show you even more methods of the
re module, including substitution, splitting, and escaping. First, a quick review of named groups and non-capturing groups.
The regular expression on the left is a named group.
?P indicates a naming, with the angle brackets (
) surrounding what you’re going to name it.
So the expression on the left is called
digits and is looking for
\d+ as its matching criteria. In the sentence, that matches the digits
'123'. Those digits are inside of the named group, and so the match results with a name
'123'. Inside of the regular expression, instead of using a numeric backreference, you can use a named backreference.
?P= is the named backreference. The first part of the regex is the same as before. The second part is a backreference using the named group
digits. From the number
'13' through to the number
'13' in the string is the entire match, but the group named
digits only contains
?: is a non-capturing group. The regular expression on the left-hand side has three groups in it: literal
r, and literal
This matches the
'care' of the string. But only two groups are found: the
'a' and the
'r' in between doesn’t capture.
Here are the matching functions you’ve seen so far.
re.search() looks for a regular expression inside of a string.
re.match() looks for a regular expression matching at the beginning of a string.
fullmatch() evaluates whether or not the string fully matches the regex.
findall() looks for all the matches for a regular expression inside of the string and returns a list of those matches. And
finditer() does the same thing, except instead of returning a list, it returns an iterator.
The module has some other functions that may be of use to you. You can do substitution using
subn(). And there are utility functions like
split(), which is similar to the
str.split() function, but operates using regular expressions, and the
escape() function, which is useful when you’re trying to escape strings with regular expressions in them.
content is a string to search within.
I’ll start with the
sub() takes a regular expression, something to replace it with, and the thing being searched. In this case, the regular expression is “One or more digits,” the replacement is the literal number sign (
content is the string to be substituted. It returns a new string with—in this case—the numbers replaced with number signs.
You can limit how many times the match happens. In this case, only the first substitution was performed, and then it stopped. Instead of providing a literal string for substitution, you can also provide a function. The function always takes a
and then you return whatever you want to have substituted. You’ll recall that
.group(0) returns the entire string match, and then I’m slicing it,
[::-1], which is a Python trick for reversing the string. Using this function inside of
and the two numbers that are found are replaced with their values of the digits reversed. Backreferences are also valid inside of substitution. The string is replaced, with the first group second and the second group first, resulting in the string
'two one'. There’s a little bit of ambiguity in this definition.
Let’s say I wanted to replace the string with a backreference and then the number
0. There’s the backref, the
0, and the string—and a problem.
If you read the error, it says
invalid group reference 10 at position 1. The
re library has no way of knowing that you intended this as
\1 and then a
0. It sees
There is no group
10—there’s only one group. You can get around this by using the group meta-character
The angle brackets say what group to be named—in this case, it’s saying “Replace with group
1 and then put a
"0".” Another nifty trick with
re.sub() is what happens when a zero-length match is used.
Looking for zero or more
"x"s in the word
"spam" and replacing it with a hyphen (
"-") results in hyphens inserted after every single letter. What’s happening here is every single spot in the string is a zero-length match, so a
"-" is being inserted inside of each zero-length match.
subn() function is the same as
sub() except it returns a tuple.
The first member of the tuple is the result string from
sub(), and the second member of the tuple is the number of substitutions that happened.
In this case, there were
5 dashes inserted. This is the
split() function from the
str library. Passing
split() a comma (
,) splits the string up based on those commas.
It returns a list of the components of the string. The
re library also has a
split() method, allowing you to split on things far more complicated than the
str library supports.
Here, I’m splitting on one or more digits. Like the
str library, the
re.split() method returns a list. Because the
"13" and the
"42" are the split points, the only contents of the list are the parts outside of the matching patterns.
split() method also supports a
maxsplit set to
1, the first regex match is used as a splitting point, and then it returns the rest of the string. If you want to include the split points in the result, use groups.
07:07 Now, the returned list is split on things before the match, the match groups themselves, between the matches, the match group itself, and after the match.
07:18 This is a neat way of demonstrating non-capturing groups. By changing the expression to a non-capturing group,
the end result is the same as without the groups themselves. Next up, I’ll demonstrate the
re.escape() method. Here’s an example of searching for
"2^4" in the string.
findall() is returning no matches. That’s because caret (
^) is an anchor. Having something before the anchor—that means the beginning of the string—doesn’t make a lot of sense. So
"2^4" isn’t being found.
Knowing this might be a problem, you can use the
The regex variable now contains an appropriately-escaped string.
re.escape() knows that
^ is a special character and creates the correct escape values. Just to show that it works, passing it into
findall(), and there it is. It found it.
One last thing that I want to show you is the
compile() method. Under the hood in Python, when you call a regular expression it compiles that into a binary representation of what you’re searching for.
You can ask the Python library to do that ahead of time with the
digits_re now contains a compiled version of the regular expression inside of the quotes. Inspecting the variable just tells you that it’s a compiled version of that regex.
You can now run all of the functions that you normally run directly in the
re module on this compiled regular expression.
This is the search from before, but instead of passing in the regex, you’re using the compiled regex and the
.search() method on it. The result is the same as before.
compile() is often used if you’re going to reuse your regular expression.
digits_re needs to be used over and over again, you can compile it into a variable in one place, and then keep using that variable. Technically, you could do the same thing by storing the regex itself in a variable.
09:48 This works equally well.
You might think that compiling it makes it more efficient, but it turns out that when you run the
re methods, Python is compiling it underneath anyways and caching the result. So if you reuse it, it will still use the compiled expression. The efficiencies are about the same.
10:10 In the next lesson, I’m going to show you how to use flags to modify the behavior of your regular expressions. Insert your own Sheldon Cooper joke here.
Become a Member to join the conversation.
raulfz on Jan. 29, 2021
Hi, thank you so much for this tutorial, I just want to point out that I’m not getting the same result with
Take for example the following sentence:
.finditermethod returns the whole match (both letters) whereas the
.findallmethod returns the last captured letter.
Am I doing something wrong? Are the two methods equivalent as you claimed throughout this tutorial?