Substituting, Splitting, and Escaping
In the previous lesson, I showed you how to name your groups. In this lesson, I’ll show you even more methods of the
re module, including substitution, splitting, and escaping. First, a quick review of named groups and non-capturing groups.
So the expression on the left is called
digits and is looking for
\d+ as its matching criteria. In the sentence, that matches the digits
'123'. Those digits are inside of the named group, and so the match results with a name
'123'. Inside of the regular expression, instead of using a numeric backreference, you can use a named backreference.
?P= is the named backreference. The first part of the regex is the same as before. The second part is a backreference using the named group
digits. From the number
'13' through to the number
'13' in the string is the entire match, but the group named
digits only contains
Here are the matching functions you’ve seen so far.
re.search() looks for a regular expression inside of a string.
re.match() looks for a regular expression matching at the beginning of a string.
fullmatch() evaluates whether or not the string fully matches the regex.
findall() looks for all the matches for a regular expression inside of the string and returns a list of those matches. And
finditer() does the same thing, except instead of returning a list, it returns an iterator.
The module has some other functions that may be of use to you. You can do substitution using
subn(). And there are utility functions like
split(), which is similar to the
str.split() function, but operates using regular expressions, and the
escape() function, which is useful when you’re trying to escape strings with regular expressions in them.
I’ll start with the
sub() takes a regular expression, something to replace it with, and the thing being searched. In this case, the regular expression is “One or more digits,” the replacement is the literal number sign (
content is the string to be substituted. It returns a new string with—in this case—the numbers replaced with number signs.
You can limit how many times the match happens. In this case, only the first substitution was performed, and then it stopped. Instead of providing a literal string for substitution, you can also provide a function. The function always takes a
and then you return whatever you want to have substituted. You’ll recall that
.group(0) returns the entire string match, and then I’m slicing it,
[::-1], which is a Python trick for reversing the string. Using this function inside of
and the two numbers that are found are replaced with their values of the digits reversed. Backreferences are also valid inside of substitution. The string is replaced, with the first group second and the second group first, resulting in the string
'two one'. There’s a little bit of ambiguity in this definition.
The angle brackets say what group to be named—in this case, it’s saying “Replace with group
1 and then put a
"0".” Another nifty trick with
re.sub() is what happens when a zero-length match is used.
Looking for zero or more
"x"s in the word
"spam" and replacing it with a hyphen (
"-") results in hyphens inserted after every single letter. What’s happening here is every single spot in the string is a zero-length match, so a
"-" is being inserted inside of each zero-length match.
Here, I’m splitting on one or more digits. Like the
str library, the
re.split() method returns a list. Because the
"13" and the
"42" are the split points, the only contents of the list are the parts outside of the matching patterns.
findall() is returning no matches. That’s because caret (
^) is an anchor. Having something before the anchor—that means the beginning of the string—doesn’t make a lot of sense. So
"2^4" isn’t being found.
The regex variable now contains an appropriately-escaped string.
re.escape() knows that
^ is a special character and creates the correct escape values. Just to show that it works, passing it into
findall(), and there it is. It found it.
One last thing that I want to show you is the
compile() method. Under the hood in Python, when you call a regular expression it compiles that into a binary representation of what you’re searching for.
This is the search from before, but instead of passing in the regex, you’re using the compiled regex and the
.search() method on it. The result is the same as before.
compile() is often used if you’re going to reuse your regular expression.
digits_re needs to be used over and over again, you can compile it into a variable in one place, and then keep using that variable. Technically, you could do the same thing by storing the regex itself in a variable.
You might think that compiling it makes it more efficient, but it turns out that when you run the
re methods, Python is compiling it underneath anyways and caching the result. So if you reuse it, it will still use the compiled expression. The efficiencies are about the same.
Become a Member to join the conversation.