Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Substituting, Splitting, and Escaping

Christopher Trudeau

Regular Expressions and Building Regexes in Python Christopher Trudeau 10:21

Transcript
Discussion (1)

00:00 In the previous lesson, I showed you how to name your groups. In this lesson, I’ll show you even more methods of the re module, including substitution, splitting, and escaping. First, a quick review of named groups and non-capturing groups.

00:15 The regular expression on the left is a named group. ?P indicates a naming, with the angle brackets () surrounding what you’re going to name it.

00:25 So the expression on the left is called digits and is looking for \d+ as its matching criteria. In the sentence, that matches the digits '123'. Those digits are inside of the named group, and so the match results with a name digits containing '123'. Inside of the regular expression, instead of using a numeric backreference, you can use a named backreference.

00:51 ?P= is the named backreference. The first part of the regex is the same as before. The second part is a backreference using the named group digits. From the number '13' through to the number '13' in the string is the entire match, but the group named digits only contains '13'.

01:15 ?: is a non-capturing group. The regular expression on the left-hand side has three groups in it: literal a, non-capturing r, and literal e.

01:27 This matches the 'are' in 'care' of the string. But only two groups are found: the 'a' and the 'e'. The 'r' in between doesn’t capture.

01:39 Here are the matching functions you’ve seen so far. re.search() looks for a regular expression inside of a string. re.match() looks for a regular expression matching at the beginning of a string. fullmatch() evaluates whether or not the string fully matches the regex. findall() looks for all the matches for a regular expression inside of the string and returns a list of those matches. And finditer() does the same thing, except instead of returning a list, it returns an iterator.

02:14 The module has some other functions that may be of use to you. You can do substitution using sub() and subn(). And there are utility functions like split(), which is similar to the str.split() function, but operates using regular expressions, and the escape() function, which is useful when you’re trying to escape strings with regular expressions in them.

02:37 content is a string to search within.

02:43 I’ll start with the sub() command. sub() takes a regular expression, something to replace it with, and the thing being searched. In this case, the regular expression is “One or more digits,” the replacement is the literal number sign ("#"), and content is the string to be substituted. It returns a new string with—in this case—the numbers replaced with number signs.

03:10 You can limit how many times the match happens. In this case, only the first substitution was performed, and then it stopped. Instead of providing a literal string for substitution, you can also provide a function. The function always takes a Match object,

03:34 and then you return whatever you want to have substituted. You’ll recall that .group(0) returns the entire string match, and then I’m slicing it, [::-1], which is a Python trick for reversing the string. Using this function inside of sub(),

03:55 and the two numbers that are found are replaced with their values of the digits reversed. Backreferences are also valid inside of substitution. The string is replaced, with the first group second and the second group first, resulting in the string 'two one'. There’s a little bit of ambiguity in this definition.

04:18 Let’s say I wanted to replace the string with a backreference and then the number 0. There’s the backref, the 0, and the string—and a problem.

04:32 If you read the error, it says invalid group reference 10 at position 1. The re library has no way of knowing that you intended this as \1 and then a 0. It sees \10.

04:45 There is no group 10—there’s only one group. You can get around this by using the group meta-character \g.

05:00 The angle brackets say what group to be named—in this case, it’s saying “Replace with group 1 and then put a "0".” Another nifty trick with re.sub() is what happens when a zero-length match is used.

05:15 Looking for zero or more "x"s in the word "spam" and replacing it with a hyphen ("-") results in hyphens inserted after every single letter. What’s happening here is every single spot in the string is a zero-length match, so a "-" is being inserted inside of each zero-length match.

05:37 The subn() function is the same as sub() except it returns a tuple.

05:45 The first member of the tuple is the result string from sub(), and the second member of the tuple is the number of substitutions that happened.

05:53 In this case, there were 5 dashes inserted. This is the split() function from the str library. Passing split() a comma (,) splits the string up based on those commas.

06:06 It returns a list of the components of the string. The re library also has a split() method, allowing you to split on things far more complicated than the str library supports.

06:22 Here, I’m splitting on one or more digits. Like the str library, the re.split() method returns a list. Because the "13" and the "42" are the split points, the only contents of the list are the parts outside of the matching patterns.

06:39 The split() method also supports a maxsplit parameter.

06:47 With maxsplit set to 1, the first regex match is used as a splitting point, and then it returns the rest of the string. If you want to include the split points in the result, use groups.

07:07 Now, the returned list is split on things before the match, the match groups themselves, between the matches, the match group itself, and after the match.

07:18 This is a neat way of demonstrating non-capturing groups. By changing the expression to a non-capturing group,

07:30 the end result is the same as without the groups themselves. Next up, I’ll demonstrate the re.escape() method. Here’s an example of searching for "2^4" in the string.

07:45 findall() is returning no matches. That’s because caret (^) is an anchor. Having something before the anchor—that means the beginning of the string—doesn’t make a lot of sense. So "2^4" isn’t being found.

07:59 Knowing this might be a problem, you can use the escape() method.

08:07 The regex variable now contains an appropriately-escaped string. re.escape() knows that ^ is a special character and creates the correct escape values. Just to show that it works, passing it into findall(), and there it is. It found it.

08:28 One last thing that I want to show you is the compile() method. Under the hood in Python, when you call a regular expression it compiles that into a binary representation of what you’re searching for.

08:40 You can ask the Python library to do that ahead of time with the compile() function.

08:49 digits_re now contains a compiled version of the regular expression inside of the quotes. Inspecting the variable just tells you that it’s a compiled version of that regex.

09:01 You can now run all of the functions that you normally run directly in the re module on this compiled regular expression.

09:16 This is the search from before, but instead of passing in the regex, you’re using the compiled regex and the .search() method on it. The result is the same as before. compile() is often used if you’re going to reuse your regular expression.

09:31 If that digits_re needs to be used over and over again, you can compile it into a variable in one place, and then keep using that variable. Technically, you could do the same thing by storing the regex itself in a variable.

09:48 This works equally well.

09:52 You might think that compiling it makes it more efficient, but it turns out that when you run the re methods, Python is compiling it underneath anyways and caching the result. So if you reuse it, it will still use the compiled expression. The efficiencies are about the same.

10:10 In the next lesson, I’m going to show you how to use flags to modify the behavior of your regular expressions. Insert your own Sheldon Cooper joke here.

raulfz on Jan. 29, 2021

Hi, thank you so much for this tutorial, I just want to point out that I’m not getting the same result with .findall and .finditer methods.

Take for example the following sentence:

sentence = 'According to this runner disappointed finally the run'
find_all = re.findall(r'(?P<twice>\w+)(?P=twice'), sentence)
print(find_all)
OUT>>> ['c', 'n','p', 'l']

# NOTE THE DIFFERENCE BETWEEN FINDALL AND FINDITER

find_iter = re.finditer(r'(?P<twice>\w+)(?P=twice)', sentence)
for item in find_iter:
    print(item)

OUT>>>

<re.Match object; span=(1, 3), match='cc'>
<re.Match object; span=(20, 22), match='nn'>
<re.Match object; span=(29, 31), match='pp'>
<re.Match object; span=(42, 44), match='ll'>

So .finditer method returns the whole match (both letters) whereas the .findall method returns the last captured letter.

Am I doing something wrong? Are the two methods equivalent as you claimed throughout this tutorial?

Thank you.

Become a Member to join the conversation.