Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Regular Expressions and Building Regexes in Python (Summary)

Congratulations! You’ve mastered a tremendous amount of material. Regular expressions are extremely versatile and powerful—literally a language in their own right. You’ll find them invaluable in your Python coding.

You now know how to:

  • Use re.search() to perform regex matching in Python
  • Create complex pattern matching searches with regex metacharacters
  • Tweak regex parsing behavior with flags
  • Make full use of all the functions that the re module provides
  • Precompile a regex in Python
  • Extract information from match objects

Download

Sample Code (.zip)

13.9 KB

Download

Course Slides (.pdf)

876.5 KB

00:00 Thanks for sticking with me so far. I’m just going to wrap things up with a quick review. In the first lesson, I covered what a regular expression was, where they came from, and why to use them.

00:11 The first regex you were taught was plain matching—just looking for straight characters like "spam" inside of a sentence. Class matching gives you the ability to match a range of values. In this case, the numbers 0 through 9 are in a class.

00:26 This matches the '2' in "2nd" in the sentence. Meta-characters are often used as shortcuts for certain kinds of classes of characters.

00:35 \s is whitespace, \d is a digit, and \w is a word character—a letter, a number, or an underscore. This comes from what is a word or a valid variable name in programming languages. This is matching the space, the '2', the 'n', and the 'd' inside of this sentence.

00:57 Anchors give you the ability to control where in the text a match happens. The caret anchor (^) on its own looks for things at the beginning of the string.

01:07 Applying the MULTILINE flag changes this to be not just the beginning of the string, but also after newlines (\n). This example is finding the 'My' after the newline in the middle part of this string. Quantifiers give you the ability to repeat things.

01:23 \d means a digit, and plus (+) means one or more matches. search() returns the first match for this, which is the '555' portion of the phone number.

01:34 It’s a little suspicious. Using parentheses, you can group values together. Things that have been grouped can also be pulled out using the .groups() function.

01:46 Python provides five mechanisms for straight pattern-matching. search() looks for the regular expression inside of a string. match() looks for the regular expression at the beginning of the string. fullmatch() looks for the regular expression matching the entire string. findall() returns a list of matches. And finditer() returns an iterator of matches.

02:09 The first three—search(), match(), and fullmatch()—return Match objects. findall() and finditer() do not.

02:17 The Match object returned from one of those functions has methods on it. Three of those methods allow you to get at grouped patterns. .group() takes an argument and can return one or more of the grouped patterns.

02:29 .groups() returns a tuple of all of the grouped patterns. .groupdict() returns a dictionary for any of the named groups. .expand() can be used to turn backreferences into their actual values inside of a template. And then .start(), .end(), and .span() give information about where the match happened inside of the string—the .start() and .end() being the beginning and end of the matches and the .span() being a tuple containing the start and end value.

02:59 You can also use the index of a Match to access the group of that number. In the example below, the first group is found using match[1].

03:09 Remember that group 0—and therefore index 0—is the entire match, whereas 1 and higher are the numbered backreferences.

03:18 If you don’t like using a number for the group, the ?P operator allows you to name it. What appears inside of the angle brackets is the name you give it, and after the angle brackets, the regular expression for the group.

03:31 In this example, there are two named groups—one called prefix and the other called lineno. The .groupdict() function on the Match object returns these two groups with their names. Backreferences refer to a group earlier in the regex.

03:48 You can specify a backreference with a number or if the group was named with its name. \1 is the first group. ?P= says to use the group named. In the example here, the group name is twice and the backreference twice is referenced afterwards.

04:08 Backreferences are often used for looking for repetition, as in both of the examples here, looking for the numbers '44' repeating themselves.

04:17 The re module provides functions for doing substitution. sub() and subn() substitute based on regular expressions and values or on a function that you can pass in. subn() does exactly what sub() does, but instead of returning the substituted string, it returns a tuple containing the substituted string and the number of characters that were substituted. split() is a more advanced version of the str.split(), allowing you to split based on regular expression pattern-matching. escape() is useful for turning a string that you want to find literally that possibly contains special regex characters. Running escape() on the string returns an escaped string, which will safely look for a literal version of what you passed in.

05:05 Most of the regular expression functions support flags to modify their behavior. The example here changes the search to be case insensitive. Small 'a' is found even though capital 'A' was in the regex.

05:19 There are flags for ignoring case, changing the behavior of anchors in multiline, changing that behavior of a period to include newlines, VERBOSE for making your regular expressions easier to read, DEBUG for showing information about how the regular expression is operating, and the ASCII, UNICODE, and LOCALE flags for changing the style of character encoding used inside of the regex. In Python 3, the default is UNICODE.

05:48 I also showed you how to do conditional matching, changing the criteria based on the presence or absence of a group.

05:54 I followed that up with lookahead and lookbehind matches, which allow you to look for things that are preceded or followed by something, but that something isn’t consumed.

06:06 Regular expressions are a powerful tool in your tool belt. I hope you’ve enjoyed the course. I know I’ve enjoyed giving it. Thanks for your attention.

Ghani on Nov. 22, 2020

Excellent course! I like very much the review at the start of each new lesson. Well-done and thanks.

zulfiiaditto on Nov. 29, 2020

Thank you so much for very detailed course!

raulfz on Feb. 2, 2021

Excellent course, it was really helpful to see how regex are used in the python standard library.

Cici Du on Feb. 9, 2021

Great course! I have struggled to apply regex after my first 2 courses in Python, so I took this course specifically for regex. This course explains things thoroughly!

Dirk on March 22, 2021

Thank you for the good course. Everything well to the point. Will help me in my work.

vuqpham on March 29, 2021

The instructor definitely has a great pedagogical skill! Thank you.

datagirl-89 on May 10, 2021

This was a great course and I learned a lot. I used it to complete some assignments in another regex module on Coursera that didn’t provide the instruction I needed to fully understand the concepts. I would have liked a few examples of re.finditer() and when to use that. Otherwise, it was outstanding. Thanks!!

Christopher Trudeau RP Team on May 11, 2021

Hi datagirl-89,

I’m glad you found the course useful.

Both re.findall() and re.finditer() are for finding multiple matches inside of some text. The re.findall() method returns just the matching strings, whereas re.finditer() returns match objects (this is actually a bit inconsistent and confusing).

The other difference is re.findall() returns a list, whereas re.finditer() returns an iterator. If you haven’t come across iterators before, they’re used similarly to lists, but they calculate results on the fly. Lists are precalculated and take up the amount of memory of everything in the list. Iterators can be used the same way you use lists, but they don’t precalulate everything, they calculate the item when you ask for it.

More on iterators here:

realpython.com/python-for-loop/#iterators

As for a quick example:

>>> import re
>>> text = "She sells sea shells by the sea shore"
>>> re.findall(r'se', text)
['se', 'se', 'se']
>>> re.finditer(r'se', text)
<callable_iterator object at 0x7f944ff29880>
>>> for match in re.finditer(r'se', text):
...     print(match)
...
<re.Match object; span=(4, 6), match='se'>
<re.Match object; span=(10, 12), match='se'>
<re.Match object; span=(28, 30), match='se'>

Generally you want to prefer iterators over lists whenever dealing with large amounts of data to avoid using up too much memory. In this particular case it isn’t just about the iterator, but also what you want to get back. If you’re just after some matching strings, re.findall() would be easier to use as you get the strings themselves back. If you want more info about the match, you may want to use re.finditer() even if it isn’t a lot of data, because if gives you the match objects.

Become a Member to join the conversation.