Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Regular Expressions and Building Regexes in Python (Summary)

Congratulations! You’ve mastered a tremendous amount of material. Regular expressions are extremely versatile and powerful—literally a language in their own right. You’ll find them invaluable in your Python coding.

You now know how to:

  • Use re.search() to perform regex matching in Python
  • Create complex pattern matching searches with regex metacharacters
  • Tweak regex parsing behavior with flags
  • Make full use of all the functions that the re module provides
  • Precompile a regex in Python
  • Extract information from match objects
Download

Sample Code (.zip)

13.9 KB
Download

Course Slides (.pdf)

876.5 KB

00:00 Thanks for sticking with me so far. I’m just going to wrap things up with a quick review. In the first lesson, I covered what a regular expression was, where they came from, and why to use them.

00:11 The first regex you were taught was plain matching—just looking for straight characters like "spam" inside of a sentence. Class matching gives you the ability to match a range of values. In this case, the numbers 0 through 9 are in a class.

00:26 This matches the '2' in "2nd" in the sentence. Meta-characters are often used as shortcuts for certain kinds of classes of characters.

00:35 \s is whitespace, \d is a digit, and \w is a word character—a letter, a number, or an underscore. This comes from what is a word or a valid variable name in programming languages. This is matching the space, the '2', the 'n', and the 'd' inside of this sentence.

00:57 Anchors give you the ability to control where in the text a match happens. The caret anchor (^) on its own looks for things at the beginning of the string.

01:07 Applying the MULTILINE flag changes this to be not just the beginning of the string, but also after newlines (\n). This example is finding the 'My' after the newline in the middle part of this string. Quantifiers give you the ability to repeat things.

01:23 \d means a digit, and plus (+) means one or more matches. search() returns the first match for this, which is the '555' portion of the phone number.

01:34 It’s a little suspicious. Using parentheses, you can group values together. Things that have been grouped can also be pulled out using the .groups() function.

01:46 Python provides five mechanisms for straight pattern-matching. search() looks for the regular expression inside of a string. match() looks for the regular expression at the beginning of the string. fullmatch() looks for the regular expression matching the entire string. findall() returns a list of matches. And finditer() returns an iterator of matches.

02:09 The first three—search(), match(), and fullmatch()—return Match objects. findall() and finditer() do not.

02:17 The Match object returned from one of those functions has methods on it. Three of those methods allow you to get at grouped patterns. .group() takes an argument and can return one or more of the grouped patterns.

02:29 .groups() returns a tuple of all of the grouped patterns. .groupdict() returns a dictionary for any of the named groups. .expand() can be used to turn backreferences into their actual values inside of a template. And then .start(), .end(), and .span() give information about where the match happened inside of the string—the .start() and .end() being the beginning and end of the matches and the .span() being a tuple containing the start and end value.

02:59 You can also use the index of a Match to access the group of that number. In the example below, the first group is found using match[1].

03:09 Remember that group 0—and therefore index 0—is the entire match, whereas 1 and higher are the numbered backreferences.

03:18 If you don’t like using a number for the group, the ?P operator allows you to name it. What appears inside of the angle brackets is the name you give it, and after the angle brackets, the regular expression for the group.

03:31 In this example, there are two named groups—one called prefix and the other called lineno. The .groupdict() function on the Match object returns these two groups with their names. Backreferences refer to a group earlier in the regex.

03:48 You can specify a backreference with a number or if the group was named with its name. \1 is the first group. ?P= says to use the group named. In the example here, the group name is twice and the backreference twice is referenced afterwards.

04:08 Backreferences are often used for looking for repetition, as in both of the examples here, looking for the numbers '44' repeating themselves.

04:17 The re module provides functions for doing substitution. sub() and subn() substitute based on regular expressions and values or on a function that you can pass in. subn() does exactly what sub() does, but instead of returning the substituted string, it returns a tuple containing the substituted string and the number of characters that were substituted. split() is a more advanced version of the str.split(), allowing you to split based on regular expression pattern-matching. escape() is useful for turning a string that you want to find literally that possibly contains special regex characters. Running escape() on the string returns an escaped string, which will safely look for a literal version of what you passed in.

05:05 Most of the regular expression functions support flags to modify their behavior. The example here changes the search to be case insensitive. Small 'a' is found even though capital 'A' was in the regex.

05:19 There are flags for ignoring case, changing the behavior of anchors in multiline, changing that behavior of a period to include newlines, VERBOSE for making your regular expressions easier to read, DEBUG for showing information about how the regular expression is operating, and the ASCII, UNICODE, and LOCALE flags for changing the style of character encoding used inside of the regex. In Python 3, the default is UNICODE.

05:48 I also showed you how to do conditional matching, changing the criteria based on the presence or absence of a group.

05:54 I followed that up with lookahead and lookbehind matches, which allow you to look for things that are preceded or followed by something, but that something isn’t consumed.

06:06 Regular expressions are a powerful tool in your tool belt. I hope you’ve enjoyed the course. I know I’ve enjoyed giving it. Thanks for your attention.

Avatar image for Ghani

Ghani on Nov. 22, 2020

Excellent course! I like very much the review at the start of each new lesson. Well-done and thanks.

Avatar image for zulfiiaditto

zulfiiaditto on Nov. 29, 2020

Thank you so much for very detailed course!

Avatar image for raulfz

raulfz on Feb. 2, 2021

Excellent course, it was really helpful to see how regex are used in the python standard library.

Avatar image for Cici Du

Cici Du on Feb. 9, 2021

Great course! I have struggled to apply regex after my first 2 courses in Python, so I took this course specifically for regex. This course explains things thoroughly!

Avatar image for Dirk

Dirk on March 22, 2021

Thank you for the good course. Everything well to the point. Will help me in my work.

Avatar image for vuqpham

vuqpham on March 29, 2021

The instructor definitely has a great pedagogical skill! Thank you.

Avatar image for datagirl-89

datagirl-89 on May 10, 2021

This was a great course and I learned a lot. I used it to complete some assignments in another regex module on Coursera that didn’t provide the instruction I needed to fully understand the concepts. I would have liked a few examples of re.finditer() and when to use that. Otherwise, it was outstanding. Thanks!!

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on May 11, 2021

Hi datagirl-89,

I’m glad you found the course useful.

Both re.findall() and re.finditer() are for finding multiple matches inside of some text. The re.findall() method returns just the matching strings, whereas re.finditer() returns match objects (this is actually a bit inconsistent and confusing).

The other difference is re.findall() returns a list, whereas re.finditer() returns an iterator. If you haven’t come across iterators before, they’re used similarly to lists, but they calculate results on the fly. Lists are precalculated and take up the amount of memory of everything in the list. Iterators can be used the same way you use lists, but they don’t precalulate everything, they calculate the item when you ask for it.

More on iterators here:

realpython.com/python-for-loop/#iterators

As for a quick example:

>>> import re
>>> text = "She sells sea shells by the sea shore"
>>> re.findall(r'se', text)
['se', 'se', 'se']
>>> re.finditer(r'se', text)
<callable_iterator object at 0x7f944ff29880>
>>> for match in re.finditer(r'se', text):
...     print(match)
...
<re.Match object; span=(4, 6), match='se'>
<re.Match object; span=(10, 12), match='se'>
<re.Match object; span=(28, 30), match='se'>

Generally you want to prefer iterators over lists whenever dealing with large amounts of data to avoid using up too much memory. In this particular case it isn’t just about the iterator, but also what you want to get back. If you’re just after some matching strings, re.findall() would be easier to use as you get the strings themselves back. If you want more info about the match, you may want to use re.finditer() even if it isn’t a lot of data, because if gives you the match objects.

Avatar image for Shubha

Shubha on Oct. 17, 2021

Great Course! Enjoyed learning lots of new stuff using regex. Thank you.

Avatar image for aniketbarphe

aniketbarphe on Nov. 25, 2021

Great Course! Thank You!

Avatar image for andresfmesad

andresfmesad on Aug. 5, 2022

What a complex but great course. Are there any books you recommend to learn more about this?

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on Aug. 7, 2022

Hi Andre,

I asked my colleagues and there were two suggestions:

“Regular Expressions” by Jeffrey Friedl – it is a bit older, but the tech hasn’t changed

and as strange as this might sound on a Python site:

“Teach Yourself Perl in 21 Days” by Laura Lemay

Perl uses regexes heavily, so you may find useful stuff in that book.

I haven’t read either of these, but trust the folks making the recommendations.

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on Aug. 18, 2022

Late to the party, but in case anyone else reading the comments is interested, one more recommendation from another RP guy:

learnbyexample.github.io/books/#python-re-gex

Become a Member to join the conversation.