Regular Expressions and Building Regexes in Python (Summary)

Regular Expressions and Building Regexes in Python Christopher Trudeau 06:17

Congratulations! You’ve mastered a tremendous amount of material. Regular expressions are extremely versatile and powerful—literally a language in their own right. You’ll find them invaluable in your Python coding.

You now know how to:

Use re.search() to perform regex matching in Python
Create complex pattern matching searches with regex metacharacters
Tweak regex parsing behavior with flags
Make full use of all the functions that the re module provides
Precompile a regex in Python
Extract information from match objects

Download

Sample Code (.zip)

13.9 KB

Download

Course Slides (.pdf)

876.5 KB

Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Already a member? Sign-In

Locked learning resources

The full lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Already a member? Sign-In

00:00 Thanks for sticking with me so far. I’m just going to wrap things up with a quick review. In the first lesson, I covered what a regular expression was, where they came from, and why to use them.

00:11 The first regex you were taught was plain matching—just looking for straight characters like "spam" inside of a sentence. Class matching gives you the ability to match a range of values. In this case, the numbers 0 through 9 are in a class.

00:26 This matches the '2' in "2nd" in the sentence. Meta-characters are often used as shortcuts for certain kinds of classes of characters.

00:35 \s is whitespace, \d is a digit, and \w is a word character—a letter, a number, or an underscore. This comes from what is a word or a valid variable name in programming languages. This is matching the space, the '2', the 'n', and the 'd' inside of this sentence.

00:57 Anchors give you the ability to control where in the text a match happens. The caret anchor (^) on its own looks for things at the beginning of the string.

01:07 Applying the MULTILINE flag changes this to be not just the beginning of the string, but also after newlines (\n). This example is finding the 'My' after the newline in the middle part of this string. Quantifiers give you the ability to repeat things.

01:23 \d means a digit, and plus (+) means one or more matches. search() returns the first match for this, which is the '555' portion of the phone number.

01:34 It’s a little suspicious. Using parentheses, you can group values together. Things that have been grouped can also be pulled out using the .groups() function.

01:46 Python provides five mechanisms for straight pattern-matching. search() looks for the regular expression inside of a string. match() looks for the regular expression at the beginning of the string. fullmatch() looks for the regular expression matching the entire string. findall() returns a list of matches. And finditer() returns an iterator of matches.

02:09 The first three—search(), match(), and fullmatch()—return Match objects. findall() and finditer() do not.

02:17 The Match object returned from one of those functions has methods on it. Three of those methods allow you to get at grouped patterns. .group() takes an argument and can return one or more of the grouped patterns.

02:29 .groups() returns a tuple of all of the grouped patterns. .groupdict() returns a dictionary for any of the named groups. .expand() can be used to turn backreferences into their actual values inside of a template. And then .start(), .end(), and .span() give information about where the match happened inside of the string—the .start() and .end() being the beginning and end of the matches and the .span() being a tuple containing the start and end value.

02:59 You can also use the index of a Match to access the group of that number. In the example below, the first group is found using match[1].

03:09 Remember that group 0—and therefore index 0—is the entire match, whereas 1 and higher are the numbered backreferences.

03:18 If you don’t like using a number for the group, the ?P operator allows you to name it. What appears inside of the angle brackets is the name you give it, and after the angle brackets, the regular expression for the group.

03:31 In this example, there are two named groups—one called prefix and the other called lineno. The .groupdict() function on the Match object returns these two groups with their names. Backreferences refer to a group earlier in the regex.

03:48 You can specify a backreference with a number or if the group was named with its name. \1 is the first group. ?P= says to use the group named. In the example here, the group name is twice and the backreference twice is referenced afterwards.

04:08 Backreferences are often used for looking for repetition, as in both of the examples here, looking for the numbers '44' repeating themselves.

04:17 The re module provides functions for doing substitution. sub() and subn() substitute based on regular expressions and values or on a function that you can pass in. subn() does exactly what sub() does, but instead of returning the substituted string, it returns a tuple containing the substituted string and the number of characters that were substituted. split() is a more advanced version of the str.split(), allowing you to split based on regular expression pattern-matching. escape() is useful for turning a string that you want to find literally that possibly contains special regex characters. Running escape() on the string returns an escaped string, which will safely look for a literal version of what you passed in.

05:05 Most of the regular expression functions support flags to modify their behavior. The example here changes the search to be case insensitive. Small 'a' is found even though capital 'A' was in the regex.

05:19 There are flags for ignoring case, changing the behavior of anchors in multiline, changing that behavior of a period to include newlines, VERBOSE for making your regular expressions easier to read, DEBUG for showing information about how the regular expression is operating, and the ASCII, UNICODE, and LOCALE flags for changing the style of character encoding used inside of the regex. In Python 3, the default is UNICODE.

05:48 I also showed you how to do conditional matching, changing the criteria based on the presence or absence of a group.

05:54 I followed that up with lookahead and lookbehind matches, which allow you to look for things that are preceded or followed by something, but that something isn’t consumed.

06:06 Regular expressions are a powerful tool in your tool belt. I hope you’ve enjoyed the course. I know I’ve enjoyed giving it. Thanks for your attention.

Ghani on Nov. 22, 2020

Excellent course! I like very much the review at the start of each new lesson. Well-done and thanks.

zulfiiaditto on Nov. 29, 2020

Thank you so much for very detailed course!

raulfz on Feb. 2, 2021

Excellent course, it was really helpful to see how regex are used in the python standard library.

Cici Du on Feb. 9, 2021

Great course! I have struggled to apply regex after my first 2 courses in Python, so I took this course specifically for regex. This course explains things thoroughly!

Dirk on March 22, 2021

Thank you for the good course. Everything well to the point. Will help me in my work.

vuqpham on March 29, 2021

The instructor definitely has a great pedagogical skill! Thank you.

datagirl-89 on May 10, 2021

This was a great course and I learned a lot. I used it to complete some assignments in another regex module on Coursera that didn’t provide the instruction I needed to fully understand the concepts. I would have liked a few examples of re.finditer() and when to use that. Otherwise, it was outstanding. Thanks!!

Christopher Trudeau RP Team on May 11, 2021

Hi datagirl-89,

I’m glad you found the course useful.

Both re.findall() and re.finditer() are for finding multiple matches inside of some text. The re.findall() method returns just the matching strings, whereas re.finditer() returns match objects (this is actually a bit inconsistent and confusing).

The other difference is re.findall() returns a list, whereas re.finditer() returns an iterator. If you haven’t come across iterators before, they’re used similarly to lists, but they calculate results on the fly. Lists are precalculated and take up the amount of memory of everything in the list. Iterators can be used the same way you use lists, but they don’t precalulate everything, they calculate the item when you ask for it.