Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Regular Expressions and Building Regexes in Python (Summary)

Congratulations! You’ve mastered a tremendous amount of material. Regular expressions are extremely versatile and powerful—literally a language in their own right. You’ll find them invaluable in your Python coding.

You now know how to:

  • Use re.search() to perform regex matching in Python
  • Create complex pattern matching searches with regex metacharacters
  • Tweak regex parsing behavior with flags
  • Make full use of all the functions that the re module provides
  • Precompile a regex in Python
  • Extract information from match objects

Download

Sample Code (.zip)

13.9 KB

Download

Course Slides (.pdf)

922.2 KB

00:00 Thanks for sticking with me so far. I’m just going to wrap things up with a quick review. In the first lesson, I covered what a regular expression was, where they came from, and why to use them.

00:11 The first regex you were taught was plain matching—just looking for straight characters like "spam" inside of a sentence. Class matching gives you the ability to match a range of values. In this case, the numbers 0 through 9 are in a class.

00:26 This matches the '2' in "2nd" in the sentence. Meta-characters are often used as shortcuts for certain kinds of classes of characters.

00:35 \s is whitespace, \d is a digit, and \w is a word character—a letter, a number, or an underscore. This comes from what is a word or a valid variable name in programming languages. This is matching the space, the '2', the 'n', and the 'd' inside of this sentence.

00:57 Anchors give you the ability to control where in the text a match happens. The caret anchor (^) on its own looks for things at the beginning of the string.

01:07 Applying the MULTILINE flag changes this to be not just the beginning of the string, but also after newlines (\n). This example is finding the 'My' after the newline in the middle part of this string. Quantifiers give you the ability to repeat things.

01:23 \d means a digit, and plus (+) means one or more matches. search() returns the first match for this, which is the '555' portion of the phone number.

01:34 It’s a little suspicious. Using parentheses, you can group values together. Things that have been grouped can also be pulled out using the .groups() function.

01:46 Python provides five mechanisms for straight pattern-matching. search() looks for the regular expression inside of a string. match() looks for the regular expression at the beginning of the string. fullmatch() looks for the regular expression matching the entire string. findall() returns a list of matches. And finditer() returns an iterator of matches.

02:09 The first three—search(), match(), and fullmatch()—return Match objects. findall() and finditer() do not.

02:17 The Match object returned from one of those functions has methods on it. Three of those methods allow you to get at grouped patterns. .group() takes an argument and can return one or more of the grouped patterns.

02:29 .groups() returns a tuple of all of the grouped patterns. .groupdict() returns a dictionary for any of the named groups. .expand() can be used to turn backreferences into their actual values inside of a template. And then .start(), .end(), and .span() give information about where the match happened inside of the string—the .start() and .end() being the beginning and end of the matches and the .span() being a tuple containing the start and end value.

02:59 You can also use the index of a Match to access the group of that number. In the example below, the first group is found using match[1].

03:09 Remember that group 0—and therefore index 0—is the entire match, whereas 1 and higher are the numbered backreferences.

03:18 If you don’t like using a number for the group, the ?P operator allows you to name it. What appears inside of the angle brackets is the name you give it, and after the angle brackets, the regular expression for the group.

03:31 In this example, there are two named groups—one called prefix and the other called lineno. The .groupdict() function on the Match object returns these two groups with their names. Backreferences refer to a group earlier in the regex.

03:48 You can specify a backreference with a number or if the group was named with its name. \1 is the first group. ?P= says to use the group named. In the example here, the group name is twice and the backreference twice is referenced afterwards.

04:08 Backreferences are often used for looking for repetition, as in both of the examples here, looking for the numbers '44' repeating themselves.

04:17 The re module provides functions for doing substitution. sub() and subn() substitute based on regular expressions and values or on a function that you can pass in. subn() does exactly what sub() does, but instead of returning the substituted string, it returns a tuple containing the substituted string and the number of characters that were substituted. split() is a more advanced version of the str.split(), allowing you to split based on regular expression pattern-matching. escape() is useful for turning a string that you want to find literally that possibly contains special regex characters. Running escape() on the string returns an escaped string, which will safely look for a literal version of what you passed in.

05:05 Most of the regular expression functions support flags to modify their behavior. The example here changes the search to be case insensitive. Small 'a' is found even though capital 'A' was in the regex.

05:19 There are flags for ignoring case, changing the behavior of anchors in multiline, changing that behavior of a period to include newlines, VERBOSE for making your regular expressions easier to read, DEBUG for showing information about how the regular expression is operating, and the ASCII, UNICODE, and LOCALE flags for changing the style of character encoding used inside of the regex. In Python 3, the default is UNICODE.

05:48 I also showed you how to do conditional matching, changing the criteria based on the presence or absence of a group.

05:54 I followed that up with lookahead and lookbehind matches, which allow you to look for things that are preceded or followed by something, but that something isn’t consumed.

06:06 Regular expressions are a powerful tool in your tool belt. I hope you’ve enjoyed the course. I know I’ve enjoyed giving it. Thanks for your attention.

Ghani on Nov. 22, 2020

Excellent course! I like very much the review at the start of each new lesson. Well-done and thanks.

zulfiiaditto on Nov. 29, 2020

Thank you so much for very detailed course!

Become a Member to join the conversation.