Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Creating Flags

00:00 In the previous lesson, I introduced you to some other methods inside of the re module. In this lesson, I’ll show you how to use flags to modify your regular expression behavior.

00:11 Most of Python’s regex methods support additional flags to control how they work. An example of this would be changing a regular expression from being case sensitive to case insensitive. You do that by setting a flag.

00:25 All these methods that you’ve seen before also support the addition of modifying flags. Most of the flags have both a short form and a long form, so you can pass it in as either re.I or re.IGNORECASE.

00:40 As you can tell from the naming of this one, when present, it changes the regular expression to IGNORECASE. You’ve seen me use the MULTILINE button inside of pythex.

00:50 What that’s actually doing is setting the MULTILINE flag inside of Python. As you may recall, this changes the behaviors of the caret (^) and dollar sign ($) anchors, allowing them to match on embedded newlines (\n), as well as the beginning and ending of the string. In normal practice, the period (.) does not match the newline character.

01:09 Adding the DOTALL flag allows it to match the newline character. This is one of my favorite features of Python regular expressions. It changes how whitespace works inside of the regex, allowing you to use multiple lines. I’ll show you an example of it later, but it makes it far easier to comment and display what the regex is doing inside of your code. This flag is barely documented.

01:33 It’s documented as being there, but there aren’t a lot of details unless you go into the compiler about how it works. It shows debug information about how the regular expression is working.

01:47 The default character encoding in Python 2 was ASCII. That changed to Unicode in Python 3. The re module changed with it. If you don’t specify a flag, Unicode is the default encoding for regexes and the strings being searched.

02:03 You can control that by setting the ASCII flag or setting the LOCALE flag. The LOCALE flag sets it to whatever the default encoding is for the user’s current locale.

02:17 Start with a simple search looking for the small letter "a" with one or more instances. Nothing is found. Adding the flag at the end and the behavior changes. The match is now case insensitive.

02:38 If you don’t feel like typing IGNORECASE, I is the short form.

02:48 You’ve seen the multiline behavior inside of pythex. Here it is in Python.

02:58 And of course, the \A variant doesn’t match in multiline or non-multiline mode. This is the same as what I showed you in pythex. Without modification, the period meta-character (.) doesn’t match newlines.

03:14 Adding the DOTALL flag changes the behavior of that. I was really happy when they added the VERBOSE feature to regular expressions in Python. It makes the code far more readable. Here’s an example.

03:29 I’m going to start by writing a regex on multiple lines that shows comments about the phone number it’s searching for.

03:58 Pardon my North American bias with the North American style phone number, but this regex matches a phone number with a '1', an area code, and then three digits, hyphens, and four digits separated by a whitespace.

04:10 Without the VERBOSE flag, this wouldn’t work. The additional whitespace would be considered literal characters and would be searched for. Likewise, this isn’t the format for a comment inside of a regular expression.

04:22 I do find this easier to read, though. And to use it, all you have to do is pass in the VERBOSE flag. Here I’m going to compile the expression,

04:36 and now I can apply that regular expression to a phone number. My friends in Canada will understand why I’m suddenly hungry for pizza. If pizza’s not good enough, you could go with breakfast food.

04:53 Nothing unusual about this, but watch how it breaks if I apply the VERBOSE flag.

05:06 It no longer matches. The effect of the VERBOSE flag is to make whitespace non-meaningful. That means the regex on the left-hand side, "bacon, eggs" is now equivalent to "bacon,eggs" without the space.

05:21 When that’s applied to the string on the right-hand side where there is a space, a match is no longer found. This means if you are looking for whitespace when using the VERBOSE flag, make sure to use \s and be specific about it. Now for some fun with the debug information. I’ll start with something that works.

05:47 Here’s the string I’ll be searching. Here’s a valid regular expression. I’m looking for the literal word "spam" with any character after it and repeating three times. It finds it, "spam spam spam." So far, so good.

06:04 Now let’s assume for a moment that you’ve forgotten that you’re supposed to use the number 3, but instead you just typed the word.

06:16 Of course that doesn’t work, but if you’re trying to figure out what went wrong, you can add the DEBUG flag.

06:28 Wasn’t that helpful? Ha. As I mentioned before, Python doesn’t document this flag very well. It says that the flag is there, but it doesn’t describe what the actual output is.

06:40 I looked for a while and I’ve yet to find a good article on how all these pieces fit together. But if you read it, you can get a sense of what’s going on.

06:49 The SUBPATTERN match there tells you that it’s looking for four literals: 115, 112, 97, and 109.

06:57 Those are the ASCII codes for 's', 'p', 'a', 'm'. The fact that it then looks for the next set of literals tells you it isn’t actually looking for {3}, which is the mistake that I made.

07:09 It’s actually looking for those literal characters. This gives you a little bit of a hint of what’s going on. As I scroll down, you can see this information again.

07:21 What’s marked as line 17 is the hex version of 115, which is hex 73, and it actually shows you that it’s ASCII 's'. You can also see from lines 28 through 40 that it’s looking for the literals '{three}'. To see the difference, let’s debug the correct statement.

07:47 You’ll notice here that the {three} now becomes the phrase MAX_REPEAT, and it’s MAX_REPEAT with a range of 3 to 3it’s only looking for three times. You’ll remember inside of curly brackets, you typically have two numbers m, n.

08:02 This is essentially the equivalent of m, n with m and n equal to 3. Once you become familiar with this kind of debug output and you’re expecting to see the repeat sequence show up, when it doesn’t, it helps you debug your statement.

08:19 I don’t pretend to fully understand exactly how all of this debug works, but there’s enough little hints in it that it can give you an idea of what’s going on. In the further reading section, I’ll show you a website that actually uses some of this information to help you understand how a regex is working.

08:37 That’s enough esoteric debugging for now. What if you want to combine flags?

08:49 Here, I’ve used MULTILINE, but I’m looking for "^i". I want MULTILINE and IGNORECASE. Each of the flags is a bit in an integer.

08:59 This means you can combine them using a bitwise OR. If you haven’t seen this before, essentially it’s the pipe symbol. MULTILINE | IGNORECASE, and you get both MULTILINE and IGNORECASE. The match of IGNORECASE "^i" and MULTILINE finds the 'I' in "I didn't" after the newline.

09:24 There’s also a mechanism for triggering flags inside of the regular expression itself.

09:36 The (?) means to turn the flags on. The i and m here are the single letter representation of the IGNORECASE and MULTILINE flags.

09:54 You can also turn these flags on later on in the regular expression, but notice it applies to the entire regular expression. Even though the flags are on after the i, it is still doing a case insensitive multiline version of 'I'. Later versions of Python deprecate this and warn you that you should only put them at the beginning. I would go one step further than that, and except for a couple of very specific use cases, from a readability standpoint using the flags as a parameter is much better.

10:23 You can also set a flag just for the duration of a group. You do that with the colon (:).

10:33 This says that the group is case insensitive. The first three 'spam' match, but the fourth one doesn’t because it can be capital or small 's'—case insensitive—inside the group, but then literal small 'pam'.

10:49 You can also inverse this.

10:59 Here, the flag on the right-hand side, re.IGNORECASE, sets the entire regex to be IGNORECASE. The -i inside of the group turns the IGNORECASE flag off for the duration of that group.

11:15 This means it will only match 'spam', overriding the IGNORECASE flag. Hence, the first and last spams don’t match. Put your thinking caps on. Next lesson is about some even trickier matching tools.

Become a Member to join the conversation.