Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Creating Flags

00:00 In the previous lesson, I introduced you to some other methods inside of the re module. In this lesson, I’ll show you how to use flags to modify your regular expression behavior.

00:11 Most of Python’s regex methods support additional flags to control how they work. An example of this would be changing a regular expression from being case sensitive to case insensitive. You do that by setting a flag.

00:25 All these methods that you’ve seen before also support the addition of modifying flags. Most of the flags have both a short form and a long form, so you can pass it in as either re.I or re.IGNORECASE.

00:40 As you can tell from the naming of this one, when present, it changes the regular expression to IGNORECASE. You’ve seen me use the MULTILINE button inside of pythex.

00:50 What that’s actually doing is setting the MULTILINE flag inside of Python. As you may recall, this changes the behaviors of the caret (^) and dollar sign ($) anchors, allowing them to match on embedded newlines (\n), as well as the beginning and ending of the string. In normal practice, the period (.) does not match the newline character.

01:09 Adding the DOTALL flag allows it to match the newline character. This is one of my favorite features of Python regular expressions. It changes how whitespace works inside of the regex, allowing you to use multiple lines. I’ll show you an example of it later, but it makes it far easier to comment and display what the regex is doing inside of your code. This flag is barely documented.

01:33 It’s documented as being there, but there aren’t a lot of details unless you go into the compiler about how it works. It shows debug information about how the regular expression is working.

01:47 The default character encoding in Python 2 was ASCII. That changed to Unicode in Python 3. The re module changed with it. If you don’t specify a flag, Unicode is the default encoding for regexes and the strings being searched.

02:03 You can control that by setting the ASCII flag or setting the LOCALE flag. The LOCALE flag sets it to whatever the default encoding is for the user’s current locale.

02:17 Start with a simple search looking for the small letter "a" with one or more instances. Nothing is found. Adding the flag at the end and the behavior changes. The match is now case insensitive.

02:38 If you don’t feel like typing IGNORECASE, I is the short form.

02:48 You’ve seen the multiline behavior inside of pythex. Here it is in Python.

02:58 And of course, the \A variant doesn’t match in multiline or non-multiline mode. This is the same as what I showed you in pythex. Without modification, the period meta-character (.) doesn’t match newlines.

03:14 Adding the DOTALL flag changes the behavior of that. I was really happy when they added the VERBOSE feature to regular expressions in Python. It makes the code far more readable. Here’s an example.

03:29 I’m going to start by writing a regex on multiple lines that shows comments about the phone number it’s searching for.

03:58 Pardon my North American bias with the North American style phone number, but this regex matches a phone number with a '1', an area code, and then three digits, hyphens, and four digits separated by a whitespace.

04:10 Without the VERBOSE flag, this wouldn’t work. The additional whitespace would be considered literal characters and would be searched for. Likewise, this isn’t the format for a comment inside of a regular expression.

04:22 I do find this easier to read, though. And to use it, all you have to do is pass in the VERBOSE flag. Here I’m going to compile the expression,

04:36 and now I can apply that regular expression to a phone number. My friends in Canada will understand why I’m suddenly hungry for pizza. If pizza’s not good enough, you could go with breakfast food.

04:53 Nothing unusual about this, but watch how it breaks if I apply the VERBOSE flag.

05:06 It no longer matches. The effect of the VERBOSE flag is to make whitespace non-meaningful. That means the regex on the left-hand side, "bacon, eggs" is now equivalent to "bacon,eggs" without the space.

05:21 When that’s applied to the string on the right-hand side where there is a space, a match is no longer found. This means if you are looking for whitespace when using the VERBOSE flag, make sure to use \s and be specific about it. Now for some fun with the debug information. I’ll start with something that works.

05:47 Here’s the string I’ll be searching. Here’s a valid regular expression. I’m looking for the literal word "spam" with any character after it and repeating three times. It finds it, "spam spam spam." So far, so good.

06:04 Now let’s assume for a moment that you’ve forgotten that you’re supposed to use the number 3, but instead you just typed the word.

06:16 Of course that doesn’t work, but if you’re trying to figure out what went wrong, you can add the DEBUG flag.

06:28 Wasn’t that helpful? Ha. As I mentioned before, Python doesn’t document this flag very well. It says that the flag is there, but it doesn’t describe what the actual output is.

06:40 I looked for a while and I’ve yet to find a good article on how all these pieces fit together. But if you read it, you can get a sense of what’s going on.

06:49 The SUBPATTERN match there tells you that it’s looking for four literals: 115, 112, 97, and 109.

06:57 Those are the ASCII codes for 's', 'p', 'a', 'm'. The fact that it then looks for the next set of literals tells you it isn’t actually looking for {3}, which is the mistake that I made.

07:09 It’s actually looking for those literal characters. This gives you a little bit of a hint of what’s going on. As I scroll down, you can see this information again.

07:21 What’s marked as line 17 is the hex version of 115, which is hex 73, and it actually shows you that it’s ASCII 's'. You can also see from lines 28 through 40 that it’s looking for the literals '{three}'. To see the difference, let’s debug the correct statement.

07:47 You’ll notice here that the {three} now becomes the phrase MAX_REPEAT, and it’s MAX_REPEAT with a range of 3 to 3it’s only looking for three times. You’ll remember inside of curly brackets, you typically have two numbers m, n.

08:02 This is essentially the equivalent of m, n with m and n equal to 3. Once you become familiar with this kind of debug output and you’re expecting to see the repeat sequence show up, when it doesn’t, it helps you debug your statement.

08:19 I don’t pretend to fully understand exactly how all of this debug works, but there’s enough little hints in it that it can give you an idea of what’s going on. In the further reading section, I’ll show you a website that actually uses some of this information to help you understand how a regex is working.

08:37 That’s enough esoteric debugging for now. What if you want to combine flags?

08:49 Here, I’ve used MULTILINE, but I’m looking for "^i". I want MULTILINE and IGNORECASE. Each of the flags is a bit in an integer.

08:59 This means you can combine them using a bitwise OR. If you haven’t seen this before, essentially it’s the pipe symbol. MULTILINE | IGNORECASE, and you get both MULTILINE and IGNORECASE. The match of IGNORECASE "^i" and MULTILINE finds the 'I' in "I didn't" after the newline.

09:24 There’s also a mechanism for triggering flags inside of the regular expression itself.

09:36 The (?) means to turn the flags on. The i and m here are the single letter representation of the IGNORECASE and MULTILINE flags.

09:54 You can also turn these flags on later on in the regular expression, but notice it applies to the entire regular expression. Even though the flags are on after the i, it is still doing a case insensitive multiline version of 'I'. Later versions of Python deprecate this and warn you that you should only put them at the beginning. I would go one step further than that, and except for a couple of very specific use cases, from a readability standpoint using the flags as a parameter is much better.

10:23 You can also set a flag just for the duration of a group. You do that with the colon (:).

10:33 This says that the group is case insensitive. The first three 'spam' match, but the fourth one doesn’t because it can be capital or small 's'—case insensitive—inside the group, but then literal small 'pam'.

10:49 You can also inverse this.

10:59 Here, the flag on the right-hand side, re.IGNORECASE, sets the entire regex to be IGNORECASE. The -i inside of the group turns the IGNORECASE flag off for the duration of that group.

11:15 This means it will only match 'spam', overriding the IGNORECASE flag. Hence, the first and last spams don’t match. Put your thinking caps on. Next lesson is about some even trickier matching tools.

Avatar image for raulfz

raulfz on Jan. 30, 2021

Hi, again thank you for your awesome tutorial. Could you please help me with this? I don’t know what I’m doing wrong, I tryed to capture ‘this’ in the multiline string bellow with the re.finditer and the re.findall methods. Even though both methods find the word ‘this’ they gave different results. Why is this? I find the re.finditer method more informative because it gives me the span, but it is not getting the whole matches in this case.

Also, I would like to know if it is possible to pass as parameter to the re methods more than one flag. I mean, if the first word in the multiline string change to “This” I would like to pass re.I, re.M (both flags) to the re methods.

multiline = '''
this is a really long string
the long string is long enough to prove that
this string in
this paper, work as expected'''

find = re.compile('this')
matches = find.finditer(multiline, re.M)
re.finditer(multiline, r'this', re.M)
for match in matches:
    print(match)

OUT>>> <re.Match object; span=(75,79), match='this'>
       <re.Match object; span=(75,79), match='this'>

re.findall('this', multiline, re.M)
OUT>>> ['this', 'this', 'this']

Thank you in advance.

Avatar image for raulfz

raulfz on Jan. 30, 2021

Sorry, when I posted the previous message I haven’t finished the video, so now I know how to pass multiple flags, and even better how to apply to specific parts of my search.

Thank you!

Avatar image for raulfz

raulfz on Feb. 1, 2021

Just in case someone finds the same issue I reported before, this is the problem. When I used .finditer() method I first used the re.compile() method. It’s in this re.compíle() method that I should have included the flags in order to work the same as with the findall() method.

Avatar image for Roy Telles

Roy Telles on March 16, 2021

I still don’t quite understand why the fourth “SPAM” at 10:49 didn’t match. The video and transcript say it is because the “s” can be lowercase or uppercase, but then it goes on to say “literal small ‘p-a-m’” but I think it meant to say “capital ‘P-A-M’“. Considering the group contains only s’s it makes sense that the fourth doesn’t match because it would still be “sPAM” which doesn’t match ‘(s)pam’ in the regex. Thank you!

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on March 16, 2021

Hi Roy,

I believe you are talking about this snippet:

>>> re.findall("(?i:s)pam", "Spam, spam, spam, SPAM")
['Spam', 'spam', 'spam']

When I said “literal p-a-m”, what I mean is the literal portion of the small characters in the REGEX, not in the result. The “(?-i:s)” portion of the regex makes the “s” case insensitive, but only applies inside the brackets. This means “Spam” and “spam” match because the “pam” is small in those cases. The fourth “SPAM” doesn’t match because the “PAM” portion of the string doesn’t match the “pam” literal in the REGEX.

If you want to ignore case in the whole regex you are better off using “re.IGNORECASE” on the whole thing rather than the “(?-i:)” flag. If you did that then it would match the fourth “SPAM”.

>>> re.findall("spam", "Spam, spam, spam, SPAM", re.IGNORECASE)
['Spam', 'spam', 'spam', 'SPAM']

Note the difference between that and the piece of code that follows immediately after in the video where I’m using both the “(?-i:)” flag and the “re.IGNORECASE” global flag, where you get a different result.

>>> re.findall("(?-i:s)pam", "Spam, spam, spam, SPAM", re.IGNORECASE)
['spam', 'spam']

Hope that clears things up for you.

Become a Member to join the conversation.