Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Regex Quantifiers

Christopher Trudeau

Regular Expressions and Building Regexes in Python Christopher Trudeau 08:50

Transcript
Discussion (2)

00:00 In the previous lesson, I showed you how to use anchors to change where the matches happen inside of a string. In this lesson, I’m going to be talking about quantifiers: how to do repetition inside of your pattern matches.

00:13 First, a little review. The first anchor I showed you was the caret symbol (^). ^The matches 'The' at the beginning of this string.

00:23 Turning on multiline mode, and 'The' gets matched at the beginning of each of the sentences in the string. Replacing ^ with \A, even in multiline mode, only matches the beginning of the string.

00:39 Now to the other end—dollar sign ($) matches the end. Like the ^ in multiline mode, $ matches the end of a line. \Z is the equivalent of \A, matching the end of the string even in multiline mode.

01:00 In addition to the beginning and end of strings, you can match word boundaries. \b is the word boundary anchor. Literal car, \b matches 'car' but not 'carpet', because 'car' ends on a word boundary.

01:17 You can invert that behavior with \B. Now car\B matches 'carpet' because it doesn’t end on a word boundary, and it no longer matches 'car', which does.

01:32 You can look for repetitions of patterns inside of your matches using a quantifier. There are four kinds of quantifiers. Star (*) means zero or more matches. Plus (+) means one or more matches.

01:47 Question mark (?) means zero or one matches. And curly braces ({}) indicate some number of matches between m and n, including m and n.

02:02 To start out, let’s look at the plus symbol (+). 39+0 looks for repetitions of the number 9. So literal 3, 9 one or more times, and then 0.

02:17 This matches '3990'. Changing the + to a *, and the number of matches changes. This says 3, 9 zero or more times, and then the number 0. Changing the * to a ?, now you’re looking for zero or one matches.

02:42 So only the serial number matches. The model number has two '9's in it, so it no longer qualifies. This quantifier doesn’t include two 9s in a row.

02:55 Quantifiers can also apply to meta-characters. This regular expression is the literal S/N: followed by the digit meta-character (\d), one or more times, then a hyphen (-), and then a word character (\w).

03:11 This can be used to match the serial number inside of the text.

03:17 You can also apply quantifiers to character classes. \b is the word boundary anchor, so this regex is looking for something separated by word boundaries—what we would consider a word.

03:31 The first character class is just the vowels, indicating that this word begins with a vowel, and then zero or more small alphabetic characters.

03:42 This matches all of the words in the text that begin with a vowel.

03:50 When dealing with quantifiers, you need to understand the concept of greediness. This is <, some character zero or more times, and then >. This is creating one match, starting with <, some number of characters, and then >. The two > characters buried inside of it are eaten by the quantifier because it’s in greedy mode.

04:17 It will consume as many characters as it can inside of the regular expression. You can modify the quantifier by changing it to non-greedy mode. Adding a ? after the quantifier changes it to be non-greedy.

04:35 Now you’re seeing three different matches, each starting and ending with angle brackets. It’s unfortunate that they chose ? as the way of turning something from greedy to non-greedy. ? on its own is also a quantifier.

04:51 This makes regular expressions a little difficult to read, because the context of the character changes what it means. Here’s another example, this time using the + quantifier.

05:02 Remember, that means one or more. This is looking for a capital A and a small a repeated one or more times. This matches the 'Aaaaaaaaaa' in 'Aaaaaaaaaah'.

05:16 Because it’s in greedy mode, it absorbs all of the letters. Adding the ?, and it takes the minimum number to make this expression valid. Because the + means one or more, the minimum is one, so now just capital 'A' and small 'a' matches.

05:39 This concept can be a little confusing when dealing with the ? quantifier.

05:46 This is the capital A followed by the small a zero or more times. This is matching the 'A' in 'ASCII' and 'ACME' and 'After', but it’s still capturing the first two letters of 'Aaaaaaaaaah'.

05:59 Changing this ? to non-greedy mode, and notice the difference in 'Aaaaaaaaaah'. Now only the capital 'A' is matching. The least greedy version of zero or one characters is zero characters, so only the capital 'A' is highlighted.

06:17 This is referred to as a zero-length match.

06:22 The other kind of quantifier indicates the number of repetitions. This is done with curly brackets. This is looking for whitespace meta-character (\s), a digit (\d), and the digit being repeated 7 times.

06:35 This matches the beginning of the serial number. You can also specify a range of matches.

06:45 This regular expression now looks for a space followed by anywhere from 2 to 7 digits. This matches ' 1095', ' 18', ' 1949', as well as the serial number before. A range like this is inclusive—the 7 isn’t the upper limit, but the largest number of values.

07:07 This is a little counter-intuitive for people who’ve spent a lot of time programming, where normally these kinds of comparisons are less than, rather than less than or equal to.

07:16 It takes a little getting used to. The second value doesn’t have to be specified. This version says “Look for 4 or more digits.”

07:29 The first value also doesn’t have to be specified. This says “From zero up to and including 5 matches.” The whitespace all the way through the text is being highlighted because whitespace followed by zero digits is matching this expression.

07:45 Not only can you remove the bottom end of the range and the top end of the range—you can remove the entire range. This is equivalent to the asterisk quantifier (*).

07:59 If you’re looking to match actual curly brackets, don’t put in a number. This will match literal curly brackets. Because there’s no number and no comma inside of it, it treats them as if they are normal characters.

08:14 Like other quantifiers, the range match is also a greedy match. The 'aaaaaaaaa' in 'Aaaaaaaaaah' is matching all nine of the a’s.

08:26 You can modify the match by applying the question mark (?). Now it’s only matching the first six a’s—the least greedy version of this expression is matching 6 characters.

08:42 Next up, I’ll show you how to group parts of expressions and examine matches on subsections of a string.

shoebptl on Aug. 31, 2023

Does /b includes only space? Because \b[aeiou][a-z]*\b highlights acme and example in support@acme.example.com. I though \b means words with space in front of or end of it. Also it highlights ar of {ar} again here we have {} not space.

Martin Breuss RP Team on Aug. 31, 2023

@shoebptl \b denotes all characters other than word characters (alphanumeric or underscore).

That means whitespace characters (which you expected) but also all the others that you mentioned (@, ., }) and many more!

Try to play around with it in an online regex playground, e.g. Regex 101.

Which ones get matched when you put car\b as the pattern and then enter the following as your text:

car carpet {car} car@car.com carriage car_door

Become a Member to join the conversation.