Regex Quantifiers
00:00 In the previous lesson, I showed you how to use anchors to change where the matches happen inside of a string. In this lesson, I’m going to be talking about quantifiers: how to do repetition inside of your pattern matches.
00:13
First, a little review. The first anchor I showed you was the caret symbol (^
). ^The
matches 'The'
at the beginning of this string.
00:23
Turning on multiline mode, and 'The'
gets matched at the beginning of each of the sentences in the string. Replacing ^
with \A
, even in multiline mode, only matches the beginning of the string.
00:39
Now to the other end—dollar sign ($
) matches the end. Like the ^
in multiline mode, $
matches the end of a line. \Z
is the equivalent of \A
, matching the end of the string even in multiline mode.
01:00
In addition to the beginning and end of strings, you can match word boundaries. \b
is the word boundary anchor. Literal car
, \b
matches 'car'
but not 'carpet'
, because 'car'
ends on a word boundary.
01:17
You can invert that behavior with \B
. Now car\B
matches 'carpet'
because it doesn’t end on a word boundary, and it no longer matches 'car'
, which does.
01:32
You can look for repetitions of patterns inside of your matches using a quantifier. There are four kinds of quantifiers. Star (*
) means zero or more matches. Plus (+
) means one or more matches.
01:47
Question mark (?
) means zero or one matches. And curly braces ({}
) indicate some number of matches between m
and n
, including m
and n
.
02:02
To start out, let’s look at the plus symbol (+
). 39+0
looks for repetitions of the number 9
. So literal 3
, 9
one or more times, and then 0
.
02:17
This matches '3990'
. Changing the +
to a *
, and the number of matches changes. This says 3
, 9
zero or more times, and then the number 0
. Changing the *
to a ?
, now you’re looking for zero or one matches.
02:42
So only the serial number matches. The model number has two '9'
s in it, so it no longer qualifies. This quantifier doesn’t include two 9
s in a row.
02:55
Quantifiers can also apply to meta-characters. This regular expression is the literal S/N:
followed by the digit meta-character (\d
), one or more times, then a hyphen (-
), and then a word character (\w
).
03:11 This can be used to match the serial number inside of the text.
03:17
You can also apply quantifiers to character classes. \b
is the word boundary anchor, so this regex is looking for something separated by word boundaries—what we would consider a word.
03:31 The first character class is just the vowels, indicating that this word begins with a vowel, and then zero or more small alphabetic characters.
03:42 This matches all of the words in the text that begin with a vowel.
03:50
When dealing with quantifiers, you need to understand the concept of greediness. This is <
, some character zero or more times, and then >
. This is creating one match, starting with <
, some number of characters, and then >
. The two >
characters buried inside of it are eaten by the quantifier because it’s in greedy mode.
04:17
It will consume as many characters as it can inside of the regular expression. You can modify the quantifier by changing it to non-greedy mode. Adding a ?
after the quantifier changes it to be non-greedy.
04:35
Now you’re seeing three different matches, each starting and ending with angle brackets. It’s unfortunate that they chose ?
as the way of turning something from greedy to non-greedy. ?
on its own is also a quantifier.
04:51
This makes regular expressions a little difficult to read, because the context of the character changes what it means. Here’s another example, this time using the +
quantifier.
05:02
Remember, that means one or more. This is looking for a capital A
and a small a
repeated one or more times. This matches the 'Aaaaaaaaaa'
in 'Aaaaaaaaaah'
.
05:16
Because it’s in greedy mode, it absorbs all of the letters. Adding the ?
, and it takes the minimum number to make this expression valid. Because the +
means one or more, the minimum is one, so now just capital 'A'
and small 'a'
matches.
05:39
This concept can be a little confusing when dealing with the ?
quantifier.
05:46
This is the capital A
followed by the small a
zero or more times. This is matching the 'A'
in 'ASCII'
and 'ACME'
and 'After'
, but it’s still capturing the first two letters of 'Aaaaaaaaaah'
.
05:59
Changing this ?
to non-greedy mode, and notice the difference in 'Aaaaaaaaaah'
. Now only the capital 'A'
is matching. The least greedy version of zero or one characters is zero characters, so only the capital 'A'
is highlighted.
06:17 This is referred to as a zero-length match.
06:22
The other kind of quantifier indicates the number of repetitions. This is done with curly brackets. This is looking for whitespace meta-character (\s
), a digit (\d
), and the digit being repeated 7
times.
06:35 This matches the beginning of the serial number. You can also specify a range of matches.
06:45
This regular expression now looks for a space followed by anywhere from 2
to 7
digits. This matches ' 1095'
, ' 18'
, ' 1949'
, as well as the serial number before. A range like this is inclusive—the 7
isn’t the upper limit, but the largest number of values.
07:07 This is a little counter-intuitive for people who’ve spent a lot of time programming, where normally these kinds of comparisons are less than, rather than less than or equal to.
07:16
It takes a little getting used to. The second value doesn’t have to be specified. This version says “Look for 4
or more digits.”
07:29
The first value also doesn’t have to be specified. This says “From zero up to and including 5
matches.” The whitespace all the way through the text is being highlighted because whitespace followed by zero digits is matching this expression.
07:45
Not only can you remove the bottom end of the range and the top end of the range—you can remove the entire range. This is equivalent to the asterisk quantifier (*
).
07:59 If you’re looking to match actual curly brackets, don’t put in a number. This will match literal curly brackets. Because there’s no number and no comma inside of it, it treats them as if they are normal characters.
08:14
Like other quantifiers, the range match is also a greedy match. The 'aaaaaaaaa'
in 'Aaaaaaaaaah'
is matching all nine of the a’s.
08:26
You can modify the match by applying the question mark (?
). Now it’s only matching the first six a’s—the least greedy version of this expression is matching 6
characters.
08:42 Next up, I’ll show you how to group parts of expressions and examine matches on subsections of a string.
Become a Member to join the conversation.