Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

Meta-Characters

00:00 In the previous lesson, I introduced you to the simple, plain matching regular expression and then added some complexity with ranges of characters and class matching.

00:10 In this lesson, I’ll be talking about meta-characters, special characters that represent things like whitespace in a regular expression. First off, a little bit of review.

00:20 A plain string match is the text you’re looking for. In the sentence 'I put the thing in the place', 'thing' is highlighted because the regex is thing. Do note that it’s only those letters—not the surrounding whitespace.

00:34 Similarly for the pattern spam, "Well, there's spam egg sausage and spam," "that's not got much spam in it" has three matches—each of the instances of 'spam'. Character ranges, or classes, allow you to specify a match over a range of characters. A common use for this is to look for numbers. The square brackets indicate a range—in this case, two ranges, both 0 to 9. That matches the number '42' in the text on the right-hand side. You can match with letters as well.

01:06 Inside of the square brackets, I’m looking for small [a-z] or large [A-Z], so essentially any letter followed by the plain text ar. So 'far', 'bar', and the 'par' part of 'subpar' matches.

01:23 Regexes support something called a meta-character—for example, digits, whitespace, words, et cetera. Oftentimes, these are shortcuts which you would otherwise have to specify with a lengthy square bracket class.

01:38 Here I am back inside of pythex using the same text to search through as before—the Wile E. Coyote email message. I’ll start with a character class. This is any vowel.

01:51 And now I’m going to introduce the first meta-character: period (.). Period means any character, except for a newline (\n).

02:01 So now, with the character class vowel, period, character class, you will see matches where it is a vowel, some other character, and then another vowel. If you look at the matches, this includes things like 'a<a', 'exa', 'ilu', 'ase', et cetera. Because period (.) is a special character, if you actually want to match one, you have to escape it.

02:29 This is similar to the square brackets that I showed you in the last lesson.

02:34 Here’s a letter followed by a period followed by another letter. This matches parts of the email addresses in the To and From headers.

02:45 Here I’m matching quote (") followed by any character. If you look carefully, one of them is missing. This quote isn’t highlighted, and that’s because the . does not actually match newline, and as the only character after this quote is a newline, it won’t match.

03:06 A common pattern that you would need to match in programming text is the small letters, capital letters, digits, and an underscore (_). In most programming languages, these are the valid characters that can be inside of a variable name. Here, you can see the plain text match for the capital letter E followed by that range. In the case of this email message, that matches the 'En' in 'Enclosed'.

03:30 Because this is so common, there’s a shortcut for it.

03:37 \w is a meta-character, short for word. And in this context, word means a programming variable. This meta-character does the exact same thing as the previous expression, still matching 'Enclosed', but is far less to type. A common pattern with most of the meta-characters is for the capital version of them to be the inverse. Changing small w to capital W matches everything that isn’t a word. In this case, we have capital 'E' followed by whitespace, '>', and '-', whereas '>', '-', and whitespace are not alphabetic letters or digits or an underscore. Speaking of digits, you’ll remember this pattern from the previous lesson.

04:24 It’s looking for four digits in a row. Again, because this range is common, there’s a shortcut for it as well: the meta-character \d. Four \d in a row is the same as the previous expression.

04:40 And like the \W, \D is the inverse. This pattern is looking for a digit followed by a non-digit followed by another digit. This matches the '3a3' and the '1.0'.

04:54 Keep in mind that non-digit means every character except for 0 to 9, so periods and hyphens are included.

05:06 \s, short for space, matches space characters or whitespace inside of your text. These are things like space, tab, and newline. This particular tool doesn’t actually highlight the newline characters, which makes it a little hard to see, but if I look for a colon followed by whitespace, you’ll see what I mean.

05:30 The email headers all have colons and spaces, so those matched just like before, but the 'Product information was as follows:' has nothing after this—it’s just a newline character.

05:40 So you can see that this is matching the whitespace of the newline.

05:48 Like before, capital is the inverse—everything but a whitespace. So one way of finding the four-letter words is whitespace (\s), followed by four instances of non whitespace (\S), followed by whitespace (\s).

06:02 This of course will match letters from the alphabet, punctuation, and digits.

06:10 You can use meta-characters inside of a character class. \d matches a digit, the square brackets give the option of matching either a hyphen or a \s. Remember, if the hyphen is first inside of the character class, it’s a literal match—not a range.

06:30 So this is looking for a digit followed by either a hyphen or a whitespace. This matches things like the '0' and the '9' at the end of this line, because the carriage return is included, as well as the '3' followed by the hyphen, and the '3' followed by the whitespace, the newline.

06:51 Because meta-characters begin with a backslash, to look for a backslash you must escape it with another backslash. This is looking for a literal backslash followed by a digit between 0 and 9. It matches the '\5' at the end of the model number.

07:10 This can get rather confusing when you start combining things. This is the same pattern as before, but instead of using the character class, I’m using the meta-character for digit.

07:19 So the first two backslashes are an escaped backslash, the third backslash is part of the meta-character \d. This is looking for a literal backslash followed by a digit, and this once again matches the '\5' at the end of the model number in the text. Next up, I’ll be talking about anchoring expressions: ways of making sure that the thing you’re finding is at the beginning or the end of a string.

Become a Member to join the conversation.