Loading video player…

Meta-Characters

00:00 In the previous lesson, I introduced you to the simple, plain matching regular expression and then added some complexity with ranges of characters and class matching.

00:10 In this lesson, I’ll be talking about meta-characters, special characters that represent things like whitespace in a regular expression. First off, a little bit of review.

00:20 A plain string match is the text you’re looking for. In the sentence 'I put the thing in the place', 'thing' is highlighted because the regex is thing. Do note that it’s only those letters—not the surrounding whitespace.

00:34 Similarly for the pattern spam, "Well, there's spam egg sausage and spam," "that's not got much spam in it" has three matches—each of the instances of 'spam'. Character ranges, or classes, allow you to specify a match over a range of characters. A common use for this is to look for numbers. The square brackets indicate a range—in this case, two ranges, both 0 to 9. That matches the number '42' in the text on the right-hand side. You can match with letters as well.

01:06 Inside of the square brackets, I’m looking for small [a-z] or large [A-Z], so essentially any letter followed by the plain text ar. So 'far', 'bar', and the 'par' part of 'subpar' matches.

01:23 Regexes support something called a meta-character—for example, digits, whitespace, words, et cetera. Oftentimes, these are shortcuts which you would otherwise have to specify with a lengthy square bracket class.

01:38 Here I am back inside of pythex using the same text to search through as before—the Wile E. Coyote email message. I’ll start with a character class. This is any vowel.

01:51 And now I’m going to introduce the first meta-character: period (.). Period means any character, except for a newline (\n).

02:01 So now, with the character class vowel, period, character class, you will see matches where it is a vowel, some other character, and then another vowel. If you look at the matches, this includes things like 'a<a', 'exa', 'ilu', 'ase', et cetera. Because period (.) is a special character, if you actually want to match one, you have to escape it.

02:29 This is similar to the square brackets that I showed you in the last lesson.

02:34 Here’s a letter followed by a period followed by another letter. This matches parts of the email addresses in the To and From headers.

02:45 Here I’m matching quote (") followed by any character. If you look carefully, one of them is missing. This quote isn’t highlighted, and that’s because the . does not actually match newline, and as the only character after this quote is a newline, it won’t match.

03:06 A common pattern that you would need to match in programming text is the small letters, capital letters, digits, and an underscore (_). In most programming languages, these are the valid characters that can be inside of a variable name. Here, you can see the plain text match for the capital letter E followed by that range. In the case of this email message, that matches the 'En' in 'Enclosed'.

03:30 Because this is so common, there’s a shortcut for it.

03:37 \w is a meta-character, short for word. And in this context, word means a programming variable. This meta-character does the exact same thing as the previous expression, still matching 'Enclosed', but is far less to type. A common pattern with most of the meta-characters is for the capital version of them to be the inverse. Changing small w to capital W matches everything that isn’t a word. In this case, we have capital 'E' followed by whitespace, '>', and '-', whereas '>', '-', and whitespace are not alphabetic letters or digits or an underscore. Speaking of digits, you’ll remember this pattern from the previous lesson.

04:24 It’s looking for four digits in a row. Again, because this range is common, there’s a shortcut for it as well: the meta-character \d. Four \d in a row is the same as the previous expression.

04:40 And like the \W, \D is the inverse. This pattern is looking for a digit followed by a non-digit followed by another digit. This matches the '3a3' and the '1.0'.

04:54 Keep in mind that non-digit means every character except for 0 to 9, so periods and hyphens are included.

05:06 \s, short for space, matches space characters or whitespace inside of your text. These are things like space, tab, and newline. This particular tool doesn’t actually highlight the newline characters, which makes it a little hard to see, but if I look for a colon followed by whitespace, you’ll see what I mean.

05:30 The email headers all have colons and spaces, so those matched just like before, but the 'Product information was as follows:' has nothing after this—it’s just a newline character.

05:40 So you can see that this is matching the whitespace of the newline.

05:48 Like before, capital is the inverse—everything but a whitespace. So one way of finding the four-letter words is whitespace (\s), followed by four instances of non whitespace (\S), followed by whitespace (\s).

06:02 This of course will match letters from the alphabet, punctuation, and digits.

06:10 You can use meta-characters inside of a character class. \d matches a digit, the square brackets give the option of matching either a hyphen or a \s. Remember, if the hyphen is first inside of the character class, it’s a literal match—not a range.

06:30 So this is looking for a digit followed by either a hyphen or a whitespace. This matches things like the '0' and the '9' at the end of this line, because the carriage return is included, as well as the '3' followed by the hyphen, and the '3' followed by the whitespace, the newline.

06:51 Because meta-characters begin with a backslash, to look for a backslash you must escape it with another backslash. This is looking for a literal backslash followed by a digit between 0 and 9. It matches the '\5' at the end of the model number.

07:10 This can get rather confusing when you start combining things. This is the same pattern as before, but instead of using the character class, I’m using the meta-character for digit.

07:19 So the first two backslashes are an escaped backslash, the third backslash is part of the meta-character \d. This is looking for a literal backslash followed by a digit, and this once again matches the '\5' at the end of the model number in the text. Next up, I’ll be talking about anchoring expressions: ways of making sure that the thing you’re finding is at the beginning or the end of a string.

Avatar image for DoubleA

DoubleA on Feb. 6, 2021

Dear Christopher, thank you so much for creating yet another extremely useful high-quality tutorial. Learning regex with the course like this one is so much easier when compared to the Python official documentation! I am playing with regex now and I see that the regex character class syntax[0-9] works per digit and within the 0-9 range only. For example, I am trying to write a piece of code which would parse a text in a file and before tonekzing, tagging, and POS-ing it I want to make sure that I am actually working with the relevant corpus only, i.e. the text files which contain the text data meeting certain requirements as to the content.

The following regex:

IPC\sclass:\t[A-H][0-99][A-Z][0-99]/

Does not want to match the following two strings:

IPC class:    B65D85/804

However, when I modify the character class ranges above it works, but looks bulky:

IPC\sclass:\t[A-H][0-9][0-9][A-Z][0-9][0-9]/

Is there a more concise way of construing character ranges matching two-digit numbers in order to avoid typing [0-9] multiple times? If the answer is yes, is it possible to use subranges with such two-digit character ranges, e.g. for numbers from 10 to 70?

Thank you for your comment.

Avatar image for Christopher Trudeau

Christopher Trudeau RP Team on Feb. 7, 2021

Hi @DoubleA,

Glad you’re enjoying the course. RegEx’s are all string based, so the concept of a number that you’re after doesn’t work – the string just sees the characters, 0, 1, 2, etc. It doesn’t understand number ranges like “10-70”.

That being said, there are some shortcuts that could make things a bit less cumbersome. The \d character is a short form for “digits” and there are ways of saying “one or more”.

This would match your example string:

/IPC class:\s[A-H]\d+\w\d+\/\d+/

The \s is for white space and the \w for characters. The “+” says “one or more”.

The problem with this particular match is it would also apply to “B678D678/804548393” because the “+” is greedy and will match more than just two digits.

You can get more granular and be specific with the number of digits matched:

/IPC class:\s[A-H]\d{1,2}\w\d{1,2}\/\d{3}/

In this case the {m,n} notation means from m-n matches, with the n being optional. The first two uses in the example above match 1 or 2 digits only, the third only matches 3 digits. The {} notation is covered in the the lesson on Quantifiers.

Of course, that isn’t any easier to read than the original, but it is very precise.

If you’re writing code with regexes like this which are complicated I would definitely encourage you to look into the VERBOSE flag which allows you to embed comments into the regex – makes it far easier to remember what the regex does when you come back to it later. VERBOSE is covered in the lesson on Flags.

Happy pythoning!

Avatar image for DoubleA

DoubleA on Feb. 7, 2021

Hi Christopher,

Thank you so much for the detailed explanation. Indeed, I’ve completely forgotten about the {m,n} notation, which in the present case, appears to be the best candidate to use. And yes, the + operator can sometimes bring in very unexpected hits, so I try to avoid using it wherever possible.

The present course is by far the best course on Python regex I’ve come accross. Thank you for the great work!

Become a Member to join the conversation.