Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Regex Anchors

00:00 In the previous lesson, I introduced the concept of a meta-character, a way of describing groups of characters through shortcuts like whitespace. In this lesson, I’ll be talking about anchors.

00:11 Anchors allow you to change where a match happens—the beginning or end of a string. The first meta-character that I introduced you to was the period (.).

00:20 . means any character that isn’t a newline (\n). In this text, "The core of it is," "a cur can't drive a car", c.r matches the 'cor' of 'core', the 'cur' of 'cur', and the 'car' of 'car'. \w is the word character, which is all letters, digits, and the underscore (_).

00:45 The regex on the left is looking for a b, then some word character, some other word character, and then an r—matching 'bear' and 'beer' in the sentence on the right.

00:58 All of the backslash meta-characters support capitalization being the inverse. The regex on the left has three word letters, followed by a non-word letter, followed by a digit. This matches the model numbers on the right-hand side.

01:15 'A3F' all match the \w, colon (:) matches the \W, and then the '3' for the \d. Same goes for 'JHK\2' and '23M/1'.

01:34 \s is the meta-character for whitespace. Literal ark followed by a \s\w\w matches 'ark wa' and 'ark an' in 'The ark was large.' and 'Inside it was dark and scary.' To match a backslash character, you need to escape it with another backslash character.

02:00 This can get a little messy. Looking at this code, you see meta-character \w, a literal colon (:). The first two backslashes that follow are an escaped backslash.

02:12 The third one is for the meta-character \w, followed by another meta-character \w, followed by another, followed by an escaped backslash, and then three digit meta-characters.

02:24 This matches the portion of a Windows path on the right-hand side.

02:30 Up until now, all the matches have been anywhere in the string. An anchor allows you to choose whether or not it matches the beginning or end of a string.

02:41 You can also change the behavior of the regex by putting it in multiline mode. Multiline mode changes whether or not a newline is considered the beginning of an anchor match. By default, it’s only the beginning and end of the string. In multiline mode, it’s the beginning, the end, and any newline, allowing you to treat paragraph breaks as if they are the beginning of new strings.

03:06 I’m back inside of pythex using the same Wile E. Coyote email message. The first anchor I’m going to introduce you to is the caret symbol (^).

03:15 You may remember that the caret symbol inside of square brackets negates the square brackets. Unfortunately, although it’s a little confusing, a caret also means something completely different outside of the square brackets—it means anchoring to the beginning of the string.

03:31 ^X says look for a capital X at the beginning of the string, the caret meaning the beginning.

03:42 If I change this to ^P, no match is found. This is expected—the string begins with capital 'X'. If I turn on multiline mode, the behavior changes.

03:56 Now, the 'P' in 'Product' is matched. This is because in multiline mode, anchors match to the beginning of paragraphs, so if the line before ends in a carriage return, it’s considered like the beginning of the string. This is pretty useful if you’re doing multiline texts like this.

04:15 Essentially, it becomes “a line that begins with” rather than “the entire string begins with.” There are two meta-characters for doing this kind of match.

04:24 Caret (^) is one of them. The other is \A.

04:30 \AX matches the 'X' at the beginning of the string, but \AP does not match even though you’re in multiline mode. That’s because multiline mode only affects the caret.

04:46 ^ and \A work the same way if multiline mode is off. When multiline mode is on, \A still only matches the beginning of the string.

05:00 The next anchor that I want to show you is dollar sign ($). $ means the end of the string, or—if multiline is on—the end of the line as well.

05:09 Remember, that period (.) itself is a meta-character, so to match a period I have to escape it. And the $ says “at the end of a line.” Because I’m in multiline mode, this is matching the end of the sentences.

05:24 Turning multiline mode off, the match is only the one at the end of the string.

05:34 Similar to \A is \Z. It anchors to the end of the string, whether or not you’re in multiline mode.

05:45 The last anchor I’m going to introduce you to is \b. This is the word boundary. \b literal f looks for words that begin with the letter f.

06:00 You can also match things at the end of a word by having the literal er followed by the word boundary (\b).

06:11 Finally, you can invert it by capitalizing the B, looking for non-word boundaries. So non-word boundary, er, non-word boundary matches 'er' inside of a word.

06:27 That’s regex anchors. Next up, I’ll be talking about how you can repeat expressions with quantifiers.

Become a Member to join the conversation.