Fun and Further Reading
Here are resources for a few fun regex examples:
- RFC822 in Perl: regexp-based address validation
- Divide By 7 - Example
- Hard Code Golf: Regex for divisibility by 7
- Demystifying The Regular Expression That Checks If A Number Is Prime
- Xavier Noria - repo on math by regex
Here are resources for additional regex tools:
- pythex
- Regular Expressions 101
- Matther Branett’s regex Library
- Parse: Parse strings using a specification based on the Python format() syntax
Here are resources for further reading about regex:
00:00 In the previous lesson, I showed you how to up your pattern-matching game with conditionals and lookaheads and lookbehinds. In this lesson, I’ll show you some fun regular expressions and point you at some further reading.
00:13 There’s a famous quote that’s been kicking around the internet for quite some time. “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” This was originally attribute it to Jamie Zawinski, an early Netscape engineer.
00:30 The example I’m going to show you now is probably why he said it. In the first lesson, I showed you this horror show. This is a regular expression in a Perl module that validates the headers like To and From inside of a mail message. Although it’s long and complicated, you do have the capability now of understanding what’s in it—if you have a few hours to spare and want to break it down by pieces.
00:56 Generally speaking, people don’t write regexes like this. They’re very hard to debug. So, where did this come from? Here’s the Perl code that actually generated it. Still not easy to understand, but a little easier. If you want to dig through it, start at the bottom.
01:12
Look at what’s being returned. The address variable consists of the $mailbox
or the $group
variables. The $mailbox
and $group
variables are based on $addr_spec
and $phrase
, et cetera, working its way up to the top until you have the entire regex.
01:28 Regexes being what they are, there’s an entire sport on the internet to coming up with obscure ones. Here’s the first portion of one that determines whether or not a string of numbers is divisible by 7.
01:44 Code golf is the practice of finding the shortest possible snippet to solve a problem. Regex code golf is a subset of this based on regular expressions. The first part of that divide by 7 regex that I showed you is actually over 25,000 characters long. Stack Exchange hosts a Code Golf section.
02:05 This question was posted and people tried to top it. One enterprising individual actually got it down to only 103 bytes. Now, they had to use a .NET specific version of the library and that’s because the .NET library supports recursion inside of regular expressions. By creating a recursive regex, the number of bytes in the regex was drastically reduced.
02:28 Of course, I’m not suggesting you use a regex to decide whether or not something’s divisible by 7. There are far easier ways of doing it.
02:36 Speaking of taking the hard route, here’s a function that uses a regex to show whether or not a number in a string is prime. It’s not particularly efficient, but it is clever.
02:48
This regular expression requires the number to be expressed in uniary format. Uniary format is just the number 1 repeated for the length of the value you’re looking for, so 4 is 1111—length of the string is equal to the value of the number. is_prime()
takes n
and the right-hand side of the re.match()
function call multiplies 1
by n
, i.e. repeats it n
times, converting an integer into a uniary format. Left of the pipe is ^.?$
.
03:26
This matches a single digit zero or one times. So if the string consists of a single "1"
, it will return True
. Because of the ^
and the $
, it can only consist of that single "1"
. If that evaluates to False
, then the right-hand side of the OR gets evaluated. The right-hand side of the OR consists of an anchored expression.
03:48
This is the full expression using the ^
and the $
. If you are using Python 3.5 or greater, you could use the fullmatch()
function instead. Inside of the group, it’s looking for a character, and then a character repeating one or more times in the non-greedy format. Outside of the group is a backreference.
04:09 This first tries to match two characters, then three characters, et cetera. The backreference will look for duplicates, so if you have two characters and then two characters, that means it isn’t prime, because you’ve got two and then two, which isn’t prime. It does the same for three and three, four and four, et cetera—running all the way up.
04:33 This is essentially the equivalent of taking your number, dividing it by 2—if you get a whole number, it wasn’t prime. Taking your number, dividing by 3, and doing that all the way up to the length of the number.
04:46 As I said, not particularly efficient. This clever little trick is from a Perl hacker known as Abigail. If you want to see more details about how this works and particularly how it works in different languages, Illya has a great article available here that tears it down.
05:05 Xavier Noria has made a whole repo available with regex patterns for math.
05:12 You’ve seen me using pythex to demonstrate and debug my regular expressions. Another useful site is Regex101. I find pythex a little easier to use—it isn’t as cluttered—but Regex101 has an interesting feature that it will break down exactly how the regular expression works and explain it to you.
05:30 If you’re struggling to debug a pattern, watching how it’s broken down may help you find your problem.
05:36 There’s a third-party regex library available on PyPI. This one includes nested sets, set operations, and infinite lookbehind. It’s far more powerful than what’s built into Python’s standard library.
05:50
Another alternative for you to look at is the parse
library. parse
doesn’t actually use regular expressions, but solves many of the same problems and tends to be easier to read than regexes are. If you’re doing a lot of text matching, you may find this library helpful.
06:06 As always, Wikipedia is an invaluable resource. Here’s an article on regular expressions themselves, as well as specifics to Perl Compatible Regular Expressions.
06:17 As I mentioned in the previous lesson, regexes are built on finite-state machines. There’s a whole school of computer science based on these ideas. You can read more about them here. And finally, you need to be careful with how you use your regular expressions when they’re publicly available.
06:33 There are ways of creating text that will cause your regular expressions either to hang or become computationally expensive. This is called a regular expression denial of service attack. This article here will give you more information on how to protect yourself from it and what kind of regexes are vulnerable.
06:53 I’m sure you’re ready to do mathematical proofs in regex now, but before you do, continue on to the summary and I’ll wrap things up.
Become a Member to join the conversation.