Regular Expressions to Parse Input
00:00 In this lesson, I’ll show you how you can use regular expressions, or regexes, to parse command line input with a little bit more specificity and nuance than you can with simple Python list comprehensions, as I showed in the last lesson.
00:16
Before I jump into the Python code, I want to show you an example from the Linux utility that the Python code I’m going to show you is going to try to duplicate, and that Linux utility is called seq
.
00:28
It works a lot like Python’s range()
function. I can say, for example, seq 10
and get a sequence of numbers starting at 1
and going up by 1
all the way up to 10
. If you use the -h
option, you get this usage message, which says a couple of little formatting things, and then it shows you that seq
takes in optional parameters: the first, the increment, and then the last.
00:54
So, I just passed in the last up there, but I can also say start at 2
, go by 2
, and end at 10
. This will count up starting from 2
, by 2
’s, all the way to 10
. So, this is a quite simple function, but it has a couple of other little things here like the option -s
, which gives you the separator.
01:13
I can say here that the separator is going to be an arrow instead of a newline—which is the default—and then I can say, for example, I’ll just go 2
to 10
here. And as you can see, I get 2
through 10
, and it actually appends the separator to the end as well, which is kind of interesting. And as you can see, there’s no newline here because the newline isn’t the separator.
01:34
So, those are the general features of seq
, and it has some features that I won’t get into in this tutorial, but you’re free to keep looking at that on your own.
01:42 So this is what I’m going to try to actually make a duplicate of using regular expressions in Python.
01:49
What I have here is a mostly complete version of seq_regex.py
,
01:55
which uses regex to actually create this seq
utility. As you can see from the USAGE
string, the usage is the Python file and then -s
followed by a separator with first
,
02:08
increment
, and last
—and first
and increment
are actually optional. What this usage doesn’t quite tell you is that there’s also a --help
option, which will just print the usage and exit.
02:20 Next, I have the actual pattern here, which is a regex expression spread over multiple lines. If you’re not very familiar with regex, I would encourage you to check out the recent Real Python tutorials on regex, because they’re great.
02:33
There are also a lot of amazing online resources you can use to learn regex in detail. The idea is that this regex will enforce many things that the list comprehension and conditional logic approach actually didn’t. So, for example, it first can take in a --help
option, which is captured as a named group called HELP
.
02:54
This HELP
is exclusive using this or option here with the rest of the content, so if the HELP
is there, then none of this other stuff will even really be dealt with.
03:06
Then, you can take in this -s
, or long option --separator
, with a separator option
03:13
argument, which is captured in this SEP
group. So you now have support for option-arguments, which you didn’t have with the list comprehension approach that I showed you.
03:22
Then, it takes in up to three numeric arguments, which is specified with this \d+
specifier, and those are all captured in the OP1
through OP3
groups.
03:33 As you can tell, these are generally optional, but you need at least one of them. That’s also enforced by the regex there. So, this does a lot of things that we could never have done with just this list comprehension approach. We’ve enforced the numeric values that these operands need to have, we’ve enforced this option-argument logic, we have mutual exclusivity. It’s really going to work quite well.
03:56
This parse()
is what I’m going to write out in just a second, but it’s how you actually get the arguments from this match object that will be generated when you run this regex pattern against the actual argument string.
04:10
Then here, I have a function called seq()
that actually does the logic. Remember what I was telling you earlier about how you first do the parsing and validation, and then you actually do the logic of your program.
04:21
What this does is it takes in this list of operands
, and this does all of the logic of what to do when there is one argument, what to do if there’s two arguments, and what to do if there’s three arguments, and so on.
04:32
Then, it actually returns the sep.join()
of all of these string versions of these actual integers, so this does the real work of the program.
04:42
And then main()
is actually just calling parse()
to get the args
, and it has to join all of the system arguments into one string so that regex works well. Then, it has a couple of little extra things, here.
04:55
If there are no arguments at all it raises SystemExit
with the USAGE
. If the --help
option has been included in the args
, then it prints the USAGE
and returns.
05:05
Then finally, it just uses those args
and it actually gets the operands
from them and it gets the separator and then it prints the actual seq
output.
05:13 So, that’s how all of this works. Now, let me go back and just show you what I think is a really important feature of regexes, which is how you can actually interact with the match object that you get from a regex expression.
05:27 What I can do here is I can say if this actually returned a match,
05:34
so, if this matches this pattern—because that’s what regexes are all about, is pattern matching—so if match_object
—and I’m going to use the convenient new walrus operator (:=
) from Python 3.8 and above, which lets you check for equality but also assign to this named match object.
05:53
You’ll see why that’s convenient in a second. So I can say := args_pattern
, which is this thing above here. So, if args_pattern.match(arg_line)
—what I’m taking in—then, what I’ll do is I’ll say args = {k, v for k, v
—
06:12
and this, of course, stands for key, value—in match_object.groupdict()}
—and this just gets a dictionary of all of those matches with these capture groups. And then I’m saying .items()
—and I’m going to use a little bit more space just for clarity—if v is not None
.
06:34
And so, really, all I’m doing here is just checking all of these matches and then not including anything that wasn’t included in the regex. So if they didn’t use the --help
option, I don’t actually include it in the dictionary, and so on.
06:46
So then, I’ll just return args
. It really is that simple. But this is a really important thing that you need to know how to do with regexes: once you get this match, you can actually query the match object for the actual named capture groups, and that’s really helpful and that actually allows this to work.
07:03 So, let’s see it in action.
07:05
And one thing I just realized I needed to fix real quick before I actually test this in the terminal was I need to say k
maps to v
, whereas previously I actually had k, v
, and that’s not how that works in dictionaries. So now, I can test it and I can say python seq_regex.py
, and I’ll just pass in this --help
flag. As you can see, it gets the USAGE
, just like you might expect.
07:27
And if I pass in --help
and then I pass in some other arguments, then still, nothing happens. So if --help
is there, then it’s just going to print out the help, regardless.
07:36
Now, let’s see it work with seq_regex.py
and I’ll just pass in a number. So, as you can see, that works just like it should work there. Now, I’ll pass in 2
, 2
, and 10
and see if it works this way—it works great.
07:50
I’ll pass in 2
to 10
, start and finish, and it works just like you might hope. So, this really works quite well. And for this seq_regex.py
, there’s not really much more that you can ask from this.
08:03 This is a perfectly acceptable approach here. I will pass in this separator argument, just to see that that works—and that actually does well too. So, this is all working just fine.
08:14 Now, let’s talk a little bit about the problems you might run into with regexes. Well, as you’ll notice from this whole pattern here, this regex is really complicated already, and this is just for what—in comparison to some command line interfaces—is incredibly simple. So, if you’re needing to manually maintain these regexes as you’re building a command line interface and making it much more complex, then you’re going to run into some serious problems. So what of course the applications that build command line interfaces do, is they automate that process and they make it so that you can generate a regex from what’s called a grammar.
08:52 I’ll talk more about that when I get into some of the libraries that you can use in Python to build command line interfaces. But otherwise, this regex gives you a lot of advantages that you didn’t have previously, right? With the mutual exclusivity, the option-arguments—all this sort of thing.
09:05 And so this is a really good way to start out building a command line interface. In the next lesson, I’ll show you how you can build a custom parser that relies on some more sophisticated Python logic but doesn’t actually use regexes to accomplish pretty much the same goal.
Become a Member to join the conversation.