A Custom Parser

Command Line Interfaces in Python Liam Pulsifer 07:53

Transcript
Discussion

00:00 An even more flexible option than a regular expression is to write a custom parser that uses more advanced Python logic to parse input in pretty much any format you can imagine.

00:13 Let’s take a look at that in Python code with another implementation of the seq utility. Here, I have a new version of the seq utility which is going to be implemented with a custom parser.

00:26 Let me just add that in up here just so that it’s clear to everyone what’s going on here. Okay. So that’s there. And as you can see, the usage here is exactly the same as it was for the regex implementation.

00:40 It takes in the Python file, and then a possible --help option, a separator, and then optional first and incr (increment), but it must include a last operand.

00:50 The seq() function is actually exactly the same as well. It takes in this list of operands, which is a list of integers and a separator, which by default is newline ("\n").

01:02 And then based on whether there’s one, two, or three operands, it constructs the actual logic of the seq program. The parse() is what I’m going to end up writing here, and it’s going to be much more complex than it was in the regex expression.

01:16 But where complexity is gained in parse(), it’s lost in the fact that there’s no actual regex to deal with. But the main() also works quite the same way, where it actually gets the separator and the operands from the parsing process, and then simply says if there are no operands, raise SystemExit with the USAGE, otherwise, use seq() to actually construct the string that needs to be returned.

01:40 So, how does this actual parsing logic work? How is it possible to just parse through this without using something like a regex? Well, the idea is that you want to parse from left to right, one argument at a time, and follow certain rules based on what should happen with each argument.

01:57 So, that kind of structure suggests that it might be nice to use what’s called a double-ended queue, which is essentially a queue, or a list, that lets you pop really quickly and easily from both sides—from the left and the right.

02:11 We want to pop from the left, generally. So, the arguments is going to be a double-ended queue of the arguments, and then I’ll just say for now that the sep is equal to newline, even though that would actually be taken care of by the default argument to seq(), but I just want to have this here for clarity’s sake.

02:27 And then, at the moment, there are no operands so far, so operands is just an empty list even though, eventually, it will return a list of the operands. Now, the next thing I want to say is while arguments—so, while arguments has any members at all—I want to get the current argument, which is equal to arguments.popleft().

02:47 So, I’m popping—or getting access to—the left-most member of arguments, and then I’m taking it away from the arguments list as I go along.

02:56 Now, the first case is if there are no operands whatsoever so far, so if len(operands) == 0. That’s the only time when you want to be checking for the --help or the sep options. So, if current == "--help",

03:15 then what you want to do is you want to print out the USAGE, and then I actually just want to exit from the system, but I want to exit() with the status code 0 to make sure that anyone looking at this knows that this wasn’t a failure—this was a designed exit.

03:31 The next thing that could happen is current could be in either the short form of

03:38 --separator or the long form, here. Right? So, if it’s either one of those things, what I want to do is I want to say sep = current, which should work just fine because current is an argument, so it’s a string.

03:52 Then, I want to continue because I don’t want to do any more logic after this, I just want to get the separator. And I said sep = current but of course that doesn’t make sense because then it would just make sep to be "-s" or "--separator".

04:03 So what I actually want is I want to say sep = arguments.popleft(), so actually get the next argument from current instead of the actual current operator, which is one of these two.

04:16 So, the next thing I want to say is, well, “What happens if I’m not looking to parse "--help" or "--separator"?” Right? The next thing that I want to do is I want to actually try to get access to an operand, right?

04:29 Because anything other than "--help" or sep is just an integer operand. So, the first thing I can say here is try—and I’ll tell you why I’m using a try and except block in a second—but I’ll say try: operands.append() the integer form of current.

04:46 And it may just have become clear why this try and except block was necessary, because it could be the case that current is not actually a parseable integer.

04:54 So in that case, I’ll except a ValueError and I’ll raise SystemExit with the USAGE message, so

05:01 that they know that they’ve made some mistake here. And then the final last thing I want to do is I want to say if operands—or, I should actually be more precise—if len() of operands is more than 3—so, if this is the fourth or greater argument—then I also want to raise SystemExit, because someone has passed in a malformed number of operands there. And then with all of that done, I can simply return in the correct order the separator, and then the operands.

05:32 So, that’s how you can use this parsing logic along with a double-ended queue to go through and follow simple rules, and by following those simple rules, you get some great properties. So for example, --help and --separator have to come before any operands in this case, right? Otherwise, if you see them in any other place you’ll get this ValueError because they aren’t integers, right?

05:54 So if there are any operands, then there needs to between one and three, and there needs to be at least one, and otherwise, there’ll be some kind of error that’s raised just simply based on the behavior of this while loop structure.

06:07 So, this is really convenient and really cool. Now let’s watch it work in the terminal. So, I can say here python seq_parse.py and let’s first try it with this --help flag.

06:18 And there you get the usage, so it works just fine. And then a simple test case, here, just passing in 10—works great. And then my classic example—going up by 2, and then going up by 1—works perfectly fine.

06:32 And let’s try with the --help flag after some operands. Well, that raises a usage, just as it would if I passed in some kind of malformed argument, like this --s.

06:43 Now, let’s use the separator real quick, and see if this works. So, I’ll say here—maybe my separator would be the three letters "AAA". I don’t know why you would use that, but there—you actually get that just as desired, so this works just fine.

06:57 Writing a custom parser can be a really flexible and really powerful approach, but the problem, of course, is that this logic here that I showed you in parse()—it also requires a lot of maintenance just like a complex regular expression.

07:12 So really, you haven’t lost any complexity by moving to this custom parsing version, rather than regex. You’ve just moved that complexity to your code logic rather than your regex logic.

07:24 So really, both of these approaches—as awesome as they are—are something that really needs to be automated, and so that’s what I’ll talk about in the next section of this series when I’ll talk about the tools that exist in Python to automate this stuff for you. But as of now, you have a great understanding of how all of this works under the hood, and you should be really well prepared to understand any command line parsing system that you encounter, at least on a basic level. In the next lesson, I’ll talk some about type validation.

Become a Member to join the conversation.