PyYAML for Loading and Writing Documents
00:00 In the previous lesson, I showed you some of the more advanced features of the YAML format. In this lesson, I’m going to dive deeper into serializing and deserializing YAML using the PyYAML library.
PyYAML provides multiple loaders, and it does this through the use of classes and convenience functions to those classes. Up until now, you’ve seen me use the
safe_load() function, which is actually a shortcut for
load(), passing in the
It’s called the
UnsafeLoader for a reason. Any time a parser or a data spec supports this kind of stuff, somebody turns it into an attack vector. Be very, very careful with these features. All right, you’ve been warned. Now I’ll show you how to do those things you probably shouldn’t. It’ll be fun.
PyYAML has implemented several Python-specific tags for your convenience. Using
!!python, you can directly invoke Python types—for example, turning a YAML sequence into a tuple instead of a list. And as everything in Python is an object, the
!!python/object tag gives you lots of flexibility.
And this is my YAML file. The
!!python/object tag takes a reference to the class to use to create an object, including the module name. You can parametrize the object that gets created either with an inline hash specifying its attributes, or putting the attributes in the block below.
Let’s parse this file. Because this feature isn’t supported by
safe_load(), I can’t use my
show_spud() function. So instead, instantiate a loader directly. First, let me import the module, then open the file,
I’m too lazy to pretty-print it. You’ll have to dig through this with your eyeballs. Ned, Seymore, and Bart all got created. The
!!python/object tag has instantiated the
Person object and set the attributes.
It is just setting the attributes on the object that you give it. This means you can set whatever you want, but it also means if you mistype an attribute name, you’re not going to get an error. If you really want to, there are magic methods for changing the behavior of a Python object so it won’t do this. It involves the special attribute called
And here’s the corresponding YAML. Instead of using
!!python/object, I’m using
apply tag actually instantiates the object passing in the arguments to
.__init__(). Like before, the arguments can be an inline sequence, a sequence in the YAML hash, or there’s a third way here using the
07:02 and I’ll close the block here … and note the printout. This isn’t me showing you the result object. It’s a side effect of the file being loaded. As the print statement says, this is arbitrary code execution. Remember that big yellow warning a few slides back?
07:20 Yeah, I probably should have put in the siren sound effect. This goes beyond being a foot-gun. It’s more of foot-artillery. For example, Python’s subprocess function allows you to execute arbitrary shell programs. I can use subprocess to delete your hard drive.
I can count on one hand after a severe construction accident how many times I’ve used
.__new__() on its own in my fifteen-plus years of coding Python. I’m not sure why I’d want to do this in YAML, but you can.
08:32 So far, you’ve seen me call the various load methods using a file handle. The same methods are also able to handle strings directly or stream objects. This allows you to parse hard-coded YAML in your code or manipulate content before having it parsed. I’ve kind of glossed over something important.
UTF-32 is not very common, so you’re not likely to run into it. All this encoding stuff can be problematic. Python wants UTF-8. You can transcode, though, through, in the case of strings, the
string.decode() methods, or you can deal with encoding directly when you’re opening files.
You may have noticed I’ve been opening the files using the
"rb" mode, meaning read-binary. PyYAML can read the different UTFs, and so by opening the files in binary, the library takes care of the conversion.
If I try to open a UTF-16 as UTF-8 text, problems will occur. If you know your YAML file is UTF-8, you can use
"rt" mode or the
encoding parameter to the
.open() method, or you can just do what I’ve done, which is easier, and use binary all the time.
I’m using write-text (
"wt") mode here … calling
dump(), passing in the file handle, and now in the top window … you can see the resulting YAML file. Let me do that again, this time using UTF-16. I’ll open a new file …
11:52 Note I used binary this time. And then I dump the file, specifying the encoding to be UTF-16. Now I’m going to have to leave the REPL because my display tool here only works with UTF-8, so I can’t output the resulting file in the top window.
12:24 The serialization functions take some optional flags, changing the behavior of how escaping Unicode works, whether to use the canonical style. Look, I’d love to tell you exactly what that is, but the documentation is rather vague, and experimenting with it didn’t fully enlighten me. I know what the word means, but I’m not clear what the PyYAML people think it means from a format standpoint.
12:46 Inhale. What was I saying? Right. You can also change what flow style is used based on the input and what multi-doc markers are used. In addition to the Boolean flags, there are some more things you can do to modify the output.
This can be important, as different operating systems use different characters for newline. There’s a parameter to specify what YAML version to use, which I think is there for future proofing. And a mechanism for defining your own shortcut tags. This allows you to create aliases of tags, so instead of writing
!!python/object/apply, I could write
13:36 If all of that’s not enough choice for what you’re doing, there is also a low-level API as well. You can parse files, getting tokens or events back, similar to how a lot of XML parsers work. And for serialization, you can dynamically compose YAML using objects. If you’ve ever worked with Beautiful Soup, it’s a bit like that. If you’re interested in this deeper stuff, the Real Python article that this course is based on has a section that does a deep dive. Next up: a bit about other YAML tools, some more about the consequences of the type casting ambiguity, and I’ll bore you with my opinion, as if I hadn’t been doing that up until now.
Become a Member to join the conversation.