PyYAML for Loading and Writing Documents
00:00 In the previous lesson, I showed you some of the more advanced features of the YAML format. In this lesson, I’m going to dive deeper into serializing and deserializing YAML using the PyYAML library.
00:12
PyYAML provides multiple loaders, and it does this through the use of classes and convenience functions to those classes. Up until now, you’ve seen me use the safe_load()
function, which is actually a shortcut for load()
, passing in the SafeLoader
class.
00:31 There are four loading classes, one of which being a base class that you shouldn’t use directly. This table compares the capabilities of each of the classes.
00:40
All four support anchors and aliases. All but the base class support tags like !!float
. The full and unsafe loaders also support additional PyYAML-specific tags.
00:53
I’ll show you how to use these later in this lesson. Everything but the BaseLoader
supports the data types discussed a couple of lessons back. The BaseLoader
just implements a subset.
01:04
The UnsafeLoader
even supports custom types and causing code to execute as a side effect of parsing the file. Yeah, take a moment and think about that.
01:14 You can probably guess what I’m about to say. I’ll save your ears and not put in the claxon sort of sound effect here. Just pretend it’s going off in the background.
01:24
It’s called the UnsafeLoader
for a reason. Any time a parser or a data spec supports this kind of stuff, somebody turns it into an attack vector. Be very, very careful with these features. All right, you’ve been warned. Now I’ll show you how to do those things you probably shouldn’t. It’ll be fun.
01:44 Earlier I introduced you to the idea of a tag denoted using double exclamation marks, or the more fun way of saying it: bang bang. You’re just a kiss-kiss short of a quirky action movie.
01:56
PyYAML has implemented several Python-specific tags for your convenience. Using !!python
, you can directly invoke Python types—for example, turning a YAML sequence into a tuple instead of a list. And as everything in Python is an object, the !!python/object
tag gives you lots of flexibility.
02:16
Flexibility so you can bend and twist in ways your body shouldn’t, but flexibility nonetheless. Let me show you the !!python/object
tag with some code.
02:28
This is the Person
data class inside of person.py
. Nothing too fancy here, just a class for a person with a first name and last name.
02:37 I’m going to reference this using a PyYAML tag now.
02:43
And this is my YAML file. The !!python/object
tag takes a reference to the class to use to create an object, including the module name. You can parametrize the object that gets created either with an inline hash specifying its attributes, or putting the attributes in the block below.
03:03
Remember that the Person
class only defines first and last name. Note how in the Bart
instance, I’m also populating an .age
attribute.
03:11
Let’s parse this file. Because this feature isn’t supported by safe_load()
, I can’t use my show_spud()
function. So instead, instantiate a loader directly. First, let me import the module, then open the file,
03:31
and now I’ll use the unsafe_load()
shortcut.
03:39
This is the same as calling load()
, passing in the UnsafeLoader
class, but with less typing. Like with the safe_load()
, unsafe_load()
returns a Python object. Let me show it to you.
03:52
I’m too lazy to pretty-print it. You’ll have to dig through this with your eyeballs. Ned, Seymore, and Bart all got created. The !!python/object
tag has instantiated the Person
object and set the attributes.
04:05
Let’s look more closely at the Bart
object, particularly the extra attribute.
04:13
By default, Python objects allow you to set arbitrary attributes. The !!python/object
tag isn’t actually calling Person
’s .__init__()
.
04:22
It is just setting the attributes on the object that you give it. This means you can set whatever you want, but it also means if you mistype an attribute name, you’re not going to get an error. If you really want to, there are magic methods for changing the behavior of a Python object so it won’t do this. It involves the special attribute called .__slots__
.
04:42 I don’t recommend playing with that just to get around how YAML deserializes things, but if you Google “python class slots,” you can learn how that works.
04:50
PyYAML offers other object tags that actually invoke .__init__()
.
04:58
The top window has a new class. This one is a Car
. I’m no longer using the data class here, to highlight the fact that I’m going to be actually calling the .__init__()
method.
05:12
And here’s the corresponding YAML. Instead of using !!python/object
, I’m using !!python/object/apply
. The apply
tag actually instantiates the object passing in the arguments to .__init__()
. Like before, the arguments can be an inline sequence, a sequence in the YAML hash, or there’s a third way here using the args
or kwds
attributes.
05:37
Be careful with that last one. It isn’t kwargs
, but kwds
. All three of these invoke the initializer. Now to write the corresponding Python … opening the file …
05:56
using the unsafe_load()
…
06:04
and there’s the object. dataclass
objects like Person
have a nice, pretty .__str__()
method that default objects don’t, so this is a little harder to read.
06:14 I’ll de-reference a couple of those objects directly.
06:21 There’s the Mustang’s make
06:25
and model, to show you that it worked. The apply
tag is invoking the class as callable, which is how you construct an object in Python, but that means it works for any callable.
06:41
In this new YAML file at the top here, I’m calling the print()
function, passing in two string arguments. Let me open this file …
07:02 and I’ll close the block here … and note the printout. This isn’t me showing you the result object. It’s a side effect of the file being loaded. As the print statement says, this is arbitrary code execution. Remember that big yellow warning a few slides back?
07:20 Yeah, I probably should have put in the siren sound effect. This goes beyond being a foot-gun. It’s more of foot-artillery. For example, Python’s subprocess function allows you to execute arbitrary shell programs. I can use subprocess to delete your hard drive.
07:39
This would be why I’d suggest avoiding unsafe_load()
at all costs.
07:48
There are a few more PyYAML tags in case apply
just wasn’t enough fun for you. !!python/object/new
calls the .__new__()
method without calling .__init__()
.
07:59
I can count on one hand after a severe construction accident how many times I’ve used .__new__()
on its own in my fifteen-plus years of coding Python. I’m not sure why I’d want to do this in YAML, but you can.
08:11
!!python/name
gives you a reference to an object in the Python space. This means loading the YAML file allows you to get at values in Python’s scope.
08:21
And !!python/module
gives you a reference to a module. These could be fun to share with your friends, especially if they’re friends you don’t like all that much.
08:32 So far, you’ve seen me call the various load methods using a file handle. The same methods are also able to handle strings directly or stream objects. This allows you to parse hard-coded YAML in your code or manipulate content before having it parsed. I’ve kind of glossed over something important.
08:49 That’s how all this text is encoded. YAML 1.1 supports UTF-8 and UTF-16. Remember, YAML 1.1 is what PyYAML implements. YAML 1.2 also additionally supports UTF-32.
09:06 The YAML 1.2 spec is a superset of JSON, meaning if you have a valid JSON doc, it can be parsed as YAML. Since JSON supports UTF-32, they did this for compatibility reasons.
09:19
UTF-32 is not very common, so you’re not likely to run into it. All this encoding stuff can be problematic. Python wants UTF-8. You can transcode, though, through, in the case of strings, the string.encode()
and string.decode()
methods, or you can deal with encoding directly when you’re opening files.
09:39
You may have noticed I’ve been opening the files using the "rb"
mode, meaning read-binary. PyYAML can read the different UTFs, and so by opening the files in binary, the library takes care of the conversion.
09:52
If I try to open a UTF-16 as UTF-8 text, problems will occur. If you know your YAML file is UTF-8, you can use "rt"
mode or the encoding
parameter to the .open()
method, or you can just do what I’ve done, which is easier, and use binary all the time.
10:13
Like the parsers, PyYAML has serializer classes. There are only three of them, though: base, safe, and dumper. There are also safe_dump()
and dump()
shortcuts, as you might expect.
10:25
Like with parsing, you can dump to a string, file, or a stream. And also like with parsing, there are safe_dump_all()
and dump_all()
functions for multi-doc formats.
10:35 Time to serialize some YAML. Just import the library … declare a dict in Python …
10:52
As I didn’t give safe_dump()
a file or stream handle, it assumes I want a string and returns that. Let me try the same thing with a stream. I’ll initialize a string stream …
11:08
call dump()
using the stream …
11:13 and now I can ask the stream for its contents. You end up in the same place. Now let’s try a file. Opening the file …
11:27
I’m using write-text ("wt"
) mode here … calling dump()
, passing in the file handle, and now in the top window … you can see the resulting YAML file. Let me do that again, this time using UTF-16. I’ll open a new file …
11:52 Note I used binary this time. And then I dump the file, specifying the encoding to be UTF-16. Now I’m going to have to leave the REPL because my display tool here only works with UTF-8, so I can’t output the resulting file in the top window.
12:10 Cat the file … and you can take my word that those funky little question marks are the right binary values for UTF-16.
12:24 The serialization functions take some optional flags, changing the behavior of how escaping Unicode works, whether to use the canonical style. Look, I’d love to tell you exactly what that is, but the documentation is rather vague, and experimenting with it didn’t fully enlighten me. I know what the word means, but I’m not clear what the PyYAML people think it means from a format standpoint.
12:46 Inhale. What was I saying? Right. You can also change what flow style is used based on the input and what multi-doc markers are used. In addition to the Boolean flags, there are some more things you can do to modify the output.
13:02
indent
and width
change how much indentation and how wide a line is. You can change how text is quoted and encoded and what line-break characters to use.
13:12
This can be important, as different operating systems use different characters for newline. There’s a parameter to specify what YAML version to use, which I think is there for future proofing. And a mechanism for defining your own shortcut tags. This allows you to create aliases of tags, so instead of writing !!python/object/apply
, I could write !!footgun
.
13:36 If all of that’s not enough choice for what you’re doing, there is also a low-level API as well. You can parse files, getting tokens or events back, similar to how a lot of XML parsers work. And for serialization, you can dynamically compose YAML using objects. If you’ve ever worked with Beautiful Soup, it’s a bit like that. If you’re interested in this deeper stuff, the Real Python article that this course is based on has a section that does a deep dive. Next up: a bit about other YAML tools, some more about the consequences of the type casting ambiguity, and I’ll bore you with my opinion, as if I hadn’t been doing that up until now.
Become a Member to join the conversation.