Data Structures in YAML
00:00 In the previous lesson, I showed you what a YAML document looks like and how to read it into your program using PyYAML. In this chapter, I’ll go into more details about YAML docs and the data structures found within them.
The example I parsed in the previous lesson is in the block on top here. YAML is rather flexible when it comes to structure. It also allows inlining. As the structure maps to hashes (known as dicts in Python), it kind of makes sense that you can inline the structure using curly brackets (
00:30 The block on bottom here is the equivalent to the top, just using fewer lines. It even uses the same brace brackets, or curly brackets, as python, which is convenient for remembering what they mean.
00:46 Strings in YAML are, well, to be blunt, confusing. I get what they were attempting, trying to keep it simple and not require quotes, but the result has a bunch of edge cases that are hard to remember.
Strings can be unquoted, quoted using single quotes, or quoted using double quotes, but each of these behaves subtly differently. An unquoted string is considered a literal. If you put a
\n in it, it will be escaped in Python. You won’t get a newline, you’ll get
A single-quoted string is almost literal. It also escapes things like
\n, but of course you can’t put a single quote or apostrophe inside of it.
01:29 Rather than using the slash as an escape character as every language on the planet since C has, YAML decided to be different and use two single-quotes to indicate a quote. Yeah, you heard me right.
If there is two of them in a row, that isn’t the end of the string, but the single-quote character. This decision baffles me. Double-quoted strings are more C-like, or Python-like, if you want to talk that way. Inside double-quotes, a single-quote is just a single-quote. You don’t have to double them up, and a
\n actually means newline.
02:06 All these choices for strings get complicated. If you’re coming from almost any programming language, it’s going to seem messy. In a moment, I’ll show you some examples, but this is one of those areas where YAML makes me a bit uncomfortable.
02:18 Its decision to allow unquoted strings has some deep ramifications on data-type interpolation, which I’ll get to more in a bit.
There are also some special keyword values in YAML, like
null. Because these have special meanings and because YAML normally allows unquoted strings, you get weirdness here.
True on its own is not a string. The phrase
True that, on the other hand, is. Like I said, ramifications. Let’s go to the REPL and see this mess in some code.
02:54 In the top window, I have a YAML document with five lines and five different data cases. The first is an unquoted string. The second uses single quotes. The third uses double quotes.
special key has the Boolean
True as its value, and the
morespecial key has this uncool cat’s fully written version of the slang phrase
show_spud() to see what happens with this doc.
For the unquoted string, the single quote is a single quote, and the
\\n. In Python, the slash is escaped rather than becoming a newline. For the single-quoted string, you see the use of the single quote to escape itself. There’s two in the YAML, but only one in the Python, and like the unquoted version, the slash in
\n is escaped.
03:54 The double-quoted string is the most “normal”. Did you hear the air quotes in my voice? Let’s not speak of air quotes. Somebody on the YAML standards committee might get inspired.
\n in this case is newline, and two single quotes in a row are what they’re supposed to be: two single quotes. So, regular double quotes is the way to go.
As I mentioned in the slide,
true is a keyword. Although it looks like an unquoted string, it isn’t. It becomes a Boolean, but if you stick something after it, it becomes a string again.
04:37 There’s those air quotes again.
Let’s go through all the data types that YAML supports. The
null keyword means empty. You denote it with the word
null, a tilde (
~), or just by leaving something empty.
false are acceptable for Boolean in YAML 1.2. In YAML 1.1—remember, that’s the one PyYAML uses—you also have
off. I get why these are here.
05:07 They’re meant to make the file more readable. But remember when I spoke of ramifications? Well, those got removed from the spec for just that reason. I’ll dive deeper into this later.
YAML supports integers in decimal, binary, hex, and octal. Yammel 1.2 uses the
o notation for octal numbers, while YAML 1.1 uses a leading zero. In addition to integers, you can also get floats, including markers for infinity and Not a Number.
05:36 I worked with a programmer once whose name was Nan. Man, did he hate floating-point jokes. I already showed you the weird, wonderful, wooly world of YAML strings. And finally, there are dates.
05:48 Dates can also be a little tricky. The year-month-date format is handled nicely, but adding the time can be problematic upon occasion. YAML handles a couple more variations on date and timestamps than I have here.
If you’re doing a lot of timestamp work, you’ll want to look the details up. Note that
false are all keywords in YAML, and all of these can be lower, upper, or mixed-case. YAML isn’t picky.
06:14 Let’s go play with a couple of these data types in the REPL.
06:20 In the top window here, I have a larger YAML file. Let me load it into Python, then I’ll go over it a few data types at a time. Import …
and let me just scroll up. The first keys here are variations on
null. Note the two different letter cases for the word, as well as the use of the tilde.
The next three are Booleans. Because PyYAML is YAML 1.1–based, the keyword
yes is a Boolean. Like with
null, either different letter case can be supported.
06:58 Let me scroll down here a bit, and I’ll talk about some numbers.
10 is decimal ten.
0b10 is binary, giving you
2 in decimal.
0x10 is hex, giving you
16 decimal, and
010 in YAML 1.1 is octal, giving you decimal
0o10 is YAML 1.2, so PyYAML sees this as a string.
You need to be very aware of what version your parser is using and make sure your file is using the same thing. Okay, onto some floats, using both numbers with decimal points and exponents, as well as infinity and good old
nan. I wonder how he’s doing.
Scrolling down a little more … I must have been in a morbid mood when I wrote this example.
trinity is the first test of the atomic bomb. Notice the subtle difference between
The first has seconds specified in the timestamp, and the second does not. The first becomes a Python
datetime object, while the second becomes a string.
When I showed you this sample YAML document in the previous lesson, I mentioned that YAML supports sequences also known as arrays. There are two different ways of writing these, either using the Python-friendly square brackets (
) inline or by using dashes (
-), kind of like a bullet list in a document.
08:25 Both of these result in the same situation. Note that you can either put leading spaces or not in front of those dashes. The YAML documents I normally use tend to put the spaces here, and I think it’s clearer, as that list does belong to the hash being created by the key, but it does work without them.
08:45 And speaking of hashes, I’ve kind of touched on most of this when introducing the basic YAML structure, but just to be a completionist, YAML supports dictionaries.
You’ve seen the nesting and the inline feature, but there’s also one more variation as well. You can have an anonymous hash inside of a list segment. The
children dict here has a list of dicts.
The list has two anonymous dicts in it, each with
dateOfBirth key-value pairs inside.
09:16 YAML is a text-based format with automatic casting of content which is technically text into the supported data types. As I’ve pointed out, this can cause some weirdnesses.
YAML 1.1 has the additional bit of fun of supporting base-60 values. The original committee must have had some ancient Mayan members. 2012 forever! Anyhow, base-60 is denoted using a colon (
:), which can create some surprises.
22:22 is base-60, turning into
1342 in Python. Putting a leading zero, like I’ve done here, which to me looks like the way military time writes 24-hour time, becomes a string. Without the leading zero, it’s base-60. Without the leading zero but using hours, minutes, and seconds, it’s a timestamp in Python that turns into—wait for it—not a
datetime object, but an integer counting the number of seconds since midnight.
10:18 And finally, take the same thing and put a leading zero on it, and you’re back to a string. You having fun yet?
You can get around some of this by using YAML tags, which can be used to specify what a chunk of text should be interpreted as. A tag is denoted by
10:39 That’s exclamation mark, exclamation mark for those of you who haven’t spent time playing with mainframes. Yes, I am that old. Some tags are built into YAML, and others are parser-specific.
I’ll talk about some of the PyYAML-specific ones in a later lesson. Let’s look at a couple of YAML-specific tags.
!!float forces the number to be a float. Even though I didn’t put the zero here, Python will see it as
!!string forces a string. If I want
22:22 to be a string rather than base-60, this is how I do it. There’s even
11:17 This takes base-64 encoded text, and in Python, that becomes a binary value. Yep, you can use YAML to write out GIF data.
11:26 The implementation of tags is a little creaky. In writing this course, I played with some of the YAML 1.1 tags and found some that I couldn’t get to work in PyYAML.
I’m not sure if I was doing something wrong with
!!timestamp, but I couldn’t get a timestamp out of any of the examples that I previously showed you.
11:43 I don’t know whether this is the spec or PyYAML’s implementation or me just not doing it right, but it was less than fun.
11:53 All right, if that last little bit didn’t scare you off, the next little bit might. In the next lesson, I’ll show you why I was cautious when I said that YAML was readable.
Become a Member to join the conversation.