In this lesson, you’ll learn about a crucial feature of UTF-8 called variable-length encoding. The UTF-8 encoding of Unicode doesn’t just store the code point. You’ll write a hex-based code point viewer to help visualize this. Using this function, you’ll try different characters and see the 1-, 2-, 3- and 4-byte encodings that UTF-8 uses.
In the previous lesson, I showed you how
.decode() works in Python to move from strings to bytes, and back. In this lesson, I’m going to drill down on UTF-8 and how it actually stores the content.
Remember that Unicode specifies the code point, whereas UTF-8 is an encoding storing those values. Python has two escape characters you can use to get at the Unicode code points:
"\u" and capital
"\u" is used for 4-digit hexadecimal code points, capital
"\U" is for 8-digit hexadecimal code points. The purpose of this lesson is to fulfill your curiosity about UTF-8. Generally, you don’t need to understand the inner workings of this to be able to successfully use UTF-8 and Unicode in Python.
02:34 Encoding that turns it into 4 bytes of UTF-8. You’ve gone from single letters in ASCII that are stored in a single byte, upper-level extended ASCII characters that are stored in 2 bytes, higher-level characters in 3, and then things like the snake symbol way up at the top of the table, requiring a full 4 bytes of UTF-8.
The remaining bits are the actual encoding. This corresponds perfectly to the 7-bit ASCII. For 2-byte encodings, the leading bits are
110. The remaining chunk, then, is part of the encoding. Back to our
C3 starts with
110, so you can see by looking at the first byte that this is going to be a 2-byte encoding. 3 bytes starts with
Remember? That’s code point
E9 is greater than
7F is 127 in decimal. This means it’s going to have to be a multi-byte encoding, so it won’t start with
0, like an ASCII one. To start to break this down, let’s look at
E9 as a binary number.
Using our digits of hex trick from before, translate the
1110 and the
1001. Because the number in the code point is bigger than 127, you know that it’s going to be multi-byte encoding.
05:34 Next, take the next chunk of bits. Well, there’s only 2 bits left and because all the bits have been used up, you know you’re done, which means it’s going to be 2 bytes, so use the 2-byte marker.
Become a Member to join the conversation.