Encoding UTF-8
In this lesson, you’ll learn about a crucial feature of UTF-8 called variable-length encoding. The UTF-8 encoding of Unicode doesn’t just store the code point. You’ll write a hex-based code point viewer to help visualize this. Using this function, you’ll try different characters and see the 1-, 2-, 3- and 4-byte encodings that UTF-8 uses.
00:00
In the previous lesson, I showed you how .encode()
and .decode()
works in Python to move from strings to bytes, and back. In this lesson, I’m going to drill down on UTF-8 and how it actually stores the content.
00:14
Remember that Unicode specifies the code point, whereas UTF-8 is an encoding storing those values. Python has two escape characters you can use to get at the Unicode code points: "\u"
and capital "\U"
.
00:27
Small "\u"
is used for 4-digit hexadecimal code points, capital "\U"
is for 8-digit hexadecimal code points. The purpose of this lesson is to fulfill your curiosity about UTF-8. Generally, you don’t need to understand the inner workings of this to be able to successfully use UTF-8 and Unicode in Python.
00:48
Now that you’re familiar with hex, I’ve rewritten the method that shows the code points, this time showing it in hex code points. I’ve put this inside of a file called points.py
.
01:01
I can import this function, and then write a string… and look at the encoding. The 'c'
in 'café'
is hex 63
, the 'a'
is hex 61
, the 'f'
—66
, and 'é'
accent aigu is e9
.
01:20 Notice that these are the code points, not how UTF-8 stores them.
01:27
You can use the "\u"
to get those letters back out. So, I can replace place
with 'caf\u00E9'
and get back the exact same string.
01:49 Letter by letter, same thing.
01:55
Now, when you encode this, notice that the code point 'E9'
turns into 0xc30xa9
(c3
a9
)—2 bytes of hex.
02:07 The double dagger symbol is a much larger code point number in Unicode.
02:15 Encoding it turns it into 3 bytes worth of information.
02:23 The snake is close to the upper end of the table—
02:28 you need a full 8 digits to describe the code point.
02:34 Encoding that turns it into 4 bytes of UTF-8. You’ve gone from single letters in ASCII that are stored in a single byte, upper-level extended ASCII characters that are stored in 2 bytes, higher-level characters in 3, and then things like the snake symbol way up at the top of the table, requiring a full 4 bytes of UTF-8.
02:58 So, I think I’ve established the UTF-8 is an encoding and not just the Unicode code point number. It’s variable-length and can be 1, 2, 3, or 4 bytes long.
03:08 How does the system understand what a character is comprised of? How does it know how many bytes are in this UTF-8 character? Well, the secret is in the encoding.
03:19
The beginning of each encoded character has the first few bits of a byte indicate how long the sequences is. If it’s 1 byte, the leading bit is 0
.
03:30
The remaining bits are the actual encoding. This corresponds perfectly to the 7-bit ASCII. For 2-byte encodings, the leading bits are 110
. The remaining chunk, then, is part of the encoding. Back to our 'é'
from 'cafe'
, C3
starts with 110
, so you can see by looking at the first byte that this is going to be a 2-byte encoding. 3 bytes starts with 1110
.
03:58
Similarly for the double dagger—that encodes to E2
. E2
starts with 1110
. And finally, for 4-byte encodings—four 1
’s and a 0
. The pattern holds.
04:12
So, what about the rest of the bytes? Well, if you’re in a 2-, 3- or 4-byte encoding, the 2nd, 3rd or 4th byte all start with 10
. This is important.
04:23
This feature is called self punctuating. This means you can look at any byte in Unicode and know whether or not it’s a leading byte or a subsequent byte. No leading byte starts with 10
.
04:35
This allows you to pick up partway through a stream and know when the next character starts. To see this in action, let’s look back at the 'é'
from 'café'
.
04:46
Remember? That’s code point E9
. E9
is greater than 7F
. 7F
is 127 in decimal. This means it’s going to have to be a multi-byte encoding, so it won’t start with 0
, like an ASCII one. To start to break this down, let’s look at E9
as a binary number.
05:06
Using our digits of hex trick from before, translate the E
into 1110
and the 9
into 1001
. Because the number in the code point is bigger than 127, you know that it’s going to be multi-byte encoding.
05:21
Start on the right-hand side and peel off the last 6 bits. Because this is going to be a subsequent encoded byte, lead it with the 10
subsequent byte marker.
05:34 Next, take the next chunk of bits. Well, there’s only 2 bits left and because all the bits have been used up, you know you’re done, which means it’s going to be 2 bytes, so use the 2-byte marker.
05:45
Finally, fill in the middle with some padding. This is the end result of the encoding. The left-hand side turns into C3
, the right-hand side into A9
.
05:57
If you remember from the session in the REPL, letter
—having the code point E9
—encoded into \xc3\xa9
(c3
a9
).
06:06 So, this is how UTF-8 represents its information. Believe it or not, that was the easy part. It gets worse from here on in. Next up: digraphs and dirty tricks.
Become a Member to join the conversation.