Using Other Encodings
In this lesson, you’ll go beyond UTF-8 and learn about other encodings in Python. There are multiple ways of specifying Unicode in a Python string. You’ll learn that not all characters can be represented in all of these formats. The complete list of accepted encodings is buried way down in the documentation for the codecs
module, which is part of Python’s Standard Library.
00:00 In the previous lesson, I gave you a tour of useful built-in Python functions for manipulating text and code points. In this lesson, I’m going to talk about encodings other than UTF-8.
00:11 There are numerous ways of specifying Unicode inside of a Python string. You can put it in from your keyboard or paste it from a clipboard. Any string can contain Unicode. You can use a raw octal escape specifying a 3-digit long octal number, a raw hex escape specifying a 2-digit long hex number.
00:32
You can use the full Unicode database name, or you can use the small "\u"
escape, which is 2 hex bytes, or the full-size 4-byte capital "\U"
escape.
00:45
And now inside the REPL, I’ll prove that all those things are the same. The typed 'a'
, the octal, the hex, the database,
01:00
small "\u"
escape, capital "\U"
escape—
01:06
and look at that. They’re all equal. It’s not possible to represent all Unicode characters using all of these escape sequences. An octal escape is always 3 digits long. That gives it a maximum value of 511 in decimal, or code point 1FF
.
01:23
A hex escape is always 2 digits long. That gives it a maximum decimal value of 255, or code point FF
. The small "\u"
escape is 4 digits of hex.
01:36 This allows you to get up to decimal 65535, which actually isn’t a character. This is a reserved spot in Unicode for the symbol <not a character>.
01:46
This means escape capital "\U"
is the only format that can specify all possible code points. In addition to UTF-8, Unicode supports UTF-16 and -32.
01:57 Like UTF-8, UTF-16 is variable-length, but it’s either 2 or 4 bytes. UTF-32 is always 4 bytes long. It’s important to note that these encodings are not compatible with each other.
02:12 Consider the following code.
02:16
Encoding the raw data inside of "utf-8"
then decoding it in "utf-16"
does not give you the same result. Not all UTF-8 encodings are even compatible with UTF-16 encodings, so not only is it possible to get the wrong result—you may also get an exception.
02:35 Both UTF-8 and UTF-16 are variable-length. UTF-8 is generally shorter because it can go down to a single-byte encoding, whereas UTF-16 always takes 2 bytes. That being said, there are some corner cases where UTF-8 can actually be larger.
02:54 There’s a few thousand characters in the Unicode blocks that encoding UTF-8 to 3 bytes would only be encoded to 2 bytes in UTF-16. Outside of Unicode, two common encodings that you’ll run into are Latin-1 and CP1252. CP1252 is a Windows-based variant on Latin-1, which is very, very similar, and both of these are common, particularly because the HTTP standard specifies Latin-1 encoding by default.
03:23 This means many web servers spit out Latin-1 or CP1252 unless they’re configured to do otherwise. Python also provides a series of encodings that are specific to the language and used as utilities.
03:38
The 'unicode-escape'
encoding is a useful one. This encoding returns the code points of the string.
03:50
Notice that it uses the smallest possible representation of the code point, changing from small "\u"
to capital "\U"
as necessary.
04:02 Closer to the bottom of the code point table, it switches to hex.
04:10
Just remember that these are the code point numbers, not the UTF-8 or UTF-16 encodings. You may remember the 'é'
in 'café'
encodes in UTF-8 to 0xc30xa9
(c3
a9
).
04:23 Python supports a long list of encodings: standard ones as well as some built into the language specifically. The whole list is available at this URL.
04:35 You’ve made it this far! One lesson left. The final lesson presents a few caveats and corner cases, as well as showing you references and further reading.
Become a Member to join the conversation.