Using Other Encodings
In this lesson, you’ll go beyond UTF-8 and learn about other encodings in Python. There are multiple ways of specifying Unicode in a Python string. You’ll learn that not all characters can be represented in all of these formats. The complete list of accepted encodings is buried way down in the documentation for the
codecs module, which is part of Python’s Standard Library.
00:11 There are numerous ways of specifying Unicode inside of a Python string. You can put it in from your keyboard or paste it from a clipboard. Any string can contain Unicode. You can use a raw octal escape specifying a 3-digit long octal number, a raw hex escape specifying a 2-digit long hex number.
and look at that. They’re all equal. It’s not possible to represent all Unicode characters using all of these escape sequences. An octal escape is always 3 digits long. That gives it a maximum value of 511 in decimal, or code point
Encoding the raw data inside of
"utf-8" then decoding it in
"utf-16" does not give you the same result. Not all UTF-8 encodings are even compatible with UTF-16 encodings, so not only is it possible to get the wrong result—you may also get an exception.
02:35 Both UTF-8 and UTF-16 are variable-length. UTF-8 is generally shorter because it can go down to a single-byte encoding, whereas UTF-16 always takes 2 bytes. That being said, there are some corner cases where UTF-8 can actually be larger.
02:54 There’s a few thousand characters in the Unicode blocks that encoding UTF-8 to 3 bytes would only be encoded to 2 bytes in UTF-16. Outside of Unicode, two common encodings that you’ll run into are Latin-1 and CP1252. CP1252 is a Windows-based variant on Latin-1, which is very, very similar, and both of these are common, particularly because the HTTP standard specifies Latin-1 encoding by default.
03:23 This means many web servers spit out Latin-1 or CP1252 unless they’re configured to do otherwise. Python also provides a series of encodings that are specific to the language and used as utilities.
Become a Member to join the conversation.