Specifying the Character Encoding
You can then encode this string into the corresponding bytes. What comes back is a
bytes() object literal, which looks quite like a regular string, except that it starts with a lowercase letter
"b". However, it’s actually a concealed sequence of numeric bytes that you can reveal by turning them into a list, for example.
If you look closely, these are exactly the same numeric ASCII codes that you saw earlier. Note that you can reverse the process by creating a new instance of the
bytes() object, passing the list of integers, and calling
.decode() on it.
01:20 Python does this automatically for you whenever you open a file in text mode, so this happens seamlessly in the background. Unfortunately, things can get more complicated when you stumble on some funky characters that aren’t defined in the original ASCII encoding table.
01:37 These could be letters with diacritic marks or symbols from non-Latin alphabets. ASCII was designed for the English language, after all. Let’s say you wanted to decode the following sequence of bytes.
This produces the word
"café" with an accent. Notice that although the word only has four characters, it was encoded using five bytes, and that’s because of the last character, which doesn’t have a corresponding ASCII code.
02:15 How was it then possible for Python to decode it, you may ask? Well, when you don’t request any particular character encoding yourself, then Python silently falls back to your operating systems’s default character encoding. In my case, that default encoding happens to be UTF-8, which is a superset of ASCII, so it’s fully backward compatible, but at the same time, it extends ASCII with a much wider range of characters.
02:44 Note that this doesn’t mean it’ll be the same for you. Your operating system may be using a completely different character encoding. This is a problem because if you test your code on, say, macOS and it works, then it doesn’t necessarily mean it’ll work elsewhere.
03:13 You can do this by passing a string with the encoding’s name to the relevant method. When you try something else, like ASCII, then you’re going to have a problem because one of the bytes doesn’t correspond to any known ASCII code. Similarly, when you specify a character encoding that can’t represent one of the letters from your text, Python won’t be able to encode a string into bytes.
In general, you have to know the encoding of a text file that you’re about to open for reading. If you’re unsure, then there are libraries like
chardet that can help you with that by trying to guess the encoding. However, there’s no guarantee they’ll succeed at all.
05:12 In early computing, people adopted dozens of character encodings to encompass the unique needs of different spoken languages. Because of the limited disk space at the time, each encoding assigned different characters to the same byte value, making those encodings mostly incompatible with each other. For example, the byte value 225 could represent any of the letters depicted in the first row of the table on the slide, and even more. Apart from that, once you had chosen a given character encoding for your text, you could only represent characters belonging to a few similar alphabets.
05:50 So if you wanted to write a piece of text that included Arabic, Greek, and Korean all at the same time, then you’d be out of luck. It just wasn’t possible to fit all these different characters on a single encoding.
06:06 Fortunately, this problem is a thing of the past thanks to the advent of Unicode, which is a single standardized and universal numeric representation of all characters from any spoken language. It even specifies emoji symbols!
06:22 In Unicode, each character is given a unique number called a code point that can’t be confused with any other character. However, because the standard defines almost one hundred fifty thousand characters, there’s no single font that could possibly display them all.
06:49 For example, if your text is mostly English with occasional foreign-language asides or citations, then you may want to allocate fewer bytes for Latin letters because they appear most frequently. In this case, you can use UTF-8, which is backward compatible with ASCII by using only eight bits, or a single byte, per character. That being said, UTF-8 may sometimes require as many as four bytes to encode an exotic character like an emoji symbol, so it’s a form of variable length encoding. Conversely, other popular Unicode encodings always use multiple bytes, which may be preferable when your texts predominantly consist of non-English characters. These days, UTF-8 is arguably the most widely used character encoding on the planet.
07:56 At the same time, it supports multiple languages, uses the previously mentioned compact representation, and was designed to be Internet-friendly. All in all, UTF-8 should become the default choice for your applications because you can’t go wrong with it. Even if you don’t think you’ll ever need to use characters other than English letters, embracing Unicode early on is still a good idea because you may eventually want to offer your content in other languages, or the content may be user generated, in which case you’ll need to support a wide range of characters anyway.
08:32 As a rule of thumb, always explicitly specify the character encoding of a text file that you open in Python, and make sure that it actually matches the encoding that the file was written with. If you’re creating a new file yourself, then stick with UTF-8, which is the most suitable encoding in most cases.
09:02 One of the most extreme but also very real examples of this problem can actually prevent you from installing a Python library. This is because many build tools will try to open the README file of a package as part of the installation procedure.
09:30 Character encoding is not the only thing you should keep in mind when you open a file in Python. Another thing that you may sometimes need to consider when working with text files in Python is the line-ending character, which you’ll learn about in the next lesson.
Become a Member to join the conversation.