Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Specifying the Character Encoding

Bartosz Zaczyński

Python Basics: Reading and Writing Files Bartosz Zaczyński 09:46

Transcript
Discussion (2)

00:00 In this lesson, you’ll learn how to specify the character encoding of a text file in Python so that you can correctly read the file contents.

00:10 Decoding row bytes into characters and the other way around requires that you choose and agree on some encoding scheme, which is usually known as character encoding.

00:20 You can experiment with this concept by running a few lines of code in IDLE. Start by declaring a string of characters like "cash", which is the word that you saw in the previous lesson.

00:31 You can then encode this string into the corresponding bytes. What comes back is a bytes() object literal, which looks quite like a regular string, except that it starts with a lowercase letter "b". However, it’s actually a concealed sequence of numeric bytes that you can reveal by turning them into a list, for example.

00:54 If you look closely, these are exactly the same numeric ASCII codes that you saw earlier. Note that you can reverse the process by creating a new instance of the bytes() object, passing the list of integers, and calling .decode() on it.

01:11 Don’t worry about the technical details, though. This is only to illustrate the idea behind encoding characters into bytes and decoding them back into characters.

01:20 Python does this automatically for you whenever you open a file in text mode, so this happens seamlessly in the background. Unfortunately, things can get more complicated when you stumble on some funky characters that aren’t defined in the original ASCII encoding table.

01:37 These could be letters with diacritic marks or symbols from non-Latin alphabets. ASCII was designed for the English language, after all. Let’s say you wanted to decode the following sequence of bytes.

01:50 I’m going to change the last two and append one more.

02:01 This produces the word "café" with an accent. Notice that although the word only has four characters, it was encoded using five bytes, and that’s because of the last character, which doesn’t have a corresponding ASCII code.

02:15 How was it then possible for Python to decode it, you may ask? Well, when you don’t request any particular character encoding yourself, then Python silently falls back to your operating systems’s default character encoding. In my case, that default encoding happens to be UTF-8, which is a superset of ASCII, so it’s fully backward compatible, but at the same time, it extends ASCII with a much wider range of characters.

02:44 Note that this doesn’t mean it’ll be the same for you. Your operating system may be using a completely different character encoding. This is a problem because if you test your code on, say, macOS and it works, then it doesn’t necessarily mean it’ll work elsewhere.

03:01 It’s one of the reasons why you should always specify what character encoding to use. When in doubt, just request UTF-8, which has become the widespread standard across the world.

03:13 You can do this by passing a string with the encoding’s name to the relevant method. When you try something else, like ASCII, then you’re going to have a problem because one of the bytes doesn’t correspond to any known ASCII code. Similarly, when you specify a character encoding that can’t represent one of the letters from your text, Python won’t be able to encode a string into bytes.

03:39 These problems will also affect your text files, so to address them both, the built-in open() function as well as its Path.open() counterpart expose the .encoding attribute.

03:51 When you open a file in text mode, which is the default mode, you must tell Python which character encoding the file was written with.

04:08 That’s because different character encodings will represent the same text differently. If you provide an incorrect encoding like here, then you’ll most likely end up with a familiar error

04:20 or, in the best-case scenario, some nonsensical output.

04:29 In general, you have to know the encoding of a text file that you’re about to open for reading. If you’re unsure, then there are libraries like chardet that can help you with that by trying to guess the encoding. However, there’s no guarantee they’ll succeed at all.

04:46 If you’d like to get a complete list of character encodings that your Python version supports, then import the aliases dictionary from encodings.aliases

04:59 and get all of its values.

05:04 These are the encoding names that you can use when you open a file in Python.

05:12 In early computing, people adopted dozens of character encodings to encompass the unique needs of different spoken languages. Because of the limited disk space at the time, each encoding assigned different characters to the same byte value, making those encodings mostly incompatible with each other. For example, the byte value 225 could represent any of the letters depicted in the first row of the table on the slide, and even more. Apart from that, once you had chosen a given character encoding for your text, you could only represent characters belonging to a few similar alphabets.

05:50 So if you wanted to write a piece of text that included Arabic, Greek, and Korean all at the same time, then you’d be out of luck. It just wasn’t possible to fit all these different characters on a single encoding.

06:06 Fortunately, this problem is a thing of the past thanks to the advent of Unicode, which is a single standardized and universal numeric representation of all characters from any spoken language. It even specifies emoji symbols!

06:22 In Unicode, each character is given a unique number called a code point that can’t be confused with any other character. However, because the standard defines almost one hundred fifty thousand characters, there’s no single font that could possibly display them all.

06:40 There’s a whole family of specific Unicode-to-byte encodings that may use a different number of bytes per character, depending on your primary language.

06:49 For example, if your text is mostly English with occasional foreign-language asides or citations, then you may want to allocate fewer bytes for Latin letters because they appear most frequently. In this case, you can use UTF-8, which is backward compatible with ASCII by using only eight bits, or a single byte, per character. That being said, UTF-8 may sometimes require as many as four bytes to encode an exotic character like an emoji symbol, so it’s a form of variable length encoding. Conversely, other popular Unicode encodings always use multiple bytes, which may be preferable when your texts predominantly consist of non-English characters. These days, UTF-8 is arguably the most widely used character encoding on the planet.

07:42 Software programs, including Python, adopt it as standard. This encoding remains backward compatible with ASCII because the first 128 characters have essentially identical byte values.

07:56 At the same time, it supports multiple languages, uses the previously mentioned compact representation, and was designed to be Internet-friendly. All in all, UTF-8 should become the default choice for your applications because you can’t go wrong with it. Even if you don’t think you’ll ever need to use characters other than English letters, embracing Unicode early on is still a good idea because you may eventually want to offer your content in other languages, or the content may be user generated, in which case you’ll need to support a wide range of characters anyway.

08:32 As a rule of thumb, always explicitly specify the character encoding of a text file that you open in Python, and make sure that it actually matches the encoding that the file was written with. If you’re creating a new file yourself, then stick with UTF-8, which is the most suitable encoding in most cases.

08:52 Not specifying any character encoding when you open a text file is a common mistake, which some tools and sometimes even Python itself will warn you about.

09:02 One of the most extreme but also very real examples of this problem can actually prevent you from installing a Python library. This is because many build tools will try to open the README file of a package as part of the installation procedure.

09:17 If they fail to decode the characters in that file because of the wrong character encoding, then you’ll only be able to install the library on some operating systems, but not others.

09:30 Character encoding is not the only thing you should keep in mind when you open a file in Python. Another thing that you may sometimes need to consider when working with text files in Python is the line-ending character, which you’ll learn about in the next lesson.

dakshnavenki on Aug. 16, 2023

I tried the encode and decode function in python 2.7.5 IDLE window, but the output is same characters and not the ASCII values as mentioned here, is the python 2.7.5 doesnt support encode and decode functions or is there difference between python version 2 and 3 for these functions?

Bartosz Zaczyński RP Team on Aug. 16, 2023

@dakshnavenki There are significant differences between Python 2 and 3 regarding string representation. In Python 2, there was no separate data type for representing sequences of bytes, while the string type (str) served this purpose instead. So, when you .encode() a Python 2 string using the specified encoding, you end up with another string:

# Python 2
>>> "cash".encode()
'cash'

In this case, the source string consists of ASCII letters only, so the resulting string that you see in the output is the same as the original string that you started with. On the other hand, when you try encoding a Unicode string with some exotic characters, then you’ll see a difference:

# Python 2
>>> u"café".encode("utf-8")
'caf\xc3\xa9'

Regardless of the string’s contents, to reveal the numeric byte values of its individual characters in Python 2, you can call ord() on them:

# Python 2
>>> [ord(character) for character in "cash"]
[99, 97, 115, 104]

>>> [ord(character) for character in u"café".encode("utf-8")]
[99, 97, 102, 195, 169]

Last but not least, I should mention that Python 2 has been long deprecated and is no longer maintained, nor does it receive security and bug fixes. Unless you have specific reasons to use an older version of the language, you should use Python 3 instead.

Become a Member to join the conversation.