Python 3 stores data as either a string or a byte. In this lesson, you’ll practice with
decode(), which allow you to convert between the two. You’ll also start to work with Unicode and learn the difference between an encoding and a code point. Unicode specifies code points for characters but not their encodings. There are several different ways of encoding Unicode. UTF-8 is the most common and is the default in Python 3.
You may recall the ASCII standard has 128 code points. This is not enough for all human languages. This has been improved upon: Unicode has 1,114,112 possible code points. That’s 17 * 2^16 - 1, or
0 to hex
00:41 The first 128 code points in Unicode are ASCII, making it backwardly compatible. Unicode itself is not an encoding. Unicode really only specifies the map for the code points. UTF-8, the basis for Python, is the most common and popular of the encodings.
The concept of encoding and decoding in Python is the process of moving between these two representations. Let’s start off with a sample encoding. Here’s good old
'hello' encoded into
'utf-8', and you get back the binary representation
b'hello'. For ASCII, it’s pretty simple—there’s not much change.
This being the default means all strings are Unicode and they can contain any Unicode character. Most Unicode is even valid for identifiers, so if I felt like embracing my French-Canadian heritage and putting the appropriate accents inside of
résumé, I could. Generally, it’s not considered good practice because these characters aren’t always easy to type on most people’s keyboards, but it is now possible.
Not all characters in Unicode are valid identifiers. Unfortunately, you cannot use emojis inside your identifiers. The list is long of what is supported and it supports most languages, but it isn’t 100% of Unicode. In addition to string manipulation being Unicode-based, so are regular expressions. And finally, the default encoding for
"utf-8", so if you don’t specify it, it’ll be UTF-8.
03:55 That being said, best practice is to always specify the encoder. This makes it easier for people who are switching back and forth between Python 2 in Python 3 not to get confused. Now that you’ve seen the basics behind Unicode and UTF-8, in the next lesson, I’m going to show you how UTF-8 actually works.
Become a Member to join the conversation.