Unicode in Python: Working With Character Encodings (Overview)

Christopher Trudeau

Unicode in Python: Working With Character Encodings Christopher Trudeau 07:56

Python’s Unicode support is strong and robust, but it takes some time to master. There are many ways of encoding text into binary data, and in this course you’ll learn a bit of the history of encodings. You’ll also spend time learning the intricacies of Unicode, UTF-8, and how to use them when programming Python. You’ll practice with multiple examples and see how smooth working with text and binary data in Python can be!

By the end of this course, you’ll know:

What an encoding is
What ASCII is
How binary displays as octal and hex values
How UTF-8 encodes a code point
How to combine code points into a single glyph
Which built-in functions can help you

Download

Sample Code (.zip)

6.4 KB

Download

Course Slides (.pdf)

1.6 MB

00:00 Welcome to Unicode and Character Encodings in Python. My name is Chris and I will be your guide. This course talks about what an encoding is and how it works, where ASCII came from and how it evolved, how binary bits can be described in oct and hex and how to use those to map to code points, the Unicode standard and the UTF-8 encoding thereof, how UTF-8 uses the underlying bits to encode a code point, how multiple code points can result in a single character or glyph, functions built into Python that can help you when you’re messing around with characters in Unicode, and other encodings. First off, strings and character encoding is one of the big changes between Python 2 in Python 3. In fact, it’s one of the better reasons to move from Python 2 to Python 3.

00:46 All the examples in this course will be Python 3 based. If you’re using a Python 2 interpreter, you’re not going to be able to follow along. It’s really easy to forget when you’re programming in a nice high-level language like Python that computers really only understand numbers.

01:00 When you’re dealing with text, you’re actually dealing with a mapping between a number and a character that is being displayed. The fundamental item that is being stored in memory is still a number. ASCII was one of the preeminent standards for this kind of mapping.

01:16 It specified that certain numbers represented certain letters, and so when the computer used those numbers in the context of a string it would produce the right letters.

01:26 The problem with ASCII was it really only encoded the Latin alphabet. It didn’t even include accented characters. It was invented by and for English speakers; it wasn’t until later that accents for other Western languages were added. By contrast, Unicode is an international standard and has enough space to encode all written languages. In fact, it has space to encode other things as well, like emojis. At one point in time, there was even a move to add Klingon to it, but it was turned down. But there’s still space left over if the standard body changes its mind. First off, a little history.

02:00 I think I mentioned that computers only understand numbers? Well, computers only understand numbers. In fact, it’s even worse than that—they really only understand binary. Everything is a 1 or a 0.

02:10 This goes down to how transistors work—they’re either on or off. So, inside of the computer, everything is represented as either True or False, on or off, or 1 or 0 to represent that. Everything on top of that is an abstraction.

02:25 A byte is a grouping of bits. In the early history of computers, the size of a byte was different from different machines. By the time PCs came around, there were 8 bits to a byte, and that’s pretty common now.

02:37 Now, most processors deal with more than one byte at a time, but instead of redefining how big a byte is, they have other terms like word for groupings of bytes.

02:46 An 8-bit byte can hold 2^8 combinations—that’s 256. The counting starts at 0, so the number range, instead of being from 1 to 256, is from 0 to 255. Back in

03:00 the olden times—and I’m talking about time so old that even an old man like me thinks they’re the past—IBM introduced BCD, or Binary Coded Decimal. This was an early encoding. It was very, very simple and very small. It used 6 bits to represent a character.

03:15 This wasn’t enough to even fully cover the English language, so IBM extended BCD with EBCDIC—Extended Binary Coded Decimal Interchange Code.

03:25 This used a full 8 bits to describe a character and was so advanced it actually included lowercase letters. Around the same time as EBCDIC being standardized, ASCII was introduced.

03:36 ASCII was put together by a standards body rather than by a single company and became more popular across different platforms. ASCII only required 7 bits, but at the time most computers were using an 8-bit byte, so the lead bit was just left as 0. Sometimes, using some transmission protocols like over modems or terminals, that 8th bit would be used as a parity bit to make sure that the byte had been transmitted correctly.

04:01 ASCII was adopted as an international standard in 1967, and quickly there were several iterations and extensions made on top of it. The extended ASCII format moved to a full 8-bits of description and added accent characters, allowing Western languages that were not English to be described.

04:19 PCs used ASCII, so when they became the defacto standard, ASCII became the way of communicating between computers. For clarity’s sake, let’s establish some common terminology. First off, what’s a character?

04:31 This probably feels clear to you—it’s that one little single unit of text—but this term can actually get a little confusing depending on who you’re talking to. So for the purposes of this course, the word character is going to mean a minimal unit of text that has a semantic value. So, that includes things like emojis, or symbols in Han Chinese, as well as obvious stuff like the letter A.

04:52 A character set is just a collection of these characters, and these sets can be used across multiple languages. Think about the Latin character set that most European languages can use, the Greek character set that pretty much only the Greek language can use, and the Russian character set, which is used across certain Slavic languages.

05:10 A code point is a number that represents a single character in one of these sets of encoded characters. For example, in the ASCII standard, the capital letter 'A' is the decimal number 65.

05:25 A code unit, by contrast to a code point, is a sequence of bits that represent that code point. In ASCII, the code point 65 means 'A', and it’s stored in the computer using that number.

05:37 In other encoding standards that mapping may not apply. As I mentioned before, in the original ASCII standard, a code unit was 7 bits long, so that covered from the numbers 0 to 127. Unicode supports different kinds of encodings, and some of those even have varying length code units.

05:55 UTF-8, one of those encodings, is an 8-bit encoding, but its code point can map to 1, 2, 3, or 4 code units, so multiple bytes may be describing a single code point.

06:08 That’s enough background. Let’s look at some code. In order to inspect some strings, I’ve written a quick little method inside of a file called show.py. The core part of this method is line 5, which uses the built-in ord() function, returning the code point of the character that is passed in.

06:29 I’m going to import that function into the REPL and start with a simple string in English saying 'Hello there'. Calling code_points() on that prints out the code point for each one of the values in the str (string).

06:42 If you look at this, capital 'H' is 72 in ASCII, so it maps down below. Six characters in, you’ll see 32—that’s a space (" ") in ASCII. Notice that every one of these numbers is below 128—that means they’re in the range of the original ASCII 7-bit standard.

07:02 Let’s look at something a little more challenging.

07:05 Here’s some Russian that says “da svidaniya”, or at least, that’s what the web page I copied it from said it did—I hope it says that. Running code_points() on it,

07:16 you get a significantly larger set of numbers. Now, the third character in is 32—a space—just like in 'Hello there'. And if you look near here at the end, there’s a character that’s 225, which is below 256 in the extended ASCII range.

07:34 That is the accented 'á'. Everything else here is from the Cyrillic alphabet, which has much higher code point numbers above the ASCII range. All of these, as you’ll notice, are sort of around a thousand.

07:48 That’s it for the introduction. Next up, I’ll dive deeper into Python strings and their relationship to ASCII.

Become a Member to join the conversation.