Unicode in Python: Working With Character Encodings (Overview)
Python’s Unicode support is strong and robust, but it takes some time to master. There are many ways of encoding text into binary data, and in this course you’ll learn a bit of the history of encodings. You’ll also spend time learning the intricacies of Unicode, UTF-8, and how to use them when programming Python. You’ll practice with multiple examples and see how smooth working with text and binary data in Python can be!
By the end of this course, you’ll know:
- What an encoding is
- What ASCII is
- How binary displays as octal and hex values
- How UTF-8 encodes a code point
- How to combine code points into a single glyph
- Which built-in functions can help you
00:00 Welcome to Unicode and Character Encodings in Python. My name is Chris and I will be your guide. This course talks about what an encoding is and how it works, where ASCII came from and how it evolved, how binary bits can be described in oct and hex and how to use those to map to code points, the Unicode standard and the UTF-8 encoding thereof, how UTF-8 uses the underlying bits to encode a code point, how multiple code points can result in a single character or glyph, functions built into Python that can help you when you’re messing around with characters in Unicode, and other encodings. First off, strings and character encoding is one of the big changes between Python 2 in Python 3. In fact, it’s one of the better reasons to move from Python 2 to Python 3.
00:46 All the examples in this course will be Python 3 based. If you’re using a Python 2 interpreter, you’re not going to be able to follow along. It’s really easy to forget when you’re programming in a nice high-level language like Python that computers really only understand numbers.
01:00 When you’re dealing with text, you’re actually dealing with a mapping between a number and a character that is being displayed. The fundamental item that is being stored in memory is still a number. ASCII was one of the preeminent standards for this kind of mapping.
01:26 The problem with ASCII was it really only encoded the Latin alphabet. It didn’t even include accented characters. It was invented by and for English speakers; it wasn’t until later that accents for other Western languages were added. By contrast, Unicode is an international standard and has enough space to encode all written languages. In fact, it has space to encode other things as well, like emojis. At one point in time, there was even a move to add Klingon to it, but it was turned down. But there’s still space left over if the standard body changes its mind. First off, a little history.
02:00 I think I mentioned that computers only understand numbers? Well, computers only understand numbers. In fact, it’s even worse than that—they really only understand binary. Everything is a 1 or a 0.
This goes down to how transistors work—they’re either on or off. So, inside of the computer, everything is represented as either
False, on or off, or 1 or 0 to represent that. Everything on top of that is an abstraction.
02:25 A byte is a grouping of bits. In the early history of computers, the size of a byte was different from different machines. By the time PCs came around, there were 8 bits to a byte, and that’s pretty common now.
03:00 the olden times—and I’m talking about time so old that even an old man like me thinks they’re the past—IBM introduced BCD, or Binary Coded Decimal. This was an early encoding. It was very, very simple and very small. It used 6 bits to represent a character.
03:36 ASCII was put together by a standards body rather than by a single company and became more popular across different platforms. ASCII only required 7 bits, but at the time most computers were using an 8-bit byte, so the lead bit was just left as 0. Sometimes, using some transmission protocols like over modems or terminals, that 8th bit would be used as a parity bit to make sure that the byte had been transmitted correctly.
04:01 ASCII was adopted as an international standard in 1967, and quickly there were several iterations and extensions made on top of it. The extended ASCII format moved to a full 8-bits of description and added accent characters, allowing Western languages that were not English to be described.
04:19 PCs used ASCII, so when they became the defacto standard, ASCII became the way of communicating between computers. For clarity’s sake, let’s establish some common terminology. First off, what’s a character?
04:31 This probably feels clear to you—it’s that one little single unit of text—but this term can actually get a little confusing depending on who you’re talking to. So for the purposes of this course, the word character is going to mean a minimal unit of text that has a semantic value. So, that includes things like emojis, or symbols in Han Chinese, as well as obvious stuff like the letter A.
04:52 A character set is just a collection of these characters, and these sets can be used across multiple languages. Think about the Latin character set that most European languages can use, the Greek character set that pretty much only the Greek language can use, and the Russian character set, which is used across certain Slavic languages.
05:37 In other encoding standards that mapping may not apply. As I mentioned before, in the original ASCII standard, a code unit was 7 bits long, so that covered from the numbers 0 to 127. Unicode supports different kinds of encodings, and some of those even have varying length code units.
That’s enough background. Let’s look at some code. In order to inspect some strings, I’ve written a quick little method inside of a file called
show.py. The core part of this method is line 5, which uses the built-in
ord() function, returning the code point of the character that is passed in.
I’m going to import that function into the REPL and start with a simple string in English saying
'Hello there'. Calling
code_points() on that prints out the code point for each one of the values in the
If you look at this, capital
72 in ASCII, so it maps down below. Six characters in, you’ll see
32—that’s a space (
" ") in ASCII. Notice that every one of these numbers is below
128—that means they’re in the range of the original ASCII 7-bit standard.
you get a significantly larger set of numbers. Now, the third character in is
32—a space—just like in
'Hello there'. And if you look near here at the end, there’s a character that’s
225, which is below 256 in the extended ASCII range.
That is the accented
'á'. Everything else here is from the Cyrillic alphabet, which has much higher code point numbers above the ASCII range. All of these, as you’ll notice, are sort of around a thousand.
Become a Member to join the conversation.