Working With ASCII and the Python String Module

Unicode in Python: Working With Character Encodings Christopher Trudeau 05:49

There are tens if not hundreds of character encodings. In this lesson, you’ll start by exploring one of the simplest character encodings, ASCII. This is a good place to start learning about character encoding because ASCII is a small and contained encoding.

The built-in Python string module includes several constants that categorize ASCII text. You’ll use these string constants to identify character sets, such as string.ascii_letters, string.digits, string.whitespace, and string.punctuation.

00:00 In the previous lesson, I introduced you to characters, character points, and the encoding thereof. In this lesson, I’m going to dive further into ASCII and its support in the Python string module.

00:10 ASCII became one of most common standards for encoding because it was used by PCs early on. ASCII only encodes the basic Latin alphabet. There are no accented characters. The original encoding is 7 bits, so 128 characters in total, and it can be divided up into a series of groups.

00:28 The first 32 are control characters. They’re non-printable. These include things like printer controls, the bell sound, and carriage return. The next chunk is the space, a series of symbols, and numbers.

00:42 After that comes capital letters, a few more symbols, lower letters, a few more symbols, and then finally, the character for deletion. The original ASCII was a 7-bit encoding, and so went from 0 to 127. PCs used 8-bit bytes, so oftentimes, the leading 8th bit was used for parity checks during transmission.

01:04 It didn’t take long to figure out that ASCII was insufficient to handle other kinds of languages. Accented characters for Latin and Germanic languages were added by extending ASCII to use the full 8 bits. This wasn’t the only extension.

01:17 Another one was called Latin-1. Latin-1 was then modified by Microsoft to create Windows-1252. Latin-1 and 1252 are very, very close, which causes all sorts of problems because it looks like you can interchange them, but every once in a while, you’re going to run into a character difference.

01:36 If you’re wondering why I’m spending so much time on ASCII when this is a course on Unicode, well, it turns out that Unicode, Latin-1, Windows-1252—they all use the first 128 code points from ASCII.

01:50 So, if you’re sticking with the characters that I described in the previous screen, then the encoding is compatible across all four of these standards. Although Unicode is quickly becoming the defacto encoding, due to history, you still run into other encodings quite frequently.

02:06 The web is one of those places. Latin-1 was the original default encoding for documents delivered over HTTP. Anything with a MIME type of text/, unless you specify otherwise, is using Latin-1.

02:19 Of course, standards are, well, not always so standard, so depending on what web server you were using and what browsers you were using, there were subtle differences to this. In order to get around this, browsers try to guess the encoding. This works with a varying degree of success, although they’ve gotten much better in the recent past. Old coders like me used to spend a lot of time on Slashdot. If you’re not familiar, this is a website that aggregates technology news. It’s been around since 1997 and I’m pretty sure some of the code in there is still the original code. It’s notorious for not supporting Unicode, and you can see this in a comment that I’ve clipped here.