Working With ASCII and the Python String Module
There are tens if not hundreds of character encodings. In this lesson, you’ll start by exploring one of the simplest character encodings, ASCII. This is a good place to start learning about character encoding because ASCII is a small and contained encoding.
The built-in Python
string module includes several constants that categorize ASCII text. You’ll use these
string constants to identify character sets, such as
In the previous lesson, I introduced you to characters, character points, and the encoding thereof. In this lesson, I’m going to dive further into ASCII and its support in the Python
00:10 ASCII became one of most common standards for encoding because it was used by PCs early on. ASCII only encodes the basic Latin alphabet. There are no accented characters. The original encoding is 7 bits, so 128 characters in total, and it can be divided up into a series of groups.
00:28 The first 32 are control characters. They’re non-printable. These include things like printer controls, the bell sound, and carriage return. The next chunk is the space, a series of symbols, and numbers.
00:42 After that comes capital letters, a few more symbols, lower letters, a few more symbols, and then finally, the character for deletion. The original ASCII was a 7-bit encoding, and so went from 0 to 127. PCs used 8-bit bytes, so oftentimes, the leading 8th bit was used for parity checks during transmission.
01:04 It didn’t take long to figure out that ASCII was insufficient to handle other kinds of languages. Accented characters for Latin and Germanic languages were added by extending ASCII to use the full 8 bits. This wasn’t the only extension.
01:17 Another one was called Latin-1. Latin-1 was then modified by Microsoft to create Windows-1252. Latin-1 and 1252 are very, very close, which causes all sorts of problems because it looks like you can interchange them, but every once in a while, you’re going to run into a character difference.
01:36 If you’re wondering why I’m spending so much time on ASCII when this is a course on Unicode, well, it turns out that Unicode, Latin-1, Windows-1252—they all use the first 128 code points from ASCII.
01:50 So, if you’re sticking with the characters that I described in the previous screen, then the encoding is compatible across all four of these standards. Although Unicode is quickly becoming the defacto encoding, due to history, you still run into other encodings quite frequently.
02:19 Of course, standards are, well, not always so standard, so depending on what web server you were using and what browsers you were using, there were subtle differences to this. In order to get around this, browsers try to guess the encoding. This works with a varying degree of success, although they’ve gotten much better in the recent past. Old coders like me used to spend a lot of time on Slashdot. If you’re not familiar, this is a website that aggregates technology news. It’s been around since 1997 and I’m pretty sure some of the code in there is still the original code. It’s notorious for not supporting Unicode, and you can see this in a comment that I’ve clipped here.
02:58 This isn’t because a cat ran across this person’s keyboard—this is because the apostrophe has been interpreted in a different encoding and you get a whole bunch of garbage instead of the poster’s intent.
You don’t see them as often anymore, but in the early 2000s, frequently web pages would be littered with these little question marks and blocks. Before browsers got better at guessing the encoding, this was the character that was shown if the character on the page couldn’t be shown in the browser’s current encoding. Thankfully, this problem is mostly solved now. The Python
string module defines a whole bunch of constants that are useful for looking at ASCII. Let’s take a look at a few of them.
ascii_letters is the combination of those two.
digits are the numbers.
hexdigits are the numbers plus the first few characters in either lower or upper case.
octdigits are the first eight numbers.
If you pass in values to
.rstrip(), it’ll tell it what characters to pull. Passing in
string.whitespace will pull all of the question marks and the exclamation marks and the space between them off the right-hand side of that string.
You can use the
.isprintable() method to see whether or not it contains printable characters. One word of caution:
.isprintable() doesn’t actually use
string.printable, so there’s a subtle difference between the two.
This is because
.isprintable() is an older method that tells you whether or not something is printable within the
repr() representation. That
repr() representation doesn’t actually include tabs and newlines, so you get into the strange situation where
string.printable—which does include those characters—
05:33 isn’t printable. Before digging into Unicode and how it’s represented, you’re going to need a little bit of computer science math. So in the next episode, I’ll be reviewing bits, bytes, octal, and hex representations.
Become a Member to join the conversation.