Working With ASCII and the Python String Module
There are tens if not hundreds of character encodings. In this lesson, you’ll start by exploring one of the simplest character encodings, ASCII. This is a good place to start learning about character encoding because ASCII is a small and contained encoding.
The built-in Python string
module includes several constants that categorize ASCII text. You’ll use these string
constants to identify character sets, such as string.ascii_letters
, string.digits
, string.whitespace
, and string.punctuation
.
00:00
In the previous lesson, I introduced you to characters, character points, and the encoding thereof. In this lesson, I’m going to dive further into ASCII and its support in the Python string
module.
00:10 ASCII became one of most common standards for encoding because it was used by PCs early on. ASCII only encodes the basic Latin alphabet. There are no accented characters. The original encoding is 7 bits, so 128 characters in total, and it can be divided up into a series of groups.
00:28 The first 32 are control characters. They’re non-printable. These include things like printer controls, the bell sound, and carriage return. The next chunk is the space, a series of symbols, and numbers.
00:42 After that comes capital letters, a few more symbols, lower letters, a few more symbols, and then finally, the character for deletion. The original ASCII was a 7-bit encoding, and so went from 0 to 127. PCs used 8-bit bytes, so oftentimes, the leading 8th bit was used for parity checks during transmission.
01:04 It didn’t take long to figure out that ASCII was insufficient to handle other kinds of languages. Accented characters for Latin and Germanic languages were added by extending ASCII to use the full 8 bits. This wasn’t the only extension.
01:17 Another one was called Latin-1. Latin-1 was then modified by Microsoft to create Windows-1252. Latin-1 and 1252 are very, very close, which causes all sorts of problems because it looks like you can interchange them, but every once in a while, you’re going to run into a character difference.
01:36 If you’re wondering why I’m spending so much time on ASCII when this is a course on Unicode, well, it turns out that Unicode, Latin-1, Windows-1252—they all use the first 128 code points from ASCII.
01:50 So, if you’re sticking with the characters that I described in the previous screen, then the encoding is compatible across all four of these standards. Although Unicode is quickly becoming the defacto encoding, due to history, you still run into other encodings quite frequently.
02:06
The web is one of those places. Latin-1 was the original default encoding for documents delivered over HTTP. Anything with a MIME type of text/
, unless you specify otherwise, is using Latin-1.
02:19 Of course, standards are, well, not always so standard, so depending on what web server you were using and what browsers you were using, there were subtle differences to this. In order to get around this, browsers try to guess the encoding. This works with a varying degree of success, although they’ve gotten much better in the recent past. Old coders like me used to spend a lot of time on Slashdot. If you’re not familiar, this is a website that aggregates technology news. It’s been around since 1997 and I’m pretty sure some of the code in there is still the original code. It’s notorious for not supporting Unicode, and you can see this in a comment that I’ve clipped here.
02:58 This isn’t because a cat ran across this person’s keyboard—this is because the apostrophe has been interpreted in a different encoding and you get a whole bunch of garbage instead of the poster’s intent.
03:10
You don’t see them as often anymore, but in the early 2000s, frequently web pages would be littered with these little question marks and blocks. Before browsers got better at guessing the encoding, this was the character that was shown if the character on the page couldn’t be shown in the browser’s current encoding. Thankfully, this problem is mostly solved now. The Python string
module defines a whole bunch of constants that are useful for looking at ASCII. Let’s take a look at a few of them.
03:36
string.whitespace
defines tab, newline, and others to be whitespace characters. ascii_lowercase
and ascii_uppercase
show the alphabet letters.
03:46
ascii_letters
is the combination of those two. digits
are the numbers. hexdigits
are the numbers plus the first few characters in either lower or upper case. octdigits
are the first eight numbers. punctuation
symbols.
04:02
And finally, string.printable
shows all of these combined.
04:08
Let’s crack open the REPL and take a look at this in practice. I’m going to import string
so I can get access to those constants that I just showed you. Type in a question.
04:20
Now, let’s say you wanted to pull the punctuation and space off the right-hand side. The .rstrip()
method will pull characters out of a string.
04:28
If you pass in values to .rstrip()
, it’ll tell it what characters to pull. Passing in string.punctuation
and string.whitespace
will pull all of the question marks and the exclamation marks and the space between them off the right-hand side of that string.
04:45
You can use the .isascii()
method to see whether or not a value is ASCII.
04:51
You can use the .isprintable()
method to see whether or not it contains printable characters. One word of caution: .isprintable()
doesn’t actually use string.printable
, so there’s a subtle difference between the two.
05:06
.isprintable()
on blanks
is False
, even though string.printable
includes the tab and newline characters.
05:13
This is because .isprintable()
is an older method that tells you whether or not something is printable within the repr()
representation. That repr()
representation doesn’t actually include tabs and newlines, so you get into the strange situation where string.printable
—which does include those characters—
05:33 isn’t printable. Before digging into Unicode and how it’s represented, you’re going to need a little bit of computer science math. So in the next episode, I’ll be reviewing bits, bytes, octal, and hex representations.
Become a Member to join the conversation.