Unicode in Python: Working With Character Encodings (Summary)
Congratulations on learning more about character encodings! In this lesson, you’ll cover a few caveats to remember when you’re working with encodings and see some resources you can check out to keep learning.
In this course, you learned about:
- Fundamental concepts of character encodings and numbering systems
- Integer, binary, octal, hex, str, and bytes literals in Python
- Differences between Unicode code points and UTF-8 encoding
- Python’s built-in functions related to character encoding and numbering systems
- Other encoding formats included in Python’s Standard Library
It’s very important to know the encoding of any data you read. Using the wrong encoding may result in an exception, or worse it will read successfully but have the wrong content.
Wikipedia has some useful pages:
- List of Unicode Characters
- Unicode Block
- Combining Diacritical Marks
- Extended ASCII
You can also check out these resources:
- Python documentation: Unicode changes in Python 3
- Python documentation: Unicode how-to
- Python documentation: Supported encodings
- Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Kunststube: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
- Mozilla: A composite approach to language/encoding detection
Congratulations, you made it to the end of the course! What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment in the discussion section and let us know.
Well, you’ve made it through eight lessons on Unicode. You’ll recall that I started off with the basics of encoding, talked about the Python
string module and the constants that are available to manipulate ASCII, took a detour down Computer Science Lane and talked about bits and bytes and how they can be represented in oct and hex.
00:19 And no Unicode course would be complete without a section on Unicode. Lesson 5 talked about how UTF-8 actually is represented in binary. Lesson 6 looked at digraphs and ligatures and other kinds of combined characters.
00:45 In this lesson, I’m going to talk about a couple of remaining corner cases and point you at some references and possible future reading material. It’s important to remember that all input is bytes until it’s decoded.
A Python-specific problem is the
open() specifies encoding, but it defaults, and the default is platform-specific. If you’re opening a text file, i.e. not specifying a binary mode and you don’t explicitly name the encoding, you will get the operating system’s encoding.
On a Mac, that’s UTF-8. On older versions of Windows, it was
cp1252. On more recent ones, it might be UTF-16. You can see what the default encoding is by looking at the
get_preferred_encoding() method of the
lookup() together you can go back and forth. Wikipedia has a ton of content on Unicode. There’s the Unicode article itself, and then there are breakdowns on Unicode character lists, the different sections of Unicode and how they’re blocked together, how to do the combinations, and then, of course, specifics to the encodings like UTF-8. In addition to Wikipedia, unicode.org itself has a rich amount of material and examples that you can pull from. If you’re looking for other encodings—back to Wikipedia.
03:53 There’s plenty there on ASCII, extended ASCII, Latin-1, and Windows-1252. If my babbling about digraph and ligatures was interesting to you, Wikipedia has got even more information there as well.
04:07 Joel on Software is a great source for programmers and his blog entry on the minimum you need to know for Unicode is quite in-depth. Additionally, David Zentgraf’s article and the Mozilla article on detecting encodings also cover lots of useful information. Specific to Python, you can look at the What’s New in Python 3.0 article that talks about how texts and bytes has changed, and the default Unicode mechanisms in Python 3.
04:33 Understanding Unicode is so necessary that Python has a full how-to on it, and deep within the documentation, you can find a full listing of the supported encodings. Given the topic, it seems only appropriate to say merci, grazie, gracias.
Become a Member to join the conversation.