Unicode in Python: Working With Character Encodings (Summary)
Congratulations on learning more about character encodings! In this lesson, you’ll cover a few caveats to remember when you’re working with encodings and see some resources you can check out to keep learning.
In this course, you learned about:
- Fundamental concepts of character encodings and numbering systems
- Integer, binary, octal, hex, str, and bytes literals in Python
- Differences between Unicode code points and UTF-8 encoding
- Python’s built-in functions related to character encoding and numbering systems
- Other encoding formats included in Python’s Standard Library
It’s very important to know the encoding of any data you read. Using the wrong encoding may result in an exception, or worse it will read successfully but have the wrong content.
Wikipedia has some useful pages:
- Unicode
- List of Unicode Characters
- Unicode Block
- Combining Diacritical Marks
- UTF-8
- ASCII
- Extended ASCII
- IEC_8859-1
- Windows-1252
- Digraph
- Orthographic_ligature
You can also check out these resources:
- Python documentation: Unicode changes in Python 3
- Python documentation: Unicode how-to
- Python documentation: Supported encodings
- Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Kunststube: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
- Mozilla: A composite approach to language/encoding detection
Congratulations, you made it to the end of the course! What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment in the discussion section and let us know.
00:00
Well, you’ve made it through eight lessons on Unicode. You’ll recall that I started off with the basics of encoding, talked about the Python string
module and the constants that are available to manipulate ASCII, took a detour down Computer Science Lane and talked about bits and bytes and how they can be represented in oct and hex.
00:19 And no Unicode course would be complete without a section on Unicode. Lesson 5 talked about how UTF-8 actually is represented in binary. Lesson 6 looked at digraphs and ligatures and other kinds of combined characters.
00:33 Lesson 7 gave a tour of built-in Python functions that are helpful when dealing with Unicode or byte conversion. And the last lesson was on encodings besides UTF-8.
00:45 In this lesson, I’m going to talk about a couple of remaining corner cases and point you at some references and possible future reading material. It’s important to remember that all input is bytes until it’s decoded.
00:57 If you assume a data’s encoding, you may run into trouble. Let’s say you were accessing a recipe site API, and you got the following chunk of data.
01:09 If you make an assumption about the decoding…
01:15
you could be in trouble. Hex bc
is not valid UTF-8.
01:23 Change the encoding to Latin-1, and all of a sudden the data makes an awful lot more sense.
01:31
The symbol for one quarter in UTF-8 isn’t bc
, but c2 bc
. There are worse cases than getting an exception. At least when you get an exception, you know something went wrong.
01:43 Consider the following piece of Norse. Encoding it…
01:51 and then decoding it in UTF-16 by accident, results in a different character. No error, no exception. Your data is now dirty and wherever you put it, it’ll be wrong.
02:05
A Python-specific problem is the open()
command. open()
specifies encoding, but it defaults, and the default is platform-specific. If you’re opening a text file, i.e. not specifying a binary mode and you don’t explicitly name the encoding, you will get the operating system’s encoding.
02:26
On a Mac, that’s UTF-8. On older versions of Windows, it was cp1252
. On more recent ones, it might be UTF-16. You can see what the default encoding is by looking at the get_preferred_encoding()
method of the locale
module.
02:42
Python ships with a module that represents the Unicode database. It’s called unicodedata
. You can use this to do lookups on your characters or on your code points.
02:53 Let’s look at it in action.
02:58
The name()
method takes a str of a single character and returns the Unicode name for that character.
03:10
The lookup()
method does the opposite. Given the name 'EURO SIGN'
, it returns the corresponding character.
03:20
By using name()
and lookup()
together you can go back and forth. Wikipedia has a ton of content on Unicode. There’s the Unicode article itself, and then there are breakdowns on Unicode character lists, the different sections of Unicode and how they’re blocked together, how to do the combinations, and then, of course, specifics to the encodings like UTF-8. In addition to Wikipedia, unicode.org itself has a rich amount of material and examples that you can pull from. If you’re looking for other encodings—back to Wikipedia.
03:53 There’s plenty there on ASCII, extended ASCII, Latin-1, and Windows-1252. If my babbling about digraph and ligatures was interesting to you, Wikipedia has got even more information there as well.
04:07 Joel on Software is a great source for programmers and his blog entry on the minimum you need to know for Unicode is quite in-depth. Additionally, David Zentgraf’s article and the Mozilla article on detecting encodings also cover lots of useful information. Specific to Python, you can look at the What’s New in Python 3.0 article that talks about how texts and bytes has changed, and the default Unicode mechanisms in Python 3.
04:33 Understanding Unicode is so necessary that Python has a full how-to on it, and deep within the documentation, you can find a full listing of the supported encodings. Given the topic, it seems only appropriate to say merci, grazie, gracias.
04:49 Thanks for your attention. I hope it’s been informative.
Pradeep Kumar on July 6, 2020
Awesome Course!!!
Ranjit Shrivastva on Aug. 21, 2020
Interesting topic…Thanks for unicode detail.
sacsachin on Oct. 10, 2020
Great tutorial.
DoubleA on Jan. 24, 2021
Thank you for sharing your deep knowledge of the topic. For me as a beginner it’s hard to grasp 100% of the stuff just now, but the big picture has now become so much clearer!
Christopher Trudeau RP Team on Jan. 24, 2021
Glad you enjoyed it @DoubleA. Feel free to post questions if you need clarity on something.
Become a Member to join the conversation.
Alain Rouleau on July 2, 2020
Very interesting, thanks!