Unicode
Unicode is a universal character encoding standard that assigns a unique number (code point) to every character in every language, plus symbols, emojis, and control characters. It enables consistent representation and handling of text across different platforms and languages.
It extends ASCII by preserving its first 128 characters while enabling representation of over a million characters using 1 to 4 bytes per code point.
Key Concepts
Code Points
A code point is a unique number assigned to each character in the Unicode standard, typically written in hexadecimal with a U+ prefix (e.g., U+0041 for “A”, U+1F600 for 😀).
Cncoding Schemes
These are encoding schemes that determine how Unicode code points are stored as bytes:
- UTF-8: Variable-length encoding (1-4 bytes), backward compatible with ASCII
- UTF-16: Variable-length encoding (2 or 4 bytes), common in Windows
- UTF-32: Fixed-length encoding (4 bytes), simple but space-inefficient
Python and Unicode
In Python, all strings are Unicode by default. The str
type represents Unicode text, while bytes
represents encoded text data.
>>> # Unicode string (str type)
>>> text = "Hello, 世界 🌍"
>>> # Encode to bytes
>>> utf8_bytes = text.encode("utf-8")
>>> utf8_bytes
b'Hello, \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x8c\x8d'
>>> # Decode back to string
>>> decoded = utf8_bytes.decode("utf-8")
>>> decoded
'Hello, 世界 🌍'
Getting code points:
>>> # Get code point of a character
>>> ord("A")
65
>>> ord("世")
19990
>>> # Get character from code point
>>> chr(65)
'A'
>>> chr(19990)
'世'
Different sequences of code points can represent the same character. Unicode normalization ensures consistent representation:
>>> import unicodedata
>>> # Normalize to NFC (composed form)
>>> unicodedata.normalize("NFC", text)
'Hello, 世界 🌍'
Common Pitfalls
- Mixing encodings: Always know what encoding your data uses
- Assuming one character = one code point: Some characters require multiple code points. For example, emoji with skin tones.
- File encoding issues: Specify encoding when opening files
Best Practices
- Always specify encoding explicitly when reading/writing files
- Use UTF-8 as your default encoding
- Handle encoding errors gracefully with error handlers (
"ignore"
,"replace"
,"strict"
) - Be aware that string length (
len()
) counts code points, not visual characters
Related Terms
- ASCII: 7-bit character encoding, subset of Unicode (first 128 characters)
- Code Unit: The minimal bit combination in a character encoding (e.g., 8 bits in UTF-8)
- BOM (Byte Order Mark): Optional marker at the start of a text stream indicating encoding
- Surrogate Pairs: UTF-16 mechanism for encoding code points beyond U+FFFF
Related Resources
Tutorial
Unicode & Character Encodings in Python: A Painless Guide
In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at times seem painful and complicated, but this guide is here to help with easy-to-follow Python examples.
For additional information on related topics, take a look at the following resources: