Unicode

Unicode is a universal character encoding standard that assigns a unique number (code point) to every character in every language, plus symbols, emojis, and control characters. It enables consistent representation and handling of text across different platforms and languages.

It extends ASCII by preserving its first 128 characters while enabling representation of over a million characters using 1 to 4 bytes per code point.

Key Concepts

Code Points

A code point is a unique number assigned to each character in the Unicode standard, typically written in hexadecimal with a U+ prefix (e.g., U+0041 for “A”, U+1F600 for 😀).

Cncoding Schemes

These are encoding schemes that determine how Unicode code points are stored as bytes:

  • UTF-8: Variable-length encoding (1-4 bytes), backward compatible with ASCII
  • UTF-16: Variable-length encoding (2 or 4 bytes), common in Windows
  • UTF-32: Fixed-length encoding (4 bytes), simple but space-inefficient

Python and Unicode

In Python, all strings are Unicode by default. The str type represents Unicode text, while bytes represents encoded text data.

Python
>>> # Unicode string (str type)
>>> text = "Hello, 世界 🌍"

>>> # Encode to bytes
>>> utf8_bytes = text.encode("utf-8")
>>> utf8_bytes
b'Hello, \xe4\xb8\x96\xe7\x95\x8c \xf0\x9f\x8c\x8d'

>>> # Decode back to string
>>> decoded = utf8_bytes.decode("utf-8")
>>> decoded
'Hello, 世界 🌍'

Getting code points:

Python
>>> # Get code point of a character
>>> ord("A")
65
>>> ord("世")
19990

>>> # Get character from code point
>>> chr(65)
'A'
>>> chr(19990)
'世'

Different sequences of code points can represent the same character. Unicode normalization ensures consistent representation:

Python
>>> import unicodedata

>>> # Normalize to NFC (composed form)
>>> unicodedata.normalize("NFC", text)
'Hello, 世界 🌍'

Common Pitfalls

  • Mixing encodings: Always know what encoding your data uses
  • Assuming one character = one code point: Some characters require multiple code points. For example, emoji with skin tones.
  • File encoding issues: Specify encoding when opening files

Best Practices

  • Always specify encoding explicitly when reading/writing files
  • Use UTF-8 as your default encoding
  • Handle encoding errors gracefully with error handlers ("ignore", "replace", "strict")
  • Be aware that string length (len()) counts code points, not visual characters
  • ASCII: 7-bit character encoding, subset of Unicode (first 128 characters)
  • Code Unit: The minimal bit combination in a character encoding (e.g., 8 bits in UTF-8)
  • BOM (Byte Order Mark): Optional marker at the start of a text stream indicating encoding
  • Surrogate Pairs: UTF-16 mechanism for encoding code points beyond U+FFFF

Tutorial

Unicode & Character Encodings in Python: A Painless Guide

In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at times seem painful and complicated, but this guide is here to help with easy-to-follow Python examples.

advanced python

For additional information on related topics, take a look at the following resources:


By Dan Bader • Updated June 30, 2025 • Reviewed by Leodanis Pozo Ramos