Text Encoding

Text encoding involves the conversion of text data into a specific format that computers can store and manipulate. Each character in a text is mapped to a unique code point, which is then represented in binary format.

In Python, text encoding is often handled with Unicode, which is a standard that assigns a unique code point to every character and symbol in the world’s writing systems. The most common encoding used in Python is UTF-8, which is capable of encoding all possible characters defined by Unicode using one to four bytes.

Understanding text encoding is essential for working with text data in Python, especially when dealing with multiple languages or special characters. Incorrect handling of encodings can lead to errors and data corruption.

Example

Here’s an example where you encode a string into bytes using UTF-8 and then decode it back to a string:

Python
>>> text = "Hello, World!"

>>> # Encode into bytes
>>> encoded_text = text.encode("utf-8")
>>> encoded_text
b'Hello, World!'

>>> # Decode back to a string
>>> decoded_text = encoded_text.decode("utf-8")
>>> decoded_text
'Hello, World!'

In this example, the .encode() method converts the string to a bytes object using the UTF-8 encoding, while .decode() converts it back to a string.

Tutorial

Unicode & Character Encodings in Python: A Painless Guide

In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at times seem painful and complicated, but this guide is here to help with easy-to-follow Python examples.

advanced python

For additional information on related topics, take a look at the following resources:


By Leodanis Pozo Ramos • Updated Jan. 7, 2025 • Reviewed by Dan Bader