`codecs`

The Python codecs module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry. It supports text encodings, text-to-text codecs, and bytes-to-bytes codecs.

Here’s a quick look at encoding and decoding text:

>>> import codecs

>>> codecs.encode("Hello, World!", "utf-8")
b'Hello, World!'
>>> codecs.decode(b"Hello, World!", "utf-8")
'Hello, World!'

Key Features

Provides encode() and decode() functions with configurable error handling strategies
Maintains a searchable registry of built-in and custom codecs
Supports text encodings, text-to-text transforms, and bytes-to-bytes transforms
Offers incremental encoders and decoders suitable for streaming data
Provides stream-oriented reader and writer classes for encoded files and network streams
Includes BOM (byte order marks) constants, such as codecs.BOM_UTF8, for detecting and writing BOMs
Allows registration of custom codecs and custom error handlers

Frequently Used Classes and Functions

Object	Type	Description
`codecs.encode()`	Function	Encodes an object using a named registered codec
`codecs.decode()`	Function	Decodes an object using a named registered codec
`codecs.lookup()`	Function	Returns a `CodecInfo` object for a named encoding
`codecs.register()`	Function	Registers a custom codec search function with the registry
`codecs.register_error()`	Function	Registers a named error handling function for use during encoding or decoding
`codecs.iterencode()`	Function	Incrementally encodes strings from an iterator using a named codec
`codecs.iterdecode()`	Function	Incrementally decodes bytes from an iterator using a named codec
`codecs.IncrementalEncoder`	Class	Base class for building stateful incremental encoders
`codecs.IncrementalDecoder`	Class	Base class for building stateful incremental decoders
`codecs.StreamReader`	Class	Base class for reading and decoding data from a binary stream
`codecs.StreamWriter`	Class	Base class for encoding and writing data to a binary stream
`codecs.BOM_UTF8`	Constant	UTF-8 byte order mark (`b'\xef\xbb\xbf'`), used to signal UTF-8 encoding at the start of a byte stream
`codecs.BOM_UTF16`	Constant	UTF-16 BOM in native byte order; also available as `BOM_UTF16_BE` and `BOM_UTF16_LE` for explicit endianness
`codecs.BOM_UTF32`	Constant	UTF-32 BOM in native byte order; also available as `BOM_UTF32_BE` and `BOM_UTF32_LE` for explicit endianness

Examples

Decoding bytes that contain a character that can’t be represented in ASCII, using different error strategies:

>>> import codecs

>>> data = "Caf\u00e9".encode("latin-1")
>>> codecs.decode(data, "ascii", errors="ignore")
'Caf'
>>> codecs.decode(data, "ascii", errors="replace")
'Caf\ufffd'
>>> codecs.decode(data, "ascii", errors="backslashreplace")
'Caf\\xe9'

Inspecting a codec’s metadata with the codecs.lookup() function:

>>> import codecs

>>> info = codecs.lookup("utf-8")
>>> info.name
'utf-8'
>>> info.incrementalencoder
<class 'encodings.utf_8.IncrementalEncoder'>

Using codecs.iterdecode() to decode a stream of byte chunks incrementally:

>>> import codecs

>>> chunks = [b"Hell", b"o, ", b"W\xc3\xb6", b"rld!"]
>>> decoder = codecs.iterdecode(iter(chunks), "utf-8")
>>> list(decoder)
['Hell', 'o, ', 'W\xf6', 'rld!']

Common Use Cases

The most common use case for codecs include:

Encoding text to bytes for storage in files or transmission over a network
Decoding bytes from legacy systems that use non-UTF-8 encodings such as Latin-1 or Windows-1252
Streaming large files in a specific encoding without loading them fully into memory
Registering custom codecs for domain-specific or proprietary data formats
Applying named error handlers to control behavior when encountering unencodable or undecodable characters

Real-World Example

A script can use codecs.iterdecode() to incrementally transcode a Latin-1 encoded log file to UTF-8 without loading the entire file into memory:

import codecs

def transcode_to_utf8(source_path, dest_path):
    with open(source_path, "rb") as src, open(dest_path, "wb") as dst:
        reader = codecs.iterdecode(src, "latin-1")
        for line in reader:
            dst.write(line.encode("utf-8"))

transcode_to_utf8("legacy_log.txt", "modern_log.txt")
print("Transcoding complete.")

Run it:

$ python transcode_log.py
Transcoding complete.

The iterdecode() call processes the file in chunks, converting each segment from Latin-1 to a Python string, which is then re-encoded as UTF-8 and written to the output file.

Tutorial

Unicode & Character Encodings in Python: A Painless Guide

In this tutorial, you'll get a Python-centric introduction to character encodings and unicode. Handling character encodings and numbering systems can at times seem painful and complicated, but this guide is here to help with easy-to-follow Python examples.

advanced python

For additional information on related topics, take a look at the following resources: