What Is the __pycache__ Folder in Python?

What Is the __pycache__ Folder in Python?

by Bartosz Zaczyński May 13, 2024 intermediate python

When you develop a self-contained Python script, you might not notice anything unusual about your directory structure. However, as soon as your project becomes more complex, you’ll often decide to extract parts of the functionality into additional modules or packages. That’s when you may start to see a __pycache__ folder appearing out of nowhere next to your source files in seemingly random places:

project/
│
├── mathematics/
│   │
│   ├── __pycache__/
│   │
│   ├── arithmetic/
│   │   ├── __init__.py
│   │   ├── add.py
│   │   └── sub.py
│   │
│   ├── geometry/
│   │   │
│   │   ├── __pycache__/
│   │   │
│   │   ├── __init__.py
│   │   └── shapes.py
│   │
│   └── __init__.py
│
└── calculator.py

Notice that the __pycache__ folder can be present at different levels in your project’s directory tree when you have multiple subpackages nested in one another. At the same time, other packages or folders with your Python source files may not contain this mysterious cache directory.

You may encounter a similar situation after you clone a remote Git repository with a Python project and run the underlying code. So, what causes the __pycache__ folder to appear, and for what purpose?

Take the Quiz: Test your knowledge with our interactive “What Is the __pycache__ Folder in Python?” quiz. You’ll receive a score upon completion to help you track your learning progress:


Interactive Quiz

What Is the __pycache__ Folder in Python?

In this quiz, you'll have the opportunity to test your knowledge of the __pycache__ folder, including when, where, and why Python creates these folders.

In Short: It Makes Importing Python Modules Faster

Even though Python is an interpreted programming language, its interpreter doesn’t operate directly on your Python code, which would be very slow. Instead, when you run a Python script or import a Python module, the interpreter compiles your high-level Python source code into bytecode, which is an intermediate binary representation of the code.

This bytecode enables the interpreter to skip recurring steps, such as lexing and parsing the code into an abstract syntax tree and validating its correctness every time you run the same program. As long as the underlying source code hasn’t changed, Python can reuse the intermediate representation, which is immediately ready for execution. This saves time, speeding up your script’s startup time.

Remember that while loading the compiled bytecode from __pycache__ makes Python modules import faster, it doesn’t affect their execution speed!

Why bother with bytecode at all instead of compiling the code straight to the low-level machine code? While machine code is what executes on the hardware, providing the ultimate performance, it’s not as portable or quick to produce as bytecode.

Machine code is a set of binary instructions understood by your specific CPU architecture, wrapped in a container format like EXE, ELF, or Mach-O, depending on the operating system. In contrast, bytecode provides a platform-independent abstraction layer and is typically quicker to compile.

Python uses local __pycache__ folders to store the compiled bytecode of imported modules in your project. On subsequent runs, the interpreter will try to load precompiled versions of modules from these folders, provided they’re up-to-date with the corresponding source files. Note that this caching mechanism only gets triggered for modules you import in your code rather than executing as scripts in the terminal.

In addition to this on-disk bytecode caching, Python keeps an in-memory cache of modules, which you can access through the sys.modules dictionary. It ensures that when you import the same module multiple times from different places within your program, Python will use the already imported module without needing to reload or recompile it. Both mechanisms work together to reduce the overhead of importing Python modules.

Next, you’re going to find out exactly how much faster Python loads the cached bytecode as opposed to compiling the source code on the fly when you import a module.

How Much Faster Is Loading Modules From Cache?

The caching happens behind the scenes and usually goes unnoticed since Python is quite rapid at compiling the bytecode. Besides, unless you often run short-lived Python scripts, the compilation step remains insignificant when compared to the total execution time. That said, without caching, the overhead associated with bytecode compilation could add up if you had lots of modules and imported them many times over.

To measure the difference in import time between a cached and uncached module, you can pass the -X importtime option to the python command or set the equivalent PYTHONPROFILEIMPORTTIME environment variable. When this option is enabled, Python will display a table summarizing how long it took to import each module, including the cumulative time in case a module depends on other modules.

Suppose you had a calculator.py script that imports and calls a utility function from a local arithmetic.py module:

Python calculator.py
from arithmetic import add

add(3, 4)

The imported module defines a single function:

Python arithmetic.py
def add(a, b):
    return a + b

As you can see, the main script delegates the addition of two numbers, three and four, to the add() function imported from the arithmetic module.

The first time you run your script, Python compiles and saves the bytecode of the module you imported into a local __pycache__ folder. If such a folder doesn’t already exist, then Python automatically creates one before moving on. Now, when you execute your script again, Python should find and load the cached bytecode as long as you didn’t alter the associated source code.

Once the cache warms up, it’ll contribute to a faster startup time of your Python script:

Shell
$ python -X importtime calculator.py
(...)
import time:     20092 |      20092 | arithmetic

$ python -X importtime calculator.py
(...)
import time:       232 |        232 | arithmetic

$ python -X importtime calculator.py
(...)
import time:       203 |        203 | arithmetic

Although the exact measurements may vary across runs, the improvement is clearly visible. Without the __pycache__ folder, the initial attempt to import arithmetic was two orders of magnitude slower than during subsequent runs. This change may look staggering at first glance, but these values are expressed in microseconds, so you most likely won’t even notice the difference despite such a dramatic drop in numbers.

The performance gain is usually barely reflected by the startup time of most Python scripts, which you can assess using the time command on Unix-like systems:

Shell
$ rm -rf __pycache__
$ time python calculator.py

real    0m0.088s
user    0m0.064s
sys     0m0.028s

$ time python calculator.py

real    0m0.086s
user    0m0.060s
sys     0m0.030s

Here, the total execution time remains virtually the same regardless of whether the cache exists or not. Removing the __pycache__ folder delays the execution by about two milliseconds, which is negligible for most applications.

Python’s bytecode compiler is pretty snappy when you compare it to a more sophisticated counterpart in Java, which can take advantage of static typing.

For example, if you have a sample Calculator.java file, then you can either compile it to a .class file upfront, which is the usual way of working with Java code, or run the .java file directly. In the latter case, the Java runtime will compile the code in the background into a temporary location before running it:

Shell
$ javac Calculator.java
$ time java Calculator

real    0m0.039s
user    0m0.026s
sys     0m0.019s

$ time java Calculator.java

real    0m0.574s
user    0m1.182s
sys     0m0.069s

When you manually compile the source code using the javac command and run the resulting bytecode, the execution takes about forty milliseconds. On the other hand, when you let the java command handle the compilation, the total execution time rises to just over half a second. Therefore, unlike in Python, the Java compiler’s overhead is very noticeable, even with a parallel compilation on multiple CPU cores!

Now that you know the purpose of the __pycache__ folder, you might be curious about its content.

What’s Inside a __pycache__ Folder?

When you take a peek inside the __pycache__ folder, you’ll see one or more files that end with the .pyc extension. It stands for a compiled Python module:

Shell
$ ls -1 __pycache__
arithmetic.cpython-311.pyc
arithmetic.cpython-312.pyc
arithmetic.pypy310.opt-1.pyc
arithmetic.pypy310.opt-2.pyc
arithmetic.pypy310.pyc
solver.cpython-312.pyc
units.cpython-312.pyc

Each of these files contains the bytecode of the corresponding Python module, defined in the current package, that you imported at runtime. The compiled bytecode targets a specific Python implementation, version, and an optional optimization level. All of this information is encoded in the file name.

For example, a file named arithmetic.pypy310.opt-2.pyc is the bytecode of your arithmetic.py module compiled by PyPy 3.10 with an optimization level of two. This optimization removes the assert statements and discards any docstrings. Conversely, arithmetic.cpython-312.pyc represents the same module but compiled for CPython 3.12 without any optimizations.

In total, there are five bytecode variants compiled from a single arithmetic.py module in the __pycache__ folder above, which are highlighted.

Such a file naming scheme ensures compatibility across different Python versions and flavors. When you run the same script using PyPy or an earlier CPython version, the interpreter will compile all imported modules against its own runtime environment so that it can reuse them later. The Python version, along with other metadata, is also stored in the .pyc file itself.

You’ll take a closer look at the .pyc files stored in the cache folder at the end of this tutorial. Now, it’s time to learn about the circumstances that trigger Python to create the cache folder.

When Does Python Create Cache Folders?

The interpreter will only store the compiled bytecode of Python modules in the __pycache__ folder when you import those modules or sometimes their parent package. It won’t create the cache folder when you run a regular Python script that doesn’t import any modules or packages. This is based on the assumption that modules are less likely to change and that you may import them multiple times during a single execution.

When you import an individual module from a package, Python will produce the corresponding .pyc file and cache it in the __pycache__ folder located inside that package. It’ll also compile the package’s __init__.py file but won’t touch any other modules or nested subpackages. However, if the imported module itself imports other modules, then those modules will also be compiled, and so on.

Here’s an example demonstrating the most basic case, assuming that you placed a suitable import statement in the calculator.py script below:

project/
│
├── arithmetic/
│   │
│   ├── __pycache__/
│   │   ├── __init__.cpython-312.pyc
│   │   └── add.cpython-312.pyc
│   │
│   ├── __init__.py
│   ├── add.py
│   └── sub.py
│
└── calculator.py

After running the script for the first time, Python will ensure that a __pycache__ folder exists in the arithmetic package. Then, it’ll compile the package’s __init__.py along with any modules imported from that package. In this case, you only requested to import the arithmetic.add module, so you can see a .pyc file associated with add.py but not with sub.py.

All of the following import statements would yield the same result depicted above:

Python
import arithmetic.add
from arithmetic import add
from arithmetic.add import add_function

Regardless of how you import a Python module and whether you import it in its entirety or just a specific symbol, such as a class or a constant, the interpreter compiles the whole module, as it can’t read modules partially.

Conversely, importing a whole package would normally cause Python to compile only __init__.py in that package. However, it’s fairly common for packages to expose their internal modules or subpackages from within __init__.py for more convenient access. For instance, consider this:

Python arithmetic/__init__.py
from arithmetic import add
from arithmetic.sub import sub_function

Imports within __init__.py like these create additional .pyc files, even when you use the plain import arithmetic statement in your script.

If you imported a subpackage or a deeply nested module or symbol, then all intermediate packages leading up to the top-level package would also have their __init__.py files compiled and placed in their respective cache folders. However, Python won’t go in the other direction by recursively scanning the nested subpackages, as that would be unnecessary. It’ll only compile the modules you really need by importing them explicitly or indirectly.

What if Python has already compiled your module into a .pyc file, but you decide to modify its source code in the original .py file? You’ll find out next!

What Actions Invalidate the Cache?

Running an obsolete bytecode could cause an error or, worse, it could lead to completely unpredictable behavior. Fortunately, Python is clever enough to detect when you modify the source code of a compiled module and will recompile it as necessary.

To determine whether a module needs recompiling, Python uses one of two cache invalidation strategies:

  1. Timestamp-based
  2. Hash-based

The first one compares the source file’s size and its last-modified timestamp to the metadata stored in the corresponding .pyc file. Later, you’ll learn how these values are persisted in the .pyc file, along with other metadata.

In contrast, the second strategy calculates the hash value of the source file and checks it against a special field in the file’s header (PEP 552), which was introduced in Python 3.7. This strategy is more secure and deterministic but also a bit slower. That’s why the timestamp-based strategy remains the default for now.

When you artificially update the modification time (mtime) of the source file, for example, by using the touch command on macOS or Linux, you’ll force Python to compile the module again:

Shell
$ tree -D --dirsfirst
[Apr 26 09:48]  .
├── [Apr 26 09:48]  __pycache__
│     └── [Apr 26 09:48]  arithmetic.cpython-312.pyc
├── [Apr 26 09:48]  arithmetic.py
└── [Apr 26 09:48]  calculator.py

2 directories, 3 files

$ touch arithmetic.py
$ python calculator.py

$ tree -D --dirsfirst
[Apr 26 09:48]  .
├── [Apr 26 09:52]  __pycache__
│     └── [Apr 26 09:52]  arithmetic.cpython-312.pyc
├── [Apr 26 09:52]  arithmetic.py
└── [Apr 26 09:48]  calculator.py

2 directories, 3 files

Initially, the cached arithmetic.cpython-312.pyc file was last modified at 09:48 am. After touching the source file, arithmetic.py, Python deems the compiled bytecode outdated and recompiles the module when you run the script that imports arithmetic. This results in a new .pyc file with an updated timestamp of 09:52 am.

To produce hash-based .pyc files, you must use Python’s compileall module with the --invalidation-mode option set accordingly. For instance, this command will compile all modules in the current folder and subfolders into the so-called checked hash-based variant:

Shell
$ python -m compileall --invalidation-mode checked-hash

The official documentation explains the difference between checked and unchecked variants of the hash-based .pyc files as follows:

For checked hash-based .pyc files, Python validates the cache file by hashing the source file and comparing the resulting hash with the hash in the cache file. If a checked hash-based cache file is found to be invalid, Python regenerates it and writes a new checked hash-based cache file. For unchecked hash-based .pyc files, Python simply assumes the cache file is valid if it exists. (Source)

Nevertheless, you can always override the default validation behavior of hash-based .pyc files with the --check-hash-based-pycs option when you run the Python interpreter.

Knowing when and where Python creates the __pycache__ folders, as well as when it updates their content, will give you an idea about whether it’s safe to remove them.

Is It Safe to Remove a Cache Folder?

Yes, although you should ask yourself whether you really should! At this point, you understand that removing a __pycache__ folder is harmless because Python regenerates the cache on each invocation. Anyway, removing the individual cache folders by hand is a tedious job. Besides, it’ll only last until the next time you run your code.

The good news is that you can automate the removal of those cache folders from your project if you really insist, which you’ll do now.

How to Recursively Remove All Cache Folders?

Okay, you’ve already established that removing the cached bytecode that Python compiles is no big deal. However, the trouble with these __pycache__ folders is that they can appear in multiple subdirectories if you have a complex project structure. Finding and deleting them by hand would be a chore, especially since they rise like a phoenix from the ashes every time you run Python.

Below, you’ll find the platform-specific commands that recursively remove all __pycache__ folders from the current directory and all of its nested subdirectories in one go:

Windows PowerShell
PS> $dirs = Get-ChildItem -Path . -Filter __pycache__ -Recurse -Directory
PS> $dirs | Remove-Item -Recurse -Force
Shell
$ find . -type d -name __pycache__ -exec rm -rf {} +

You should exercise caution when running bulk delete commands, as they can remove more than you intended if you’re not careful. Always double-check the paths and filter criteria before you execute such commands!

Deleting the __pycache__ folders can declutter your project’s workspace, but only provisionally. If you’re still annoyed by having to repeatedly run the recursive delete command, then you may prefer to take control of the cache folder handling in the first place. Next, you’ll explore two approaches to tackling this issue.

How to Prevent Python From Creating Cache Folders?

If you don’t want Python to cache the compiled bytecode, then you can pass the -B option to the python command when running a script. This will prevent the __pycache__ folders from appearing unless they already exist. That said, Python will continue to leverage any .pyc files that it can find in existing cache folders. It just won’t write new files to disk.

For a more permanent and global effect that spans across multiple Python interpreters, you can set the PYTHONDONTWRITEBYTECODE environment variable in your shell or its configuration file:

Shell
export PYTHONDONTWRITEBYTECODE=1

It’ll affect any Python interpreter, including one within a virtual environment that you’ve activated.

Still, you should think carefully about whether suppressing bytecode compilation is the right approach for your use case. The alternative is to tell Python to create the individual __pycache__ folders in a single shared location on your file system.

How to Store the Cache in a Centralized Folder?

When you disable the bytecode compilation altogether, you gain a cleaner workspace but lose the benefits of caching for faster load times. If you’d like to combine the best of both worlds, then you can instruct Python to write the .pyc files into a parallel tree rooted at the specified directory using the -X pycache_prefix option:

Shell
$ python -X pycache_prefix=/tmp/pycache calculator.py

In this case, you tell Python to cache the compiled bytecode in a temporary folder located at /tmp/pycache on your file system. When you run this command, Python won’t try to create local __pycache__ folders in your project anymore. Instead, it’ll mirror the directory structure of your project under the indicated root folder and store all .pyc files there:

tmp/
└── pycache/
    └── home/
        └── user/
            │
            ├── other_project/
            │   └── solver.cpython-312.pyc
            │
            └── project/
                │
                └── mathematics/
                    │
                    ├── arithmetic/
                    │   ├── __init__.cpython-312.pyc
                    │   ├── add.cpython-312.pyc
                    │   └── sub.cpython-312.pyc
                    │
                    ├── geometry/
                    │   ├── __init__.cpython-312.pyc
                    │   └── shapes.cpython-312.pyc
                    │
                    └── __init__.cpython-312.pyc

Notice two things here. First, because the cache directory is kept separate from your source code, there’s no need to nest the compiled .pyc files inside the __pycache__ folders. Secondly, since the hierarchy within such a centralized cache matches your project’s structure, you can share this cache folder between multiple projects.

Other advantages of this setup include an easier cleanup, as you can remove all .pyc files belonging to the same project with a single keystroke without having to manually traverse all the directories. Additionally, you can store the cache folder on a separate physical disk to take advantage of parallel reads, or keep the cache in a persistent volume when working with Docker containers.

Remember that you must use the -X pycache_prefix option every time you run the python command for this to work consistently. As an alternative, you can set the path to a shared cache folder through the PYTHONPYCACHEPREFIX environment variable:

Shell
export PYTHONPYCACHEPREFIX=/tmp/pycache

Either way, you can programmatically verify whether Python will use the specified cache directory or fall back to the default behavior and create local __pycache__ folders:

Python
>>> import sys
>>> sys.pycache_prefix
'/tmp/pycache'

The sys.pycache_prefix variable can be either a string or None.

You’ve come a long way through this tutorial, and now you know a thing or two about dealing with __pycache__ folders in your Python projects. It’s finally time to see how you can work directly with the .pyc files stored in those folders.

What’s Inside a Cached .pyc File?

A .pyc file consists of a header with metadata followed by the serialized code object to be executed at runtime. The file’s header begins with a magic number that uniquely identifies the specific Python version the bytecode was compiled for. Next, there’s a bit field defined in PEP 552, which determines one of three cache invalidation strategies explained earlier.

In timestamp-based .pyc files, the bit field is filled with zeros and followed by two four-byte fields. These fields correspond to the Unix time of the last modification and the size of the source .py file, respectively:

Offset Field Size Field Description
0 4 Magic number Identifies the Python version
4 4 Bit field Filled with zeros
8 4 Timestamp The time of .py file’s modification
12 4 File size Concerns the source .py file

Conversely, for hash-based .pyc files, the bit field can be equal to either one, indicating an unchecked variant, or three, meaning the checked variant. Then, instead of the timestamp and file size, there’s only one eight-byte field with the hash value of the Python source code:

Offset Field Size Field Description
0 4 Magic number Identifies the Python version
4 4 Bit field Equals 1 (unchecked) or 3 (checked)
8 8 Hash value Source code’s hash value

In both cases, the header is sixteen bytes long, which you can skip if you’re not interested in reading the encoded metadata. By doing so, you’ll jump straight to the code object serialized with the marshal module, which occupies the remaining part of the .pyc file.

With this information, you can X-ray one of your compiled .pyc files and directly execute the underlying bytecode, even if you no longer have the original .py file with the associated source code.

How to Read and Execute the Cached Bytecode?

By itself, Python will only run a companion .pyc file if the original .py file still exists. If you remove the source module after it’s already been compiled, then Python will refuse to run the .pyc file. That’s by design. However, you can always run the bytecode by hand if you want to.

The following Python script shows how you can read the .pyc file’s header, as well as how to deserialize and execute the bytecode that comes with it:

Python xray.py
 1import marshal
 2from datetime import datetime, timezone
 3from importlib.util import MAGIC_NUMBER
 4from pathlib import Path
 5from pprint import pp
 6from py_compile import PycInvalidationMode
 7from sys import argv
 8from types import SimpleNamespace
 9
10def main(path):
11    metadata, code = load_pyc(path)
12    pp(vars(metadata))
13    if metadata.magic_number == MAGIC_NUMBER:
14        exec(code, globals())
15    else:
16        print("Bytecode incompatible with this interpreter")
17
18def load_pyc(path):
19    with Path(path).open(mode="rb") as file:
20        return (
21            parse_header(file.read(16)),
22            marshal.loads(file.read()),
23        )
24
25def parse_header(header):
26    metadata = SimpleNamespace()
27    metadata.magic_number = header[0:4]
28    metadata.magic_int = int.from_bytes(header[0:4][:2], "little")
29    metadata.python_version = f"3.{(metadata.magic_int - 2900) // 50}"
30    metadata.bit_field = int.from_bytes(header[4:8], "little")
31    metadata.pyc_type = {
32        0: PycInvalidationMode.TIMESTAMP,
33        1: PycInvalidationMode.UNCHECKED_HASH,
34        3: PycInvalidationMode.CHECKED_HASH,
35    }.get(metadata.bit_field)
36    if metadata.pyc_type is PycInvalidationMode.TIMESTAMP:
37        metadata.timestamp = datetime.fromtimestamp(
38            int.from_bytes(header[8:12], "little"),
39            timezone.utc,
40        )
41        metadata.file_size = int.from_bytes(header[12:16], "little")
42    else:
43        metadata.hash_value = header[8:16]
44    return metadata
45
46if __name__ == "__main__":
47    main(argv[1])

There are several things going on here, so you can break them down line by line:

  • Line 11 receives a tuple comprising the metadata parsed from the file’s header and a deserialized code object ready to execute. Both are loaded from a .pyc file specified as the only required command-line argument.
  • Line 12 pretty prints the decoded metadata onto the screen.
  • Lines 13 to 16 conditionally execute the bytecode from the .pyc file using exec() or print an error message. To determine whether the file was compiled for the current interpreter version, this code fragment compares the magic number obtained from the header to the interpreter’s magic number. If everything succeeds, the loaded module’s symbols get imported into globals().
  • Lines 18 to 23 open the .pyc file in binary mode using the pathlib module, parse the header, and unmarshal the code object.
  • Lines 25 to 44 parse the header fields using their corresponding offsets and byte sizes, interpreting multi-byte values with the little-endian byte order.
  • Lines 28 and 29 extract the Python version from the magic number, which increments with each minor release of Python according to the formula 2900 + 50n, where n is the minor release of Python 3.11 or later.
  • Lines 31 to 35 determine the type of .pyc file based on the preceding bit field (PEP 552).
  • Lines 37 to 40 convert the source file’s modification time into a datetime object in the UTC timezone.

You can run the X-ray script outlined above against a .pyc file of your choice. When you enable Python’s interactive mode (-i), you’ll be able to inspect the variables and the state of the program after it terminates:

Shell
$ python -i xray.py __pycache__/arithmetic.cpython-312.pyc
{'magic_number': b'\xcb\r\r\n',
 'magic_int': 3531,
 'python_version': '3.12',
 'bit_field': 0,
 'pyc_type': <PycInvalidationMode.TIMESTAMP: 1>,
 'timestamp': datetime.datetime(2024, 4, 26, 17, 34, 57, tzinfo=….utc),
 'file_size': 32}
>>> add(3, 4)
7

The script prints out the decoded header fields, including the magic number and the source file’s modification time. Right after that, due to the python -i option, you’re dropped into the interactive Python REPL where you call add(), which was imported into the global namespace by executing the module’s bytecode.

This works as expected because the Python interpreter that you’re currently running happens to match the module’s bytecode version. Here’s what would happen if you tried to run another .pyc file targeting a different Python version or one of its alternative implementations:

Shell
$ python -i xray.py __pycache__/arithmetic.cpython-311.pyc
{'magic_number': b'\xa7\r\r\n',
 'magic_int': 3495,
 'python_version': '3.11',
 'bit_field': 0,
 'pyc_type': <PycInvalidationMode.TIMESTAMP: 1>,
 'timestamp': datetime.datetime(2024, 4, 25, 14, 40, 26, tzinfo=….utc),
 'file_size': 32}
Bytecode incompatible with this interpreter
>>> add(3, 4)
Traceback (most recent call last):
  ...
NameError: name 'add' is not defined

This time, the output shows a message indicating a mismatch between the bytecode version in the .pyc file and the interpreter being used. As a result, the bytecode wasn’t executed and the add() function wasn’t defined, so you can’t call it.

Now, if you X-ray a hash-based .pyc file (checked or unchecked), then this is what you might get:

Shell
$ python xray.py __pycache__/arithmetic.cpython-312.pyc
{'magic_number': b'\xcb\r\r\n',
 'magic_int': 3531,
 'python_version': '3.12',
 'bit_field': 3,
 'pyc_type': <PycInvalidationMode.CHECKED_HASH: 2>,
 'hash_value': b'\xf3\xdd\x87j\x8d>\x0e)'}

Python can compare the hash value embedded in the .pyc file with one that it calculates from the associated .py file by calling source_hash() on the source code:

Python
>>> from importlib.util import source_hash
>>> from pathlib import Path
>>> source_hash(Path("arithmetic.py").read_bytes())
b'\xf3\xdd\x87j\x8d>\x0e)'

This is a more reliable method of cache invalidation and verifying code integrity than comparing a volatile last-modified attribute of the source file. Notice how the computed hash value agrees with the one read from the .pyc file.

Now that you know how to import Python modules from a compiled binary form, it might feel tempting to distribute your commercial Python programs without sharing the source code.

Can Bytecode Obfuscate Python Programs?

Earlier, you learned that Python won’t import a module from a .pyc file if the associated .py file can’t be found. However, there’s one notable exception since that’s exactly what happens when you import Python code from a ZIP file specified on the PYTHONPATH. Such archives typically contain only the compiled .pyc files without the accompanying source code.

The ability to import compiled modules, either by hand or through these ZIP files, lets you implement a rudimentary code obfuscation mechanism. Unfortunately, it wouldn’t be particularly bulletproof since more tech-savvy users could try to decompile your .pyc files back into high-level Python code using specialized tools like uncompyle6 or pycdc.

But even without those external tools, you can disassemble Python’s bytecode into human-readable opcodes, making the analysis and reverse-engineering of your programs fairly accessible. The proper way to conceal Python source code is by compiling it to machine code. For example, you can help yourself with tools like Cython or rewrite the core parts of your code in a lower-level programming language like C, C++, or Rust.

How to Disassemble the Cached Bytecode?

Once you get a handle on a code object in Python, you can use the dis module from the standard library to disassemble the compiled bytecode. For the sake of example, you’ll quickly generate a code object yourself without having to rely on the .pyc files cached by Python:

Python
>>> from pathlib import Path
>>> source_code = Path("arithmetic.py").read_text(encoding="utf-8")
>>> module = compile(source_code, "arithmetic.py", mode="exec")
>>> module
<code object <module> at 0x7d09a9c92f50, file "arithmetic.py", line 1>

You call the built-in compile() function with the "exec" mode as a parameter to compile a Python module. Now, you can display the human-readable opcode names of the resulting code object using dis:

Python
>>> from dis import dis
>>> dis(module)
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (<code object add at 0x7d...>)
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (add)
              8 RETURN_CONST             1 (None)

Disassembly of <code object add at 0x7d...>:
  1           0 RESUME                   0

  2           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE

The opcodes MAKE_FUNCTION and STORE_NAME tell you that there’s a function named add() in this bytecode. When you look closely at the disassembled code object of that function, you’ll see that it takes two arguments called a and b, adds them using the binary plus operator (+), and returns the calculated value.

Alternatively, you could walk through the tree of opcodes and iterate over the individual instructions, to try to reconstruct the high-level Python source code:

Python
>>> from dis import Bytecode
>>> from types import CodeType

>>> def traverse(code):
...     print(code.co_name, code.co_varnames)
...     for instruction in Bytecode(code):
...         if isinstance(instruction.argval, CodeType):
...             traverse(instruction.argval)
...
>>> traverse(module)
<module> ()
add ('a', 'b')

This code snippet is far from being complete. Plus, decompiling is a complicated process in general. It often leads to imperfect results, as some information gets irreversibly lost when certain optimizations are applied during the bytecode compilation. Anyway, it should give you a rough idea of how decompilers work.

Conclusion

In this tutorial, you’ve dived into the inner workings of Python’s bytecode caching mechanism. You now understand that caching is all about the modules you import. By storing compiled bytecode in __pycache__ folders, Python avoids the overhead of recompiling modules on every program run, leading to faster startup times.

Now you understand what triggers the creation of cache folders, how to suppress them, and how to move them to a centralized folder on your file system. Along the way, you built a utility tool to let you read and execute the individual .pyc files from a __pycache__ folder.

Take the Quiz: Test your knowledge with our interactive “What Is the __pycache__ Folder in Python?” quiz. You’ll receive a score upon completion to help you track your learning progress:


Interactive Quiz

What Is the __pycache__ Folder in Python?

In this quiz, you'll have the opportunity to test your knowledge of the __pycache__ folder, including when, where, and why Python creates these folders.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Bartosz Zaczyński

Bartosz is a bootcamp instructor, author, and polyglot programmer in love with Python. He helps his students get into software engineering by sharing over a decade of commercial experience in the IT industry.

» More about Bartosz

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Master Real-World Python Skills With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

Master Real-World Python Skills
With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal.


Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!