May 13, 2024

Choosing the Right Python Serialization Tool

By Alyce Osbourne

Serialization is a key process in programming where data structures or object states are transformed into a format that can be conveniently stored, transmitted, and later reconstructed.

Let’s explore the serialization options provided by Python!

JSON

JavaScript Object Notation, otherwise known as JSON, is a human-readable serialization format that is also simple for machines to parse. Python includes support for this format in the json module.

JSON is used fairly universally and is supported by most languages, as well as being the predominant format used to transmit data between networks.

Key features:

Text-based and language-independent.
Ideal for lightweight and cross-language data interchange.
Supports basic data types like strings, numbers, lists, and dictionaries.

Caveats

Can only serialize simple data.
Depending on your data structure, you can balloon the size of the stored data due to the need to include delimiters in the format.

import json

# Serializing data
data = {'name': 'Arjan', 'profession': "Software Developer"}
json_data = json.dumps(data)
print(json_data)

# Deserializing data
data_loaded = json.loads(json_data)
print(data_loaded)

Marshal

The marshal module is used for serializing and deserializing Python objects. It is mainly intended for serializing .pyc files, and is generally not intended for general persistence, especially across Python versions.

It can be used when implementing inter-process communication when working with multiprocessing. Outside of these cases, it’s generally not recommended for use. It can also only serialize simple, primitive types and is unsuitable for more complex data.

Key features:

Fast and specific to Python internal use.
Not secure against erroneous or maliciously constructed data.
Intended for Python bytecode serialization.

Caveats

Not intended for general use.
It cannot be used across versions.
Non-universal format.
Can only serialize simple data types.

import marshal

# Serializing data
data = {'x': 1, 'y': 2}
serialized_data = marshal.dumps(data)

# Deserializing data
deserialized_data = marshal.loads(serialized_data)
print(deserialized_data)

Pickle

The pickle module serializes Python object structures into byte streams and is more general than marshal. Unlike json, pickle can handle a wide variety of Python objects, including custom classes.

Of all of the options on the list, this is the most powerful but also the most dangerous.

pickle can serialize and deserialize arbitrary Python code, meaning it is able to be used by bad actors as a means to infect your computer with a malicious payload, and this generally avoids most antivirus technology as the code is being executed in an authorized application. This means you should never, ever unpickle data from untrusted sources.

Key features:

Can serialize complex Python objects.
Supports binary formats.

Caveats

Potentially dangerous.
Highly coupled to the structure of the code that generated it.
Not compatible between different versions.

import pickle

# Serializing data
data = {'a': [1, 2, 3], 'b': None}
with open('data.pkl', 'wb') as file:
    pickle.dump(data, file)

# Deserializing data
with open('data.pkl', 'rb') as file:
    loaded_data = pickle.load(file)
print(loaded_data)

Shelve

The shelve module is a persistent key store that utilizes pickle to serialize objects. This means that pickle is can store arbitrary Python objects. Since it uses pickle it comes with the same inherit risks.

Key features:

Dictionary-like interface.
Stores pickled objects with a key.
Good for simple data storage solutions.

Caveats

All of the ones presented by pickle.
Can only track changes of mutable objects if specifically configured too, which comes at large performance costs.
Large file sizes, as well as multiple files per shelve depending on the operating system.

import shelve

# Serializing data
with shelve.open('shelf.db') as shelf:
    shelf['info'] = {'name': 'Alice', 'occupation': 'Engineer'}

# Deserializing data
with shelve.open('shelf.db') as shelf:
    print(shelf['info'])

CTypes

Lastly, ctypes is a foreign function library for Python that provides C-compatible data types and allows calling functions in DLLs or shared libraries. It also provides a number of ctype data structures, such as Structure. While not directly intended for serialization, it can facilitate it by organizing data into structures and converting them to bytes.

Key features:

Ideal for interfacing with C code.
Provides C-compatible data types.
It isn’t a direct serialization tool but can be used as a tool to write serializers.
Maximum control over the representation of serialized data.

Caveats

It isn’t intended as a serialization tool; as such, all serialization and deserialization functions will need to be handwritten.
You will have to maintain your own serialization tools.

Final thoughts

Python provides a number of tools for serializing, each with their own strengths and weaknesses. For most common cases, JSON will be the preferable choice, and for situations where you need to serialize more complex data, pickle and shelve can be used.

Be sure to weigh up the pros and cons of each option and choose the one that suits your problem the best.

Be sure to check out my post on creating custom collections.

Choosing the Right Python Serialization Tool

JSON

Marshal

Pickle

Shelve

CTypes

Final thoughts

Improve your code with my 3-part code diagnosis framework

Recent posts

Python Pickle Risks and Safer Serialization Alternatives

Environment Variables & Dotfiles for Secure Projects

Python Doc Generation Made Easy With PDoc