pickle — Python object serialization¶The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” [1] or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”. This documentation describes both the pickle module and the cPickle module. Relationship to other Python modules¶The pickle module has an optimized cousin called the cPickle module. As its name implies, cPickle is written in C, so it can be up to 1000 times faster than pickle. However it does not support subclassing of the Pickler() and Unpickler() classes, because in cPickle these are functions, not classes. Most applications have no need for this functionality, and can benefit from the improved performance of cPickle. Other than that, the interfaces of the two modules are nearly identical; the common interface is described in this manual and differences are pointed out where necessary. In the following discussions, we use the term “pickle” to collectively describe the pickle and cPickle modules. The data streams the two modules produce are guaranteed to be interchangeable. Python has a more primitive serialization module called marshal, but in general pickle should always be the preferred way to serialize Python objects. marshal exists primarily to support Python’s .pyc files. The pickle module differs from marshal several significant ways:
Warning The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source. Note that serialization is a more primitive notion than persistence; although pickle reads and writes file objects, it does not handle the issue of naming persistent objects, nor the (even more complicated) issue of concurrent access to persistent objects. The pickle module can transform a complex object into a byte stream and it can transform the byte stream into an object with the same internal structure. Perhaps the most obvious thing to do with these byte streams is to write them onto a file, but it is also conceivable to send them across a network or store them in a database. The module shelve provides a simple interface to pickle and unpickle objects on DBM-style database files. Data stream format¶The data format used by pickle is Python-specific. This has the advantage that there are no restrictions imposed by external standards such as XDR (which can’t represent pointer sharing); however it means that non-Python programs may not be able to reconstruct pickled Python objects. By default, the pickle data format uses a printable ASCII representation. This is slightly more voluminous than a binary representation. The big advantage of using printable ASCII (and of some other characteristics of pickle‘s representation) is that for debugging or recovery purposes it is possible for a human to read the pickled file with a standard text editor. There are currently 3 different protocols which can be used for pickling.
Refer to PEP 307 for more information. If a protocol is not specified, protocol 0 is used. If protocol is specified as a negative value or HIGHEST_PROTOCOL, the highest protocol version available will be used. Changed in version 2.3: Introduced the protocol parameter. A binary format, which is slightly more efficient, can be chosen by specifying a protocol version >= 1. Usage¶To serialize an object hierarchy, you first create a pickler, then you call the pickler’s dump() method. To de-serialize a data stream, you first create an unpickler, then you call the unpickler’s load() method. The pickle module provides the following constant:
Note Be sure to always open pickle files created with protocols >= 1 in binary mode. For the old ASCII-based pickle protocol 0 you can use either text mode or binary mode as long as you stay consistent. A pickle file written with protocol 0 in binary mode will contain lone linefeeds as line terminators and therefore will look “funny” when viewed in Notepad or other editors which do not support this format. The pickle module provides the following functions to make the pickling process more convenient:
The pickle module also defines three exceptions:
The pickle module also exports two callables [2], Pickler and Unpickler:
It is possible to make multiple calls to the dump() method of the same Pickler instance. These must then be matched to the same number of calls to the load() method of the corresponding Unpickler instance. If the same object is pickled by multiple dump() calls, the load() will all yield references to the same object. [3] Unpickler objects are defined as:
What can be pickled and unpickled?¶The following types can be pickled:
Attempts to pickle unpicklable objects will raise the PicklingError exception; when this happens, an unspecified number of bytes may have already been written to the underlying file. Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a RuntimeError will be raised in this case. You can carefully raise this limit with sys.setrecursionlimit(). Note that functions (built-in and user-defined) are pickled by “fully qualified” name reference, not by value. This means that only the function name is pickled, along with the name of module the function is defined in. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised. [4] Similarly, classes are pickled by named reference, so the same restrictions in the unpickling environment apply. Note that none of the class’s code or data is pickled, so in the following example the class attribute attr is not restored in the unpickling environment: class Foo:
attr = 'a class attr'
picklestring = pickle.dumps(Foo)
These restrictions are why picklable functions and classes must be defined in the top level of a module. Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled. This is done on purpose, so you can fix bugs in a class or add methods to the class and still load objects that were created with an earlier version of the class. If you plan to have long-lived objects that will see many versions of a class, it may be worthwhile to put a version number in the objects so that suitable conversions can be made by the class’s __setstate__() method. The pickle protocol¶This section describes the “pickling protocol” that defines the interface between the pickler/unpickler and the objects that are being serialized. This protocol provides a standard way for you to define, customize, and control how your objects are serialized and de-serialized. The description in this section doesn’t cover specific customizations that you can employ to make the unpickling environment slightly safer from untrusted pickle data streams; see section Subclassing Unpicklers for more details. Pickling and unpickling normal class instances¶
Pickling and unpickling extension types¶
An alternative to implementing a __reduce__() method on the object to be pickled, is to register the callable with the copy_reg module. This module provides a way for programs to register “reduction functions” and constructors for user-defined types. Reduction functions have the same semantics and interface as the __reduce__() method described above, except that they are called with a single argument, the object to be pickled. The registered constructor is deemed a “safe constructor” for purposes of unpickling as described above. Pickling and unpickling external objects¶For the benefit of object persistence, the pickle module supports the notion of a reference to an object outside the pickled data stream. Such objects are referenced by a “persistent id”, which is just an arbitrary string of printable ASCII characters. The resolution of such names is not defined by the pickle module; it will delegate this resolution to user defined functions on the pickler and unpickler. [7] To define external persistent id resolution, you need to set the persistent_id attribute of the pickler object and the persistent_load attribute of the unpickler object. To pickle objects that have an external persistent id, the pickler must have a custom persistent_id() method that takes an object as an argument and returns either None or the persistent id for that object. When None is returned, the pickler simply pickles the object as normal. When a persistent id string is returned, the pickler will pickle that string, along with a marker so that the unpickler will recognize the string as a persistent id. To unpickle external objects, the unpickler must have a custom persistent_load() function that takes a persistent id string and returns the referenced object. Here’s a silly example that might shed more light: import pickle
from cStringIO import StringIO
src = StringIO()
p = pickle.Pickler(src)
def persistent_id(obj):
if hasattr(obj, 'x'):
return 'the value %d' % obj.x
else:
return None
p.persistent_id = persistent_id
class Integer:
def __init__(self, x):
self.x = x
def __str__(self):
return 'My name is integer %d' % self.x
i = Integer(7)
print i
p.dump(i)
datastream = src.getvalue()
print repr(datastream)
dst = StringIO(datastream)
up = pickle.Unpickler(dst)
class FancyInteger(Integer):
def __str__(self):
return 'I am the integer %d' % self.x
def persistent_load(persid):
if persid.startswith('the value '):
value = int(persid.split()[2])
return FancyInteger(value)
else:
raise pickle.UnpicklingError, 'Invalid persistent id'
up.persistent_load = persistent_load
j = up.load()
print j
In the cPickle module, the unpickler’s persistent_load attribute can also be set to a Python list, in which case, when the unpickler reaches a persistent id, the persistent id string will simply be appended to this list. This functionality exists so that a pickle data stream can be “sniffed” for object references without actually instantiating all the objects in a pickle. [8] Setting persistent_load to a list is usually used in conjunction with the noload() method on the Unpickler. Subclassing Unpicklers¶By default, unpickling will import any class that it finds in the pickle data. You can control exactly what gets unpickled and what gets called by customizing your unpickler. Unfortunately, exactly how you do this is different depending on whether you’re using pickle or cPickle. [9] In the pickle module, you need to derive a subclass from Unpickler, overriding the load_global() method. load_global() should read two lines from the pickle data stream where the first line will the name of the module containing the class and the second line will be the name of the instance’s class. It then looks up the class, possibly importing the module and digging out the attribute, then it appends what it finds to the unpickler’s stack. Later on, this class will be assigned to the __class__ attribute of an empty class, as a way of magically creating an instance without calling its class’s __init__(). Your job (should you choose to accept it), would be to have load_global() push onto the unpickler’s stack, a known safe version of any class you deem safe to unpickle. It is up to you to produce such a class. Or you could raise an error if you want to disallow all unpickling of instances. If this sounds like a hack, you’re right. Refer to the source code to make this work. Things are a little cleaner with cPickle, but not by much. To control what gets unpickled, you can set the unpickler’s find_global attribute to a function or None. If it is None then any attempts to unpickle instances will raise an UnpicklingError. If it is a function, then it should accept a module name and a class name, and return the corresponding class object. It is responsible for looking up the class and performing any necessary imports, and it may raise an error to prevent instances of the class from being unpickled. The moral of the story is that you should be really careful about the source of the strings your application unpickles. Example¶For the simplest code, use the dump() and load() functions. Note that a self-referencing list is pickled and restored correctly. import pickle
data1 = {'a': [1, 2.0, 3, 4+6j],
'b': ('string', u'Unicode string'),
'c': None}
selfref_list = [1, 2, 3]
selfref_list.append(selfref_list)
output = open('data.pkl', 'wb')
# Pickle dictionary using protocol 0.
pickle.dump(data1, output)
# Pickle the list using the highest protocol available.
pickle.dump(selfref_list, output, -1)
output.close()
The following example reads the resulting pickled data. When reading a pickle-containing file, you should open the file in binary mode because you can’t be sure if the ASCII or binary format was used. import pprint, pickle
pkl_file = open('data.pkl', 'rb')
data1 = pickle.load(pkl_file)
pprint.pprint(data1)
data2 = pickle.load(pkl_file)
pprint.pprint(data2)
pkl_file.close()
Here’s a larger example that shows how to modify pickling behavior for a class. The TextReader class opens a text file, and returns the line number and line contents each time its readline() method is called. If a TextReader instance is pickled, all attributes except the file object member are saved. When the instance is unpickled, the file is reopened, and reading resumes from the last location. The __setstate__() and __getstate__() methods are used to implement this behavior. #!/usr/local/bin/python
class TextReader:
"""Print and number lines in a text file."""
def __init__(self, file):
self.file = file
self.fh = open(file)
self.lineno = 0
def readline(self):
self.lineno = self.lineno + 1
line = self.fh.readline()
if not line:
return None
if line.endswith("\n"):
line = line[:-1]
return "%d: %s" % (self.lineno, line)
def __getstate__(self):
odict = self.__dict__.copy() # copy the dict since we change it
del odict['fh'] # remove filehandle entry
return odict
def __setstate__(self, dict):
fh = open(dict['file']) # reopen file
count = dict['lineno'] # read from file...
while count: # until line count is restored
fh.readline()
count = count - 1
self.__dict__.update(dict) # update attributes
self.fh = fh # save the file object
A sample usage might be something like this: >>> import TextReader
>>> obj = TextReader.TextReader("TextReader.py")
>>> obj.readline()
'1: #!/usr/local/bin/python'
>>> obj.readline()
'2: '
>>> obj.readline()
'3: class TextReader:'
>>> import pickle
>>> pickle.dump(obj, open('save.p', 'wb'))
If you want to see that pickle works across Python processes, start another Python session, before continuing. What follows can happen from either the same process or a new process. >>> import pickle
>>> reader = pickle.load(open('save.p', 'rb'))
>>> reader.readline()
'4: """Print and number lines in a text file."""'
cPickle — A faster pickle¶The cPickle module supports serialization and de-serialization of Python objects, providing an interface and functionality nearly identical to the pickle module. There are several differences, the most important being performance and subclassability. First, cPickle can be up to 1000 times faster than pickle because the former is implemented in C. Second, in the cPickle module the callables Pickler() and Unpickler() are functions, not classes. This means that you cannot use them to derive custom pickling and unpickling subclasses. Most applications have no need for this functionality and should benefit from the greatly improved performance of the cPickle module. The pickle data stream produced by pickle and cPickle are identical, so it is possible to use pickle and cPickle interchangeably with existing pickles. [10] There are additional minor differences in API between cPickle and pickle, however for most applications, they are interchangeable. More documentation is provided in the pickle module documentation, which includes a list of the documented differences. Footnotes
|