22

Overview

I want to serialize my complex objects. It looks simple but every step creates a different problem.

In the end, other programmers must also be able to create a complex object inherited from my parent object. And this object should be pickleable, for Python 2.7 and Python3.x.

I started with a simple object and used pickle.dump and pickle.load with success.

I then created multiple complex objects (similar but not identical), some of which can be dumped, and a few cannot.

Debugging

The pickle library knows which objects can be pickled or not. In theory this means pdb could be customized to enable pickle debugging.

Alternative serialization libraries

I wanted a reliable serialization independent of the content of the object. So I searched for other serialization tools:

  • Cerealizer which selftest failed and seems to be outdated.
  • MessagePack which is not available for Python 3.
  • I tried JSON and got the error: builtins.TypeError: <lib.scan.Content object at 0x7f37f1e5da50> is not JSON serializable
  • I looked at Marshal and Shelve but all refer to Pickle.

Digging into using pickle

I have read How to check if an object is pickleable which did not give me an answer.

The closest I found was How to find source of error in Python Pickle on massive object

I adjusted this to:

import pickle

if _future_.isPython3():        
    class MyPickler(pickle._Pickler):        
        def save(self, obj):             
            try:
                pickle._Pickler.save(self, obj)
            except:
                print ('pick(3.x) {0} of type {1}'.format(obj, type(obj)))                  
else:
    class MyPickler (pickle.Pickler):

        def save(self, obj):         
            try:
                pickle.Pickler.save(self, obj)
            except:
                print('pick(2.x)', obj, 'of type', type(obj))

I call this code using:

def save(obj, file):  
    if platform.python_implementation() == 'CPython':
        myPickler = MyPickler(file)                
        myPickler.save(obj) 

I expect the save is executed until an exception is raised. The content of obj is printed so I can see exactly where the error orcurs. But the result is:

pick(3.x)  <class 'module'> of type <class 'type'>
pick(3.x)  <class 'module'> of type <class 'type'>
pick(3.x)  <class 'Struct'> of type <class 'type'>
pick(3.x)  <class 'site.setquit.<locals>.Quitter'> of type <class 'type'>
pick(3.x)  <class 'site.setquit.<locals>.Quitter'> of type <class 'type'>
pick(3.x)  <class 'module'> of type <class 'type'>
pick(3.x)  <class 'sys.int_info'> of type <class 'type'>
...

This is just a small part of the result. I do not comprehend this. It does not help me which detail is wrong to pickle. And how to solve this.

I have seen : http://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled but it does not help me much if I cannot detect which line in my code cannot be pickled.

The code in my complex object works as expecting, in the end running a generated code as:

sys.modules['unum']

But when pickling it seems the 'module' is not read as expected.

Attempt at a solution

Some background to clear what I mean. I have had programs who worked, and suddenly did not work. It might be an update or an other change resource. Programs who work for others and not for me and opposite.

This is a general problem so I want to develop a program to check all kind of resources. The amount of different kind of resources is huge. So I have one parent object class with all general behaviour. And a as small as possible detail class for the specific resources.

This is done in my child resources classes.

These resources have to be checked with different versions f.e. Python 2.7 or Python 3.3 If you run with Python 2.7.5 the resource is valid if Python 2.7 and higher is required. So the check must be a bit more then an equal value. This is specified as a single statement in the custom config file. There is a specific config file for each program, which must be as small as possible to be used. One resource is checked with a single statement in the config file.

The general class is about 98% of the code. The specific resources and config is just about 2% of the code. So it is very easy to add new resources to check, and new config files for new programs.

This child resources :

class R_Sys(r_base.R_Base):
    '''
    doc : http://docs.python.org/3/library/sys.html#module-sys

    sys.modules returns only a list of imported module

    statement :
    sys.modules['psutil'] #  may return false (installed but not imported
    but the statements :
    import psutil
    sys.modules['psutil'] # will return true, now psutil is imported
    '''

    allowed_names = ('modules', 'path', 'builtin_module_names', 'stdin')

    allowed_keys_in_dict_config = ('name',)
    allowed_operators = ("R_NONE", "=", 'installed')  # installed only for modules

    class_group = 'Sys'
    module_used = sys   


    def __init__(self, check_type, group, name):
        super(R_Sys, self).__init__(check_type, group, name)

called by this config statement :

sc.analyse(r.R_Sys, c.ct('DETECT'), dict(name='path'))

can be succefull pickled. But with config statement :

sc.analyse(r.R_Sys, c.ct('DETECT'),
                     dict(name='modules', tuplename='unum') )  

it fails.

This means in my opinion that 98% main code should be ok, otherwise the first statement would fail as well.

There are class attributes in the child class. These are required to function properly. And again in the first call the dump execute well. I did not do a load yet.

Community
  • 1
  • 1
Bernard
  • 681
  • 1
  • 8
  • 21
  • Can zou post the code of the unserializable object? – User Mar 06 '14 at 20:37
  • Not really. It is complex. With a lot of code which has nothing to do with pickling. So it would be very confusing and hard to detect. – Bernard Mar 07 '14 at 09:40
  • In general my preference is not a solution for this single object. Because the next complex object the problem may appear again. I am searching for a kind of "pickle debug". This "pickle debug" pinpoint to one line of code which is wrong for pickling, and if possible the type of error. If that line is found, with error code, I assume 99% of the problem is solved. And not only for me, but for everybody using pickle. – Bernard Mar 07 '14 at 09:49
  • I am using pickle. So you just want to know which object is not picklable or also the line of code where this object was created? The last one is very difficult. It could be done that it tells you which attribute of something is not picklable or the whole reference graph from the pickle.dumps input to the unpicklable object. A simple example would be good to see if we talk about the same thing and to show the algorithm. Of cause I could create it but of what use is the example if it does not match your case. – User Mar 07 '14 at 11:07
  • @User I ad to my question additional info and code and hope this is sufficient to you. The new text is starting with the sentence : "Some background to clear what I mean." – Bernard Mar 07 '14 at 15:19
  • @user, do you have enough info? The full code is a few 1000 lines so not suitable to share here in stackoverflow. Also there should be more then 20 child-resources and more then 100 config files with each multiple config statements. To use pickle also in the future I feel it must be robust and be able to handle the described complexity. – Bernard Mar 10 '14 at 10:26
  • 2
    @Bernard it is great, you have provided some code. Advice: do not complain about having too many lines to show and do your best to make your own "minimal working example", what in this case means, minimal example, where you have a problem to pickle. It requires some effort, but generally force you to think about the problem and often leads to a solution. At least, it helps others to understand your problem what is prerequisite to helping you. – Jan Vlcinsky Apr 20 '14 at 21:02
  • @Jan in general I do agree with you; try to make a "minimal working example" Said more general "information hiding" show only what is relevant. This is what I am searching for in pickle. Pickle is for me a blackbox and I would like to have just a small exception raised like 'detail x can not be pickled because situation y'. Then I can solve my problem. – Bernard Apr 21 '14 at 07:53
  • I was running into similar problems and I decided to add a to_serializable_data and from_serializable_data in my objects. This returns a dictionary (filters __dict__) with what I need or set whatever is needed; and at the same time transforms unpickleable things (e.g. converts collections.DefaultDict to regular dict...) – Josep Valls Nov 19 '14 at 19:19
  • Once you get a pickling exception, can you not try pickling each of the contained objects separately? I have not used pickle or json that extensively to know the internals, but I expect that a list, tuple, or dict of pickleable objects is itself pickleable. So iterate through the failed object, find the failed contained object, iterate through that, as deep as you need to go. Keeping a recursion depth counter and item index or key at each level would probably be useful. – RufusVS Feb 16 '15 at 19:47

1 Answers1

11

dill has some good diagnostic tools for pickling, the best of which is the pickle trace (similar to what you have implemented).

Let's build a complex object, and explore:

>>> import dill
>>> class Foo(object):
...   @classmethod
...   def bar(self, x):
...     return self.z + x
...   def baz(self, z):
...     self.z = z
...   z = 1
...   zap = lambda self, x: x + self.bar(x)
... 
>>> f = Foo()
>>> f.zap(3)
7
>>> f.baz(7)
>>> f.z 
7

Turn on "pickle trace":

>>> dill.detect.trace(True)
>>> _f = dill.dumps(f)
T2: <class '__main__.Foo'>
F2: <function _create_type at 0x10f94a668>
T1: <type 'type'>
F2: <function _load_type at 0x10f94a5f0>
T1: <type 'object'>
D2: <dict object at 0x10f96bb40>
Cm: <classmethod object at 0x10f9ad408>
T4: <type 'classmethod'>
F1: <function bar at 0x10f9aa9b0>
F2: <function _create_function at 0x10f94a6e0>
Co: <code object bar at 0x10f9a9130, file "<stdin>", line 2>
F2: <function _unmarshal at 0x10f94a578>
D1: <dict object at 0x10e8d6168>
D2: <dict object at 0x10f96b5c8>
F1: <function baz at 0x10f9aaa28>
Co: <code object baz at 0x10f9a9ab0, file "<stdin>", line 5>
D1: <dict object at 0x10e8d6168>
D2: <dict object at 0x10f969d70>
F1: <function <lambda> at 0x10f9aaaa0>
Co: <code object <lambda> at 0x10f9a9c30, file "<stdin>", line 8>
D1: <dict object at 0x10e8d6168>
D2: <dict object at 0x10f97d050>
D2: <dict object at 0x10e97b4b0>
>>> f_ = dill.loads(_f)
>>> f_.z
7

Ok, dill can pickle this object… so let's make it harder. We first turn off trace.

>>> dill.detect.trace(False)
>>> 
>>> f.y = xrange(5)
>>> f.w = iter([1,2,3])
>>> 
>>> dill.pickles(f)
False

Ok, now dill fails. So what causes the failure? We can look at all of the objects that fail to pickle if we dig into our object f.

>>> dill.detect.badtypes(f)
<class '__main__.Foo'>
>>> dill.detect.badtypes(f, depth=1)
{'__hash__': <type 'method-wrapper'>, '__setattr__': <type 'method-wrapper'>, '__reduce_ex__': <type 'builtin_function_or_method'>, 'baz': <type 'instancemethod'>, '__reduce__': <type 'builtin_function_or_method'>, '__str__': <type 'method-wrapper'>, '__format__': <type 'builtin_function_or_method'>, '__getattribute__': <type 'method-wrapper'>, 'zap': <type 'instancemethod'>, '__delattr__': <type 'method-wrapper'>, '__repr__': <type 'method-wrapper'>, 'w': <type 'listiterator'>, '__dict__': <type 'dict'>, '__sizeof__': <type 'builtin_function_or_method'>, '__init__': <type 'method-wrapper'>}
>>> dill.detect.badobjects(f, depth=1)
{'__hash__': <method-wrapper '__hash__' of Foo object at 0x10f9b0050>, '__setattr__': <method-wrapper '__setattr__' of Foo object at 0x10f9b0050>, '__reduce_ex__': <built-in method __reduce_ex__ of Foo object at 0x10f9b0050>, 'baz': <bound method Foo.baz of <__main__.Foo object at 0x10f9b0050>>, '__reduce__': <built-in method __reduce__ of Foo object at 0x10f9b0050>, '__str__': <method-wrapper '__str__' of Foo object at 0x10f9b0050>, '__format__': <built-in method __format__ of Foo object at 0x10f9b0050>, '__getattribute__': <method-wrapper '__getattribute__' of Foo object at 0x10f9b0050>, 'zap': <bound method Foo.<lambda> of <__main__.Foo object at 0x10f9b0050>>, '__delattr__': <method-wrapper '__delattr__' of Foo object at 0x10f9b0050>, '__repr__': <method-wrapper '__repr__' of Foo object at 0x10f9b0050>, 'w': <listiterator object at 0x10f9b0550>, '__dict__': {'y': xrange(5), 'z': 7, 'w': <listiterator object at 0x10f9b0550>}, '__sizeof__': <built-in method __sizeof__ of Foo object at 0x10f9b0050>, '__init__': <method-wrapper '__init__' of Foo object at 0x10f9b0050>}

Hmmm. That's a lot. Of course, not all of these objects have to serialize for our object to serialize… however at least one of them is causing the failure.

The natural thing to do is look at the failure we are getting… So, what's the error that would be thrown? Maybe that will give a hint.

>>> dill.detect.errors(f)
PicklingError("Can't pickle <type 'listiterator'>: it's not found as __builtin__.listiterator",)

Aha, the listiterator is a bad object. Let's dig deeper by turning "trace" back on.

>>> dill.detect.trace(True)
>>> dill.pickles(f)
T2: <class '__main__.Foo'>
F2: <function _create_type at 0x10f94a668>
T1: <type 'type'>
F2: <function _load_type at 0x10f94a5f0>
T1: <type 'object'>
D2: <dict object at 0x10f9826e0>
Cm: <classmethod object at 0x10f9ad408>
T4: <type 'classmethod'>
F1: <function bar at 0x10f9aa9b0>
F2: <function _create_function at 0x10f94a6e0>
Co: <code object bar at 0x10f9a9130, file "<stdin>", line 2>
F2: <function _unmarshal at 0x10f94a578>
D1: <dict object at 0x10e8d6168>
D2: <dict object at 0x10f96b5c8>
F1: <function baz at 0x10f9aaa28>
Co: <code object baz at 0x10f9a9ab0, file "<stdin>", line 5>
D1: <dict object at 0x10e8d6168>
D2: <dict object at 0x10f969d70>
F1: <function <lambda> at 0x10f9aaaa0>
Co: <code object <lambda> at 0x10f9a9c30, file "<stdin>", line 8>
D1: <dict object at 0x10e8d6168>
D2: <dict object at 0x10f97d050>
D2: <dict object at 0x10e97b4b0>
Si: xrange(5)
F2: <function _eval_repr at 0x10f94acf8>
T4: <type 'listiterator'>
False

Indeed, it stops at the listiterator. However, notice (just above) that the xrange does pickle. So, let's replace the iter with xrange

>>> f.w = xrange(1,4)  
>>> dill.detect.trace(False)
>>> dill.pickles(f)
True
>>> 

Our object now pickles again.

dill has a bunch of other pickle detection tools built-in, including methods to trace which object points to which (useful for debugging recursive pickling failures).

I believe that cloudpickle also has some similar tools to dill for pickle debugging… but the main tool in either case is similar to what you have built.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139
  • thank you for your answer. I am not so familiar which SO. This question was asked one year ago, and I skipped pickling at all. Most important for the above reason, but also for the security risk when received a pickle file from an other source. So sorry you spent time to this. I would have like to close this question. I don not know how to close on SO. But may be this question has value for others. If so please do response. Debugging pickle is an important subject. – Bernard Aug 26 '15 at 08:34
  • 1
    @Bernard: don't worry about it. It was a good question, so I left an answer. Feel free to leave it open for others if they find it, and maybe it will help. I would not however take pickles from 3rd parties. That is surely a bad idea. If you are looking for a secure pickle, then you have to use one that limits the datatypes to the most basic types… and still someone might hijack it. Having said that, pickling is great when you are passing objects to yourself. One thing I didn't mention is `pickletools.dis`, which reads pickles and turns them into code instructions. Very useful. – Mike McKerns Aug 26 '15 at 11:19