jsonpickle icon indicating copy to clipboard operation
jsonpickle copied to clipboard

Register handlers for dicts

Open lfvjimisola opened this issue 2 years ago • 3 comments

Consider the following dictionary:

requirements: Dict[UrnId, RequirementData]

where

@dataclass
class UrnId:
    urn: str
    id: str

with keys=False

  "requirements": {
        "UrnId(urn='ms-001', id='REQ_ms001_101')": {

with keys=True

    "requirements": {
        "json://\"ms-001:REQ_ms001_101\"": { 

None of them are really useful as I want to serialize perhaps as

  "requirements": {
        "ms-001:REQ_ms001_101": {

Can I override this behaviour some how?

Is there a way to override serialization of dicts in general?

Ideally, I would like to modify the dict upon serialization based on key so that urn is one level and id another.

So that,

  "requirements": {
        "UrnId(urn='U1', id='I1')": {...}
        "UrnId(urn='U1', id='I2')": {...}

becomes (i.e. adding one more level):

  "requirements": {
        "U1": {
              "I1": {...},
             "I2": {...}
          }

Potential code:

Not sure if there is a way to call flatten using the handlers instead of <call flatten recursively here>

class CustomDictHandler(jsonpickle.handlers.BaseHandler):
    def flatten(self, obj, data):
        if isinstance(obj, dict):
            new_dict = {}
            for key, value in obj.items():
                if isinstance(key, UrnId):
                    if key.urn not in new_dict:
                        new_dict[key.urn] = {}

                    new_dict[key.urn][key.id] = <call flatten recursively here>(value)

            return new_dict
        
        return jsonpickle.handlers.BaseHandler.flatten(self, obj, data)

lfvjimisola avatar Dec 05 '23 20:12 lfvjimisola

At the moment I don't believe that's possible in general by hooking into jsonpickle. That being said, it may be possible to define a custom __getstate__ on UrnId that returns what you're looking for when encoded.

Theelx avatar Dec 05 '23 20:12 Theelx

Ok. I tried adding __getstate__ to UrnId class but it does change the key for me unless I did something wrong.

You meant for the key representation without "json://"?

Because, I looked at https://jsonpickle.readthedocs.io/en/latest/api.html#object.getstate and can't see how a getState on UrnId can manipulate the dictionary structure?

Or is there are way to manipulate the dictionary structure? If so, would you mind elaborating a little to it started?

lfvjimisola avatar Dec 05 '23 20:12 lfvjimisola

I can give you an example __getstate__ if you can send me an entire minimal reproducible example (MRE) to get the output that you are currently getting, and an example of the output you want from that script.

Theelx avatar Dec 07 '23 16:12 Theelx

Here's a few notes in case it helps.

I'll preface this with this important detail about the library's purpose -- jsonpickle isn't really geared towards customizing some of these more aesthetic aspects of serialization.

jsonpickle's primary use case is being able to reconstitute objects through json. Because of this, we will always need to embed some amount of out-of-band metadata in order to accomplish that goal.

Another important note is that you're using dicts with specialized objects as its keys. That's something that JSON itself cannot represent, so jsonpickle has to do something that's both general-purpose and capable enough to handle being able to reconstitute objects. That's why we have to serialize objects to embedded json inside strings ~ complex objects keys require special handling.

Now, with that out of the way, there is one thing you can leverage about jsonpickle's behavior to get pretty close to your first example:

"requirements": {
        "ms-001:REQ_ms001_101": {

When jsonpickle hits the UrnId objects in the dict's keys it will use repr(obj) to stringify the object when creating its corresponding json key.

We use embedded json when using json_str = jsonpickle.encode(requirements, keys=True) and that expects you to also use jsonpickle.decode(json_str, keys=True) in order to reconstitute the objects.

But, if you'd like to keep the json representation simpler what you can do is rely on jsonpickle for serializing the values of the dict (its guts) and then handle restoring the keys yourself. This might be a good enough middle ground for your use case.

Also, maybe you don't even care about decoding the objects, and all you care about is the encoded representation. If that's the case you can omit the from_repr_string() staticmethod and skip the latter part of the test function below.

So, here's the basic approach / workaround:

  • Don't use keys=True so that jsonpickle uses repr(key) on the dict keys.
  • Implement __repr__ so that the dict keys look nicer.
  • Restore dict keys manually from the __repr__ string after loading from jsonpickle to restore a copy of the original dict.
import json
from dataclasses import dataclass

import jsonpickle


@dataclass
class Requirement:
    package: str
    version: tuple


@dataclass
class UrnId:
    urn: str
    id: str

    def __hash__(self):
        return hash(repr(self))

    def __repr__(self):
        return f'{self.urn}:{self.id}'

    @staticmethod
    def from_repr(string):
        if isinstance(string, UrnId):
            return string
        try:
            urn, new_id = string.split(':', 1)
        except ValueError:
            return None
        return UrnId(urn, new_id)


def test_dataclass_custom_restoration():
    """Restore objects manually to simplify the JSON represntation"""
    requirements = {
        UrnId('ms-001', 'REQ_ms001_101'): Requirement('pkz', [1, 0, 2]),
        UrnId('ms-002', 'REQ_ms002_101'): Requirement('pkz', [1, 1, 0]),
        UrnId('ms-003', 'REQ_ms003_101'): Requirement('pkz', [2, 0, 0]),
    }
    encoded = jsonpickle.encode(requirements)
    # If all you care about is the JSON output you can stop here.
    # If you need to restore objects from the JSON above, continue below.

    decoded = jsonpickle.decode(encoded)
    # Reconstitute the top-level dict keys
    new_requirements = {}
    for key, value in decoded.items():
        new_key = UrnId.from_repr(key)
        if new_key is None:
            continue
        new_requirements[new_key] = value

    assert requirements == new_requirements

I consider this a nice middle ground because the resulting JSON looks like the following:

{
  "ms-001:REQ_ms001_101": {"py/object": "__main__.Requirement", "package": "pkz", "version": [1, 0, 2]},
  "ms-002:REQ_ms002_201": {"py/object": "__main__.Requirement", "package": "pkz", "version": [1, 1, 0]},
  "ms-003:REQ_ms003_301": {"py/object": "__main__.Requirement", "package": "pkz", "version": [2, 0, 0]}
}

It's simpler, and while it could be simpler, it's not too bad. Any further simplification of the data will require you to handle the serialization yourself.

I've closed this issue for now since it doesn't seem like there's anything actionable left to do, but please feel free to continue the conversation if you have any questions or discussion topics.

davvid avatar Apr 13 '24 22:04 davvid

I can give you an example __getstate__ if you can send me an entire minimal reproducible example (MRE) to get the output that you are currently getting, and an example of the output you want from that script.

@Theelx I totallt missed your response. Sorry about that. Reading up @davvid answer now.

lfvjimisola avatar Apr 15 '24 07:04 lfvjimisola

@davvid Thank you for the very informative answer.

I realized as I started reading your reply that I didn't state that we are not deserializing the data (so we can skip from_repr). The JSON output is for 3rd party tools and it should be as clean as possible, since they have no use of python metadata for deserialization. For enumerations I could handle this using a customer handler.

With your solution we get:

{ "ms-001:REQ_ms001_101": {"py/object": "main.Requirement", "package": "pkz", "version": [1, 0, 2]}, "ms-002:REQ_ms002_201": {"py/object": "main.Requirement", "package": "pkz", "version": [1, 1, 0]}, "ms-003:REQ_ms003_301": {"py/object": "main.Requirement", "package": "pkz", "version": [2, 0, 0]} }

but py/object and package are really not wanted. However, this is an improvement.

However, it does not solve the challenge of restructuring the dicts (as mentioned initially) so that

So that,

"requirements": {
       "UrnId(urn='U1', id='I1')": {...}
       "UrnId(urn='U1', id='I2')": {...}
       

becomes (i.e. adding one more level):

"requirements": {
       "U1": {
             "I1": {...},
            "I2": {...}
         }

Any further simplification of the data will require you to handle the serialization yourself.

Do you mean without jsonpickle or manually by? @Theelx mentioned getstate could that solve our dilemma some how?

And just to confirm, there will not be a new feature implemented that allows us to register handlers for dicts?

lfvjimisola avatar Apr 15 '24 07:04 lfvjimisola