4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Flatten dictionaries function to better view ndjson data (preview, export as csv, `map_item`, etc.)

Open dale-wahl opened this issue 3 years ago • 4 comments

We've discussed doing this a bit offline. It could be useful to replace map_item or at least offer a different sort of preview of data and perhaps way to export as csv which many users prefer.

Right now this is the best I've been able to put together (through cannibalization and adaption of other code).

from collections.abc import MutableMapping
import json

def _flatten_dict_gen(d, parent_key, sep):
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, MutableMapping):
            yield from flatten_dict(v, new_key, sep=sep).items()
        elif isinstance(v, list):
            yield new_key, json.dumps([flatten_dict(item, new_key, sep=sep) if isinstance(item, MutableMapping) else item for item in v])
        else:
            yield new_key, v


def flatten_dict(d: MutableMapping, parent_key: str = '', sep: str = '.'):
    return dict(_flatten_dict_gen(d, parent_key, sep))

The handling of list objects is perhaps a bit odd. I was trying to avoid how a package like flatdict handles it. Namely completely flattening lists so you end up with dict_key:0 value, dict_key:1 value, dict_key:2 value, etc as columns. It may not even be necessary to json.dumps() the list, but it does allow us to know we can use json.loads() later if desired and keep lists all in one column which when testing tweet dicts seemed preferable.

dale-wahl avatar Jul 13 '22 15:07 dale-wahl

I'd also note that this would need to be in conjunction with map_item. For example with the twitter datasource we would still like to create things like hashtags and mentions in a regular format as expected by certain processors as well as need to ensure things like id, body, timestamp are properly labeled.

This could help to resolve some issues I have seen where columns created by 4CAT processors are not easily accessible to other processors since they are not "added" to the map_item function (though they do exist in the actual ndjson file and so can be accessed if you bypass map_item in iterate_items for example).

dale-wahl avatar Jul 13 '22 15:07 dale-wahl

Using this logic, I made a processor to convert any ndjson to csv files: https://github.com/digitalmethodsinitiative/4cat/commit/394311135270671acd80ceace5181db6001a9618

Presumably we could incorporate it into that dataset class itself, but this seemed a simple way to test it on both different datasources and processor outputs that use ndjson.

dale-wahl avatar Jul 21 '22 16:07 dale-wahl

As of e8dac2273d7e3f6b3fe2cb7604cb18a651ccc4c4, it is possible to make the previewer show the underlying NDJSON rather than the mapped items. This doesn't fully address this issue but it does perhaps offer part of the solution.

stijn-uva avatar Oct 18 '23 15:10 stijn-uva