Flatten dictionaries function to better view ndjson data (preview, export as csv, `map_item`, etc.)
We've discussed doing this a bit offline. It could be useful to replace map_item or at least offer a different sort of preview of data and perhaps way to export as csv which many users prefer.
Right now this is the best I've been able to put together (through cannibalization and adaption of other code).
from collections.abc import MutableMapping
import json
def _flatten_dict_gen(d, parent_key, sep):
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
yield from flatten_dict(v, new_key, sep=sep).items()
elif isinstance(v, list):
yield new_key, json.dumps([flatten_dict(item, new_key, sep=sep) if isinstance(item, MutableMapping) else item for item in v])
else:
yield new_key, v
def flatten_dict(d: MutableMapping, parent_key: str = '', sep: str = '.'):
return dict(_flatten_dict_gen(d, parent_key, sep))
The handling of list objects is perhaps a bit odd. I was trying to avoid how a package like flatdict handles it. Namely completely flattening lists so you end up with dict_key:0 value, dict_key:1 value, dict_key:2 value, etc as columns. It may not even be necessary to json.dumps() the list, but it does allow us to know we can use json.loads() later if desired and keep lists all in one column which when testing tweet dicts seemed preferable.
I'd also note that this would need to be in conjunction with map_item. For example with the twitter datasource we would still like to create things like hashtags and mentions in a regular format as expected by certain processors as well as need to ensure things like id, body, timestamp are properly labeled.
This could help to resolve some issues I have seen where columns created by 4CAT processors are not easily accessible to other processors since they are not "added" to the map_item function (though they do exist in the actual ndjson file and so can be accessed if you bypass map_item in iterate_items for example).
Using this logic, I made a processor to convert any ndjson to csv files: https://github.com/digitalmethodsinitiative/4cat/commit/394311135270671acd80ceace5181db6001a9618
Presumably we could incorporate it into that dataset class itself, but this seemed a simple way to test it on both different datasources and processor outputs that use ndjson.
As of e8dac2273d7e3f6b3fe2cb7604cb18a651ccc4c4, it is possible to make the previewer show the underlying NDJSON rather than the mapped items. This doesn't fully address this issue but it does perhaps offer part of the solution.