Implement/document a way how to pass extra configuration to json.dump()
There is useful configuration to json.dump() which I'd like to pass through await crawler.export_data("export.json"), but I see no way to do that:
-
ensure_ascii- as someone living in a country using extended latin, setting this toFalseprevents Python to encode half of the characters as weird mess -
indent- allows me to read the output as a mere human -
sort_keys- may be useful for git scraping, not sure
The only workaround I can think of right now is something convoluted like:
from pathlib import Path
path = Path("export.json")
await crawler.export_data(path)
path.write_text(json.dumps(json.loads(path.read_text()), ensure_ascii=False, indent=2))
Hi @honzajavorek, thanks for your input. We are going to add a new option for providing additional keyword arguments for export_data.
Thanks for considering this!
Just my $.02 - there is the issue of export_data being the kind of 80:20 helper that decides the output format based on the destination filename. Adding JSON-specific kwargs to this doesn't feel right, and we discussed some other options - everybody, feel free to speak your mind :slightly_smiling_face:
- just add kwargs for both json, csv and whatever other format we may add in the future
- make a separate export helper for each format on the
BasicCrawler, e.g.export_data_jsonandexport_data_csv - keep just
export_data, but have it accept something likestr | JsonExportOptions | CsvExportOptions... the path could either be part of both option object types, or there could be two parameters - path and options. The decision tree in the implementation may be larger, but manageable and testable - keep everything the way it is and publish an example how to do this now - something like
json.dump(crawler.get_data(), dest_file, indent=2)should probably do it
I would go for 2, maybe combined with 1, but mentioning the format-specific methods in the export_data comment would do the job too, if you want to configure something specific to JSON or CSV, it makes sense to use a method specific for that format.
My hunch would be that the current .export_data() method is unnecessarily "magical". It implicitly decides format based on extension, which isn't clear from the outside, and will work 90 % of time, but surely there are edge cases when the extension won't match or won't be, for some reason, either .json or .csv. Also my issue shows that it perhaps does too many things, because once we need to pass JSON export options, it doesn't feel right in the interface, as there's also CSV. And CSV has a ton of compatibility options itself! Hence I tend to think that this method should be two explicit methods, which have the potential to take care of the specifics of individual formats. Also in the future, adding or deprecating a format is, IMHO, done in a cleaner way with separate functions.
Is this something I can work on?
Is this something I can work on?
Absolutely, open a PR when you're ready.
Re-opening due to this issue: https://github.com/apify/apify-docs/issues/2112.
I'd suggest introducing an arg like this to crawler.export_data:
additional_kwargs: ExportDataJsonKwargs | ExportDataCsvKwargs