crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Implement/document a way how to pass extra configuration to json.dump()

Open honzajavorek opened this issue 1 year ago • 8 comments

There is useful configuration to json.dump() which I'd like to pass through await crawler.export_data("export.json"), but I see no way to do that:

  • ensure_ascii - as someone living in a country using extended latin, setting this to False prevents Python to encode half of the characters as weird mess
  • indent - allows me to read the output as a mere human
  • sort_keys - may be useful for git scraping, not sure

The only workaround I can think of right now is something convoluted like:

from pathlib import Path

path = Path("export.json")
await crawler.export_data(path)
path.write_text(json.dumps(json.loads(path.read_text()), ensure_ascii=False, indent=2))

honzajavorek avatar Sep 15 '24 17:09 honzajavorek

Hi @honzajavorek, thanks for your input. We are going to add a new option for providing additional keyword arguments for export_data.

vdusek avatar Sep 16 '24 13:09 vdusek

Thanks for considering this!

honzajavorek avatar Sep 16 '24 17:09 honzajavorek

Just my $.02 - there is the issue of export_data being the kind of 80:20 helper that decides the output format based on the destination filename. Adding JSON-specific kwargs to this doesn't feel right, and we discussed some other options - everybody, feel free to speak your mind :slightly_smiling_face:

  1. just add kwargs for both json, csv and whatever other format we may add in the future
  2. make a separate export helper for each format on the BasicCrawler, e.g. export_data_json and export_data_csv
  3. keep just export_data, but have it accept something like str | JsonExportOptions | CsvExportOptions... the path could either be part of both option object types, or there could be two parameters - path and options. The decision tree in the implementation may be larger, but manageable and testable
  4. keep everything the way it is and publish an example how to do this now - something like json.dump(crawler.get_data(), dest_file, indent=2) should probably do it

janbuchar avatar Sep 17 '24 09:09 janbuchar

I would go for 2, maybe combined with 1, but mentioning the format-specific methods in the export_data comment would do the job too, if you want to configure something specific to JSON or CSV, it makes sense to use a method specific for that format.

B4nan avatar Sep 17 '24 10:09 B4nan

My hunch would be that the current .export_data() method is unnecessarily "magical". It implicitly decides format based on extension, which isn't clear from the outside, and will work 90 % of time, but surely there are edge cases when the extension won't match or won't be, for some reason, either .json or .csv. Also my issue shows that it perhaps does too many things, because once we need to pass JSON export options, it doesn't feel right in the interface, as there's also CSV. And CSV has a ton of compatibility options itself! Hence I tend to think that this method should be two explicit methods, which have the potential to take care of the specifics of individual formats. Also in the future, adding or deprecating a format is, IMHO, done in a cleaner way with separate functions.

honzajavorek avatar Sep 17 '24 15:09 honzajavorek

Is this something I can work on?

0xSolanaceae avatar Oct 05 '24 04:10 0xSolanaceae

Is this something I can work on?

Absolutely, open a PR when you're ready.

janbuchar avatar Oct 07 '24 07:10 janbuchar

Re-opening due to this issue: https://github.com/apify/apify-docs/issues/2112.

I'd suggest introducing an arg like this to crawler.export_data:

additional_kwargs: ExportDataJsonKwargs | ExportDataCsvKwargs

vdusek avatar Nov 25 '25 14:11 vdusek