dsq icon indicating copy to clipboard operation
dsq copied to clipboard

Performance ideas

Open eatonphil opened this issue 3 years ago • 0 comments

Catchall for now for potential improvements to datastation/dsq.

  • SQL pre-processing
    • Import only used fields (see #71)
    • Do pre-filtering of data in SQLiteWriter, only insert things that match the WHERE clause
  • Support more input types using SQLiteWriter, basically requires supporting expanded nested objects in (see notes in #67 )
  • Maybe Handle jsonl in parallel since newlines must not be within individual JSON lines
  • Get rid of map[string]any inside datastation
    • At the very least put WriteRecord into the ResultWriter interface so SQLiteWriter can avoid map[string]any which it converts from anyway
  • CSV parser improvements
    • Find a simdcsv Go implementation (https://github.com/minio/simdcsv is abandoned) or write a wrapper to https://github.com/geofflangdale/simdcsv
    • Maybe easier first step: write a parser that handles CSVs when there are no quotes and fall back to encoding/csv otherwise
    • Or actually investigate why encoding/csv is slow
  • Add benchmarks for every file format, not just CSV. Basically every file format needs to be worked on individually

eatonphil avatar Jun 21 '22 02:06 eatonphil