dataframe Alternative CSV reader

should be investigated: https://github.com/doyaaaaaken/kotlin-csv

Feb 13 '24 16:02 Jolanrensen

I tried FastCSV and want to utilize it on JVM for performance that several times better than existing one and beats pandas too I assume you aim for KMP, so it's a different thing. Just a note to keep in mind

Feb 13 '24 17:02 koperagen

Keep in mind that you can always write your own interface and hide the platform implementation later

Feb 22 '24 12:02 devcrocod

I've been experimenting with different implementations to find the fastest one in combination with DataFrame.

Each test has two versions of the implementation:

The default version first loads the entire CSV into memory. This is usually the fastest for smaller CSVs since the right amount of memory for the columns can be created right away. However, this can run into memory issues more quickly for larger CSV files.
That's why each test is accompanied with a "sequential" version. This version uses data collectors to stream the csv rows into separate string-columns directly. The downside of this is that we don't know the right amount of memory yet, so the ArrayLists need to grow accordingly, but we never get a full List<SomeCsvRowClass>, saving memory in the long run :)

We test:

Apache Commons CSV (the current implementation)
Fast CSV 2.x
Kotlin-CSV

Small CSV: 65.4 kB (ops/s: Higher score is better)

(s/op: Lower score is better)

Large CSV: 857.7 MB (ops/s: Higher score is better)

(s/op: Lower score is better)

Sep 02 '24 12:09 Jolanrensen

I now added Deephaven-csv:

(s/op: Lower is better)

Benchmark                                    Mode  Cnt   Score    Error  Units
CsvBenchmark.apacheCsvReader                   ss   10   0.007 ±  0.003   s/op
CsvBenchmark.apacheCsvReaderSequential         ss   10   0.008 ±  0.003   s/op
CsvBenchmark.deephavenCsvReader                ss   10   0.009 ±  0.011   s/op
CsvBenchmark.fastCsvReader                     ss   10   0.004 ±  0.001   s/op
CsvBenchmark.fastCsvReaderSequential           ss   10   0.004 ±  0.002   s/op
CsvBenchmark.kotlinCsvReader                   ss   10   0.008 ±  0.001   s/op
CsvBenchmark.kotlinCsvReaderSequential         ss   10   0.007 ±  0.001   s/op
LargeCsvBenchmark.apacheCsvReader              ss    5  72.809 ± 16.879   s/op
LargeCsvBenchmark.apacheCsvReaderSequential    ss    5  46.433 ± 39.409   s/op
LargeCsvBenchmark.deephavenCsvReader           ss    5  16.640 ±  6.664   s/op
LargeCsvBenchmark.fastCsvReader                ss    5  59.848 ± 22.986   s/op
LargeCsvBenchmark.fastCsvReaderSequential      ss    5  40.747 ±  4.598   s/op
LargeCsvBenchmark.kotlinCsvReader              ss    5  80.383 ± 15.870   s/op
LargeCsvBenchmark.kotlinCsvReaderSequential    ss    5  68.547 ± 20.748   s/op

Note: The deephaven integration might not be optimal yet:

It can parse values by type itself, but I haven't figured out how to make custom parsers for it yet, so parsing a string column requires parsing twice (or more) at the moment.
deephaven allows defining your own (typed and unboxed) data collector which could give an immense boost in combination with https://github.com/Kotlin/dataframe/pull/712

Sep 02 '24 18:09 Jolanrensen

Combining Deephaven with https://github.com/Kotlin/dataframe/pull/712 is very promising. Reading the large csv on the ColumnDataHolder branch with properly set-up deephaven reading yields the following results:

Doing the same on the master branch yields:

Both in terms of memory and performance, there's something to gain from using deephaven and primitive arrays, at least when it comes to reading csvs :)

Sep 17 '24 13:09 Jolanrensen

Deephaven with normal arraylists (that support nulls this time) and new parsers:

Sep 25 '24 17:09 Jolanrensen