Consider whither using Apache Arrow intermediate representation
Columnify uses Apache Arrow Schema/Record as an intermediate representation between various input formant and output ( currently only parquet ). It's powerful, fast memory accesses, supports columnar like representation. But Go implementation is not perfect yet e.g. Arrow record type doesn't support some types on its sub fields so it's not still applicable for Columnify. Additionally Arrow Go implementation doesn't support rich data conversion like PyArrow. Finally it's using "only Arrow Schema" as a necessary intermediate data now.
So we have some options to tackle this problems like:
- Remove Arrow dependency. It's unnecessary now and reducing dependencies make clear maintainancability of this product. Arrow Schema type is replacable with Avro Schema or others.
- Improve Arrow! It's an OSS and we probably have various chances to contribute to Go Arrow implementation.
- Just keep current Columnify implementation ant watch activities on Arrow community.
As a tirivial topic, gocredits doesn't work on Go Arrow dependency. https://github.com/reproio/columnify/issues/4
Arrow intermediate records should be memory efficient, will mitigate memory usage! https://github.com/reproio/columnify/issues/44
And also it can validate input data by given schema https://github.com/reproio/columnify/issues/27