explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Replace infer_schema_length by infer_schema

Open josevalim opened this issue 1 year ago • 4 comments

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:

infer_schema: true | false | non_neg_integer()

Where true enables, false disables, and the integer configures the length. The default can be the same as today.

josevalim avatar Aug 27 '24 13:08 josevalim

I like this, but what would we use for all rows? IIUC true -> default (1000 rows).

cigrainger avatar Aug 27 '24 14:08 cigrainger

true means all rows.

josevalim avatar Aug 27 '24 14:08 josevalim

thanks for improving this! just share a way duckdb did. it has two parameters,

  • auto_detect: true | false
  • sample_size: BIGINT (-1, mean all rows, default 20480)

ref: CSV Import – DuckDB CSV Auto Detection – DuckDB

I am more than happy to take a stab at this

lei0zhou avatar Aug 27 '24 18:08 lei0zhou

Today infer_schema_length has an awkward API, since setting it to nil is used to infer all columns and 0 is used to disable it.

I propose:

infer_schema: true | false | non_neg_integer()

Where true enables, false disables, and the integer configures the length. The default can be the same as today.

👉🏼 given Option<NonZeroUsize>) to infer schema, what I understand is ;

  • if it's None will use entire file
  • else will use len(given) rows
  • will fail at comptime if you give 0
    /// Set the JSON reader to infer the schema of the file. Currently, this is only used when reading from
    /// [`JsonFormat::JsonLines`], as [`JsonFormat::Json`] reads in the entire array anyway.
    ///
    /// When using [`JsonFormat::JsonLines`], `max_records = None` will read the entire buffer in order to infer the
    /// schema, `Some(1)` would look only at the first record, `Some(2)` the first two records, etc.
    ///
    /// It is an error to pass `max_records = Some(0)`, as a schema cannot be inferred from 0 records when deserializing
    /// from JSON (unlike CSVs, there is no header row to inspect for column names).
    pub fn infer_schema_len(mut self, max_records: Option<NonZeroUsize>) -> Self {
        self.infer_schema_len = max_records;
        self
    }

ceyhunkerti avatar Sep 25 '24 20:09 ceyhunkerti