takacsd comments

Results 17 comments of


                                            takacsd

Can't cast to VARBINARY

If I do `cast(col, BINARY)` it gets converted to `CAST("colname" AS VARBINARY)` and is working fine. but if I do `cast(col, VARBINARY)` it gets converted to `CAST("colname" AS BINARY)` which...

BUG: read_parquet converts pyarrow list type to numpy dtype

Run into the same issue: ```py df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))}) df.to_parquet('test.parquet') # SUCCESS pd.read_parquet('test.parquet') # *** FAIL df.to_parquet('test.parquet') # SUCCESS pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype) # SUCCESS df.to_parquet('test.parquet', store_schema=False) #...

BUG: read_parquet converts pyarrow list type to numpy dtype

@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine. ```py...

BUG: read_parquet converts pyarrow list type to numpy dtype

> The main issue I think is because`dtype` is a string I guess. I'm not 100% sure about how `_pandas_api.pandas_dtype` works, but presumably it's a large `dict` mapping types in...

BUG: read_parquet converts pyarrow list type to numpy dtype

@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may...

BUG: read_parquet converts pyarrow list type to numpy dtype

Yeah, after some experimenting, I think we need to gave up on parsing the type string: These two: ```py pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()}))) pd.Series([{'a: int64, b':...

BUG: read_parquet converts pyarrow list type to numpy dtype

I was bored: ```py class ParseFail(Exception): pass class Parsed(NamedTuple): type: pa.DataType end: int class TypeStringParser: BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?') TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]') NAME_MATCHER = re.compile(r'\w+') # this can be r'[^:]'...

NetRocks' AWS plugin only shows 1000 files.

PR is up: #2786

NetRocks' AWS plugin only shows 1000 files.

Yes, thank you!