arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-17382: [C++] open_dataset doesn't ignore BOM in csv file when header's with quotes

Open ZMZ91 opened this issue 3 years ago • 3 comments

ZMZ91 avatar Aug 10 '22 09:08 ZMZ91

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Aug 10 '22 09:08 github-actions[bot]

https://issues.apache.org/jira/browse/ARROW-17382

github-actions[bot] avatar Aug 11 '22 01:08 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Aug 11 '22 01:08 github-actions[bot]

Hi @pitrou, could you help review this pr?

ZMZ91 avatar Aug 18 '22 02:08 ZMZ91

@ZMZ91 I could, but the first step would be to get the CI runs fixed.

In addition, since this claims to fix a bug, there should be a unit test added somewhere.

pitrou avatar Aug 18 '22 10:08 pitrou

Thanks @pitrou. I've pushed a new commit and still got 2 ci failures. I'm not sure it's related with my change. Could you help check?

ZMZ91 avatar Aug 19 '22 04:08 ZMZ91

You are right, the failing CI checks are unrelated.

pitrou avatar Aug 19 '22 14:08 pitrou

@ZMZ91 There is already some logic in the CSV reader to skip the BOM: https://github.com/apache/arrow/blob/fe9ca423f7f2684d70e0cf0cff8f48a0175c0fdf/cpp/src/arrow/csv/reader.cc#L109-L121

Instead of adding the same logic in the CSV parser, you should instead try to find out what that logic (in the CSV reader) isn't sufficient here.

pitrou avatar Aug 19 '22 14:08 pitrou

I saw the code in reader.cc @pitrou. But it seems work on the entire csv file while the parser only takes a block of csv data and delimits rows and fields. And the data for each function are read respectively. Correct me if I'm not right.

ZMZ91 avatar Aug 22 '22 07:08 ZMZ91

That is why your approach is wrong: the BOM is only expected at the beginning of the file, not at the beginning of each CSV cell.

pitrou avatar Aug 22 '22 09:08 pitrou

CI failures are unrelated.

pitrou avatar Sep 13 '22 17:09 pitrou

Benchmark runs are scheduled for baseline = 8bf60b5deaa15a6e301221b57b419c9599e313b0 and contender = 01dce6a0c95242224eab1a5331e0a2524a4b9339. 01dce6a0c95242224eab1a5331e0a2524a4b9339 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Finished :arrow_down:0.18% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] 01dce6a0 ec2-t3-xlarge-us-east-2 [Finished] 01dce6a0 test-mac-arm [Failed] 01dce6a0 ursa-i9-9960x [Finished] 01dce6a0 ursa-thinkcentre-m75q [Finished] 8bf60b5d ec2-t3-xlarge-us-east-2 [Failed] 8bf60b5d test-mac-arm [Failed] 8bf60b5d ursa-i9-9960x [Finished] 8bf60b5d ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Sep 14 '22 04:09 ursabot