ARROW-17382: [C++] open_dataset doesn't ignore BOM in csv file when header's with quotes
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
https://issues.apache.org/jira/browse/ARROW-17382
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
Hi @pitrou, could you help review this pr?
@ZMZ91 I could, but the first step would be to get the CI runs fixed.
In addition, since this claims to fix a bug, there should be a unit test added somewhere.
Thanks @pitrou. I've pushed a new commit and still got 2 ci failures. I'm not sure it's related with my change. Could you help check?
You are right, the failing CI checks are unrelated.
@ZMZ91 There is already some logic in the CSV reader to skip the BOM: https://github.com/apache/arrow/blob/fe9ca423f7f2684d70e0cf0cff8f48a0175c0fdf/cpp/src/arrow/csv/reader.cc#L109-L121
Instead of adding the same logic in the CSV parser, you should instead try to find out what that logic (in the CSV reader) isn't sufficient here.
I saw the code in reader.cc @pitrou. But it seems work on the entire csv file while the parser only takes a block of csv data and delimits rows and fields. And the data for each function are read respectively. Correct me if I'm not right.
That is why your approach is wrong: the BOM is only expected at the beginning of the file, not at the beginning of each CSV cell.
CI failures are unrelated.
Benchmark runs are scheduled for baseline = 8bf60b5deaa15a6e301221b57b419c9599e313b0 and contender = 01dce6a0c95242224eab1a5331e0a2524a4b9339. 01dce6a0c95242224eab1a5331e0a2524a4b9339 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x
[Finished :arrow_down:0.18% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 01dce6a0 ec2-t3-xlarge-us-east-2
[Finished] 01dce6a0 test-mac-arm
[Failed] 01dce6a0 ursa-i9-9960x
[Finished] 01dce6a0 ursa-thinkcentre-m75q
[Finished] 8bf60b5d ec2-t3-xlarge-us-east-2
[Failed] 8bf60b5d test-mac-arm
[Failed] 8bf60b5d ursa-i9-9960x
[Finished] 8bf60b5d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java