[SPARK-39901][CORE][SQL] Redesign `ignoreCorruptFiles` to make it more accurate by adding a new config `spark.files.ignoreCorruptFiles.errorClasses`
What changes were proposed in this pull request?
As described in issue SPARK-39901 the design of the current ignoreCorruptFiles feature has certain flaws. It excessively matches IOExceptions, which may cause a file to be Ignored by mistake when encountering some transient and sporadic IO exceptions.
This PR proposes a new config spark.files.ignoreCorruptFiles.errorClasses. By setting this config, Spark users can accurately ignore corrupt files caused by specific exceptions.
For example, if the config value is set as belows:
-
java.lang.IOException:not a Sequence file,java.lang.EOFException(config format: className[:keyMsg],className[:keyMsg])
It means that when an IOException is encountered and the error message contains key information not a Sequence file, or when a java.lang.EOFException is encountered (note that only class needs to be judged here), corrupted files should be ignored.
The default value of this config is "", which means that the error class list for ignoring corrupt files has not been set. At this time, the behavior of ignoreCorruptFiles remains exactly the same as before.
Why are the changes needed?
Optimize the defects of the current ignoreCorruptFiles feature.
Does this PR introduce any user-facing change?
Yes, Spark users can change the behavior of ignoreCorruptFiles by setting the new config; but by default, the behavior remains the same as before. So don't worry it's a breakchange for users.
How was this patch tested?
Add some new test cases and Pass GA.
Was this patch authored or co-authored using generative AI tooling?
No.
cc @JoshRosen @LuciferYang , when you have time,thanks.
cc @cloud-fan
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!