spark [SPARK-39389]Filesystem closed should not be considered as corrupt files

What changes were proposed in this pull request?

Should throw the error if caused by Filesystem closed when we enable ignoreCorruptFiles

Why are the changes needed?

If we set ignoreCorruptFiles to true, when some executors are preempted, the driver will stop these executors, which might cause the filesystem closed exception. As ignoreCorruptFIles is enabled, some tasks will finish directly and return the results to the driver, and the driver will not reschedule these tasks if get the result, this could cause data loss. As these files actually not corrupt files, so I think spark should handle this scenario.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Jun 06 '22 10:06 boneanxs

Can one of the admins verify this patch?

Jun 06 '22 16:06 AmplabJenkins

Does checking for filesystem closed exceptions completely fix this issue or are we vulnerable to race conditions? Skimming through the Hadoop DFSClient code, it looks like it only throws Filesystem closed IOExceptions from a checkOpen() call, but what happens if the FS is closed while a thread has proceeded past checkOpen() and is in the middle of some other operation? In that case we might get a different IOException but would presumably want to treat it the same way (because it is a side-effect of another task being interrupted rather than a side-effect of a corrupt file).

It seems like the core problem is that we have a really overly-broad catch block which catches corrupt files but also catches many other potential sources of exceptions. For example, let's say that we get an IOException caused by a transient network connectivity issue to external storage: this doesn't meet the intuitive definition of "corrupt file" but would still get caught in the dragnet of the current exception handler.

The current code seems biased towards identifying "false positive" instances of corruption (which can lead to correctness issues). If instead we wanted to bias towards false negatives (i.e. mis-identifying true corruption as a transient crash, therefore failing a user's job) then maybe we could have the code in readFunc wrap and re-throw exceptions in some sort of CorruptFileException wrapper and then modify the catch here to only ignore that new exception. This would require changes in a number of data sources, though. The FileScanRDD code might simply lack the necessary information to be able to identify true corruption cases, so pushing part of that decision one layer lower might make sense.

I think that could be a breaking change for certain users, though, so we'd need to treat it as a user-facing change and document it appropriately (and might need add escape hatch flags in case users need to revert back to the old (arguably buggy) behavior).

I'm not sure what's the right course of action here. I just wanted to flag that there's a potentially broader issue here.

Jun 07 '22 02:06 JoshRosen

+1 to @JoshRosen 's issue. We also encountered this problem in our scenario for ETL. When reading abnormally from filesystem(such as read timeout exception, which may succeeded with retrying), rather than the file corrupted, it cause data loss. We had to give up to use this conf finally.

Jun 08 '22 09:06 wayneguow

Thanks for you guys reviews, I think @JoshRosen's way is more appropriate, we better change the implementation to "false negatives", though it may need more efforts to implement it and it's a breaking change to the users.

Do we need to implement it or change this to identify whether it's an exception from DFSClient.checkOpen to make the codes more robust?

cc: @mridulm @HyukjinKwon @dongjoon-hyun

Jun 24 '22 10:06 boneanxs

Thanks for you guys reviews, I think @JoshRosen's way is more appropriate, we better change the implementation to "false negatives", though it may need more efforts to implement it and it's a breaking change to the users.

Do we need to implement it or change this to identify whether it's an exception from DFSClient.checkOpen to make the codes more robust?

Maybe we can tackle the issues separately?

In the long-term I think we should reconsider the overall approach for detecting corrupt files and maybe move towards a "false negatives" design. This would be a breaking change and might require datasource changes, so I think it deserves a separate and broader discussion (maybe on a separate JIRA or on the dev mailing list).

In the short-term, I think it's okay to make an incremental improvement to the current design and fix the specific issue targeted by this PR. Merging the slightly-more-robust-via-checking-exception-source version of the change you proposed here delivers real value to users and solves a real problem even if it doesn't completely solve the larger pre-existing design problem in this feature. Could you update this PR to include the more robust method of error checking and ping us for a final look? Please add a code comment near the modified code which references SPARK-39389 and explains the motivation for the change.

Jul 01 '22 01:07 JoshRosen

ping @JoshRosen fix the error handling codes, can take a review

Jul 11 '22 06:07 boneanxs

Gentle ping @JoshRosen @mridulm @HyukjinKwon @wayneguow ...

Jul 18 '22 02:07 boneanxs

Seems failure not relate to this pr

Jul 25 '22 08:07 boneanxs

IMO, it's better that users can configure what exceptions can ignore corrupt files.

Jul 28 '22 14:07 wayneguow

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Nov 09 '22 00:11 github-actions[bot]