[SPARK-37210][CORE][SQL] Allow forced use of staging directory
What changes were proposed in this pull request?
Add forceUseStagingDir config to force use of staging dir when writing.
When setting forceUseStagingDir to true, I set committerOutputPath to staging dir in InsertIntoHadoopFsRelationCommand and for HadoopMapReduceCommitProtocol.newTaskTempFile method I calculate absolute dir and call newTaskTempFileAbsPath.
Why are the changes needed?
As discussed in SPARK-37210, errors or data loss may occur under some concurrent write scenarios.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added test case in InsertSuite.
Hi @dongjoon-hyun , could you please help me review it?
Can one of the admins verify this patch?
Thank you for making a PR, @wForget .
To @viirya and @sunchao . This issue has a reproducible example in the JIRA.
Why it is an issue particular for
InsertIntoHadoopFsRelationCommand?
InsertIntoHiveTable always uses hive staging dir https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L107
InsertIntoHadoopFsRelationCommand only uses spark staging dir in dynamic overwrite mode, otherwise it uses table_location/_temporary which leads to concurrency conflicts.
https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L171
The usecase looks suspicious to me. Is it a valid one? I'm not sure that
InsertIntoHadoopFsRelationCommandguarantees concurrent writing to same table.
It seems a reasonable requirement to concurrently write to different partitions of the same table. Is there some blocking issues?
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!