spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-37210][CORE][SQL] Allow forced use of staging directory

Open wForget opened this issue 3 years ago • 5 comments

What changes were proposed in this pull request?

Add forceUseStagingDir config to force use of staging dir when writing.

When setting forceUseStagingDir to true, I set committerOutputPath to staging dir in InsertIntoHadoopFsRelationCommand and for HadoopMapReduceCommitProtocol.newTaskTempFile method I calculate absolute dir and call newTaskTempFileAbsPath.

Why are the changes needed?

As discussed in SPARK-37210, errors or data loss may occur under some concurrent write scenarios.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test case in InsertSuite.

wForget avatar Jul 30 '22 06:07 wForget

Hi @dongjoon-hyun , could you please help me review it?

wForget avatar Jul 30 '22 06:07 wForget

Can one of the admins verify this patch?

AmplabJenkins avatar Jul 31 '22 09:07 AmplabJenkins

Thank you for making a PR, @wForget .

To @viirya and @sunchao . This issue has a reproducible example in the JIRA.

dongjoon-hyun avatar Aug 01 '22 21:08 dongjoon-hyun

Why it is an issue particular for InsertIntoHadoopFsRelationCommand?

InsertIntoHiveTable always uses hive staging dir https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala#L107 InsertIntoHadoopFsRelationCommand only uses spark staging dir in dynamic overwrite mode, otherwise it uses table_location/_temporary which leads to concurrency conflicts. https://github.com/apache/spark/blob/b0c831d3408dddfbbf3acacbe8100a9e08b400de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L171

wForget avatar Aug 02 '22 04:08 wForget

The usecase looks suspicious to me. Is it a valid one? I'm not sure that InsertIntoHadoopFsRelationCommand guarantees concurrent writing to same table.

It seems a reasonable requirement to concurrently write to different partitions of the same table. Is there some blocking issues?

wForget avatar Aug 03 '22 09:08 wForget

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Nov 12 '22 00:11 github-actions[bot]