hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Data deduplication caused by drawback in the delete invalid files before commit

Open beyond1920 opened this issue 1 year ago • 9 comments

Dear community, Our user complained that after their daily run job which written to a Hudi cow table finished, the downstream reading jobs find many duplicate records today. The daily run job has been already online for a long time, and this is the first time of such wrong result. He gives a detailed deduplicated record as example to help debug. The record appeared in 3 base files which belongs to different file groups. image I find the today's writer job, the spark application finished successfully. In the driver log, I find those two files marked as invalid files which to delete, only one file is valid files. image And in the clean stage task log, those two files are also marked to be deleted and there is no exception in the task either. image Those two files already existed on the hdfs before the clean stage began, but they still existed after the clean stage.

Finally, found the root cause is some corner case happened in hdfs. And fs.delete does not throw any exception, only return false if the hdfs does not delete the file successfully. image And I check the fs.delete api, the definition is reasonable. image

I think we should check the return value offs.delete in HoodieTable#deleteInvalidFilesByPartitions to avoid wrong results. Besides, it's necessary to check all places which called fs.delete. Any suggestion?

beyond1920 avatar Jun 08 '24 16:06 beyond1920

you are right, we already got a fix recently: https://github.com/apache/hudi/pull/11343

danny0405 avatar Jun 09 '24 01:06 danny0405

@danny0405 Thanks for your attention. I checked #11343, it could not fix the current issues. The issue should be fixed in HoodieTable#deleteInvalidFilesByPartitions to avoid fail to delete the invalid files, while #11343 aims to fix clean service.

beyond1920 avatar Jun 09 '24 02:06 beyond1920

hmm, would you mind to fire a fix for it?

danny0405 avatar Jun 10 '24 00:06 danny0405

I would like to fire a fix recently.

beyond1920 avatar Jun 11 '24 01:06 beyond1920

cc @yihua @nsivabalan @codope @xushiyan

ad1happy2go avatar Jun 13 '24 11:06 ad1happy2go

thanks @beyond1920 . please put out a patch. I would like to review as well.

nsivabalan avatar Jun 13 '24 14:06 nsivabalan

thanks @nsivabalan. I think the underlying file system should ensure that fs.delete should throw exception instead of return false if it fail to delete the file. But it might need a long time to discuss and push all the file system types to agree this rule. Should we introduce new delete API in hoodie HoodieStorage ensure this rules or changed existed HoodieStorage#deleteDirectory and HoodieStorage#deleteFile API to avoid all unexpected behavior when call fs.delete.

Or just simply fix the current bug?

beyond1920 avatar Jun 14 '24 06:06 beyond1920

Should we introduce new delete API in hoodie HoodieStorage ensure this rules or changed existed HoodieStorage#deleteDirectory and HoodieStorage#deleteFile API to avoid all unexpected behavior when call fs.delete.

+1 for this way.

danny0405 avatar Jun 17 '24 00:06 danny0405

is the main reason, diff file system schemes treat file not found differently during fs.delete()? and you are proposing HoodieStorage#deleteFile to unify that?

nsivabalan avatar Jun 18 '24 13:06 nsivabalan

This issue is now resolved and closed, following the merge of PR #11445.

rangareddy avatar Oct 30 '25 09:10 rangareddy