hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] High number of duplicated records for certain commits

Open tped17 opened this issue 1 year ago • 11 comments

Tips before filing an issue

  • Have you gone through our FAQs?

    • this link gives me a 404
  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced We noticed an issue with two of our datasets wherein we have multiple rows with the same _hoodie_record_key, _hoodie_commit_time and _hoodie_commit_seqno within the same file. Unfortunately all of the problematic commits have been archived. Below is an example of the duplicate records (I've redacted the exact record key, but they are all the same), each sequence number is repeated 64 times.

+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
|_hoodie_record_key|_hoodie_commit_time|_hoodie_file_name                                                              |_hoodie_commit_seqno        |count|
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360995|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360996|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360993|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360994|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360994|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360995|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_78_2360996|64   |
|XXXX  |20240515220256697  |7f1d1a18-b025-4c25-8ba5-761667e1b6d4-0_1-7816-2840501_20240912033629311.parquet|20240515220256697_77_2360993|64   |
+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------------------------------------------------------------------+----------------------------+-----+

Here's the config we use:

hoodie.parquet.small.file.limit -> 104857600
hoodie.datasource.write.precombine.field -> eventVersion
hoodie.datasource.write.payload.class -> org.apache.hudi.common.model.EmptyHoodieRecordPayload
hoodie.bloom.index.filter.dynamic.max.entries -> 1106137
hoodie.cleaner.fileversions.retained -> 2
hoodie.parquet.max.file.size -> 134217728
hoodie.cleaner.parallelism -> 1500
hoodie.write.lock.client.num_retries -> 10
hoodie.delete.shuffle.parallelism -> 1500
hoodie.bloom.index.prune.by.ranges -> true
hoodie.metadata.enable -> false
hoodie.clean.automatic -> false
hoodie.datasource.write.operation -> upsert
hoodie.write.lock.wait_time_ms -> 600000
hoodie.metrics.reporter.type -> CLOUDWATCH
hoodie.datasource.write.recordkey.field -> timestamp,eventId,subType,trackedItem
hoodie.table.name -> my_table_name
hoodie.datasource.write.table.type -> COPY_ON_WRITE
hoodie.datasource.write.hive_style_partitioning -> true
hoodie.datasource.write.partitions.to.delete -> 
hoodie.write.lock.dynamodb.partition_key -> my_table_name_key
hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS
hoodie.write.markers.type -> DIRECT
hoodie.metrics.on -> false
hoodie.datasource.write.reconcile.schema -> true
hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.cleaner.policy.failed.writes -> LAZY
hoodie.upsert.shuffle.parallelism -> 1500
hoodie.write.lock.dynamodb.table -> HoodieLockTable
hoodie.write.lock.provider -> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.datasource.write.partitionpath.field -> region,year,month,day,hour
hoodie.bloom.index.filter.type -> DYNAMIC_V0
hoodie.write.lock.wait_time_ms_between_retry -> 30000
hoodie.write.concurrency.mode -> optimistic_concurrency_control
hoodie.write.lock.dynamodb.region -> us-east-1

To Reproduce We have not been able to reproduce this intentionally. This only happens occasionally in our dataset and it does not seem to follow any pattern that we've been able to discern.

Expected behavior

It is my understanding that we shouldn't be seeing a large number of duplicates per sequence number.

Environment Description

  • Hudi version : 0.11.1

  • Spark version : 3.2.1

  • Hive version : 3.1.3

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context For the datasets in which we found the issue we run cleaning and clustering manually and I noticed that our lock keys were incorrectly configured on the cleaning/clustering jobs, so it is possible that we were running cleaning or clustering at the same time as data ingestion or deletion. Please let me know if you need any more info, thank you!

tped17 avatar Sep 23 '24 15:09 tped17

Are these duplicates come from different partitions?

danny0405 avatar Sep 24 '24 01:09 danny0405

No, these are all in the same partition

tped17 avatar Sep 24 '24 01:09 tped17

Do you have any bulk_insert operations on the table?

danny0405 avatar Sep 24 '24 05:09 danny0405

is _hoodie_file_name and _hoodie_commit_seqno is real production data? the file_name token's partitionId look like is different from seqno.

KnightChess avatar Sep 24 '24 13:09 KnightChess

We do not use any bulk_insert operations, everything should be an upsert. Yes, these are actual file names and sequence numbers

tped17 avatar Sep 24 '24 15:09 tped17

@tped17 Is it possible to zip the .hoodie directory without metadata partitions and attach to the ticket. if not, Can you provide hudi timeline?

ad1happy2go avatar Sep 26 '24 12:09 ad1happy2go

I'm talking to my team and taking a look at the data to make sure that it's safe to share in this venue, will provide another update tomorrow at the latest

tped17 avatar Sep 30 '24 22:09 tped17

I noticed the value of the precombine field (refered as eventVersion above) is the same for all records. Could this cause duplication? If not, what record is kept during the upsert (is it the one upserted or just arbitrary)?

igorgelf avatar Oct 02 '24 15:10 igorgelf

@igorgelf It should not be the cause of dups as it will pick one only.

ad1happy2go avatar Oct 03 '24 13:10 ad1happy2go

@ad1happy2go are you looking for the timeline data just for bad commits? Or would it be instructive to see other commits as well?

tped17 avatar Oct 14 '24 15:10 tped17

@tped17 Yeah i want to see for the commit file from where the dups entered into system. If you dont see it safe to share on open community we can have a call also to understand more. You can reach out to me (Aditya Goenka) on OSS hudi slack and we can connect.

ad1happy2go avatar Oct 17 '24 09:10 ad1happy2go

HI @tped17

I hope this issue is now resolved. If you are still encountering problems, please let me know so we can schedule a quick call to troubleshoot. I'll keep this ticket open for one more week and close it if there's no further response. Feel free to reopen it if the issue becomes reproducible again in the future.

rangareddy avatar Oct 30 '25 08:10 rangareddy

Closing this issue because the user don't have any follow-up questions.

rangareddy avatar Nov 05 '25 13:11 rangareddy