hudi icon indicating copy to clipboard operation
hudi copied to clipboard

DELETE Statement Deleting Another Record

Open Amar1404 opened this issue 1 year ago • 2 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced I have duplicated keys in hudi table due to the insert statement, when I tried deleting the key based on a different filter both the keys were deleted

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. Create a table using Insert two records with the same key on without partition table.
  2. Try to delete the record of the key in only one row by using key and _hoodie_commit_seqno
  3. now check the table the table will delete both the record

Expected behavior

The delete command should only delete the one row which was used for filtering

Environment Description

  • Hudi version : 0.12.3

  • Spark version : 3.3

  • Hive version : 3

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : s3

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Amar1404 avatar May 14 '24 04:05 Amar1404

@Amar1404 Can you please try 0.14.1. This was fixed. I tried below code also to demonstrate -

DROP TABLE issue_11212;
set hoodie.spark.sql.insert.into.operation=bulk_insert;
CREATE TABLE issue_11212 (
    ts BIGINT,
    uuid STRING,
    rider STRING,
    driver STRING,
    fare DOUBLE,
    city STRING
) USING HUDI
OPTIONS(
  'hoodie.datasource.write.recordkey.field'='uuid',
  'hoodie.datasource.write.precombine.field'='ts',
  'hoodie.datasource.write.operation'='bulk_insert'
);

INSERT INTO issue_11212
VALUES
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco');

INSERT INTO issue_11212
VALUES
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-C','driver-L',19.10,'san_francisco');

select * from issue_11212 where uuid = '334e26e9-8355-45cc-97c6-c31daf0df330';

SELECT * FROM issue_11212 WHERE uuid = '334e26e9-8355-45cc-97c6-c31daf0df330' and _hoodie_commit_seqno = '<seq no>';

DELETE FROM issue_11212 WHERE uuid = '334e26e9-8355-45cc-97c6-c31daf0df330' and _hoodie_commit_seqno = '<seq no>'

select * from issue_11212 where uuid = '334e26e9-8355-45cc-97c6-c31daf0df330';

Can you please check above and let us know.

ad1happy2go avatar May 14 '24 06:05 ad1happy2go

@ad1happy2go - Is there any other way to do it on hudi 0.12.3 like I am trying to use config hoodie.combine.before.delete setting it as false, or any other config

Amar1404 avatar May 14 '24 08:05 Amar1404

@ad1happy2go - Do you know any other way to delete duplicated record from the hudi table without rewriting whole table

Amar1404 avatar May 14 '24 08:05 Amar1404

@Amar1404 With 0.12 we always used to delete records based on record key. That is the reason both of those records are getting filtered out. One way is to identify duplicated records from the table and then perform delete and insert.

ad1happy2go avatar May 15 '24 08:05 ad1happy2go

@Amar1404 Did the approach worked? Do you need any other help here?

ad1happy2go avatar May 16 '24 09:05 ad1happy2go

@ad1happy2go - that approach worked thanks.

Amar1404 avatar May 16 '24 11:05 Amar1404