DELETE Statement Deleting Another Record
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced I have duplicated keys in hudi table due to the insert statement, when I tried deleting the key based on a different filter both the keys were deleted
A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
- Create a table using Insert two records with the same key on without partition table.
- Try to delete the record of the key in only one row by using key and _hoodie_commit_seqno
- now check the table the table will delete both the record
Expected behavior
The delete command should only delete the one row which was used for filtering
Environment Description
-
Hudi version : 0.12.3
-
Spark version : 3.3
-
Hive version : 3
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) : s3
-
Running on Docker? (yes/no) : no
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
@Amar1404 Can you please try 0.14.1. This was fixed. I tried below code also to demonstrate -
DROP TABLE issue_11212;
set hoodie.spark.sql.insert.into.operation=bulk_insert;
CREATE TABLE issue_11212 (
ts BIGINT,
uuid STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING
) USING HUDI
OPTIONS(
'hoodie.datasource.write.recordkey.field'='uuid',
'hoodie.datasource.write.precombine.field'='ts',
'hoodie.datasource.write.operation'='bulk_insert'
);
INSERT INTO issue_11212
VALUES
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco');
INSERT INTO issue_11212
VALUES
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-C','driver-L',19.10,'san_francisco');
select * from issue_11212 where uuid = '334e26e9-8355-45cc-97c6-c31daf0df330';
SELECT * FROM issue_11212 WHERE uuid = '334e26e9-8355-45cc-97c6-c31daf0df330' and _hoodie_commit_seqno = '<seq no>';
DELETE FROM issue_11212 WHERE uuid = '334e26e9-8355-45cc-97c6-c31daf0df330' and _hoodie_commit_seqno = '<seq no>'
select * from issue_11212 where uuid = '334e26e9-8355-45cc-97c6-c31daf0df330';
Can you please check above and let us know.
@ad1happy2go - Is there any other way to do it on hudi 0.12.3 like I am trying to use config hoodie.combine.before.delete setting it as false, or any other config
@ad1happy2go - Do you know any other way to delete duplicated record from the hudi table without rewriting whole table
@Amar1404 With 0.12 we always used to delete records based on record key. That is the reason both of those records are getting filtered out. One way is to identify duplicated records from the table and then perform delete and insert.
@Amar1404 Did the approach worked? Do you need any other help here?
@ad1happy2go - that approach worked thanks.