datashare-toolkit Optionally forestall duplicate ingestion

If the write disposition for an ingestion run is set to WRITE_APPEND, BigQuery will ingest the same file an arbitrary number of times, yielding duplicate records in the corresponding table which differ in their values for bqds_batch_id (which is shared by the entire ingestion iteration).

One possible way to avoid this duplication is to warn or fail if the same file is going to be uploaded twice by searching the existing batch ID for the same file name as the one incoming from GCS.

Matching on the file name could end up being a relatively unreliable approach depending on a publisher's specific naming conventions. An alternative to reduce false positives would be to use an MD5 hash or similar on the entire inbound file, and then use that for inbound file validation. While not foolproof, this should give both publisher namespace flexibility with static evaluation and rejection of precise duplicate inbound files.

Dec 07 '19 21:12 salsferrazza

I prefer the approach of taking an MD5 hash on the file content. However both have their merits. We could allow for both and make it configurable also. Default on file hash, but allow a user to configure it within the individual file configs. If we take a MD5 hash we'll also need a place to persist historical hashes. We could keep at a row level which would be redundant or maybe even store a dummy file named with the hash in a separate path in the bucket upon successfully processing a file. This way we could just check for the existence of a hash/file name within one storage call.

Dec 08 '19 02:12 mservidio

The hash is available as file metadata for each GCS object - perhaps it makes sense to add this to the datashare_batch_id.

    Creation time:          Thu, 19 Apr 2018 18:50:40 GMT
    Update time:            Thu, 19 Apr 2018 18:50:40 GMT
    Storage class:          STANDARD
    Content-Length:         33685826
    Content-Type:           application/zip
    Hash (crc32c):          25iVgg==
    Hash (md5):             sEjyHtS53TO/t+wK9PFtog==
    ETag:                   CPzOweKAx9oCEAE=
    Generation:             1524163840862076
    Metageneration:         1
    ACL:                    [

Oct 03 '20 21:10 salsferrazza

So hash could be stored in column separate from datashare_batch_id, then ingestion function can halt if a COUNT(*) WHERE datashare_hash = '<hash_of_new_file>' returns anything but 0.

Oct 03 '20 21:10 salsferrazza

Problem is that using a single hash would require a join to all of the data. To minimize the joined set we could combine with a statement date or something depending on the type of data. In this case too, it may make sense to cluster on the date used.

Oct 03 '20 21:10 mservidio