Provide way to extract _id for indexed document in templates that write to Elasticsearch from index document fields
Related Template(s)
BigQueryToElasticsearch
What feature(s) are you requesting?
When using this template to index it auto-generates indexed document _id values. This makes it difficult to be able to later update specific indexed documents. It would be better if a field or function could be provided that extracts/computes a document _id from the fields of the document being indexed (Bigquery table/query-result row in the case of this template).
I see in the code for the ElasticsearchIO builder that there is a method withIdFn (ElasticsearchIO.java line 1200) that suggests this seems relatively easily possible and just needs to be exposed via template parameters.
Perhaps just adding the withIdFn call to this chain
and adding appropriate option to https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/elasticsearch-common/src/main/java/com/google/cloud/teleport/v2/elasticsearch/options/ElasticsearchWriteOptions.java
The autogeneration of the _id causes another problem with these templates in that if there is a retry because of Elasticsearch timeout (but elasticsearch did receive the initial request with indexed document but didn't respond fast enough to prevent the timeout and retry) then that document will be duplicated in the index (since the _id is autogenerated and different for each)
This means for example if you use the BigqueryToElasticsearch with a table with 1M unique rows, you may end up with 1.1M indexed documents where 100K are duplicates
any update?
@alexandregiordanelli I have a PR open but waiting for review/approval (I believe it needs to be a repo maintainer) and then merge