vdk-oracle: optimize batching of payload rows with different keysets
Overview
Prerequisites
https://github.com/vmware/versatile-data-kit/issues/2933 Benchmark values
Use case of payload objects with different key sets.
https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-plugins/vdk-oracle/tests/jobs/oracle-ingest-job-different-payloads-no-table/10_ingest.py#L6
We still want to be able to batch queries using cursor.executemany(). This is not feasible when the keysets differ, because we're substituting in an insert query with a static number of columns, e.g.
f"INSERT INTO {table_name} ({', '.join(columns)}) VALUES ({', '.join([':' + str(i + 1) for i in range(len(columns))])})"
We call this for each ingestion payload. This means that the payload should be uniform, e.g. have the same keys and same number of values for every row (object). However, this contradicts our desired use case. We've solved this by further batching the payload by key set and then doing separate cursor.executemany() calls for each key set.
There are a few drawbacks to this approach.
- We create a
frozensetfor each row. Depending on the number of columns, this could be a problem, because it's an O(n) operation. - Frozen sets are hashable, but depending on the hash function, there is probably a better option to use as a key. This should be researched further.
- We have to recreate the key-value mappings, so that order is preserved across the batch, e.g. each batch contains a bunch of rows that are dicts and we have to convert them to lists in order to execute the query. The columns are a list and were converted earlier from a keyset, where order is not guaranteed. We have to make each row follow the order of the column list, otherwise there will be errors when executing the queries. This requires seeing each element of the row and putting it in a list at the correct position.
We might be constrained by linear time, because batching requires seeing at least every row. Making the data uniform maybe does not require seeing every value of every row, so there's room for optimization.
Proposed solution
A good alternative approach might be to get the sum of all key sets in the payload (all possible columns). Then, for each row, if there are missing keys, we just set them to null and do a single cursor.executemany().
Acceptance criteria
- Decide if this is worth optimizing
- Implement optimization
- Measure results