DALI [RFC] Use TensorVector everywhere

Category: Refactoring

Description:

The first commit comes from #3827 - see the description of changes there. I stopped maintaining the chain of commits as it was getting hard to manage, so after the general changes here are reviewed, I can carve out separate commits for more thorough review one by one.

Changes (some of them could be separated into individual PRs):

device_id parameter was added to ShareData and SetSample APIs that take shared_ptr - they already accept order, but the device it holds doesn't necessarily indicate the allocation device. (Can be extracted to separate PR, and changes that just add some kind of device_id parameter are related to this)
TensorVector APIs are adjusted:
- Now it has states: Contiguous and Noncontiguous - which can be either enforced or in "don't care" mode
  - When enforced, changing the mode will error out
  - when not enforced, setting individual samples would mark it as currently non-contiguous
  - Resize got option to coalesce or to split to individual allocation, if we fit, it will still keep the backing allocation - this is the point that can be adjusted
- Most metadata is kept in the batch as the source of truth, I tried to make sure they are set in the same order in every callsite, and added some tests for permuting the setters.
- Setting or copying in new samples checks against that metadata and must match.
- When we convert the buffer to non-contiguous and it had a contiguous backing allocation, we mark the tensors_ that were pointing to that allocation as not sharing, but still aliasing - so we can call Resize() and get a new allocation without the error that we try to reallocate something that shares data
- The open question is how to treat the variable batch size - when we make the TV smaller, unless we are sharing, we do not reset buffer and we hog the memory. It kinda aligns with the reserve, but might be problematic because of the memory hogging.
- We need to add the AsTensor and AsReshapedTensor APIs, I tried a separate PR but needed some adjustments on the way (#3963).
TensorList is removed - the current implementation is just replaced by an alias to TensorVector. Replacing all the uses would introduce a lot of noise, it will be done separately (we can decide on what name we want to use: TensorList, TensorVector, TensorBatch, etc).
- Some overloads for TensorList were removed, as they now duplicate the signature of TensorVector and are no longer needed.
- The TensorVector version was preferred for sample access (for example in SequenceOperator).
As the operators may return non-contiguous data, some changes to prevent that were necessary:
- This part is mostly about changes in op_graph, graph_descr.cc, and pipeline.cc and make_contiguous.*
- Now, every output gets its own MakeContiguous operator (unconditionally).
- The MakeContiguous either passes the data through or copies if the input was non-contiguous
- Additional pass through the graph was added, to see which operators infer the outputs (this currently means, that they return contiguous data), or pass through data (which means we must check if the input that was passed through is contiguous or not. This is a bit problematic, as PassThrough is static information in OpSchema, and the CanInferOutputs() is a part of Operator object API.
- MakeContiguous is extended to utilize the inferred information.
- This will require follow-up for Split&Merge et al - they will also pass through the data (part of the batches) but in different manner.
- This might be a bit messy, due to realizing pretty late how do we handle duplicated outputs out of the pipeline.

Know issues:

numpy reader for GPU triggers a problem with non-contiguous outputs, didn't debug it yet.
Looks like I introduced a leak in one of the recent rebases which shows up in the L3 test.

Additional information:

Affected modules and functionalities:

TensorVector/TensorList
MakeContiguous
Graph Creation
Pipeline Outputs

Key points relevant for the review:

How the TV keeps the memory in case of the batch size change
How does the TV::Resize work in contiguous/non-contiguous modes.
- Do we want additional modes for that?
Any alternative to adding MakeContiguous for every output?

Tests:

[x] Existing tests apply
- TensorVector is now covered by both TV and TL tests
- must be tested on the whole DALI test suite
- perf to be verified on L3 tests
- some tests that verify the number of ops in graph were adjusted to accommodate for more of MakeContiguous nodes.
[x] New tests added
- new tests that check the mixed order of setting batch properties, etc.
- [ ] Python tests
- [ ] GTests
- [ ] Benchmark
- [ ] Other
[ ] N/A

Checklist

Documentation

[x] Existing documentation applies
[ ] Documentation updated
- [x] Docstring
- [ ] Doxygen
- [ ] RST
- [ ] Jupyter
- [ ] Other
[ ] N/A

DALI team only

Requirements

[ ] Implements new requirements
[ ] Affects existing requirements
[ ] N/A

REQ IDs: N/A

JIRA TASK: DALI-2689, DALI-2548

Jun 30 '22 17:06 klecki

!build

Jun 30 '22 17:06 klecki

CI MESSAGE: [5225007]: BUILD STARTED

Jun 30 '22 18:06 dali-automaton

CI MESSAGE: [5225007]: BUILD FAILED

Jun 30 '22 21:06 dali-automaton

Changes were merged as a separate PRs

Sep 01 '22 15:09 klecki