Scott Lee issues

Results 9 issues of


                                            Scott Lee

[Datasets] Correct schema unification for Datasets with ragged Arrow arrays

Signed-off-by: Scott Lee ## Why are these changes needed? When creating Datasets with ragged arrays, the resulting Dataset incorrectly uses `ArrowTensorArray` instead of `ArrowVariableShapedTensorArray` as the underlying schema type. This...

[Data] Remove skip for missing logical plan in Dataset plan optimizer

## Why are these changes needed? WIP - checking what changes are needed to enable optimizer by default 100%. ## Related issue number ## Checks - [ ] I've signed...

[Data] Implement Operators for `union()`

## Why are these changes needed? Implement the `LogicalOperator` and `PhysicalOperator` for `Dataset.union()`, and make `union()` lazy. This PR also introduces `Nary` and `NaryOperator` Logical/Physical Operators to support abstraction for...

@author-action-required

[Data] Add heterogeneous Ray Data + Train release test

## Why are these changes needed? - Modifies the existing multi node train benchmark code to enable testing with heterogeneous clusters. - Adds a new release test `read_images_train_1_gpu_5_cpu` with 1...

[Data] Reduce internal Ray Data stack trace output by default

## Why are these changes needed? Whenever there is any error when using Ray Data, the full stack trace is currently printed to stdout. If the exception originates from the...

[Data] Enabling local shuffle buffer reduces throughput when iterating with Ray Trainer

### What happened + What you expected to happen When iterating over a Ray Dataset within the `TorchTrainer` train loop, a non-`None` `local_shuffle_buffer_size` causes a decrease in throughput compared to...

bug

performance

data

ray 2.11

Scott Lee

[Datasets] Correct schema unification for Datasets with ragged Arrow arrays

[Data] Remove skip for missing logical plan in Dataset plan optimizer

[Data] Implement Operators for `union()`

[Data] Add heterogeneous Ray Data + Train release test

[Data] Reduce internal Ray Data stack trace output by default

[Data] Enabling local shuffle buffer reduces throughput when iterating with Ray Trainer

[Data] Fix progress bars being displayed as partially completed in Jupyter notebooks

[Data] Pass `DataContext` as constructor arg for `LogicalPlan` and `PhysicalPlan`

[Data] [Docs] Improve docs around Parquet filter predicate / column selection pushdown