litdata
litdata copied to clipboard
Blazingly fast, distributed streaming of training data from any cloud storage for training AI models
Before submitting - [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request...
## 🐛 Bug Training slowed down as time progress. ### To Reproduce Unfortunately, I don't know what a good way to reproduce this is. This happens to certain datasets. The...
## 🐛 Bug Time per sample grows as processed samples grows ### To Reproduce Steps to reproduce the behavior: Follow the example provided using optimize. Increase the number of samples....
## 🐛 Bug It is a known issue with PyTorch's IterableDataset that issues can occur when the dataset defines `len()`. PyTorch Lightning even raises a warning to make the user...
When running `optimize`, my process somehow crashed after 4h (was estimated to take 10h). Now I have to restart it from scratch. Could you add a checkpointing feature such that...
DataChunkRecipe is not working when used in litgpt's TinyLlama pretraining example error: AttributeError: 'SlimPajamaDataRecipe' object has no attribute 'is_generator' the type of SlimPajamaDataRecipe is DataChunkRecipe, and i find DataChunkRecipe object...
## 🚀 Feature `StreamingDataset` has enabled fast data reading, which is amazing when we have a large dataset. However, currently, it does not support reading just a fraction of data,...
## 🐛 Bug I'd like to debug the `random_images` function taken from the first example. However, when adding a `breakpoint()` line in that function, python crashes. ### Environment - PyTorch...
## 🚀 Feature Streaming subsets of channels ### Motivation My geotiff data is typically multispectral and I do experiments using subsets of the channels. I would like to stream only...
## 🐛 Bug I'm attempting to train a model using litgpt, and the openwebtext dataset. I launch the run as normal following their examples, and the dataset preprocessing starts: However,...