Deependu

Results 70 comments of Deependu

Cool! 🔥 ### Issues: - the code part seems somewhat odd. - Also, a weird white side-bar is present (which is not present in TorchText docs or any other). Rest...

Hey, I'd love to work on this issue. Should I continue?

Hi, I'm interested in working on this feature. But, before that, I must ensure I've understood it correctly. The current behavior for `optimize` is: ```python optimize( fn=random_images, inputs=list(range(1000)), output_dir="my_dataset", num_workers=4,...

Hey @ethanwharris , we have the feature to subsample from the dataset. Though, the subsamples are optimized to be from as few chunks as possible. Indexing and slicing is also...

is it all about running only if it is executing in studio, and do nothing otherwise? modified code to be something like: ```python def _cleanup_cache(self) -> None: if not _IS_IN_STUDIO:...

How about logging warning for this if they are running it outside?

Hi, @yuzc19, You can set the DATA_OPTIMIZER_CACHE_FOLDER environment variable at the top of your script to specify the cache directory. This way, the cache_dir will be set to your desired...

- From [zstd's readme](https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression): --- Also, chatgpt says: --- The graph shared has `10K different json files of roughly 1KB each`. LitData chunks on an average will be 64MB or...

really nice issue. @bhimrazy my understanding of the issue is, optimize dataset will contain only one sample, but while streaming, same sample will be yielded multiple times (along with sample...

Also, I think shuffling in this case will be interesting. My approach for this will be: - Add an additional property in index.json file called `sample_count`, which will contain how...