datachain
datachain copied to clipboard
AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
Refactoring upon the changes at #494 from the comments, this introduces the way to allow a way to persist dataset even if exception is thrown. With this change, the cleanup...
Related to https://github.com/iterative/datachain/issues/477 and https://github.com/iterative/studio/issues/10635#issuecomment-2381829809 & https://github.com/iterative/studio/issues/10635#issuecomment-2406225017 All the context for this change is in https://github.com/iterative/datachain/issues/477 but the tl;dr is: This behaviour is currently undocumented, untested and misunderstood. There is...
### Description To avoid having to use pandas in a situations like [this](https://gitlab.com/iterative.ai/cse/customers/7-eleven/qco-image-catalog-dvcx/-/blob/main/scripts/qco-3-train-model.py?ref_type=heads#L27-29) (during model training) we need to be able to implement the [unique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) method (potentially also [nunique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html))
Implement chain `group_by`: `group_by.py`: ```python from datachain import C, DataChain from datachain.lib import func from datachain.sql.functions.path import file_ext res = ( DataChain.from_storage("s3://dql-50k-laion-files/") .group_by( cnt=func.count(), total_size=func.sum("file.size"), avg_size=func.avg("file.size"), partition_by=file_ext(C("file__path")), ) ) res.show()...
### Description If the name of a top level colum which contains some subcolumns is the same as a level of a 1-level colum, the `.select` method seems to only...
### Description Currently, the `.merge` method of `DataChain` expects both keys to be of the same type. This makes sense, but it would improve developer's quality of life a lot...
Before, when listing local FS, we had root of the FS always set for `source` field, e.g `file:///` and the rest was in `path`. Idea behind this was to utilize...
Updates the requirements on [numpy](https://github.com/numpy/numpy) to permit the latest version. Release notes Sourced from numpy's releases. 2.1.2 (Oct 5, 2024) NumPy 2.1.2 Release Notes NumPy 2.1.2 is a maintenance release...
#306 is a workaround that throws an exception. But we need to implement this functionality with a potential schema change. This command should change my_dist to another value with a...
We need to specify clearly how `.order_by()` interacts with other methods. For instance, in SQL, the order of the results from a SELECT query is undefined unless there is an...