datachain icon indicating copy to clipboard operation
datachain copied to clipboard

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps

Results 155 datachain issues
Sort by recently updated
recently updated
newest added

Refactoring upon the changes at #494 from the comments, this introduces the way to allow a way to persist dataset even if exception is thrown. With this change, the cleanup...

Related to https://github.com/iterative/datachain/issues/477 and https://github.com/iterative/studio/issues/10635#issuecomment-2381829809 & https://github.com/iterative/studio/issues/10635#issuecomment-2406225017 All the context for this change is in https://github.com/iterative/datachain/issues/477 but the tl;dr is: This behaviour is currently undocumented, untested and misunderstood. There is...

### Description To avoid having to use pandas in a situations like [this](https://gitlab.com/iterative.ai/cse/customers/7-eleven/qco-image-catalog-dvcx/-/blob/main/scripts/qco-3-train-model.py?ref_type=heads#L27-29) (during model training) we need to be able to implement the [unique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html) method (potentially also [nunique](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html))

enhancement

Implement chain `group_by`: `group_by.py`: ```python from datachain import C, DataChain from datachain.lib import func from datachain.sql.functions.path import file_ext res = ( DataChain.from_storage("s3://dql-50k-laion-files/") .group_by( cnt=func.count(), total_size=func.sum("file.size"), avg_size=func.avg("file.size"), partition_by=file_ext(C("file__path")), ) ) res.show()...

### Description If the name of a top level colum which contains some subcolumns is the same as a level of a 1-level colum, the `.select` method seems to only...

bug
triage

### Description Currently, the `.merge` method of `DataChain` expects both keys to be of the same type. This makes sense, but it would improve developer's quality of life a lot...

enhancement
triage

Before, when listing local FS, we had root of the FS always set for `source` field, e.g `file:///` and the rest was in `path`. Idea behind this was to utilize...

Updates the requirements on [numpy](https://github.com/numpy/numpy) to permit the latest version. Release notes Sourced from numpy's releases. 2.1.2 (Oct 5, 2024) NumPy 2.1.2 Release Notes NumPy 2.1.2 is a maintenance release...

#306 is a workaround that throws an exception. But we need to implement this functionality with a potential schema change. This command should change my_dist to another value with a...

priority-p2

We need to specify clearly how `.order_by()` interacts with other methods. For instance, in SQL, the order of the results from a SELECT query is undefined unless there is an...