datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Implement chain group_by

Open dreadatour opened this issue 1 year ago • 2 comments

Implement chain group_by:

group_by.py:

from datachain import C, DataChain
from datachain.lib import func
from datachain.sql.functions.path import file_ext


res = (
    DataChain.from_storage("s3://dql-50k-laion-files/")
    .group_by(
        cnt=func.count(),
        total_size=func.sum("file.size"),
        avg_size=func.avg("file.size"),
        partition_by=file_ext(C("file__path")),
    )
)

res.show()

Run:

$ python group_by.py
Processed: 1 rows [00:00, 1085.76 rows/s]
Generated: 1 rows [00:00, 1162.82 rows/s]
Cleanup: 1 tables [00:00, 6615.62 tables/s]
Listing s3://dql-50k-laion-files: 129136 objects [05:07, 419.31 objects/s]
Processed: 1 rows [05:09, 309.83s/ rows] objects [05:07, 364.91 objects/s]
Generated: 129136 rows [05:04, 423.69 rows/s]
Cleanup: 1 tables [00:00, 257.19 tables/s]/s]
  file_ext    cnt  total_size      avg_size
0      jpg  43042  1079645149  2.508353e+04
1     json  43047    29743128  6.909454e+02
2  parquet      5    15378208  3.075642e+06
3      txt  43042     2927814  6.802226e+01
$

See also tests.

dreadatour avatar Sep 27 '24 14:09 dreadatour

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 56999e8
Status: ✅  Deploy successful!
Preview URL: https://027a0733.datachain-documentation.pages.dev
Branch Preview URL: https://228-group-by.datachain-documentation.pages.dev

View logs

Codecov Report

Attention: Patch coverage is 95.59748% with 7 lines in your changes missing coverage. Please review.

Project coverage is 87.25%. Comparing base (437898c) to head (56999e8). Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/query/dataset.py 69.23% 2 Missing and 2 partials :warning:
src/datachain/lib/func/func.py 91.66% 1 Missing and 2 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #482      +/-   ##
==========================================
+ Coverage   87.15%   87.25%   +0.10%     
==========================================
  Files          92       96       +4     
  Lines        9834     9943     +109     
  Branches     1348     1362      +14     
==========================================
+ Hits         8571     8676     +105     
- Misses        910      911       +1     
- Partials      353      356       +3     
Flag Coverage Δ
datachain 87.22% <95.59%> (+0.10%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 30 '24 17:09 codecov[bot]