datachain
datachain copied to clipboard
Implement chain group_by
Implement chain group_by:
group_by.py:
from datachain import C, DataChain
from datachain.lib import func
from datachain.sql.functions.path import file_ext
res = (
DataChain.from_storage("s3://dql-50k-laion-files/")
.group_by(
cnt=func.count(),
total_size=func.sum("file.size"),
avg_size=func.avg("file.size"),
partition_by=file_ext(C("file__path")),
)
)
res.show()
Run:
$ python group_by.py
Processed: 1 rows [00:00, 1085.76 rows/s]
Generated: 1 rows [00:00, 1162.82 rows/s]
Cleanup: 1 tables [00:00, 6615.62 tables/s]
Listing s3://dql-50k-laion-files: 129136 objects [05:07, 419.31 objects/s]
Processed: 1 rows [05:09, 309.83s/ rows] objects [05:07, 364.91 objects/s]
Generated: 129136 rows [05:04, 423.69 rows/s]
Cleanup: 1 tables [00:00, 257.19 tables/s]/s]
file_ext cnt total_size avg_size
0 jpg 43042 1079645149 2.508353e+04
1 json 43047 29743128 6.909454e+02
2 parquet 5 15378208 3.075642e+06
3 txt 43042 2927814 6.802226e+01
$
See also tests.
Deploying datachain-documentation with
Cloudflare Pages
| Latest commit: |
56999e8
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://027a0733.datachain-documentation.pages.dev |
| Branch Preview URL: | https://228-group-by.datachain-documentation.pages.dev |
Codecov Report
Attention: Patch coverage is 95.59748% with 7 lines in your changes missing coverage. Please review.
Project coverage is 87.25%. Comparing base (
437898c) to head (56999e8). Report is 4 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/datachain/query/dataset.py | 69.23% | 2 Missing and 2 partials :warning: |
| src/datachain/lib/func/func.py | 91.66% | 1 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #482 +/- ##
==========================================
+ Coverage 87.15% 87.25% +0.10%
==========================================
Files 92 96 +4
Lines 9834 9943 +109
Branches 1348 1362 +14
==========================================
+ Hits 8571 8676 +105
- Misses 910 911 +1
- Partials 353 356 +3
| Flag | Coverage Δ | |
|---|---|---|
| datachain | 87.22% <95.59%> (+0.10%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.