Pool "load" operator in Zed
tl;dr - Inspired by the kinds of ideas proposed in #2754, @mccanne thought of the idea of a load operator that could be used for populating scratch Pools based on data from other Pools.
Consider the following workflow using current Zed commit 7adbb36 and the Zeek zed-sample-data. Let's say I start by loading it all to one pool:
$ rm -rf $ZED_LAKE_ROOT
$ zed lake create -p samples
pool created: samples
$ zed lake load -p samples *
1szVwPnjln5K2tiUUXSckKGxqQ3 committed 1 segments
$ zed lake query -z 'from samples | count() by _path'
{_path:"http",count:144034 (uint64)} (=0)
{_path:"ntp",count:904} (0)
{_path:"ssh",count:22} (0)
...
If I wanted to extract a subset of that data to another pool, this can be done currently using a pipeline that invokes Zed multiple times:
$ zed lake create -p weird
pool created: weird
$ zed lake query 'from samples | _path=="weird"' | zed lake load -p weird -
1szWGf4n4AUAfFElCHtVxjwa05m committed 1 segments
$ zed lake query -z 'from weird | count() by _path'
{_path:"weird",count:24048 (uint64)} (=0)
However, @mccanne pointed out that a load operator could potentially be added in Zed to accomplish this in one shot, such as:
$ zed lake query 'from samples | _path=="weird" | load weird'
This could be particularly useful in cloud-centric use cases, since it would open up the potential to create scratch pools consisting of subsets of larger data sets, without having to require data to bounce all the way down to the client just to be sent back up to the cloud again.
We talked about this recently in sync and the front-end guys wanted to think through possible UX options for something like this...
The result of a load should be the commit object or an error. This way the client can tell if it was successful and why not have the commit object that fully describes the operation as a handy result.
A community zync user that intends to use this feature made good point about how this operator (and others we may create that could "write" data to a pool, e.g. delete #4142) "might be unsafe in some circumstances". That is, Zed currently lacks RBAC and such to allow fine-grained control over certain operations. For the bulk of current users that are still running standalone on the desktop this is no big deal since they can drop to the shell and load data, so being able to do it in the app or a within a Zed pipeline at the CLI is just a nice convenience and not in any way a new exposure. But the user who spoke of load being potentially "unsafe" is unique because they're effectively running in production where many of their users are intended to be "query only", so if those same users could invoke load it would have the potential to corrupt data.
Something that could perhaps be done in the short term would be to have some kind of flag for starting the Zed service that could disable any "write" operations such as load and delete. This doesn't necessarily have to be done at the same time as the initial implementation of load, but if the attached PR merges without covering that, we can open a follow-on issue at that time.
Verified in Zed commit f733ef3.
Repeating the steps from when the issue was opened, but now with the load happening from within the Zed pipeline:
$ zed -version
Version: v1.7.0-58-gf733ef37
$ zed create samples
pool created: samples 2PGPVpT3dj3kWo9xHZcX11kasEP
$ zed -use samples load *
(26/26) 44.71MB/44.71MB 0B/s 100.00%
2PGPYCRlEVGWerHibvqRBi1G6KR committed
$ zed create weird
pool created: weird 2PGPbPwfCC1o02Iu8Ma47kFmLxJ
$ zed query -z 'from samples | _path=="weird" | load "weird"'
0x10df7767693f212cecafa04887b0f388c4f5fb2b
$ zed query -z 'from weird | count() by _path'
{_path:"weird",count:24048(uint64)}
Docs are also available at https://zed.brimdata.io/docs/next/language/operators/load.
Several follow-on issues have also been spawned to further improve the operator's usability in the future:
- https://github.com/brimdata/zui/issues/2760
- https://github.com/brimdata/zed/issues/4570
- https://github.com/brimdata/zed/issues/4571
- https://github.com/brimdata/zed/issues/4574
Thanks @dianetc!