Question: How to perform in-place put/drop in zng file?
Note: I have asked this question in #3881. Creating a new issue since it is not quite relevant to it.
I am evaluating zed to perform CRUD on large JSON files (few 100 MBs to 1 GB). And it is likely to have deeply nestled objects.
Could you share sample queries to Put/Cut/Drop inplace and store the result directly in the same zng file?
What I have tried?: Put a specific value in a deeply nested object. I am able to achieve it like this.
zq -i zng -o out.zng 'over arr | where id==2 | put level0.level1.level2.level3.value:=100' min.zng
It gives only object with id=2. I am able to redirect to out.zng.
However I would want to store the changed value in min.zng itself. How to do this?
Attaching the sample zng file that will mimic my usecase for reference. min.zng.zip
Interesting. If you want to modify a value inside a JSON file in place, that's not something zq can do and I don't know of anything that would do this. Instead you would have to read the file, modify the contents, then write out the updated contents.
If your use case would allow using ZNG instead of JSON for the storage format, then you could do a zq query that makes the desired change and produces a new file, then you could do a file-system move of the new file to the original file.
As an example, on your sample file, suppose we want to change the record at id=6 so that level2 has a different value you could run this query...
over arr | switch id (
case 6 => put level0.level1.level2.value:="2-changed"
default => pass
)
| merge id
| arr:=collect(this)
If you put the query above in a file say update.zed then run this:
zq -I update.zed min.zng
you will get new ZNG output with the desired update to the specified field.
Since there's a lot going on here, it's easy to test and inspect the output by running this:
zq -I update.zed min.zng | zq -Z -pretty=2 'over arr | id==6' -
which will pull out the record with id==6 and let you have a look:
{
id: 6,
next: 7,
_id: "625664e3ea574290b931f172",
index: 0,
guid: "e300c649-6f2c-4a60-9b51-bc1be08d0a14",
isActive: false,
balance: ",764.44",
picture: "http://placehold.it/32x32",
age: 38,
eyeColor: "brown",
name: "Hart Kline",
gender: "male",
company: "LUNCHPAD",
email: "[email protected]",
phone: "+1 (840) 496-2259",
address: "643 Clara Street, Groveville, North Carolina, 4785",
registered: "2015-11-02T04:02:38 -06:-30",
latitude: 82.284556,
longitude: -53.359112,
tags: [
"ex",
"duis",
"commodo",
"et",
"ad",
"voluptate",
"cupidatat"
],
friends: [
{
id: 0,
name: "Bradford Shaffer"
},
{
id: 1,
name: "Monroe Kent"
},
{
id: 2,
name: "John Carey"
}
],
greeting: "Hello, Hart Kline! You have 9 unread messages.",
favoriteFruit: "strawberry",
level0: {
tags: [
1,
2,
3
],
value: "0",
level1: {
tags: [
1,
2,
3
],
value: "1",
level2: {
tags: [
1,
2,
3
],
value: "2-changed",
level3: {
tags: [
1,
2,
3
],
value: "3",
level4: {
...
You could also do this pretty easily with node or python.
If you want fine-grained updates to complicated nested structures like this, we're planning to add CRUD updates to the Zed lake which wouldn't require reading and writing large objects, though this is a longer discussion...
@mccanne : Thanks a lot for detailed answer. It gave a lot of clarity.
I would like to share the problem we wanted to solve and why we think zed can help us out.
I apologize for a slightly longer post.
Objectives:
- Ability to perform granular CRUD on large JSON objects. By granular, I mean the system should be capable of performing pointed updates/retrieval of an value or sub-object without recreating the entire json object.
- Provide transactional guarantees for queries across multiple reader/writers of JSON objects. (Similar to ACID compliance in DB)
- Keep the turnaround time of query on large json objects (say 1-2 GB size) within acceptable range for better user experience.
- Keep the system resources under control.
- Optional: Indexing for faster queries.
In short, the user should see much difference in the experience regardless of the size of json objects.
Ideally, a database would fit the objectives.
There are a bunch of options available for small json objects. Ranging from SQL based (though hard structuring the JSON with ORM layer is needed) or NOSQL based (like MongoDB documents), because to some extent they help in preserving the semi-structured nature of json. For smaller json object NOSQL checks out all the objectives listed above.
AFAIK, there isnt any DB based solution that can readily address all the objectives for large json objects. We even evaluated options to break json into smaller pieces and store them in NOSQL DB. This is because most NOSQL we have seen have a restriction on the size of individual document/entry (16 MB for Mongo documents).
Then I investigated file based option to store json objects. Since keeping large file in-memory is not a choice, streaming solution seemed like a good fit. However I didnt find a streaming based solution that meets our objective. For example, JQ, JJ had their own set of issues.
Zed comes pretty close to what we are looking for. In fact, the intuitive query building, ZNG format are additional good reasons (which we didnt think of earlier) to investigate zed further.
It would be of great help if you can share your thoughts on the following. It would help us in understanding the capabilities of zed and how we can use it in our solution.
- Do you see any outright problems in Zed which will not provide large json CRUD within acceptable time (in seconds regardless of json filesize). Assuming we will store the json in zng format.
- What are the client libraries Zed support? Our solution stack is currently Dart based. Which language client library would you recommend? Language is not a concern. Ideally, we would prefer a library which is stable, has better documentation support.
- I havent checked the Zed APIs and how to use them. Are there any sample code available, maybe in your test suites, that I can take a look at to understand how to use APIs. Pointers to the documentation would be of great help.
- Can Zed (APIs) give transactional guarantees across concurrent read/writes on the same zng file? Or is this has to be handled outside zed.
- Are there any specific features in zed that you would like us to look at which you think might be useful.
- Are there any features in your immediate road map that would address some of the objectives.
Thanks for reading. Deeply appreciate the great work done by the zed team!
@muthu-rk: Not sure if you're still using Zed and watching this issue, but here's some updates and responses to your questions.
To start at a higher level, in your original inquiries you seemed to be hoping that zq would be able to do efficient changes-in-place to JSON files on disk as opposed to transforming while doing a full read/write pass through a file like you found. Having discussed the topic with the team, we doubt this is something that zq is likely to cover soon, if ever. Per the comment from @mccanne above, the more sophisticated operations you're seeking are something we intend to deliver via the Zed lake. With the lake, you'd effectively load your JSON data into a pool, make CRUD-like changes to it in the pool, and then you could extract some/all of it back out as JSON (or many other formats) as needed. If you've not kept up with all the activity in that area, you can find lots of docs at https://zed.brimdata.io/, and https://zed.brimdata.io/docs/next/commands/zed is probably the best doc to start from in terms of acquiring some hands-on familiarity with the concepts.
With that, here's some responses to the specific questions in your most recent comment.
Do you see any outright problems in Zed which will not provide large json CRUD within acceptable time (in seconds regardless of json filesize). Assuming we will store the json in zng format.
The planned enhancement for CRUD-like operations is tracked in #4024. It's likely to be implemented in the next few months. While we won't know for sure until it's finished, I expect that operations on large JSON objects would complete in seconds.
What are the client libraries Zed support? Our solution stack is currently Dart based. Which language client library would you recommend? Language is not a concern. Ideally, we would prefer a library which is stable, has better documentation support. I havent checked the Zed APIs and how to use them. Are there any sample code available, maybe in your test suites, that I can take a look at to understand how to use APIs. Pointers to the documentation would be of great help.
There's some support for client libraries available in Go, Javascript, and Python, and you can see the docs below https://zed.brimdata.io/docs/next/libraries for more details. Some of the REST API is also documented at https://zed.brimdata.io/docs/next/lake/api. In many ways we've been adding enhancements/docs in these areas as user needs have surfaced, so if you start working with one of them and there's something you want to do that's not immediately obvious, let us know.
Can Zed (APIs) give transactional guarantees across concurrent read/writes on the same zng file? Or is this has to be handled outside zed.
Per the comment above, we'd only be ready to make this statement about data in a pool of a Zed lake (not a ZNG file on a filesystem), but I think the answer to your question is "yes". @mccanne or @nwt may chime in with more detail here.
Are there any specific features in zed that you would like us to look at which you think might be useful.
In the comments above I think you've already been exposed to some of the most important ones (e.g., over and switch), but the language docs below https://zed.brimdata.io/docs/next/language are all worth reviewing. The trick shown in #4050 may also prove useful.
Are there any features in your immediate road map that would address some of the objectives.
I think that roll operator described in #4024 is probably the biggie. The other priorities right now in the short-term roadmap are largely focused on making the lake operate/perform at scale. Therefore it's the same language concepts pointed to above, but bigger/faster/easier.
I'll continue to hold this issue open in the event you have further questions, and as we deliver in some of these areas I'll check back and see if you're able to start making use of them.
@philrz Thanks for the update. We parked large json handling using zed after raising the issues. Some of the updates you mentioned are interesting and we will explore them. I will reach out to you if I need further clarification.