[FEATURE] JSON Search Support
Is your feature request related to a problem? Community members have asked for easier JSON parsing and analysis capabilities which allow them to not only search JSON logs and extract fields without writing complex parse expressions, but perform computations on JSON array values, such as finding the sum of all values in the array, where the number of elements in the array is not known.
What solution would you like? Allow users to extract and transform data from JSON-formatted events and fields. Users should be able to extract all values in an array by specifying a wildcard for the individual element position and doing an aggregation operation on them. Users should be able to extract: 1/single or multiple top level fields 2/nested fields 3/keys in arrays and perform operations on the values.
** Examples ***
- User wants to look at an array of elements, across log entries to see how many times each element has occurred in the array.
- Match any item in a JSON array, regardless of the position in the array
- Pull out fields within a nested JSON file
What alternatives have you considered? No other solutions are available in PPL
Do you have any additional context? No
SELECT
json_extract(myblob, '$.name') AS name,
json_extract(myblob, '$.projects') AS projects
FROM dataset
it would help a lot to support json_extract function to manipulate json string field as shown above.
@anasalkouz @YANG-DB @brijos
Json Functions Proposal
I have created a working prototype for the json_extract function.
This is currently implemented in the sql sub-project to make the json functions available not only as PPL command. In other words: The function can be used (like any other built in function) in sql and ppl.
The proposed (and so far implemented syntax) is:
json_extract(<json>,<path>)
<json> Json as string. From an table cell or as literal (mandatory)
<path> a json path or a json pointer expression (mandatory).
The function returns the result as string (scalar value or full json)
An error is thrown when:
- Json is invalid
- Path is not a valid json path or json pointer expression
No error is thrown when:
- Path can not be located (we return just NULL)
Examples:
#JSON Path
select json_extract('{\"name\":\"saly\"}', '$.name')
#JSON Pointer
select json_extract('{\"name\":\"saly\"}', '/name')
Open questions:
- Should it be implemented as a built in sql function?
- Is the string to string approach sufficient (json input as string, function output as string)?
Not yet covered
- Optimizations
- Integ Tests
@anasalkouz @YANG-DB @rupal-bq any comments on the proposal so far?
Can we perhaps specify a document ID to use as the json for the query? I've got a lot of json blobs that I'd love to search through instead of breaking them up before ingest.
Can we perhaps specify a document ID to use as the json for the query? I've got a lot of json blobs that I'd love to search through instead of breaking them up before ingest.
can you post an example how this can look like?
- Should it be implemented as a built in sql function? yes this would be the case as this is true for Athena's json_extract function.
- Is the string to string approach sufficient (json input as string, function output as string)? i would think this suffices for now @salyh also it would be nice if the implementation support targeting json array elements as well as shown in the example below
source input
{
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::example-bucket/*"
},
{
"Effect": "Deny",
"Action": "s3:DeleteObject",
"Resource": "arn:aws:s3:::example-bucket/*"
}
]
}
query
SELECT ...
FROM ...
WHERE json_extract(p.policy_std, '$.Statement[*].Effect') = 'Allow'
AND f.name = 'hellopython';
Will add this ...
@salyh I've added a separate json & flatten issues:
- #3029
- #3027