sql icon indicating copy to clipboard operation
sql copied to clipboard

[FEATURE] JSON Search Support

Open brijos opened this issue 1 year ago • 9 comments

Is your feature request related to a problem? Community members have asked for easier JSON parsing and analysis capabilities which allow them to not only search JSON logs and extract fields without writing complex parse expressions, but perform computations on JSON array values, such as finding the sum of all values in the array, where the number of elements in the array is not known.

What solution would you like? Allow users to extract and transform data from JSON-formatted events and fields. Users should be able to extract all values in an array by specifying a wildcard for the individual element position and doing an aggregation operation on them. Users should be able to extract: 1/single or multiple top level fields 2/nested fields 3/keys in arrays and perform operations on the values.

** Examples ***

  • User wants to look at an array of elements, across log entries to see how many times each element has occurred in the array.
  • Match any item in a JSON array, regardless of the position in the array
  • Pull out fields within a nested JSON file

What alternatives have you considered? No other solutions are available in PPL

Do you have any additional context? No

brijos avatar May 03 '24 23:05 brijos

SELECT
  json_extract(myblob, '$.name') AS name,
  json_extract(myblob, '$.projects') AS projects
FROM dataset

it would help a lot to support json_extract function to manipulate json string field as shown above.

kedbirhan avatar May 08 '24 00:05 kedbirhan

Catch All Triage - 1 2 3 4 5 6

dblock avatar Jun 24 '24 16:06 dblock

@anasalkouz @YANG-DB @brijos

Json Functions Proposal

I have created a working prototype for the json_extract function.

This is currently implemented in the sql sub-project to make the json functions available not only as PPL command. In other words: The function can be used (like any other built in function) in sql and ppl.

The proposed (and so far implemented syntax) is:

json_extract(<json>,<path>)

<json> Json as string. From an table cell or as literal (mandatory)

<path> a json path or a json pointer expression (mandatory).

The function returns the result as string (scalar value or full json)

An error is thrown when:

  • Json is invalid
  • Path is not a valid json path or json pointer expression

No error is thrown when:

  • Path can not be located (we return just NULL)

Examples:

#JSON Path
select json_extract('{\"name\":\"saly\"}', '$.name')
#JSON Pointer
select json_extract('{\"name\":\"saly\"}', '/name')

Open questions:

  1. Should it be implemented as a built in sql function?
  2. Is the string to string approach sufficient (json input as string, function output as string)?

Not yet covered

  • Optimizations
  • Integ Tests

salyh avatar Jul 03 '24 15:07 salyh

@anasalkouz @YANG-DB @rupal-bq any comments on the proposal so far?

salyh avatar Jul 09 '24 07:07 salyh

Can we perhaps specify a document ID to use as the json for the query? I've got a lot of json blobs that I'd love to search through instead of breaking them up before ingest.

nateynateynate avatar Jul 11 '24 16:07 nateynateynate

Can we perhaps specify a document ID to use as the json for the query? I've got a lot of json blobs that I'd love to search through instead of breaking them up before ingest.

can you post an example how this can look like?

salyh avatar Jul 16 '24 14:07 salyh

  1. Should it be implemented as a built in sql function? yes this would be the case as this is true for Athena's json_extract function.
  2. Is the string to string approach sufficient (json input as string, function output as string)? i would think this suffices for now @salyh also it would be nice if the implementation support targeting json array elements as well as shown in the example below

source input


{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::example-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::example-bucket/*"
    }
  ]
}

query

SELECT ...
FROM ...
WHERE json_extract(p.policy_std, '$.Statement[*].Effect') = 'Allow'
  AND f.name = 'hellopython';

kedbirhan avatar Jul 18 '24 18:07 kedbirhan

Will add this ...

salyh avatar Jul 23 '24 13:07 salyh

@salyh I've added a separate json & flatten issues:

  • #3029
  • #3027

YANG-DB avatar Sep 16 '24 18:09 YANG-DB