parseable Dynamic endpoints

Fixes #370

Description

Adds support for dynamic queries by periodically re-fetching the query given.

Associates each dynamic query with a UUID. Uses a static ref map to associate the UUIDs with the latest result.

This PR has:

[ y] been tested to ensure log ingestion and log query works.
[y ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[ y] added documentation for new or modified features or behaviors.

/claim #370

Oct 21 '24 19:10 TomBebb

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

Oct 21 '24 19:10 github-actions[bot]

I have read the CLA Document and I hereby sign the CLA

Oct 21 '24 19:10 TomBebb

Can you look at the CI failures @TomBebb ?

Oct 22 '24 07:10 nitisht

Done

Oct 22 '24 07:10 TomBebb

@TomBebb we need to add a few validations and update a few things in this PR -

max cache duration should be 60 mins
max number of unique url server stores should be 10
results should be cached in disk not in memory - for this, you can expose an env var where user can provide a directory path and you can use this path to store the results in parquet with the file name uuid.parquet
server should refresh the data every minute , delete the old parquet and write the new parquet in the disk so that at any point when client does a GET /uuid call, server has the preprocessed data and can return the data from the disk.

do let me know if you need further clarification to the points mentioned here

Thanks!

Oct 23 '24 06:10 nikhilsinhaparseable

@nikhilsinhaparseable Cool, working on that now

Oct 23 '24 19:10 TomBebb

@TomBebb we need to add a few validations and update a few things in this PR -

1. max cache duration should be 60 mins

2. max number of unique url server stores should be 10

3. results should be cached in disk not in memory - for this, you can expose an env var where user can provide a directory path and you can use this path to store the results in parquet with the file name uuid.parquet

4. server should refresh the data every minute , delete the old parquet and write the new parquet in the disk so that at any point when client does a GET /uuid call, server has the preprocessed data and can return the data from the disk.

do let me know if you need further clarification to the points mentioned here

Thanks!

What should the difference be between hitting the cache duration and waiting a minute be?

Oct 24 '24 17:10 TomBebb

@TomBebb cache duration is used to store the amount of data you want to store for a particular uuid (eg. 5 mins worth of data) but this 5 mins range is not fixed, it is relative from current time stamp say, i have made a POST /dynamic_query call with "cache-duration":"5m" you will generate a hash for this query but will have to create a separate thread that runs every minute to update the parquet in the disk every minute by fetching latest 5 mins worth of data. When a user calls GET /dynamic_query/{uuid} at any point of time, server should return latest 5 mins of preprocessed data available in the disk. We should expose another endpoint DELETE /dynamic_query/{uuid} to delete the uuid and corresponding parquet from the disk.

Oct 25 '24 02:10 nikhilsinhaparseable

Implemented using advised notes, please can someone re-review?

Oct 25 '24 19:10 TomBebb

@TomBebb below are the review comments -

the env DYNAMIC_QUERY_RESULTS_CACHE_PATH_ENV should be optional, if user provides, dynamic query endpoints work else should give error as env not set
return the uuid as response first and separate thread processes the query and write parquet to disk, handler should not wait for query to be processed before returning the uuid
query fails when I use aggregate query like Select count(*) from app2000 order by p_timestamp (please test with other aggregate queries as well)

thread 'actix-rt|system:0|arbiter:1' panicked at server/src/dynamic_query.rs:114:69:
called `Result::unwrap()` on an `Err` value: Datafusion(SchemaError(FieldNotFound { field: Column { relation: Some(Bare { table: "app2000" }), name: "p_timestamp" }, valid_fields: [Column { relation: None, name: "count(*)" }] }, Some("")))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

BTreeMap<Ulid, DynamicQuery> - store this in disk in root path of DYNAMIC_QUERY_RESULTS_CACHE_PATH_ENV so that you can load to memory at server start

I will update if I find anything else.

Oct 28 '24 08:10 nikhilsinhaparseable

@nikhilsinhaparseable I cannot reproduce the aggregate query issue on my commit before master merge, but can in commits since and on master.

Oct 28 '24 20:10 TomBebb

@nikhilsinhaparseable There is a datafusion byte serialization crate datafusion_proto but it only supports expression conversion to / from bytes for now. I can work around that or just shove the raw text query in the parquet file.

Oct 30 '24 20:10 TomBebb

@TomBebb please update if the PR is ready for review

Nov 04 '24 08:11 nikhilsinhaparseable