Search taking lot of time using clg binary
Bug
We are using CLP for compressing logs generated by our Kubernetes cluster which are in JSON format. A sample log is given below:
{ "log_time": "2023-08-29T13:55:09.477456Z", "stream": "stdout", "time": "2023-08-29T13:55:09.477456564Z", "@timestamp": "2023-08-29T19:25:09.477+05:30", "@version": "1", "message": " Method: POST;Root=1-64edf8bd-5c762a676349ee71616bb687 , Request Body : {"orders":[{"order_type":"normal","external_reference_id":"69426","items":[{"offset_in_minutes":"721","quantity":"1","external_product_id":"225090"}]}]}", "level": "INFO", "level_value": 20000, "request_id": "6f3f3651-a22b-42a0-b5fe-412d2167c5ca", "kubernetes_docker_id": "caa9102a169a1495e5790cb2c17cb21d0a279ffc50d802d413938870ba59c7c0", } When we are using the clg binary to search through the generated archive using various search queries, It is taking a lot of time to process each query (around 25-30s on average). We only search for request_id and namespace name as mentioned below. It is not feasible for us if the search takes so much time for each archive. Ideally, for one archive 4-5s is the expected search time by us. For example, To search for the request_id in the above log, we generally use the following queries
- info6f3f3651-a22b-42a0-b5fe-412d2167c5ca*
- 6f3f3651-a22b-42a0-b5fe-412d2167c5ca Our archives are sized 40MB on average wherein the sizes of the internal files and folders of an archive is as given below:
- var.dict : 18.2MB
- /s: 16.8MB
- var.segindex: 2.6MB
- metadata.db: 1.9MB
- logtype.dict: 736KB
- logtype.segindex: 36KB
- metadata: 4KB
Is this the correct way to write search queries (the correct way in the sense that will it use the log type and other dictionaries to search through the archives efficiently)? because as mentioned earlier searching through a single archive itself takes more than 30s which is very infeasible. We expect the search time for each archive to be around 4-5s, not more than that for a single archive. Please guide me If I am doing wrong anywhere like the search query being inefficient, etc.
CLP version
3a20c0d2bb831de7fa267d57d187dab8c3f092c1
Environment
UBUNTU 20.04 EC2 instance type: m5.8xlarge
Reproduction steps
NA
Hi @bb-rajakarthik,
Sorry for the delayed response.
I assume your JSON log events are printed with one event per line? Either way, I think CLP is having trouble parsing the timestamps from your events, which means it will store those strings in the variable dictionary. You can verify this by using make-dictionaries-readable to create human-readable versions of the dictionaries (which you can inspect). Storing the timestamps in the dictionary would lead to quite a large variable dictionary, which in turn would be slow to load and slow to search.
CLP currently doesn't have good support for parsing timestamps spread across a message (we are improving our schema-based parser to handle these cases); for unstructured log events, typically the event's timestamp appears at relatively fixed locations, so it's easy to parse. However, in JSON events, it can move around and there can be multiple timestamps.
One potential solution for your case might be to preprocess the events to convert the timestamps to epoch timestamps; that way CLP could at least encode the timestamps as numbers rather than strings.
In the future, we hope to open-source a tool specifically designed for compressing and searching JSON log events.
With respect to your queries, including delimiters around the strings you search for might improve performance. E.g., for query 1, you could try: *"INFO"*"6f3f3651-a22b-42a0-b5fe-412d2167c5ca"*. Without the delimiters, the wildcards are directly connected to the tokens, so CLP has to scan the dictionaries looking for substring matches rather than performing a table lookup.
Hope that helps.
hi @kirkrodrigues Thanks for taking time to respond and giving me a detailed answer. So @kirkrodrigues based on your answer to my question, If my log message is as following: { "stream": "stdout", "@Version": "1", "message": " Method: POST;Root=1-64edf8bd-5c762a676349ee71616bb687 , Request Body : {"orders":[{"order_type":"normal","external_reference_id":"69426","items":[{"offset_in_minutes":"721","quantity":"1","external_product_id":"225090"}]}]}", "level": "INFO", "level_value": 20000, "request_id": "6f3f3651-a22b-42a0-b5fe-412d2167c5ca", "kubernetes_docker_id": "caa9102a169a1495e5790cb2c17cb21d0a279ffc50d802d413938870ba59c7c0", } where all timestamp fields have been removed from the json log, Search will definitely be faster as the size of the variable dictionary is reduced significantly. Is my understanding correct? Also, If I decide to filter out specific fixed fields from all of my json logs, Will the time taken for searching through the archives get reduced? According to my understanding of CLP, If all json logs are of similar structure, Even the compression rate has to increase. Is my understanding correct @kirkrodrigues ?
If my log message is as following ... where all timestamp fields have been removed from the json log, Search will definitely be faster as the size of the variable dictionary is reduced significantly. Is my understanding correct?
Yes, that should improve search performance.
Also, If I decide to filter out specific fixed fields from all of my json logs, Will the time taken for searching through the archives get reduced?
To some extent, yes, just because CLP needs to scan less data. The size of the improvement will depend on how much of the logs are occupied by those fields.
According to my understanding of CLP, If all json logs are of similar structure, Even the compression rate has to increase.
Yes, if we compare
- compressing JSON logs with similar structure to
- compressing JSON logs with many different structures,
the compression ratio should be better for (1). The improvement will depend on how many different structures there are in total and, again, how much of the logs those structures occupy.
Hope that helps.
Hi @bb-rajakarthik,
We now have a distributed version of CLP (clp-json) specifically designed for compressing and searching JSON logs; we also have a single-threaded version, clp-s, which I think you've tried. Both should improve the performance of the searches you mentioned.
Let us know if you notice any more bottlenecks, otherwise I will close this issue.