parseable Inquiry and Suggestion Regarding Time Partition Validation Against Server Time

Dear Development Team,

I hope this message finds you well. I am reaching out to discuss a particular aspect of the validate_time_partition function that has recently come to my attention. This function plays a critical role in validating the time partitioning of incoming data against the server's current time. The relevant code snippet for this functionality is as follows:

if parsed_timestamp.date() == Utc::now().naive_utc().date()
    && parsed_timestamp.hour() == Utc::now().naive_utc().hour()
    && parsed_timestamp.minute() == Utc::now().naive_utc().minute() {
    Ok(true)
} else {
    Err(anyhow!(format!(
        "field {} and server time are not same",
        time_partition.unwrap()
    )))
}

The rationale behind this stringent time comparison intrigues me, especially in the context of a recent endeavor I embarked on. My goal is to incorporate the date into stream names and bulk-insert historical data accordingly. The introduction of time partitioning seemed like a beacon of possibility for this project. However, I encountered an unexpected challenge due to the function's requirement for the exact alignment of data timestamps with the server's current time.

Understanding the importance of this validation process is crucial for me. If it serves a specific purpose or security measure, I would greatly appreciate an explanation to better comprehend its necessity. On the other hand, if this strict comparison is not integral to the system's functionality or integrity, might I suggest either removing this constraint or introducing a toggleable mode? Such a modification could enhance flexibility, especially for scenarios like database migrations, where aligning every data point with the exact server time is not feasible.

The potential for this feature to significantly impact the parsability and migration of existing databases is substantial. It could open up new avenues for efficiently managing historical data, which is paramount for organizations looking to evolve their data infrastructure.

In light of this, I kindly ask for your consideration of this matter. An option to adjust the time validation requirement would be immensely beneficial for developers facing similar challenges. Your support in this regard would not only facilitate our current project but also enrich the tool's adaptability for a wide range of use cases.

Thank you very much for your time and understanding. I look forward to your feedback and am open to discussing this further if needed.

Warm regards,

Apr 09 '24 11:04 mrchypark

Thanks for the detailed info about your question @mrchypark .We did try this without the check that you're referring to - but we found issues while testing. I'll explain in the details about the challenges - in a separate comment.

Apr 09 '24 11:04 nitisht

This validation will cause significant problems during log collection. For example, if Fluent Bit is configured to report logs once per second, it will result in the time validation failing for the last second of the previous minute at the 0th second of each minute. In extreme cases, if logs are uploaded once per minute, all log times will fail validation under this rule.

Apr 26 '24 16:04 sp0cket

Hi @mrchypark and @sp0cket

Thanks for your valuable comments on this topic. We had an internal discussion based on which we came up with below solution, please provide the feedback for the same -

Optional header X-P-Time-Partition needs to be sent in the log creation API to allow time partitioning
If provided, all the events in the batch will be validated at the time of ingestion for below - a. if time-partition field does not exist in any of the events in the batch, reject the whole batch b. if time-partition field in any of the events in the batch, cannot be parsed as valid datetime (eg. 2024-04-04T00:41:00.000Z), reject the whole batch c. if time-partition field in any of the events in the batch is more than a month older than the server timestamp, reject the whole batch
Once all these validations are passed, we can start the ingestion
Now since events can be out of order, we have to take each event in the batch individually to ingest and upload to storage with the prefix provided in the time partition field
Hence, if the time-partition header is provide, prefix will be created based on the time partition field in the event, else prefix will be created based on the server time
We can also allow custom partition other than time field with another header X-P-Custom-Partition which can have comma separated field names from the log event
The custom partition fields can be integer or string types (that too not more than 7-10 field length - just to keep the prefixes as short as possible)
Server cannot check the data sent for these fields, user has to wisely select the fields (eg. not to partition based on id fields, rather choose a field which has less no. of distinct values in the application like status_code, os, level etc)
We can allow maximum of 3 such fields to be provided in the custom partition header
Now since no validation can be made for out of order events, multiple parquets can be created under same prefix but that can reduce the query performance because of possibility of sorting issues between the parquets

Example with custom partition - If X-P-Custom-Partition = Os, StatusCode

Events will be partitioned with below prefixes -

date=2024-04-30\hr=05\minute=03\Os=Linux\StatusCode=200\domain.port.parquet date=2024-04-30\hr=05\minute=03\Os=Linux\StatusCode=401\domain.port.parquet date=2024-04-30\hr=05\minute=03\Os=Windows\StatusCode=500\domain.port.parquet date=2024-04-30\hr=05\minute=03\Os=Windows\StatusCode=400\domain.port.parquet

May 02 '24 18:05 nikhilsinhaparseable

Hi, thank you for sharing your proposal.

There are two main reasons why user-provided time partitioning is necessary for me:

Migration of existing data
Reliable storage for data collection that may experience delays

I provide solutions involving IoT devices in remote environments, where data transmission delays can range from a few minutes to several days. During such delays, edge devices store the data and then transmit all accumulated data once connectivity issues are resolved.

Regarding the second scenario, I find that a one-month time constraint could be appropriate depending on the situation.

While the first scenario might be a bit challenging, it seems feasible to move forward.

I also appreciate the addition of custom partitioning. Congratulations on completing the distributed version. I am a fan of the parseable approach. I look forward to more great projects in the future.

May 02 '24 19:05 mrchypark

Our situation is similar to @mrchypark , with data migration and latency issues as well. In our scenario, data migration is primarily used to collect offline vehicle fault logs (similar to a black box on an airplane) and aggregate them into a central database. The one-month limitation has been a significant improvement, but I'm not sure if there might be some exceptional cases. I'm wondering if there's a possibility to provide some compatibility for relatively old data, even at the expense of performance, to address these extreme scenarios.

May 03 '24 11:05 sp0cket

@sp0cket we can extend the one month time constraint to make this number configurable. User has to provide additional header X-P-Time-Partition-Limit at the time of log creation value of which will be expected to be as unsigned integer with ending 'd' eg. 90d (for 90 days). If not provided, default of 30 days constraint will be applied, else user can use this field to allow log older than 30 days as well to be ingested.

May 05 '24 06:05 nikhilsinhaparseable