What are the semantics of the `ldes:timestampPath`?
The specification should explicitly define the semantics of ldes:timestampPath. What does it represent? I propose using this issue to discuss and clarify this.
Two key possibilities need consideration:
- Does the timestamp represent the ingestion time of the member?
- Does the timestamp correspond to the creation time in the backend system?
The second approach may introduce complications. For example:
- Suppose you want to publish an LDES of the DCAT-AP Feed of Europe, aggregating data from backend systems in different countries (e.g., Belgium, Sweden). Since updates occur at different frequencies (e.g., Sweden updates daily, Belgium weekly), Belgian members from earlier in the week could be added after more recent Swedish members -- leading to out-of-order entries.
- A similar issue arises with an address registry spanning multiple municipalities, where some municipalities process updates slower than others, causing out-of-order arrivals.
This concern ties into issue #62 , which discusses whether members must be added in order.
Additionally, the second approach may not always be feasible. If backend systems do not include a timestamp, the ingestion process would need to generate one.
However, there are valid cases where reusing the backend system’s timestamp is desirable -- particularly when the data already contains meaningful timestamps and comes from a single, well-defined source.
Given these considerations, a more flexible approach might be needed rather than enforcing a strict rule. At the very least, the specification should include a note making publishers aware of these nuances.
Related issues: #10, #34, #35
In the first WG of the LDES trajectory in 2025 we’ll propose a fix for this, clearing out indeed that out of order is impossible based on the timestampPath.
I propose to however not constrain the timestampPath on the semantics of added in the event source (although that might be the most common use case), as you may want to instead use the timestamp of when it happened in the real world when you can guarantee this will remain in-order (e.g., when I did an observation, rather than when the observation was received by the event source server).
I saw there was a conflict in semantics in the vocabulary indeed in comparison with the spec text. I fixed this in https://github.com/SEMICeu/LinkedDataEventStreams/pull/71/commits/d6b36fe61011d222f2cc60381948f6622b8108da
We use the timestamp as a proxy for ordering the log, which is not ideal, an auto increment for log offset, perhaps combined with the commit time would be easier to work with, but I don't think it is a good idea to change this at this point, as it would be a backwards incompatible change.
I will however argue that timestamps should always increment from one member to the next in order to maintain strict ordering in the stream. I think this is really important for keeping the memory footprint of the client minimal: imagine an LDES stream that gets created out of an existing data system. The system, say wikidata, contains terrabytes of data. When creating an LDES stream from this existing system, an initial consistent snapshot is made, and the full database is committed at the same time to the LDES; all members have exactly the same timestamp. In this case, as there is no order in the stream, the client needs to keep the full state of what members have been send out until the whole snapshot is processed.
I think we are making things very difficult by allowing ambiguous order in the stream, as consistency will be much more difficult to guarantee. An example that would lead to inconsistency would be multiple updates to one object at the same time (although nonsensical, the spec would allow for it): depending on processing order, different clients would end up with different state.
What do you think? Would there be a big downside to enforcing strict order?
I propose to however not constrain the timestampPath on the semantics of
added in the event source(although that might be the most common use case), as you may want to instead use the timestamp of when it happened in the real world when you can guarantee this will remain in-order (e.g., when I did an observation, rather than when the observation was received by the event source server).
To me, it also makes sense to avoid restricting it to “added in the event source.” Instead, I would prefer to provide a note that guides spec implementers on how to handle the required in-order timestampPath in cases where they combine multiple data sources (or any other scenario) where no in-order could be guaranteed in their upstream. Such a comment could enhance their awareness of this possibility and assist them in mitigating the potential issue.
We use the timestamp as a proxy for ordering the log, which is not ideal, an auto increment for log offset, perhaps combined with the commit time would be easier to work with, but I don't think it is a good idea to change this at this point, as it would be a backwards incompatible change.
I will however argue that timestamps should always increment from one member to the next in order to maintain strict ordering in the stream. I think this is really important for keeping the memory footprint of the client minimal: imagine an LDES stream that gets created out of an existing data system. The system, say wikidata, contains terrabytes of data. When creating an LDES stream from this existing system, an initial consistent snapshot is made, and the full database is committed at the same time to the LDES; all members have exactly the same timestamp. In this case, as there is no order in the stream, the client needs to keep the full state of what members have been send out until the whole snapshot is processed.
I think we are making things very difficult by allowing ambiguous order in the stream, as consistency will be much more difficult to guarantee. An example that would lead to inconsistency would be multiple updates to one object at the same time (although nonsensical, the spec would allow for it): depending on processing order, different clients would end up with different state.
What do you think? Would there be a big downside to enforcing strict order?
We discussed this offline with @pietercolpaert and @ajuvercr and were wondering if following would make sense for your use case?
First and foremost, we think the use case of terabytes of data on the same timestamp for Wikidata for example can already be solved by instead using the last modified on the pages instead of the commit time in a different server, and thus avoid the fact that you'd have these terabytes of data on the same timestamp all together.
However, indeed, use cases will still exist in which bursts of members are on the same timestamp. We believe the solution this is that you should fragment the members over different fragments, which you then mark as immutable. Only the last fragment where new members can still be added to should still be mutable. If you then keep track of only the emitted members for the current timestamp for only the mutable fragments, that list should be rather limited. This requires however the extra requirement that for a certain EventSource, you can publish a member only once.
Therefore, we think that we don't need incremental counters to solve this problem, on the condition that in EventSource the requirement that you can only find each member once is added.
First and foremost, we think the use case of terabytes of data on the same timestamp for Wikidata for example can already be solved by instead using the last modified on the pages instead of the commit time in a different server, and thus avoid the fact that you'd have these terabytes of data on the same timestamp all together.
Yes, we can find workaround, but in this case it implies a dependency on the domain. I want to find a solution that works without having to rely on domain specific semantics, and write code for a particular case. Just consider the wikidata example as a placeholder for an enormous graph in a triplestore. I should be able to write a plugin for a triplestore that serializes a named graph as LDES, without having to know what kind of triples are in the graph. This is also why I don't like the 'path' (timestamp/...) constructs, they introduce an unnecessary dependency on the domain. From my point of view the real problem we are solving is: how can we serialize a graph (mutations) as LDES in an unambiguous way? From a server perspective offering explicit order is cheap, but it makes writing clients so much simpler. With explicit order, the answer to the question 'where am I in processing the stream?' is a very straightforward one, and no buffering is required. I think we are making things very complicated by allowing for non-ordered streams.
If I understand you correctly @sandervd, the main usecase is: when you replicate a very large LDES, you would like to know where you are in a stream. And you suggest to use timestamps/.. so the client can use this timestamp/... as the only data to store when resuming the client later on. We now suggest a way to not use any timestamp/.., as this is not domain independent. When you're data doesn't have a timestamp it is not required to add a ingest counter, in my opinion.
Is the data you refer to, too large to store in a b-tree fragmentation? I think with this fragmentation log(n) fragments are mutable, and the client should only store the ids of the members in these fragments, and the found relations starting from these fragments. This information is your location in the stream.
I don't like to add awkward increment triples to entities as they change the entities, which is a problem when you for example merge LDESes together :/
See #76 that has been closed!