jena Additional THRIFT-Format without redundant Strings

Version

5.0.0-SNAPSHOT

Feature

The current THRIFT format is great, but in practice there are many duplicate strings that unnecessarily bloat the serialised data. Idea: Extend the format with a duplicate-free list of all occurring strings. And then only point to the index of the string in the remaining fields with an i32. For entire graphs, there would be exactly one list of strings. With the stream-based formats for RDF and results, there would be one such list per row, which would be built up by the consumer into a complete list, as each row only contains the strings that have been added. Apart from this change, the format would otherwise be identical.

Result: I serialised some graphs with THRIFT and the proposed new format:

cheeses-0.1.ttl: 1.14 MB (THRIFT) -> 0.29 MB (new format)
pizza.owl.rdf: 0.30 MB (THRIFT) -> 0.08 MB (new format)
bsbm-1m.nt.gz: 256.29 MB (THRIFT) -> 82.24 MB (new format)

(all without additional compression)

The speed of serialisation and deserialisation in memory hardly differs between both THRIFT-based formats.

Now one could argue that with LZ4 compression applied, you get practically the same result (number of bytes and speed) without an additional format. For whole graphs and datasets, that seems to be true.

With streaming, classic compression is of little use because it can only compress the character strings that occur in the current row. The previously transmitted character strings are lost for compression. --> This problem is probably ideally addressed by the proposed new format. --> This way, the proposed new format could lead to a massive reduction in network traffic when transporting query results.

The basic idea should also be applicable to the protobuf format.

What do you think?

Lack of ideas made me choose the name THRIFT2. A first rough realisation of the idea can be found in my branch. I also implemented jmh benchmarks for the serialization and deserialization of graphs in different formats and with optional compression. I have not yet made any effort to test my theory with measurements that query results might benefit the most from the new format.

Are you interested in contributing a solution yourself?

Yes

Feb 04 '24 00:02 arne-bdt

What do you think?

Some initial thoughts - I haven't dived deep into the new code yet.

Improvement in results formats is the most interesting part here. The results case is interesting in that it is, typically, create-once, read-once so writing costs matter.

As you note, gzip gives a lot of benefit for the N-Triple like formats (x8-x10 compression). And for publishing data, sticking to standards based formats matters.

RAM space

All language parsers go through a String to IRI cache (ParserProfileStd - it's per parser run cache) this reduces the memory costs - it saves about 1/3 of the RAM needed to store a large in-memory dataset. SPARQL results sets do not.

We could have a JVM wide term cache (needs to be multithreading safe) - across parser runs and shared with receiving results.

Results

So I think the improvement metric should be "faster", and whether size is a factor needs to be determined. It probably is for across the web. The size impact is made complicated because HTTP speeds over the public internet, over average corporate networks and in AWS/Azure/GCP are very different. The latter can be incredible fast.

It'll good to have a really good, non-standards, results format.

** BSBM data**

bsbm-1m.nt.gz

My copy of bsbm-1m in RDF-THRIFT is 254M uncompressed and 28M compressed.

On my machine, RDF-THRIFT: Disk is an NVMe SSD although the test data is probably in the disk cache of the OS. I don't believe that I/O is a significant factor although it is a factor. The bandwidth on SSDs is really quite good.

bsbm-25m.rt : 6.3G
bsbm-25m.rt.gz : 684M

$ riot --time --sink bsbm-25m.rt
bsbm-25m.rt     : 18.86 sec : 24,997,044 Triples : 1,325,329.73 per second
$ riot --time --sink bsbm-25m.rt.gz
bsbm-25m.rt.gz  : 24.02 sec : 24,997,044 Triples : 1,040,719.60 per second

x8-x10 is what I expect when compressing N-triples. BSBM may be different in that it has a significant amount of longer strings randomly generated (IIRC).

At those speeds, parsing isn't a factor in data loading into TDB2. The TDB2 parallel loader does the parsing on a separate thread, passes inter-thread in chunks of 100k triples and has a queue of 10 blocks. It becomes waiting for the loader to consume the triples.

Feb 04 '24 17:02 afs

So I looked at this problem extensively around 10 years ago and developed something we called Binary Tuple Format (BTF). This was all proprietary $dayjob work and it never went beyond the research stage for several reasons which I'll go into depth in this response. So no code/reports I can share but I can talk about it in general terms.

TL;DR Version:

Using GZip compression with formats like NTriples/RDF-Thrift for graphs, and SPARQL-TSV for results, generally offered the best overall performance (see discussion for what that means)
Specialised formats can offer some performance gains on specific metrics, e.g. data size, but this tends to not be that relevant when you look at overall performance

Binary Tuple Format

Binary Tuple Format (BTF) was an effort to address the problem of how could we communicate results faster between the server and the client. This was work at Cray so we had some specific additional constraints that we were looking to take account of:

The server had vastly more memory than the client(s)
The server could parallelise the results output because we didn't send the results directly over the network. Rather the server wrote a file to the parallel file system and merely communicating the metadata (the filename, results count and size etc) over the wire to the client.
The client was often resource constrained (particularly for memory) relative to the server so any format had to allow for that

Design wise the format consisted of a global header (magic bytes, metadata, data type - triples, quads, results, result column names if applicable, prefix mappings etc), following by one/more data blocks. Each data block consists of a block header (block type, block type specific header fields) followed by actual data. This gave us a framework in which we could experiment extensively with different compression schemes and data layouts.

Since an output file consisted of multiple blocks we could generate these in parallel and then write the blocks out in any sequence we wanted (unless ordered SPARQL results where order needed preserving).

We had a number of different block types that we implemented. From what we called Tabular Blocks that were basically uncompressed NTriples/SPARQL TSV through to Compressed Dictionary Blocks that used different encoding and compression techniques.

Quantifying Overall Performance

How you quantify performance becomes a key part of the problem. It's easy to focus on a single metric, e.g. overall data size, but we found through extensive comparisons between BTF, all the standard formats supported by Jena, and HDT that it wasn't the whole story.

Firstly data is never just written, it always needs to be read at some point as well. So at a bare minimum you need to consider the whole round trip i.e. both serialisation and parsing time.

Secondly there is also a computational cost (CPU and memory) associated with each data format. Applying any kind of compression is trading CPU time plus memory consumption for output size. Depending on the system the overhead may actually outweigh the benefit.

Aside - not specific to BTF but with other well known distributed compute frameworks we were running on our systems at the time it was actually a benefit to disable their default network compression behaviour. Our network interconnect was so good it wasn't worth trading the CPU time for smaller data transfers, it was faster just to send the uncompressed data. And as @afs notes networks have evolved hugely over the intervening years, that's not to say that there aren't cases where the network is the bottleneck, even in cloud/HPC environments, but that it isn't the only thing to consider these days.

Which is to say that any judgement of improved performance has to be holistic, it's not enough to just say that your new format consistently gives smaller file sizes if it does so at the cost of increased processing time and/or memory consumption. You MUST evaluate all those factors together somehow, e.g. weighted scoring, and I'm not sure we ever came up with a good metric for that.

Standards Matter

XKCD Standards Cartoon

The RDF/Semantic Web community already has a problem IMO with over-proliferation of standards. For any new format to be broadly useful it has to be adopted across the wider community. If it's only in Jena it only benefits users whose entire workflow/lifecycle can be encompassed in Jena. As soon as they go outside of Jena they lose the benefits.

This might be fine for your specific use case, but if it doesn't benefit the majority of users are we actually adding sufficient value to justify it?

In most real-world use cases I've seen Jena is rarely involved in 100% of the workflow. Typically Jena's involvement terminates at Fuseki as a SPARQL endpoint and then other parts of the business use JavaScript, Python etc to implement UIs, data science etc on top of the SPARQL results.

Compressed Dictionary Blocks

Going back to BTF, the data block type that ended up being the best overall performance (caveats on that term as I described earlier) was what we called a Compressed Dictionary Block. This block type consisted of a fixed size dictionary that was compressed with Gzip, followed by a series of rows. Each row is a fixed number of bytes with N bytes for each column, where the value of N depends on the size of dictionary in use. The value for each column is the offset for the associated RDF term in the blocks dictionary. We assigned some special offsets to have special meanings e.g. 0 means no value (particularly relevant for SPARQL Results), 1 for repeat term from same column in previous row etc.

From experimentation we found 16MB to be the optimal maximum dictionary size as that meant N was 4 bytes so each row would be 4M bytes where M is the number of columns in each row (3 for triples, 4 for quads, arbitrary for SPARQL results).

Having a fixed maximum dictionary size had a couple of benefits:

We could populate multiple blocks of data in parallel
It limits how much data any client needs to process at once if it stream processes the results. At most a client must hold the 16MB dictionary in memory and then read an additional 4M bytes for each row.

Since rows are fixed width you can also achieve random access within the compressed file by reading very small portions of the global and block headers to seek to the position of any given tuple in the file.

Advanced Dictionaries

We also experimented with a lot of tricks around the building of the dictionary though none of those ended up yielding that much benefit.

For example in an early iteration we used variable width offset encoding to reference the dictionary i.e. the rows contain 1 to N bytes for each offset, this was combined with a form of Huffman coding for the dictionary so that the most common terms would be given lower offsets in the dictionary and thus have smaller references in the row portion of the block.

In principal this meant that the rows portion of the block should be smaller. In practise this proved to not be the case because with a variable width encoding you end up using more bytes overall because the terms at the end of the dictionary end up needing 5 bytes to be encoded (assuming a 16MB dictionary) due to how variable width encodings work.

Also having variable width rows adds some complexity to the row reading logic the cost of which can end up outweighing the benefit of sometimes marginally smaller file size. Plus if you want to allow random access to an arbitrary row within the result set we now have to manually read through much of the file to do this because we don't know how many bytes to skip in advance. We ended up throwing out all usage of variable width encodings as a result.

Performance Comparisons

As noted earlier we did extensive comparisons of BTF against all the built-in Jena formats (for both RDF graphs, datasets and SPARQL results) and some other academic research like HDT.

The end results was that BTF was about the same size as the equivalent Gzipped NTriples/SPARQL TSV BUT was time wise was slower to both read and write so overall staying with the standard formats proved to be better. I would also note that over the lifetime of this research project if anything the gap between the Jena built-in's and BTF actually increased as optimisations were introduced into the Jena parsers and writers and the JVM optimiser got better.

Feb 05 '24 10:02 rvesse

The current THRIFT format is great, but in practice there are many duplicate strings that unnecessarily bloat the serialised data. Idea: Extend the format with a duplicate-free list of all occurring strings. And then only point to the index of the string in the remaining fields with an i32. For entire graphs, there would be exactly one list of strings. With the stream-based formats for RDF and results, there would be one such list per row, which would be built up by the consumer into a complete list, as each row only contains the strings that have been added. Apart from this change, the format would otherwise be identical.

You need to do this carefully such that you still allow stream processing of the data because that's very common in real-world use cases. Also there is a potential memory issue here in that a processor has to continuously build up the string table as it goes along (both for reading/writing).

In the pathological case a malicious user could construct a SPARQL query using a function like UUID() that generates a massive stream of results where the strings never repeat and effectively create a DoS attack against the server.

The speed of serialisation and deserialisation in memory hardly differs between both THRIFT-based formats.

The in-memory worries me here. Reality is that these serialisations are typically exchanged either via a file system or network where IO overhead can often dominate. If we're going to do benchmarks we need to account for that somehow.

With streaming, classic compression is of little use because it can only compress the character strings that occur in the current row. The previously transmitted character strings are lost for compression. --> This problem is probably ideally addressed by the proposed new format. --> This way, the proposed new format could lead to a massive reduction in network traffic when transporting query results.

General purpose stream compression, i.e. Gzip, is usually fine provided you choose an appropriate buffer size for the Gzip stream. This avoids the problem you imply here. Compression schemes (e.g. Deflate) in formats like Gzip typically have a maximum size window for the references to previously used strings, so as long as your buffer size is appropriate you get good enough compression.

In the pathological case with all unique strings any compression scheme will struggle but general purpose ones generally cope the best in such scenarios (potentially even dropping back to uncompressed stored data blocks where it makes sense).

Feb 05 '24 11:02 rvesse

Thank you very much for all your feedback.

@rvesse, regarding your attack scenario where the server could run out of memory due to storing all strings until the last result row is sent: that is indeed a valid concern. This vulnerability might make the format less suitable for public SPARQL servers. However, within our cluster, for service-to-service and service-to-UI client communications, it could serve as an effective bandwidth-saving alternative.

Regarding your concerns about the in-memory measurement, since bandwidth and I/O performance vary greatly across different scenarios, I thought it prudent to exclude these factors from the performance measurements. Are you suggesting that different disk flushing strategies might not have been adequately considered?

@afs, it appears that the Protobuf and Thrift formats do not utilize ParserProfileStd, which uses CacheCaffeine with a default cache size of 500. Instead, they employ CacheSimple with a default cache size of 5000. The rationale behind the use of two different LRU cache implementations and varying default cache sizes is unclear to me. Neither Protobuf nor Thrift seems to utilize FactoryRDFCaching, and no caching is applied to results, only to graphs. --> I could try to use the IRI Cache for the results too and try to measure the impact. Would that be helpful? --> I could try to use FactoryRDFCaching#createTypedLiteral for the graph and result deserialization and measure the impact. Would that be helpful?

The idea of a JVM-wide term cache is intriguing, but I am uncertain about its potential to slow down the system under heavy multithreaded workloads. An alternative could be making it optionally thread-local. Another concern is determining the optimal size for such a cache.

Back to my current approach:

Introducing another data format is not my intention.

To identify the optimal format for our needs, we conducted extensive benchmarks across various serialization formats, combined with different compression algorithms and levels. We also considered the frequency of graph writes versus reads and the implications for disk space over time. For our use case, RDF_THRIFT_VALUES (soon to be RDF_THRIFT) combined with LZ4_fastest emerged as the best choice. Our active datasets are maintained as in-memory graphs to ensure peak performance, with only changes being written as deltas into the RDBMS, following a CQRS with event sourcing pattern. A significant bottleneck we've identified is the substantial data traffic generated by services executing numerous parallel queries.

Upon examining the THRIFT results format, I noticed that all data fields contained strings, with many repetitions. In my initial implementation, I simply adapted the THRIFT format and code, substituting all strings with i32 indices and adding string lists to the rows. The modifications to the code were minimal. For deserialization in Java, the fastest approach seems to be using a single ArrayList<String>, adding new strings with #addAll, and retrieving corresponding strings with #get(index). The downside is that the consumer must retain the expanding string list in memory until the stream is processed. The upside is its simplicity and straightforwardness, with an equivalent to ArrayList<String> likely available in other programming languages.

THRIFT2 is a variation of THRIFT: 2024-02-05 21_13_53-jena – BinaryRDF thrift

I've conducted JMH-benchmarks on several graphs that I had access to:

pizza.owl.rdf and cheese.ttl (I sometimes left these out of some benchmarks due to their relatively small size)
RealGrid_EQ.xml, RealGrid_SSH.xml, RealGrid_TP.xml, RealGrid_SV.xml (these are sample datasets from the CGMES conformity assessment, available for download at ENTSO-E)
xxx_CGMES_EQ.xml, xxx_CGMES_SSH.xml, xxx_CGMES_TP.xml (these are real datasets from one of our projects and are not publicly available)
bsbm-1m.nt.gz

Each of these datasets, especially the EQ, SSH, TP, and SV graphs, has a unique structure, leading to varying distributions of subjects, predicates, and objects.

Since additional compression often seems to be relevant, I measured uncompressed, GZIP-compressed and LZ4Fastest-compressed.

Regarding the SPARQL query results, I opted for a straightforward "SELECT ?s ?p ?o FROM ..." query. Based on my observations, I don't believe the choice of query would significantly alter the outcomes.

Here I compared THRIFT with my implementation (currently named THRIFT2):

Speed of a roundtrip The simplest approach to measure speed is a simple roundtrip of serialization and deserialization. If the producer has significantly more or less resources than the consumer, this perspective is, of course, too simplified. If data is written more often than read, or the other way around, with other formats, this perspective is too simplified.

Speed of SPARQL-Result Serialization+Deserialization-Roundtrip - Thrift vs Thrif2 --> For SPARQL-results, THRIFT2 outperforms THRIFT in all cases and with all tested compressions.

Speed of Graph Serialization+Deserialization-Roundtrip - Thirft vs Thrift2 --> For graphs, THRIFT2 outperforms THRIFT in combination with compression but not uncompressed.

Size For the size, I simply calculated the number of triples per MB:

Serialized SPARQL-Results - Thrift vs Thrif2 --> Uncompressed THRIFT2 SPARQL-results are in all cases at least 50% smaller than uncompressed THRIFT.

Serialized Graphs - Thrift vs Thrif2 --> Uncompressed THRIFT2 graphs are in all cases at least 50% smaller than uncompressed THRIFT.

Speed of serialization Speed of SPARQL-Result Serialization - Thrift vs Thrif2 Speed of Graph Serialization - Thrift vs Thrif2 --> Overall, THRIFT2 is faster than THRIFT during serialization.

Speed of deserialization

Speed of SPARQL-Result Deserialization - Thrift vs Thrif2 --> In combination with compression, THRIFT2 outperforms THRIFT for SPARQL-results. Uncompressed, there is no clear winner.

Speed of Graph Deserialization - Thrift vs Thrif2 --> For graphs, THRIFT2 outperforms THRIFT in combination with compression but not uncompressed.

Questions What do you think? Are the advantages shown in the benchmarks too small to justify an additional format? Should I move forward with the THRIFT variant, or maybe only with the results format but not with the graphs format? Would it be appropriate to move THRIFT2 (or whatever name it will have) to a separate package, to avoid introducing yet another format?

Benchmarks for SPARQL-result serialization/deserialization

Since I've conducted some benchmarks, it might be interesting to compare all available formats here.

Note: I tried the same approach with PROTO, but Protobuf itself seems to behave differently when deserializing string lists. Deserialization in PROTO2 is, in my opinion, too slow to be considered an option.

Size Serialized SPARQL-Results - uncompressed Serialized SPARQL-Results - with LZ4 compression Serialized Graphs - uncompressed

Speed of a roundtrip Speed of SPARQL-Result Serialization+Deserialization-Roundtrip - uncompressed

Speed of serialization Speed of SPARQL-Result Serialization - uncompressed

Speed of deserialization Speed of SPARQL-Result Derialization - uncompressed

Benchmarks for graph serialization/deserialization

Size Serialized Graphs - uncompressed Serialized Graphs - with LZ4 compression Serialized Graphs - with GZIP compression

Speed of a roundtrip Speed of Graph Serialization+Deserialization-Roundtrip - uncompressed

Speed of serialization Speed of Graph Serialization - uncompressed

Speed of deserialization Speed of Graph Deserialization - uncompressed

My excel file with all the numbers: Apache Jena Serialization.xlsx

Feb 12 '24 22:02 arne-bdt

As far as I know, any new format could be implemented as sidecar/external library, since Jena offers all necessary APIs to register new formats for graph serialization and query results.

I think at this point, Jena does not need another format in the main project.

Mar 18 '24 20:03 arne-bdt