jena icon indicating copy to clipboard operation
jena copied to clipboard

Sub-par concurrent read performance with jena-iri

Open Aklakan opened this issue 3 years ago • 5 comments

Version

4.6.0-SNAPSHOT

What happened?

I started again looking into the issues I had with Jena in Spark settings; related to https://issues.apache.org/jira/browse/JENA-2309

Right now I am investigating some long standing performance issues where concurrent processing time does not scale directly with the number of cores. Concretely, I am comparing our spark+jena4-based tarql re-implementation with original tarql (jena2).

One culprit is the jena-iri package which uses synchronized singleton lexers which introduce locking overhead between the worker threads. A quick fix is to make those lexers thread-local which reduces the overhead. On my notebook in power save and performance mode I get these improvements:

jena-4.6.0-SNAPSHOT: power save: 68 sec performance: 21 sec

thread-local-fix: power save: 54 sec performance: 19sec

Profiler output (relevant column is the number of waits): image

A related issue I am currently investigating is that a lot of time is spent in the IRI parsing machinery e.g. via E_IRI. For testing I changed it to return the argument as given which reduced the total processing time (in performance mode) from 19 to 13 seconds - so around 30% - time that is predominantly spent in the jena-iri lexers. I am not yet sure however if there is anything that can be optimized without compromising functionality though.

Are you interested in making a pull request?

Yes

Aklakan avatar Aug 05 '22 20:08 Aklakan

Are you calling jena-iri directly?

1/ (repeated from JENA-2309) IRIx is an abstraction layer for replaceable IRI implementations.

One such IRI3986 implementation is https://github.com/afs/x4ld/tree/main/iri4ld . Minimal object creation - one object to record the results per parser call and RFC3986.create is thread-safe.

Other implementations can be plugged in.

2/ The parser pipeline uses a cache to avoid duplicate work: that changes IRI processing from being the significant cost to not the primary cost when parsing on a single thread.

https://github.com/apache/jena/blob/main/jena-arq/src/main/java/org/apache/jena/riot/system/FactoryRDFCaching.java#L62 which incidentally has the benefit of reducing memory footprint (IIRC by about a 1/3). Maybe that works in E_IRI.

FYI: https://github.com/tarql/tarql/pull/99 upgrades tarql to Apache Jena 4.5.0

afs avatar Aug 05 '22 21:08 afs

Adding a cache to E_IRI/IRIx should be simple and I can check how much this improves.

How does the iri4ld implementation differ from jena's current default one functionality-wise? In any case, having less (needless) synchronization between threads is always better.

FYI: https://github.com/tarql/tarql/pull/99 upgrades tarql to Apache Jena 4.5.0

Good to know that its possible to compare performance of spark-based tarql to original tarql within jena4! :) Especially because then the same IRI machinery is used.

In addition, I noticed that E_BNode also causes waits due to synchronization in a SecureRandom instance. This is probably better handled as a separate issue but for now I just wanted to document it here. My spark job's runtime (using a test mapping without iri()) jumps from ~4.5 to ~10 seconds only by adding a dummy bnode() call:

CONSTRUCT { <urn:example:s> <urn:example:p> ?a, ?b, ?c } # ... 16 columns in total
FROM <file:data.csv>
WHERE { BIND(bnode(?a) AS ?foobar) }

The same job with tarql/jena2 executes somewhere between 50-60 sec where with bnode it seems to tend more towards 60sec - so in single thread processing the effect is less visible. It seems that threads competing for the bnode call is also a bottleneck.

Aklakan avatar Aug 06 '22 12:08 Aklakan

How does the iri4ld implementation differ from jena's current default one functionality-wise?

Javadoc has the operations described: https://github.com/afs/x4ld/blob/main/iri4ld/src/main/java/org/seaborne/rfc3986/RFC3986.java

An Jena IRIProvider: https://gist.github.com/afs/a0bf740d1bd1fde283eabeab8b4ddb67

It is a java-coded parser for RFC 3986. The parser is a single file (IRI3986), written with efficiency in-mind. No sub-parsers or tokenizers.

jena-iri is a general system for IRIs. It is complicated to build.

iri4ld simple to build and provides the operations needed for linked data. Like jena-iri, it is independent of the Jena RDF codebase. iri4ld has less in the the way of extras not used by Jena.

The parser is IRI3986.java - all URIs (except it works in Java unicode strings so RFC 3987).

It has some additional scheme specific rule support for the common schemes: it covers "http:", "https:", "did:", "file:" "urn:uuid:", "urn:", "uuid:" (which is not official) and "example:" (RFC 7595).

afs avatar Aug 06 '22 15:08 afs

The parsers generate blank nodes by allocating a UUID once at the start of a parser run, then xor'ing the label into the random number. Unlabelled blank nodes get a not-writable label (it has a 0 byte in it) allocated from a counter.

afs avatar Aug 06 '22 15:08 afs

IRIx is not the place to put a cache. IRIx is general IRI machinery for any purpose.

The session is provided by an FactoryRDF (FactoryRDFCaching extends FactoryRDFStd implements FactoryRDF). The cache is then of NodeURIs.

afs avatar Aug 06 '22 15:08 afs