jena Improved implementation of PrefixMapStd

GitHub issue resolved #1474

Reimplementation of PrefixMapStd to combine "fast-track" with trie backing.

[x] Tests are included. (All existing tests apply - no new ones created)
[ ] Documentation change and updates are provided for the Apache Jena website
[x] Commits have been squashed to remove intermediate development commit messages.
[x] Key commit messages start with the issue number (GH-xxxx or JENA-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

Aug 09 '22 12:08 Aklakan

This PR is ready for review.

The small extra changes are: Longest-prefix-lookup-cache invalidation now only happens if there is an actual change; adding duplicates or removing non-existing entries do not trigger invalidation. Also, I wasn't sure about the thread-safety of PrefixMapStd - probably its better to have it so I added RWL locking.

Is there a place where the utility methods calcWithLock(Lock, Supplier) and runWithLock(Lock, Runnable) could be publicly added - or maybe they already exist in one of the dependencies?

Aug 24 '22 19:08 Aklakan

I converted back to draft because I have some pending updates which I need to benchmark against the well-working parts of the original PrefixMapStd (IRIs with / or #) first in order to determine whether performance-wise it would make sense to include them.

The general idea is to have PrefixMap implementation that auto-adapts to a given workload - such as parsing lots of queries without the overhead of updating reverse-lookup structures as well as updating them on-demand upon writing out RDF.

In essence the direction I am investigating is about building/updating the reverse-lookup (iri-to-prefix) structures lazily (upon abbreviating). This would buffer prefix-iri inserts/deletions similar to your BufferingPrefixMap; upon abbreviate only the delta is materialized into the reverse-lookup structures.

Sep 17 '22 15:09 Aklakan

Another choice is to restrict to the basic case of prefix at the final "/", "#" and ":" (for URNs). Only have the "fast path" abbreviate.

Do you have cases where abbreviation is not one of these?

If you are going for the complicated version,maybe the best way is to have a new PrefixMapCaching and leave PrefixMapStd.

Sep 17 '22 16:09 afs

Another choice is to restrict to the basic case of prefix at the final "/", "#" and ":" (for URNs). Only have the "fast path" abbreviate.

I think your suggestions of including ':' in the list and only using fast path (without resorting to scanning) would work efficiently and be sufficient for the vast majority of use cases. Without scanning, even relative IRIs (that do not contain any of the fast track chars) wouldn't cause problems.

Do you have cases where abbreviation is not one of these?

Right now I only have some initial experiments where I abuse the prefix map as a poor-mans dictionary encoding in order to reduce the amount of bytes that need to be parsed in order to produce triples/quads. For this I am using trie-based lookups to encode the data (so IRIs can be split anywhere), but I have yet to evaluate whether this actually gives a noticeable performance boost.

Sep 19 '22 17:09 Aklakan