janusgraph Index mutation failures don't cause transaction abort

When a JanusGraph vertex property indexed in ElasticSearch is updated concurrently in separate transactions and then committed, ElasticSearch fails and the indexing backend throws an exception like so:

java.lang.Exception: failure in bulk execution:
[0]: index [titan], type [verticesByDocumentExternal], id [j3s], message [VersionConflictEngineException[[titan][4] [verticesByDocumentExternal][j3s]: version conflict, current [3], provided [2]]]

That isn't a problem by itself. However, JanusGraph swallows this exception and proceeds to commit the transaction as if nothing had happened. See this snippet from StandardJanusGraph (I am on 0.1.1 but it appears the same on the master branch):


                        //2. Commit indexes - [FAILURE] all exceptions are collected and logged but nothing is aborted
                        indexFailures = mutator.commitIndexes();
                        if (!indexFailures.isEmpty()) {
                            status = LogTxStatus.SECONDARY_FAILURE;
                            for (Map.Entry<String,Throwable> entry : indexFailures.entrySet()) {
                                log.error("Error while commiting index mutations for transaction ["+transactionId+"] on index: " +entry.getKey(),entry.getValue());
                            }
                        }
                        //3. Log transaction if configured - [FAILURE] is recorded but does not cause exception
                        if (logTxIdentifier!=null) {
                            try {
                                userlogSuccess = false;
                                final Log userLog = backend.getUserLog(logTxIdentifier);
                                Future<Message> env = userLog.add(txLogHeader.serializeModifications(serializer, LogTxStatus.USER_LOG, tx, addedRelations, deletedRelations));
                                if (env.isDone()) {
                                    try {
                                        env.get();
                                    } catch (ExecutionException ex) {
                                        throw ex.getCause();
                                    }
                                }
                                userlogSuccess=true;
                            } catch (Throwable e) {
                                status = LogTxStatus.SECONDARY_FAILURE;
                                log.error("Could not user-log committed transaction ["+transactionId+"] to " + logTxIdentifier, e);
                            }
                         }

When this happens, I would expect the commit to fail, and I'd be fine with that - essentially the same behavior as conflicting transactions in JanusGraph have in many circumstances. But since it doesn't, my transaction completes and I am left with some weird state, essentially a transaction that is partially committed.

May 24 '17 18:05 sharpau

I think I noticed a similar issue when implementing the DynamoDB storage backend. Basically, JanusGraph does not check for InterruptedException either (unless something changed since the last time I dealt with the issue).

May 26 '17 15:05 amcp

Is there any progress on issue? I experience exactly the same behaviour of JG — commit succeeds even if mutation fails.

Aug 02 '18 12:08 fufler

Any progress? I think this is a serious bug which can lead to data inconsistency

Apr 24 '19 01:04 cjxqhhh

@cjxqhhh Take a look at documentation. As far as I understand this behavior is considered to be expected. With transaction recovery configured vertices index should be consistent, though you still may have issues with edges index if consistency locking is not set on edge labels.

Apr 25 '19 08:04 fufler

@cjxqhhh Take a look at documentation. As far as I understand this behavior is considered to be expected. With transaction recovery configured vertices index should be consistent, though you still may have issues with edges index if consistency locking is not set on edge labels.

Thanks~ However, I don't understand why it's designed this way, why not just throw the exception, letting the user know there is something wrong in the index backend so they can recommit the transactions? And how should I choose the startTime in the startTransactionRecovery function, thanks in advance.

Apr 29 '19 09:04 cjxqhhh

I don't understand why it's designed this way, why not just throw the exception, letting the user know there is something wrong in the index backend so they can recommit the transactions?

I do not know exact reason for such design decision, since this code was not written by me.

And how should I choose the startTime in the startTransactionRecovery function

Here is what documentation says about this param:

start time specifies the time since epoch where the recovery process should start reading from the write-ahead log

May 06 '19 05:05 fufler

I do think we should make it configurable. Some users would like to fail fast ~~rather than using a separate transaction recovery process.~~ so that they are aware.

Jun 30 '22 18:06 li-boxuan

I also wanted to point out the reason behind recommending transaction recovery process:

Let's assume we abort the transaction if the index mutation fails, and client code does a read-before-write like this:

// `vid` hits a composite index and `prop_mix_indexed` hits a mixed index
g.V().has("vid", vid).fold().coalesce(__.unfold(), __.addV().property("vid", vid).property("prop_mix_indexed", value))

If the mixed index (for property prop_mix_indexed) creation fails, the transaction will abort (as requested by this issue). When the client retries, Gremlin will witness that the vertex already exists, and thus not triggering vertex & property creation (and of course, index mutation) at all. The mixed index entry will still be missing. Therefore, you need to either avoid such usage (which is hard), or you should use transaction recovery rather than relying on transaction abort (if we implement this feature).

Jul 08 '22 18:07 li-boxuan