Index mutation failures don't cause transaction abort
When a JanusGraph vertex property indexed in ElasticSearch is updated concurrently in separate transactions and then committed, ElasticSearch fails and the indexing backend throws an exception like so:
java.lang.Exception: failure in bulk execution:
[0]: index [titan], type [verticesByDocumentExternal], id [j3s], message [VersionConflictEngineException[[titan][4] [verticesByDocumentExternal][j3s]: version conflict, current [3], provided [2]]]
That isn't a problem by itself. However, JanusGraph swallows this exception and proceeds to commit the transaction as if nothing had happened. See this snippet from StandardJanusGraph (I am on 0.1.1 but it appears the same on the master branch):
//2. Commit indexes - [FAILURE] all exceptions are collected and logged but nothing is aborted
indexFailures = mutator.commitIndexes();
if (!indexFailures.isEmpty()) {
status = LogTxStatus.SECONDARY_FAILURE;
for (Map.Entry<String,Throwable> entry : indexFailures.entrySet()) {
log.error("Error while commiting index mutations for transaction ["+transactionId+"] on index: " +entry.getKey(),entry.getValue());
}
}
//3. Log transaction if configured - [FAILURE] is recorded but does not cause exception
if (logTxIdentifier!=null) {
try {
userlogSuccess = false;
final Log userLog = backend.getUserLog(logTxIdentifier);
Future<Message> env = userLog.add(txLogHeader.serializeModifications(serializer, LogTxStatus.USER_LOG, tx, addedRelations, deletedRelations));
if (env.isDone()) {
try {
env.get();
} catch (ExecutionException ex) {
throw ex.getCause();
}
}
userlogSuccess=true;
} catch (Throwable e) {
status = LogTxStatus.SECONDARY_FAILURE;
log.error("Could not user-log committed transaction ["+transactionId+"] to " + logTxIdentifier, e);
}
}
When this happens, I would expect the commit to fail, and I'd be fine with that - essentially the same behavior as conflicting transactions in JanusGraph have in many circumstances. But since it doesn't, my transaction completes and I am left with some weird state, essentially a transaction that is partially committed.
I think I noticed a similar issue when implementing the DynamoDB storage backend. Basically, JanusGraph does not check for InterruptedException either (unless something changed since the last time I dealt with the issue).
Is there any progress on issue? I experience exactly the same behaviour of JG — commit succeeds even if mutation fails.
Any progress? I think this is a serious bug which can lead to data inconsistency
@cjxqhhh Take a look at documentation. As far as I understand this behavior is considered to be expected. With transaction recovery configured vertices index should be consistent, though you still may have issues with edges index if consistency locking is not set on edge labels.
@cjxqhhh Take a look at documentation. As far as I understand this behavior is considered to be expected. With transaction recovery configured vertices index should be consistent, though you still may have issues with edges index if consistency locking is not set on edge labels.
Thanks~ However, I don't understand why it's designed this way, why not just throw the exception, letting the user know there is something wrong in the index backend so they can recommit the transactions? And how should I choose the startTime in the startTransactionRecovery function, thanks in advance.
I don't understand why it's designed this way, why not just throw the exception, letting the user know there is something wrong in the index backend so they can recommit the transactions?
I do not know exact reason for such design decision, since this code was not written by me.
And how should I choose the startTime in the startTransactionRecovery function
Here is what documentation says about this param:
start time specifies the time since epoch where the recovery process should start reading from the write-ahead log
I do think we should make it configurable. Some users would like to fail fast ~~rather than using a separate transaction recovery process.~~ so that they are aware.
I also wanted to point out the reason behind recommending transaction recovery process:
Let's assume we abort the transaction if the index mutation fails, and client code does a read-before-write like this:
// `vid` hits a composite index and `prop_mix_indexed` hits a mixed index
g.V().has("vid", vid).fold().coalesce(__.unfold(), __.addV().property("vid", vid).property("prop_mix_indexed", value))
If the mixed index (for property prop_mix_indexed) creation fails, the transaction will abort (as requested by this issue).
When the client retries, Gremlin will witness that the vertex already exists, and thus not triggering vertex & property creation (and of course, index mutation) at all. The mixed index entry will still be missing. Therefore, you need to either avoid such usage (which is hard), or you should use transaction recovery rather than relying on transaction abort (if we implement this feature).