Exceptions during chunk commit are silently ignored
Erwin Vervaet opened BATCH-2415 and commented
A TaskletStep uses various helpers to loop over all chunks to be processed and process each of them in a separate transaction while managing things such as skippable and retryable exceptions. In pseudo code this whole system can be summarized as follows:
while (more chunks) { // RepeatTemplate usage in TaskletStep.doExecute
try { // RepeatTemplate.executeInternal
try { // StepContextRepeatCallback.doInIteration
chunkContext = queue.poll()
{ // TaskletStep: StepContextRepeatCallback.doInChunkContext
try { // TransactionTemplate.execute
{ // TaskletStep: ChunkTransactionCallback.doInTransaction
{ // ChunkOrientedTasklet.execute
chunk = chunkProvider.provide()
chunkProcessor.process(chunk) [A]
chunkContext.setComplete()
return RepeatStatus.continueIf(!chunk.isEnd())
}
}
} catch (Exception e) {
rollback()
}
commit() [B]
}
} finally {
if (!chunkContext.isComplete()) {
queue.add(chunkContext)
}
}
} catch (Exception e) {
{ // SimpleRetryExceptionHandler.handleException
if (exception is fatal) {
throw e;
}
}
}
}
This code exhibits the following type of behavior:
- If no exceptions occur during chunk processing, the chunk is marked as complete and the transaction commits. Since the chunk its complete it is not pushed back on the queue and the repeat loop ends normally.
- A non-fatal (i.e. retryable) exception that is thrown while processing the chunk (point [A]) will cause the chunk not to be marked complete and the transaction to rollback. Since the chunk is not complete its pushed back onto the queue of chunks to process. Furthermore, since the exception is not fatal it will be ignored by the RepeatTemplate and a new iteration will start, actually doing the retry.
So far so good. However, we're experiencing the following case:
- No exceptions occur during chunk processing. Consequently the chunk is marked as complete. Now an exception is thrown during transaction commit (point [B]). Since the chunk is already marked as complete the chunk is not pushed back on the queue and no retry is attempted. The commit failure caused correct chunk transaction rollback. However, since the exception is not deemed to be fatal it is silently ignored by the RepeatTemplate! End result: the batch completes normally but certain chunks were silently skipped!
The exception we get is (see also BATCH-2403):
2015-08-06 13:53:04.440 DEBUG o.s.b.c.s.i.SimpleRetryExceptionHandler - Handled non-fatal exception
javax.persistence.PersistenceException: org.hibernate.exception.LockAcquisitionException: could not execute statement
at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1316) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
at org.hibernate.ejb.AbstractEntityManagerImpl$CallbackExceptionMapperImpl.mapManagedFlushFailure(AbstractEntityManagerImpl.java:1510) ~[hibernate-entitymanager-4.2.12.Final.jar:4.2.12.Final]
at org.hibernate.engine.transaction.synchronization.internal.SynchronizationCallbackCoordinatorNonTrackingImpl.beforeCompletion(SynchronizationCallbackCoordinatorNonTrackingImpl.java:110) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
at org.hibernate.engine.transaction.synchronization.internal.RegisteredSynchronization.beforeCompletion(RegisteredSynchronization.java:53) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
at bitronix.tm.BitronixTransaction.fireBeforeCompletionEvent(BitronixTransaction.java:478) ~[btm-2.1.2.jar:2.1.2]
at bitronix.tm.BitronixTransaction.commit(BitronixTransaction.java:193) ~[btm-2.1.2.jar:2.1.2]
at bitronix.tm.BitronixTransactionManager.commit(BitronixTransactionManager.java:120) ~[btm-2.1.2.jar:2.1.2]
at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:1012) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:755) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:724) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:148) ~[spring-tx-3.2.9.RELEASE.jar:3.2.9.RELEASE]
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:267) ~[spring-batch-core-2.2.7.RELEASE.jar:na]
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:77) ~[spring-batch-core-2.2.7.RELEASE.jar:na]
at org.springframework.batch.repeat.support.RepeatTemplate.getNextResult(RepeatTemplate.java:368) [spring-batch-infrastructure-2.2.7.RELEASE.jar:na]
...
Caused by: org.hibernate.exception.LockAcquisitionException: could not execute statement
at org.hibernate.dialect.MySQLDialect$1.convert(MySQLDialect.java:412) ~[hibernate-core-4.2.12.Final.jar:4.2.12.Final]
...
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_05]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_05]
...
I fail to see how that can be desired behavior. Or am I just missing something here?
Affects: 2.2.7, 3.0.4
Attachments:
- debug.log (399.30 kB)
- step_succeeds_when_commit_fails.patch (4.33 kB)
4 votes, 9 watchers
Michael Minella commented
- Does this issue persist in the 3.0.x branch? 2.2.x is no longer maintained.
- How is this issue different from BATCH-2403 which you point out?
Erwin Vervaet commented
- We're still on 2.2.7 so I haven't verified that yet. I'll upgrade and test this next week.
- BATCH-2403 is ultimately about the noRetry and noSkip methods on FaultTolerantStepBuilder. This ticket is about a logic path that can lead to a situation where Spring Batch rolls back a chunk and silently ignores the exception that caused the rollback, ultimately marking the batch as having completed successfully while some chunks were actually skipped.
Albert Strasser commented
Hi Erwin,
You might be correct that we are looking at 2 different issues here. My assumption was that just checking if an exception was fatal (last catch block of your example) seemed to me as a wrong implementation of skipping (which should default to false for any exception, not only for fatal ones). But you could be right that at this point we are already beyond the concept of skipping and therefore this would mean that we look at a major bug in handling the status of a chunk.
I can assure that this issue does persist in the 3.0.x branch as we are using 3.0.4 and experience exactly the same behaviour.
Erwin Vervaet commented
I can now also confirm that the problem is reproducible with Spring Batch 3.0.4. Attached you can find DEBUG level Spring Batch logging for an occurrence of the problem.
The logging is from two 50-item batches (documents_cuba_RLBTST17.zip and documents_cuba_RLBTST16.zip) that run concurrently. The system first picks up the documents_cuba_RLBTST17.zip batch and about 10 seconds later documents_cuba_RLBTST16.zip is also picked up.
Both batches complete successfully.
documents_cuba_RLBTST17.zip incorrectly skipped 2 items. The STEP_EXECUTION table contains the following (no skips):
| STATUS | COMMIT_COUNT | READ_COUNT | WRITE_COUNT | ROLLBACK_COUNT |
|---|---|---|---|---|
| COMPLETED | 49 | 50 | 48 | 2 |
documents_cuba_RLBTST16.zip incorrectly skipped 12 items. The STEP_EXECUTION table contains the following (no skips):
| STATUS | COMMIT_COUNT | READ_COUNT | WRITE_COUNT | ROLLBACK_COUNT |
|---|---|---|---|---|
| COMPLETED | 39 | 50 | 38 | 12 |
Erwin Vervaet commented
As a temporary work-around, we're now forcing a transaction flush (using EntityManager.flush()) from inside our ItemProcessor. This way Spring Batch will correctly retry or fail the batch if retries are exhausted.
Obviously this is not a real fix since a transaction commit might still fail for other reasons.
Erwin Vervaet commented
Michael Minella, Can you given any feedback related to this issue? Given that if confirmed this is a critical bug I would say this has some priority.
Andrei Amariei commented
Also affected by this issue. I attached [^step_succeeds_when_commit_fails.patch] that adds a failing test to spring-batch-samples module, maybe it helps moving it forward.
Andrei Amariei commented
One concern I have with flushing is that it can add side effects to an unsuccessful write phase.
For example:
- given inputs are entities and hibernate's
Session.saveOrUpdate(item)is used (as inHibernateItemWriter), flushing will increase the version field before throwing any exception. - therefore after rollback, when attempting to process the items individually, writing will fail with StaleObjectStateException.
- the same can also happen in commit phase, so maybe a way to skip all items in the chunk (instead of retrying) in case of write/commit failure would help.
About the attached test, maybe it's wrong to expect the step to fail (even if all the items are skipped), and it would be better to instead have a field StepExecution.commitSkipCount and a method SkipListener.onSkipInCommit() (the method can be placed in a new interface to keep backward compatibility). Similar for retry rules.
Erwin Vervaet commented
Andrei Amariei, you're right, the flush is just a hack. In our application it effectively fixes the problem: we've not seen any new occurrences in our production system. But I fully agree that it's not a real solution that applies generally.
Failing test with 5.0.3 (to add in ChunkOrientedStepIntegrationTests):
@Test
public void faultTolerantStepShouldFailWhenCommitFails() throws Exception {
StepBuilder stepBuilder = new StepBuilder("step", jobRepository);
FaultTolerantStepBuilder<String, String> faultTolerantStepBuilder = new FaultTolerantStepBuilder<>(stepBuilder);
faultTolerantStepBuilder.transactionManager(transactionManager);
faultTolerantStepBuilder.reader(getReader(new String[] { "a", "b", "c" }));
faultTolerantStepBuilder.writer(data -> TransactionSynchronizationManager
.registerSynchronization(new TransactionSynchronizationAdapter() {
@Override
public void beforeCommit(boolean readOnly) {
throw new RuntimeException("Simulate commit failure");
}
}));
step = faultTolerantStepBuilder.build();
JobParameters jobParameters = new JobParameters();
JobExecution jobExecution = jobRepository.createJobExecution(job.getName(), jobParameters);
StepExecution stepExecution = new StepExecution(step.getName(), jobExecution);
jobRepository.add(stepExecution);
step.execute(stepExecution);
Assert.assertEquals(BatchStatus.FAILED, stepExecution.getStatus());
}
Will be reviewed in #3950 .
Is there a workaround for this issue?
Hello. facing the same issue. is there any workaround currently?