NullPointer exception using DBWAL and s3n

Open dhamilton opened this issue 9 years ago • 0 comments

I'm hoping someone might be able to shine some light on this particular issue as we have been having intermittent stability issues pushing Avro files using s3n and the DBWAL and have seen this particular problem several times. After running a connector with 8 tasks for the last 4-5 days, all of tasks started to fail at around the same time with the issue below. When we hit this issue the only way to restart the job seems to be to drop all the tables that the DBWAL is using and recreate the job. Is this a known issue? Is there a known fix or workaround? Thanks.

[2017-02-21 01:55:32,511] INFO Offset from WAL 304796 for topic partition id 46 (com.qubole.streamx.s3.wal.DBWAL:287) [2017-02-21 01:55:32,512] INFO truncating table delete from pixel_events_s3_organic_events_wh_13 where id < 2865 (com.qubole.streamx.s3.wal.DBWAL:250) [2017-02-21 01:55:32,512] INFO Finished recovery for topic partition organic_events_wh-46 (io.confluent.connect.hdfs.TopicPartitionWriter:213) [2017-02-21 01:55:32,512] INFO truncating table delete from pixel_events_s3_organic_events_wh_38 where id < 2870 (com.qubole.streamx.s3.wal.DBWAL:250) [2017-02-21 01:55:32,512] INFO Reading wal select * from pixel_events_s3_organic_events_wh_69 order by id desc limit 1 (com.qubole.streamx.s3.wal.DBWAL:211) [2017-02-21 01:55:32,513] INFO Offset from WAL 298197 for topic partition id 13 (com.qubole.streamx.s3.wal.DBWAL:287) [2017-02-21 01:55:32,513] INFO Reading wal select * from pixel_events_s3_organic_events_wh_79 order by id desc limit 1 (com.qubole.streamx.s3.wal.DBWAL:211) [2017-02-21 01:55:32,513] INFO Offset from WAL 311797 for topic partition id 38 (com.qubole.streamx.s3.wal.DBWAL:287) [2017-02-21 01:55:32,513] INFO Finished recovery for topic partition organic_events_wh-13 (io.confluent.connect.hdfs.TopicPartitionWriter:213) [2017-02-21 01:55:32,514] INFO Finished recovery for topic partition organic_events_wh-38 (io.confluent.connect.hdfs.TopicPartitionWriter:213) [2017-02-21 01:55:32,515] INFO Started recovery for topic partition paid_events_wh-23 (io.confluent.connect.hdfs.TopicPartitionWriter:198) [2017-02-21 01:55:32,517] INFO Started recovery for topic partition organic_events_wh-29 (io.confluent.connect.hdfs.TopicPartitionWriter:198) [2017-02-21 01:55:32,517] INFO Started recovery for topic partition organic_events_wh-15 (io.confluent.connect.hdfs.TopicPartitionWriter:198) [2017-02-21 01:55:32,518] INFO Reading wal select * from pixel_events_s3_paid_events_wh_23 order by id desc limit 1 (com.qubole.streamx.s3.wal.DBWAL:211) [2017-02-21 01:55:32,521] INFO Reading wal select * from pixel_events_s3_organic_events_wh_29 order by id desc limit 1 (com.qubole.streamx.s3.wal.DBWAL:211) [2017-02-21 01:55:32,522] INFO Recovering file (com.qubole.streamx.s3.wal.DBWAL:223) [2017-02-21 01:55:32,522] INFO Reading wal select * from pixel_events_s3_organic_events_wh_15 order by id desc limit 1 (com.qubole.streamx.s3.wal.DBWAL:211) [2017-02-21 01:55:32,524] INFO truncating table delete from pixel_events_s3_organic_events_wh_29 where id < 2857 (com.qubole.streamx.s3.wal.DBWAL:250) [2017-02-21 01:55:32,530] ERROR Task pixel_events_s3-7 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:142) java.lang.NullPointerException at io.confluent.connect.hdfs.DataWriter.close(DataWriter.java:299) at io.confluent.connect.hdfs.HdfsSinkTask.close(HdfsSinkTask.java:110) at org.apache.kafka.connect.runtime.WorkerSinkTask.commitOffsets(WorkerSinkTask.java:301) at org.apache.kafka.connect.runtime.WorkerSinkTask.closePartitions(WorkerSinkTask.java:432) at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:146) at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:140) at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:175) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [2017-02-21 01:55:32,531] ERROR Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:143)

Feb 21 '17 22:02 dhamilton