brokerProducer is nil, panic
Versions
| Sarama | Kafka | Go |
|---|---|---|
| 1.32.0 | 2.8.7 | 1.18 |
Problem Description
This is not an inevitable situation, but it happens many times
This is a crash triggered by brokerProducer=nil
Test scenario:
A message is sent every second, and the network connection is interrupted during the sending process. During the process, the producer will asynchronously close the producer, and Panic will occur in some cases in about 3-5 minutes
In github.com/Shopify/sarama/async_producer.go 560 lines
if msg.retries > pp.highWatermark {
// a new, higher, retry level; handle it and then back off
pp.newHighWatermark(msg.retries)
pp.backoff(msg.retries)
}
func (pp *partitionProducer) newHighWatermark(hwm int) {
Logger.Printf("producer/leader/%s/%d state change to [retrying-%d]\n", pp.topic, pp.partition, hwm)
pp.highWatermark = hwm
pp.retryState[pp.highWatermark].expectChaser = true
pp.parent.inFlight.Add(1)
// Note: This is going to cause panic, because brokerProducer=nil
pp.brokerProducer.input <- &ProducerMessage{Topic: pp.topic, Partition: pp.partition, flags: fin, retries: pp.highWatermark - 1}
Logger.Printf("producer/leader/%s/%d abandoning broker %d\n", pp.topic, pp.partition, pp.leader.ID())
pp.parent.unrefBrokerProducer(pp.leader, pp.brokerProducer)
pp.brokerProducer = nil
}
I also took a quick look at the Dispatch method, but couldn't figure out the specific reason This problem, only in 1.32.0, I have tested other versions without this problem
Hi @crcms Can you please test this with https://github.com/Shopify/sarama/pull/2182 ? Thank you!
Line 560 for 1.32.0 does not seem to dereference a brokerProducer:
https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L560
Would you be able to post the panic trace, confirm the version and ideally provide some Sarama logs?
Also what do you mean by During the process, the producer will asynchronously close the producer?
@slaunay I am just disconnecting the network during the running process.
@crcms if you can share the panic trace that would go a long way.
@slaunay This is a stack of panic.
goroutine 58 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc000218600, 0x1)
vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc000218600)
vendor/github.com/Shopify/sarama/async_producer.go:562 +0x4f7
github.com/Shopify/sarama.withRecover(0x0)
vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1fb
This is a must-have BUG, as long as you interrupt the network in the running process, this problem will occur about 3-5 minutes, and the number of retries set by Producer is the default.
Hi @crcms Can you please test this with #2182 ? Thank you!
Already testing, still has question
Thanks @crcms, the stack points us to line 618 for the nil derefencing vs line 560:
https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L615-L619
One scenario that would lead to the brokerProducer being nil prior to calling newHighWatermark would be if the following case <-pp.brokerProducer.abandoned matches:
https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L547-L558
We need a unit test case to reproduce the issue in order to implement the proper fix. Such test case can be used as well to confirm whether it still happens with #2182.
@slaunay thank you very much, but the issue #2182 I used it, and it's still problematic
We are having the same issue, any updates on this?
edit: sarama v1.34.0 (go 1.19) with kafka v2.4.1
As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]
goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
/go/pkg/mod/github.com/!shopify/[email protected]/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:515 +0x208
Are there any updates on this?
@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?
As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330] goroutine 1168 [running]: github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2) /go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:620 +0x1e0 github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0) /go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:564 +0x4a4 github.com/Shopify/sarama.withRecover(0xdf5190?) /go/pkg/mod/github.com/!shopify/[email protected]/utils.go:43 +0x40 created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer /go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:515 +0x208Are there any updates on this?
@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?
I face the same issue
got the same issue with [email protected] on prod when we lost our kafka:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xfca3b0]
goroutine 469 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40006bc5a0, 0x1)
/home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:708 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40006bc5a0)
/home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:648 +0x4a4
github.com/Shopify/sarama.withRecover(0x0?)
/home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
/home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:599 +0x208
We also see this segfault when the broker goes away. Sarama version 1.31.1 (2022-02-01). Issue #2322 is a duplicate of this issue #2184, right? Can we mark one or the other as a duplicate?
Dec 10 11:07:19 dut2 OUR: panic: runtime error: invalid memory address or nil pointer dereference
Dec 10 11:07:19 dut2 OUR: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1865bfd]
Dec 10 11:07:19 dut2 OUR: goroutine 131 [running]:
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc001361380, 0x1)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc001361380)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:562 +0x50a
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.withRecover(0xc0004a5400?)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
Dec 10 11:07:19 dut2 OUR: created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1f6
Dec 10 11:07:19 dut2 systemd: OUR.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Just met the issue on sarama v1.29.1, any updates to solve it?
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1ded78f]
goroutine 799 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0045287e0, 0x1)
/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:617 +0x1cf
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0045287e0)
/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:561 +0x589
github.com/Shopify/sarama.withRecover(0xc003fcde00)
/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/utils.go:43 +0x49
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:512 +0x1aa
I have a currently feasible mitigation solution: first set the panic handler, rebuild all producers when any panic is caught, and because the panic handler is set, the producer will not automatically close when the producer internally crashes, and the input channel will be fast is filled and blocked, a producer rebuild is also required when this is detected.
got the same issue: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12b2609]
goroutine 74332 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0c18b9140, 0x3)
github.com/Shopify/[email protected]/async_producer.go:620 +0x1a9
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0c18b9140)
github.com/Shopify/[email protected]/async_producer.go:564 +0x537
github.com/Shopify/sarama.withRecover(0xc1ea43e580?)
github.com/Shopify/[email protected]/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
github.com/Shopify/[email protected]/async_producer.go:515 +0x1ea
seems this issue was fixed by:fix: add nil check while producer is retrying (
Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.
Closing as fixed by https://github.com/IBM/sarama/pull/2387