sarama icon indicating copy to clipboard operation
sarama copied to clipboard

brokerProducer is nil, panic

Open simon28082 opened this issue 4 years ago • 15 comments

Versions
Sarama Kafka Go
1.32.0 2.8.7 1.18
Problem Description

This is not an inevitable situation, but it happens many times This is a crash triggered by brokerProducer=nil Test scenario: A message is sent every second, and the network connection is interrupted during the sending process. During the process, the producer will asynchronously close the producer, and Panic will occur in some cases in about 3-5 minutes In github.com/Shopify/sarama/async_producer.go 560 lines

if msg.retries > pp.highWatermark {
	// a new, higher, retry level; handle it and then back off
	pp.newHighWatermark(msg.retries)
	pp.backoff(msg.retries)
}

func (pp *partitionProducer) newHighWatermark(hwm int) {
	Logger.Printf("producer/leader/%s/%d state change to [retrying-%d]\n", pp.topic, pp.partition, hwm)
	pp.highWatermark = hwm

	
	pp.retryState[pp.highWatermark].expectChaser = true
	pp.parent.inFlight.Add(1)
//  Note: This is going to cause panic, because brokerProducer=nil
	pp.brokerProducer.input <- &ProducerMessage{Topic: pp.topic, Partition: pp.partition, flags: fin, retries: pp.highWatermark - 1}


	Logger.Printf("producer/leader/%s/%d abandoning broker %d\n", pp.topic, pp.partition, pp.leader.ID())
	pp.parent.unrefBrokerProducer(pp.leader, pp.brokerProducer)
	pp.brokerProducer = nil
}

I also took a quick look at the Dispatch method, but couldn't figure out the specific reason This problem, only in 1.32.0, I have tested other versions without this problem

simon28082 avatar Mar 20 '22 03:03 simon28082

Hi @crcms Can you please test this with https://github.com/Shopify/sarama/pull/2182 ? Thank you!

niamster avatar Mar 21 '22 11:03 niamster

Line 560 for 1.32.0 does not seem to dereference a brokerProducer: https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L560

Would you be able to post the panic trace, confirm the version and ideally provide some Sarama logs?

Also what do you mean by During the process, the producer will asynchronously close the producer?

slaunay avatar Mar 25 '22 03:03 slaunay

@slaunay I am just disconnecting the network during the running process.

simon28082 avatar Mar 30 '22 02:03 simon28082

@crcms if you can share the panic trace that would go a long way.

slaunay avatar Mar 31 '22 03:03 slaunay

@slaunay This is a stack of panic.

goroutine 58 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc000218600, 0x1)
        vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc000218600)
        vendor/github.com/Shopify/sarama/async_producer.go:562 +0x4f7
github.com/Shopify/sarama.withRecover(0x0)
        vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1fb

This is a must-have BUG, as long as you interrupt the network in the running process, this problem will occur about 3-5 minutes, and the number of retries set by Producer is the default.

simon28082 avatar Apr 01 '22 09:04 simon28082

Hi @crcms Can you please test this with #2182 ? Thank you!

Already testing, still has question

simon28082 avatar Apr 01 '22 09:04 simon28082

Thanks @crcms, the stack points us to line 618 for the nil derefencing vs line 560: https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L615-L619

One scenario that would lead to the brokerProducer being nil prior to calling newHighWatermark would be if the following case <-pp.brokerProducer.abandoned matches: https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L547-L558

We need a unit test case to reproduce the issue in order to implement the proper fix. Such test case can be used as well to confirm whether it still happens with #2182.

slaunay avatar Apr 01 '22 21:04 slaunay

@slaunay thank you very much, but the issue #2182 I used it, and it's still problematic

simon28082 avatar Apr 02 '22 02:04 simon28082

We are having the same issue, any updates on this?

edit: sarama v1.34.0 (go 1.19) with kafka v2.4.1

NewbieOrange avatar Jun 15 '22 02:06 NewbieOrange

As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]

goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
	/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
	/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
	/go/pkg/mod/github.com/!shopify/[email protected]/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:515 +0x208

Are there any updates on this?

@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?

qitta avatar Oct 04 '22 07:10 qitta

As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]

goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
	/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
	/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
	/go/pkg/mod/github.com/!shopify/[email protected]/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:515 +0x208

Are there any updates on this?

@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?

I face the same issue

yusronhanan avatar Oct 21 '22 06:10 yusronhanan

got the same issue with [email protected] on prod when we lost our kafka:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xfca3b0]
goroutine 469 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40006bc5a0, 0x1)
        /home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:708 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40006bc5a0)
        /home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:648 +0x4a4
github.com/Shopify/sarama.withRecover(0x0?)
        /home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        /home/jenkins/go/pkg/mod/github.com/!shopify/[email protected]/async_producer.go:599 +0x208

thothothotho avatar Nov 29 '22 14:11 thothothotho

We also see this segfault when the broker goes away. Sarama version 1.31.1 (2022-02-01). Issue #2322 is a duplicate of this issue #2184, right? Can we mark one or the other as a duplicate?

Dec 10 11:07:19 dut2 OUR: panic: runtime error: invalid memory address or nil pointer dereference
Dec 10 11:07:19 dut2 OUR: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1865bfd]
Dec 10 11:07:19 dut2 OUR: goroutine 131 [running]:
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc001361380, 0x1)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc001361380)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:562 +0x50a
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.withRecover(0xc0004a5400?)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
Dec 10 11:07:19 dut2 OUR: created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1f6
Dec 10 11:07:19 dut2 systemd: OUR.service: main process exited, code=exited, status=2/INVALIDARGUMENT

Quuxplusone avatar Dec 12 '22 15:12 Quuxplusone

Just met the issue on sarama v1.29.1, any updates to solve it?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1ded78f]

goroutine 799 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0045287e0, 0x1)
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:617 +0x1cf
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0045287e0)
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:561 +0x589
github.com/Shopify/sarama.withRecover(0xc003fcde00)
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/utils.go:43 +0x49
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:512 +0x1aa

flgg2yybb avatar Dec 15 '22 09:12 flgg2yybb

I have a currently feasible mitigation solution: first set the panic handler, rebuild all producers when any panic is caught, and because the panic handler is set, the producer will not automatically close when the producer internally crashes, and the input channel will be fast is filled and blocked, a producer rebuild is also required when this is detected.

edoger avatar Dec 19 '22 02:12 edoger

got the same issue: panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12b2609]

goroutine 74332 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0c18b9140, 0x3)
	github.com/Shopify/[email protected]/async_producer.go:620 +0x1a9
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0c18b9140)
	github.com/Shopify/[email protected]/async_producer.go:564 +0x537
github.com/Shopify/sarama.withRecover(0xc1ea43e580?)
	github.com/Shopify/[email protected]/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	github.com/Shopify/[email protected]/async_producer.go:515 +0x1ea

asddongmen avatar May 23 '23 08:05 asddongmen

seems this issue was fixed by:fix: add nil check while producer is retrying (

asddongmen avatar May 23 '23 08:05 asddongmen

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur. Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

github-actions[bot] avatar Aug 21 '23 10:08 github-actions[bot]

Closing as fixed by https://github.com/IBM/sarama/pull/2387

dnwe avatar Aug 21 '23 12:08 dnwe