netpoll icon indicating copy to clipboard operation
netpoll copied to clipboard

"integer divide by zero" when dialing

Open cannium opened this issue 6 months ago • 3 comments

Describe the bug Got this panic stack(caller stack omitted) in some chaos tests:

panic: runtime error: integer divide by zero

goroutine 310 [running]:
github.com/cloudwego/netpoll.(*roundRobinLB).Pick(0xc001280001?)
        /go/pkg/mod/github.com/cloudwego/[email protected]/poll_loadbalance.go:90 +0x6b
github.com/cloudwego/netpoll.(*manager).Pick(0xc0003dccf0)
        /go/pkg/mod/github.com/cloudwego/[email protected]/poll_manager.go:151 +0x79
github.com/cloudwego/netpoll.newPollDesc(0x77)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_polldesc.go:26 +0x36
github.com/cloudwego/netpoll.(*netFD).connect(0xc0003828c0, {0x15eef38, 0xc0003827e0}, {0xc0003828c0?, 0xc0012baa08?}, {0x15d8b40?, 0xc0012800e0?})
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_netfd.go:134 +0xf9
github.com/cloudwego/netpoll.(*netFD).dial(0xc0003828c0, {0x15eef38, 0xc0003827e0}, {0x15f4780?, 0x0?}, {0x15f4780?, 0xc0005cc450?})
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_netfd.go:84 +0x155
github.com/cloudwego/netpoll.socket({0x15eef38, 0xc0003827e0}, {0x142a6fd, 0x3}, 0x2, 0x1, 0xc00011ac40?, 0x0, {0x15f4780, 0x0}, ...)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_sock.go:119 +0x145
github.com/cloudwego/netpoll.internetSocket({0x15eef38, 0xc0003827e0}, {0x142a6fd, 0x3}, {0x15f4780, 0x0}, {0x15f4780, 0xc0005cc450}, 0x1, 0x0, ...)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_sock.go:47 +0xdc
github.com/cloudwego/netpoll.(*sysDialer).dialTCP(0xc0012bac30, {0x15eef38, 0xc0003827e0}, 0x0, 0xc0005cc450)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_tcpsock.go:178 +0x94
github.com/cloudwego/netpoll.DialTCP({0x15eef38, 0xc0003827e0}, {0x142a6fd, 0x3}, 0x0, 0xc0005cc450)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_tcpsock.go:170 +0x20f
github.com/cloudwego/netpoll.(*dialer).dialTCP(0x15eee58?, {0x15eef38, 0xc0003827e0}, {0x142a6fd, 0x3}, {0xc0005922c0?, 0x0?})
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_dialer.go:116 +0x338
github.com/cloudwego/netpoll.(*dialer).DialConnection(0x20b7f60?, {0x142a6fd, 0x3}, {0xc0005922c0, 0x10}, 0x2540be400?)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_dialer.go:72 +0x125
github.com/cloudwego/netpoll.DialConnection(...)
        /go/pkg/mod/github.com/cloudwego/[email protected]/net_dialer.go:28

I skimmed the code and guess the return value of m.Run() is not checked, it failed and closed m, so m.balance.Pick() panicked.

https://github.com/cloudwego/netpoll/blob/b0bf57dc0a80447a5c84ae3ef7d5760b58ccd736/poll_manager.go#L143-L152

cannium avatar Jul 10 '25 09:07 cannium

Hi, thanks for the feedback. May I have the tests that can reproduce this panic?

ppzqh avatar Jul 11 '25 03:07 ppzqh

Hi, thanks for the feedback. May I have the tests that can reproduce this panic?

It's a system-level test, not a unit test, so I can not share it with you easily. The test basically partitions network(with iptables) in a random way, and recover the network after a while. You may introduce such tests in your own way.

cannium avatar Jul 11 '25 04:07 cannium

okay. Not sure if we can reproduce it, we will try.

ppzqh avatar Jul 16 '25 09:07 ppzqh