ppp icon indicating copy to clipboard operation
ppp copied to clipboard

pppd to terminate on "LCP: timeout sending Config-Requests"

Open jkroonza opened this issue 10 months ago • 13 comments

Hi,

We've picked up a few oddball cases where we expect pppd to terminate. It seems the common factor is the last line stating "LCP: timeout sending Config-Requests", eg:

Apr  6 18:56:56 hostname pppd[6145]: Plugin radius.so loaded.
Apr  6 18:56:56 hostname pppd[6145]: RADIUS plugin initialized.
Apr  6 18:56:56 hostname pppd[6145]: Plugin radattr.so loaded.
Apr  6 18:56:56 hostname pppd[6145]: RADATTR plugin initialized.
Apr  6 18:56:56 hostname pppd[6145]: Plugin pppol2tp.so loaded.
Apr  6 18:56:56 hostname pppd[6145]: pppd 2.5.1 started by root, uid 0
Apr  6 18:56:56 hostname pppd[6145]: Using interface ppp1
Apr  6 18:56:56 hostname pppd[6145]: Connect: ppp1 <--> 
Apr  6 18:56:56 hostname pppd[6145]: Overriding mtu 1500 to 1363
Apr  6 18:56:56 hostname pppd[6145]: Overriding mru 1500 to mtu value 1363
Apr  6 18:57:26 hostname pppd[6145]: LCP: timeout sending Config-Requests
Apr  7 23:24:15 hostname pppd[6145]: Terminating on signal 15 
Apr  7 23:24:15 hostname pppd[6145]: Connection terminated.
Apr  7 23:24:15 hostname pppd[6145]: Exit.

You will note more than 24h later I killed this using "kill 6145" at which point it terminated.

I find it a bit oddball that xl2tpd also doesn't kill the pppd since I'm assuming the remote l2tp service has become non-responsive ... but the simplest "fix" is to just have pppd terminate here I reckon, please confirm my assessment, will look into code.

jkroonza avatar Apr 07 '25 21:04 jkroonza

@paulusmack: What do you think?

Neustradamus avatar Apr 07 '25 22:04 Neustradamus

What options are in use here? If you can reproduce with the 'debug' option enabled so we can see the actual negotiation (if any), that might also help. Also, is this new behavior, or do older versions of pppd also do this?

paulusmack avatar Apr 30 '25 09:04 paulusmack

Hi,

I still need to find a reliable way to reproduce. Only seeing this on two specific hosts, where I suspect the "remote" side vanishes during setup (they're incoming from mobile source, which in my experience means they frequently disconnect and connect to the internet). My suspicion is that if LCP times out from establishing during the early phases, or during auth, then this can happen.

I've got no ideal if this is old or new behaviour, I've not seen (observed) it before, but that unfortunately doesn't eliminate this as a potential old existing issue.

jkroonza avatar May 01 '25 07:05 jkroonza

@nomis, @rfc1036, @zdzichu, @pali, @carlsonj, @masputra, @enaess, @chipitsine, @sthibaul, @alarig, @EasyNetDev, @anphsw, @dfskoll, @XWwalker, @kurt-vd, @nkukard, @kad, @tisj: What do you think?

Neustradamus avatar Oct 22 '25 14:10 Neustradamus

I do still see some cases where pppd instances don't close when they're supposed to. Usually this is a non-issue, but I have also seen a few on pppoe now. So something somewhere is up. I suspect (based on events the last few days) it might be related to radius blocking, but I've not managed to confirm yet. I don't think this is critical for a 2.5.3.

jkroonza avatar Oct 22 '25 18:10 jkroonza

Failing to exit after an LCP config-request timeout itself sounds like a problem, and my first guess would be that there's something wrong with the underlying serial transport -- it's hanging on close() or pppd is unexpectedly configured for demand or persist mode or something like that. A stack backtrace in this state might help. And I definitely agree that a debug log and set of options/configuration is going to help.

carlsonj avatar Oct 25 '25 21:10 carlsonj

The latter cases from 4 days ago should be fixed by #572 - but that won't affect the original issue mentioned here. I think the best approach is to attach with strace the next time it happens, I won't be surprised if it's something similar to #572 ...

jkroonza avatar Oct 26 '25 14:10 jkroonza

sorry for naive question. @jkroonza do you maybe have core dump collection configured on that host ? it would be quite interesting to see "gdb bt"

chipitsine avatar Oct 28 '25 18:10 chipitsine

is that "signal 15" arriving from xl2tpd or (I assume) from some segmentation fault ?

chipitsine avatar Oct 28 '25 18:10 chipitsine

I do have some experience running xl2tpd. never seen anything like that.

chipitsine avatar Oct 28 '25 18:10 chipitsine

sorry for naive question. @jkroonza do you maybe have core dump collection configured on that host ? it would be quite interesting to see "gdb bt"

We'd have to manually send SEGV since it doesn't actually crash, it merely stalls, I suspect (and I didn't have the savvy to use strace to at least get a system call or something that it blocks on) to maybe use /proc/${relevant_pid}/fd/ to try and figure out which socket is the problematic one.

is that "signal 15" arriving from xl2tpd or (I assume) from some segmentation fault ?

SIGTERM - was me doing kill ${relevant_pid} to terminate the relevant ppp that wasn't making progress.

I do have some experience running xl2tpd. never seen anything like that.

Keep in mind I'm running with some extra patches: https://github.com/xelerance/xl2tpd/pulls/jkroonza

Specifically https://github.com/xelerance/xl2tpd/pull/277 could cause other issues, and I have identified some additional new segfaults in xl2tpd, or more to the point - I suspect it triggered some existing use-after-free use-cases that was already there, but due to ordering of the select() code never showed effect, but I've not had time to sit down and dig into it properly yet with gdb.

jkroonza avatar Oct 29 '25 07:10 jkroonza

use after free leads to segmentation fault and immediate process termination. likely it is not the case

chipitsine avatar Oct 29 '25 08:10 chipitsine

I was more successful with adding debug "printf" than dancing with gdb

chipitsine avatar Oct 29 '25 08:10 chipitsine