Zombie startVpn threads spinning on runtime.futex
Intra has consistent [ 0.3% | 0.6% ] background CPU usage which stem from detached startVpn-onNetworkConnected threads that for some reason are waiting on a runtime.futex along with other threads (spawned by go-mobile?) named Thread-N (where, N is a positive int: Thread-10, Thread-92...) waking up once every second and in our experiments, stealing up to [ 200ms | 500ms ] of CPU time every 10 seconds.
Steps to reproduce:
- Install Intra (preferably on a rooted device or emulator), and run it.
- Start / stop WiFi thrice or so.
- Observe
topstats and/or use perfetto. - Collect stack-traces of running threads.
- See that
startVpn-onNetwisdetachedfrom the VM but is spinning onruntime.futex.
adb shell
# find running threads
top -H -b -p <intra-pid>
# stack-trace, requires root
head /proc/<pid>/tasks/<tid>/stack
# stack-trace but in /data/anr/trace_<id>
adb shell kill -3 <pid>
# from shell, pull the stack-trace out (requires root):
adb root && adb pull /data/anr/trace_<id> .
# or, pull stack-trace from /data/anr via bugreport (no root):
adb bugreport
# debug-trace (if stack-traces don't work), needs root
adb root && abd shell debuggerd -b <pid> > bout
Example stack-traces (not debug-trace), from RethinkDNS (an Intra fork).
# I see four more such threads "Thread-N". Blocked on futex?
"Thread-25" prio=5 tid=39 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x144c07c0 self=0x6f060a0800
| sysTid=7696 nice=0 cgrp=default sched=0/0 handle=0x6ed2caed50
| state=S schedstat=( 1379002999 1328594980 16006 ) utm=67 stm=70 core=0 HZ=100
| stack=0x6ed2bb8000-0x6ed2bba000 stackSize=991KB
| held mutexes=
kernel: (couldn't read /proc/self/task/7696/stack)
native: #00 pc 0000000000203c0c /data/app/com.celzero.bravedns-Kc7nGZjRdwCmpKai1wxqNA==/base.apk (offset 944000) (runtime.futex+28)
(no managed stack frames)
# and four more "startVpn-onNetw" threads, "not attached" to the VM (?), and blocked (?) on a futex.
"startVpn-onNetw" prio=10 (not attached)
| sysTid=7677 nice=0 cgrp=default
| state=S schedstat=( 3819114 1563 7 ) utm=0 stm=0 core=6 HZ=100
kernel: (couldn't read /proc/self/task/7677/stack)
native: #00 pc 0000000000203c0c /data/app/com.celzero.bravedns-Kc7nGZjRdwCmpKai1wxqNA==/base.apk (offset 944000) (runtime.futex+28)
# "not attached" sample 2. The other 2 also look the same, just the thread number is different.
"startVpn-onNetw" prio=10 (not attached)
| sysTid=7678 nice=0 cgrp=default
| state=S schedstat=( 4112652397 7301607460 128141 ) utm=92 stm=319 core=0 HZ=100
kernel: (couldn't read /proc/self/task/7678/stack)
native: #00 pc 0000000000203c10 /data/app/com.celzero.bravedns-Kc7nGZjRdwCmpKai1wxqNA==/base.apk (offset 944000) (runtime.futex+32)
Just a note: On our Intra fork we moved tun2socks to Go1.18 (instead of Go1.15 at the time of this bug), replaced Threads with Kotlin Coroutines, and removed LwIP for gvisor/netstack... and we don't see that consistent [0.3% | 0.6%] bg CPU usage anymore. Any combination of the above might have done the trick is our guess.