Intra icon indicating copy to clipboard operation
Intra copied to clipboard

Zombie startVpn threads spinning on runtime.futex

Open ignoramous opened this issue 4 years ago • 1 comments

Intra has consistent [ 0.3% | 0.6% ] background CPU usage which stem from detached startVpn-onNetworkConnected threads that for some reason are waiting on a runtime.futex along with other threads (spawned by go-mobile?) named Thread-N (where, N is a positive int: Thread-10, Thread-92...) waking up once every second and in our experiments, stealing up to [ 200ms | 500ms ] of CPU time every 10 seconds.

Steps to reproduce:

  1. Install Intra (preferably on a rooted device or emulator), and run it.
  2. Start / stop WiFi thrice or so.
  3. Observe top stats and/or use perfetto.
  4. Collect stack-traces of running threads.
  5. See that startVpn-onNetw is detached from the VM but is spinning on runtime.futex.
adb shell
# find running threads
top -H -b -p <intra-pid>
# stack-trace, requires root
head /proc/<pid>/tasks/<tid>/stack

# stack-trace but in /data/anr/trace_<id>
adb shell kill -3 <pid>
# from shell, pull the stack-trace out (requires root):
adb root && adb pull /data/anr/trace_<id> .
# or, pull stack-trace from /data/anr via bugreport (no root):
adb bugreport

# debug-trace (if stack-traces don't work), needs root
adb root && abd shell debuggerd -b <pid> > bout

Example stack-traces (not debug-trace), from RethinkDNS (an Intra fork).

# I see four more such threads "Thread-N". Blocked on futex?
"Thread-25" prio=5 tid=39 Native                                                
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x144c07c0 self=0x6f060a0800    
  | sysTid=7696 nice=0 cgrp=default sched=0/0 handle=0x6ed2caed50               
  | state=S schedstat=( 1379002999 1328594980 16006 ) utm=67 stm=70 core=0 HZ=100
  | stack=0x6ed2bb8000-0x6ed2bba000 stackSize=991KB                             
  | held mutexes=                                                               
  kernel: (couldn't read /proc/self/task/7696/stack)                            
  native: #00 pc 0000000000203c0c  /data/app/com.celzero.bravedns-Kc7nGZjRdwCmpKai1wxqNA==/base.apk (offset 944000) (runtime.futex+28)
  (no managed stack frames)                                                     

# and four more "startVpn-onNetw" threads, "not attached" to the VM (?), and blocked (?) on a futex.
"startVpn-onNetw" prio=10 (not attached)                                        
  | sysTid=7677 nice=0 cgrp=default                                             
  | state=S schedstat=( 3819114 1563 7 ) utm=0 stm=0 core=6 HZ=100              
  kernel: (couldn't read /proc/self/task/7677/stack)                            
  native: #00 pc 0000000000203c0c  /data/app/com.celzero.bravedns-Kc7nGZjRdwCmpKai1wxqNA==/base.apk (offset 944000) (runtime.futex+28)

# "not attached" sample 2. The other 2 also look the same, just the thread number is different.
"startVpn-onNetw" prio=10 (not attached)                                        
  | sysTid=7678 nice=0 cgrp=default                                             
  | state=S schedstat=( 4112652397 7301607460 128141 ) utm=92 stm=319 core=0 HZ=100
  kernel: (couldn't read /proc/self/task/7678/stack)                            
  native: #00 pc 0000000000203c10  /data/app/com.celzero.bravedns-Kc7nGZjRdwCmpKai1wxqNA==/base.apk (offset 944000) (runtime.futex+32)

ignoramous avatar Feb 22 '21 17:02 ignoramous

Just a note: On our Intra fork we moved tun2socks to Go1.18 (instead of Go1.15 at the time of this bug), replaced Threads with Kotlin Coroutines, and removed LwIP for gvisor/netstack... and we don't see that consistent [0.3% | 0.6%] bg CPU usage anymore. Any combination of the above might have done the trick is our guess.

ignoramous avatar Jul 25 '22 18:07 ignoramous