vmprof-python icon indicating copy to clipboard operation
vmprof-python copied to clipboard

infinite loop inside `vmp_walk_and_record_stack`

Open antocuni opened this issue 8 years ago • 3 comments

I have a PyPy process which is being profiled with vmprof.enable(native=False). I noticed it was consuming 100% of CPU, so I gdbed into it. This is what I found:

(gdb) where
#0  0x00007f5d7164dc10 in vmp_walk_and_record_stack () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#1  0x00007f5d7164fd05 in sigprof_handler () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#2  <signal handler called>
#3  0x00007f5d6e8932a3 in _int_free (av=0x7f5d6ebd7b20 <main_arena>, p=0x3596ecc0, have_lock=0) at malloc.c:3962
#4  0x00007f5d6e89753c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#5  0x00007f5d71644fea in g_restore_state () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#6  0x00007f5d71645048 in _stacklet_switchstack.constprop.3 () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#7  0x00007fffcef58270 in ?? ()
#8  0x00007f5d6b9a7a60 in ?? ()
#9  0x0000000035a1a320 in ?? ()
#10 0x000000003583c7a8 in ?? ()
#11 0x00007f5d73375010 in ?? ()
#12 0x00000000026aa640 in ?? ()
#13 0x0000000000000260 in ?? ()
#14 0x00007f5d716453c7 in stacklet_switch () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#15 0x00007f5d7144b388 in pypy_g.switch () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#16 0x00007f5d714276bd in pypy_g_StackletThread_switch () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so
#17 0x00007f5d5e0cbfe7 in ?? ()
#18 0x00007f5d6b2db740 in ?? ()
#19 0x00007f5d70f99783 in pypy_g_crash_in_jit_4 () from /opt/pypy-c-jit-92954-e4158aeecc04-linux64/bin/libpypy-c.so

so, from this it seems that the vmprof signal triggered in the middle of a greenlet switch. Then, I looked at the disassembly:

(gdb)
Dump of assembler code for function vmp_walk_and_record_stack:
   0x00007f5d7164dbc0 <+0>:	push   %r15
   0x00007f5d7164dbc2 <+2>:	push   %r14
   0x00007f5d7164dbc4 <+4>:	mov    %rdi,%r14
   0x00007f5d7164dbc7 <+7>:	push   %r13
   0x00007f5d7164dbc9 <+9>:	push   %r12
   0x00007f5d7164dbcb <+11>:	push   %rbp
   0x00007f5d7164dbcc <+12>:	push   %rbx
   0x00007f5d7164dbcd <+13>:	mov    %rsi,%rbp
   0x00007f5d7164dbd0 <+16>:	mov    %edx,%ebx
   0x00007f5d7164dbd2 <+18>:	sub    $0x588,%rsp
   0x00007f5d7164dbd9 <+25>:	mov    0x1c481fc(%rip),%r15d        # 0x7f5d73295ddc <vmp_native_traces_enabled>
   0x00007f5d7164dbe0 <+32>:	mov    %r8,0x28(%rsp)
   0x00007f5d7164dbe5 <+37>:	test   %r15d,%r15d
   0x00007f5d7164dbe8 <+40>:	jne    0x7f5d7164dc78 <vmp_walk_and_record_stack+184>
   0x00007f5d7164dbee <+46>:	test   %rdi,%rdi
   0x00007f5d7164dbf1 <+49>:	je     0x7f5d7164dddf <vmp_walk_and_record_stack+543>
   0x00007f5d7164dbf7 <+55>:	cmp    $0x1,%ebx
   0x00007f5d7164dbfa <+58>:	mov    $0x2,%edx
   0x00007f5d7164dbff <+63>:	jg     0x7f5d7164dc30 <vmp_walk_and_record_stack+112>
   0x00007f5d7164dc01 <+65>:	jmpq   0x7f5d7164dddf <vmp_walk_and_record_stack+543>
   0x00007f5d7164dc06 <+70>:	nopw   %cs:0x0(%rax,%rax,1)
   0x00007f5d7164dc10 <+80>:	cmp    $0x3,%rax
   0x00007f5d7164dc14 <+84>:	je     0x7f5d7164dc58 <vmp_walk_and_record_stack+152>
   0x00007f5d7164dc16 <+86>:	mov    (%r14),%r14
   0x00007f5d7164dc19 <+89>:	lea    0x2(%r15),%edx
   0x00007f5d7164dc1d <+93>:	test   %r14,%r14
   0x00007f5d7164dc20 <+96>:	setne  %cl
   0x00007f5d7164dc23 <+99>:	cmp    %edx,%ebx
   0x00007f5d7164dc25 <+101>:	setge  %al
   0x00007f5d7164dc28 <+104>:	test   %al,%cl
   0x00007f5d7164dc2a <+106>:	je     0x7f5d7164dde2 <vmp_walk_and_record_stack+546>
   0x00007f5d7164dc30 <+112>:	mov    0x10(%r14),%rax
   0x00007f5d7164dc34 <+116>:	cmp    $0x1,%rax
=> 0x00007f5d7164dc38 <+120>:	jne    0x7f5d7164dc10 <vmp_walk_and_record_stack+80>
   0x00007f5d7164dc3a <+122>:	mov    0x8(%r14),%rcx
   0x00007f5d7164dc3e <+126>:	movslq %r15d,%rax
   0x00007f5d7164dc41 <+129>:	mov    %edx,%r15d
   0x00007f5d7164dc44 <+132>:	movq   $0x1,0x0(%rbp,%rax,8)
   0x00007f5d7164dc4d <+141>:	mov    %rcx,0x8(%rbp,%rax,8)
   0x00007f5d7164dc52 <+146>:	jmp    0x7f5d7164dc16 <vmp_walk_and_record_stack+86>
   0x00007f5d7164dc54 <+148>:	nopl   0x0(%rax)
   0x00007f5d7164dc58 <+152>:	mov    0x8(%r14),%rax
   0x00007f5d7164dc5c <+156>:	movslq %r15d,%rsi
   0x00007f5d7164dc5f <+159>:	mov    %ebx,%ecx
   0x00007f5d7164dc61 <+161>:	mov    %rbp,%rdi
   0x00007f5d7164dc64 <+164>:	mov    -0x8(%rax),%rdx
   0x00007f5d7164dc68 <+168>:	callq  0x7f5d7164d550 <vmprof_write_header_for_jit_addr>
   0x00007f5d7164dc6d <+173>:	mov    %eax,%r15d
   0x00007f5d7164dc70 <+176>:	jmp    0x7f5d7164dc16 <vmp_walk_and_record_stack+86>
   0x00007f5d7164dc72 <+178>:	nopw   0x0(%rax,%rax,1)
   0x00007f5d7164dc78 <+184>:	lea    0x80(%rsp),%r15

The infinite loop is between instructions +80 and +120. I tried to understand which part of the C code it corresponds to, but failed. However, since I am using native=False, I suspect it is inside an inlined call to vmp_walk_and_record_python_stack_only.

I didn't prove it yet, but I suspect it might be related to this pypy issue: https://bitbucket.org/pypy/pypy/issues/2683/infinite-traceback-with-greenlets

continulet seems to produce an infinite python stack when you switch. So, maybe vmprof triggered at the wrong time and started to sample the infinite stack. I'll try to find a reproducible example.

antocuni avatar Nov 11 '17 11:11 antocuni

Update; I tried to write a smaller reproducer; if you run this program under vmprof, it reliably segfaults inside vmp_walk_and_record_stack: not sure if it's another manifestation of the same problem or if it's unrelated though.

import itertools
import eventlet

def run(name):
    for i in itertools.count():
        if i % 10000 == 0:
            print name, i
        eventlet.sleep()

def main():
    a = eventlet.spawn(run, 'a')
    b = eventlet.spawn(run, 'b')
    c = eventlet.spawn(run, 'c')
    a.wait()
    b.wait()
    c.wait()

main()

antocuni avatar Nov 13 '17 10:11 antocuni

Ok, found it. The problem is that if vmprof samples the stack in the middle of a switch, it simply reads nonsense: most of the time it results in a segfault, but sometimes it might end up in a infinite cycle.

If my theory is correct, it is unrelated to PyPy issue 2683.

Fixed by this PyPy commit

I think that in theory this could happen also on CPython, if vmprof triggers in the middle of a greenlet switch. However, I never managed to reproduce the crash, so for now I'm just closing the issue.

antocuni avatar Nov 15 '17 15:11 antocuni

Reopening this because my fix is not complete: now it no longer segfaults but it stop taking samples after a switch(). I'll start a PyPy branch to fix it

antocuni avatar Nov 27 '17 20:11 antocuni