Segmentation fault when used with PyTorch

Open elistevens opened this issue 7 years ago • 0 comments

I get the following error intermittently when trying to profile a program that uses PyTorch. I'm not sure if that's relevant, since I haven't tried to minimize the program. The program seems to be exiting at the time, after running for about 5 minutes.

VMProf was invoked like so under Ubuntu 16.04:

if __name__ == '__main__':
    if os.getenv('VMPROF', None):
        PROFILE_FILE = 'vmprof_training.dat'
        flags = os.O_RDWR | os.O_CREAT | os.O_TRUNC
        if sys.platform == 'win32':
            flags |= os.O_BINARY
        outfd = os.open(PROFILE_FILE, flags)

        vmprof.enable(outfd, period=0.01)
    try:
        LunaTrainingApp().main()
    finally:
        if os.getenv('VMPROF', None):
            vmprof.disable()

Unfortunately, the core file is larger than the available space on the system's hard drive (I think due to the artificially inflated RAM size of CUDA programs). I can reproduce the issue, and run gdb commands if desired, however. Here's the backtrace:

#0  access_mem (as=<optimized out>, addr=140733193519104, val=0x7ffc393f6bb0, write=<optimized out>, arg=<optimized out>) at x86_64/Ginit.c:175
#1  0x00007ffff333b9dd in is_plt_entry (c=0x7ffc393f6d50) at x86_64/Gstep.c:43
#2  _ULx86_64_step (cursor=0x7ffc393f6d50) at x86_64/Gstep.c:126
#3  0x00007ffff355918f in vmp_walk_and_record_stack (frame=0x7ffc2eb556a8, result=result@entry=0x7fff95333020, max_depth=max_depth@entry=1019, signal=<optimized out>, signal@entry=1, pc=pc@entry=0)
    at src/vmp_stack.c:312
#4  0x00007ffff355a703 in get_stack_trace (current=current@entry=0xdca665b0, result=result@entry=0x7fff95333020, max_depth=max_depth@entry=1019, pc=pc@entry=0) at src/vmprof_unix.c:493
#5  0x00007ffff355a78f in _vmprof_sample_stack (p=p@entry=0x7fff95333000, tstate=tstate@entry=0xdca665b0, uc=uc@entry=0x7ffc393f7200) at src/vmprof_unix.c:98
#6  0x00007ffff355a912 in sigprof_handler (sig_nr=<optimized out>, info=<optimized out>, ucontext=<optimized out>) at src/vmprof_unix.c:242
#7  <signal handler called>
#8  0x00007fffb6373e2b in __device_stub__ZN5cudnn6detail24bn_fw_tr_1C11_singlereadIfLi512ELb1ELi1ELi2ELi20EEEv17cudnnTensorStructPKT_S2_PS3_PKfS8_ffPfS9_S9_S9_ffNS_15reduced_divisorEiSA_PNS0_19bnFwPersistentStateEifffiffP13cudnnStatus_tb(cudnnTensorStruct const&, float const*, cudnnTensorStruct const&, float*, float const*, float const*, float, float, float*, float*, float*, float*, float, float, cudnn::reduced_divisor&, int, cudnn::reduced_divisor&, cudnn::detail::bnFwPersistentState*, int, float, float, float, int, float, float, cudnnStatus_t*, bool) ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#9  0x00007fff00020000 in ?? ()
#10 0x000000005daaaaab in ?? ()
#11 0x00007fff24200000 in ?? ()
#12 0x00007ffc00000001 in ?? ()
#13 0x00007fffb638e67c in cudnnBatchNormalizationForwardTraining () from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#14 0x00007fffaefe1d3d in at::native::cudnn_batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double) ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#15 0x00007fffaf254d04 in at::Type::cudnn_batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double) const ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#16 0x00007fffd3ea3746 in torch::autograd::VariableType::cudnn_batch_norm (this=0x16e68f0, input=..., weight=..., bias=..., running_mean=..., running_var=..., training=true,
    exponential_average_factor=0.10000000000000001, epsilon=1.0000000000000001e-05) at torch/csrc/autograd/generated/VariableType.cpp:18662
#17 0x00007fffaefa36a7 in at::native::batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double, bool) ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#18 0x00007fffaf2547b6 in at::Type::batch_norm(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, double, double, bool) const ()
   from /home/elis/edit/book/.venv/lib/python3.6/site-packages/torch/lib/libATen.so
#19 0x00007fffd3e05e69 in torch::autograd::VariableType::batch_norm (this=0x16e68f0, input=..., weight=..., bias=..., running_mean=..., running_var=..., training=true, momentum=0.10000000000000001,
    eps=1.0000000000000001e-05, cudnn_enabled=true) at torch/csrc/autograd/generated/VariableType.cpp:18205
#20 0x00007fffd3f8533e in at::batch_norm (cudnn_enabled=true, eps=1.0000000000000001e-05, momentum=0.10000000000000001, training=true, running_var=..., running_mean=..., bias=..., weight=..., input=...)
    at /pytorch/torch/lib/tmp_install/include/ATen/Functions.h:2993
#21 torch::autograd::dispatch_batch_norm (cudnn_enabled=true, eps=1.0000000000000001e-05, momentum=0.10000000000000001, training=true, running_var=..., running_mean=..., bias=..., weight=..., input=...)
    at torch/csrc/autograd/generated/python_torch_functions_dispatch.h:941
#22 torch::autograd::THPVariable_batch_norm (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at torch/csrc/autograd/generated/python_torch_functions.cpp:1419
#23 0x00000000004c4b0b in _PyCFunction_FastCallKeywords ()
#24 0x000000000054f3c4 in ?? ()
#25 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#26 0x000000000054efc1 in ?? ()
#27 0x000000000054f24d in ?? ()
#28 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#29 0x000000000054e4c8 in ?? ()
#30 0x00000000005582c2 in _PyFunction_FastCallDict ()
#31 0x0000000000459c11 in _PyObject_Call_Prepend ()
#32 0x000000000045969e in PyObject_Call ()
#33 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#34 0x000000000054efc1 in ?? ()
#35 0x00000000005581e9 in _PyFunction_FastCallDict ()
#36 0x0000000000459c11 in _PyObject_Call_Prepend ()
#37 0x000000000045969e in PyObject_Call ()
#38 0x00000000004e050b in ?? ()
#39 0x0000000000459893 in _PyObject_FastCallDict ()
#40 0x000000000054f117 in ?? ()
#41 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#42 0x000000000054e4c8 in ?? ()
#43 0x00000000005582c2 in _PyFunction_FastCallDict ()
#44 0x0000000000459c11 in _PyObject_Call_Prepend ()
#45 0x000000000045969e in PyObject_Call ()
#46 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#47 0x000000000054efc1 in ?? ()
#48 0x00000000005581e9 in _PyFunction_FastCallDict ()
#49 0x0000000000459c11 in _PyObject_Call_Prepend ()
#50 0x000000000045969e in PyObject_Call ()
#51 0x00000000004e050b in ?? ()
#52 0x0000000000459893 in _PyObject_FastCallDict ()
#53 0x000000000054f117 in ?? ()
#54 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#55 0x000000000054e4c8 in ?? ()
#56 0x00000000005582c2 in _PyFunction_FastCallDict ()
#57 0x0000000000459c11 in _PyObject_Call_Prepend ()
#58 0x000000000045969e in PyObject_Call ()
#59 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#60 0x000000000054efc1 in ?? ()
#61 0x00000000005581e9 in _PyFunction_FastCallDict ()
#62 0x0000000000459c11 in _PyObject_Call_Prepend ()
#63 0x000000000045969e in PyObject_Call ()
#64 0x00000000004e050b in ?? ()
#65 0x0000000000459893 in _PyObject_FastCallDict ()
#66 0x000000000054f117 in ?? ()
#67 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#68 0x000000000054e4c8 in ?? ()
#69 0x00000000005582c2 in _PyFunction_FastCallDict ()
#70 0x0000000000459c11 in _PyObject_Call_Prepend ()
#71 0x000000000045969e in PyObject_Call ()
#72 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#73 0x000000000054efc1 in ?? ()
#74 0x00000000005581e9 in _PyFunction_FastCallDict ()
#75 0x0000000000459c11 in _PyObject_Call_Prepend ()
#76 0x000000000045969e in PyObject_Call ()
#77 0x00000000004e050b in ?? ()
#78 0x0000000000459893 in _PyObject_FastCallDict ()
#79 0x000000000054f117 in ?? ()
#80 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#81 0x000000000054e4c8 in ?? ()
#82 0x00000000005582c2 in _PyFunction_FastCallDict ()
#83 0x0000000000459c11 in _PyObject_Call_Prepend ()
#84 0x000000000045969e in PyObject_Call ()
#85 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#86 0x000000000054efc1 in ?? ()
#87 0x0000000000558146 in _PyFunction_FastCallDict ()
#88 0x0000000000459c11 in _PyObject_Call_Prepend ()
#89 0x000000000045969e in PyObject_Call ()
#90 0x00000000004e050b in ?? ()
#91 0x000000000045969e in PyObject_Call ()
#92 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#93 0x000000000054efc1 in ?? ()
#94 0x000000000054ffee in PyEval_EvalCodeEx ()
#95 0x000000000048b86d in ?? ()
#96 0x000000000045969e in PyObject_Call ()
#97 0x0000000000552029 in _PyEval_EvalFrameDefault ()
#98 0x000000000054e4c8 in ?? ()
#99 0x000000000054f4f6 in ?? ()
#100 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#101 0x000000000054e4c8 in ?? ()
#102 0x000000000054f4f6 in ?? ()
#103 0x0000000000553aaf in _PyEval_EvalFrameDefault ()
#104 0x000000000054e4c8 in ?? ()
#105 0x00000000005582c2 in _PyFunction_FastCallDict ()
#106 0x0000000000459c11 in _PyObject_Call_Prepend ()
#107 0x000000000045969e in PyObject_Call ()
#108 0x000000000058e2c2 in ?? ()
#109 0x00007ffff7bbd7fc in start_thread (arg=0x7ffc393fe700) at pthread_create.c:465
#110 0x00007ffff6d44b5f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

$ .venv/bin/python Python 3.6.3 (default, Oct 3 2017, 21:45:48) [GCC 7.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

And I see vmprof-0.4.12.dist-info/ in my site-packages, so I'm guessing that's the version I'm using.

Apr 27 '18 06:04 elistevens