Segment violation in `libnvcuvid.so.1`
When go-livepeer is under heavy load and there is constantly not enough video memory, node often panic's with Segmentation fault.
Stack trace:
#0 0x00007fff8415d7e0 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#1 0x00007fff8415d952 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#2 0x00007fff8415d9ea in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#3 0x00007fff841147c6 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#4 0x00007fff8412974b in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#5 0x00007fff8410d665 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#6 0x00007fff381e70f3 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
#7 0x00007fff381e26da in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
#8 0x00007fff381f1499 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
#9 0x0000000000496c75 in nvenc_setup_encoder (avctx=avctx@entry=0x7fffcc43cc40) at libavcodec/nvenc.c:1259
#10 0x0000000000498758 in ff_nvenc_encode_init (avctx=0x7fffcc43cc40) at libavcodec/nvenc.c:1553
#11 0x00000000013cd05c in avcodec_open2 (avctx=avctx@entry=0x7fffcc43cc40, codec=codec@entry=0x2c06740 <ff_h264_nvenc_encoder>, options=0xc008cde1d0) at libavcodec/utils.c:951
#12 0x0000000000fd4200 in open_output (ictx=0x7fffe8001598, octx=0x7fffe80015f8) at lpms_ffmpeg.c:693
#13 transcode (h=h@entry=0x7fffe8001590, inp=inp@entry=0xc00000ff60, params=params@entry=0xc008cde180, results=results@entry=0xc0000c2410, decoded_results=decoded_results@entry=0xc0000c2440) at lpms_ffmpeg.c:1145
#14 0x0000000000fd4d68 in lpms_transcode (inp=0xc00000ff60, params=0xc008cde180, results=0xc0000c2410, nb_outputs=1, decoded_results=0xc0000c2440) at lpms_ffmpeg.c:1308
#15 0x0000000000fd1376 in _cgo_f32e5de116c8_Cfunc_lpms_transcode (v=0xc000079938) at cgo-gcc-prolog:140
#16 0x00000000004fedd0 in runtime.asmcgocall () at /usr/lib/go-1.13/src/runtime/asm_amd64.s:655
#17 0x0000000000000040 in ?? ()
#18 0x0000000001c32e80 in type.* ()
#19 0x00000000004fb401 in runtime.(*mheap).setSpan (h=<optimized out>, base=0, s=0xc000079938) at /usr/lib/go-1.13/src/runtime/mheap.go:1143
#20 runtime.(*mheap).scavengeSplit.func1 (s=0x4d3600 <runtime.mstart>) at /usr/lib/go-1.13/src/runtime/mheap.go:1459
#21 0x000000c0002f1980 in ?? ()
#22 0x00000000004d3600 in ?? () at /usr/lib/go-1.13/src/runtime/proc.go:1080
#23 0x0000000000000000 in ?? ()
nvenc.c:1259 is:
nv_status = p_nvenc->nvEncInitializeEncoder(ctx->nvencoder, &ctx->init_encode_params);
I think it is either of:
- We're not processing some errors correctly and as a result passing some invalid data down to Nvidia drivers and that leads to segmental fault
- Ffmpeg's code not processing errors correctly and passes invalid data to drivers
- Just bug in Nvidia's code
@j0sh What do you think about this one? It was hard to reproduce in GCP, but on the rig with 1660s I was hitting this often during my testing on the rig with 1660.
Probably the moral of the story is, "don't overwhelm the system" 😄
Combined with https://github.com/livepeer/lpms/issues/158 , it sounds like it may be a good idea to put reasonable limits in somewhere until we can spend the time to narrow this down further.
How many streams / what configuration until you started seeing segfaults? I've also seen hangs on the 1660 a couple times (probably the same problem as #158, but not certain yet). Unfortunately the hangs have been under relatively light load such as 4 input x 4 output renditions x 8 cards (128 encodes total for the system, but 16 encodes per card and 4 decodes).
I have a suspicion that something is weird with the 1660 rig anyway because it's about 2x slower transcoding compared to the 1070s, despite having better all-around hardware specs.
Probably the moral of the story is, "don't overwhelm the system" 😄
Yep, but problem here is that these issues manifests itself if there is not enough video memory, and we don't have a way to constraint system's load by video memory.
How many streams / what configuration until you started seeing segfaults?
I don't remember (
I have a suspicion that something is weird with the 1660 rig anyway because it's about 2x slower transcoding compared to the 1070s
Strange, for me speed of 1660 and 1070 was the same.
We just hit segment violation in AC (in mainnet orchestrator).
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi fatal error: unexpected signal during runtime execution
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7f499c8e49b7]
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime stack:
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.throw(0x1c973d6, 0x2a)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /usr/lib/go-1.13/src/runtime/panic.go:774 +0x72
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.sigpanic()
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /usr/lib/go-1.13/src/runtime/signal_unix.go:378 +0x47c
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi goroutine 623 [syscall]:
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.cgocall(0xf76d00, 0xc0002bc8a0, 0xc0002bc8b0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /usr/lib/go-1.13/src/runtime/cgocall.go:128 +0x5b fp=0xc0002bc870 sp=0xc0002bc838 pc=0x49c44b
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/lpms/ffmpeg._Cfunc_lpms_transcode(0xc0007bd000, 0xc000b6a000, 0xc0000b1fb0, 0x3, 0xc0007fcad0, 0xc000000000)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi _cgo_gotypes.go:270 +0x4d fp=0xc0002bc8a0 sp=0xc0002bc870 pc=0xc2301d
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/lpms/ffmpeg.(*Transcoder).Transcode.func9(0xc0007bd000, 0xc000b6a000, 0xc0000b1fb0, 0xc000b6a000, 0x3, 0x3, 0xc0007fcad0, 0xc000a2a3b0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /go/pkg/mod/github.com/livepeer/[email protected]/ffmpeg/ffmpeg.go:290 +0xac fp=0xc0002bc8e0 sp=0xc0002bc8a0 pc=0xc2624c
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/lpms/ffmpeg.(*Transcoder).Transcode(0xc0007bcf80, 0xc0002bce08, 0xc0000d4240, 0x3, 0x3, 0x0, 0x0, 0x0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /go/pkg/mod/github.com/livepeer/[email protected]/ffmpeg/ffmpeg.go:290 +0xabc fp=0xc0002bcda8 sp=0xc0002bc8e0 pc=0xc2432c
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/go-livepeer/core.(*NvidiaTranscoder).Transcode(0xc000b6d890, 0xc000b2e450, 0x24, 0xc00067cb40, 0x4a, 0xc0003397a0, 0x3, 0x3, 0xc000b220a0, 0xc0002daf78, ...)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /build/core/transcoder.go:74 +0x116 fp=0xc0002bce40 sp=0xc0002bcda8 pc=0xf0de16
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/go-livepeer/core.(*transcoderSession).loop(0xc000b6d8c0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /build/core/lb.go:183 +0x1d4 fp=0xc0002bcfb8 sp=0xc0002bce40 pc=0xf04734
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession.func2(0xc000b6d8c0, 0xc0003bbf00)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi /build/core/lb.go:107 +0x2b fp=0xc0002bcfd0 sp=0xc0002bcfb8 pc=0xf0ebeb
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.goexit()
Looks like it is the same problem.