lpms Transcoding hangs under load

While doing throughput testing hit situation where transcoding hangs (reproduced three times). It was transcoding one 10 minutes generated video to seven 720 renditions in ten streams simultaneously. Looking at output files, it completely transcoded most of the files, except some. nvidia-smi shows that Livepeer process still uses GPU, but output files (one not completed) not growing in size. There is stopped GCE instance on which I've saw this, I've left it intact. To reproduce:

Start Ivan-gpu-p100 instance
got to /home/dark/go-livepeer
run bench.sh
wait 12 minutes
look at the /disk-1-temp directory - there will be 70 output files, most complete (same size), some smaller. and that smaller files not growing in size.

Oct 12 '19 15:10 darkdarkdragon

Also same thing happened on Genesis box with one 1080ti trying to transcode 50 streams of official_test_source_2s_keys_24pfs.mp4 with these renditions P144p30fps16x9;P360p30fps16x9;P576p30fps16x9. Out of 50 streams only 38 reported done in console.

Oct 12 '19 22:10 darkdarkdragon

Thanks for the report, couple notes:

Plex reported that what sounds like a similar problem during their own benchmarks. Would this be related? https://www.elpamsoft.com/?p=Plex-Hardware-Transcoding
Would be good to also run similar test with ffmpeg to further isolate the issue, or to at least rule out LPMS.
Do we know what the hardware is theoretically capable of handling, given our real time transcoding target? 70 concurrent 720p streams sounds high. If things work OK with a smaller number then we can use that as a basis for a hard cap.

Oct 17 '19 00:10 j0sh

Plex reported that what sounds like a similar problem during their own benchmarks. Would this be related? https://www.elpamsoft.com/?p=Plex-Hardware-Transcoding

In the provided link I can't find any problems reported, can you point me to what they're reporting?

Would be good to also run similar test with ffmpeg to further isolate the issue, or to at least rule out LPMS.

Will do.

Do we know what the hardware is theoretically capable of handling, given our real time transcoding target? 70 concurrent 720p streams sounds high. If things work OK with a smaller number then we can use that as a basis for a hard cap.

For me it hanged with eight GPUs, two streams per GPU, four (P144p30fps16x9;P360p30fps16x9;P576p30fps16x9;P720p30fps16x9) renditions per stream. Looks very underloaded.

Oct 17 '19 00:10 darkdarkdragon

In the provided link I can't find any problems reported, can you point me to what they're reporting?

@darkdarkdragon See the note on "Exceeding VRAM" :

Exceeding the VRAM usage will cause new transcodes to buffer indefinitely, even once VRAM has dropped below the maximum.

For me it hanged with eight GPUs, two streams per GPU, four (P144p30fps16x9;P360p30fps16x9;P576p30fps16x9;P720p30fps16x9) renditions per stream. Looks very underloaded.

So just 8 encodes per GPU? Curious, is there a pattern to which streams complete, and which remain incomplete?

Does it make a difference if the streams themselves are staggered a bit, eg started ~500ms-1sec after another?

Oct 17 '19 17:10 j0sh

See the note on "Exceeding VRAM" :

I thought it was something specific to their Plex Client. And they talking about new transcodes, in our case transcode hangs half-way through.

So just 8 encodes per GPU? Curious, is there a pattern to which streams complete, and which remain incomplete?

How can I distinguish between streams? Butt there is pattern - last time, from total 70 output files, 7 stuck at the same point (same file size). It was one GPU, 10 streams with 7 renditions (same 720p renditions) each. So it's looks like it was one streams that not completed - but, not a one stream reported completion (exited from Transcode3 function).

Oct 17 '19 17:10 darkdarkdragon

How can I distinguish between streams?

I usually have a file naming convention. Something like: out_<device_idx>_<sequence>_<renditition>.ts and so forth.

So it's looks like it was one streams that not completed

Ah, interesting. Is this consistently reproducible? Does it stop working at the same place each time? Do more streams make the issue worse? Anything in the logs?

Oct 17 '19 17:10 j0sh

I usually have a file naming convention.

I'm using random names. Just took piece of code from our transcoder.go

Is this consistently reproducible?

It hangs probably in 90% cases - once same test with 10 streams passed, and hanged only on next iteration, but most of the times it hangs. But it hangs in different points (in different runs, but same point in one test). Just checked old data - in two times out of three, it was 7 files not completed (so one stream). But in once case - it was 21 not finished files.

Oct 17 '19 18:10 darkdarkdragon

example results from one run:

-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_06b1ef519dc674f48992.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_081d4ade8d2a3c1ccebb.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_0b60f83f94845bf86bdb.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_127fb018986db33eeb72.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_1b177f2ea9bfbac0f796.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_1d71d5c0d404fe97a07b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_23d7ebb33bb4732ab6ab.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_273ddeb0cb7a8134829d.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_2a6e92d7eb06e602fa72.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2ba82c0a2af65c7bac19.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2c0563db05615117ad73.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2c72b44bc9f7e3c7d337.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_35c62d7365775d2ff0b5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3780f053557fd5e0e738.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_379f4eeae702a4cab428.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_3b38dc1de25b10da2f7f.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3e0f379c137383a7ac37.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3f15fb903342acd8cb08.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_41a03e29addc2be05f67.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_4e7f25fb48a9c55ccd1b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5273f496a27cb0ad2736.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_52a42c3b2e9eb8c461dc.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_575275fe839622eb1c4b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_598d0039ce6581b5ade8.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_59ff7ae27a6dd2241205.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5bd3c660b6e005cf9327.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5e69fd492b31b1706473.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_65ede9ca1fbdb5febdfa.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_68c1a65652abd162ae4b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_68e458e519b9a5adbc71.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_6e136ebd831afba76b0b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_6e8b023cd6143979fcbc.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_732df4c1925b798ceda5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_75d4b95583f334572acf.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_7b0b183433193ebfa81f.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_7e32f629d36554c32ff6.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_88d71c01e64b13bc0b51.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_8a5c12640211854c659f.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_8d8e10a69b9d02abed1a.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_98c62c6a8fe2bb6df0aa.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_9b6eb73feab0a27a2018.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_9f863b0f2288a5991e14.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_9fe95c8e6d2451e76285.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_a3d185993ee929c04be7.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_a6d9127d4a77b8e04114.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_ab04342228b924fd758e.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_ae8563ac3e870255f72f.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_afcb68ad60d011e99832.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b0833441295c9ea414c3.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_b13f3529ee054b5fcd5c.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b398704392443f53ca8a.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b405241d4f059ef2eb82.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b740a463b3c0e8410f5c.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b7aa00e25b32ff3234a1.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_c6c9d9248e6634f9a519.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_c89d6ff988fd434f8837.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_ca2fb40ef950c0eff0d8.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d5fb385bf48e2a0f6666.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d639398ee79d95e80eb9.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d6f153c2442063dd1fa2.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_dc5945ab712c1f033e2a.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_dd4ce8d97ce55f4f0ab7.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_de78db1b1716946c7eb5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_e799a345c04a0cc76444.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_f47cb2d6f6298e1a6160.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_f555b25b8deb7394cb16.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_f88c514aacede3e92ffa.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_f908a59eb870b14ffa59.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_faafbc9d2b7eb76e5f13.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_fcdebd9c16b987d707b6.ts

Oct 17 '19 18:10 darkdarkdragon

@j0sh can I enable logging inside ffmplib we're using? Or I can put fprints into LPMS c code... Given that they stuck at the same point for one streams, it's something with decoding (or filters?) But I don't understand, why other streams no reporting completion?

Oct 17 '19 19:10 darkdarkdragon

I'm using random names. Just took piece of code from our transcoder.go

For the benchmarking, wouldn't it be useful to have a more consistent naming scheme to help diagnose issues like these?

enable logging inside ffmplib we're using?

Either:

Don't call ffmpeg.InitFFmpeg() and that keeps the log level at the default (INFO)
Modify lpms_init to return enhanced logging, eg AV_LOG_DEBUG or even AV_LOG_TRACE. https://github.com/livepeer/lpms/blob/master/ffmpeg/lpms_ffmpeg.c#L65

Oct 17 '19 19:10 j0sh

@j0sh and it is not VRAM exceed problem - I've just looked at the VRAM consumption - it was using around 7M memory on the card with 16M.

Oct 17 '19 19:10 darkdarkdragon

it was using around 7M memory on the card with 16M.

Just 7MB ? Is that right?

Also, how is the measuring done? nvidia-smi dmon and pmon only show the instantaneous utilization of VRAM(eg, how much VRAM was actually read and written during the polling interval), while nvidia-smi by itself shows the absolute amount of VRAM in use on each card, broken down per process.

From the plex issue, it seems to have more to do with throughput, rather than the amount of memory in use. Anyway, agreed that it's hard to really say if it's the same issue we're having, but it seemed like an interesting and somewhat similar data point.

Oct 17 '19 19:10 j0sh

Just 7MB ? Is that right?

Just nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    56W / 250W |   7423MiB / 16280MiB |     37%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2117      C   /tmp/go-build655338158/b001/exe/livepeer    7411MiB |
+-----------------------------------------------------------------------------+

Oct 17 '19 20:10 darkdarkdragon

Traced it down to this line:

https://github.com/FFmpeg/FFmpeg/blob/f7f4691f9f745f6a087879ab855dd65f9f16879d/libavcodec/nvdec.c#L162

It enters cuvidDestroyDecoder (function in Nvidia library) and never exits. So, this is either:

Bug in Nvidia drivers
FFmpeg messes something up
Go runtime messes with threads/locks

For the next thing I'll try Nvidia's beta driver.

hangstack.txt

Oct 21 '19 17:10 darkdarkdragon

Possibly related: cuda-memcheck is indicating issues within cuvidDestroyDecoder . This is also present in ffmpeg itself, which would seem to exclude issues on the golang / cgo side.

If the beta driver doesn't help, it would be interesting to see if the hang is also reproducible with ffmpeg itself.

cuda-memcheck with a custom program shows this:

========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuCtxSynchronize + 0x12d) [0x245a8d]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x2a380]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x1981e]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 (cuvidDestroyDecoder + 0x64) [0x17974]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 [0x7a265]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_close + 0x6c) [0xb0b21]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_free_context + 0x19) [0x59d4b9]
=========     Host Frame:./a.out [0x4f5d]
=========     Host Frame:./a.out [0x7ffd]
=========     Host Frame:./a.out [0x838e]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76db]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x3f) [0x12188f]

This also occurs with ffmpeg:

$ cuda-memcheck ffmpeg -hwaccel cuvid -hwaccel_device 1 -c:v h264_cuvid -i in/short_testharness.m3u8 -c:a copy -c:v h264_nvenc -vf fps=30,scale_cuda=w=1280:h=720 -f null - -c:v h264_nvenc -vf fps=30,scale_cuda=1024:h=576 -f null -
...
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuCtxSynchronize + 0x12d) [0x245a8d]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x2a380]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x1981e]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 (cuvidDestroyDecoder + 0x64) [0x17974]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 [0x7a265]
=========     Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_close + 0x6c) [0xb0b21]
=========     Host Frame:ffmpeg [0x2f71b]
=========     Host Frame:ffmpeg [0xbbee]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:ffmpeg [0xbdba]
=========
========= ERROR SUMMARY: 1 error

Oct 22 '19 17:10 j0sh