Transcoding hangs under load
While doing throughput testing hit situation where transcoding hangs (reproduced three times).
It was transcoding one 10 minutes generated video to seven 720 renditions in ten streams simultaneously. Looking at output files, it completely transcoded most of the files, except some. nvidia-smi shows that Livepeer process still uses GPU, but output files (one not completed) not growing in size.
There is stopped GCE instance on which I've saw this, I've left it intact.
To reproduce:
- Start
Ivan-gpu-p100instance - got to
/home/dark/go-livepeer - run
bench.sh - wait 12 minutes
- look at the
/disk-1-tempdirectory - there will be 70 output files, most complete (same size), some smaller. and that smaller files not growing in size.
Also same thing happened on Genesis box with one 1080ti trying to transcode 50 streams of official_test_source_2s_keys_24pfs.mp4 with these renditions P144p30fps16x9;P360p30fps16x9;P576p30fps16x9. Out of 50 streams only 38 reported done in console.
Thanks for the report, couple notes:
-
Plex reported that what sounds like a similar problem during their own benchmarks. Would this be related? https://www.elpamsoft.com/?p=Plex-Hardware-Transcoding
-
Would be good to also run similar test with ffmpeg to further isolate the issue, or to at least rule out LPMS.
-
Do we know what the hardware is theoretically capable of handling, given our real time transcoding target? 70 concurrent 720p streams sounds high. If things work OK with a smaller number then we can use that as a basis for a hard cap.
Plex reported that what sounds like a similar problem during their own benchmarks. Would this be related? https://www.elpamsoft.com/?p=Plex-Hardware-Transcoding
In the provided link I can't find any problems reported, can you point me to what they're reporting?
Would be good to also run similar test with ffmpeg to further isolate the issue, or to at least rule out LPMS.
Will do.
Do we know what the hardware is theoretically capable of handling, given our real time transcoding target? 70 concurrent 720p streams sounds high. If things work OK with a smaller number then we can use that as a basis for a hard cap.
For me it hanged with eight GPUs, two streams per GPU, four (P144p30fps16x9;P360p30fps16x9;P576p30fps16x9;P720p30fps16x9) renditions per stream. Looks very underloaded.
In the provided link I can't find any problems reported, can you point me to what they're reporting?
@darkdarkdragon See the note on "Exceeding VRAM" :
Exceeding the VRAM usage will cause new transcodes to buffer indefinitely, even once VRAM has dropped below the maximum.
For me it hanged with eight GPUs, two streams per GPU, four (P144p30fps16x9;P360p30fps16x9;P576p30fps16x9;P720p30fps16x9) renditions per stream. Looks very underloaded.
So just 8 encodes per GPU? Curious, is there a pattern to which streams complete, and which remain incomplete?
Does it make a difference if the streams themselves are staggered a bit, eg started ~500ms-1sec after another?
See the note on "Exceeding VRAM" :
I thought it was something specific to their Plex Client. And they talking about new transcodes, in our case transcode hangs half-way through.
So just 8 encodes per GPU? Curious, is there a pattern to which streams complete, and which remain incomplete?
How can I distinguish between streams?
Butt there is pattern - last time, from total 70 output files, 7 stuck at the same point (same file size).
It was one GPU, 10 streams with 7 renditions (same 720p renditions) each.
So it's looks like it was one streams that not completed - but, not a one stream reported completion (exited from Transcode3 function).
How can I distinguish between streams?
I usually have a file naming convention. Something like: out_<device_idx>_<sequence>_<renditition>.ts and so forth.
So it's looks like it was one streams that not completed
Ah, interesting. Is this consistently reproducible? Does it stop working at the same place each time? Do more streams make the issue worse? Anything in the logs?
I usually have a file naming convention.
I'm using random names. Just took piece of code from our transcoder.go
Is this consistently reproducible?
It hangs probably in 90% cases - once same test with 10 streams passed, and hanged only on next iteration, but most of the times it hangs. But it hangs in different points (in different runs, but same point in one test). Just checked old data - in two times out of three, it was 7 files not completed (so one stream). But in once case - it was 21 not finished files.
example results from one run:
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_06b1ef519dc674f48992.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_081d4ade8d2a3c1ccebb.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_0b60f83f94845bf86bdb.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_127fb018986db33eeb72.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_1b177f2ea9bfbac0f796.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_1d71d5c0d404fe97a07b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_23d7ebb33bb4732ab6ab.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_273ddeb0cb7a8134829d.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_2a6e92d7eb06e602fa72.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2ba82c0a2af65c7bac19.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2c0563db05615117ad73.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_2c72b44bc9f7e3c7d337.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_35c62d7365775d2ff0b5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3780f053557fd5e0e738.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_379f4eeae702a4cab428.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_3b38dc1de25b10da2f7f.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3e0f379c137383a7ac37.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_3f15fb903342acd8cb08.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_41a03e29addc2be05f67.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_4e7f25fb48a9c55ccd1b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5273f496a27cb0ad2736.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_52a42c3b2e9eb8c461dc.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_575275fe839622eb1c4b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_598d0039ce6581b5ade8.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_59ff7ae27a6dd2241205.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5bd3c660b6e005cf9327.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_5e69fd492b31b1706473.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_65ede9ca1fbdb5febdfa.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_68c1a65652abd162ae4b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_68e458e519b9a5adbc71.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_6e136ebd831afba76b0b.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_6e8b023cd6143979fcbc.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_732df4c1925b798ceda5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_75d4b95583f334572acf.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_7b0b183433193ebfa81f.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_7e32f629d36554c32ff6.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_88d71c01e64b13bc0b51.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_8a5c12640211854c659f.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_8d8e10a69b9d02abed1a.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_98c62c6a8fe2bb6df0aa.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_9b6eb73feab0a27a2018.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_9f863b0f2288a5991e14.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_9fe95c8e6d2451e76285.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_a3d185993ee929c04be7.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_a6d9127d4a77b8e04114.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_ab04342228b924fd758e.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_ae8563ac3e870255f72f.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_afcb68ad60d011e99832.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b0833441295c9ea414c3.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_b13f3529ee054b5fcd5c.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b398704392443f53ca8a.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b405241d4f059ef2eb82.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b740a463b3c0e8410f5c.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_b7aa00e25b32ff3234a1.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_c6c9d9248e6634f9a519.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_c89d6ff988fd434f8837.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_ca2fb40ef950c0eff0d8.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d5fb385bf48e2a0f6666.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d639398ee79d95e80eb9.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_d6f153c2442063dd1fa2.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_dc5945ab712c1f033e2a.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_dd4ce8d97ce55f4f0ab7.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_de78db1b1716946c7eb5.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_e799a345c04a0cc76444.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_f47cb2d6f6298e1a6160.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_f555b25b8deb7394cb16.ts
-rw-rw-r-- 1 dark dark 95068592 Oct 11 20:47 out_f88c514aacede3e92ffa.ts
-rw-rw-r-- 1 dark dark 10747904 Oct 11 20:42 out_f908a59eb870b14ffa59.ts
-rw-rw-r-- 1 dark dark 53215232 Oct 11 20:45 out_faafbc9d2b7eb76e5f13.ts
-rw-rw-r-- 1 dark dark 33030144 Oct 11 20:43 out_fcdebd9c16b987d707b6.ts
@j0sh can I enable logging inside ffmplib we're using? Or I can put fprints into LPMS c code... Given that they stuck at the same point for one streams, it's something with decoding (or filters?) But I don't understand, why other streams no reporting completion?
I'm using random names. Just took piece of code from our transcoder.go
For the benchmarking, wouldn't it be useful to have a more consistent naming scheme to help diagnose issues like these?
enable logging inside ffmplib we're using?
Either:
- Don't call
ffmpeg.InitFFmpeg()and that keeps the log level at the default (INFO) - Modify
lpms_initto return enhanced logging, eg AV_LOG_DEBUG or even AV_LOG_TRACE. https://github.com/livepeer/lpms/blob/master/ffmpeg/lpms_ffmpeg.c#L65
@j0sh and it is not VRAM exceed problem - I've just looked at the VRAM consumption - it was using around 7M memory on the card with 16M.
it was using around 7M memory on the card with 16M.
Just 7MB ? Is that right?
Also, how is the measuring done? nvidia-smi dmon and pmon only show the instantaneous utilization of VRAM(eg, how much VRAM was actually read and written during the polling interval), while nvidia-smi by itself shows the absolute amount of VRAM in use on each card, broken down per process.
From the plex issue, it seems to have more to do with throughput, rather than the amount of memory in use. Anyway, agreed that it's hard to really say if it's the same issue we're having, but it seemed like an interesting and somewhat similar data point.
Just 7MB ? Is that right?
Just nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 43C P0 56W / 250W | 7423MiB / 16280MiB | 37% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2117 C /tmp/go-build655338158/b001/exe/livepeer 7411MiB |
+-----------------------------------------------------------------------------+
Traced it down to this line:
https://github.com/FFmpeg/FFmpeg/blob/f7f4691f9f745f6a087879ab855dd65f9f16879d/libavcodec/nvdec.c#L162
It enters cuvidDestroyDecoder (function in Nvidia library) and never exits.
So, this is either:
- Bug in Nvidia drivers
- FFmpeg messes something up
- Go runtime messes with threads/locks
For the next thing I'll try Nvidia's beta driver.
Possibly related: cuda-memcheck is indicating issues within cuvidDestroyDecoder . This is also present in ffmpeg itself, which would seem to exclude issues on the golang / cgo side.
If the beta driver doesn't help, it would be interesting to see if the hang is also reproducible with ffmpeg itself.
cuda-memcheck with a custom program shows this:
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuCtxSynchronize + 0x12d) [0x245a8d]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x2a380]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x1981e]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 (cuvidDestroyDecoder + 0x64) [0x17974]
========= Host Frame:/home/josh/compiled/lib/libavcodec.so.58 [0x7a265]
========= Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_close + 0x6c) [0xb0b21]
========= Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_free_context + 0x19) [0x59d4b9]
========= Host Frame:./a.out [0x4f5d]
========= Host Frame:./a.out [0x7ffd]
========= Host Frame:./a.out [0x838e]
========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76db]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x3f) [0x12188f]
This also occurs with ffmpeg:
$ cuda-memcheck ffmpeg -hwaccel cuvid -hwaccel_device 1 -c:v h264_cuvid -i in/short_testharness.m3u8 -c:a copy -c:v h264_nvenc -vf fps=30,scale_cuda=w=1280:h=720 -f null - -c:v h264_nvenc -vf fps=30,scale_cuda=1024:h=576 -f null -
...
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuCtxSynchronize + 0x12d) [0x245a8d]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x2a380]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 [0x1981e]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 (cuvidDestroyDecoder + 0x64) [0x17974]
========= Host Frame:/home/josh/compiled/lib/libavcodec.so.58 [0x7a265]
========= Host Frame:/home/josh/compiled/lib/libavcodec.so.58 (avcodec_close + 0x6c) [0xb0b21]
========= Host Frame:ffmpeg [0x2f71b]
========= Host Frame:ffmpeg [0xbbee]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:ffmpeg [0xbdba]
=========
========= ERROR SUMMARY: 1 error