deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] SageMaker Pytorch image has compatibility issues between ffmpeg version and torchaudio.io.StreamReader

Open w238liu opened this issue 2 years ago • 7 comments

Checklist

  • [x] I've prepended issue tag with type of change: [bug]
  • [x] (If applicable) I've attached the script to reproduce the bug
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [x] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: torchaudio.io.StreamReader requires ffmpeg version from 4.1 to 4.4, but the current SageMaker Pytorch training image has ffmpeg 5.1.2, which makes StreamReader fail to read video files.

DLC image/dockerfile:

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-cpu-py310-ubuntu20.04-sagemaker

Current behavior: I first ran docker run -it --gpus all 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker /bin/bash to create a container from the docker image.

In the first test, I ran the following script to check the availability of ffmpeg from torchaudio

import torch
import torchaudio
from torchaudio.utils import ffmpeg_utils


print(torch.__version__)
print(torchaudio.__version__)
print(ffmpeg_utils.get_versions())
print(ffmpeg_utils.get_build_config())
print([k for k in ffmpeg_utils.get_video_decoders().keys() if 'cuvid' in k])

it errored out with the message

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/_extension/utils.py", line 85, in _init_ffmpeg
    _load_lib("libtorchaudio_ffmpeg")
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/_extension/utils.py", line 61, in _load_lib
    torch.ops.load_library(path)
  File "/opt/conda/lib/python3.10/site-packages/torch/_ops.py", line 643, in load_library
    ctypes.CDLL(path)
  File "/opt/conda/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libavdevice.so.58: cannot open shared object file: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/_extension/utils.py", line 134, in wrapped
    _init_ffmpeg()
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/_extension/utils.py", line 87, in _init_ffmpeg
    raise ImportError("FFmpeg libraries are not found. Please install FFmpeg.") from err
ImportError: FFmpeg libraries are not found. Please install FFmpeg.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/test/test_torchaudio.py", line 8, in <module>
    print(ffmpeg_utils.get_versions())
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/_extension/utils.py", line 136, in wrapped
    raise RuntimeError(
RuntimeError: get_versions requires FFmpeg extension which is not available. Please refer to the stacktrace above for how to resolve this.

In the second test, I created some test video files

mkdir /opt/test
cd /opt/test
ffmpeg -f lavfi -i mandelbrot -t 3 -c:v libx265 -pix_fmt yuv420p10le -vtag hvc1 -y test_hevc_hdr.mp4
ffmpeg -f lavfi -i mandelbrot -t 3 -c:v libx265 -pix_fmt yuv420p -vtag hvc1 -y test_hevc_sdr.mp4
ffmpeg -f lavfi -i mandelbrot -t 3 -c:v libx264 -pix_fmt yuv420p -vtag avc1 -y test_h264_sdr.mp4

and ran the following script in the same folder

from torchaudio.io import StreamReader


def test_func(src: str, decoder: str, device: str = 'cpu'):
    if device == 'cuda':
        decode_config = {
            'buffer_chunk_size': 50,
            'decoder': f'{decoder}_cuvid',
            'hw_accel': 'cuda',
            "format": None,
        }
    else:
        decode_config = {
            'buffer_chunk_size': 50,
            'decoder': decoder,
            "decoder_option": {"threads": "0"},
            "format": "yuv420p",
        }

    video = StreamReader(src=src)

    video.add_basic_video_stream(1, **decode_config)

    stream = video.stream()
    frame, = next(stream)

    print(frame.device, frame.shape, frame.dtype)
    return frame


if __name__ == "__main__":
    test_videos = ['test_hevc_hdr.mp4', 'test_hevc_sdr.mp4', 'test_h264_sdr.mp4']
    decoders = ['hevc', 'hevc', 'h264']
    devices = ['cpu', 'cuda']

    for src_path, decoder in zip(test_videos, decoders):
        for device in devices:
            test_func(src_path, decoder, device)

and it errored out with the same message as in the first test

Expected behavior: The expected output of the first test is something like

2.0.0
2.0.0
{'libavutil': (56, 70, 100), 'libavcodec': (58, 134, 100), 'libavformat': (58, 76, 100), 'libavfilter': (7, 110, 100), 'libavdevice': (58, 13, 100)}
--prefix=/home/ubuntu/.conda/envs/torchqa --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-avresample --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-gnutls --enable-libmp3lame --enable-libvpx --enable-pthreads --enable-vaapi --enable-gpl --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-pic --enable-shared --disable-static --enable-version3 --enable-zlib --pkg-config=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/pkg-config
['av1_cuvid', 'h264_cuvid', 'hevc_cuvid', 'mjpeg_cuvid', 'mpeg1_cuvid', 'mpeg2_cuvid', 'mpeg4_cuvid', 'vc1_cuvid', 'vp8_cuvid', 'vp9_cuvid']

The expected output of the second test should be

cpu torch.Size([1, 3, 480, 640]) torch.uint8
cuda:0 torch.Size([1, 3, 480, 640]) torch.int16
cpu torch.Size([1, 3, 480, 640]) torch.uint8
cuda:0 torch.Size([1, 3, 480, 640]) torch.uint8
cpu torch.Size([1, 3, 480, 640]) torch.uint8
cuda:0 torch.Size([1, 3, 480, 640]) torch.uint8

Additional context: According to my knowledge, ffmpeg installed by conda install ffmpeg=4.4.2 -c conda-forge works well with StreamReader in torchaudio 2.0.1. However, I am not able to uninstall ffmpeg 5.1.2 and re-install ffmpeg 4.4.2 because conda could not resolve the environment due to its inconsistency.

w238liu avatar Jun 06 '23 21:06 w238liu

Hi @w238liu,

Have you tried installing the latest PyTorch 2.0.1 image ( 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 )? That might solve the issue you are seeing.

I pulled down the latest image and ran your first test and was able to receive the proper output:

> python3 test.py 
2.0.1
2.0.2
{'libavutil': (56, 70, 100), 'libavcodec': (58, 134, 100), 'libavformat': (58, 76, 100), 'libavfilter': (7, 110, 100), 'libavdevice': (58, 13, 100)}
--prefix=/opt/conda --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-avresample --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libfontconfig --enable-libopenh264 --enable-gnutls --enable-libmp3lame --enable-libvpx --enable-pthreads --enable-vaapi --enable-gpl --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-pic --enable-shared --disable-static --enable-version3 --enable-zlib --pkg-config=/home/conda/feedstock_root/build_artifacts/ffmpeg_1671040255947/_build_env/bin/pkg-config
['av1_cuvid', 'h264_cuvid', 'hevc_cuvid', 'mjpeg_cuvid', 'mpeg1_cuvid', 'mpeg2_cuvid', 'mpeg4_cuvid', 'vc1_cuvid', 'vp8_cuvid', 'vp9_cuvid']

This is still using ffmpeg=4.4.2 as seen in this call:

> conda list | grep ffmpeg
ffmpeg                    4.4.2           gpl_h8dda1f0_112    conda-forge

ohadkatz avatar Jun 29 '23 14:06 ohadkatz

@ohadkatz Hello, thanks for the suggestion. I just tried this image. It seems to work on an EC2 machine, but not work on SageMaker. Is there any plan to release a similar image dedicated for SageMaker?

Moreover, even on an EC2 machine with the EC2 image, the second test script errors out with a Segmentation fault. Any thoughts?

w238liu avatar Jul 18 '23 03:07 w238liu

Hi @w238liu, we have released the SageMaker containers with ffmpeg 4.4.2 installed.

That fixed the first issue you've mentioned, here is the release tag:https://github.com/aws/deep-learning-containers/releases/tag/v1.4-pt-sagemaker-2.0.1-tr-gpu-py310.

On the second issue, i was able to reproduce the Segmentation fault on this container and using the upstream torch, and torchaudio installed via conda install pytorch=2.0.1 pytorch-cuda=11.8 torchaudio -c pytorch -c nvidia -c defaults. To debug this, i enabled python faulthandler via export PYTHONFAULTHANDLER=1 and get below result:

cpu torch.Size([1, 3, 480, 640]) torch.uint8
Fatal Python error: Segmentation fault

Current thread 0x00007f1aaf63a740 (most recent call first):
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/io/_stream_reader.py", line 753 in add_video_stream
  File "/opt/conda/lib/python3.10/site-packages/torchaudio/io/_stream_reader.py", line 668 in add_basic_video_stream
  File "//test.py", line 22 in test_func
  File "//test.py", line 38 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2 (total: 21)
Segmentation fault (core dumped)

~~Based on the above, the issue occurs at https://github.com/pytorch/audio/blob/v2.0.2/torchaudio/io/_stream_reader.py#L753~~ ~~For the next step, i will create an issue to pytorch/audio and get help.~~

Diving deeper today, i realized that the ffmpeg from pytorch doesn't support any of the *_cuvid decoder (see below), and that the ffmpeg used in this image (4.4.2 from conda-forge) shouldn't be the one that gets installed as we want to stick to the pytorch distribution of ffmpeg.

output of the first reproduce script, see the empty list [] at the end resulted from print([k for k in ffmpeg_utils.get_video_decoders().keys() if 'cuvid' in k])

2.0.1
2.0.2
{'libavutil': (56, 51, 100), 'libavcodec': (58, 91, 100), 'libavformat': (58, 45, 100), 'libavfilter': (7, 85, 100), 'libavdevice': (58, 10, 100)}
--prefix=/fsx/conda/envs/test_oss_audio --cc=/opt/conda/conda-bld/ffmpeg_1597178665428/_build_env/bin/x86_64-conda_cos6-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-gnutls --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame
[]

So for the second issue you mentioned, our plan is to change ffmpeg back to the pytorch distribution here which means that the *_cuvid decoders won't be supported. Please let us know if you have concern over this change.

junpuf avatar Sep 20 '23 00:09 junpuf

Hi @junpuf , thanks for your work and analysis into this issue.

For the second issue, I believe it is now fixed by the release of torchaudio 2.1. See this closed issue and the release note for details.

For the ffmpeg, I do have concerns over the change from the conda-forge distribution to the pytorch distribution. Without cuda decoders, the I/O could be very slow for UHD HDR videos.

w238liu avatar Nov 13 '23 12:11 w238liu

Hi @w238liu, thanks for the above update, i read the issue that you opened and confirmed that the official documentation's recommendation of using 'ffmpeg<7' from conda-forge (link).

Since we released pytorch 2.1 container recently, i will take that and override ffmpeg and re-run the 2 test cases and get back to you.

junpuf avatar Nov 14 '23 22:11 junpuf

Hi @junpuf , is there update regarding the 2 test cases? I am recently working with some 4K HDR videos, the decoding speed of which on CPUs is very slow. May I know if there is any plan to release a container that could support the torchaudio GPU decoder in the near future?

w238liu avatar Jan 15 '24 21:01 w238liu

Hi @w238liu, I'm trying out today to see if GPU decoding can be enabled on the PyTorch Training Container with ffmpeg 6 from conda-forge.

First, i tried the test cases you provided on a g5 EC2 instance that has 2 GPUs, i installed the pytorch and torchaudio etc into a conda environment and then installed ffmpeg=6.1 in the same environment from conda-forge and was able to get the expected results. Below are the commands i used to create the conda environment

mamba create -n myenv python=3.10 pytorch=2.2.0 pytorch-cuda=12.1 torchaudio --strict-channel-priority --override-channels -c https://aws-ml-conda.s3.us-west-2.amazonaws.com -c nvidia -c conda-forge
source activate myenv
mamba install ffmpeg=6 -c conda-forge

However when i replicate the same setup in the container environment, i am consistently getting error below, and didn't have a solution at the moment.

RuntimeError: Failed to initialize CodecContext: Operation not permitted
Exception raised from open_codec at /opt/conda/conda-bld/torchaudio_1706759466457/work/src/libtorio/ffmpeg/stream_reader/stream_processor.cpp:150

junpuf avatar Mar 28 '24 23:03 junpuf