DALI Building (video) pipeline slow with high number of samples

Hello. I noticed that the time required for building the pipeline grows linearly with the number of sample in the video list. I am using more or less this code:

train_loader = DALIGenericIterator(
    pipelines=[
        MyPipeline(
            sample_list_path=video_file_list,  # this is the txt file where each line is 'path label start end'
            shuffle=True,
            batch_size=self.batch_size,
            num_threads=2,
        )
    ],
    ...
)

I noticed that executing the above snippet can take long when number of videos is big. I am referring only to the building time. For example: 1000 samples (rows) in the txt file -> 12.81 sec 2000 samples (rows) in the txt file -> 24.52 sec My dataset is way bigger than that (100x) so this linearly increasing setup time is not a viable solution.

Are these numbers expected in your experience? Maybe I am doing something wrong while configuring the operators... I am using DALI 1.20.0 from official Nvidia 22.12 container.

Jan 24 '23 11:01 elmuz

Hello, thanks for the question. It looks like MyPipeline is a function that constructs a DALI pipeline for you. Is that right? Could you share its code? It would be easier to pinpoint the problem if we now more about your use case especially what are the parameters for the video reader op? Thanks!

Jan 24 '23 22:01 awolant

Sure!

@pipeline_def(num_threads=2, device_id=0)
def SpeechPipeline(sample_list_path: Union[str, Path], shuffle: bool = False):
    frames, video_id, frame_num = fn.readers.video(
        name="speech_reader",
        device="gpu",
        file_list=f"{sample_list_path}",
        sequence_length=5,
        step=1,
        random_shuffle=shuffle,
        initial_fill=128,
        file_list_frame_num=True,
        enable_frame_num=True,
        file_list_include_preceding_frame=True,
    )

    return frames, video_id, frame_num

Jan 24 '23 23:01 elmuz

Thanks.

This looks fine. Unfortunately DALI video reader upon pipeline creation needs to create av context for each file to look for the number of frames and other metadata, if present. I think making this behavior better for the large number of files is a legitimate and enhancement request. We are working on some improvements to the video reading capabilities and we definitely should look into this. Other possible solution would be the ability to provide necessary data to the reader, so it can be calculated once and can be reused without the need to parse it during pipeline build.

One thing that comes to my mind as a possible workaround now is to try glue your videos together offline (using FFmpeg or something) into some larger chunks and use sequence_length and step arguments to extract the same samples. This might somewhat improve the situation there but I am not really sure by how much without trying it).

Hope that was at all helpful.

Jan 26 '23 17:01 awolant