DALI Training hangs on specific video

Hi,

I am adding this issue because my training hangs when specific video is in the dataset. I am using old version of DALI (0.10) because code I want to reproduce use it. Videos created with exactly the same code work well, only this one makes problems. I am attaching video file causing problems (stroller_2.mp4) and very similar video which is ok (stroller_1.mp4). In the new version of DALI it seems to work. Unfortunately I prefer to not update DALI cause it can affect reproducibility. Do you have maybe some idea why this particular file can cause problems and how could I fix it?

stroller_1.mp4 https://user-images.githubusercontent.com/17530823/194344261-45b4e882-2000-4f64-a362-6c15e60af288.mp4

stroller_2.mp4 https://user-images.githubusercontent.com/17530823/194344435-6ffa8990-5568-45e6-9e9d-2d7f3f3ca714.mp4

For any case I have repeated process of creating video file with ffmpeg but it didn't help. Both videos were made based on frames from one sequence from DAVIS dataset, one of them based on odd frames, and second based on even frames. Additionally video created based on all the frames does not cause any problems.

Oct 06 '22 14:10 Piotr94

Hi @Piotr94,

Since version 0.10 we have made several improvements in the video decoder functionality to help in such cases. The first, most notable is to add a heuristic in https://github.com/NVIDIA/DALI/pull/3247 or better error handling in https://github.com/NVIDIA/DALI/pull/4022. The first one fixes the problem when DALI fails to resume decoding from the keyframe that is not IDR. In the case of libaviutils, there is no difference between these two while the NVDEC behaves differently. Can you tell what particular functionality changes in the most recent DALI prevents you from updating it?

Oct 06 '22 15:10 JanuszL

Thanks @JanuszL for reply! I am trying to reproduce FastDVDnet results (https://github.com/m-tassano/fastdvdnet), for which authors used old version of DALI (https://github.com/m-tassano/fastdvdnet/releases/tag/v0.1). Current version of repository is compatible with new DALI but based on my experiments it provides worse results (on validation set is at least 0.3 dB in PSNR), I don't understand why. The only difference between these versions is dataloader, version of torch and few other libraries and adding ".contiguous" to one tensor. For me changes in dataloader seem to be small but difference in results is significant. I don't know DALI deeply so it's hard for me to judge if these small changes were correct or not.

Oct 07 '22 06:10 Piotr94

@Piotr94,

Current version of repository is compatible with new DALI but based on my experiments it provides worse results (on validation set is at least 0.3 dB in PSNR), I don't understand why. The only difference between these versions is dataloader, version of torch and few other libraries and adding ".contiguous" to one tensor.

Thank you for providing the details. This is not expected for sure. The data loading pipeline seems to use very basic operations that should work the same across the versions. Is it possible to just update the DALI version and see if that yields any difference in the result? Debugging such issues when there are so many moving parts is quite challenging.

Oct 07 '22 07:10 JanuszL

Simple updating DALI version unfortunately is not possible because one of the operations used in old version of the code is not supported anymore (CropCastPermute, it was replaced with CropMirrorNormalize). Here is the current version of the dataloader: https://github.com/m-tassano/fastdvdnet/blob/master/dataloaders.py, and here is the old version: https://github.com/m-tassano/fastdvdnet/blob/v0.1/dataloaders.py Here is the commit with changes: https://github.com/m-tassano/fastdvdnet/commit/96360196a24ff1a8d9444a9f48a35d2842eb2832

One of (maybe important) differences I have noticed is replacing output_layout=types.NCHW with output_layout='FCHW', another one is removing "stop_at_epoch=False" from pytorch.DALIGenericIterator parameters. Could these changes impact anything?

Oct 07 '22 08:10 Piotr94

last_batch_policy is the new way of expressing the behavior previously controlled by stop_at_epoch. What you can do it to use a single, short test video and dump all frames from these pipelines for comparison. Before that, you can check if the number of iterations and samples returned to match. Then briefly compare the frames themselves. Maybe there is something obvious we miss.

Oct 07 '22 09:10 JanuszL

Thanks for help. If I will have time I'll test it in the way you suggested. For now I have made some hack, I have removed first frame from the sequence which I use to generate video and after that video is working.

Oct 07 '22 13:10 Piotr94