ffcv icon indicating copy to clipboard operation
ffcv copied to clipboard

Warning while training model with DDP

Open AmmaraRazzaq opened this issue 3 years ago • 13 comments

Hi I am getting the following warning when training the model with ffcv dataloader + ddp.

[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

The same code works fine with pytorch dataloader + ddp

AmmaraRazzaq avatar Feb 28 '22 15:02 AmmaraRazzaq

I think this error was occurring because I was not putting the tensors on gpu in the image and label pipeline, instead I was putting them on gpu in the train and val loop. However now, only the image tensors are going on gpu, label tensors are not moving to gpu.

loaders = {}
for name in ['train', 'val']:
    label_pipeline: List[Operation] = [NDArrayDecoder(), ToDevice(ch.device('cuda:0'))] 
    image_pipeline: List[Operation] = [SimpleRGBImageDecoder(), Normalize(), ToTensor(), Convert(ch.float32), ToDevice(ch.device('cuda:0')), ToTorchImage()] 
    # Create loaders
    loaders[name] = Loader(
        paths[f'{name}_beton_path'],
        batch_size=14,
        num_workers=6,
        order=OrderOption.RANDOM if name == 'train' else OrderOption.SEQUENTIAL,
        # distributed = (name == 'train'),
        # seed= 0,
        drop_last = (name == 'train'),
        pipelines={
            'image': image_pipeline,
            'label': label_pipeline
        }
    )

AmmaraRazzaq avatar Mar 01 '22 12:03 AmmaraRazzaq

Resolved: Pytorch dataset class should be given array as an input and I was giving a list for labels. Even though NDArrayField and NDArrayDecoder() were working fine. No further changes could be done to labels after decoding.

AmmaraRazzaq avatar Mar 02 '22 12:03 AmmaraRazzaq

Hi @AmmaraRazzaq I am facing the same error. I could not understand your last comment. Do you mind sharing it in a bit more detail? Thanks!

sachitkuhar avatar Mar 03 '22 07:03 sachitkuhar

Even after successfully moving the tensors to GPU, the Warning still persists,

[W reducer.cpp:362] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [2304, 576, 1, 1], strides() = [576, 1, 576, 576] bucket_view.sizes() = [2304, 576, 1, 1], strides() = [576, 1, 1, 1] (function operator())

AmmaraRazzaq avatar Mar 03 '22 11:03 AmmaraRazzaq

Finally figured it out. This warning occurs because ToTorchImage() class returns tensor in channels_last memory format. If the input tensor to a model is in channals_last memory format then the model should support this format else it will give the warning about grad strides not matching. Model can be converted to channels last as follows model = model.to(memory_format=torch.channels_last) as explained here in detail. OR channels_last parameter can be set to False in ToTorchImage(channels_last=False) and it will return the tensor in contiguous memory format and there is no need to convert the model to channels last memory format.

I have found that contiguous memory format is much faster. channels_last memory format is making the training slower than with pytorch data loader.

AmmaraRazzaq avatar Mar 03 '22 14:03 AmmaraRazzaq

@AmmaraRazzaq What GPU are you using. Newer GPUs should be at least 10% faster with channel_last

GuillaumeLeclerc avatar Mar 03 '22 16:03 GuillaumeLeclerc

Hi @GuillaumeLeclerc I am using Tesla V100-SXM2-32GB

AmmaraRazzaq avatar Mar 05 '22 18:03 AmmaraRazzaq

I have a V100 handy, do you mind sharing a sample of your code that is faster with channel_last=false so I can investigate ?

GuillaumeLeclerc avatar Mar 08 '22 03:03 GuillaumeLeclerc

Hi @GuillaumeLeclerc Thankyou for offering to help. Here is the link to the code https://github.com/AmmaraRazzaq/image_classification/blob/main/sample_code.py

AmmaraRazzaq avatar Mar 08 '22 15:03 AmmaraRazzaq

Sorry for the delay, can you give me exactly the parameters you are using (and which dataset). Thank you!

GuillaumeLeclerc avatar Mar 14 '22 22:03 GuillaumeLeclerc

Hi @GuillaumeLeclerc I can't share much detail with you as this is a research project which is still in development phase and has not been made opensource yet. Please let me know if parameters, nature of the data set or model architecture can affect the speed of model training?

AmmaraRazzaq avatar Mar 15 '22 06:03 AmmaraRazzaq

There are many very important factors including:

  • Distribution of image resolution
  • The amount of raw/jpeg used in the file
  • amount of compression of images
  • Shape of your labels
  • etc... Can you provide a dataset where the images and labels have been replace by noise ?

GuillaumeLeclerc avatar Mar 15 '22 23:03 GuillaumeLeclerc

Hi @GuillaumeLeclerc Apologies for late reply.

I am sharing the dataset files and sample code. I am working with CheXpert dataset and beton file size is 165GB for all the images so I have created a beton file with 1000 images (~1.5GB). Images are resized to 512x512 and normalized in the range [-1,1] and are written to beton file in 'raw' format. It's a multilabel classification problem with 5 labels for each image.

Dataset files: https://github.com/AmmaraRazzaq/image_classification/tree/master/betonfiles code: https://github.com/AmmaraRazzaq/image_classification/blob/master/pyfiles/sample_code.py

I am using resnet101 architecture with lr=2e-3, bs=24, gpus=4 (ddp training), SGD optimizers with weight_decay=0, momentum=0.9 and num_workers=6 in the dataloader.

AmmaraRazzaq avatar Mar 25 '22 09:03 AmmaraRazzaq