nerfstudio Try to utilize VanillaDataManager as a component of my work and Parallel training in multi-gpu

I want to utilize the VanillaDataManager(nerfstudio-0.2.2) as a component of my work. Now single gpu training works well, and im try to make it in multi-gpu for parallel training. However, im not sure whether i've done it in a right way. Here is a part of my code below.

"""
a training pipeline of my own - pipeline.py
"""
class MyPipeline(nn.Module):
    def __init__(
        self,
        device: str,
        test_mode: Literal["test", "val", "inference"] = "val",
        world_size: int = 1,
        local_rank: int = 0,
        **kwargs
    ):
        
        # build a bunch of log dir for experiment...
        
        # VanillaDataManager of nerfstudio
        datamanager_config = VanillaDataManagerConfig(
                                                    # _target=RayPruningDataManager,
                                                    dataparser=MinimalDataParserConfig(data=Path(self.instance_dir)),
                                                    eval_num_rays_per_batch=self.batch_size,
                                                    train_num_rays_per_batch=self.batch_size,
                                                )
        
        self.datamanager = VanillaDataManager(config=datamanager_config,
                                              device=device,
                                              test_mode=test_mode,
                                              world_size=world_size,
                                              local_rank=local_rank)
        self.datamanager.to(device)
        
        assert self.datamanager.train_dataset is not None, "Missing input dataset"
        
        # initialize my nerf model
        self.model = MyNerfModel(config=model_config,
                              transform=self.datamanager.train_dataparser_outputs.dataparser_transform,
                              scale=self.datamanager.train_dataparser_outputs.dataparser_scale,
                              device=device
                              )
        
        self.model.to(device)
        # DistributedDataParallel
        self.model = torch.nn.parallel.DistributedDataParallel(self.model, device_ids=[self.GPU_INDEX], broadcast_buffers=False, find_unused_parameters=True)
        
    def train(self) -> None:
        """
        training pipeline
        """
        # ...
        ray_bundle, batch = self.datamanager.next_train(step)
        model_outputs = self.model(ray_bundle)
        metrics_dict = self.model.module.get_metrics_dict(model_outputs, batch)
        
        loss_dict = self.model.module.get_loss_dict(model_outputs, batch, metrics_dict)
        # ...

"""
Entrance to the training program - train.py
"""
def _set_random_seed(seed) -> None:
    """Set randomness seed in torch and numpy"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)


if __name__ == '__main__':
    # conf = '/root/home/workspace/tetra-nerf_modified/tetra_sdf/test.conf'
    # runner = TetraSDFTrainRunner(conf=conf)
    
    parser = argparse.ArgumentParser()
    parser.add_argument()   # omit here... 

    opt = parser.parse_args()
    """
    if opt.gpu == "auto":
        deviceIDs = GPUtil.getAvailable(order='memory', limit=1, maxLoad=0.5, maxMemory=0.5, includeNan=False,
                                        excludeID=[], excludeUUID=[])
        if len(deviceIDs) == 0:
            raise RuntimeError("No GPU available")
        gpu = deviceIDs[0]
    else:
        gpu = opt.gpu
    """
    gpu = opt.local_rank

    # set distributed training
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        rank = int(os.environ["RANK"])
        world_size = int(os.environ['WORLD_SIZE'])
        print(f"RANK and WORLD_SIZE in environ: {rank}/{world_size}")
    else:
        rank = -1
        world_size = -1

    print(opt.local_rank)
    torch.cuda.set_device(opt.local_rank)
    torch.distributed.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank, timeout=timedelta(1, 1800))
    torch.distributed.barrier()
    test_mode = "val"
    device: TORCH_DEVICE = "cpu" if world_size == 0 else f"cuda:{opt.local_rank}"
    
    _set_random_seed(42 + rank)
    trainrunner = MyPipeline(device=device,
                                   test_mode=test_mode,
                                   world_size=world_size,
                                   local_rank=opt.local_rank,
                                    # opt.stuff...
                                    )
    
    trainrunner.train()

The cmd to execute is CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --nnodes=1 --node_rank=0 train.py

By the way, i notice that class DataManager has attributs train_sampler and eval_sampler, but never used. Are they not useful for multi-GPU parallel training? Anyone can help? Waiting online

Oct 10 '23 13:10 EugeneLiu01

I've observed the same thing. Despite using multiple GPUs for training, the training time is the same with n times rays.

Inspecting the code, I wasn't able to see any call to a DistributedSampler for DDP, such as on Pytorch example.

I believe that this should be on each setup_train() function of DataManagers, such as this one from VanillaDataManager.

This was also reported before, on issue #1082 and #2475. Also, the DDP was mentioned in other issues, such as #1847. Maybe we can get more hints about it.

@jb-ye do you known if this is the expected behaviour?

Oct 22 '24 20:10 alancneves

@alancneves I observed the issue and sent a fix for that earlier. https://github.com/nerfstudio-project/nerfstudio/pull/1856 Not sure if this is the same issue for you.

Oct 22 '24 23:10 jb-ye

Hi @jb-ye! I've seen this issue, that address the averaging of the gradient across multiple GPUs.

However, I believe that the problem is more related to data loading. Without the sampler parameter on Dataloaders, in my understanding, all GPUs will load and train the same data, not a subset for each as intended.

Oct 22 '24 23:10 alancneves

@alancneves I see your point. I thought at least for Nerf Training, each gpu will sample different rays so essentially they are training with different data, no?

Oct 23 '24 13:10 jb-ye

Yeah, it makes sense.... Even using the same images, we can sample different parts of them. Thanks for the clarification @jb-ye !

Just another question (related): We cannot guarantee that the N bundle of rays will be completely distinct from each other by using N GPUs. Can we?

Oct 23 '24 13:10 alancneves

We cannot guarantee that the N bundle of rays will be completely distinct from each other by using N GPUs. Can we?

No, but since N << number of pixels, the chance is low.

Oct 23 '24 20:10 jb-ye