CUDA assertion failure when trying to resume splatfacto training

Open abroun opened this issue 1 year ago • 3 comments

Describe the bug I train a splatfacto model for 7000 iterations. I then try to resume training from the checkpoint to continue training to 30000 iterations and I get a crash

To Reproduce Steps to reproduce the behavior:

Train a splatfacto model for 7000 iterations, something like

ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 7000 colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4

Then try to resume training to 30000 iterations

ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 30000 --load-dir outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4

Observe a crash which looks to be caused by numerous index out of bounds assertion failures in a CUDA kernel

Logs from crash

(venv) abroun@Desktop-22:/src$ ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 30000 --load-dir outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4 --colmap-path sparse/0
/src/nerfstudio/field_components/activations.py:32: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float32)
/src/nerfstudio/field_components/activations.py:39: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, g):
──────────────────────────────────────────────────────── Config ────────────────────────────────────────────────────────
TrainerConfig(
    _target=<class 'nerfstudio.engine.trainer.Trainer'>,
    output_dir=PosixPath('outputs'),
    method_name='splatfacto',
    experiment_name=None,
    project_name='nerfstudio-project',
    timestamp='2024-12-19_170734',
    machine=MachineConfig(seed=42, num_devices=1, num_machines=1, machine_rank=0, dist_url='auto', device_type='cuda'),
    logging=LoggingConfig(
        relative_log_dir=PosixPath('.'),
        steps_per_log=10,
        max_buffer_size=20,
        local_writer=LocalWriterConfig(
            _target=<class 'nerfstudio.utils.writer.LocalWriter'>,
            enable=True,
            stats_to_track=(
                <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>,
                <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>,
                <EventName.CURR_TEST_PSNR: 'Test PSNR'>,
                <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>,
                <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'>,
                <EventName.ETA: 'ETA (time)'>
            ),
            max_log_size=10
        ),
        profiler='basic'
    ),
    viewer=ViewerConfig(
        relative_log_filename='viewer_log_filename.txt',
        websocket_port=None,
        websocket_port_default=7007,
        websocket_host='0.0.0.0',
        num_rays_per_chunk=32768,
        max_num_display_images=512,
        quit_on_train_completion=False,
        image_format='jpeg',
        jpeg_quality=75,
        make_share_url=False,
        camera_frustum_scale=0.1,
        default_composite_depth=True
    ),
    pipeline=VanillaPipelineConfig(
        _target=<class 'nerfstudio.pipelines.base_pipeline.VanillaPipeline'>,
        datamanager=FullImageDatamanagerConfig(
            _target=<class 'nerfstudio.data.datamanagers.full_images_datamanager.FullImageDatamanager'>,
            data=None,
            masks_on_gpu=False,
            images_on_gpu=False,
            dataparser=ColmapDataParserConfig(
                _target=<class 'nerfstudio.data.dataparsers.colmap_dataparser.ColmapDataParser'>,
                data=PosixPath('/media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/p
ersonal/knaresborough_castle/museum_20241212/complete_colmap'),
                scale_factor=1.0,
                downscale_factor=4,
                downscale_rounding_mode='floor',
                scene_scale=1.0,
                orientation_method='up',
                center_method='poses',
                auto_scale_poses=True,
                assume_colmap_world_coordinate_convention=True,
                eval_mode='interval',
                train_split_fraction=0.9,
                eval_interval=8,
                depth_unit_scale_factor=0.001,
                images_path=PosixPath('images'),
                masks_path=None,
                depths_path=None,
                colmap_path=PosixPath('sparse/0'),
                load_3D_points=True,
                max_2D_matches_per_3D_point=0
            ),
            camera_res_scale_factor=1.0,
            eval_num_images_to_sample_from=-1,
            eval_num_times_to_repeat_images=-1,
            eval_image_indices=(0,),
            cache_images='gpu',
            cache_images_type='uint8',
            max_thread_workers=None,
            train_cameras_sampling_strategy='random',
            train_cameras_sampling_seed=42,
            fps_reset_every=100
        ),
        model=SplatfactoModelConfig(
            _target=<class 'nerfstudio.models.splatfacto.SplatfactoModel'>,
            enable_collider=True,
            collider_params={'near_plane': 2.0, 'far_plane': 6.0},
            loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0},
            eval_num_rays_per_chunk=4096,
            prompt=None,
            warmup_length=500,
            refine_every=100,
            resolution_schedule=3000,
            background_color='random',
            num_downscales=2,
            cull_alpha_thresh=0.1,
            cull_scale_thresh=0.5,
            reset_alpha_every=30,
            densify_grad_thresh=0.0008,
            use_absgrad=True,
            densify_size_thresh=0.01,
            n_split_samples=2,
            sh_degree_interval=1000,
            cull_screen_size=0.15,
            split_screen_size=0.05,
            stop_screen_size_at=4000,
            random_init=False,
            num_random=50000,
            random_scale=10.0,
            ssim_lambda=0.2,
            stop_split_at=15000,
            sh_degree=3,
            use_scale_regularization=False,
            max_gauss_ratio=10.0,
            output_depth_during_training=False,
            rasterize_mode='classic',
            camera_optimizer=CameraOptimizerConfig(
                _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>,
                mode='off',
                trans_l2_penalty=0.01,
                rot_l2_penalty=0.001,
                optimizer=None,
                scheduler=None
            ),
            use_bilateral_grid=False,
            grid_shape=(16, 16, 8),
            color_corrected_metrics=False
        )
    ),
    optimizers={
        'means': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.00016,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=1e-08,
                lr_final=1.6e-06,
                warmup_steps=0,
                max_steps=30000,
                ramp='cosine'
            )
        },
        'features_dc': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.0025,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': None
        },
        'features_rest': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.000125,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': None
        },
        'opacities': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.05,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': None
        },
        'scales': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.005,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': None
        },
        'quats': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.001,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': None
        },
        'camera_opt': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.0001,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=0,
                lr_final=5e-07,
                warmup_steps=1000,
                max_steps=30000,
                ramp='cosine'
            )
        },
        'bilateral_grid': {
            'optimizer': AdamOptimizerConfig(
                _target=<class 'torch.optim.adam.Adam'>,
                lr=0.002,
                eps=1e-15,
                max_norm=None,
                weight_decay=0
            ),
            'scheduler': ExponentialDecaySchedulerConfig(
                _target=<class 'nerfstudio.engine.schedulers.ExponentialDecayScheduler'>,
                lr_pre_warmup=0,
                lr_final=0.0001,
                warmup_steps=1000,
                max_steps=30000,
                ramp='cosine'
            )
        }
    },
    vis='viewer+tensorboard',
    data=None,
    prompt=None,
    relative_model_dir=PosixPath('nerfstudio_models'),
    load_scheduler=True,
    steps_per_save=2000,
    steps_per_eval_batch=0,
    steps_per_eval_image=100,
    steps_per_eval_all_images=1000,
    max_num_iterations=30000,
    mixed_precision=False,
    use_grad_scaler=False,
    save_only_latest_checkpoint=True,
    load_dir=PosixPath('outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models'),
    load_step=None,
    load_config=None,
    load_checkpoint=None,
    log_gradients=False,
    gradient_accumulation_steps={},
    start_paused=False
)
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[17:07:34] Saving config to: outputs/unnamed/splatfacto/2024-12-19_170734/config.yml            experiment_config.py:136
/src/nerfstudio/engine/trainer.py:137: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.grad_scaler = GradScaler(enabled=self.use_grad_scaler)
           Saving checkpoints to: outputs/unnamed/splatfacto/2024-12-19_170734/nerfstudio_models          trainer.py:142
Train dataset has over 500 images, overriding cache_images to cpu
/src/venv/lib/python3.10/site-packages/torchmetrics/functional/image/lpips.py:325: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self.load_state_dict(torch.load(model_path, map_location="cpu"), strict=False)
╭─────────────── viser ───────────────╮
│             ╷                       │
│   HTTP      │ http://0.0.0.0:7007   │
│   Websocket │ ws://0.0.0.0:7007     │
│             ╵                       │
╰─────────────────────────────────────╯
[17:07:48] Caching / undistorting eval images                                             full_images_datamanager.py:230
Loading latest Nerfstudio checkpoint from load_dir...�����������������������������������������������������������������������������������������������������������
/src/nerfstudio/engine/trainer.py:432: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  loaded_state = torch.load(load_path, map_location="cpu")
Done loading Nerfstudio checkpoint from 
outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models/step-000006999.ckpt
logging events to: outputs/unnamed/splatfacto/2024-12-19_170734
[17:08:08] Caching / undistorting train images                                            full_images_datamanager.py:230
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [32,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [33,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [34,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [35,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [36,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [37,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [38,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [39,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [40,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [41,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [42,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [43,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [44,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [45,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [46,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [47,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [48,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [49,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [50,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [51,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [52,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [53,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [54,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [55,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [56,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [57,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [58,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [59,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [60,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [2,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [4,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [31,0,0], thread: [5,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
...
snip
...
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [21,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [96,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [97,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [98,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [99,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [100,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [101,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [102,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [103,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [104,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [105,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [106,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [107,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [108,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [109,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [110,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [111,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [112,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [113,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [114,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [115,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [116,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [117,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [118,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [119,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [120,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [121,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [122,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [123,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [124,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [125,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [126,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [22,0,0], thread: [127,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 14.4561             
VanillaPipeline.get_train_loss_dict: 14.1312             
Traceback (most recent call last):
  File "/src/venv/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/src/nerfstudio/scripts/train.py", line 272, in entrypoint
    main(
  File "/src/nerfstudio/scripts/train.py", line 257, in main
    launch(
  File "/src/nerfstudio/scripts/train.py", line 190, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/src/nerfstudio/scripts/train.py", line 101, in train_loop
    trainer.train()
  File "/src/nerfstudio/engine/trainer.py", line 270, in train
    callback.run_callback_at_location(
  File "/src/nerfstudio/engine/callbacks.py", line 116, in run_callback_at_location
    self.run_callback(step=step)
  File "/src/nerfstudio/engine/callbacks.py", line 106, in run_callback
    self.func(*self.args, **self.kwargs, step=step)
  File "/src/nerfstudio/models/splatfacto.py", line 341, in step_post_backward
    self.strategy.step_post_backward(
  File "/src/venv/lib/python3.10/site-packages/gsplat/strategy/default.py", line 173, in step_post_backward
    n_dupli, n_split = self._grow_gs(params, optimizers, state, step)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/src/venv/lib/python3.10/site-packages/gsplat/strategy/default.py", line 303, in _grow_gs
    split(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/src/venv/lib/python3.10/site-packages/gsplat/strategy/ops.py", line 135, in split
    sel = torch.where(mask)[0]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(venv) abroun@Desktop-22:/src$ ns-train splatfacto --vis viewer+tensorboard --max-num-iterations 30000 --load-dir outputs/unnamed/splatfacto/2024-12-19_165704/nerfstudio_models colmap --data /media/abroun/9dcb521e-bc62-451f-aaa8-bc85c85e14f9/home/abroun/Pictures/photogrammetry/personal/knaresborough_castle/museum_20241212/complete_colmap/ --downscale-factor 4 --colmap-path sparse/0

Expected behavior Ideally I'd like to be able to resume training to experiment with adjusting learning rates, optimisation parameters etc

Additional context My goal here is to experiment with different learning rates, optimisation parameters etc. Find some parameters that work well for a small number of iterations and then to adjust parameters further from that baseline. I think that it would be useful to be able to resume training from 7000 iterations so that I can explore adjusting parameters over time more quickly (rather than having to start from scratch each time). I'm not particularly familiar with the Nerfstudio project though so if there's a better way of acheiving this goal I'd be very grateful for some pointers. Cheers.

Dec 19 '24 17:12 abroun

@abroun I am not an expert but based on my friction with the repo, your fault is that with your first command you define that the max iterations will be 7000 and this number is written in the config file which you will later use to resume your training. So the 7000 will be your last iteration and then you ask for 30000. You need to define from the beginning --max-num-iterations 30000 an find a way to stop it at 7000 or save checkpoint at 7000 -> stop the training -> make your modifications -> retrain

Jan 07 '25 12:01 vrahnos3

@abroun Hi, have you found the fix for the problem?

Jan 22 '25 09:01 alorthius

I received that same error when I tried to resume the training from a different dataset location (same dataset, just moved the dataset to a different location in the file system) than what I first trained on. Also, I wasn't loading the previous config.yml file either and was using this command for resuming:

ns-train splatfacto --viewer.quit-on-train-completion=True --timestamp 'train-stage-2' --load-dir '/mnt/efs/data/resume-test/dataset/nerfstudio_models' --load-scheduler False --pipeline.model.use-scale-regularization True --max-num-iterations 30000  --pipeline.datamanager.cache-images 'disk' colmap --data '/mnt/efs/data/resume-test/dataset' --downscale-factor 1

What fixed the error for me was to first change the config.yml file to point to the new dataset location, timestamp, and also ensure max-num-iterations was set correctly inside the file. Then I just passed in the config.yml with the checkpoint like this:

ns-train splatfacto --viewer.quit-on-train-completion=True --load-dir '/mnt/efs/data/resume-test/dataset/nerfstudio_models' --load-config '/mnt/efs/data/resume-test/dataset/config.yml'--load-scheduler False  --pipeline.datamanager.cache-images 'disk' colmap --data '/mnt/efs/data/resume-test/dataset' --downscale-factor 1

I hope this helps, although the original post doesn't suggest the dataset was moved, you can pass in the --load-config command to load your previous run, just ensure you change the values in the config.yml to correspond to your new parameters (it seemed that most parameters contained in the command are overwritten by the config.yml when you pass in --load-config).

Oct 02 '25 02:10 Eecornwell