nerfstudio icon indicating copy to clipboard operation
nerfstudio copied to clipboard

instant-ngp freezes

Open faad3 opened this issue 3 years ago • 2 comments

I'm trying to train instant-ngp, but the process seems to hang and nothing happens either in the terminal or in gui.

Running from docker dromni/nerfstudio:0.1.14

The command: ns-train instant-ngp --data data/nerfstudio/poster Output: [15:04:54] Using --data alias for --data.pipeline.datamanager.dataparser.data train.py:223 ──────────────────────────────────────────────────────── Config ──────────────────────────────────────────────────────── ExperimentConfig( output_dir=PosixPath('outputs'), method_name='instant-ngp', experiment_name=None, timestamp='2023-01-12_150454', machine=MachineConfig(seed=42, num_gpus=1, num_machines=1, machine_rank=0, dist_url='auto'), logging=LoggingConfig( relative_log_dir=PosixPath('.'), steps_per_log=10, max_buffer_size=20, local_writer=LocalWriterConfig( _target=<class 'nerfstudio.utils.writer.LocalWriter'>, enable=True, stats_to_track=( <EventName.ITER_TRAIN_TIME: 'Train Iter (time)'>, <EventName.TRAIN_RAYS_PER_SEC: 'Train Rays / Sec'>, <EventName.CURR_TEST_PSNR: 'Test PSNR'>, <EventName.VIS_RAYS_PER_SEC: 'Vis Rays / Sec'>, <EventName.TEST_RAYS_PER_SEC: 'Test Rays / Sec'> ), max_log_size=10 ), enable_profiler=True ), viewer=ViewerConfig( relative_log_filename='viewer_log_filename.txt', start_train=True, zmq_port=None, launch_bridge_server=True, websocket_port=7007, ip_address='127.0.0.1', num_rays_per_chunk=64000, max_num_display_images=512, quit_on_train_completion=False, skip_openrelay=False ), trainer=TrainerConfig( steps_per_save=2000, steps_per_eval_batch=500, steps_per_eval_image=500, steps_per_eval_all_images=25000, max_num_iterations=30000, mixed_precision=True, relative_model_dir=PosixPath('nerfstudio_models'), save_only_latest_checkpoint=True, load_dir=None, load_step=None, load_config=None ), pipeline=DynamicBatchPipelineConfig( _target=<class 'nerfstudio.pipelines.dynamic_batch.DynamicBatchPipeline'>, datamanager=VanillaDataManagerConfig( _target=<class 'nerfstudio.data.datamanagers.base_datamanager.VanillaDataManager'>, dataparser=NerfstudioDataParserConfig( _target=<class 'nerfstudio.data.dataparsers.nerfstudio_dataparser.Nerfstudio'>, data=PosixPath('data/nerfstudio/poster'), scale_factor=1.0, downscale_factor=None, scene_scale=1.0, orientation_method='up', center_poses=True, auto_scale_poses=True, train_split_percentage=0.9 ), train_num_rays_per_batch=8192, train_num_images_to_sample_from=-1, train_num_times_to_repeat_images=-1, eval_num_rays_per_batch=1024, eval_num_images_to_sample_from=-1, eval_num_times_to_repeat_images=-1, eval_image_indices=(0,), camera_optimizer=CameraOptimizerConfig( _target=<class 'nerfstudio.cameras.camera_optimizers.CameraOptimizer'>, mode='off', position_noise_std=0.0, orientation_noise_std=0.0, optimizer=AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.0006, eps=1e-15, weight_decay=0 ), scheduler=SchedulerConfig( _target=<class 'nerfstudio.engine.schedulers.ExponentialDecaySchedule'>, lr_final=5e-06, max_steps=10000 ), param_group='camera_opt' ), camera_res_scale_factor=1.0 ), model=InstantNGPModelConfig( _target=<class 'nerfstudio.models.instant_ngp.NGPModel'>, enable_collider=False, collider_params=None, loss_coefficients={'rgb_loss_coarse': 1.0, 'rgb_loss_fine': 1.0}, eval_num_rays_per_chunk=8192, max_num_samples_per_ray=24, grid_resolution=128, contraction_type=<ContractionType.UN_BOUNDED_SPHERE: 2>, cone_angle=0.004, render_step_size=0.01, near_plane=0.05, far_plane=1000.0, use_appearance_embedding=False, background_color='random' ), target_num_samples=262144, max_num_samples_per_ray=1024 ), optimizers={ 'fields': { 'optimizer': AdamOptimizerConfig( _target=<class 'torch.optim.adam.Adam'>, lr=0.01, eps=1e-15, weight_decay=0 ), 'scheduler': None } }, vis='tensorboard', data=PosixPath('data/nerfstudio/poster') ) ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── [15:04:54] Saving config to: experiment_config.py:122 outputs/data-nerfstudio-poster/instant-ngp/2023-01-12_150454/config.yml [15:04:54] Saving checkpoints to: trainer.py:90 outputs/data-nerfstudio-poster/instant-ngp/2023-01-12_150454/nerfstudio_models logging events to: outputs/data-nerfstudio-poster/instant-ngp/2023-01-12_150454 [15:04:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [15:04:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 [15:04:56] Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch

And after that, nothing happens. Maybe I'm doing something wrong? Thank you in advance.

faad3 avatar Jan 12 '23 16:01 faad3

I had the same issue running the docker image of version 0.1.14 cmd output: Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [02:36:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [02:36:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch ( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)Killed

This "NerfAcc: Setting up CUDA" will excute When using instant-ngp for the first time. But each time the process is automatically killed.Run the command “ns-train instant-ngp --data data/nerfstudio/poster” again with unsuccessful compilation, it would feezes. Maybe you could delete the nerfacc cache and try again,the cache is located in ~/.cache/torch_extensions/py310_cu116 Good luck.

ccysway avatar Jan 16 '23 02:01 ccysway

I met the same issue, too.

shuimoo avatar Jan 17 '23 11:01 shuimoo

I had the same issue running the docker image of version 0.1.14 cmd output: Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [02:36:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [02:36:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch ( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)Killed

This "NerfAcc: Setting up CUDA" will excute When using instant-ngp for the first time. But each time the process is automatically killed.Run the command “ns-train instant-ngp --data data/nerfstudio/poster” again with unsuccessful compilation, it would feezes. Maybe you could delete the nerfacc cache and try again,the cache is located in ~/.cache/torch_extensions/py310_cu116 Good luck.

Same problem. This solution is helpful.

meneldil12555 avatar Mar 18 '23 17:03 meneldil12555

I delete the cache, but rerun the command. It is again killed, and the cycle repeats itself. Why is the process getting killed the first time over?

Vathys avatar Apr 26 '23 13:04 Vathys

Probably first-run CUDA setup, e.g. nerfacc? (There is supposed to be a message to that effect in server console) You need to let it finish first time, then it should cache result for future runs. If you cancel out, it will redo again next try.

machenmusik avatar Apr 26 '23 14:04 machenmusik

I had the same issue running the docker image of version 0.1.14 cmd output: Sending ping to the viewer Bridge Server... Successfully connected. Sending ping to the viewer Bridge Server... Successfully connected. [NOTE] Not running eval iterations since only viewer is enabled. Use --vis wandb or --vis tensorboard to run with eval instead. Disabled tensorboard/wandb event writers [02:36:54] Auto image downscale factor of 2 nerfstudio_dataparser.py:294 [02:36:55] Skipping 0 files in dataset split train. nerfstudio_dataparser.py:156 Skipping 0 files in dataset split val. nerfstudio_dataparser.py:156 Setting up training dataset... Caching all 204 images. Setting up evaluation dataset... Caching all 22 images. None No checkpoints to load, training from scratch ( ● ) NerfAcc: Setting up CUDA (This may take a few minutes the first time)Killed This "NerfAcc: Setting up CUDA" will excute When using instant-ngp for the first time. But each time the process is automatically killed.Run the command “ns-train instant-ngp --data data/nerfstudio/poster” again with unsuccessful compilation, it would feezes. Maybe you could delete the nerfacc cache and try again,the cache is located in ~/.cache/torch_extensions/py310_cu116 Good luck.

Same problem. This solution is helpful.

For those using windows, I found the temp files in C:\Users<username>\AppData\Local\torch_extensions\py38_cu118. After that I had to rebuild nerfacc with pip using: pip install git+https://github.com/KAIR-BAIR/nerfacc.git

AndreeInCodeLand avatar Nov 16 '23 12:11 AndreeInCodeLand