lbann icon indicating copy to clipboard operation
lbann copied to clipboard

Errors for UNet3D application on distconv LBANN

Open JBae2 opened this issue 3 years ago • 2 comments

Hello, I am trying to run the supported UNet3D aplication code in the LBANN github, but it fails.

In the distconv environments and its related source codes, it looks like that the input with "labels" data_field is not supported yet. The source code also mentioned that Distconv currently only supports CosmoFlow data.

Is this possible to run unet3d application on LBANN or am I missing something? If you have a knowledge, please advise about it.

This is the main function of my source code that I modified from the example unet3d. The omitted functions are same with the original. Thank you.

if __name__ == '__main__':
    desc = ('Construct and run the 3D U-Net on a 3D segmentation dataset.'
            'Running the experiment is only supported on LC systems.')
    parser = argparse.ArgumentParser(description=desc)
    lbann.contrib.args.add_scheduler_arguments(parser)

    (Omit parser.add_argument section)

    lbann.contrib.args.add_optimizer_arguments(
        parser,
        default_optimizer="adam",
        default_learning_rate=0.001,
    )

    args = parser.parse_args()
    args.procs_per_node=4

    parallel_strategy = get_parallel_strategy_args(
        sample_groups=args.mini_batch_size,
        depth_groups=args.depth_groups)

    # Construct layer graph
    volume = lbann.Input(data_field='samples')
    segmentation = lbann.Input(data_field='labels')

    output = UNet3D()(volume)

    ce = lbann.CrossEntropy([output, segmentation])
    layers = list(lbann.traverse_layer_graph([volume, segmentation]))

    obj = lbann.ObjectiveFunction([ce])

    for l in layers:
        l.parallel_strategy = parallel_strategy

    # Setup model
    metrics = [lbann.Metric(ce, name='CE', unit='')]
    callbacks = [lbann.CallbackPrint(),
        lbann.CallbackTimer(),
        lbann.CallbackGPUMemoryUsage(),
        lbann.CallbackProfiler(skip_init=True),
    ]
    # # TODO: Use polynomial learning rate decay (https://github.com/LLNL/lbann/issues/1581)
    # callbacks.append(
    #     lbann.CallbackPolyLearningRate(
    #         power=1.0,
    #         num_epochs=100,
    #         end_lr=1e-5))
    model = lbann.Model(epochs=args.num_epochs,
        layers=layers,
        objective_function=obj,
        callbacks=callbacks,
    )

    # Setup optimizer
    optimizer = lbann.contrib.args.create_optimizer(args)

    # Setup data reader
    data_reader = create_unet3d_data_reader(
        train_dir=args.train_dir,
        test_dir=args.test_dir)

    # Setup trainer
    trainer = lbann.Trainer(mini_batch_size=args.mini_batch_size)

    # Runtime parameters/arguments
    environment = lbann.contrib.args.get_distconv_environment(
        num_io_partitions=args.depth_groups)
    if args.dynamically_reclaim_error_signals:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 0
    else:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 1
    lbann_args = ['--use_data_store']

    # Run experiment
    kwargs = lbann.contrib.args.get_scheduler_kwargs(args)
    lbann.contrib.launcher.run(
        trainer, model, data_reader, optimizer,
        job_name=args.job_name,
        environment=environment,
        lbann_args=lbann_args,
        batch_job=args.batch_job,
        **kwargs)

JBae2 avatar Nov 17 '22 00:11 JBae2

@JBae2 There is a bug in the current UNet3D model, where the python representation of the model has drifted from some of the internal changes that have occurred in LBANN. This issue is currently being worked in PR #2151 but is not yet complete.

bvanessen avatar Nov 28 '22 17:11 bvanessen

@bvanessen Can this be closed as #2151 is now merged?

benson31 avatar Feb 01 '23 15:02 benson31