PDEBench Config files / running the example code

Hi! Great work, it's pretty great to have dataloaders set up for all these different PDE examples.

I've been struggling with running the example code, e.g. on Advection data, which is the default one which the data_download file gets. (Btw, the downloader doesn't respect the config file data paths, which is a bit confusing for a new user.)

I started by trying

CUDA_VISIBLE_DEVICES='2' python train_models_forward.py +args=config.yaml

and got the following error : In 'config': Could not find 'args/config.yaml' My understanding is that I had to change config.yaml with config_pinn_pde1d.yaml, but I'm not entirely sure.

Once I did so, and changed the filename and root_path in my config, I got an error "local variable 'timedomain' referenced before assignment", which seems due to how the filename is parsed. https://github.com/pdebench/PDEBench/blob/f8c8493692e3290f1435b99e4eae66f224938414/pdebench/models/pinn/train.py#L217 It would be nice to add a raise ValueError around this line ; and perhaps to be more explicit in what filenames are allowed or not.

After fixing this, I got a shape error

File "/Users/gnegiar/PDEBench/pdebench/models/pinn/utils.py", line 355, in __init__
    self.bd_data_L = self.data_output[0, :, None]
IndexError: too many indices for tensor of dimension 1

https://github.com/pdebench/PDEBench/blob/f8c8493692e3290f1435b99e4eae66f224938414/pdebench/models/pinn/utils.py#L354, which I'm not sure how to fix. It seems that due to the indexing using val_batch_idx, the tensor self.data_output becomes 1d. I'm surprised here that the tensor isn't 3D to start with (I checked, and h5_file['tensor'] is shape (201, 1024).

Perhaps I downloaded the wrong files? I'm using the data1D/Advection/Test/Advection_beta0.1.h5 file for now, thinking it would be the simplest.

Here is the config that I used for downloading (I didn't modify it) :

args:
    filename: 'Advection_beta'
    dataverse_url: 'https://darus.uni-stuttgart.de'
    dataset_id: 'doi:10.18419/darus-2986'
    data_folder: 'data'

Any help would be greatly welcome!!

Nov 09 '22 20:11 GeoffNN

I'm also getting errors when trying to run the scripts in run_forward_1D.sh.

For instance, running this

CUDA_VISIBLE_DEVICES='2' python3 train_models_forward.py +args=config.yaml ++args.filename='1D_Burgers_Sols_Nu0.001.hdf5' ++args.model_name='FNO'

from the pdebench/models folder gives

In 'config': Could not find 'args/config.yaml'

Available options in 'args':
        config_1DCFD
        config_2DCFD
        config_3DCFD
        config_Darcy
        config_ReacDiff
        config_diff-react
        config_diff-sorp
        config_pinn_CFD1d
        config_pinn_diff-react
        config_pinn_diff-sorp
        config_pinn_pde1d
        config_pinn_swe2d
        config_rdb
Config search path:
        provider=hydra, path=pkg://hydra.conf
        provider=main, path=file:///home/ubuntu/PDEBench/pdebench/models/config
        provider=schema, path=structured://

Nov 14 '22 10:11 GeoffNN

Hi there, could you try removing the +args=config.yaml argument? I think this is because we already specified config_path="config" and config_name="config" in the code, so it refers to the config.yaml file inside the config directory. The way we intended the config to be used is by copying the arguments from the provided config file depending on the problem to be solved, and pasting them to the config.yaml file. Another option would be to try changing your argument into +args=<specific_config_filename>, with the specific_config_filename is e.g. config_1DCFD.yaml Please let us know if this works.

Nov 15 '22 10:11 timothypraditia

Hi!

Thanks for the comment. This seems to work for this step, but unfortunately I still run into some errrors.

When running using Burgers, I get a strange

FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/1D_Burgers_Sols_Nu0.04.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

although the file is definitely in the correct location.

When running on DarcyFlow, I get a pyTorch dataloader issue:


Error executing job with overrides: ['++args.filename=1D_Advection_Sols_beta4.0.hdf5', '++args.model_name=FNO', '++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Advection/Train/']
Traceback (most recent call last):
  File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
    run_training_FNO(
  File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 186, in run_training
    for xx, yy, grid in train_loader:
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 360, in __getitem__
    return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

which I haven't been able to debug.

Could you please share which version of pytorch you tested the library on?

Thanks!

Nov 16 '22 03:11 GeoffNN

Could you please share which line of code from which script leads to the error? And could you also recheck if the file is indeed in the specified directory and is not corrupted? Or maybe you miss the ‘Train’ directory when specifying the base path? Try setting the base path to ‘/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/‘
Could you try either using CPU only or set the number of workers=1 and see if it works (and also to have a clearer error message)? @mtakamoto-D : could you please check on the arguments for DarcyFlow and the version of PyTorch you’re using?

Nov 16 '22 06:11 timothypraditia

Based on our experience, it occurs when performing training on GTX 3090 GPU with cuda11.3+. Can you please replace the following modification on the corresponding part in train.py (e.g., if FNO, L101--104)?

gen_device='cuda'

train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,                            
                                                                num_workers=num_workers, shuffle=True,                                           
                                                                generator=torch.Generator(device=gen_device))

val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size,                                         
                                                             num_workers=num_workers, shuffle=False,                                         
                                                             generator=torch.Generator(device=gen_device))

Nov 16 '22 15:11 mtakamoto-D

Hi, thanks for the reply.

You're right, I was omitting the 'Train/' directory. Thanks.
Both scripts now error our with the GPU initialization error. Stacktrace below. I'm using V100 GPUs with CUDA 11.7 (also tried 11.6).

Error executing job with overrides: ['++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/', '++args.filename=1D_Burgers_Sols_Nu0.001.hdf5', '++args.model_name=FNO']
Traceback (most recent call last):
  File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
    run_training_FNO(
  File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 194, in run_training
    for xx, yy, grid in train_loader:
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 360, in __getitem__
    return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In the config file I switched to single_file:True -- using False was giving me this error:


Traceback (most recent call last):
  File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 246, in <module>
    main()
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main
    _run_hydra(
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 296, in run_and_report
    raise ex
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
    run_training_FNO(
  File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 88, in run_training
    train_data = FNODatasetMult(flnm,
  File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 387, in __init__
    with h5py.File(self.file_path, 'r') as h5_file:
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/h5py/_hl/files.py", line 533, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/h5py/_hl/files.py", line 226, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/1D_Burgers_Sols_Nu0.001.hdf5.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Note the hdf5.h5 in the error.

Here's the command I'm running and the config file. python3 models/train_models_forward.py ++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/ ++args.filename='1D_Burgers_Sols_Nu0.001.hdf5' ++args.model_name='FNO'

defaults:
  - _self_
  - override hydra/hydra_logging: disabled
  - override hydra/job_logging: disabled

hydra:
  output_subdir: null
  run:
    dir: .
    
args:
    model_name: 'FNO'
    if_training: True
    continue_training: False
    num_workers: 2
    batch_size: 5
    initial_step: 10
    t_train: 101
    model_update: 10
    single_file: True
    reduced_resolution: 1
    reduced_resolution_t: 1
    reduced_batch: 1
    epochs: 500
    learning_rate: 1.e-3
    scheduler_step: 100
    scheduler_gamma: 0.5
    #Unet
    in_channels: 2
    out_channels: 2
    ar_mode: True
    pushforward: True
    unroll_step: 20
    #FNO
    num_channels: 2
    modes: 12
    width: 20
    #Inverse
    training_type: autoregressive
    #Inverse MCMC
    mcmc_num_samples: 20
    mcmc_warmup_steps: 10
    mcmc_num_chains: 1
    num_samples_max: 1000
    in_channels_hid: 64
    inverse_model_type: InitialConditionInterp    
    #Inverse grad
    inverse_epochs: 100
    inverse_learning_rate: 0.2
    inverse_verbose_flag: False
    #Plotting
    plot: False
    channel_plot: 0 # Which channel/variable to be plotted
    x_min: -1
    x_max: 1
    y_min: -1
    y_max: 1
    t_min: 0
    t_max: 5

Nov 16 '22 23:11 GeoffNN

Hi there,

Yes, the single_file argument should be set to True for the Burgers dataset. I assume that line 184 in the fno/utils.py script would give an error since the file type is hdf5 and not h5. My suspicion is that the error is caused by the mismatch of dimension, because the arguments for the FNO set up (num_channels, modes, and width) might be wrong.

@mtakamoto-D Could you please provide the config arguments for the Burgers dataset and change the assert statement to accept also hdf5 file (not only h5)?

Nov 17 '22 09:11 timothypraditia

@timothypraditia Seemingly I added the assert sentence not to confuse multiple data with .h5 (mine uses .hdf5). For the user friendliness, that assert sentence should be erased. If you agree with it, I would commented out them.

Nov 17 '22 11:11 mtakamoto-D

@mtakamoto-D Ah I see, then I think the assert command should be fine. Could you please also provide the config file for the Burgers dataset training?

Nov 17 '22 12:11 timothypraditia

@GeoffNN: Makoto will provide the config files as soon as possible. In the meantime, could you try to copy the arguments in the file config_ReacDiff.yaml to the arguments in the file config.yaml and then try to run it?

Nov 17 '22 14:11 timothypraditia

Hi there. I have uploaded the missing config file for 1D advection/burgurs equations with modification of run_forward_1D.sh Can you please update your local repository to the newest one?

Nov 25 '22 17:11 mtakamoto-D

Hello,

I haven't had the bandwidth recently to try; I'll keep you posted as soon as I do. Happy holidays!

Dec 28 '22 23:12 GeoffNN