Config files / running the example code
Hi! Great work, it's pretty great to have dataloaders set up for all these different PDE examples.
I've been struggling with running the example code, e.g. on Advection data, which is the default one which the data_download file gets. (Btw, the downloader doesn't respect the config file data paths, which is a bit confusing for a new user.)
I started by trying
CUDA_VISIBLE_DEVICES='2' python train_models_forward.py +args=config.yaml
and got the following error : In 'config': Could not find 'args/config.yaml'
My understanding is that I had to change config.yaml with config_pinn_pde1d.yaml, but I'm not entirely sure.
Once I did so, and changed the filename and root_path in my config, I got an error "local variable 'timedomain' referenced before assignment", which seems due to how the filename is parsed. https://github.com/pdebench/PDEBench/blob/f8c8493692e3290f1435b99e4eae66f224938414/pdebench/models/pinn/train.py#L217 It would be nice to add a raise ValueError around this line ; and perhaps to be more explicit in what filenames are allowed or not.
After fixing this, I got a shape error
File "/Users/gnegiar/PDEBench/pdebench/models/pinn/utils.py", line 355, in __init__
self.bd_data_L = self.data_output[0, :, None]
IndexError: too many indices for tensor of dimension 1
https://github.com/pdebench/PDEBench/blob/f8c8493692e3290f1435b99e4eae66f224938414/pdebench/models/pinn/utils.py#L354, which I'm not sure how to fix. It seems that due to the indexing using val_batch_idx, the tensor self.data_output becomes 1d. I'm surprised here that the tensor isn't 3D to start with (I checked, and h5_file['tensor'] is shape (201, 1024).
Perhaps I downloaded the wrong files? I'm using the data1D/Advection/Test/Advection_beta0.1.h5 file for now, thinking it would be the simplest.
Here is the config that I used for downloading (I didn't modify it) :
args:
filename: 'Advection_beta'
dataverse_url: 'https://darus.uni-stuttgart.de'
dataset_id: 'doi:10.18419/darus-2986'
data_folder: 'data'
Any help would be greatly welcome!!
I'm also getting errors when trying to run the scripts in run_forward_1D.sh.
For instance, running this
CUDA_VISIBLE_DEVICES='2' python3 train_models_forward.py +args=config.yaml ++args.filename='1D_Burgers_Sols_Nu0.001.hdf5' ++args.model_name='FNO'
from the pdebench/models folder gives
In 'config': Could not find 'args/config.yaml'
Available options in 'args':
config_1DCFD
config_2DCFD
config_3DCFD
config_Darcy
config_ReacDiff
config_diff-react
config_diff-sorp
config_pinn_CFD1d
config_pinn_diff-react
config_pinn_diff-sorp
config_pinn_pde1d
config_pinn_swe2d
config_rdb
Config search path:
provider=hydra, path=pkg://hydra.conf
provider=main, path=file:///home/ubuntu/PDEBench/pdebench/models/config
provider=schema, path=structured://
Hi there, could you try removing the +args=config.yaml argument?
I think this is because we already specified config_path="config" and config_name="config" in the code, so it refers to the config.yaml file inside the config directory.
The way we intended the config to be used is by copying the arguments from the provided config file depending on the problem to be solved, and pasting them to the config.yaml file.
Another option would be to try changing your argument into +args=<specific_config_filename>, with the specific_config_filename is e.g. config_1DCFD.yaml
Please let us know if this works.
Hi!
Thanks for the comment. This seems to work for this step, but unfortunately I still run into some errrors.
- When running using Burgers, I get a strange
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/1D_Burgers_Sols_Nu0.04.hdf5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
although the file is definitely in the correct location.
- When running on DarcyFlow, I get a pyTorch dataloader issue:
Error executing job with overrides: ['++args.filename=1D_Advection_Sols_beta4.0.hdf5', '++args.model_name=FNO', '++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Advection/Train/']
Traceback (most recent call last):
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
run_training_FNO(
File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 186, in run_training
for xx, yy, grid in train_loader:
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 360, in __getitem__
return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
which I haven't been able to debug.
Could you please share which version of pytorch you tested the library on?
Thanks!
-
Could you please share which line of code from which script leads to the error? And could you also recheck if the file is indeed in the specified directory and is not corrupted? Or maybe you miss the ‘Train’ directory when specifying the base path? Try setting the base path to ‘/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/‘
-
Could you try either using CPU only or set the number of workers=1 and see if it works (and also to have a clearer error message)? @mtakamoto-D : could you please check on the arguments for DarcyFlow and the version of PyTorch you’re using?
Based on our experience, it occurs when performing training on GTX 3090 GPU with cuda11.3+. Can you please replace the following modification on the corresponding part in train.py (e.g., if FNO, L101--104)?
gen_device='cuda'
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
num_workers=num_workers, shuffle=True,
generator=torch.Generator(device=gen_device))
val_loader = torch.utils.data.DataLoader(val_data, batch_size=batch_size,
num_workers=num_workers, shuffle=False,
generator=torch.Generator(device=gen_device))
Hi, thanks for the reply.
- You're right, I was omitting the 'Train/' directory. Thanks.
- Both scripts now error our with the GPU initialization error. Stacktrace below. I'm using V100 GPUs with CUDA 11.7 (also tried 11.6).
Error executing job with overrides: ['++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/', '++args.filename=1D_Burgers_Sols_Nu0.001.hdf5', '++args.model_name=FNO']
Traceback (most recent call last):
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
run_training_FNO(
File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 194, in run_training
for xx, yy, grid in train_loader:
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 360, in __getitem__
return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
In the config file I switched to single_file:True -- using False was giving me this error:
Traceback (most recent call last):
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 246, in <module>
main()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main
_run_hydra(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 296, in run_and_report
raise ex
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
run_training_FNO(
File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 88, in run_training
train_data = FNODatasetMult(flnm,
File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 387, in __init__
with h5py.File(self.file_path, 'r') as h5_file:
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/h5py/_hl/files.py", line 533, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/h5py/_hl/files.py", line 226, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/1D_Burgers_Sols_Nu0.001.hdf5.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
Note the hdf5.h5 in the error.
Here's the command I'm running and the config file.
python3 models/train_models_forward.py ++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/ ++args.filename='1D_Burgers_Sols_Nu0.001.hdf5' ++args.model_name='FNO'
defaults:
- _self_
- override hydra/hydra_logging: disabled
- override hydra/job_logging: disabled
hydra:
output_subdir: null
run:
dir: .
args:
model_name: 'FNO'
if_training: True
continue_training: False
num_workers: 2
batch_size: 5
initial_step: 10
t_train: 101
model_update: 10
single_file: True
reduced_resolution: 1
reduced_resolution_t: 1
reduced_batch: 1
epochs: 500
learning_rate: 1.e-3
scheduler_step: 100
scheduler_gamma: 0.5
#Unet
in_channels: 2
out_channels: 2
ar_mode: True
pushforward: True
unroll_step: 20
#FNO
num_channels: 2
modes: 12
width: 20
#Inverse
training_type: autoregressive
#Inverse MCMC
mcmc_num_samples: 20
mcmc_warmup_steps: 10
mcmc_num_chains: 1
num_samples_max: 1000
in_channels_hid: 64
inverse_model_type: InitialConditionInterp
#Inverse grad
inverse_epochs: 100
inverse_learning_rate: 0.2
inverse_verbose_flag: False
#Plotting
plot: False
channel_plot: 0 # Which channel/variable to be plotted
x_min: -1
x_max: 1
y_min: -1
y_max: 1
t_min: 0
t_max: 5
Hi there,
Yes, the single_file argument should be set to True for the Burgers dataset. I assume that line 184 in the fno/utils.py script would give an error since the file type is hdf5 and not h5. My suspicion is that the error is caused by the mismatch of dimension, because the arguments for the FNO set up (num_channels, modes, and width) might be wrong.
@mtakamoto-D Could you please provide the config arguments for the Burgers dataset and change the assert statement to accept also hdf5 file (not only h5)?
@timothypraditia Seemingly I added the assert sentence not to confuse multiple data with .h5 (mine uses .hdf5). For the user friendliness, that assert sentence should be erased. If you agree with it, I would commented out them.
@mtakamoto-D Ah I see, then I think the assert command should be fine. Could you please also provide the config file for the Burgers dataset training?
@GeoffNN: Makoto will provide the config files as soon as possible. In the meantime, could you try to copy the arguments in the file config_ReacDiff.yaml to the arguments in the file config.yaml and then try to run it?
Hi there. I have uploaded the missing config file for 1D advection/burgurs equations with modification of run_forward_1D.sh Can you please update your local repository to the newest one?
Hello,
I haven't had the bandwidth recently to try; I'll keep you posted as soon as I do. Happy holidays!