[Bug] HDF5 Output issues
Describe the bug Around the end of May, I pulled the newest code of GEOSX. After that, I had the HDF5 output issue. Every time, it reports the same error at the same simulation time. I think it is due to the upgrade of HDF5 library version.
To Reproduce
Expected behavior
If I don't output as hdf5, there won't be any issue. But if I output as hdf5, simulation get stuck due to hdf5 output issues.
Screenshots:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 25:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 102:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 80:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: ../../hdf5/src/H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
major: Virtual Object Layer
minor: Write failed
#002: ../../hdf5/src/H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 121:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 113:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 67:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 75:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 18:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 49:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 14:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 106:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 37:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 57:
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 41:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 26:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: ../../hdf5/src/H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
major: Virtual Object Layer
minor: Write failed
#002: ../../hdf5/src/H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
major: Virtual Object Layer
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 89:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 19:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: ../../hdf5/src/H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
major: Virtual Object Layer
minor: Write failed
#002: ../../hdf5/src/H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
major: Virtual Object Layer
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 81:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: ../../hdf5/src/H5VLcallback.c line 2113 in H5VL_dataset_write(): dataset write failed
major: Virtual Object Layer
minor: Write failed
#002: ../../hdf5/src/H5VLcallback.c line 2080 in H5VL__dataset_write(): dataset write failed
major: Virtual Object Layer
minor: Write failed
### #003: ../../hdf5/src/H5VLnative_dataset.c line 200 in H5VL__native_dataset_write(): could not get a validated dataspace from file_space_id
major: Invalid arguments to routine
minor: Bad value
#004: ../../hdf5/src/H5S.c line 266 in H5S_get_validated_dataspace(): selection + offset not within extent
major: Dataspace
minor: Out of range
HDF5-DIAG: Error detected in HDF5 (1.12.1) MPI-process 24:
#000: ../../hdf5/src/H5Dio.c line 291 in H5Dwrite(): can't write data
major: Dataset
minor: Write failed
#001: ../../hdf5/src/H5VLcallback.
Platform (please complete the following information):
- Machine: Stanford Sherlock
- Compiler: gcc/10.1.0
- GEOSX Version: 0.2.0 (develop, sha1: 233f16160)
Additional context It seems that this problem can only be reproduced in Stanford Sherlock cluster.
@Yifu93 Similar issue has been reported and please check the solution here (https://github.com/GEOSX/GEOSX/issues/1427).
@Yifu93 Similar issue has been reported and please check the solution here (#1427). @jhuang2601 Thank you. I tried this solution before. It can't solve my problem. I include more complete error logs. The error message is slight different.
Actually, I download the hdf5 output to my local environment and try to open it. But I have some problems to read phaseVolumeFraction element center. It shows: hdf5 oserror: can't read data (wrong b-tree signature). I google it. It seems that the problem is related to write data during in multiprocessing. I'm afraid this is some environment issues.
@Yifu93 Similar issue has been reported and please check the solution here (#1427).
I delete the code repo and git clone the code and build GEOSX from scratch to ensure the environment is not contaminated. But I still get the same error.
@Yifu93 Are you able to run a small test case of hdf5 in parallel using mpio (i.e. out of GEOSX)?
Like https://docs.h5py.org/en/stable/mpi.html if you want to use python, but you can surely find resources in pure C/C++ out there?
@TotoGaz I did some tests. Previously, I output both pressure and phase volume fraction time history data. If I only output pressure, there won't be any issue. But if I output phase volume fraction (only), I will have such errors. For phase volume fraction, we will have one more dimension compared with pressure.
@wrtobin Does that ring any bell to you? Is there anything tricky we do to write arrays in hdf5?
@wrtobin @TotoGaz For the output of phase volume fraction, let's say we want to output at 10 specific times. We can successfully output phase volume fraction at time 1 and 2. But it always (every time) breaks to output at time 3 with same error message. But based on the vtk output, there are on issues of phase volume fraction results.
Does that only fail in parallel?
@TotoGaz The model has 1.28 million grids blocks. I used 128 cores. I can't test it serial.
Maybe could you change the mesh? Using the InternalMesh feature lets use define a really little mesh.
Yeah anything that can be done to reduce the mesh size but still replicate the error would be beneficial. Debugging this sort of issue at scale and without the files to describe the problem is almost intractable, and unless this issue only occurs at scale something best to avoid.
Is this issue resolved ?
Hello, I am looking at HDF5 outputs.
I try to find why collegues of mine have periodic problems with their hdf5 outputs.
I am wondering why the output of hdf5 data is in independant mode (H5P_DEFAULT ~ H5FD_MPIO_INDEPENDENT) and not in collective mode H5FD_MPIO_COLLECTIVE when writing data ?