Reading hdf5 files with ak.load does not scale
Something about reading HDF5 files via ak.load appears to be serialized and does not scale as the number of locales increases. Writing on the other hand does appear to scale.
Here's some preliminary data from the 16-node-cs-hdr machine we run some nightly testing on (rough specs listed in https://chapel-lang.org/perf/arkouda/). This is for the default problem size (3/4 GiB per locale) running on lustre backed by hard drives (not flash/nvme):
| locales | read | write |
|---|---|---|
| 1 | 3.82 GiB/s | 0.74 GiB/s |
| 2 | 1.92 GiB/s | 1.56 GiB/s |
| 4 | 2.52 GiB/s | 3.57 GiB/s |
| 8 | 1.85 GiB/s | 5.10 GiB/s |
| 16 | 1.96 GiB/s | 11.10 GiB/s |
| 32 | 1.85 GiB/s | 23.07 GiB/s |
I have no idea what this system is supposed to be capable of in terms of read/write speeds. Write scaling looks good (performance roughly doubles when we double the number of locales) but something about reads seems to be serialized.
We see this for a bigger Cray XC as well. It's also on lustre, though I'm not sure how provisioned it is. Regardless the scaling behavior is similar, though writes starts to fall off at 64 nodes. I'm not sure yet if we're hitting a hardware limit with this particular system or if this is a bottleneck on the software/hardware side.
| locales | read | write |
|---|---|---|
| 1 | 2.59 GiB/s | 1.00 GiB/s |
| 2 | 1.82 GiB/s | 1.67 GiB/s |
| 4 | 2.02 GiB/s | 3.27 GiB/s |
| 8 | 2.52 GiB/s | 6.46 GiB/s |
| 16 | 2.51 GiB/s | 12.00 GiB/s |
| 32 | 2.69 GiB/s | 21.15 GiB/s |
| 64 | 2.68 GiB/s | 18.10 GiB/s |
| 128 | 2.82 GiB/s | 22.49 GiB/s |
| 256 | 3.33 GiB/s | 27.65 GiB/s |
FYI @mhmerrill @reuster986. I expect to start looking into this more in the near-ish term, but wanted to share preliminary results.
I think the current scheme writes out an hdf5 file per locale and there is support for reading back in with a different number of locales. I assume you care about being able to write and then read back in with a different number of locales, but is the file-per-locale important, or just an implementation detail? Would you be just as happy if one large file was created?
@ronawho Thanks for looking into this. In our applications, we care much more about reading than writing, unfortunately. Most of our datasets come in as hundreds or even thousands of medium-size files (MB to GB), so ideally our tests would target that scenario for reading, but it is difficult to generate test data like that, because arkouda currently only writes one file per locale. You're right that we would like to be able to read to and write from different numbers of locales.
With the version of HDF5 that Chapel uses, I believe the parallelism is limited by the number of files being read, so I think having one large file would be serial. I'm far from expert, so there might be a way to do single-file parallel reads in Chapel/HDF5, but I don't know that GenSymIO.chpl is aware of that functionality, if it exists.
I'm also far from an expert in HDF5, but I believe there are ways to do parallel reads/writes to a single file. We have some functionality to do that, but the writes currently require MPI, and I don't think that's an acceptable dependency.
Here's a snippet of that code that uses that just for a performance comparison (this is on the big XC, I have not been able to get MPI and hdf5-parallel dependencies set up on an IB cluster):
use BlockDist, Time, HDF5, HDF5.IOusingMPI;
config const size = 10**8;
var A = newBlockArr({1..size*numLocales}, int);
var t: Timer; t.start();
proc markTimer(s) {
var GiB = (A.size*numBytes(A.eltType)):real / (2**30):real;
writef("%5s: %.2dr GiB/s (%.2drs) \n", s, GiB / t.elapsed(), t.elapsed());
t.clear();
}
hdf5WriteDistributedArray(A, "test.h5", "dset");
markTimer("write");
hdf5ReadDistributedArray (A, "test.h5", "dset");
markTimer("read");
# load dependencies
module load cray-mpich
module load module load cray-hdf5-parallel
export MPICH_MAX_THREAD_SAFETY=multiple
| locales | read | write |
|---|---|---|
| 1 | 3.21 GiB/s | 0.46 GiB/s |
| 2 | 1.02 GiB/s | 0.72 GiB/s |
| 4 | 1.91 GiB/s | 0.73 GiB/s |
| 8 | 3.79 GiB/s | 0.78 GiB/s |
| 16 | 5.65 GiB/s | 0.77 GiB/s |
| 32 | 7.79 GiB/s | 0.78 GiB/s |
So in this case our hdf5{Write,Read}DistributedArray support has the inverse problem where reads scale, but writes don't. I don't know that there's anything deep to take away here beyond the scaling issues are likely a software / hdf5-utilization issue and nothing inherent.
Anyways, I'll keep looking at this in the background. It's probably obvious by now, but this is not something I'm familiar with, so I expect to have a pretty steep learning curve, and I'm hoping to bring in other people who have more expertise with hdf5 to assist.
There's definitely something weird going on. I created some standalone benchmarks, one that calls write1DDistArray()/read_files_into_distributed_array() directly, and another that calls the msg wrappers tohdfMsg()/readAllHdfMsg(): https://github.com/mhmerrill/arkouda/compare/master...ronawho:IO-speed-test
Here's the performance for write1DDistArray()/read_files_into_distributed_array(), tohdfMsg()/readAllHdfMsg(), and the full IO.py benchmark on 16-node-cs-hdr:
./test-bin/IOSpeedTest -nl 16
> write 9.16 GiB/s (1.30s)
> read 16.51 GiB/s (0.72s)
./test-bin/IOSpeedTestMsg -nl 16
> write: 2.70 GiB/s (4.42s)
> read: 1.67 GiB/s (7.15s)
./benchmarks/run_benchmarks.py IO --trials=1 -nl 16
> write: 11.56 GiB/s (1.03s)
> read: 1.7912 GiB/s (6.66s)
Just doing a read and not write for the msg variant:
./test-bin/IOSpeedTestMsg -nl 16
> write: 2.70 GiB/s (4.42s)
> read: 1.67 GiB/s (7.15s)
./test-bin/IOSpeedTestMsg -nl 16 --write=false
> read: 45.39 GiB/s (0.26s)
And adding some subtimers to readAllHdfMsg for the msg variant:
./test-bin/IOSpeedTestMsg -nl 16
write: 2.61 GiB/s (4.57s)
readAllHdfMsg get_dtype: 5.97s
readAllHdfMsg read_files_into_distributed_array: 0.52s
read: 1.80 GiB/s (6.61s)
So here what I'm seeing is that directly calling write1DDistArray()/read_files_into_distributed_array() performs/scales well. Calling the msg wrappers adds a lot of overhead for reads, but most of that time is spent in some auxiliary set up (get_dtype), not that actual read call. And if we get rid of the write and just do a read, performance is much better for the msg wrapper variant.
To me this indicates that there's some part of the HDF5 library we're incorrectly using. Maybe we're leaving too many files open or something and increasing memory usage or something along that lines. The current overhead in get_dtype is from a C_HDF5.H5Fis_hdf5 call, but commenting that out it just moves the overhead to the C_HDF5.H5Fopen call, so to me it seems like it's something about our usage of the HDF5 library that's causing the issues.
I'll keep digging, just posting a status update.
Some misc HDF5 perf links that may be helpful in the future:
- https://support.hdfgroup.org/HDF5/faq/perfissues.html
- https://portal.hdfgroup.org/display/knowledge/How+to+improve+performance+with+Parallel+HDF5
- https://portal.hdfgroup.org/display/knowledge/Linux+memory+handling+and+performance
@ronawho excellent sleuthing, and thank you for the update! I have run into a lot of gotchas with HDF5, but I never considered that the setup/metadata operations might be bottlenecking.
Here's another discussion of HDF5 performance issues that focuses on file open/close:
https://www.hdfgroup.org/wp-content/uploads/2018/04/avoid_truncate_white_paper_180219.pdf
Section 2 ("Slow File Open") looks promising, and they have a fix for it, but frustratingly they don't give much detail on the fix or indicate whether it has been incorporated into a more recent version of the HDF5 library.
I'm becoming more and more convinced that the overhead of file open is hurting us in a few of our important use cases, so it would be great if we could find someone familiar with HDF5 on parallel filesystems to look into this. I will try to find some expertise on our side.
Huh, that's an interesting paper. I have some experimental changes that improved read performance by spreading the opens across locales (but it was serial spreading, so I couldn't understand why it impacted performance.)
Let me open up PRs for a few of the more principled changes I prototyped and then I'll open up a draft PR to discuss that effort (and maybe have you perf test on real data)
I know the comments and work on this issue are a bit old, but a recent run-in with a separate testing issue makes we want to ask:
- What is the output of
ulimit -aon a typical node?
@ronawho can we close this yet?
This isn't done and I haven't worked on it in a long time. That said, @bmcdonald3 has been looking at this occasionally as a background task, so I'd be inclined to leave it open for now.