LouisDDN comments

Results 18 comments of


                                            LouisDDN

Add hw_counters metrics for infiniband device.

Interested in this PR too.

can we enable direct_io flag when issue file system API from MLPerf storage test suite

I don't think it makes sense as the goal is to reproduce a workload not to get maximum throughput out of the storage.

The samples of tfrecord in dali dataloader are not actually read from storage

Hi, any update on this? AFAIK this is blocking for cosmo & resnet50 for MLPerfStorage v1.0.

Benchmark.py script for v2.0

As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the memory requirements...

Benchmark.py script for v2.0

> > As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the...

inflated read perf for checkpointing (data is cached)

> This should be addressed already with the new PR [#278](https://github.com/argonne-lcf/dlio_benchmark/pull/278) . Could you please check. [@LouisDDN](https://github.com/LouisDDN) I tried this PR. The read performance is back to normal (not 50 ...

inflated read perf for checkpointing (data is cached)

The issue actually persist for read if I use LLAMA8B Zero3 and 8 mpi processes. My node has 2TB of RAM. The 10 steps of write are just 1TB.

RAM optimisations for checkpointing

@zhenghh04 I am working on a new patch as an alternative to hash consing, based on Friday’s discussion. It will increase startup time but fully preserve the original I/O pattern...

RAM optimisations for checkpointing

> > @zhenghh04 I am working on a new patch as an alternative to hash consing, based on Friday’s discussion. It will increase startup time but fully preserve the original...

RAM optimisations for checkpointing

@zhenghh04, I just pushed the replacement for hash consing for the write operation. It is in this PR. I closed the two other PRs for simplicity. The algorithm is the...