LouisDDN
LouisDDN
Interested in this PR too.
I don't think it makes sense as the goal is to reproduce a workload not to get maximum throughput out of the storage.
Hi, any update on this? AFAIK this is blocking for cosmo & resnet50 for MLPerfStorage v1.0.
As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the memory requirements...
> > As this PR appears to be updating the rules for v2.0, there was a recent discussion in the checkpointing subgroup about model sizes. The table below shows the...
> This should be addressed already with the new PR [#278](https://github.com/argonne-lcf/dlio_benchmark/pull/278) . Could you please check. [@LouisDDN](https://github.com/LouisDDN) I tried this PR. The read performance is back to normal (not 50 ...
The issue actually persist for read if I use LLAMA8B Zero3 and 8 mpi processes. My node has 2TB of RAM. The 10 steps of write are just 1TB.
@zhenghh04 I am working on a new patch as an alternative to hash consing, based on Friday’s discussion. It will increase startup time but fully preserve the original I/O pattern...
> > @zhenghh04 I am working on a new patch as an alternative to hash consing, based on Friday’s discussion. It will increase startup time but fully preserve the original...
@zhenghh04, I just pushed the replacement for hash consing for the write operation. It is in this PR. I closed the two other PRs for simplicity. The algorithm is the...