Isaac-coderpro

Results 9 comments of Isaac-coderpro

> Hi Teams, > > I have run the **default training script** with the following changes based on the results table **1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)**...

the same auc to you, let me konw what is the problem?

No,i ran on 3090,and i cut piece of day23 for auc varify.get 63%,further,i can see any dense weights loaded.can u? ---Original--- From: "Arjun ***@***.***> Date: Tue, Apr 23, 2024 21:16...

> > 1008GB RAM and 80GB x 8 GPU (NVIDIA A800) did you fix oom issues?

> First issue - GPU Dockerfile hasn't been fixed since I brought it up in [!1373 ](https://github.com/mlcommons/inference/pull/1373). Had to replace it with this one I left in the comments: [#1373...

Yes i did,does this will effect? ---Original--- From: "Arjun ***@***.***> Date: Mon, Jun 17, 2024 19:36 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [mlcommons/inference] DLRMv2 GPU Reference Implementation crasheswith BusError...

> That should rule out any permission issue coming from the docker launch. Too many open files is usually due to low setting for `ulimit - n`. > > Meanwhile...

beside > That should rule out any permission issue coming from the docker launch. Too many open files is usually due to low setting for `ulimit - n`. > >...