Isaac-coderpro
Isaac-coderpro
how can i finetune with bounds of datasets?
> Hi Teams, > > I have run the **default training script** with the following changes based on the results table **1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)**...
the same auc to you, let me konw what is the problem?
No,i ran on 3090,and i cut piece of day23 for auc varify.get 63%,further,i can see any dense weights loaded.can u? ---Original--- From: "Arjun ***@***.***> Date: Tue, Apr 23, 2024 21:16...
> > 1008GB RAM and 80GB x 8 GPU (NVIDIA A800) did you fix oom issues?
> First issue - GPU Dockerfile hasn't been fixed since I brought it up in [!1373 ](https://github.com/mlcommons/inference/pull/1373). Had to replace it with this one I left in the comments: [#1373...
Yes i did,does this will effect? ---Original--- From: "Arjun ***@***.***> Date: Mon, Jun 17, 2024 19:36 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [mlcommons/inference] DLRMv2 GPU Reference Implementation crasheswith BusError...
> That should rule out any permission issue coming from the docker launch. Too many open files is usually due to low setting for `ulimit - n`. > > Meanwhile...
beside > That should rule out any permission issue coming from the docker launch. Too many open files is usually due to low setting for `ulimit - n`. > >...