Yifei Hu

Results 15 comments of Yifei Hu

Hi, is there any update for the training code?

@soloice been four years now.... just spend a whole afternoon learning the wrong version of code...

I just figured it out. I followed the code in your blog > https://oldpan.me/archives/pytorch-gpu-memory-usage-track After comparing above code with the code in the github Readme.md, I changed ``` frame =...

同样出现了这个问题,表现和楼主的一模一样。 请问一下问题定位了吗?

> > > > 同样出现了这个问题,表现和楼主的一模一样。 请问一下问题定位了吗? > > 现在用的这个插件 WP Githuber MD > 暂时没发现有啥问题 我更新到了最新的 10.2.1版本,问题也消失了。应该是已经 fix 的 bug

同出现问题,在 bash 代码中也不能显示 $ 符号

Hi, @hassanhub sorry for the late reply, and big thanks to @adaniefei for offering more informations. You can find my fork code [here](https://github.com/Adam-fei/google-research/tree/debug) in the branch `debug`. I'll try to...

Hi. A big thank-you to @jalayrac. The [hmdb example of dmvr](https://github.com/deepmind/dmvr/blob/master/examples/linear_mmv_hmdb.py) can run successfully. However, despite using the new version of tfrecord, vatt still output following error: I was running...

> NCCL WARN socketProgressOpt: Call to recv from 192.168.1.43 failed : Broken pipe。 [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature...

> 500GB数据 使用8卡训练 感觉不太够啊, 估计需要几个月以上了 > > 感觉报错信息像是内存溢出了嘛 只是在跑通流程啦,正式训练会上多机的。 只是在多机多卡的情况下,发现报错的位置和内容和单机一样,就用单机debug了,方便些。 报错是内存溢出的话,那证实了我的猜想。 这个框架是一次性加载所有数据进内存是吗? 请问有没有更优雅的方法可以加载数据呢? 多谢 :)