Haichen Huang
Haichen Huang
Hi @rohitgr7 Thank you for your review. I've fixed the import issue. As for the docs, we've written the docstring in `ColossalAIStrategy`. We don't know what else should we do...
Hi @rohitgr7 After our discussion, we decide to add the docs in another PR. Could we just merge this PR first?
Hi @rohitgr7 The package link is added and `HybridAdam` can be imported after the installation. But I still have a problem in my local tests. I passed all my tests...
Those specific versions of packages are required by Lightning. There may raise compatibility problem if you install a mismatched version.
Hi @yli1994 , Thanks for your issue. I've already updated `environment.yaml`.
Hi @arpowers, It seems like you do not install `pytorch-lightning` package or your `pytorch-lightning`'s version is lower than 1.8.0. You can check [here](https://github.com/hpcaitech/ColossalAI/blob/main/examples/images/diffusion/environment.yaml) for some requirements.
Hi @Sakura-gh, I believe the problem is caused by an early exit of some process. Maybe you can add a synchronization like `torch.distributed.barrier()` in the end of the code to...
Hi @nostalgicimp, I just can't reproduce the error mentioned above. Try to add a barrier in the end of the code. Did you encounter the problem all the time? Please...
Hi @nostalgicimp , I still can't reproduce the problem. It seems all goes well in my server. Maybe you could set the environment variable `CUDA_LAUNCH_BLOCKING=1` for more information about your...
I have fixed a part of this problem. If users only use the initializing functions in torch.nn.init, there will be no problem. But correct initialization can't be assured when users...