Eryck

Results 2 issues of Eryck

I don't see the initialization weights in similar tf code, is it not needed?

In distributed training, the memory of the first GPU is twice that of the other.But before the swa is applied, the GPU memory is the same.