Eryck issues

Results 2 issues of


                                            Eryck

I don't see the initialization weights in similar tf code, is it not needed?

In distributed training, the memory of the first GPU is twice that of the other.But before the swa is applied, the GPU memory is the same.