Ziji Shi (Steven)

Results 5 comments of Ziji Shi (Steven)

@merleyc Hi, the CPU utilization rate reported by `top` on a multi-core machine can be more than 100%, because it represents the workload ratio over a *single core*. i.e., if...

GPU trace: ![Screenshot 2022-01-12 at 11 52 37](https://user-images.githubusercontent.com/16677443/149061050-40df48ce-10e4-4090-b6d5-e7dfc90757fb.png)

Hi Marc, Thanks for bringing this up! This is indeed a bug, and we are fixing it.

Hi Marc, Upon checking, this is not a bug. When applying BatchNorm on the default axis (last dim), BatchNorm reduces to LayerNorm, and since the size of gamma/beta depends on...

Empirically speaking, large batch training does usually lead to worse generalization due to sharp local minima (ref: https://openreview.net/forum?id=H1oyRlYgg). You may wish to use large batch optimizers like LAMB/LARS to alleviate...