Ziji Shi (Steven)
Ziji Shi (Steven)
@merleyc Hi, the CPU utilization rate reported by `top` on a multi-core machine can be more than 100%, because it represents the workload ratio over a *single core*. i.e., if...
GPU trace: 
Hi Marc, Thanks for bringing this up! This is indeed a bug, and we are fixing it.
Hi Marc, Upon checking, this is not a bug. When applying BatchNorm on the default axis (last dim), BatchNorm reduces to LayerNorm, and since the size of gamma/beta depends on...
Empirically speaking, large batch training does usually lead to worse generalization due to sharp local minima (ref: https://openreview.net/forum?id=H1oyRlYgg). You may wish to use large batch optimizers like LAMB/LARS to alleviate...