minigo [Experiment] Squeeze and Excitation

Trying out Squeeze and Excitation.

Looks really good.

Code: https://github.com/tensorflow/minigo/pull/673 (Brian would be nice for you to take a look)

Inspirations:

https://arxiv.org/abs/1709.01507
https://github.com/LeelaChessZero/lczero-training/blob/master/tf/tfprocess.py#L533

Results

Trained 6 networks, 2 each of baseline, Squeeze And Excitation (SE) and SE + bias

Tensorboard shows lower error in all metrics
Played 40 games for each of the last 5 checkpoints against a fixed external engine (LZ60)
- SE models were stronger, SE + bias even stronger

screenshot from 2019-02-03 16-05-54

screenshot from 2019-03-04 16-55-34

tensorflow code was very slow in inference. It was mentioned that averagepool is slower than reduce_average, I investigated but it appeared and disappeared unclear why.

Feb 04 '19 08:02 sethtroisi

https://github.com/hujie-frank/SENet the site of the paper

Feb 04 '19 08:02 l1t1

Some links I used to profile performance

add_run_metadata from: https://www.tensorflow.org/guide/graph_viz

chrome tracing from: https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d

chrome trace of SE:

(was .json renamed for git)timeline_01.txt

Feb 04 '19 23:02 sethtroisi

@sethtroisi From a quick survey of the paper, the Squeeze-Excite (SE) approach looks very similar to what @lightvector has been doing with global properties. The main difference seems to be that SE only considers average pooling (though they suggest other aggregations), while the later suggests that max pooling might also be useful.

See https://github.com/lightvector/GoNN#update-oct-2018 for some further reading of his research into the topic. He also discuss a bunch of other topics you might find inspiring for similar enhancements.

Feb 05 '19 22:02 kblomdahl

What do you want to do with this issue, now that we're doing it? :)

Feb 11 '19 20:02 amj

I'm planning to include details from v17 in this issue then we'll close it out

Feb 11 '19 20:02 sethtroisi

Cross Eval is showing v17 as much stronger which I'm going to 80% attribute to this change!!!

Mar 15 '19 00:03 sethtroisi

great

Mar 15 '19 00:03 l1t1

What's the computational cost (if any) of SE versus non-SE, holding number of blocks constant?

Mar 15 '19 03:03 lightvector

+2% on TPU for training, +1% for inference.

On my personal machine I had to pin some operation to the GPU or it was 2x slower.

Mar 15 '19 04:03 sethtroisi

they post another paper, i don't know if it is related with go game Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, Andrea Vedaldi https://arxiv.org/abs/1810.12348

Mar 15 '19 11:03 l1t1

@sethtroisi, you might want to update your LeelaChessZero link to: https://github.com/LeelaChessZero/lczero-training/blob/master/tf/tfprocess.py#L645

May 06 '19 19:05 TFiFiE

What if bias is used instead of weighting (gamma)? Can it be confirmed that the presence of the former won't already give the same improvement as having both?

May 07 '19 18:05 TFiFiE