fishtest SPSA improvements [RFC]

Issue opened to collect info about possible future SPSA improvements.

SPSA references

SPSA is a fairly simple algorithm to be used for local optimization (not global optimization). The wiki has now a simple documentation to explain the SPSA implementation in fishtest Here is other documentation:

SPSA seminal paper cited by @zamar in the forum
SPSA website
wikipedia SPSA page

SPSA implementation problems/improvements

we ask for "c_k_end" and "r_k_end" (final parameters values), but IMO we should ask for for "c" and "r" (starting values) if those are too big the SPSA diverges
we use "r_k = a_k / c_k^2" instead of "r_k = a_k" (I searched unsuccessfully a reference in some SPSA papers)
we set "c_k_end" and "r_k_end" for any single variable to be optimized (the original SPSA uses global values): this makes sense to account for the different sensitivity of the variables, but IMO this should be dealt with an internal normalization of the variables values based upon the starting values and the bounds.
one iteration should be set to a 2 games for match, but our worker code cannot support this, so we set one iteration to a 2*N_cores games for match
compute an averaged SP gradient per iteration to lower the noise
we have experimental code (special rounding and clipping) that nobody use: I'm afraid that it's theoretically correct but not very useful for the rough way we use SPSA
"A" parameter should be computed from the number of games
the worker passes rounded values to cutechess-cli: we should normalize the variables values to have the same resolution for all the variables

SPSA testing process (aka Time Control)

EDIT_000 this paragraph is outdated, I kept it to avoid disrupting the chain of posts:

read the wiki for a SPSA description https://github.com/glinscott/fishtest/wiki/Creating-my-first-test#tuning-with-spsa
the experience on these last years has shown that a very short time control on fishtest is not working:
- with NNUE, workers running on dual CPU have time losses at ultra short time control (USTC)
- that SPSA using or LTC or even ULTC has a high Signal/Noise ratio that helps the convergence. A ULTC match is very drawish, so in SPSA one side will win a pair of games only if the parameters random increments are somehow aligned with the gradient direction

I suggest this process to optimize the developer time and the framework CPU.

first steps: run some SPSAs at Ultra STC (e.g. 1+0.01) to find good "c_k_end", "r_k_end" values and some good variables starting values. This can be done or locally with a recent CPU or in fishtest.
last step: run a final SPSA in fishtest to optimize the variables for a longer TC (e.g. STC, 20+0.2, LTC etc.)

I took a SPSA from fishtest and run it locally changing only the the TC, the results are similar:

20+0.2 - original test on fishtest: https://tests.stockfishchess.org/tests/view/5e2dade6ab2d69d58394fb5e

20+02

2+0.02:

2+002

1+0.01::

1+001

0.5+0.01:

05+001

Feb 03 '20 19:02 ppigazzini

From my experience on SPSA, the main problem is the high level of noise in the results. If any proposition reduce this noise, I agree with it :-) You said :

"one iteration should be set to a 2 games for match, but our worker code cannot support this, so we set one iteration to a 2*N cores gamer for match"

Can we choose them number N ? Increase it specially. I think that below 100 games, the result can be completely wrong and lead to a bad convergence.

Feb 11 '20 09:02 MJZ1977

@MJZ1977 the companion code of the seminal paper asks for the number of averaged SP gradients to be used per iteration. List updated, thank you :)

Feb 11 '20 09:02 ppigazzini

The experimental options "careful clipping" and "randomized rounding" don't seems to have a first order effect, so we could keep only one method to clip and to round.

careful clipping

randomized rounding

careful clipping + randomized rounding

Apr 19 '20 12:04 ppigazzini

@ppigazzini : what are the effects of these options? did they change the number N of games before updating parameters ?

Apr 19 '20 14:04 MJZ1977

@MJZ1977 "careful clipping" https://github.com/glinscott/fishtest/commit/7eebda7e6d1f47f2672aefe46db35baee7cb5b1f and randomized rounding https://github.com/glinscott/fishtest/commit/5f63500db3f40569ea406a8b8b4b987f054ee79f are theoretical improvements with little/no effect on SPSA convergence wrt other parameters. People stuck to default, so the GUI was simplified dropping the possibility to chose them. I will do some other tests and then I will simplify the code dropping the options not useful.

https://github.com/glinscott/fishtest/blob/5b07986dab3e638292cd04d6cf95d89d9959faeb/fishtest/fishtest/rundb.py#L599-L625

Apr 19 '20 15:04 ppigazzini

From what i'm finding online, alpha is usually 0.602, gamma at 0.101 is ok, and A is ~ 10% the number of iterations. would these be good defaults for the SPSA fields?

Sources: https://hackage.haskell.org/package/spsa-0.2.0.0/docs/Math-Optimization-SPSA.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4769712/ https://www.chessprogramming.org/SPSA https://www.jhuapl.edu/SPSA/PDF-SPSA/Spall_Implementation_of_the_Simultaneous.PDF

Apr 21 '20 03:04 linrock

@linrock it makes definitely sense to have defaults for the fields (actually, I was thinking they had defaults...). Also @ppigazzini suggests to have A depend on the number of games. Shouldn't we call the field 'A[in %]' and give it a default of 10%, so that the field doesn't need to be adjusted when the number of games is changed ?

Apr 21 '20 05:04 vondele

ah yea, i removed the SPSA defaults in the "create new test" redesign PR when all that should've been removed was the list of hard-coded params in the SPSA parameter list.

A as a percentage of # games makes sense. from what i'm reading, A is typically less than or equal to 10% the expected # of iterations (2 games per iteration). So maybe it could be either:

A (% games) with default of 5%
A (% iterations) with default of 10%

Apr 21 '20 18:04 linrock

Haha, in all this time I never realised that A was (/ should be) related to the number of games! :)

Regarding SPSA at very low tc, does that stress the server a lot because workers are continually returning small batches of data?

Apr 21 '20 19:04 xoto10

@xoto10 the SPSA at very low tc can be also done locally :)

Apr 21 '20 19:04 ppigazzini

@linrock either percentage seems fine to me. Probably games, since we specify #games for SPSA and not number of iterations. In the future, I could imagine that an iteration contains more than 2 games (i.e. batching for SPSA, @vdbergh?), to reduce server load, and because it presumably makes sense (but I don't know the SPSA details).

Apr 21 '20 20:04 vondele

@vondele I am working on a small PR to allow the server to set a batch_size. It is mainly for sprt but it will also work for spsa and fixed games although for those one may consider leaving it to the worker. We can see.

Apr 21 '20 20:04 vdbergh

@ppigazzini : I am trying to understand how SPSA code is working and my knowledge is very weak. Nevermind, I am trying. In the file rundb.py, I find the following:

    # Generate the next set of tuning parameters
    iter_local = spsa['iter'] + 1  # assume at least one completed,
                                   # and avoid division by zero
    for param in spsa['params']:
      c = param['c'] / iter_local ** spsa['gamma']
      flip = 1 if random.getrandbits(1) else -1
      result['w_params'].append({
        'name': param['name'],
        'value': self.spsa_param_clip_round(param, c * flip,
                                            spsa['clipping'], spsa['rounding']),
        'R': param['a'] / (spsa['A'] + iter_local) ** spsa['alpha'] / c ** 2,
        'c': c,
        'flip': flip,
      })
      result['b_params'].append({
        'name': param['name'],
        'value': self.spsa_param_clip_round(param, -c * flip, spsa['clipping'], spsa['rounding']),
      })

    # Update the current theta based on the results from the worker
    # Worker wins/losses are always in terms of w_params
    result = spsa_results['wins'] - spsa_results['losses']
    summary = []
    w_params = self.get_params(run['_id'], worker)
    for idx, param in enumerate(spsa['params']):
      R = w_params[idx]['R']
      c = w_params[idx]['c']
      flip = w_params[idx]['flip']
      param['theta'] = self.spsa_param_clip_round(param, R * c * result * flip,
                                                  spsa['clipping'],
                                                  'deterministic')
      if grow_summary:
        summary.append({
          'theta': param['theta'],
          'R': R,
          'c': c,
        })

My questions are:

is "w_params / b_params" corresponding to "white / black parameters". If this is true, why always making "+c" for white and "-c" for black ? What I had understood is that we should make a couple of games with black and white alternating "+c/-c" and "-c/+c" ?
in the second part "update_spsa", the gradient is calculated from wins/losses from last results. Are the results corresponding to a specified number of games for a worker (like 200 in SPSA tests), or just for a couple of games? Within the worker, is black/white alternating ? we should have something like engine1(+c) - engine2(-c) white - black black - white with the same opening ?

And sorry for these "technical questions" ...

Update : latest version of code

Apr 28 '20 10:04 MJZ1977

@MJZ1977 I think it is great somebody is looking at the implementation of SPSA. I'm still puzzled why our tuning attempts have such a low success rate (@linrock recent experience). I do think we need a very large number of games, as the Elo difference we're looking for are so small, and the parameters of SPSA are not obvious or automatic, but I also think we need to critically audit the actual implementation, just in case.

Apr 28 '20 11:04 vondele

@MJZ1977 You should also look at the worker code to get the complete picture, at and below this line https://github.com/glinscott/fishtest/blob/db94846a0db8788fe8a8724678798dcc91d201e8/worker/games.py#L386

See https://github.com/zamar/spsa for the original implementation

Apr 28 '20 12:04 tomtor

Are the results corresponding to a specified number of games for a worker

@MJZ1977 A worker plays batches of 2*N-CPU games (white/black alternating) and requests a parameter update from the server after every batch.

Apr 28 '20 12:04 tomtor

@vondele SPSA claims to minimize the number of function evaluations. Classic SPSA evaluates the function only at "variables_values_k+delta; variables_values_k-delta" for the gradient estimation, so SPSA obviously diverges with wrong delta. This is why I suggest to test locally the SPSA parameters with USTC before submitting to fishtest.

The one side SPSA computes the gradient with "variables_values_k+delta; variables_values_k", so having a CPU cost free function evaluation with variable_value_k it's possible to implement:

a policy to reset the variables value upon some conditions (like done with extra CPU cost in the paper linked by @linrock)
a policy to accept delta only if f(variables_values_k+delta) > f(variables_values_k) (for the maximization problem)

Neither policies can guarantee the convergence with bad delta, though. SPSA (and all gradient descent algorithms) works only to refine the starting values within the starting basin, to find better local maxima we should switch to global optimization algorithms based on function evaluations (Nelder-Mead, genetic etc.) to explore the space variables. https://en.wikipedia.org/wiki/Global_optimization

Apr 28 '20 12:04 ppigazzini

@tomtor : thank you for the links ! Update : removed

Apr 28 '20 12:04 MJZ1977

@ppigazzini concerning Nelder-Mead, I did work on interfacing cutechess games to the nevergrad suite of optimizers : https://github.com/vondele/nevergrad4sf and picked TBPSA which seems to be the recommended optimizer for noisy functions. I found it robust if given enough games (millions literally). Unfortunately, the optimized parameters seem very good at the TC they have been optimized (VSTC), but not transferable. Since I can't optimize at STC or LTC, it would need to be integrated in fishtest.... but I'm not able to do that (time and experience with the framework lacking atm).... if somebody wants to pick it up, I would be happy to help.

Apr 28 '20 12:04 vondele

After making some tests, I think that one of principal problems is that the random parameter "flip" is only taking values +1 or -1 (please correct if I am wrong). So basically, fishtest always tries to change all variables at the same time. One improvement can be to take flip values from [+1, +0.1, -0.1, -1] for exemple. It corresponds to random division by 10. In this case, we will have some tests with only 1 or 2 variables changing. I think it is also easy to implement even if I don't have the knowledge to make it !

Apr 29 '20 16:04 MJZ1977

If we want to tune 1 constant, it would be nice if tuning could simply test the start value and N values either side (3? 5?) and then display a bar chart of the resulting performance. That might give us an easy to read clue as to whether there's a trend in in which values are better. We tend to do this manually atm but it seems easy for the tuner to do?

Apr 29 '20 17:04 xoto10

@MJZ1977

So basically, fishtest always tries to change all variables at the same time.

SPSA = Simultaneous perturbation stochastic approximation

One improvement can be to take flip values from [+1, +0.1, -0.1, -1] for exemple. It corresponds to random division by 10. In this case, we will have some tests with only 1 or 2 variables changing. I think it is also easy to implement even if I don't have the knowledge to make it !

Random [+1, -1] is the Rademacher distribution, you can use other distributions, but the result IMO will not change: we can get good fishtest gains from SPSA only for bad tuned parameters or when SPSA finds for serendipity a new local maximum.

SPSA, like other gradient algorithms, it'a a local optimization, useful to refine the starting values "without hopping from the starting basin".

@xoto10 you are talking about a global optimization algorithm, take a look to the @vondele work.

Apr 29 '20 17:04 ppigazzini

while the TBPSA might also work for global optimization (that's always hard), I don't think we're typically stuck in local minima. At least, I have never seen evidence of that. TBPSA seems to be just rather good of doing the right thing in the presence of noise, also in (a relatively small) number of dimensions. @xoto10 the bar chart will tell almost nothing in most cases, unless we do on the order of 240000 games per point (that's roughly 1Elo error, i.e. the typical gain from a tune).

I once did a scan of one parameter for one of the search parameter, and the graph is somewhere in a thread on github, which I can't find right now, and it looks like this: elo_stat

Apr 29 '20 17:04 vondele

I don't think we're typically stuck in local minima. At least, I have never seen evidence of that.

In that case (a proper implemented) SPSA should be able to find a better value, but in my first post I collected all my doubts about our SPSA implementation.

A simple proof is to set a blatant wrong value for a parameter (eg. Queen = 0.1 pawn, sorry I'm not a SF developer :) and view if our SPSA is able to recover a good value.

Apr 29 '20 19:04 ppigazzini

I made some tests since yesterday and come to the conclusion that SPSA is not working well actually because of too much noise in individual results. As an example to explain my thought, I take this simple example SPSA beginning with KnightSafeCheck = 590 https://tests.stockfishchess.org/tests/view/5ea9b5c469c5cb4e2aeb82fd SPRT master vs KnightSafeCheck = 590 https://tests.stockfishchess.org/tests/view/5eaa93b769c5cb4e2aeb8370 The best value should be KnightSafeCheck = ~790 like in master. SPSA is oscillating even if it seems increasing at the end. I use only 1 variable to avoid any bias.

The only solution to this is to make iterations for at least 200 games instead of 2N games. For example if the results are 60-40-100, it gives +20 to multiply by the same gradient. It is very different from multiplying "60" and "-40" by different gradients which clearly increases the noise. This is my opinion but I cannot be sure without making tests which are impossible now.

An improvement can be to add an SPSA parameter = minimum number of games per iteration instead of the default 2N

Apr 30 '20 10:04 MJZ1977

I think one cannot say anything without first measuring the Elo difference between 590 and 790. If it takes many games to just detect the difference one cannot expect spsa to magically wander from 590 to 790.

Apr 30 '20 11:04 vdbergh

@MJZ1977 try with a blatant wrong value eg KnightSafeCheck = 790000. If the SPSA is not able to recover a value that makes sense (eg 2000) then:

that parameter is not tunable with SPSA, or
our SPSA implementation has problems

Apr 30 '20 12:04 ppigazzini

that might not be such a good test... this could be so far off that e.g. the local gradient is zero, so progress won't be made. Maybe it could be started from 0. But I agree it is good to have an estimate of the Elo importance of the term as well, and probably picking a term with e.g. ~10Elo impact makes sense. (Maybe scaling initiative would be a candidate?)

Apr 30 '20 12:04 vondele

I launched a test beginning with a value of 10. So, you confirm that SPSA can't make +/-5 ELO difference. It will not be easy to find a +5ELO patch :-).

Apr 30 '20 12:04 MJZ1977

that might not be such a good test... this could be so far off that e.g. the local gradient is zero, so progress won't be made.

@vondele you know exactly how much you are off, so setting the proper "c_k_end" the gradient should not be 0 (if that parameter has an effect at all). And here is (one of many) a problem of our SPSA implementation: we ask for "c_k_end" and not for "c_k_start" making very hard for developers to control the SPSA behaviour.

Apr 30 '20 13:04 ppigazzini