KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

How to train the current strongest model locally?

Open anonym-g opened this issue 11 months ago • 8 comments

I have successfully get a model using the network file from official website, which is locally trained with approximately 1000 games data. But it played poorly, not at all feels like a well trained model.

Is this because I didn't use the raw checkpoint files? If it's the case, how to use it, like, where should I put it at?

anonym-g avatar Mar 07 '25 16:03 anonym-g

You might have to give more detail about what you're trying to do, and study the log files produced by your training script for what it did.

If you didn't use a "raw checkpoint file" then there's a pretty good chance that what you actually did was play 1000 games from a fully trained network, and then use those games to train an entirely new, randomly-initialized model from scratch. That would of course produce a very weak model, 1000 games is nowhere near enough to train an entirely new model from nothing. Pytorch training always uses the checkpoint files, the .bin.gz files are only used for playing games, not for training.

Take a look at your training directory, and explore the various subfolders inside it. You will be able to see how it is organized, and within it you can find the checkpoint files produced by whatever model you trained. You might also be interested in the -initial-checkpoint argument to train.py.

Also, normally 1000 games is a very tiny amount of data and is unlikely to improve the current model, and if you set the learning rate too high or train too much on the data and overfit to those 1000 games then there's a good chance it makes the model worse. Unless you intend to train the model on very special positions it hasn't seen before, or with new rules, or some other situation that currently the model is very bad at or not trained for. Then it's possible that that a relatively small amount of data could produce a lot of learning and improvement on that particular situation.

lightvector avatar Mar 08 '25 15:03 lightvector

@lightvector Thanks for your reply! I actually have figured it out earlier today, just as your description. I didn't know the mechanism of Pytorch training, so didn't place the offically released checkpoint file at train folder.

After put the .ckpt file at the correct place, I've successfully get a further trained model. But just now I encountered another problem, during second round of training, the log of train.py says:

['./train.py', '-traindir', 'G:/Projects/KataGo/Training/BaseDir/train/kata1-b28c512nbt', '-datadir', 'G:/Projects/KataGo/Training/BaseDir/shuffleddata/current/', '-exportdir', 'G:/Projects/KataGo/Training/BaseDir/torchmodels_toexport', '-exportprefix', 'kata1-b28c512nbt', '-pos-len', '19', '-batch-size', '64', '-model-kind', 'b28c512nbt', '-samples-per-epoch', '10000', '-swa-period-samples', '80000', '-quit-if-no-data', '-stop-when-train-bucket-limited', '-no-repeat-files', '-max-train-bucket-per-new-data', '8', '-max-train-bucket-size', '200000']
Using GPU device: NVIDIA GeForce RTX 4070 Laptop GPU
Seeding torch with 28429974062958983
{'version': 15, 'norm_kind': 'fixscaleonenorm', 'bnorm_epsilon': 0.0001, 'bnorm_running_avg_momentum': 0.001, 'initial_conv_1x1': False, 'trunk_num_channels': 512, 'mid_num_channels': 256, 'gpool_num_channels': 64, 'use_attention_pool': False, 'num_attention_pool_heads': 4, 'block_kind': [['rconv1', 'bottlenest2'], ['rconv2', 'bottlenest2'], ['rconv3', 'bottlenest2gpool'], ['rconv4', 'bottlenest2'], ['rconv5', 'bottlenest2'], ['rconv6', 'bottlenest2gpool'], ['rconv7', 'bottlenest2'], ['rconv8', 'bottlenest2'], ['rconv9', 'bottlenest2gpool'], ['rconv10', 'bottlenest2'], ['rconv11', 'bottlenest2'], ['rconv12', 'bottlenest2gpool'], ['rconv13', 'bottlenest2'], ['rconv14', 'bottlenest2'], ['rconv15', 'bottlenest2gpool'], ['rconv16', 'bottlenest2'], ['rconv17', 'bottlenest2'], ['rconv18', 'bottlenest2gpool'], ['rconv19', 'bottlenest2'], ['rconv20', 'bottlenest2'], ['rconv21', 'bottlenest2gpool'], ['rconv22', 'bottlenest2'], ['rconv23', 'bottlenest2'], ['rconv24', 'bottlenest2gpool'], ['rconv25', 'bottlenest2'], ['rconv26', 'bottlenest2'], ['rconv27', 'bottlenest2gpool'], ['rconv28', 'bottlenest2']], 'p1_num_channels': 64, 'g1_num_channels': 64, 'v1_num_channels': 128, 'sbv2_num_channels': 128, 'num_scorebeliefs': 8, 'v2_size': 144, 'bnorm_use_gamma': True, 'activation': 'mish', 'use_repvgg_init': True, 'use_repvgg_learning_rate': True, 'has_intermediate_head': True, 'intermediate_head_blocks': 28, 'trunk_normless': True}
swa_period_samples 80000.0
swa_scale 8
lookahead_alpha 0.5
lookahead_k 6
soft_policy_weight_scale 8.0
disable_optimistic_policy False
meta_kata_only_soft_policy False
value_loss_scale 0.6
td_value_loss_scales [0.6, 0.6, 0.6]
seki_loss_scale 1.0
variance_time_loss_scale 1.0
main_loss_scale 0.2
intermediate_loss_scale 0.8
Parameters in model:
conv_spatial.weight, [512, 22, 3, 3], 101376 params
linear_global.weight, [512, 19], 9728 params
blocks.0.normactconvp.norm.beta, [1, 512, 1, 1], 512 params

... ...

intermediate_value_head.linear_s3.weight, [8, 128], 1024 params
intermediate_value_head.linear_s3.bias, [8], 8 params
intermediate_value_head.linear_smix.weight, [8, 384], 3072 params
intermediate_value_head.linear_smix.bias, [8], 8 params
Total num params: 73162378
Total trainable params: 73162378
Using lookahead optimizer 0.5 6
Training in FP32.
Updated training data: G:\Projects\KataGo\Training\BaseDir\shuffleddata\current
Advancing trainbucket row 6998 to 14262, 7264 new rows
Fill per data 8.000, Max bucket size 200000
Old rows in bucket: 591436
New rows in bucket: 200000
Train steps since last reload: 6976 -> 0
Dropping 1/1 files in: G:\Projects\KataGo\Training\BaseDir\shuffleddata\current\train as already used
No new training files found in: G:\Projects\KataGo\Training\BaseDir\shuffleddata\current\train, quitting

But the selfplay data has been generated normally, and shuffler seems working fine:

Beginning: Finding files
Finished: Finding files in 0.00026917457580566406 seconds
Total number of files: 2
Total number of files with unknown row count: 2
Excluded count: 0
Excluded count due to looking like temp file: 0
Excluded count due to cmdline excludes file: 0
GC collect
Beginning: Sorting
Finished: Sorting in 3.337860107421875e-06 seconds
Beginning: Computing rows for unsummarized files
Finished: Computing rows for unsummarized files in 0.24188446998596191 seconds
Beginning: Processing found files
Finished: Processing found files in 5.9604644775390625e-06 seconds
Total rows found: 14262 (14262 usable)
Desired num rows: 11680 / 14262
Beginning: Computing desired rows
Using: G:/Projects/KataGo/Training/BaseDir/selfplay/kata1-b28c512nbt-s8326501440-d6998\tdata\99B01374F225843D.npz (5317-5317) (8945/11680 desired rows)
Using: G:/Projects/KataGo/Training/BaseDir/selfplay/kata1-b28c512nbt-s8326501440-d6998\tdata\1BD4C8722948A678.npz (0-0) (14262/11680 desired rows)
Finished: Computing desired rows in 1.811981201171875e-05 seconds
Finally, using: (0-14262) (14262/11680 desired rows)
GC collect
Writing 1 output files with 14262 kept / 11680 desired rows
Due to only_include_md5, filtering down to 2/2 files
Grouping 2 input files into 1 sharding groups
Beginning: Sharding
Finished: Sharding in 2.807805299758911 seconds
Beginning: Merging
Finished: Merging in 2.615654468536377 seconds
Number of rows by output file:
[('G:/Projects/KataGo/Training/BaseDir/shuffleddata/20250308-232704/train\\data0.npz', 14208)]
Cleaning up tmp dir: G:/Projects/KataGo/Training/BaseDir/shufflescratch/train\tmp.shuf0

anonym-g avatar Mar 08 '25 15:03 anonym-g

These lines are wrong:

Updated training data: G:\Projects\KataGo\Training\BaseDir\shuffleddata\current
Dropping 1/1 files in: G:\Projects\KataGo\Training\BaseDir\shuffleddata\current\train as already used

The training code has protection against reading the same training file twice to prevent overfitting if it gets run again without the shuffler running (which can happen with static datasets, or if the shuffler fails or is interrupted, and the run continues or is restarted etc). It uses the file path to distinguish whether the same training file is being read.

Notice how the shuffler outputs to a new data directory every time, e.g. .../shuffleddata/20250308-232704/, which has the date and current time, so that the file path always changes with each new shuffle, so that the training script will recognize it is new. The path .../shuffleddata/current is supposed to be be a symlink/shortcut to whatever this latest new directory is, and this line of code https://github.com/lightvector/KataGo/blob/master/python/train.py#L735 is supposed to allow the training script to resolve .../shuffleddata/current to the actual directory with the date and time, so that it can see the real new file path so that it can tell if the training files are the same or not.

Perhaps os.path.realpath is not working on your system or there's some detail in how your setup is handling the training or shuffling directories that is interfering.

lightvector avatar Mar 08 '25 15:03 lightvector

If things are working correctly, the outputted log lines should be more like:

Updated training data: G:\Projects\KataGo\Training\BaseDir\shuffleddata\2025038-...

showing that it is resolving shuffleddata/current into real path with the correct date and time.

lightvector avatar Mar 08 '25 15:03 lightvector

Alright, I will try to figure it out tomorrow(it almost 12p.m. in my time zone), thanks for the reply

And just for reference, about the rare situation you mentioned, I set the KomiMean to 7, since nowadays Go AI are generally recognize 7.5komi as white-advantage(about 64% win rate), 6.5komi as black-advantage(about 58%), so I think it would be quite easy for AI to reach a draw at 7komi without much performance loss. This might allow the model to better develop its ability to play the proper move, as there is no .5komi burden.

anonym-g avatar Mar 08 '25 15:03 anonym-g

Just so you know, the official training already centers the mean komi around the fairest value. komiMean is not even used, instead komiAuto is used, which automatically sets the mean at that the network believes to be the fairest taking into account the exact rules and the starting position. Almost all games are played from non-empty starting positions to improve training diversity, so taking into account the starting position when determining the fairest komi is relatively important.

lightvector avatar Mar 08 '25 16:03 lightvector

Think I've tackled the problem. It seems like ln -s command in shuffle.sh cannot properly create a symlink under windows os. So I changed the code to:

# python/selfplay/shuffle.sh, Line 102 ~ 106
# Just in case, give a little time for nfs
sleep 10

# # rm if it already exists
# rm -rf "$BASEDIR"/shuffleddata/current_tmp


# ln -s $OUTDIR "$BASEDIR"/shuffleddata/current_tmp
# mv -Tf "$BASEDIR"/shuffleddata/current_tmp "$BASEDIR"/shuffleddata/current

# 使用 Python 创建符号链接 (use Python to create a symlink) 
python ./create_symlink.py "$BASEDIR/shuffleddata/$OUTDIR" "$BASEDIR/shuffleddata/current"

And added a python file:

# python/create_symlink.py
import os
import sys
import shutil

if len(sys.argv) != 3:
    print("Usage: python create_symlink.py <target> <link>")
    sys.exit(1)

target = sys.argv[1]
link = sys.argv[2]

# 如果 link 已存在,删除它 (If the link exists, delete it)
if os.path.islink(link):
    os.remove(link)  # 删除符号链接 (delete symbolic link)
else:
    shutil.rmtree(link)  # 删除目录 (delete directory)

os.symlink(target, link, target_is_directory=True)
print(f"Created symlink: {link} -> {target}")

anonym-g avatar Mar 09 '25 02:03 anonym-g

Well, after the second round training, I observed a rather strange pheonmena: the model's search effeciency is greatly decreased.

The origin model downloaded from official website has about 1000+ visits/s with different thread parameters (to be precise, as long as the numSearchThreads >= 12), yet the locally trained model has up to 800:

.\katago.exe benchmark -visits 1200 -time 10 -config .\25_3_custom.cfg -model .\25_3_Custom_kata1-b28c512nbt-s8326517280-d21260.bin.gz

Ordered summary of results:

numSearchThreads =  5: 10 / 10 positions, visits/s = 449.04 nnEvals/s = 368.59 nnBatches/s = 147.88 avgBatchSize = 2.49 (26.8 secs) (EloDiff baseline)
numSearchThreads = 10: 10 / 10 positions, visits/s = 610.23 nnEvals/s = 505.83 nnBatches/s = 101.92 avgBatchSize = 4.96 (19.8 secs) (EloDiff +103)
numSearchThreads = 12: 10 / 10 positions, visits/s = 695.19 nnEvals/s = 546.71 nnBatches/s = 92.22 avgBatchSize = 5.93 (17.4 secs) (EloDiff +148)
numSearchThreads = 16: 10 / 10 positions, visits/s = 727.92 nnEvals/s = 590.22 nnBatches/s = 74.84 avgBatchSize = 7.89 (16.7 secs) (EloDiff +159)
numSearchThreads = 20: 10 / 10 positions, visits/s = 805.79 nnEvals/s = 652.28 nnBatches/s = 66.26 avgBatchSize = 9.84 (15.1 secs) (EloDiff +191)
numSearchThreads = 24: 10 / 10 positions, visits/s = 790.50 nnEvals/s = 649.93 nnBatches/s = 55.06 avgBatchSize = 11.80 (15.4 secs) (EloDiff +177)
numSearchThreads = 32: 10 / 10 positions, visits/s = 787.00 nnEvals/s = 673.16 nnBatches/s = 42.92 avgBatchSize = 15.68 (15.6 secs) (EloDiff +162)

What reason might it be?

anonym-g avatar Mar 09 '25 07:03 anonym-g