KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

Presenting a problem that arose while exporting model weights, accompanied by the corresponding resolution

Open FengmingGo opened this issue 10 months ago • 5 comments

During the process of exporting model weights today, the following error was encountered:

PS E:\selfplay\KataGo\python> python ./export_model_pytorch.py -checkpoint "E:\selfplay\train\checkpoint.ckpt" -export-dir E:\selfplay\models -filename-prefix b1c6nbt -model-name b1c6nbt ['./export_model_pytorch.py', '-checkpoint', 'E:\selfplay\train\checkpoint.ckpt', '-export-dir', 'E:\selfplay\models', '-filename-prefix', 'b1c6nbt', '-model-name', 'b1c6nbt'] Traceback (most recent call last): File "E:\selfplay\KataGo\python\export_model_pytorch.py", line 461, in main(args) ~~~~^^^^^^ File "E:\selfplay\KataGo\python\export_model_pytorch.py", line 65, in main model, swa_model, other_state_dict = load_model(checkpoint_file, use_swa, device="cpu", verbose=True) ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\selfplay\KataGo\python\load_model.py", line 37, in load_model state_dict = torch.load(checkpoint_file,map_location="cpu") File "C:\Users\15606\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\serialization.py", line 1470, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) In PyTorch 2.6, we changed the default value of the weights_only argument in torch.load from False to True. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with weights_only=True please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL collections.defaultdict was not an allowed global by default. Please use torch.serialization.add_safe_globals([defaultdict]) or the torch.serialization.safe_globals([defaultdict]) context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

How to solve this problem:

add torch.serialization.add_safe_globals([defaultdict]) torch.serialization.add_safe_globals([float]) in export_model_pytorch.py

FengmingGo avatar Mar 22 '25 09:03 FengmingGo

Subsequently, no further errors were encountered, and the weight file was successfully output.

FengmingGo avatar Mar 22 '25 09:03 FengmingGo

@lightvector

FengmingGo avatar Mar 22 '25 10:03 FengmingGo

Thanks for the report. Looks like pytorch 2.6 was released this year, I'll see about incorporating this fix.

lightvector avatar Mar 23 '25 12:03 lightvector

Not only does export_model_pytorch.py have this bug, but train.py does as well. The solution is the same as mentioned above.

['./train.py', '-traindir', 'E:\selfplay\train\b2c16', '-datadir', 'E:\selfplay\trainingdata\train2', '-exportdir', 'E:\selfplay\export', '-exportprefix', 'b2c16', '-pos-len', '19', '-batch-size', '128', '-model-kind', 'b2c16nbt', '-samples-per-epoch', '60000', '-swa-period-samples', '80000', '-quit-if-no-data', '-no-repeat-files', '-lr-scale', '8', '-export-prob', '1'] Using GPU device: NVIDIA GeForce GTX 1660 SUPER Seeding torch with 25451309904342131 Traceback (most recent call last): File "E:\selfplay\KataGo\python\train.py", line 1373, in main(rank, world_size, args, multi_gpu_device_ids, readpipes, writepipes, barrier) ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "E:\selfplay\KataGo\python\train.py", line 534, in main (model_config, ddp_model, raw_model, swa_model, optimizer, metrics_obj, running_metrics, train_state, last_val_metrics) = load() ~~~~^^ File "E:\selfplay\KataGo\python\train.py", line 468, in load state_dict = torch.load(path_to_load_from, map_location=device) File "C:\Users\15606\AppData\Local\Programs\Python\Python313\Lib\site-packages\torch\serialization.py", line 1470, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) In PyTorch 2.6, we changed the default value of the weights_only argument in torch.load from False to True. Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with weights_only=True please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL collections.defaultdict was not an allowed global by default. Please use torch.serialization.add_safe_globals([defaultdict]) or the torch.serialization.safe_globals([defaultdict]) context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

@lightvector

FengmingGo avatar Mar 23 '25 14:03 FengmingGo

Thanks.

FengmingGo avatar Mar 23 '25 14:03 FengmingGo