KataGo
KataGo copied to clipboard
gtpForceMaxNNSize causes TRT backend: Got nonfinite for policy sum
When the option gtpForceMaxNNSize=true is configured in certain games, such as those in SGF format (for example:
(
;GM[1]FF[4]CA[UTF-8]SZ[19]AW[pb][rb][gc][hc][kc][qc][cd][dd][ed][fd][hd][id][kd][he][je][gf][jf][kf][dg][hi][ji][li][mi][kj][lk][bl][cl][il][ll][ql][cm][lm][om][qm][cn][gn][jn][mn][do][ro][dp][jp][np][op][pp][dq][fq][nq][er][mr]AB[fb][kb][ob][bc][cc][dc][fc][ic][jc][pc][gd][jd][ld][pd][ge][ie][ke][le][if][qf][lj][mj][nj][hk][ik][jk][kk][bm][im][jm][km][rm][bn][in][ln][qn][rn][co][ho][po][qo][cp][qp][cq][hq][oq][pq][cr][dr][nr]KM[7.5]
;B[lo]ZZID[141]
)
it leads to the Got nonfinite for policy sum error when launched with Lizzie or other GPU applications. It appears that this is due to the fact that TensorRT (TRT) fails to handle the requireExactNNLen appropriately.
Using the above sgf, and with the 28bnbt weight can 100% reproduce this issue.
Thanks.
@lightvector @hyln9
One more information, if gtpForceMaxNNSize=false, everything works fine.
Thanks for reporting, I'll take a look.