KataGo icon indicating copy to clipboard operation
KataGo copied to clipboard

gtpForceMaxNNSize causes TRT backend: Got nonfinite for policy sum

Open kinfkong opened this issue 1 year ago • 2 comments

When the option gtpForceMaxNNSize=true is configured in certain games, such as those in SGF format (for example:

(
  ;GM[1]FF[4]CA[UTF-8]SZ[19]AW[pb][rb][gc][hc][kc][qc][cd][dd][ed][fd][hd][id][kd][he][je][gf][jf][kf][dg][hi][ji][li][mi][kj][lk][bl][cl][il][ll][ql][cm][lm][om][qm][cn][gn][jn][mn][do][ro][dp][jp][np][op][pp][dq][fq][nq][er][mr]AB[fb][kb][ob][bc][cc][dc][fc][ic][jc][pc][gd][jd][ld][pd][ge][ie][ke][le][if][qf][lj][mj][nj][hk][ik][jk][kk][bm][im][jm][km][rm][bn][in][ln][qn][rn][co][ho][po][qo][cp][qp][cq][hq][oq][pq][cr][dr][nr]KM[7.5]
  ;B[lo]ZZID[141]
)

it leads to the Got nonfinite for policy sum error when launched with Lizzie or other GPU applications. It appears that this is due to the fact that TensorRT (TRT) fails to handle the requireExactNNLen appropriately.

Using the above sgf, and with the 28bnbt weight can 100% reproduce this issue.

Thanks.

@lightvector @hyln9

kinfkong avatar Dec 18 '24 17:12 kinfkong

One more information, if gtpForceMaxNNSize=false, everything works fine.

kinfkong avatar Dec 18 '24 17:12 kinfkong

Thanks for reporting, I'll take a look.

lightvector avatar Dec 19 '24 15:12 lightvector