tokenizer icon indicating copy to clipboard operation
tokenizer copied to clipboard

panic:fatal error: concurrent map writes

Open ZeroYuJie opened this issue 2 years ago • 2 comments

I got error panic: concurrent map writes , BPE TokenizeWithCache func, Concurrent read and write operations on the map can lead to a panic.

func (b BPE) TokenizeWithCache(sequence string) (retVal []tokenizer.Token) {

	if hit, ok := b.Cache.cmap[sequence]; ok {
		return b.WordToTokens(hit)
	} else {
		word := b.MergeWord(sequence)
		retVal = b.WordToTokens(*word)
		if b.Cache != nil {
			b.Cache.SetValues([]CacheItem{
				{sequence, *word},
			})
		}
		return retVal
	}
}

Please check~

ZeroYuJie avatar Aug 28 '23 14:08 ZeroYuJie

@ZeroYuJie,

Please share the error log detail and example how to replicate. Thanks!

sugarme avatar Aug 29 '23 21:08 sugarme

@sugarme I am using this in my multi-goroutine testing. first i use this func to init model tokenizer, then I initialized a tokenizer within a global variable. the code like this:

func OfflineLLMTokenizerInit(modelName string) (*tokenizer.Tokenizer, error) {
	configFile, err := tokenizer.CachedPath(modelName, "tokenizer.json")
	if err != nil {
		return nil, err
	}
	tk, err := pretrained.FromFile(configFile)
	if err != nil {
		return nil, err
	}
	return tk, nil
}

var tk *tokenizer.Tokenizer

func main() {
	tk, _ = OfflineLLMTokenizerInit("NousResearch/Redmond-Puffin-13B")
	benchNum := 10000
	for i := 0; i < benchNum; i++ {
		go func(number int) {
			//random str len = 1000
			input := random.RandString(1000)
			encoderSingle, _ := tk.EncodeSingle(input, false)
			println(fmt.Sprintf("routine=%d,%s,len=%d", number, input, len(encoderSingle.Tokens)))
		}(i)
	}
	time.Sleep(time.Minute)
}

then it will throw the panic: image the stack : image

Because the cache b.Cache.cmap I think the cmap should use sync.Map or removing this cache...

ZeroYuJie avatar Aug 30 '23 06:08 ZeroYuJie