langchaingo textsplitter.RecursiveCharacter does not work well

`type LocalDoc2Chunk struct { }

func (l *LocalDoc2Chunk) Do(ctx context.Context, req *Doc2ChunkRequest) (*Doc2ChunkResponse, error) { c := &textsplitter.RecursiveCharacter{ Separators: []string{"\n\n", "\n", "。", "，", ",", "？", "?", "！", "!", "；", ";", ""}, ChunkSize: req.ChunkSize, ChunkOverlap: 0, } ret, err := c.SplitText(req.Content) if err != nil { return nil, err } return &Doc2ChunkResponse{Chunks: ret, Identity: req.Identity, ChunkSize: req.ChunkSize}, nil }`

the result is very different from langchain python;

Aug 03 '23 09:08 dl942702882

Can you provide the result you get in python and go and the python code you use?

Aug 09 '23 11:08 FluffyKebab

version: v0.1.3
fatal error: stack overflow, cause Separators missing ""，but python package is supported

func TestChunks2(t *testing.T) {
	logger := log.DefaultLogger
	_ = utils.LoadConfig(logger)
	c := textsplitter.RecursiveCharacter{
		Separators:   []string{"\n\n", "\n", "，", " "},
		ChunkSize:    24,
		ChunkOverlap: 0,
	}
	a, err := c.SplitText(content2)
	if err != nil {
		t.Error(err)
		return
	}
	fmt.Println(len(a))
	for i := range a {
		fmt.Printf("分段%d: %s\n", i, a[i])
	}
}

var content2 string = `
	天天猫您好，很高，

您好，很高兴为您服务，请问有什么可以帮助您的吗？

`

here is the python code

from langchain.text_splitter import RecursiveCharacterTextSplitter

if __name__ == '__main__':
    a = '''		天天猫您好，很高，

您好，很高兴为您服务，请问有什么可以帮助您的吗？

'''
    text_splitter = RecursiveCharacterTextSplitter(
        separators=['\n\n', '\n', '，', " "],
        chunk_size=6,
        chunk_overlap=0,
        length_function=len
    )
    texts = text_splitter.split_text(a)
    for item in texts:
        print(item)

the python output is this, it works very well

		天天猫您好
，很高，
您好
，很高兴为您服务
，请问有什么可以帮助您的吗？

Jan 17 '24 03:01 dl942702882

another problem is that, you cal the chunk size use golang len function, but len function just cal the Byte but in chinese and use UTF-8 encode， one chinese character use 3-4 Byte； this will lead to chunk size parameter not work very well

Jan 17 '24 03:01 dl942702882

another problem is that, you cal the chunk size use golang len function, but len function just cal the Byte but in chinese and use UTF-8 encode， one chinese character use 3-4 Byte； this will lead to chunk size parameter not work very well

Hmm yes, we should do a pass and remove code that is utf8-naive.

Feb 19 '24 08:02 tmc