langchaingo icon indicating copy to clipboard operation
langchaingo copied to clipboard

textsplitter.RecursiveCharacter does not work well

Open dl942702882 opened this issue 2 years ago • 4 comments

`type LocalDoc2Chunk struct { }

func (l *LocalDoc2Chunk) Do(ctx context.Context, req *Doc2ChunkRequest) (*Doc2ChunkResponse, error) { c := &textsplitter.RecursiveCharacter{ Separators: []string{"\n\n", "\n", "。", ",", ",", "?", "?", "!", "!", ";", ";", ""}, ChunkSize: req.ChunkSize, ChunkOverlap: 0, } ret, err := c.SplitText(req.Content) if err != nil { return nil, err } return &Doc2ChunkResponse{Chunks: ret, Identity: req.Identity, ChunkSize: req.ChunkSize}, nil }`

the result is very different from langchain python;

dl942702882 avatar Aug 03 '23 09:08 dl942702882

Can you provide the result you get in python and go and the python code you use?

FluffyKebab avatar Aug 09 '23 11:08 FluffyKebab

version: v0.1.3
fatal error: stack overflow, cause Separators missing "",but python package is supported

func TestChunks2(t *testing.T) {
	logger := log.DefaultLogger
	_ = utils.LoadConfig(logger)
	c := textsplitter.RecursiveCharacter{
		Separators:   []string{"\n\n", "\n", ",", " "},
		ChunkSize:    24,
		ChunkOverlap: 0,
	}
	a, err := c.SplitText(content2)
	if err != nil {
		t.Error(err)
		return
	}
	fmt.Println(len(a))
	for i := range a {
		fmt.Printf("分段%d: %s\n", i, a[i])
	}
}

var content2 string = `
	天天猫您好,很高,

您好,很高兴为您服务,请问有什么可以帮助您的吗?

`

here is the python code

from langchain.text_splitter import RecursiveCharacterTextSplitter

if __name__ == '__main__':
    a = '''		天天猫您好,很高,

您好,很高兴为您服务,请问有什么可以帮助您的吗?

'''
    text_splitter = RecursiveCharacterTextSplitter(
        separators=['\n\n', '\n', ',', " "],
        chunk_size=6,
        chunk_overlap=0,
        length_function=len
    )
    texts = text_splitter.split_text(a)
    for item in texts:
        print(item)

the python output is this, it works very well

		天天猫您好
,很高,
您好
,很高兴为您服务
,请问有什么可以帮助您的吗?


dl942702882 avatar Jan 17 '24 03:01 dl942702882

another problem is that, you cal the chunk size use golang len function, but len function just cal the Byte but in chinese and use UTF-8 encode, one chinese character use 3-4 Byte; this will lead to chunk size parameter not work very well

dl942702882 avatar Jan 17 '24 03:01 dl942702882

another problem is that, you cal the chunk size use golang len function, but len function just cal the Byte but in chinese and use UTF-8 encode, one chinese character use 3-4 Byte; this will lead to chunk size parameter not work very well

Hmm yes, we should do a pass and remove code that is utf8-naive.

tmc avatar Feb 19 '24 08:02 tmc