textsplitter.RecursiveCharacter does not work well
`type LocalDoc2Chunk struct { }
func (l *LocalDoc2Chunk) Do(ctx context.Context, req *Doc2ChunkRequest) (*Doc2ChunkResponse, error) { c := &textsplitter.RecursiveCharacter{ Separators: []string{"\n\n", "\n", "。", ",", ",", "?", "?", "!", "!", ";", ";", ""}, ChunkSize: req.ChunkSize, ChunkOverlap: 0, } ret, err := c.SplitText(req.Content) if err != nil { return nil, err } return &Doc2ChunkResponse{Chunks: ret, Identity: req.Identity, ChunkSize: req.ChunkSize}, nil }`
the result is very different from langchain python;
Can you provide the result you get in python and go and the python code you use?
version: v0.1.3
fatal error: stack overflow, cause Separators missing "",but python package is supported
func TestChunks2(t *testing.T) {
logger := log.DefaultLogger
_ = utils.LoadConfig(logger)
c := textsplitter.RecursiveCharacter{
Separators: []string{"\n\n", "\n", ",", " "},
ChunkSize: 24,
ChunkOverlap: 0,
}
a, err := c.SplitText(content2)
if err != nil {
t.Error(err)
return
}
fmt.Println(len(a))
for i := range a {
fmt.Printf("分段%d: %s\n", i, a[i])
}
}
var content2 string = `
天天猫您好,很高,
您好,很高兴为您服务,请问有什么可以帮助您的吗?
`
here is the python code
from langchain.text_splitter import RecursiveCharacterTextSplitter
if __name__ == '__main__':
a = ''' 天天猫您好,很高,
您好,很高兴为您服务,请问有什么可以帮助您的吗?
'''
text_splitter = RecursiveCharacterTextSplitter(
separators=['\n\n', '\n', ',', " "],
chunk_size=6,
chunk_overlap=0,
length_function=len
)
texts = text_splitter.split_text(a)
for item in texts:
print(item)
the python output is this, it works very well
天天猫您好
,很高,
您好
,很高兴为您服务
,请问有什么可以帮助您的吗?
another problem is that, you cal the chunk size use golang len function, but len function just cal the Byte but in chinese and use UTF-8 encode, one chinese character use 3-4 Byte; this will lead to chunk size parameter not work very well
another problem is that, you cal the chunk size use golang len function, but len function just cal the Byte but in chinese and use UTF-8 encode, one chinese character use 3-4 Byte; this will lead to chunk size parameter not work very well
Hmm yes, we should do a pass and remove code that is utf8-naive.