semantic-kernel
semantic-kernel copied to clipboard
Add support for overlap between paragraphs when using TextChunker to split text
Motivation and Context
When splitting large text into paragraphs for embedding, it is common to have overlap between chuncks to keep semantic context. This pattern can be used to improve semantic search quality
Description
- Added an
overlapTokenparameter toTextChuncker.SplitPlainTextParagraphsandTextChuncker.SplitMarkdownParagraphsfunctions. Allow max number ofoverlapTokentokens to be overlapped between neibouring paragraphs. - Refactoring some functions. less for loop, less side-effects
Contribution Checklist
- [x] The code builds clean without any errors or warnings
- [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with
dotnet format - [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone :smile:
@microsoft-github-policy-service agree