semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Add support for overlap between paragraphs when using TextChunker to split text

Open MonsterCoder opened this issue 2 years ago • 1 comments

Motivation and Context

When splitting large text into paragraphs for embedding, it is common to have overlap between chuncks to keep semantic context. This pattern can be used to improve semantic search quality

Description

  1. Added an overlapToken parameter to TextChuncker.SplitPlainTextParagraphs and TextChuncker.SplitMarkdownParagraphs functions. Allow max number of overlapToken tokens to be overlapped between neibouring paragraphs.
  2. Refactoring some functions. less for loop, less side-effects

Contribution Checklist

  • [x] The code builds clean without any errors or warnings
  • [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
  • [x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
  • [x] All unit tests pass, and I have added new tests where possible
  • [x] I didn't break anyone :smile:

MonsterCoder avatar May 24 '23 20:05 MonsterCoder

@microsoft-github-policy-service agree

MonsterCoder avatar May 24 '23 20:05 MonsterCoder