haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: add split_threshold to document splitter to avoid excessively short splits

Open Halpph opened this issue 1 year ago • 3 comments

Related Issues

Proposed Changes:

Pass a split_threshold to the document splitter, if the last chunk is small than the threshold, attach it to the previous one. This avoids having small chunks of 2 or 3 words that are not meaningful during the RAG.

How did you test it?

I wrote a unit test

Halpph avatar May 21 '24 15:05 Halpph

Pull Request Test Coverage Report for Build 9233759589

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.004%) to 90.597%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/document_splitter.py 1 98.63%
<!-- Total: 1
Totals Coverage Status
Change from base Build 9225765449: 0.004%
Covered Lines: 6696
Relevant Lines: 7391

💛 - Coveralls

coveralls avatar May 22 '24 07:05 coveralls

Good morning @anakin87, I think it should be ready for merge, can you take a look?

Halpph avatar May 22 '24 08:05 Halpph

You are welcome, I implemented the minor comments and rebased on the latest main, let me know when you'll merge it :)

Halpph avatar May 23 '24 13:05 Halpph

Thank you @Halpph the docstring update looks good to me. I'll merge the PR once I tested it locally once more. Great job!

julian-risch avatar May 24 '24 14:05 julian-risch