feat: add split_threshold to document splitter to avoid excessively short splits
Related Issues
- fixes #7275
Proposed Changes:
Pass a split_threshold to the document splitter, if the last chunk is small than the threshold, attach it to the previous one. This avoids having small chunks of 2 or 3 words that are not meaningful during the RAG.
How did you test it?
I wrote a unit test
Pull Request Test Coverage Report for Build 9233759589
Details
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 1 unchanged line in 1 file lost coverage.
- Overall coverage increased (+0.004%) to 90.597%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| components/preprocessors/document_splitter.py | 1 | 98.63% |
| <!-- | Total: | 1 |
| Totals | |
|---|---|
| Change from base Build 9225765449: | 0.004% |
| Covered Lines: | 6696 |
| Relevant Lines: | 7391 |
💛 - Coveralls
Good morning @anakin87, I think it should be ready for merge, can you take a look?
You are welcome, I implemented the minor comments and rebased on the latest main, let me know when you'll merge it :)
Thank you @Halpph the docstring update looks good to me. I'll merge the PR once I tested it locally once more. Great job!