Python: Text partitioning module
Motivation and Context
Porting the text partitioning module to python (Reopening of PR #427 )
Description
- This is adding the Text partitioning module in
semantic_kernel/semantic_functions/semantic_text_partitioner.pyand the function_extention insemantic_kernel/semantic_functions/function_extension.py - Compared to the C# version the files were added directly into the
semantic_functionsdirectory instead of thesemantic_functions/partitioningto not have too many nested directories.
Contribution Checklist
- [x] The code builds clean without any errors or warnings
- [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with
dotnet format - [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone :smile:
Agree that this shouldn't need to be called "semantic".
I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense.
In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK.
Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function.
Agreed. text_chunker make sense, there is also #404 referring to extend the chunker, good idea to align on this terminology.
@JTremb I'm ready to approve this PR once the terminology from "semantic text partitioning" -> "text chunking" is complete
poetry run pre-commit run -c .conf/.pre-commit-config.yaml -a will run flake8, isort, and black
poetry run ruff check .
should address your linting errors.
@awharrison-28 Great! I did the renaming of the terminology (files, functions) partition -> chunk.
Do you think I should also move the text_chunker in a different package than semantic_kernel.semantic_functions ?
For the linting errors , there's also files that I haven't modified so I did not push those changes. I can also add them in this PR.
Agree that this shouldn't need to be called "semantic".
I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense.
In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK.
Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function.
@alexchaomander Can you help make sure changes made here are also made re: dotnet\src\SemanticKernel\SemanticFunctions\Partitioning\SemanticTextPartitioner.cs in dotnet kernel?
The terminology oracle agrees to use the name: Text Chunker. Let's update both codebases