semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Python: Text partitioning module

Open JTremb opened this issue 2 years ago • 7 comments

Motivation and Context

Porting the text partitioning module to python (Reopening of PR #427 )

Description

  • This is adding the Text partitioning module in semantic_kernel/semantic_functions/semantic_text_partitioner.py and the function_extention in semantic_kernel/semantic_functions/function_extension.py
  • Compared to the C# version the files were added directly into the semantic_functions directory instead of the semantic_functions/partitioning to not have too many nested directories.

Contribution Checklist

  • [x] The code builds clean without any errors or warnings
  • [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
  • [x] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
  • [x] All unit tests pass, and I have added new tests where possible
  • [x] I didn't break anyone :smile:

JTremb avatar Apr 14 '23 13:04 JTremb

Agree that this shouldn't need to be called "semantic".

I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense.

In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK.

Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function.

alexchaomander avatar Apr 18 '23 22:04 alexchaomander

Agreed. text_chunker make sense, there is also #404 referring to extend the chunker, good idea to align on this terminology.

JTremb avatar Apr 18 '23 23:04 JTremb

@JTremb I'm ready to approve this PR once the terminology from "semantic text partitioning" -> "text chunking" is complete

awharrison-28 avatar Apr 20 '23 22:04 awharrison-28

poetry run pre-commit run -c .conf/.pre-commit-config.yaml -a will run flake8, isort, and black
poetry run ruff check .

should address your linting errors.

awharrison-28 avatar Apr 20 '23 23:04 awharrison-28

@awharrison-28 Great! I did the renaming of the terminology (files, functions) partition -> chunk. Do you think I should also move the text_chunker in a different package than semantic_kernel.semantic_functions ?

For the linting errors , there's also files that I haven't modified so I did not push those changes. I can also add them in this PR.

JTremb avatar Apr 21 '23 00:04 JTremb

Agree that this shouldn't need to be called "semantic".

I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense.

In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK.

Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function.

@alexchaomander Can you help make sure changes made here are also made re: dotnet\src\SemanticKernel\SemanticFunctions\Partitioning\SemanticTextPartitioner.cs in dotnet kernel?

lemillermicrosoft avatar Apr 21 '23 19:04 lemillermicrosoft

The terminology oracle agrees to use the name: Text Chunker. Let's update both codebases

dluc avatar Apr 21 '23 21:04 dluc