Python: Text partitioning module#450
Conversation
awharrison-28
left a comment
There was a problem hiding this comment.
I like it. Just a couple of overall comments/questions.
- I think this can just be called text_partitioner. I don't see anything that makes it semantic - splitting text based on meaning. I looks like it's just based on new lines and/or other common delimiters?
- Thoughts on calling it text_chunker? We've been using the term chunking quite a bit and this looks like a great start to adding 'chunking' to the core kernel.
- Wondering if this should go under Skills vs Functions.
@mkarle @alexchaomander @dluc thoughts?
|
Agree that this shouldn't need to be called "semantic". I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense. In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK. Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function. |
|
Agreed. |
|
@JTremb I'm ready to approve this PR once the terminology from "semantic text partitioning" -> "text chunking" is complete |
…b/semantic-kernel into feature/pythonPartitioning
should address your linting errors. |
|
@awharrison-28 Great! I did the renaming of the terminology (files, functions) partition -> chunk. For the linting errors , there's also files that I haven't modified so I did not push those changes. I can also add them in this PR. |
…b/semantic-kernel into feature/pythonPartitioning
### Motivation and Context To match changes in #450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.
### Motivation and Context Porting the text partitioning module to python (Reopening of PR #427 ) ### Description - This is adding the Text partitioning module in `semantic_kernel/semantic_functions/semantic_text_partitioner.py` and the function_extention in `semantic_kernel/semantic_functions/function_extension.py` - Compared to the C# version the files were added directly into the `semantic_functions` directory instead of the `semantic_functions/partitioning` to not have too many nested directories.
### Motivation and Context To match changes in #450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.
### Motivation and Context Porting the text partitioning module to python (Reopening of PR microsoft#427 ) ### Description - This is adding the Text partitioning module in `semantic_kernel/semantic_functions/semantic_text_partitioner.py` and the function_extention in `semantic_kernel/semantic_functions/function_extension.py` - Compared to the C# version the files were added directly into the `semantic_functions` directory instead of the `semantic_functions/partitioning` to not have too many nested directories.
### Motivation and Context To match changes in microsoft#450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.
### Motivation and Context To match changes in microsoft#450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.
### Motivation and Context <!-- Thank you for your contribution to the chat-copilot repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> ### Description <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> 1. Fix error in chat route 2. Fix error in import document route 3. Add PATCH to CORS allowed methods ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [ ] The code builds clean without any errors or warnings - [ ] The PR follows the [Contribution Guidelines](https://github.com/microsoft/chat-copilot/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/chat-copilot/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [ ] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄
Motivation and Context
Porting the text partitioning module to python
(Reopening of PR #427 )
Description
semantic_kernel/semantic_functions/semantic_text_partitioner.pyand the function_extention in
semantic_kernel/semantic_functions/function_extension.pysemantic_functionsdirectory instead of thesemantic_functions/partitioningto not have too many nested directories.Contribution Checklist
dotnet format