Python: Text partitioning module by JTremb · Pull Request #450 · microsoft/semantic-kernel

JTremb · 2023-04-14T13:12:48Z

Motivation and Context

Porting the text partitioning module to python
(Reopening of PR #427 )

Description

This is adding the Text partitioning module in semantic_kernel/semantic_functions/semantic_text_partitioner.py
and the function_extention in semantic_kernel/semantic_functions/function_extension.py
Compared to the C# version the files were added directly into the semantic_functions directory instead of the semantic_functions/partitioning to not have too many nested directories.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with dotnet format
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

…rtitioned results

awharrison-28

I like it. Just a couple of overall comments/questions.

I think this can just be called text_partitioner. I don't see anything that makes it semantic - splitting text based on meaning. I looks like it's just based on new lines and/or other common delimiters?
Thoughts on calling it text_chunker? We've been using the term chunking quite a bit and this looks like a great start to adding 'chunking' to the core kernel.
Wondering if this should go under Skills vs Functions.

@mkarle @alexchaomander @dluc thoughts?

alexchaomander · 2023-04-18T22:36:18Z

Agree that this shouldn't need to be called "semantic".

I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense.

In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK.

Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function.

JTremb · 2023-04-18T23:43:34Z

Agreed. text_chunker make sense, there is also #404 referring to extend the chunker, good idea to align on this terminology.

…ndows

awharrison-28 · 2023-04-20T22:40:24Z

@JTremb I'm ready to approve this PR once the terminology from "semantic text partitioning" -> "text chunking" is complete

…b/semantic-kernel into feature/pythonPartitioning

…tition -> chunk

awharrison-28 · 2023-04-20T23:59:12Z

poetry run pre-commit run -c .conf/.pre-commit-config.yaml -a will run flake8, isort, and black
poetry run ruff check .

should address your linting errors.

JTremb · 2023-04-21T00:01:48Z

@awharrison-28 Great! I did the renaming of the terminology (files, functions) partition -> chunk.
Do you think I should also move the text_chunker in a different package than semantic_kernel.semantic_functions ?

For the linting errors , there's also files that I haven't modified so I did not push those changes. I can also add them in this PR.

…b/semantic-kernel into feature/pythonPartitioning

### Motivation and Context To match changes in #450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.

### Motivation and Context Porting the text partitioning module to python (Reopening of PR #427 ) ### Description - This is adding the Text partitioning module in `semantic_kernel/semantic_functions/semantic_text_partitioner.py` and the function_extention in `semantic_kernel/semantic_functions/function_extension.py` - Compared to the C# version the files were added directly into the `semantic_functions` directory instead of the `semantic_functions/partitioning` to not have too many nested directories.

### Motivation and Context To match changes in #450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.

### Motivation and Context Porting the text partitioning module to python (Reopening of PR microsoft#427 ) ### Description - This is adding the Text partitioning module in `semantic_kernel/semantic_functions/semantic_text_partitioner.py` and the function_extention in `semantic_kernel/semantic_functions/function_extension.py` - Compared to the C# version the files were added directly into the `semantic_functions` directory instead of the `semantic_functions/partitioning` to not have too many nested directories.

### Motivation and Context To match changes in microsoft#450 ### Description This pull request renames the SemanticTextPartitioner class to TextChunker, and updates all references to this class in the code and documentation. The TextChunker class is responsible for splitting text into smaller chunks based on tokens and paragraphs, which is useful for processing large texts or code files. The new name TextChunker reflects the functionality of the class more clearly and avoids confusion with the SemanticPartitioner class, which is a different concept that splits text into semantic segments based on embeddings. Details: - Rename SemanticTextPartitioner.cs to TextChunker.cs and update the class name and namespace accordingly. The TextChunker class is now in the Text namespace, which is more consistent with its purpose and the other classes in the namespace. - Update all references to SemanticTextPartitioner in the code and documentation to use TextChunker instead. This includes the ConversationSummarySkill and the DocumentImportController classes, which use the TextChunker class to split text into sentences and paragraphs for summarization and import. - Update the README.md file for the GitHub Repo Q&A Bot sample to explain how the TextChunker class works with memory and embeddings. - Make some minor formatting changes to the TextChunker class, such as adding line breaks and spaces for readability, and removing an unused using directive.

### Motivation and Context  ### Description  1. Fix error in chat route 2. Fix error in import document route 3. Add PATCH to CORS allowed methods ### Contribution Checklist  - [ ] The code builds clean without any errors or warnings - [ ] The PR follows the [Contribution Guidelines](https://github.com/microsoft/chat-copilot/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/chat-copilot/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [ ] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄

JTremb added 2 commits April 14, 2023 08:47

Adding text partitioner and function extension that aggregates the pa…

4ada900

…rtitioned results

adding semantic text partioner and funtion extension tests

29875cd

alexchaomander added the python Pull requests for the Python Semantic Kernel label Apr 14, 2023

dluc and others added 2 commits April 14, 2023 12:50

Merge branch 'main' into feature/pythonPartitioning

98fb70d

Merge branch 'main' into feature/pythonPartitioning

46bdf95

awharrison-28 reviewed Apr 18, 2023

View reviewed changes

JTremb and others added 5 commits April 20, 2023 09:18

Merge branch 'main' into feature/pythonPartitioning

07e1df8

updating for the new way of instantiating the Kernel()

61f4183

formatting

d7fb4fe

replacing a line-ending with os.linesep to make sure tests pass in wi…

b5028be

…ndows

Merge branch 'main' into feature/pythonPartitioning

c484bf8

awharrison-28 and others added 6 commits April 20, 2023 15:40

Merge branch 'main' into feature/pythonPartitioning

2da6f48

renaming semantic_test_partitioner -> text_chunker

f500237

fix small issue during the renaming

75dcc71

Merge branch 'feature/pythonPartitioning' of https://github.com/JTrem…

272342b

…b/semantic-kernel into feature/pythonPartitioning

missed some comments and function names to during the renaming of par…

2f5b3ce

…tition -> chunk

Merge branch 'main' into feature/pythonPartitioning

766e4f6

JTremb added 7 commits April 20, 2023 20:02

formatting

4a22010

Merge branch 'feature/pythonPartitioning' of https://github.com/JTrem…

792fe5e

…b/semantic-kernel into feature/pythonPartitioning

isort

9d13878

moving text_chunker to semantic_kernel.text

1b7c91c

moving text_chunker unit test to tests.unit.text

a31eb9f

adding the missing __init__.py in semantic_kernel.text module

02b72b2

adding the missing __init__.py in semantic_kernel.text module

76fd81c

awharrison-28 added the PR: ready to merge PR has been approved by all reviewers, and is ready to merge. label Apr 21, 2023

awharrison-28 and others added 5 commits April 21, 2023 17:48

Merge branch 'main' into feature/pythonPartitioning

8968302

Merge branch 'main' into feature/pythonPartitioning

54bf0e0

Merge branch 'main' into feature/pythonPartitioning

08f3ab3

Merge branch 'main' into feature/pythonPartitioning

b7e3621

fixing unit test with the renaming of backend module

74d43bc

JTremb dismissed awharrison-28’s stale review via 74d43bc April 26, 2023 13:36

formatting

be6c3fc

lemillermicrosoft assigned shawncal Apr 26, 2023

JTremb and others added 2 commits April 26, 2023 11:27

Merge branch 'main' into feature/pythonPartitioning

09a655d

Merge branch 'main' into feature/pythonPartitioning

325c44f

awharrison-28 requested a review from lemillermicrosoft April 26, 2023 17:40

awharrison-28 approved these changes Apr 26, 2023

View reviewed changes

awharrison-28 added PR: ready to merge PR has been approved by all reviewers, and is ready to merge. and removed PR: feedback to address Waiting for PR owner to address comments/questions labels Apr 26, 2023

awharrison-28 and others added 3 commits April 26, 2023 14:01

Merge branch 'main' into feature/pythonPartitioning

52a2926

Merge branch 'main' into feature/pythonPartitioning

11c9b87

Merge branch 'main' into feature/pythonPartitioning

5d843fb

lemillermicrosoft approved these changes Apr 26, 2023

View reviewed changes

lemillermicrosoft enabled auto-merge (squash) April 26, 2023 22:45

lemillermicrosoft merged commit 971d2b6 into microsoft:main Apr 26, 2023

lemillermicrosoft mentioned this pull request Apr 26, 2023

Rename TextChunker from SemanticTextPartitioner #683

Merged

5 tasks

alexchaomander mentioned this pull request Apr 27, 2023

.NET: Create specialized TextChunker plugin #688

Closed

JTremb deleted the feature/pythonPartitioning branch July 17, 2023 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: Text partitioning module#450

Python: Text partitioning module#450
lemillermicrosoft merged 35 commits into
microsoft:mainfrom
JTremb:feature/pythonPartitioning

JTremb commented Apr 14, 2023

Uh oh!

awharrison-28 left a comment

Uh oh!

alexchaomander commented Apr 18, 2023

Uh oh!

JTremb commented Apr 18, 2023

Uh oh!

awharrison-28 commented Apr 20, 2023

Uh oh!

awharrison-28 commented Apr 20, 2023

Uh oh!

JTremb commented Apr 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

JTremb commented Apr 14, 2023

Motivation and Context

Description

Contribution Checklist

Uh oh!

awharrison-28 left a comment

Choose a reason for hiding this comment

Uh oh!

alexchaomander commented Apr 18, 2023

Uh oh!

JTremb commented Apr 18, 2023

Uh oh!

awharrison-28 commented Apr 20, 2023

Uh oh!

awharrison-28 commented Apr 20, 2023

Uh oh!

JTremb commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JTremb commented Apr 21, 2023 •

edited

Loading