Skip to content

Python: Text partitioning module#450

Merged
lemillermicrosoft merged 35 commits into
microsoft:mainfrom
JTremb:feature/pythonPartitioning
Apr 26, 2023
Merged

Python: Text partitioning module#450
lemillermicrosoft merged 35 commits into
microsoft:mainfrom
JTremb:feature/pythonPartitioning

Conversation

@JTremb

@JTremb JTremb commented Apr 14, 2023

Copy link
Copy Markdown
Contributor

Motivation and Context

Porting the text partitioning module to python
(Reopening of PR #427 )

Description

  • This is adding the Text partitioning module in semantic_kernel/semantic_functions/semantic_text_partitioner.py
    and the function_extention in semantic_kernel/semantic_functions/function_extension.py
  • Compared to the C# version the files were added directly into the semantic_functions directory instead of the semantic_functions/partitioning to not have too many nested directories.

Contribution Checklist

@alexchaomander alexchaomander added the python Pull requests for the Python Semantic Kernel label Apr 14, 2023

@awharrison-28 awharrison-28 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Just a couple of overall comments/questions.

  • I think this can just be called text_partitioner. I don't see anything that makes it semantic - splitting text based on meaning. I looks like it's just based on new lines and/or other common delimiters?
  • Thoughts on calling it text_chunker? We've been using the term chunking quite a bit and this looks like a great start to adding 'chunking' to the core kernel.
  • Wondering if this should go under Skills vs Functions.

@mkarle @alexchaomander @dluc thoughts?

@alexchaomander

Copy link
Copy Markdown
Contributor

Agree that this shouldn't need to be called "semantic".

I like the phrase "text chunking". As long as you all think implementing various "Chunkers" sounds good and makes sense.

In the future, we'll want to be able to chunk based off the number of characters, the number of tokens, and by other popular Python libraries like Spacy and NLTK.

Since this will be a common operation for anyone working with data in the SK, I think this can be considered a core SK Function.

@JTremb

JTremb commented Apr 18, 2023

Copy link
Copy Markdown
Contributor Author

Agreed. text_chunker make sense, there is also #404 referring to extend the chunker, good idea to align on this terminology.

@awharrison-28

Copy link
Copy Markdown
Contributor

@JTremb I'm ready to approve this PR once the terminology from "semantic text partitioning" -> "text chunking" is complete

@awharrison-28

Copy link
Copy Markdown
Contributor
poetry run pre-commit run -c .conf/.pre-commit-config.yaml -a will run flake8, isort, and black
poetry run ruff check .

should address your linting errors.

@JTremb

JTremb commented Apr 21, 2023

Copy link
Copy Markdown
Contributor Author

@awharrison-28 Great! I did the renaming of the terminology (files, functions) partition -> chunk.
Do you think I should also move the text_chunker in a different package than semantic_kernel.semantic_functions ?

For the linting errors , there's also files that I haven't modified so I did not push those changes. I can also add them in this PR.

@awharrison-28 awharrison-28 added the PR: ready to merge PR has been approved by all reviewers, and is ready to merge. label Apr 21, 2023
@awharrison-28 awharrison-28 added PR: ready to merge PR has been approved by all reviewers, and is ready to merge. and removed PR: feedback to address Waiting for PR owner to address comments/questions labels Apr 26, 2023
@lemillermicrosoft lemillermicrosoft enabled auto-merge (squash) April 26, 2023 22:45
@lemillermicrosoft lemillermicrosoft merged commit 971d2b6 into microsoft:main Apr 26, 2023
adrianwyatt pushed a commit that referenced this pull request Apr 26, 2023
### Motivation and Context
To match changes in #450 

### Description
This pull request renames the SemanticTextPartitioner class to
TextChunker, and updates all references to this class in the code and
documentation. The TextChunker class is responsible for splitting text
into smaller chunks based on tokens and paragraphs, which is useful for
processing large texts or code files. The new name TextChunker reflects
the functionality of the class more clearly and avoids confusion with
the SemanticPartitioner class, which is a different concept that splits
text into semantic segments based on embeddings.

Details:
- Rename SemanticTextPartitioner.cs to TextChunker.cs and update the
class name and namespace accordingly. The TextChunker class is now in
the Text namespace, which is more consistent with its purpose and the
other classes in the namespace.
- Update all references to SemanticTextPartitioner in the code and
documentation to use TextChunker instead. This includes the
ConversationSummarySkill and the DocumentImportController classes, which
use the TextChunker class to split text into sentences and paragraphs
for summarization and import.
- Update the README.md file for the GitHub Repo Q&A Bot sample to
explain how the TextChunker class works with memory and embeddings.
- Make some minor formatting changes to the TextChunker class, such as
adding line breaks and spaces for readability, and removing an unused
using directive.
dluc pushed a commit that referenced this pull request Apr 29, 2023
### Motivation and Context
Porting the text partitioning module to python
(Reopening of PR #427 )

### Description

- This is adding the Text partitioning module in
`semantic_kernel/semantic_functions/semantic_text_partitioner.py`
and the function_extention in
`semantic_kernel/semantic_functions/function_extension.py`
- Compared to the C# version the files were added directly into the
`semantic_functions` directory instead of the
`semantic_functions/partitioning` to not have too many nested
directories.
dluc pushed a commit that referenced this pull request Apr 29, 2023
### Motivation and Context
To match changes in #450 

### Description
This pull request renames the SemanticTextPartitioner class to
TextChunker, and updates all references to this class in the code and
documentation. The TextChunker class is responsible for splitting text
into smaller chunks based on tokens and paragraphs, which is useful for
processing large texts or code files. The new name TextChunker reflects
the functionality of the class more clearly and avoids confusion with
the SemanticPartitioner class, which is a different concept that splits
text into semantic segments based on embeddings.

Details:
- Rename SemanticTextPartitioner.cs to TextChunker.cs and update the
class name and namespace accordingly. The TextChunker class is now in
the Text namespace, which is more consistent with its purpose and the
other classes in the namespace.
- Update all references to SemanticTextPartitioner in the code and
documentation to use TextChunker instead. This includes the
ConversationSummarySkill and the DocumentImportController classes, which
use the TextChunker class to split text into sentences and paragraphs
for summarization and import.
- Update the README.md file for the GitHub Repo Q&A Bot sample to
explain how the TextChunker class works with memory and embeddings.
- Make some minor formatting changes to the TextChunker class, such as
adding line breaks and spaces for readability, and removing an unused
using directive.
dehoward pushed a commit to lemillermicrosoft/semantic-kernel that referenced this pull request Jun 1, 2023
### Motivation and Context
Porting the text partitioning module to python
(Reopening of PR microsoft#427 )

### Description

- This is adding the Text partitioning module in
`semantic_kernel/semantic_functions/semantic_text_partitioner.py`
and the function_extention in
`semantic_kernel/semantic_functions/function_extension.py`
- Compared to the C# version the files were added directly into the
`semantic_functions` directory instead of the
`semantic_functions/partitioning` to not have too many nested
directories.
dehoward pushed a commit to lemillermicrosoft/semantic-kernel that referenced this pull request Jun 1, 2023
### Motivation and Context
To match changes in microsoft#450 

### Description
This pull request renames the SemanticTextPartitioner class to
TextChunker, and updates all references to this class in the code and
documentation. The TextChunker class is responsible for splitting text
into smaller chunks based on tokens and paragraphs, which is useful for
processing large texts or code files. The new name TextChunker reflects
the functionality of the class more clearly and avoids confusion with
the SemanticPartitioner class, which is a different concept that splits
text into semantic segments based on embeddings.

Details:
- Rename SemanticTextPartitioner.cs to TextChunker.cs and update the
class name and namespace accordingly. The TextChunker class is now in
the Text namespace, which is more consistent with its purpose and the
other classes in the namespace.
- Update all references to SemanticTextPartitioner in the code and
documentation to use TextChunker instead. This includes the
ConversationSummarySkill and the DocumentImportController classes, which
use the TextChunker class to split text into sentences and paragraphs
for summarization and import.
- Update the README.md file for the GitHub Repo Q&A Bot sample to
explain how the TextChunker class works with memory and embeddings.
- Make some minor formatting changes to the TextChunker class, such as
adding line breaks and spaces for readability, and removing an unused
using directive.
@JTremb JTremb deleted the feature/pythonPartitioning branch July 17, 2023 15:06
golden-aries pushed a commit to golden-aries/semantic-kernel that referenced this pull request Oct 10, 2023
### Motivation and Context
To match changes in microsoft#450 

### Description
This pull request renames the SemanticTextPartitioner class to
TextChunker, and updates all references to this class in the code and
documentation. The TextChunker class is responsible for splitting text
into smaller chunks based on tokens and paragraphs, which is useful for
processing large texts or code files. The new name TextChunker reflects
the functionality of the class more clearly and avoids confusion with
the SemanticPartitioner class, which is a different concept that splits
text into semantic segments based on embeddings.

Details:
- Rename SemanticTextPartitioner.cs to TextChunker.cs and update the
class name and namespace accordingly. The TextChunker class is now in
the Text namespace, which is more consistent with its purpose and the
other classes in the namespace.
- Update all references to SemanticTextPartitioner in the code and
documentation to use TextChunker instead. This includes the
ConversationSummarySkill and the DocumentImportController classes, which
use the TextChunker class to split text into sentences and paragraphs
for summarization and import.
- Update the README.md file for the GitHub Repo Q&A Bot sample to
explain how the TextChunker class works with memory and embeddings.
- Make some minor formatting changes to the TextChunker class, such as
adding line breaks and spaces for readability, and removing an unused
using directive.
golden-aries pushed a commit to golden-aries/semantic-kernel that referenced this pull request Oct 24, 2023
### Motivation and Context

<!-- Thank you for your contribution to the chat-copilot repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

### Description

<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->
1. Fix error in chat route
2. Fix error in import document route
3. Add PATCH to CORS allowed methods

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows the [Contribution
Guidelines](https://github.com/microsoft/chat-copilot/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/chat-copilot/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: ready to merge PR has been approved by all reviewers, and is ready to merge. python Pull requests for the Python Semantic Kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants