Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) #78

Closed
9 of 26 tasks
ChakshuGautam opened this issue May 15, 2023 · 11 comments
Closed
9 of 26 tasks
Assignees
Labels

Comments

@ChakshuGautam
Copy link
Collaborator

ChakshuGautam commented May 15, 2023

Project Details

AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.

Features to be implemented

The idea is to implement a document uploader API that is async and returns the embeddings for chunks of that document. It should save the data for a short period until the user asks for the download. This data can then be uploaded by the user wherever they have a search engine. The current problem statement doesn't cover this.

How it works

Extract the text from the PDF file. Tokenize the extracted text using cosine distance and create chunks. For each chunk, create vector embeddings using an Instructor Model.

Create APIs to upload the following document Types

  • PDF
  • Audio (transcription)
  • Video (transcription)

Behavior of Upload API

  • It takes a pdf file and uploads it to our database.
  • API returns a document id in response. For future calls, this document id should be used. Each document id maps to an index containing embeddings.
  • If you are indexing multiple documents, then pass document ids accordingly.
    Taken from here

File Status API

  • This API is used to check the status of file upload.
  • It returns status and document id.
  • Possible values for status are yet_to_start, in_progress, completed, and failed
  • If the embeddings for a document are successfully created and indexed, then completed is returned.
    Taken from here

Chunking

Sample pdfs:

https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw

OpenAI Embedding Alternatives

Learning Path

Complexity

Medium

Skills Required

Python, Knowledge of HuggingFace Transformers, NLP.

Name of Mentors:

@GautamR-Samagra

Project size

8 Weeks

Product Set Up

See the setup here

Acceptance Criteria

  • Unit Test Cases
  • e2e Test Caes
  • OpenAPI Spec/Postman Collection
  • Dockerfile for this module

Milestone

Every document type supported is a milestone.

Reference

  1. Gist with basic implementation
  2. LLM Town

C4GT

This issue is nominated for Code for GovTech (C4GT) 2023 edition.
C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/


The scope of this ticket has now expanded to make it the 'content processing' part of 'FAQ bot'.
The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.

This ticket covers the content processing part of the bot. It includes the following tasks in its scope:

@ajitg25
Copy link

ajitg25 commented May 18, 2023

I understood the problem statement to take the transcriptions and store the embeddings in the database. I would like to contribute to this issue . Please assign it to me!!

@chandra-pro
Copy link

I have good knowledge of working on NLP and I also understand your problem . So I would to contribute to this issue . Could you please assign it to me

@Dhruv88
Copy link

Dhruv88 commented May 19, 2023

I have worked on a similar problem statement earlier. We had been given paragraphs on several topics and then a question on a specific topic was asked and we had to retrieve the answer for that query using the given paragraphs. The solution we had come up with was to convert the paragraphs into embeddings using the hugging face transformer model. The embeddings were indexed using the FAISS indexing library. Then for a question, we took its embeddings and retrieved the closest paragraph embeddings from the index using cosine similarity. Here is the link to the code notebook for reference click here

We used retrieval then question Answering to solve the problem.

Thus, I think I can work on converting the above code into a proper API as required by the project.

@ajitg25
Copy link

ajitg25 commented May 19, 2023

I have made FastAPI to upload the PDF file and extract the text as per mentioned in the "git with basic implementation". I implemented the requirements of "Behavior of Upload API". Please review it

@Gautam-Rajeev
Copy link
Collaborator

The approach which initally suggested was to creating a window for the embeddings and checking for any sharp changes in the embeddings.

However, now we aren't sure if the changing in the similarity score is a good enough approach as information about a variety of things may be present in a paragraph and this then separates them into different chunks.

I think what will be required will be :
A benchmark model using GPT that takes all the text in a page and creates chunks out of it.
Explore topic modelling style approach to the problem that does some kind of heirarchial clustering a page into identified topics.

Some sample PDFs are provided here. A simple test can be done on a page and we can see if the text extracted is getting chunked into the same paragraphs as in the pdf.

@ajitg25
Copy link

ajitg25 commented May 31, 2023

Okay sir, Currently I am dividing the page into chunks and then I am doing embedding. So now what I need to do is first divide the content of pages based on different topic and then do the embedding. Have I understood right Sir?

I will explore the PDFs you have attached.

@Gautam-Rajeev
Copy link
Collaborator

Gautam-Rajeev commented Jun 1, 2023

Okay sir, Currently I am dividing the page into chunks and then I am doing embedding. So now what I need to do is first divide the content of pages based on different topic and then do the embedding. Have I understood right Sir?

I will explore the PDFs you have attached.

That is correct.

Potential flow for solving this could be :

  • Pick a pdf - a good example seems to be https://drive.google.com/drive/u/0/folders/1sAsuh-EFH-xmFYrxzhmj0VRUZYNzsyLw from the pdfs provided in the folders.
  • Decide an evaluation metric. For the above pdf, the text has neat paragrpahs that can be considered chunks. Create a test set- with the pdf parsed according to the headings to get the paragraphs.
  • Use various methods to extract text from pdf and chunk it into various paragraphs such that are of similar topic
  • Measure the accuracy of the chunking for various chunking methods by comparing your chunks vs the pdf paragraphs.
  • Store the chunks in a CSV/DB.

Next steps:

  • Generate tags for each chunk that can be searched for various user questions/prompts.
  • Embed the tags using any vecotr embeddings
  • Integrate the setup within a vector DB

@ajitg25
Copy link

ajitg25 commented Jun 3, 2023

Ok sir

@Codecreatermunesh
Copy link

I have been working on this project since 20 May. I did see many projects, but finally, I will understand everything related to this problem statement. I am submitting only this proposal. I have good knowledge of NLP and have been learning about HuggingFace Transformers for the last 1 week. I am interested in this project. I have been doing Machine learning for the last 1 year, so I have good knowledge of Python Language.

@Sanchariii
Copy link

I have worked on more or else similar project before using hugging face transformer model. I would like to contribute on this project.

@notinrange
Copy link

Hello @GautamR-Samagra Sir, I wanted to contribute to the development of the document uploader API within the AI Toolchain, for
helping streamline document processing, embedding generation, and indexing for enhanced machine learning workflows.

@Gautam-Rajeev Gautam-Rajeev changed the title [C4GT] Document Uploader [C4GT] Document Uploader ( Text chunking into paragraphs + content retrieval ) Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants