https://paperswithcode.com/paper/detect-order-construct-a-tree-construction
https://github.com/jannisborn/paperscraper
https://github.com/VikParuchuri/marker
https://github.com/jakespringer/echo-embeddings/blob/master/README.md
https://huggingface.co/blog/how-to-train-sentence-transformers
https://www.mercity.ai/blog-post/classify-long-texts-with-bert
https://github.com/desgeeko/pdfsyntax/
https://github.com/aminya/tocPDF
https://huggingface.co/McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised/discussions
https://github.com/weaviate/recipes/blob/main/integrations/dspy/2.Writing-Blog-Posts-with-DSPy.ipynb
https://huggingface.co/McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised/discussions
For LLM evaluations one tool I like is Uptrain. Rohan Paul
https://github.com/EleutherAI/lm-evaluation-harness
https://openai.com/research/summarizing-books
https://github.com/oobabooga/text-generation-webui/wiki/05-%E2%80%90-Training-Tab
https://github.com/huggingface/alignment-handbook
https://github.com/jondurbin/bagel
https://www.datacamp.com/tutorial/fine-tuning-llama-2
https://github.com/bublint/ue5-llama-lora?tab=readme-ov-file
https://zohaib.me/a-beginners-guide-to-fine-tuning-llm-using-lora/
- Surya: OCR and line detection in 90+ languages
- Tonic Validate x LlamaIndex: Implementing integration tests for LlamaIndex
- Building a Full-Stack Complex PDF AI chatbot w/ R.A.G (Llama Index)
- Mastering PDFs: Extracting Sections, Headings, Paragraphs, and Tables with Cutting-Edge Parser
- https://github.com/AymenKallala/RAG_Maestro
- AirbyteLoader: Airbyte is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.
- https://grobid.readthedocs.io/en/latest/Grobid-service/#use-grobid-test-console
- https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb
- https://github.com/titipata/scipdf_parser
- Semantic Chunking: Splits the text based on semantic similarity.
- Chunk Visualizer
- https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker
- Take company docs. Chunk it. Ask ollama to create questions based on chunks TDM (e/λ)
Ask ollama to answer the said questions based on chunks (done seperately to ensure verbose answer)
Any embedding model to verify if the answer is def related to the chunk
Maybe some keyword matches
Pending step to remove further hallucinations (ask GPT-4/Mistral-medium to rate it from 1 to 5??)
A decent dataset acquired - LLaMA-Factory: Unify Efficient Fine-tuning of 100+ LLMs
- Democratizing LLMs: 4-bit Quantization for Optimal LLM Inference: A deep dive into model quantization with GGUF and llama.cpp and model evaluation with LlamaIndex
- 🦎 LazyAxolotl: Large Language Model Course ❤️ Created by @maximelabonne.
This notebook allows you to fine-tune your LLMs using Axolotl and Runpod (please consider using my referral link). It can also use LLM AutoEval to automatically evaluate the trained model using Nous' benchmark suite. You can find Axolotl YAML configurations (SFT or DPO) on GitHub or Hugging Face.
- Fine-Tuning Pretrained Models
- https://github.com/yuhuixu1993/qa-lora
- https://github.com/modelscope/swift?tab=readme-ov-file#-getting-started
- ludwig-ai/ludwig#3814
- https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing
- https://github.com/unslothai/unsloth
- https://github.com/OpenAccess-AI-Collective/axolotl
- https://github.com/ggerganov/llama.cpp/tree/master/examples/finetune
- https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac
- https://towardsdatascience.com/democratizing-llms-4-bit-quantization-for-optimal-llm-inference-be30cf4e0e34
- https://www.interconnects.ai/p/llm-synthetic-data
- https://zohaib.me/a-beginners-guide-to-fine-tuning-llm-using-lora/
- https://github.com/Lightning-AI/litgpt?ref=zohaib.me
- https://kaitchup.substack.com/p/lora-adapters-when-a-naive-merge