llm-training-data

Here are 9 public repositories matching this topic...

ServiceNow / SyGra

SyGra - Graph-oriented Synthetic data generation Pipeline

python open-source ai multimodality synthetic-data synthetic-dataset-generation dpo image-datasets low-code-no-code llm-datasets llm-framework sft-data llm-training-data

Updated Feb 11, 2026
Python

b2bemaillists / b2b-email-leads-ranking

Star

An open-source collection of datasets, guides, and rankings for B2B email marketing and lead generation. Your go-to resource for sales prospecting strategies.

email-marketing datasets lead-generation cold-email marketing-data b2b-leads llm-training-data b2b-marketing seo-dataset faq-dataset sales-prospecting

Updated Sep 13, 2025

emailmarketingdataset / Open-Email-Marketing-Dataset

Star

Following is the Open Email Marketing Dataset; you can use it without any restrictions.

email-marketing lead-generation jsonl gdpr-compliant cold-email marketing-dataset open-dataset llm-training-data b2b-dataset verified-emails seo-dataset

Updated Jul 12, 2025

BlazeWild / Custom_LLM_DataGen_Template

Star

🔧 Modular pipeline for generating high-quality, domain-specific datasets for LLM fine-tuning — from PDFs and web scraping to synthetic Q&A generation, quality filtering, and training-ready formatting.

synthetic-dataset-generation template-generic-repo llm-training finetuning-llms finetuning-large-language-models llama3 llm-training-data lora-fine-tuning

Updated Jul 15, 2025
Python

deepakshroff / Capston-Gemini-ChatBot

Star

👨‍🏫This project was developed under the guidance of Mr. Lokesh Sir as part of the AI & ML Training Program. It explores LLM integration using Google Gemini APIs with a custom UI built on Streamlit.

api-client llm-training llm-training-data

Updated Jul 27, 2025
Python

AmanPriyanshu / Stratified-LLM-Subsets-100K-1M-Scale

Sponsor

Star

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.