SyGra - Graph-oriented Synthetic data generation Pipeline
-
Updated
Feb 11, 2026 - Python
SyGra - Graph-oriented Synthetic data generation Pipeline
An open-source collection of datasets, guides, and rankings for B2B email marketing and lead generation. Your go-to resource for sales prospecting strategies.
Following is the Open Email Marketing Dataset; you can use it without any restrictions.
🔧 Modular pipeline for generating high-quality, domain-specific datasets for LLM fine-tuning — from PDFs and web scraping to synthetic Q&A generation, quality filtering, and training-ready formatting.
👨🏫This project was developed under the guidance of Mr. Lokesh Sir as part of the AI & ML Training Program. It explores LLM integration using Google Gemini APIs with a custom UI built on Streamlit.
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.
Sample edition of The Stack Enriched: annotated, secure, and optimized code dataset, this is a sample version
High quality dataset containing ~2k lines of synthetically generated LLM training examples for spontaneous observations
Add a description, image, and links to the llm-training-data topic page so that developers can more easily learn about it.
To associate your repository with the llm-training-data topic, visit your repo's landing page and select "manage topics."