Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
-
Updated
Dec 23, 2025 - Python
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Select, put and delete data from JSON, TOML, YAML, XML, INI, HCL and CSV files with a single tool. Also available as a go mod.
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it!
A lightweight data processing framework built on DuckDB and 3FS.
A light-weight, flexible, and expressive statistical data testing library
Concurrent and multi-stage data ingestion and data processing with Elixir
Kubernetes-native platform to run massively parallel data/streaming jobs
Large-scale pretraining for dialogue
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Python Stream Processing
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
Extract Transform Load for Python 3.5+
Concurrent Python made simple
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Data and tools for generating and inspecting OLMo pre-training data.
Scalable data pre processing and curation toolkit for LLMs
Add a description, image, and links to the data-processing topic page so that developers can more easily learn about it.
To associate your repository with the data-processing topic, visit your repo's landing page and select "manage topics."