🚀 Data Engineering
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
This is a repo with links to everything you'd ever want to learn about data engineering
Apache Beam is a unified programming model for Batch and Streaming data processing.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
An orchestration platform for the development, production, and observation of data assets.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
☁️ Choose the optimal Google Compute Engine machine type or instance in the many Google Cloud Platform regions
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
This repository is a getting started guide to Singer.
Scrapy, a fast high-level web crawling & scraping framework for Python.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
PySpark test helper methods with beautiful error messages
This dbt package contains macros to support unit testing that can be (re)used across dbt projects.
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
Always know what to expect from your data.
Apache Spark - A unified analytics engine for large-scale data processing
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)