Google BigQuery data source for Apache Spark
-
Updated
Oct 1, 2024 - Scala
Google BigQuery data source for Apache Spark
This project orchestrates a data processing workflow using Apache Airflow, Spark, Google Cloud Storage (GCS), and Snowflake. The workflow is designed to handle daily data updates, filter completed orders, and update a Snowflake target table with the latest information. The project leverages Apache Airflow for workflow scheduling and management.
StackExchange data procured is cleaned with pig, queried with hive-ql, performed tf-idf to obtain top 10 words used by top 10 users of StackExchange.
A Python framework for managing Dataproc cluster and Scheduling PySpark Jobs over it. Additionally it provides docker based development for debugging PySpark jobs.
Desafio do curso Criando um Ecossistema Hadoop Totalmente Gerenciado com Google Cloud Dataproc
Experiments with MapReduce (mrjob) and Google Dataproc
Welcome to the Learning and Experiments Hub—a dynamic repository capturing my journey of exploration and experimentation in the vast world of technology. This space serves as a digital canvas where I document my learning process, experiments, and discoveries.
Welcome to the MiniProjects Playground—an interactive space where learning meets doing! This repository is a collection of hands-on mini-projects that I've crafted after delving into various tech stacks and frameworks. From theory to application, each project is a testament to the practical side of coding.
Add a description, image, and links to the google-dataproc topic page so that developers can more easily learn about it.
To associate your repository with the google-dataproc topic, visit your repo's landing page and select "manage topics."