- spark_emr_dev - Demo of submitting Hadoop ecosystem jobs to AWS EMR
- spark-etl-pipeline - Demo of various Spark ETL processes
- utility_Scala - Scala/Spark programming basic demo
# ├── Dockerfile : Dockerfile make scala spark env
# ├── README.md
# ├── archived : legacy spark scripts in python/java...
# ├── build.sbt : (scala) sbt file build spark scala dependency
# ├── config : config for various services. e.g. s3, DB, hive..
# ├── data : sample data for some spark scripts demo
# ├── output : where the spark stream/batch output to
# ├── project : (scala) other sbt setting : plugins.sbt, build.properties...
# ├── python : helper python script
# ├── run_all_process.sh : script demo run minimum end-to-end spark process
# ├── script : helper shell script
# ├── src : (scala) MAIN SCALA SPARK TESTS/SCRIPTS
# ├── target : where the final complied jar output to (e.g. target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar)
# └── travis_build.sh : travis build file
- Modify config with yours and rename them (e.g.
twitter.config.dev
->twitter.config
) to access services like data source, file system.. and so on. - Install SBT as scala dependency management tool
- Install Java, Spark
- Modify build.sbt aligned your dev env
- Check the spark etl scripts : src
sbt clean compile -> sbt test -> sbt run -> sbt assembly -> spark-submit <spark-script>.jar
$ git clone https://github.com/yennanliu/spark-etl-pipeline.git && cd spark-etl-pipeline && bash run_all_process.sh
Quick Start Manually
# STEP 0)
$ cd ~ && git clone https://github.com/yennanliu/spark-etl-pipeline.git && cd spark-etl-pipeline
# STEP 1) download the used dependencies.
$ sbt clean compile
# STEP 2) print twitter via spark stream via sbt run`
$ sbt run
# # STEP 3) create jars from spark scala scriots
$ sbt assembly
$ spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar
# get fake page view event data
# run the script generate page view
$ sbt package
$ spark-submit \
--class DataGenerator.PageViewDataGenerator \
target/scala-2.11/spark-etl-pipeline_2.11-1.0.jar
# open the other terminal to receive the event
$ curl 127.0.0.1:44444
Quick Start Docker
# STEP 0)
$ git clone https://github.com/yennanliu/spark-etl-pipeline.git
# STEP 1)
$ cd spark-etl-pipeline
# STEP 2) docker build
$ docker build . -t spark_env
# STEP 3) ONE COMMAND : run the docker env and sbt compile and sbt run and assembly once
$ docker run --mount \
type=bind,\
source="$(pwd)"/.,\
target=/spark-etl-pipeline \
-i -t spark_env \
/bin/bash -c "cd ../spark-etl-pipeline && sbt clean compile && && sbt assembly && spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar"
# STEP 3') : STEP BY STEP : access docker -> sbt clean compile -> sbt run -> sbt assembly -> spark-submit
# docker run
$ docker run --mount \
type=bind,\
source="$(pwd)"/.,\
target=/spark-etl-pipeline \
-i -t spark_env \
/bin/bash
# inside docker bash
root@942744030b57:~ cd ../spark-etl-pipeline && sbt clean compile && sbt run
root@942744030b57:~ cd ../spark-etl-pipeline && spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar
Ref
- Stream via python socket
- Install spark + yarn + hadoop via docker
Dataset
-
Twitch API (
stream
) -
Dota2 API (
stream
) -
NYC TLC Trip Record dataset (taxi) (
large dataset
) -
Amazon Customer Reviews Dataset (
large dataset
) -
Github repo dataset (
large dataset
) -
Hacker news dataset (
large dataset
) -
Stackoverflow dataset (
large dataset
) -
Yelp dataset (
large dataset
) -
Relational dataset (RDBMS online free dataset)
-
Awesome public streaming date
-
NYC SUBWAY REALTIME API
-
Github mirror data