This pipeline captures real-time bike station data from the Jcdecaux public API and processes it using Dockerized services for messaging, real-time processing, storage, and visualization. Kafka handles messaging, Spark Streaming processes data, Elasticsearch stores the information, and Kibana provide visualization.
The pipeline consists of the following components:
- Jcdecaux API: Source of real-time bike station data.
- Kafka: Distributed messaging system for data distribution.
- Spark Streaming: Processes data in real-time.
- Elasticsearch: Stores data for easy querying.
- Kibana: Visualize data stored in Elasticsearch.
All components run in Docker containers. Ensure Docker and Docker Compose are installed on your system.
-
Clone the repository:
git clone https://github.com/Yasselo/Realtime_jcdecaux_pipeline cd Realtime_jcdecaux_pipeline
-
Build and Start Services: Run the following command to start all services, including Kafka, Spark, Elasticsearch, and Kibana, in Docker containers:
docker-compose up --build
This command pulls the necessary Docker images, builds the custom bike pipeline container, and starts all services defined in `docker-compose.yml`.
- Kafka Topic: A Kafka topic named `velib_stations` will be created automatically for bike station data.
- Environment Variables:
- `KAFKA_BROKER`: Kafka broker address (`kafka:9092`).
- `ELASTICSEARCH_HOST`: Elasticsearch host (`elasticsearch`).
- `SPARK_MASTER`: Spark master URL (`spark://spark:7077`).
- Start the Docker containers:
docker-compose up
- Start the pipeline:
docker exec -it bike_pipeline bash
Wait for the various services to initialize. Once they're running, enter the following commands in the bash:
python3 producer & python3 consumer
- Kibana Access: Open http://localhost:5601 to view Kibana and check Elasticsearch data under "Index Management".
The following Docker images are used:
- Kafka: `wurstmeister/kafka:2.13-2.8.1`
- Spark: `bitnami/spark:3.2.4`
- Elasticsearch: `docker.elastic.co/elasticsearch/elasticsearch:8.8.2`
- Kibana: `docker.elastic.co/kibana/kibana:8.8.2`
- Custom Pipeline: Defined in `Dockerfile`, containing OpenJDK 11, Spark, Kafka, and Elasticsearch dependencies.
With the setup complete, real-time bike station data should now be visible in Kibana. This project demonstrates the use of a streaming data pipeline for processing and visualizing real-time data using Docker.