This project is for demonstrating knowledge of Data Engineering tools and concepts and also learning in the process
This Pipeline extracts data from Sunglasseshut web page that will be sinked into a Power BI dashboard.
The tools that were used for this project are the following:
- Azure for hosting the infraestructure.
- Terraform as IaaC for infra provisioning.
- Airflow for orchestration.
- Docker for containerizing the pipeline.
- Insomnia for obtaining the code that sends request to the Web page API.
- Power BI for data visualization.
- Python as the main programming language.
- Scraping the data using insomnia and python.
- The extracted data is converted into a Dataframe that is uploaded to Azure Storage account using the Azure Identity and Azure Blob Storage client libraries.
- The data is cleaned and the data types are validated using pydantic's data classes.
- Finally, we deliver the data to Azure Synapse (Datawarehouse) and Azure Database for PostgreSQL - Flexible Server.
- Users can now analyze the data using Power BI or any other visualization tool they prefer.
It is necessary to install and configure the following tools for the correct functioning of the pipeline:
- Azure CLI for account configuring and terraform provisioning.
- Terraform to provision the infraestructure.
- Docker for running airflow and containerizing the pipeline.
- (Optional) Linux or MAC OS to use the "Makefile" make commands.
The following commands can be used to initialize the pipeline, but they will only work on Linux or MAC OS.
make init-terraform
: Initialize the Terraform backend inside the ./terraform directory. You'll be asked to insert first a password and then a user. These same credentials will be used for both Azure Synapse and PostgreSQL.make environment
: Provision the Azure infraestructure using Terraform.make terraform-config
: Outputs the configuration of the infra created with Terraform into a file called "configuration.env" inside ./airflow/tasks. This file includes FQDN, database names, etc.make start-run
: Creates and starts the airflow containers.make az-login
: Login to Azure from within the container. Necessary for all the Airflow tasks. This will prompt you with an authentication url and code. Follow the instructions closely.
Now you can login into Airflow through http://localhost:8080/ and trigger the pipeline manually or wait for the next scheduled run. The user and password should be "airflow". You can change this from the docker-compose file.