This project creates a pipeline that takes data from Properati web page ( Properati is a real estate search site), processes it using lambda functions and, finally, stores it in a redshift database. This pipeline is orchestrated using AWS Step Functions and scheduled with AWS EventBridge. Also, we'll build a FLASK REST API to interact with the database. This Flask App allow us to retrieve data and is hosted in AWS Lightsail container service.
The tools that were used for the project are:
- AWS for hosting the infraestructure.
- AWS S3 as our storage.
- AWS Lambda as the executor.
- AWS Redshift as our data warehouse.
- AWS Step Functions for orchestrating our pipeline.
- AWS Eventbrigde for scheduling our pipeline.
- AWS Lightsail Containers for hosting our Flask REST API App.
- Terraform as IaC for the infra provisioning.
- Docker for containerizing our FLASK APP.
- Insomnia and Flask for testing and developing our REST API.
- Pytest for testing the response we receive from the webpage.
- Python as the main programming language.
- Extracting data from Properati
- The extracted data is validated, cleaned and uploaded to redshift.
- A Flask REST API is created for the database so we can interact with the data inside our Data Warehouse.
- Users can now analyze the data using any visualization tool they prefer or use the API to develop new solutions.
These next requirements need to be installed locally for the correct functioning of the solution:
- AWS CLI for account configuring and terraform provisioning.
- AWS CLI Lighstail plugin for deploying our containers and pushing the docker images to the AWS Lightsail Containers' Repository.
- Terraform to provision the infraestructure.
- Docker to containerize the Flask REST API App image.
For testing, let's go to our root folder and run:
pytest
: This will run some tests to make sure the web page works as we want to.
- The first test will make sure that we receive the response 200, meaning that the webpage exists and we have access to it.
- The second test will make sure that the limit of elements per page is 30.
Now to create the pipeline, terraform will initialize everything that we need. Just clone the repo and execute the next commands inside the terraform folder:
aws configure
: This command is used to log in into an AWS Account using your secret access keys.terraform init
: This will initiate terraform in the folder.terraform apply
: This will create our infraestructure. You will be prompt to input a redshift password and user.- (Only run if you want to destroy the infraestructure)
terraform destroy
: This destroys the created infraestructure.
This pipeline is scheduled hourly, so we can wait 1 hour for the pipeline to run or run our Step Functions' State Machines manually.
Path | Request Type | Parameters |
---|---|---|
/properties |
GET | No parameters required. This request retrieves all the data from our database. |
/properties |
POST | id(int), type(str), title(str), bedrooms(int), bathrooms(int), price(int), surface(int), district(str), geo_lon(float), geo_lat(float), place_lon(float), place_lat(float) |
/properties/<int:id> |
GET | No parameters required. This request retrieves an specific property from our database by its id. |
- The Flask API URL can be found in the AWS lightsail container service.
- The path URL/swagger-ui will show the documentation of the Flask API.