Skip to content

SebasMBK/etl-houseslima-properati

Repository files navigation

ETL-Houseslima-Properati

This project creates a pipeline that takes data from Properati web page ( Properati is a real estate search site), processes it using lambda functions and, finally, stores it in a redshift database. This pipeline is orchestrated using AWS Step Functions and scheduled with AWS EventBridge. Also, we'll build a FLASK REST API to interact with the database. This Flask App allow us to retrieve data and is hosted in AWS Lightsail container service.

The tools that were used for the project are:

Project's Architecture

project_arch

  1. Extracting data from Properati
  2. The extracted data is validated, cleaned and uploaded to redshift.
  3. A Flask REST API is created for the database so we can interact with the data inside our Data Warehouse.
  4. Users can now analyze the data using any visualization tool they prefer or use the API to develop new solutions.

Project's requirements

These next requirements need to be installed locally for the correct functioning of the solution:

  1. AWS CLI for account configuring and terraform provisioning.
  2. AWS CLI Lighstail plugin for deploying our containers and pushing the docker images to the AWS Lightsail Containers' Repository.
  3. Terraform to provision the infraestructure.
  4. Docker to containerize the Flask REST API App image.

Start Pipeline

For testing, let's go to our root folder and run:

pytest: This will run some tests to make sure the web page works as we want to.

  1. The first test will make sure that we receive the response 200, meaning that the webpage exists and we have access to it.
  2. The second test will make sure that the limit of elements per page is 30.

Now to create the pipeline, terraform will initialize everything that we need. Just clone the repo and execute the next commands inside the terraform folder:

  1. aws configure: This command is used to log in into an AWS Account using your secret access keys.
  2. terraform init: This will initiate terraform in the folder.
  3. terraform apply: This will create our infraestructure. You will be prompt to input a redshift password and user.
  4. (Only run if you want to destroy the infraestructure) terraform destroy: This destroys the created infraestructure.

This pipeline is scheduled hourly, so we can wait 1 hour for the pipeline to run or run our Step Functions' State Machines manually.

Flask REST API

Path Request Type Parameters
/properties GET No parameters required. This request retrieves all the data from our database.
/properties POST id(int), type(str), title(str), bedrooms(int), bathrooms(int), price(int), surface(int), district(str), geo_lon(float), geo_lat(float), place_lon(float), place_lat(float)
/properties/<int:id> GET No parameters required. This request retrieves an specific property from our database by its id.
  • The Flask API URL can be found in the AWS lightsail container service.
  • The path URL/swagger-ui will show the documentation of the Flask API.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages