This documentation provides guidance on setting up, running, and understanding the codebase, including instructions for installing dependencies, configuring the environment, and running the analysis.
Before proceeding, ensure you have the following installed:
- Python (version 3.8 or higher)
- Git (if cloning the repository)
- A code editor or IDE (e.g., VS Code, PyCharm)
Here is an how the project directory is organized:
project-directory/
│
├── Raw_datasets/
│ ├── items.csv
│ ├── promotion.csv
│ ├── sales.csv
│ ├── supermarkets.csv
│
├── src/
│ ├── Data_Engineering_Pretest.ipynb # Main script for running analysis
│ ├── Clean_data.py # Functions for cleaning datasets
│
├── requirements.txt # List of dependencies
├── .env # Environment variables (e.g., database credentials)
├── README.md # Project overview and usage
└── Report
└── Report on Data Engineering Pretest.pdf # Final report detailing tasks and insights
If the project is hosted on a Git repository, clone it:
git clone <repository-url>
cd project-directorySet up a virtual environment to manage dependencies:
python -m venv venv
source venv/bin/activate # On Linux/Mac
venv\Scripts\activate # On WindowsInstall the required libraries using requirements.txt:
pip install -r requirements.txtCreate a .env file (if it doesn’t already exist) in the project root. Add the following variables:
DB_NAME=your_database_name
DB_USER=your_username
DB_PASSWORD=your_password
DB_HOST=your_host
DB_PORT=your_port
Replace placeholders with your actual PostgreSQL credentials.
Use the database.py script to upload cleaned datasets to the PostgreSQL database:
python src/database.pyThis script:
- Connects to the PostgreSQL database using credentials in
.env. - Creates tables for items, promotions, sales, and supermarkets.
- Loads the data from the
data/folder into the respective tables.
Run the data cleaning script to preprocess and clean the datasets:
python src/data_cleaning.pyUse the analysis.py script to analyze branch-level sales patterns and promotion effectiveness:
python src/analysis.pyRun the visualization script to create charts and heatmaps for promotion effectiveness and sales trends:
python src/visualization.pyAlternatively, you can run the main.py script, which combines all the steps:
python src/main.pyThis file lists all Python dependencies. Here’s an example:
pandas
numpy
matplotlib
seaborn
sqlalchemy
psycopg2-binary
python-dotenv
Install these by running:
pip install -r requirements.txtHolds sensitive credentials (e.g., database information). Example:
DB_NAME=supermarket_data
DB_USER=admin
DB_PASSWORD=securepassword
DB_HOST=localhost
DB_PORT=5432
-
Data Cleaning:
- Handles missing values, duplicates, and data type corrections.
-
Database Integration:
- Uploads cleaned data into a PostgreSQL database for centralized storage and analysis.
-
Business Analysis:
- Generates actionable insights, such as branch-level sales patterns and promotion effectiveness.
-
Visualization:
- Provides visual insights through heatmaps, bar plots, and other charts.
-
Dependency Issues:
- Ensure you’re using the correct Python version.
- If errors occur during installation, update
pip:pip install --upgrade pip
-
Database Connection Errors:
- Verify that the PostgreSQL server is running and the
.envfile contains the correct credentials.
- Verify that the PostgreSQL server is running and the
-
Data File Issues:
- Ensure all required CSV files are in the
data/folder. Missing files will cause errors.
- Ensure all required CSV files are in the
- Add year information to the sales dataset for temporal analysis.
- Include additional validation steps for ensuring data quality during extraction.