PipeDreams is a data exploration and visualization tool designed to be simple, flexible, and powerful. Upload any CSV file, perform basic ETL transformations, visualize your data, and gain insights with built-in machine learning—all from a user-friendly Streamlit interface. If no file is uploaded, a default dataset (customers-100000.csv
) is used for demonstration.
- CSV Upload: Upload any CSV file for immediate analysis and visualization.
- Default Dataset: If no file is uploaded, the app loads a sample dataset (
customers-100000.csv
) located in thedata/
directory. - ETL Transformations: Clean and transform data, remove missing values, and auto-convert data types.
- Data Visualization: Interactive charts (scatter, bar, line, histogram, and box plots) to gain insights from your data.
- Clustering Analysis: Use KMeans clustering to identify natural groupings within the data, helping to segment and classify.
- Predictive Analysis: A synthetic column (
Annual Purchase Amount
) is included for testing linear regression, allowing users to explore predictive analysis features.
You can run PipeDreams using Docker to avoid setting up dependencies locally. The pre-built Docker image is available on DockerHub.
Pull the latest Docker image from DockerHub:
docker pull anuclei/pipedreams:latest
Run the application with Docker, exposing it on port 8501:
docker run -p 8501:8501 anuclei/pipedreams:latest
Once the container is running, open your browser and go to http://localhost:8501
to access the application.
PipeDreams can also be deployed on a Kubernetes cluster. This deployment scenario uses Minikube for local Kubernetes clusters and includes configurations for high availability and autoscaling.
For detailed instructions and YAML configurations, refer to the Kubernetes Deployment Guide in the k8s
directory.
If you prefer not to use Docker, you can set up the app manually.
- Python (version 3.6 or higher)
-
Clone the repository:
git clone https://github.com/markjacksonfishing/pipedreams.git cd pipedreams
-
Set up a virtual environment:
- MacOS/Linux:
python3 -m venv venv source venv/bin/activate
- Windows:
python -m venv venv .\venv\Scripts\activate
- MacOS/Linux:
-
Install dependencies:
pip install -r requirements.txt
-
Run the application:
- MacOS/Linux:
source venv/bin/activate streamlit run app.py
- Windows:
.\venv\Scripts\activate streamlit run app.py
The application will open in your default web browser at
http://localhost:8501
and will look like this: - MacOS/Linux:
-
Deactivate the virtual environment (when finished):
deactivate
- Start the application by following the setup steps above (or run it via Docker).
- Upload a CSV file using the file uploader in the app, or view the default dataset if no file is uploaded.
- Explore the data with built-in ETL transformations and interactive visualizations.
- Perform clustering analysis and predictive analysis on available data.
- Clustering Analysis: Select features for clustering, and the app will automatically group data into clusters using KMeans. This can reveal natural groupings in the data, such as customer segments.
- Predictive Analysis: Select features and a target variable (e.g., the synthetic
Annual Purchase Amount
) for linear regression. The app will generate a prediction model, display a mean squared error metric, and show an interactive scatter plot comparing actual vs. predicted values.
The default dataset, customers-100000.csv
, is located in the data/
directory. If no CSV file is uploaded, this dataset will automatically load, allowing users to test the ETL transformations, visualizations, clustering, and predictive analysis features without needing their own data file.