PipeDreams - CSV Data Explorer

PipeDreams is a data exploration and visualization tool designed to be simple, flexible, and powerful. Upload any CSV file, perform basic ETL transformations, visualize your data, and gain insights with built-in machine learning—all from a user-friendly Streamlit interface. If no file is uploaded, a default dataset (customers-100000.csv) is used for demonstration.

Features

CSV Upload: Upload any CSV file for immediate analysis and visualization.
Default Dataset: If no file is uploaded, the app loads a sample dataset (customers-100000.csv) located in the data/ directory.
ETL Transformations: Clean and transform data, remove missing values, and auto-convert data types.
Data Visualization: Interactive charts (scatter, bar, line, histogram, and box plots) to gain insights from your data.
Clustering Analysis: Use KMeans clustering to identify natural groupings within the data, helping to segment and classify.
Predictive Analysis: A synthetic column (Annual Purchase Amount) is included for testing linear regression, allowing users to explore predictive analysis features.

Getting Started with Docker

You can run PipeDreams using Docker to avoid setting up dependencies locally. The pre-built Docker image is available on DockerHub.

Pulling the Docker Image

Pull the latest Docker image from DockerHub:

docker pull anuclei/pipedreams:latest

Running the Docker Container

Run the application with Docker, exposing it on port 8501:

docker run -p 8501:8501 anuclei/pipedreams:latest

Once the container is running, open your browser and go to http://localhost:8501 to access the application.

Kubernetes Deployment

PipeDreams can also be deployed on a Kubernetes cluster. This deployment scenario uses Minikube for local Kubernetes clusters and includes configurations for high availability and autoscaling.

For detailed instructions and YAML configurations, refer to the Kubernetes Deployment Guide in the k8s directory.

Manual Installation

If you prefer not to use Docker, you can set up the app manually.

Prerequisites

Python (version 3.6 or higher)

Installation

Clone the repository:

git clone https://github.com/markjacksonfishing/pipedreams.git
cd pipedreams

Set up a virtual environment:

MacOS/Linux:

python3 -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
.\venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
- MacOS/Linux:
```
source venv/bin/activate
streamlit run app.py
```
- Windows:
```
.\venv\Scripts\activate
streamlit run app.py
```
The application will open in your default web browser at http://localhost:8501 and will look like this:
Deactivate the virtual environment (when finished):
```
deactivate
```

How to Use

Start the application by following the setup steps above (or run it via Docker).
Upload a CSV file using the file uploader in the app, or view the default dataset if no file is uploaded.
Explore the data with built-in ETL transformations and interactive visualizations.
Perform clustering analysis and predictive analysis on available data.

Advanced Insights: Clustering and Predictive Analysis

Clustering Analysis: Select features for clustering, and the app will automatically group data into clusters using KMeans. This can reveal natural groupings in the data, such as customer segments.
Predictive Analysis: Select features and a target variable (e.g., the synthetic Annual Purchase Amount) for linear regression. The app will generate a prediction model, display a mean squared error metric, and show an interactive scatter plot comparing actual vs. predicted values.

Default Dataset: `customers-100000.csv`

The default dataset, customers-100000.csv, is located in the data/ directory. If no CSV file is uploaded, this dataset will automatically load, allowing users to test the ETL transformations, visualizations, clustering, and predictive analysis features without needing their own data file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
data		data
images		images
k8s		k8s
.gitignore		.gitignore
CODE-TOUR.md		CODE-TOUR.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PipeDreams - CSV Data Explorer

Features

Getting Started with Docker

Pulling the Docker Image

Running the Docker Container

Kubernetes Deployment

Manual Installation

Prerequisites

Installation

How to Use

Advanced Insights: Clustering and Predictive Analysis

Default Dataset: `customers-100000.csv`

About

Releases 1

Packages

Languages

License

markjacksonfishing/pipedreams

Folders and files

Latest commit

History

Repository files navigation

PipeDreams - CSV Data Explorer

Features

Getting Started with Docker

Pulling the Docker Image

Running the Docker Container

Kubernetes Deployment

Manual Installation

Prerequisites

Installation

How to Use

Advanced Insights: Clustering and Predictive Analysis

Default Dataset: customers-100000.csv

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Default Dataset: `customers-100000.csv`

Packages