Skip to content

4adex/hilabs

Repository files navigation

Provider Data Quality Analytics & Dashboard

image

Team Name: beetrootRaita , Team Members: Adesh Gupta , Aman Behera , Pratham Singla

Video Submission: https://vimeo.com/1116586005?share=copy

Introduction

Healthcare provider data often suffers from duplication, inconsistent formatting, missing identifiers, and invalid or expired licenses, making reliable analysis and compliance tracking challenging. To address this, we developed a comprehensive data analytics and visualization platform that streamlines the workflow:

  1. Data Cleaning & Deduplication: Resolves entities and removes redundant records.

  2. Standardization & Quality Assessment: Ensures consistency across phone numbers, ZIP codes, and provider names.

  3. License Validation & Compliance: Checks state-specific medical board databases, with tailored logic for California and New York.

  4. Interactive Queries: Allows natural language queries via a small language model integrated with a SQL engine.

  5. Analytics Dashboard: Provides real-time insights to support credentialing and compliance management.

Table of Contents

The pipeline of the platform can be divided into the following parts:

  1. Dataset
  2. Preprocessing
    2.1. Provider Entity Resolution & Deduplication
    2.2. Data Quality Assessment & Standardization
    2.3. License Validation & Compliance Tracking
    2.4. Compliance Score Calculation
    2.5. NPI Matching (Present/Not Present)
    2.6. Outlier Removal
    2.7. Final Preprocessed Data Format & Summary
  3. Catering Natural Language Queries
    3.1. Small Language Model
    3.2. SQL Engine
  4. Interactive Analytics Dashboard
    4.1. Frontend
    4.2. Backend

Requirements

  1. Docker
  2. Docker Compose
  3. Docker Model Runner
  4. NVIDIA-supported GPUs

Dataset

The datasets used in this project include:

  1. Provider Directory Dataset (provider roster with errors.csv)
    Contains 500+ provider records with issues like duplicates, expired licenses, missing information, and formatting inconsistencies.

  2. NY State Medical License Database (ny medical license database.csv)
    Official NY medical license records with license numbers, expiration dates, provider names, and specialties. Used to validate provider licenses in the main directory.

  3. CA State Medical License Database (ca medical license database.csv)
    Similar to the NY dataset but for California. Used for license validation and detecting expired or mismatched records.

  4. Mock NPI Registry (mock npi registry.csv)
    Simulated NPI Registry for validating provider identifiers, checking missing NPIs, and cross-verifying basic details like name and specialty.

Preprocessing

Provider Entity Resolution & Deduplication

The dataset contained multiple records referring to the same healthcare provider, requiring duplicate removal. A naive O(n²) approach for 500 records involved ~125,000 comparisons and took 40–45 seconds. To improve efficiency, we implemented a blocking-based deduplication pipeline, grouping records by keys like NPI, phone, ZIP3 + name, and address. Comparisons were limited to each block, and candidate pairs were scored using string and address similarity. This approach reduced runtime to 5–7 seconds while maintaining high accuracy, achieving over 10× speedup compared to the naive method.

image image

Data Quality Assessment & Standardization

Raw provider data often contained inconsistencies in phone numbers, ZIP codes, names, and addresses, making cross-database matching unreliable. To address this, we designed a standardization pipeline to enforce consistent formatting before downstream processing:

  • Phone Numbers: Retain only digits, removing symbols and spaces.
    Example: (123)-456-78901234567890

  • ZIP Codes: Normalized with zero-padding and converted ZIP+4 codes into standard format.
    Examples: 12300123, 12345678912345-6789

  • Name & Address Fields: Converted to title case to eliminate formatting discrepancies.
    Example: JOHN DOEJohn Doe

  • Full Name Reconstruction: Rebuilt full name dynamically from first name, last name, and credential to ensure consistency across records.

License Validation & Compliance Tracking

For license validation, we needed to merge two tables, each having a different key structure. Providers are first filtered based on license state:

  1. California (CA): System looks up the CA License Table using license number as join key.
  2. New York (NY): The system queries the NY License Table using a composite join key of license number plus expiration date to identify the correct record.

This ensures that license status is retrieved accurately according to state-specific requirements.

image

Compliance Score Calculation

The compliance score was calculated using a combination of data quality metrics: completeness, validity, consistency, uniqueness, accuracy, missing NPI records & expired licenses. Each metric was measured independently, and the overall score was computed as the average of these dimensions.

image

NPI Matching (Present/Not Present)

NPI was used as the primary identifier for providers. For missing NPIs, fuzzy matching using names, phone numbers, license numbers, addresses, and city-state combinations ensured reliable linkage. The merged dataset includes a boolean npi present flag indicating whether the provider’s NPI exists in the reference NPI roster, helping identify missing or unmatched records.

image

Outlier Removal

Outliers and invalid entries were filtered from the dataset. Records with unrealistic values in years_in_practice (outside 0–60) were removed. Missing or malformed entries in practice phone, ZIP code, and other key fields were standardized or cleaned to ensure data consistency.

Final preprocessed data format & Summary

The dataset contains 30 columns: 28 from the original provider dataset, 1 indicating the license status (from the combined CA and NY datasets), and 1 indicating whether the NPI is present.

# Column Non-Null Count Dtype
0 provider_id 500 non-null object
1 npi 500 non-null int64
2 first_name 500 non-null object
3 last_name 500 non-null object
4 credential 500 non-null object
5 full_name 500 non-null object
6 primary_specialty 500 non-null object
7 practice_address_line1 500 non-null object
8 practice_address_line2 159 non-null object
9 practice_city 500 non-null object
10 practice_state 500 non-null object
11 practice_zip 500 non-null object
12 practice_phone 500 non-null object
13 mailing_address_line1 500 non-null object
14 mailing_address_line2 154 non-null object
15 mailing_city 500 non-null object
16 mailing_state 500 non-null object
17 mailing_zip 500 non-null object
18 license_number 500 non-null object
19 license_state 500 non-null object
20 license_expiration 500 non-null object
21 accepting_new_patients 500 non-null object
22 board_certified 500 non-null bool
23 years_in_practice 500 non-null int64
24 medical_school 500 non-null object
25 residency_program 500 non-null object
26 last_updated 500 non-null object
27 taxonomy_code 500 non-null object
28 status 500 non-null object
29 npi_present 500 non-null bool

As shown in the table, the dataset contains 30 columns: 28 from the original provider dataset, 1 indicating the license status (from the combined CA and NY datasets), and 1 indicating whether the NPI is present or not.

Important: Why do we only store the relevant columns?

We limit the schema to relevant columns because the data will be passed to the small language model (SLM). Including hundreds of columns can easily exceed the SLM’s context window, which may lead to incoherent responses or even no response at all.

While preprocessing, we also generate a structured summary that is passed to the frontend for displaying the analysis. This ensures the schema is followed, keeping the data consistent and easily interpretable.

Example of Summary Passed to the Frontend:

{
  "total_records": 524,
  "candidate_pairs": 46229,
  "duplicate_pairs": 28,
  "unique_involved": 44,
  "clusters": 20,
  "outliers_removed": 0,
  "final_records": 500,
  "expired_licenses": 459,
  "missing_npi": 0,
  "providers_available": 165,
  "ca_state": 188,
  "ny_state": 312,
  "formatting_issues": 59,
  "compliance_rate": 8.2,
  "data_quality_score": 87.73
}

Catering Natural Language Queries

Small Language Model

We hosted a locally optimized small language model (gemma2-it-GGUF) using the llama.cpp engine via a Docker model runner service for easy integration. It takes the Natural Language Query and generates the SQL query which is fed into the SQL Engine, which runs the commands and returns the fetched results.

image

SQL Engine

A Dockerized MySQL service stores both the original and preprocessed data for the model to query. The model converts natural language queries into SQL statements, which are executed by the SQL engine. The results are then interpreted by the small language model and presented to the user.

image

Interactive Analytics Dashboard

Frontend

The frontend consists of an analytics dashboard , duplicate analysis page, provider directory page, AI Assistant built using Next.js and shadcn. It includes high-quality plots for deduplication, duplicate analysis, anomaly detection, ensuring a smooth user experience and in-depth insights.

Analysis Dashboard

image

Duplicate Analysis Page

image

Provider Directort Page

image

AI Assistant Page

image

Plots for insights

Several plots have been included to illustrate the results of the data analysis.These visualizations help highlight key trends and patterns observed in the dataset. Examples are shown below:

Experience distribution by speciality

image

Provider count by city (Top 20)

image

Duplicate provider detection network

image

🚀 Setup Instructions

Prerequisites

Before you begin, ensure you have the following installed on your system:

  • Docker (version 20.10.0 or higher)
  • Docker Compose (version 2.0.0 or higher)
  • Docker Model Runner (for AI model hosting)
  • Git (for cloning the repository)
  • NVIDIA GPU (recommended for optimal AI model performance)

System Requirements

  • Memory: Minimum 8GB RAM (16GB recommended)
  • Storage: At least 10GB free disk space
  • CPU: Multi-core processor (4+ cores recommended)
  • GPU: NVIDIA GPU with CUDA support (optional but recommended)

Installation Steps

1. Clone the Repository

git clone https://github.com/4adex/hilabs.git
cd docker-compose-ai-app

2. Verify Project Structure

Ensure your project directory contains:

├── docker-compose.yml
├── backend/
│   ├── Dockerfile
│   ├── main.py
│   ├── pipeline.py
│   └── requirements.txt
├── dash-hilabs/
│   ├── Dockerfile
│   ├── package.json
│   └── app/
├── data/
│   ├── ca.csv
│   ├── ny.csv
│   ├── npi.csv
│   └── provider_roster_with_errors.csv
└── assets/

3. Set Up Docker Model Runner

Install and configure Docker Model Runner for AI model hosting:

# Install Docker Model Runner (if not already installed)
# Follow the official installation guide for your OS
# Ensure it's running on the default port 12434

4. Configure Environment Variables (Optional)

Create a .env file in the project root for custom configurations:

# Database Configuration
MYSQL_ROOT_PASSWORD=rootpassword
MYSQL_DATABASE=company_db
MYSQL_USER=appuser
MYSQL_PASSWORD=apppassword

# API Configuration
BACKEND_PORT=8000
FRONTEND_PORT=3000
MYSQL_PORT=3307

# AI Model Configuration
SQL_MODEL_URL=http://model-runner.docker.internal:12434
SQL_MODEL_NAME=hf.co/unsloth/gemma-3-270m-it-gguf

5. Build and Start Services

# Build and start all services in detached mode
docker-compose up --build -d

# Alternative: Start with live logs
docker-compose up --build

6. Verify Service Status

Check that all services are running correctly:

# Check service status
docker-compose ps

# View logs for specific services
docker-compose logs backend
docker-compose logs frontend-new
docker-compose logs mysql

7. Initialize Database

The database will be automatically initialized when the backend service starts. You can verify the initialization:

# Check backend logs for database initialization
docker-compose logs backend | grep -i "database\|mysql\|table"

8. Access the Application

Once all services are running:

Troubleshooting

Common Issues and Solutions

1. Port Conflicts

# Check if ports are already in use
sudo netstat -tulpn | grep -E ':(3000|8000|3307)'

# Stop conflicting services or change ports in docker-compose.yml

2. Docker Model Runner Connection Issues

# Verify Docker Model Runner is running
curl http://localhost:12434/health

# Restart Docker Model Runner if needed
# Check Docker Model Runner documentation for specific commands

3. Database Connection Issues

# Restart MySQL service
docker-compose restart mysql

# Check MySQL logs
docker-compose logs mysql

4. Frontend Build Issues

# Rebuild frontend with no cache
docker-compose build --no-cache frontend-new

5. Memory Issues

# Check Docker memory usage
docker stats

# Increase Docker memory limits in Docker Desktop settings

Service Health Checks

# Check if backend API is responsive
curl http://localhost:8000/health

# Check frontend accessibility
curl http://localhost:3000

# Test database connection
docker-compose exec mysql mysql -u appuser -p -e "SHOW DATABASES;"

Development Setup

For development purposes, you can run services individually:

Backend Development

cd backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 8000

Frontend Development

cd dash-hilabs
npm install
npm run dev

Data Management

Loading Custom Data

  1. Place your CSV files in the data/ directory
  2. Update the data paths in the backend configuration
  3. Restart the backend service
# Restart backend to reload data
docker-compose restart backend

Backup and Restore

# Backup database
docker-compose exec mysql mysqldump -u root -p company_db > backup.sql

# Restore database
docker-compose exec -T mysql mysql -u root -p company_db < backup.sql

Stopping Services

# Stop all services
docker-compose down

# Stop and remove volumes (clears database)
docker-compose down -v

# Stop and remove all containers, networks, and images
docker-compose down --rmi all

Getting Started with the Application

  1. Access the Dashboard: Open http://localhost:3000 in your browser
  2. Explore Analytics: Navigate through the analytics dashboard to view data insights
  3. Use Natural Language Queries: Try asking questions like "Show me providers in California" or "How many licenses are expired?"
  4. Review Data Quality: Check the data quality reports and duplicate detection results
  5. Manage Providers: Use the provider directory to search and filter healthcare providers

Backend

The backend is implemented with FastAPI and handles the preprocessing pipeline as well as SQL connections, providing APIs for the dashboard and the natural language query engine.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •