Team Name: beetrootRaita , Team Members: Adesh Gupta , Aman Behera , Pratham Singla
Video Submission: https://vimeo.com/1116586005?share=copy
Healthcare provider data often suffers from duplication, inconsistent formatting, missing identifiers, and invalid or expired licenses, making reliable analysis and compliance tracking challenging. To address this, we developed a comprehensive data analytics and visualization platform that streamlines the workflow:
-
Data Cleaning & Deduplication: Resolves entities and removes redundant records.
-
Standardization & Quality Assessment: Ensures consistency across phone numbers, ZIP codes, and provider names.
-
License Validation & Compliance: Checks state-specific medical board databases, with tailored logic for California and New York.
-
Interactive Queries: Allows natural language queries via a small language model integrated with a SQL engine.
-
Analytics Dashboard: Provides real-time insights to support credentialing and compliance management.
The pipeline of the platform can be divided into the following parts:
- Dataset
- Preprocessing
2.1. Provider Entity Resolution & Deduplication
2.2. Data Quality Assessment & Standardization
2.3. License Validation & Compliance Tracking
2.4. Compliance Score Calculation
2.5. NPI Matching (Present/Not Present)
2.6. Outlier Removal
2.7. Final Preprocessed Data Format & Summary - Catering Natural Language Queries
3.1. Small Language Model
3.2. SQL Engine - Interactive Analytics Dashboard
4.1. Frontend
4.2. Backend
- Docker
- Docker Compose
- Docker Model Runner
- NVIDIA-supported GPUs
The datasets used in this project include:
-
Provider Directory Dataset (
provider roster with errors.csv)
Contains 500+ provider records with issues like duplicates, expired licenses, missing information, and formatting inconsistencies. -
NY State Medical License Database (
ny medical license database.csv)
Official NY medical license records with license numbers, expiration dates, provider names, and specialties. Used to validate provider licenses in the main directory. -
CA State Medical License Database (
ca medical license database.csv)
Similar to the NY dataset but for California. Used for license validation and detecting expired or mismatched records. -
Mock NPI Registry (
mock npi registry.csv)
Simulated NPI Registry for validating provider identifiers, checking missing NPIs, and cross-verifying basic details like name and specialty.
The dataset contained multiple records referring to the same healthcare provider, requiring duplicate removal. A naive O(n²) approach for 500 records involved ~125,000 comparisons and took 40–45 seconds. To improve efficiency, we implemented a blocking-based deduplication pipeline, grouping records by keys like NPI, phone, ZIP3 + name, and address. Comparisons were limited to each block, and candidate pairs were scored using string and address similarity. This approach reduced runtime to 5–7 seconds while maintaining high accuracy, achieving over 10× speedup compared to the naive method.
Raw provider data often contained inconsistencies in phone numbers, ZIP codes, names, and addresses, making cross-database matching unreliable. To address this, we designed a standardization pipeline to enforce consistent formatting before downstream processing:
-
Phone Numbers: Retain only digits, removing symbols and spaces.
Example:(123)-456-7890→1234567890 -
ZIP Codes: Normalized with zero-padding and converted ZIP+4 codes into standard format.
Examples:123→00123,123456789→12345-6789 -
Name & Address Fields: Converted to title case to eliminate formatting discrepancies.
Example:JOHN DOE→John Doe -
Full Name Reconstruction: Rebuilt full name dynamically from first name, last name, and credential to ensure consistency across records.
For license validation, we needed to merge two tables, each having a different key structure. Providers are first filtered based on license state:
- California (CA): System looks up the CA License Table using license number as join key.
- New York (NY): The system queries the NY License Table using a composite join key of license number plus expiration date to identify the correct record.
This ensures that license status is retrieved accurately according to state-specific requirements.
The compliance score was calculated using a combination of data quality metrics: completeness, validity, consistency, uniqueness, accuracy, missing NPI records & expired licenses. Each metric was measured independently, and the overall score was computed as the average of these dimensions.
NPI was used as the primary identifier for providers. For missing NPIs, fuzzy matching using names, phone numbers, license numbers, addresses, and city-state combinations ensured reliable linkage. The merged dataset includes a boolean npi present flag indicating whether the provider’s NPI exists in the reference NPI roster, helping identify missing or unmatched records.
Outliers and invalid entries were filtered from the dataset. Records with unrealistic values in years_in_practice (outside 0–60) were removed. Missing or malformed entries in practice phone, ZIP code, and other key fields were standardized or cleaned to ensure data consistency.
The dataset contains 30 columns: 28 from the original provider dataset, 1 indicating the license status (from the combined CA and NY datasets), and 1 indicating whether the NPI is present.
| # | Column | Non-Null Count | Dtype |
|---|---|---|---|
| 0 | provider_id | 500 non-null | object |
| 1 | npi | 500 non-null | int64 |
| 2 | first_name | 500 non-null | object |
| 3 | last_name | 500 non-null | object |
| 4 | credential | 500 non-null | object |
| 5 | full_name | 500 non-null | object |
| 6 | primary_specialty | 500 non-null | object |
| 7 | practice_address_line1 | 500 non-null | object |
| 8 | practice_address_line2 | 159 non-null | object |
| 9 | practice_city | 500 non-null | object |
| 10 | practice_state | 500 non-null | object |
| 11 | practice_zip | 500 non-null | object |
| 12 | practice_phone | 500 non-null | object |
| 13 | mailing_address_line1 | 500 non-null | object |
| 14 | mailing_address_line2 | 154 non-null | object |
| 15 | mailing_city | 500 non-null | object |
| 16 | mailing_state | 500 non-null | object |
| 17 | mailing_zip | 500 non-null | object |
| 18 | license_number | 500 non-null | object |
| 19 | license_state | 500 non-null | object |
| 20 | license_expiration | 500 non-null | object |
| 21 | accepting_new_patients | 500 non-null | object |
| 22 | board_certified | 500 non-null | bool |
| 23 | years_in_practice | 500 non-null | int64 |
| 24 | medical_school | 500 non-null | object |
| 25 | residency_program | 500 non-null | object |
| 26 | last_updated | 500 non-null | object |
| 27 | taxonomy_code | 500 non-null | object |
| 28 | status | 500 non-null | object |
| 29 | npi_present | 500 non-null | bool |
As shown in the table, the dataset contains 30 columns: 28 from the original provider dataset, 1 indicating the license status (from the combined CA and NY datasets), and 1 indicating whether the NPI is present or not.
Important: Why do we only store the relevant columns?
We limit the schema to relevant columns because the data will be passed to the small language model (SLM). Including hundreds of columns can easily exceed the SLM’s context window, which may lead to incoherent responses or even no response at all.
While preprocessing, we also generate a structured summary that is passed to the frontend for displaying the analysis. This ensures the schema is followed, keeping the data consistent and easily interpretable.
Example of Summary Passed to the Frontend:
{
"total_records": 524,
"candidate_pairs": 46229,
"duplicate_pairs": 28,
"unique_involved": 44,
"clusters": 20,
"outliers_removed": 0,
"final_records": 500,
"expired_licenses": 459,
"missing_npi": 0,
"providers_available": 165,
"ca_state": 188,
"ny_state": 312,
"formatting_issues": 59,
"compliance_rate": 8.2,
"data_quality_score": 87.73
}We hosted a locally optimized small language model (gemma2-it-GGUF) using the llama.cpp engine via a Docker model runner service for easy integration. It takes the Natural Language Query and generates the SQL query which is fed into the SQL Engine, which runs the commands and returns the fetched results.
A Dockerized MySQL service stores both the original and preprocessed data for the model to query. The model converts natural language queries into SQL statements, which are executed by the SQL engine. The results are then interpreted by the small language model and presented to the user.
The frontend consists of an analytics dashboard , duplicate analysis page, provider directory page, AI Assistant built using Next.js and shadcn. It includes high-quality plots for deduplication, duplicate analysis, anomaly detection, ensuring a smooth user experience and in-depth insights.
Analysis Dashboard
Duplicate Analysis Page
Provider Directort Page
AI Assistant Page
Several plots have been included to illustrate the results of the data analysis.These visualizations help highlight key trends and patterns observed in the dataset. Examples are shown below:
Experience distribution by speciality
Provider count by city (Top 20)
Duplicate provider detection network
Before you begin, ensure you have the following installed on your system:
- Docker (version 20.10.0 or higher)
- Docker Compose (version 2.0.0 or higher)
- Docker Model Runner (for AI model hosting)
- Git (for cloning the repository)
- NVIDIA GPU (recommended for optimal AI model performance)
- Memory: Minimum 8GB RAM (16GB recommended)
- Storage: At least 10GB free disk space
- CPU: Multi-core processor (4+ cores recommended)
- GPU: NVIDIA GPU with CUDA support (optional but recommended)
git clone https://github.com/4adex/hilabs.git
cd docker-compose-ai-appEnsure your project directory contains:
├── docker-compose.yml
├── backend/
│ ├── Dockerfile
│ ├── main.py
│ ├── pipeline.py
│ └── requirements.txt
├── dash-hilabs/
│ ├── Dockerfile
│ ├── package.json
│ └── app/
├── data/
│ ├── ca.csv
│ ├── ny.csv
│ ├── npi.csv
│ └── provider_roster_with_errors.csv
└── assets/
Install and configure Docker Model Runner for AI model hosting:
# Install Docker Model Runner (if not already installed)
# Follow the official installation guide for your OS
# Ensure it's running on the default port 12434Create a .env file in the project root for custom configurations:
# Database Configuration
MYSQL_ROOT_PASSWORD=rootpassword
MYSQL_DATABASE=company_db
MYSQL_USER=appuser
MYSQL_PASSWORD=apppassword
# API Configuration
BACKEND_PORT=8000
FRONTEND_PORT=3000
MYSQL_PORT=3307
# AI Model Configuration
SQL_MODEL_URL=http://model-runner.docker.internal:12434
SQL_MODEL_NAME=hf.co/unsloth/gemma-3-270m-it-gguf# Build and start all services in detached mode
docker-compose up --build -d
# Alternative: Start with live logs
docker-compose up --buildCheck that all services are running correctly:
# Check service status
docker-compose ps
# View logs for specific services
docker-compose logs backend
docker-compose logs frontend-new
docker-compose logs mysqlThe database will be automatically initialized when the backend service starts. You can verify the initialization:
# Check backend logs for database initialization
docker-compose logs backend | grep -i "database\|mysql\|table"Once all services are running:
- Frontend Dashboard: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- MySQL Database: localhost:3307
1. Port Conflicts
# Check if ports are already in use
sudo netstat -tulpn | grep -E ':(3000|8000|3307)'
# Stop conflicting services or change ports in docker-compose.yml2. Docker Model Runner Connection Issues
# Verify Docker Model Runner is running
curl http://localhost:12434/health
# Restart Docker Model Runner if needed
# Check Docker Model Runner documentation for specific commands3. Database Connection Issues
# Restart MySQL service
docker-compose restart mysql
# Check MySQL logs
docker-compose logs mysql4. Frontend Build Issues
# Rebuild frontend with no cache
docker-compose build --no-cache frontend-new5. Memory Issues
# Check Docker memory usage
docker stats
# Increase Docker memory limits in Docker Desktop settings# Check if backend API is responsive
curl http://localhost:8000/health
# Check frontend accessibility
curl http://localhost:3000
# Test database connection
docker-compose exec mysql mysql -u appuser -p -e "SHOW DATABASES;"For development purposes, you can run services individually:
cd backend
pip install -r requirements.txt
uvicorn main:app --reload --host 0.0.0.0 --port 8000cd dash-hilabs
npm install
npm run dev- Place your CSV files in the
data/directory - Update the data paths in the backend configuration
- Restart the backend service
# Restart backend to reload data
docker-compose restart backend# Backup database
docker-compose exec mysql mysqldump -u root -p company_db > backup.sql
# Restore database
docker-compose exec -T mysql mysql -u root -p company_db < backup.sql# Stop all services
docker-compose down
# Stop and remove volumes (clears database)
docker-compose down -v
# Stop and remove all containers, networks, and images
docker-compose down --rmi all- Access the Dashboard: Open http://localhost:3000 in your browser
- Explore Analytics: Navigate through the analytics dashboard to view data insights
- Use Natural Language Queries: Try asking questions like "Show me providers in California" or "How many licenses are expired?"
- Review Data Quality: Check the data quality reports and duplicate detection results
- Manage Providers: Use the provider directory to search and filter healthcare providers
The backend is implemented with FastAPI and handles the preprocessing pipeline as well as SQL connections, providing APIs for the dashboard and the natural language query engine.