A production-minded metadata governance service built in Python to scan database schemas, normalize metadata into a standardized data dictionary, and persist it reliably using idempotent, hash-driven ingestion patterns.
This project reflects how I design and reason about metadata systems after working extensively with enterprise data platforms—where correctness, repeatability, and operational resilience are non-negotiable.
The goal is not to demonstrate a framework, but to model how a real metadata ingestion service behaves in production when run repeatedly, monitored, and trusted by downstream consumers.
In most data platforms, metadata is treated as a by-product rather than a first-class asset.
That leads to duplication, drift, inconsistent catalogs, and brittle governance workflows.
This toolkit is built on the belief that:
- Metadata ingestion must be deterministic
- Re-runs should be safe and idempotent
- Failures should be observable and debuggable
- Local development should be frictionless, with a clear path to scale
Everything in this repository follows those principles.
Every design decision mirrors patterns I use in production systems:
- Idempotent ingestion using stable hash keys to prevent duplication
- Clear separation of concerns between scanning, transformation, and persistence
- Explicit job lifecycle tracking for observability and failure analysis
- Operational hygiene through logging, disk monitoring, and cleanup routines
- Local-first defaults with environment-driven configuration for real databases
This is intentionally opinionated. The goal is correctness and reliability over convenience.
- Production-grade Python system design beyond scripts and notebooks
- Hands-on experience with metadata ingestion and governance workflows
- Safe, repeatable persistence using hash-based upsert strategies
- API-driven orchestration with clear execution semantics
- Operational awareness (logging, monitoring, cleanup, failure visibility)
- Extensible architecture suitable for real enterprise integrations
GitHub renders Mermaid diagrams automatically in
README.md.
flowchart TD
A["Source Database\n(SQLite / Postgres / MySQL)"] --> B["Schema Scanner\n(SQLAlchemy Inspector)"]
B --> C["Transformer\n(Standardized Data Dictionary Model)"]
C --> D["MetaDB\n(Idempotent Upserts via Hash Keys)"]
D --> E["FastAPI Service\n(/scan, /jobs, /dictionary)"]
E --> F["Consumers\n(Catalog / Governance / Analytics)"]
E --> G["Ops Utilities\n(logging, cleanup, disk monitor)"]
flowchart LR
subgraph API["FastAPI Layer"]
R1["POST /scan/trigger"]
R2["GET /jobs"]
R3["GET /dictionary"]
end
subgraph Core["Core Runtime"]
JR["Job Runner"]
TX["Transformer"]
end
subgraph Scan["Scanning"]
SC["SQLAlchemy Scanner"]
end
subgraph Store["Storage"]
MDB["MetaDB (SQLite)"]
SDB["Source DB"]
end
R1 --> JR
JR --> SC
SC --> SDB
JR --> TX
TX --> MDB
R2 --> MDB
R3 --> MDB
The project defaults to a local SQLite setup for zero-friction onboarding.
No external database is required to run or evaluate the system.
- Python 3.12+
git clone <your-repo-url>
cd metadata-governance-toolkit
python -m venv .venv
source .venv/bin/activate
pip install .
mkdir -p .metadbpython -m uvicorn mgt.api.app:app --reloadOpen Swagger UI:
curl http://127.0.0.1:8000/healthcurl -X POST http://127.0.0.1:8000/scan/triggerThis scans the source database schema and persists results into MetaDB using idempotent logic.
curl "http://127.0.0.1:8000/jobs?limit=10"Returns execution status, timestamps, and error details (if any).
curl "http://127.0.0.1:8000/dictionary?limit=20"Returns normalized metadata records (one row per column per object).
Re-running scans will update existing entries, not duplicate them.
Create this folder and add PNGs:
docs/images/
Recommended files:
docs/images/swagger-ui.pngdocs/images/trigger-scan.pngdocs/images/jobs.pngdocs/images/dictionary.png
Once you commit those files, they will render here:
docker build -t metadata-governance-toolkit .
docker run -p 8000:8000 metadata-governance-toolkitAll configuration is environment-driven.
Example (.env.example):
META_DB_URL=sqlite:///./.metadb/metadb.sqlite
SOURCE_DB_URL=sqlite:///./.metadb/source_demo.sqlite
LOG_LEVEL=INFOThe local SQLite setup is intended for development.
The same code path supports Postgres or other relational databases via SQLAlchemy.
This repository includes:
- Automated test execution via GitHub Actions
- Clean packaging using a src-layout
- Explicit dependency management
The intent is to keep the project runnable, verifiable, and reviewable at any point in time.
- Runtime data (
.metadb/, logs, SQLite files) is intentionally excluded from version control - The demo database is auto-created locally to keep onboarding simple
- The architecture is designed to extend naturally toward:
- async job execution
- metadata export connectors (Collibra/DataHub/Atlas)
- lineage and catalog integrations
This project is less about “how to use FastAPI” and more about how metadata systems should be built: deterministic, observable, and safe to run every day.
That mindset is what I bring to real production data platforms.



