A simple data quality tool. Collect and publish metrics about quality of data anywhere.
- [X] get metrics
- [X] run tests
- [ ] publish metrics to aws
- [ ] publish metrics to prometheus
- [ ] publish metadata to DataHub
- [ ] build dashboards
- [ ] alets based on CloudWatch
- [ ] other cloud providers?
- [ ] multiple data sources
- [ ] example dag with airflow
- [ ] example with prefect
- [ ] re-conciliation between two data sources, % missing, matching columns vs mismatch
Download from https://github.com/warfox/dqt
`dqt` is a command line tool that runs on JVM.
Make sure you have the jdbc drivers in classpath.
java -jar dqt.jar run -d datasource.edn -t table.edn
java -cp "/path/to/jdbc/driver/jar/:./dqt.jar" dqt.core run -d examples/postgres.edn -t examples/tables/employees.edn
Run the project directly, via `:main-opts` (`-m dqt.core`):
$ clojure -M:run
Run the project, with parameters
$ clojure -M:run -d datasource.edn -t table.edn
Run the project’s tests (they’ll fail until you edit them):
$ clojure -T:build test
$ ./bin/kaocha
Build uberjar
$ clojure -T:build uberjar
This will produce an updated pom.xml file with synchronized dependencies inside the META-INF
directory inside target/classes and the uberjar in target. You can update the version (and SCM tag)
information in generated pom.xml by updating build.clj.
If you don’t want the pom.xml file in your project, you can remove it. The ci task will
still generate a minimal pom.xml as part of the uber task, unless you remove version
from build.clj.
Run that uberjar:
$ java -jar target/dqt-0.1.0-SNAPSHOT.jar
If you remove version from build.clj, the uberjar will become target/dqt-standalone.jar.
FIXME: listing of options this app accepts.
{:dbtype "postgresql"
:dbname "postgres"
:host #or [#env DATABASE_HOSTNAME "localhost"]
:user "postgres"
:password "postgres"
:ssl false
:classname "org.postgres.Driver"
:sslfactory "org.postgresql.ssl.NonValidatingFactory"}{:table-name :employees
:metrics [:row-count
:avg-length
:max-length
:min-length
:avg
:sum
:max
:min
:stddev
:variance]
:tests [[:row-count > 10]
[:avg-length-phone-number < 13]
[:stddev-salary > 4500]
[:sum-salary > 20000]
[:max-length-email < 30]]}clj -M:dev:run run -d datasource.edn -t tables/employees.ednbb run-examplebb devRun docker compose up to have postgress running
bb migrateclj -M:dev:test
clj -M:dev:test --watch
bb test
bb test:watch$ bin/koacha
$ bin/koacha --watch
Copyright © 2021 Warfox
Distributed under the MIT License.