-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingestion): use approx_distinct when profiling Athena and Trino #13671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion): use approx_distinct when profiling Athena and Trino #13671
Conversation
Bundle ReportChanges will decrease total bundle size by 355 bytes (-0.0%) ⬇️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
ec76fa7 to
3a3703f
Compare
metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py
Outdated
Show resolved
Hide resolved
| expr = sa.func.APPROX_COUNT_DISTINCT(sa.column(column)) | ||
| elif ( | ||
| self.engine.dialect.name.lower() == AWSATHENA | ||
| or self.engine.dialect.name.lower() == TRINO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have trino tests that validate this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Trino integration tests include profiling of unique counts. Most of the unique counts in the test data are small and unchanged.
One difference that came up: approx_distinct can process JSON-typed columns whereas count(DISTINCT ...) errors out. So more stats will now be available for JSON columns!
Having a unique count also triggers distinct unique value frequency profiling in cases of low cardinality. That logic didn't have any exception handling, and failed trying to SORT BY a JSON column. I included fixes for both of these issues in the PR.
3a3703f to
21d0c32
Compare
21d0c32 to
752cf71
Compare
752cf71 to
94d914a
Compare
94d914a to
885b60b
Compare
885b60b to
be703c2
Compare
…nct_value_frequencies Great Expectations runs SORT BY in SQL by default, but not all column types are sortable (such as JSON data types on Athena/Trino). We could pass sort="none" and sort it in Python instead, but writing the SQL ourselves is straightforward and seems less bug-prone than relying on stringly-typed arguments into an external library.
be703c2 to
6780a22
Compare
No description provided.