Skip to content

Conversation

@ligfx
Copy link
Contributor

@ligfx ligfx commented Jun 2, 2025

No description provided.

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jun 2, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Jun 2, 2025
@codecov
Copy link

codecov bot commented Jun 2, 2025

Bundle Report

Changes will decrease total bundle size by 355 bytes (-0.0%) ⬇️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 19.66MB -355 bytes (-0.0%) ⬇️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js -355 bytes 16.03MB -0.0%

@codecov
Copy link

codecov bot commented Jun 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from ec76fa7 to 3a3703f Compare June 2, 2025 17:23
expr = sa.func.APPROX_COUNT_DISTINCT(sa.column(column))
elif (
self.engine.dialect.name.lower() == AWSATHENA
or self.engine.dialect.name.lower() == TRINO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have trino tests that validate this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Trino integration tests include profiling of unique counts. Most of the unique counts in the test data are small and unchanged.

One difference that came up: approx_distinct can process JSON-typed columns whereas count(DISTINCT ...) errors out. So more stats will now be available for JSON columns!

Having a unique count also triggers distinct unique value frequency profiling in cases of low cardinality. That logic didn't have any exception handling, and failed trying to SORT BY a JSON column. I included fixes for both of these issues in the PR.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jun 2, 2025
@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from 3a3703f to 21d0c32 Compare June 2, 2025 21:07
@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from 21d0c32 to 752cf71 Compare June 3, 2025 17:52
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jun 3, 2025
@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from 752cf71 to 94d914a Compare June 10, 2025 17:01
@ligfx ligfx requested a review from hsheth2 June 10, 2025 23:37
@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from 94d914a to 885b60b Compare June 23, 2025 15:14
@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Jun 23, 2025
@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from 885b60b to be703c2 Compare June 25, 2025 13:54
ligfx added 3 commits June 25, 2025 12:13
…nct_value_frequencies

Great Expectations runs SORT BY in SQL by default, but not all column types are sortable
(such as JSON data types on Athena/Trino). We could pass sort="none" and sort it in Python
instead, but writing the SQL ourselves is straightforward and seems less bug-prone than
relying on stringly-typed arguments into an external library.
@ligfx ligfx force-pushed the athena_profiling_approx_distinct branch from be703c2 to 6780a22 Compare June 25, 2025 16:13
@ligfx ligfx merged commit 0f0119f into datahub-project:master Jun 25, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants