Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

naresh-angala
Copy link

Checklist

  • [ X ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@naresh-angala naresh-angala changed the title feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics into Datahub feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub Nov 17, 2023
@david-leifker david-leifker added the product PR or Issue related to the DataHub UI/UX label Nov 21, 2023
@maggiehays maggiehays added the community-contribution PR or Issue raised by member(s) of DataHub Community label Nov 29, 2023
@jjoyce0510
Copy link
Collaborator

Hi there! What is the goal with this PR? Adding context in the description will be quite useful! Thanks in advanced

@naresh-angala
Copy link
Author

Hi there! What is the goal with this PR? Adding context in the description will be quite useful! Thanks in advanced

Hi,

PR is about adding Data Quality Metrics capability, working on changes for dynamic Data Quality metrics addition as per PR review comments.

Thanks.

@rtekal
Copy link
Contributor

rtekal commented Aug 2, 2024

@naresh-angala I know that this PR is just the model changes. Can you attach a documentation link that conveys the big picture and where it all fits please.

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes and clarifications have not been addressed on this PR

@rtekal
Copy link
Contributor

rtekal commented Aug 7, 2024

Okay, the team will change the PR to Draft and work on the design changes. Thanks.

@sgm44
Copy link

sgm44 commented Aug 7, 2024

PR is about adding Data Quality Metrics capability, working on changes for dynamic Data Quality metrics addition as per PR review comments.

This is the data model changes to support the full ability to capture and report data quality dimensions. There was a bit of back and forth in slack back in Oct 2023 on this topic which include example usage screens found here

Here was the simple Feature Goal statement:
As a data producer, I want quality metrics for my ingested datasets to display within the dataset view in the catalog so that the metrics are available to data consumers.

@sgm44
Copy link

sgm44 commented Aug 7, 2024

@naresh-angala and @rtekal -- where are the graphQL and UI updates related to this feature? Right now this looks like just PDL updates. Without the rest I don't see how datahub gets any value outside of the ability to ingest and store the data which IMHO is pretty basic.

@naresh-angala
Copy link
Author

@naresh-angala and @rtekal -- where are the graphQL and UI updates related to this feature? Right now this looks like just PDL updates. Without the rest I don't see how datahub gets any value outside of the ability to ingest and store the data which IMHO is pretty basic.

@sgm44 -- Intial plan was to get the Data quality model changes be reviewed and accepted.
Would be updating the code with GraphQL and UI related change subsequently after the model.

@naresh-angala
Copy link
Author

@jjoyce0510 -- Can you share details on below,

  1. Can a feature be contributed with multiple PRs.
  2. If feature need to be contributed with single PR, let us know comments on below approach
    --> Marking this PR in draft mode and add the below changes incrementally,
    a) Dynamic list of Dimensions with new aspect
    b) Mapper changes and GraphQL changes
    c) UI changes to display the quality metrics

Thanks.

@naresh-angala
Copy link
Author

@jjoyce0510 -- Can you share details on below,

  1. Can a feature be contributed with multiple PRs.
  2. If feature need to be contributed with single PR, let us know comments on below approach
    --> Marking this PR in draft mode and add the below changes incrementally,
    a) Dynamic list of Dimensions with new aspect
    b) Mapper changes and GraphQL changes
    c) UI changes to display the quality metrics

Thanks.

@jjoyce0510 -- Please provide details on above points.

@naresh-angala naresh-angala marked this pull request as draft August 26, 2024 08:00
@Curiosity007
Copy link

@naresh-angala Is there any tentative timeline for this feature to be fully integrated into the UI, GraphQL and backend? This is an integral part of DQ, and would like very much to see this in the newest version

@rtekal
Copy link
Contributor

rtekal commented Sep 11, 2024

@naresh-angala told me: Targetting last week of Sep to complete

@naresh-angala naresh-angala reopened this Oct 25, 2024
@github-actions github-actions bot removed the product PR or Issue related to the DataHub UI/UX label Oct 25, 2024
@naresh-angala
Copy link
Author

Requested changes and clarifications have not been addressed on this PR

@jjoyce0510 : Updated the PR with dynamic dimension names and UI changes. Please review.

@naresh-angala naresh-angala marked this pull request as ready for review November 15, 2024 17:30
@naresh-angala
Copy link
Author

UI Screen shots: Dataset metrics: Chart view:
Screenshot 2024-11-15 at 10 25 10 PM

Schemafield metrics:
Screenshot 2024-11-15 at 10 25 18 PM

@naresh-angala
Copy link
Author

UI Screenshots:
Dataset metrics: Table view:

Screenshot 2024-11-15 at 11 03 20 PM

@naresh-angala
Copy link
Author

Requested changes and clarifications have not been addressed on this PR

@jjoyce0510: Have addressed all the requested changes. Please review.

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Nov 21, 2024
Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will take another look at this.

Note that we never had a design discussion around this non-trivial feature. It would be best to have a dedicated time to chat through this together.

Either way, we'll take another look and try to reverse interpret

Cheers
John

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite comprehensive!

This one requires more discussion at the strategic level before admission for the broader community. An RFC document detailing the thinking here might be the best place to discuss and engage others within the community.

A few top of mind concerns:

  1. Who is responsible for registering Dimension Names? e.g. creating the new Dimension Name Type entity
  2. Who is responsible for producing the Dimension Name Scores for Tables & Columns?
  3. Is there an example of a Data Quality dimension at the table and column level we could use to understand the use case a bit better?
  4. Is there a way to use Structured Properties + a custom UI tab to achieve what you want? It feels like you simply need a way to attach strong-typed numeric properties to tables and columns (which structured properties can be used for)
  5. What would a user-facing feature guide doc look like for this feature?

There are also lower level tactical comments on the code around variable naming, data modeling, etc, but I don't want to waste your time on those things before there is alignment at the strategic level.

Cheers
John

@hsheth2 hsheth2 added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 4, 2024
@rtekal
Copy link
Contributor

rtekal commented Dec 6, 2024

cc: @naresh-angala, @mzaman

  1. Who is responsible for registering Dimension Names? e.g. creating the new Dimension Name Type entity

Data Producer will submit an enhancement request for adding a new dimension as an Entity Type

  1. Who is responsible for producing the Dimension Name Scores for Tables & Columns?

Data Producer will use their own algorithm for calculating the scores and will ingest them to Data Catalog.

  1. Is there an example of a Data Quality dimension at the table and column level we could use to understand the use case a bit better?

Already provided in the PR above

  1. Is there a way to use Structured Properties + a custom UI tab to achieve what you want? It feels like you simply need a way to attach strong-typed numeric properties to tables and columns (which structured properties can be used for)
  • Structured property could be used for just showing the current score. But we are also showing the historical average
  • We will show the time-series values in future. Not supported by Structured Properties
  • We will also show the information about the method of calculation and the threshold score. These are the quality attributes similar to "Assertions" and "Metadata Tests". So, it makes sense to group them together with Assertions and Tests
  1. What would a user-facing feature guide doc look like for this feature?

User doc in Markdown

Introduction

Data Quality Dimension(DQ) is a popular industry term used to describe characteristics or attributes of data that can be measured against defined standards in order to determine the quality of data. The aggregate at dimension level can fairly indicate the fitness of data to be used for a certain business purpose.

Data Catalog offers the standard Dimensions recognized in the industry and scorecard for both Data Owners & Consumers to make informed decisions on the quality of data. This approach facilitates a common language for the users to communicate and comprehend the quality of data in a consistent way across the enterprise.

Dimensions of Data Quality

A Data Quality Dimension is typically presented as a percentage or a total count.

Dimension Description
Accuracy How well does a piece of information reflect reality?
Completeness Does it fulfill your expectations of what’s comprehensive?
Consistency Does information stored in one place match relevant data stored elsewhere?
Timeliness Is your information available when you need it?
Validity Is information in a specific format, does it follow business rules, or is it in an unusable format?
Uniqueness Is this the only instance in which this information appears in the database?
Integrity Is the data compliant to the referential integrity rules?
Semantic Correctness Are Data values true to their meaning?
Reliability Is the stability and reliability of data measured over time?
Duplication Are there multiple instances in which this information appears in the dataset ?
Precision Analyzes the content of the data to identify inconsistencies and irregularities.

Each Dimension will contain dimension scores which can be either provided in numerical or percentage format.

  • Current score (Required): The score as observed from the most recent data quality evaluation for a specific dimension
  • HistoricalWeightedScore (Optional): The score when weighted across several historical quality runs for a specific dimension
  • ScoreType (Required): enum field to define current and historical score format (Percentage or Numerical)
  • Notes (Optional): field to capture note/comment for a specific dimension

Data quality metrics will be available at Dataset and Field level under "Quality" tab in Data Catalog.

image

@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community needs-review Label for PRs that need review from a maintainer. poc-marathon-dec-2023
Projects
None yet
Development

Successfully merging this pull request may close these issues.