-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Dataquality aspect): Added Data Quality Metrics aspect to emit data quality metrics metadata into Datahub #9265
base: master
Are you sure you want to change the base?
Conversation
metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataquality/DimensionScore.pdl
Show resolved
Hide resolved
Hi there! What is the goal with this PR? Adding context in the description will be quite useful! Thanks in advanced |
Hi, PR is about adding Data Quality Metrics capability, working on changes for dynamic Data Quality metrics addition as per PR review comments. Thanks. |
@naresh-angala I know that this PR is just the model changes. Can you attach a documentation link that conveys the big picture and where it all fits please. |
metadata-models/src/main/pegasus/com/linkedin/dataquality/DataQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes and clarifications have not been addressed on this PR
Okay, the team will change the PR to Draft and work on the design changes. Thanks. |
This is the data model changes to support the full ability to capture and report data quality dimensions. There was a bit of back and forth in slack back in Oct 2023 on this topic which include example usage screens found here Here was the simple Feature Goal statement: |
@naresh-angala and @rtekal -- where are the graphQL and UI updates related to this feature? Right now this looks like just PDL updates. Without the rest I don't see how datahub gets any value outside of the ability to ingest and store the data which IMHO is pretty basic. |
@sgm44 -- Intial plan was to get the Data quality model changes be reviewed and accepted. |
@jjoyce0510 -- Can you share details on below,
Thanks. |
@jjoyce0510 -- Please provide details on above points. |
@naresh-angala Is there any tentative timeline for this feature to be fully integrated into the UI, GraphQL and backend? This is an integral part of DQ, and would like very much to see this in the newest version |
@naresh-angala told me: Targetting last week of Sep to complete |
091db70
to
eab2ac7
Compare
@jjoyce0510 : Updated the PR with dynamic dimension names and UI changes. Please review. |
@jjoyce0510: Have addressed all the requested changes. Please review. |
metadata-models/src/main/pegasus/com/linkedin/dataquality/SchemaFieldQualityDimensionInfo.pdl
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will take another look at this.
Note that we never had a design discussion around this non-trivial feature. It would be best to have a dedicated time to chat through this together.
Either way, we'll take another look and try to reverse interpret
Cheers
John
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite comprehensive!
This one requires more discussion at the strategic level before admission for the broader community. An RFC document detailing the thinking here might be the best place to discuss and engage others within the community.
A few top of mind concerns:
- Who is responsible for registering Dimension Names? e.g. creating the new Dimension Name Type entity
- Who is responsible for producing the Dimension Name Scores for Tables & Columns?
- Is there an example of a Data Quality dimension at the table and column level we could use to understand the use case a bit better?
- Is there a way to use Structured Properties + a custom UI tab to achieve what you want? It feels like you simply need a way to attach strong-typed numeric properties to tables and columns (which structured properties can be used for)
- What would a user-facing feature guide doc look like for this feature?
There are also lower level tactical comments on the code around variable naming, data modeling, etc, but I don't want to waste your time on those things before there is alignment at the strategic level.
Cheers
John
cc: @naresh-angala, @mzaman
Data Producer will submit an enhancement request for adding a new dimension as an Entity Type
Data Producer will use their own algorithm for calculating the scores and will ingest them to Data Catalog.
Already provided in the PR above
User doc in Markdown IntroductionData Quality Dimension(DQ) is a popular industry term used to describe characteristics or attributes of data that can be measured against defined standards in order to determine the quality of data. The aggregate at dimension level can fairly indicate the fitness of data to be used for a certain business purpose. Data Catalog offers the standard Dimensions recognized in the industry and scorecard for both Data Owners & Consumers to make informed decisions on the quality of data. This approach facilitates a common language for the users to communicate and comprehend the quality of data in a consistent way across the enterprise. Dimensions of Data QualityA Data Quality Dimension is typically presented as a percentage or a total count.
Each Dimension will contain dimension scores which can be either provided in numerical or percentage format.
Data quality metrics will be available at Dataset and Field level under "Quality" tab in Data Catalog. |
Checklist