Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ML Model Backend Implementation #1896

Merged
merged 33 commits into from
Feb 17, 2021
Merged

feat: ML Model Backend Implementation #1896

merged 33 commits into from
Feb 17, 2021

Conversation

RyanHolstien
Copy link
Collaborator

@RyanHolstien RyanHolstien commented Sep 25, 2020

Fixes #1877

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@RyanHolstien
Copy link
Collaborator Author

Build is failing due to a 502 bad gateway when reaching out to the gradle maven repo for ElasticSearch's rest client. Should pass now after the fix from a typo I made (if it can reach all the dependencies). Not sure how to force a rebuild without making more changes

@jplaisted
Copy link
Contributor

Where was the typo? Did you already commit it? We ran the build jobs for your last few commits (see the Xs or check mars on them).

@RyanHolstien
Copy link
Collaborator Author

RyanHolstien commented Oct 11, 2020

Where was the typo? Did you already commit it? We ran the build jobs for your last few commits (see the Xs or check mars on them).

Yeah the typo fix was the last commit. The previous failure was due to the restspec not having EvaluationData endpoints because I typoed the text in the Rest.li annotation for EvaluationDataResource which is now fixed. The current build failure is due to this error which does not seem to be code related, it hit a gateway error when reaching out to the Gradle repository to get a dependency:

Could not resolve all files for configuration ':metadata-utils:compileClasspath'.
Could not resolve org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.8.
Required by:
project :metadata-utils
> Could not resolve org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.8.
> Could not get resource 'https://plugins.gradle.org/m2/org/elasticsearch/client/elasticsearch-rest-high-level-client/5.6.8/elasticsearch-rest-high-level-client-5.6.8.pom'.
> Could not GET 'https://jcenter.bintray.com/org/elasticsearch/client/elasticsearch-rest-high-level-client/5.6.8/elasticsearch-rest-high-level-client-5.6.8.pom'. Received status code 502 from server: Bad Gateway

@mars-lan mars-lan changed the title #1877 ML Model Backend Implementation feat: ML Model Backend Implementation Oct 12, 2020
@mars-lan
Copy link
Contributor

Sorry but just noticed that all the ML-related models are placed under com.linkedin.ml.metadata namespace (see https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata). @RyanHolstien do you mind if I submit a PR before this to move them to com.linkedin.ml instead?

@RyanHolstien
Copy link
Collaborator Author

Sorry but just noticed that all the ML-related models are placed under com.linkedin.ml.metadata namespace (see https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata). @RyanHolstien do you mind if I submit a PR before this to move them to com.linkedin.ml instead?

@mars-lan Not at all, go for it. I'll update as needed 😄

@mars-lan
Copy link
Contributor

Sorry but just noticed that all the ML-related models are placed under com.linkedin.ml.metadata namespace (see https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata). @RyanHolstien do you mind if I submit a PR before this to move them to com.linkedin.ml instead?

@mars-lan Not at all, go for it. I'll update as needed 😄

Actually nvm. Seems like the models have been referenced at multiple places, changing the namespace will quickly become a backward incompatible change. Let's live with them for now.

final List<EvaluatedOn> evaluationDataList = evaluationData.getEvaluationData()
.stream()
.filter(BaseData::hasDataset)
.filter(baseData -> DatasetUrn.ENTITY_TYPE.equals(baseData.getDataset().getEntityType()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? since dataset field is of type DatasetUrn, shouldn't the entity type be the same as that of DatasetUrn?

public <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(@Nonnull URN urn, @Nonnull EvaluationData evaluationData) {
final List<EvaluatedOn> evaluationDataList = evaluationData.getEvaluationData()
.stream()
.filter(BaseData::hasDataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset seems to be a required field in BaseData model. Do we really need the check hasDataset?

/**
* How was the data preprocessed for evaluation (e.g., tokenization of sentences, cropping of images, any filtering such as dropping images without faces)?
*/
preProcessing: optional string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share a usecase where storing this property in the edge will be useful? I wouldn't recommend adding this field as an edge property.
cc @keremsahin1 @camelliazhang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanHolstien Thanks for your contribution! Could you please help me to understand this relationship better?

  1. Could you please provide a few examples?
  2. How do the queries look like when this relationship is involved? Are you trying to do a filter based on "preProcessing" property? How many possible values are we talking about? In general, I won't recommend to add an arbitrary string property on an edge

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed it. My thought was essentially that it would be useful information to return as a result of searching for the relationship or perhaps filtering on certain types of preprocessing that had occurred similar to how OwnedBy has OwnershipType. If the issue is that it's a arbitrary string I don't currently have a list of reasonable preprocessing examples for an enum so we can leave it off and add it later if it's worthwhile.

/**
* How was the data preprocessed for evaluation (e.g., tokenization of sentences, cropping of images, any filtering such as dropping images without faces)?
*/
preProcessing: optional string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as before. Don't think we need to store this property in the edge.


import static com.linkedin.metadata.dao.internal.BaseGraphWriterDAO.RemovalOption.REMOVE_ALL_EDGES_FROM_SOURCE;

public class TrainedOnBuilderFromTrainingData extends BaseRelationshipBuilder<TrainingData> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comments as provided in EvaluatedOnBuilderFromEvaluationData.

private MLModelDocument getDocumentToUpdateFromAspect(MLModelUrn urn, EvaluationData evaluationData) {
final MLModelDocument doc = new MLModelDocument();

if (evaluationData.hasEvaluationData()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more of a generic comment. hasX() doesn't guarantee that getX() will not be null. I suggest we replace this with (evaluationData.getEvaluationData() != null). If you can make similar changes in other parts of the code that will be great! Related discussion #1950 (comment)

@RyanHolstien
Copy link
Collaborator Author

Sorry about the delay, made the requested changes and updated the code to reflect changes to master over time. Let me know if there's anything else needed to get this merged in.

Copy link
Contributor

@camelliazhang camelliazhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanHolstien Thanks for taking time addressing all the review comments!

/**
* Data model for a ML Model entity
*/
record MLModelEntity includes BaseEntity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the entity model for graph. Could you please share some context on this? Such as a list of main graph queries to support?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current graph use cases for ML Models include querying models that have been trained/evaluated on certain datasets. I created the entity model based on the other current entity models which are rather minimalistic with just the URN and URN components. The main value is the edge relationships between datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your information @RyanHolstien
I do see the search APIs (via ESSearchDAO) are implemented but not Neo4jQueryDAO. I assume you will have a follow up for building some APIs that leverages query DAO (for graph)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a separate API & service layer diverging from Rest.li in our fork, so changes from there aren't directly transferrable. If no one else picks it up before I have time I can do the follow up mapping it over to the equivalent Rest.li based resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, SGTM.

"platform^0.055",
"type^0.01"
],
"default_operator": "OR"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me now. My suggestion is to use one operator as default operator query string, either OR or AND, to reduce the possible confusions from users. With query_string, user can always specify and overwrite the default operator. Full syntax can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

Copy link
Contributor

@camelliazhang camelliazhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RyanHolstien for your contributions and persistence! Great work here!

@RyanHolstien
Copy link
Collaborator Author

Thanks for taking the time to review @camelliazhang ! I'm not sure what the build error is about, the failing test passes locally. The error is saying that the number of arguments doesn't line up with the constructor call in my new search sanity test, but it is structured the same way as the other tests. I don't have access to run it again without pushing another commit. I can try pushing a harmless commit like adding a local variable for one of the Strings in that test class to see if it passes the build so this can be merged if that's the only way to trigger another build.

@camelliazhang
Copy link
Contributor

The error is saying that the number of arguments doesn't line up with the constructor call in my new search sanity test,

@jplaisted Could you please take a look for the failure of new search sanity test? Thanks

@jplaisted
Copy link
Contributor

Yeah I added a new parameter, which I acknowledges breaks code, but at the time i thought one no one was using it besides me.

It is weird to me it passes locally but not on the CI. I'd think it'd fail on both. I don't think CI tries to rebase on master before running changes?

In any case, make sure you've rebased on master, and run ./gradlew idea build. It'll probably fail locally then too, something might be cached.

Check out this: #2067. Basically there's some new sanity tests to test your config too, just pass the super constructor the search config.

@arunvasudevan
Copy link
Collaborator

arunvasudevan commented Feb 8, 2021

@RyanHolstien It would be great if the GMS sample API calls section in the README is updated for MLModel entity - https://github.com/linkedin/datahub/blob/master/gms/README.md#sample-api-calls

Copy link
Contributor

@jplaisted jplaisted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Thoughts on making the L lower case? MlModel?

https://google.github.io/styleguide/javaguide.html#s5.3-camel-case

We don't use google's style guide, but still a useful reference.

@shirshanka
Copy link
Contributor

Nit: Thoughts on making the L lower case? MlModel?

https://google.github.io/styleguide/javaguide.html#s5.3-camel-case

We don't use google's style guide, but still a useful reference.

Since the ML here is a short-form for Machine Learning ... maybe MLModel is more appropriate.
Seems like Apple agrees: https://developer.apple.com/documentation/coreml/mlmodel

@shirshanka shirshanka dismissed jywadhwani’s stale review February 17, 2021 21:26

Looks like Jyoti's concerns are addressed now.

@shirshanka shirshanka merged commit ea86ade into datahub-project:master Feb 17, 2021
@shirshanka
Copy link
Contributor

Thanks @RyanHolstien for the lift and @camelliazhang, @jywadhwani for detailed reviews!

@arunvasudevan arunvasudevan mentioned this pull request Mar 12, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ML Model Backend Implementation
8 participants