Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ML Model Backend Implementation #1896

Merged
merged 33 commits into from
Feb 17, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7b724a5
Merge pull request #1 from linkedin/master
RyanHolstien Apr 9, 2020
8475ad3
Merge pull request #2 from linkedin/master
RyanHolstien Jul 2, 2020
c18c55b
Merge pull request #3 from linkedin/master
RyanHolstien Sep 16, 2020
dc474a7
Merge pull request #4 from linkedin/master
RyanHolstien Sep 25, 2020
c7d519c
feat(1877): add backend implementation for ML Model
Sep 25, 2020
d347b96
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Sep 25, 2020
81a342e
Merge pull request #5 from linkedin/master
RyanHolstien Sep 25, 2020
02a9970
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Sep 25, 2020
26a2908
update mce cli examples
Sep 25, 2020
8c897fb
changes for review comments
Oct 6, 2020
87ab115
modify owners field in index per review comments
Oct 7, 2020
938b8d2
Merge pull request #6 from linkedin/master
RyanHolstien Oct 7, 2020
42dc2f5
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Oct 7, 2020
10f5d89
update package of GMS factories and use BaseSearchableClient
Oct 7, 2020
29eab26
fix for checkstyle
Oct 7, 2020
391304c
remove unused comma analyzers and tokenizers
Oct 8, 2020
3ad1b2f
Merge pull request #7 from linkedin/master
RyanHolstien Oct 8, 2020
912b2a6
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Oct 8, 2020
69bbcbd
fix endpoint naming for evaluation data
Oct 9, 2020
059def7
fixes for IndexBuilder per review comments
Oct 14, 2020
2bc1963
update IndexBuilder with efficient setting of urn derived fields
Oct 14, 2020
075ecfa
remove unnecessary filters and fields from index builder and relation…
Jan 25, 2021
3f83e7f
Merge pull request #8 from linkedin/master
RyanHolstien Jan 25, 2021
21b3c23
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Jan 26, 2021
5646b85
remove unused imports, fix formatting, and add new integration test t…
Jan 26, 2021
8215a60
add custom analyzer to split up URN components and fix typos
Jan 27, 2021
25daf59
update query template to include urn components
Jan 27, 2021
1735ff3
consolidate query template to one operator
Feb 2, 2021
ea9434c
Merge pull request #9 from linkedin/master
RyanHolstien Feb 9, 2021
86b54cb
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Feb 9, 2021
f4e27af
update sanity test to master
Feb 10, 2021
22c233b
Merge pull request #10 from linkedin/master
RyanHolstien Feb 16, 2021
c2dbf80
Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl
Feb 16, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docker/elasticsearch-setup/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@ FROM jwilder/dockerize:0.6.1

RUN apk add --no-cache curl

COPY corpuser-index-config.json dataprocess-index-config.json dataset-index-config.json /
COPY corpuser-index-config.json dataprocess-index-config.json dataset-index-config.json ml-model-index-config.json /

CMD dockerize \
-wait http://$ELASTICSEARCH_HOST:$ELASTICSEARCH_PORT \
-timeout 120s \
curl -XPUT $ELASTICSEARCH_HOST:$ELASTICSEARCH_PORT/corpuserinfodocument --data @corpuser-index-config.json && \
curl -XPUT $ELASTICSEARCH_HOST:$ELASTICSEARCH_PORT/dataprocessdocument --data @dataprocess-index-config.json && \
curl -XPUT $ELASTICSEARCH_HOST:$ELASTICSEARCH_PORT/datasetdocument --data @dataset-index-config.json
curl -XPUT $ELASTICSEARCH_HOST:$ELASTICSEARCH_PORT/datasetdocument --data @dataset-index-config.json && \
curl -XPUT $ELASTICSEARCH_HOST:$ELASTICSEARCH_PORT/mlmodeldocument --data @ml-model-index-config.json
212 changes: 212 additions & 0 deletions docker/elasticsearch-setup/ml-model-index-config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
{
"settings": {
"index": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "50"
},
"custom_delimiter": {
"split_on_numerics": "false",
"split_on_case_change": "false",
"type": "word_delimiter",
"preserve_original": "true",
"catenate_words": "false"
}
},
"char_filter": {
"ml_model_pattern": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this character filter anywhere?

Copy link
Collaborator Author

@RyanHolstien RyanHolstien Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The char_filter here gets used on every ml_model_pattern (used in the name field). What it does is replace all the dots with slashes, why it does this or what the significance of it is, I'm not exactly sure, I copied it from Dataset. It doesn't seem to be used from what I can tell. It would make sense for example if the slash pattern got utilized with this character filter as it would allow for tokenizing of a name like "aws_us_east_1.modelName" (which would get pattern replace to aws_us_east_1/modelName and then be two separate tokens from splitting on the slash) or something like that, but the ml_model_pattern and its analogue dataset_pattern support tokenizing on both slashes and dots.

Do you have any insight from the LinkedIn side why dots are undesirable in the name field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are couple of issues I can see here with the mappings and settings. It will be great if you can confirm the same

  1. name field is using model_pattern analyzer but that doesn't seem to be defined anywhere? I do see ml_model_pattern analyzer being defined but not model_pattern analyzer.
  2. ml_model_pattern analyzer makes use of ml_model_pattern tokenizer but where are we using the char_filter ml_model_pattern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question on why we replaced dots with slashes in datasets is to support browse feature. We chose this uniform standard since some datasets name were of form foo.bar while others of form /foo/bar. You don't need this in your settings/mappings if you don't intend to support browse.
This also means that we should improve our documentation :) Thanks for the question.

"pattern": "[.]",
"type": "pattern_replace",
"replacement": "/"
}
},
"normalizer": {
"my_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
},
"analyzer": {
"whitespace_lowercase": {
jywadhwani marked this conversation as resolved.
Show resolved Hide resolved
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
},
"slash_pattern": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "slash_tokenizer"
},
"ml_model_pattern": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "ml_model_pattern"
},
"comma_pattern": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "comma_tokenizer"
},
"custom_browse": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "path_hierarchy_tokenizer"
},
"custom_ngram": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "custom_ngram"
},
"custom_keyword": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "keyword"
},
"comma_pattern_ngram": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "comma_tokenizer"
},
"delimit": {
"filter": [
"lowercase",
"custom_delimiter"
],
"tokenizer": "whitespace"
},
"ml_model_pattern_ngram": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "ml_model_pattern"
},
"custom_browse_slash": {
"filter": [
"lowercase"
],
"type": "custom",
"tokenizer": "path_hierarchy"
}
},
"tokenizer": {
"path_hierarchy_tokenizer": {
"type": "path_hierarchy",
"replacement": "/",
"delimiter": "."
},
"custom_ngram": {
"type": "ngram",
"min_gram": "3",
"max_gram": "50"
},
"slash_tokenizer": {
"pattern": "[/]",
"type": "pattern"
},
"comma_tokenizer": {
"pattern": ",",
"type": "pattern"
},
"ml_model_pattern": {
"pattern": "[./]",
"type": "pattern"
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"browsePaths": {
"type": "text",
"fields": {
"length": {
"type": "token_count",
"analyzer": "slash_pattern"
}
},
"analyzer": "custom_browse_slash",
"fielddata": true
},
"origin": {
"type": "keyword",
"fields": {
"ngram": {
"type": "text",
"analyzer": "custom_ngram"
}
},
"normalizer": "my_normalizer"
},
"hasOwners": {
"type": "boolean"
},
"name": {
"type": "keyword"
},
"num_inputs": {
jywadhwani marked this conversation as resolved.
Show resolved Hide resolved
"type": "long"
},
"num_outputs": {
"type": "long"
},
"owners": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "comma_pattern_ngram"
}
},
"analyzer": "comma_pattern"
},
"orchestrator": {
"type": "keyword",
"fields": {
"ngram": {
"type": "text",
"analyzer": "custom_ngram"
}
},
"normalizer": "my_normalizer"
},
"urn": {
"type": "keyword",
"normalizer": "my_normalizer"
},
"inputs": {
"type": "keyword",
"normalizer": "my_normalizer"
},
"outputs": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
Loading