Add doc about search document & some cleanup

linkedin · Dec 19, 2019 · 3ba1492 · 3ba1492
1 parent 8f120f1
commit 3ba1492
Show file tree

Hide file tree

Showing 19 changed files with 126 additions and 11 deletions.
diff --git a/docs/architecture/architecture.md b/docs/architecture/architecture.md
@@ -1,11 +1,11 @@
 # DataHub Architecture
 ![datahub-architecture](../imgs/datahub-architecture.png)
 
+## Generalized Metadata Architecture (GMA)
+Refer to [GMA](../what/gma.md).
+
 ## Metadata Serving
 Refer to [metadata-serving](metadata-serving.md).
 
 ## Metadata Ingestion
-Refer to [metadata-ingestion](metadata-ingestion.md).
-
-## What is Generalized Metadata Architecture (GMA)?
-Refer to [GMA](../what/gma.md).
+Refer to [metadata-ingestion](metadata-ingestion.md).
diff --git a/docs/architecture/metadata-ingestion.md b/docs/architecture/metadata-ingestion.md
@@ -0,0 +1,2 @@
+# Metadata Ingestion Architecture
+
diff --git a/docs/architecture/metadata-serving.md b/docs/architecture/metadata-serving.md
@@ -0,0 +1,3 @@
+# Metadata Serving Architecture
+
+![metadata-serving](../imgs/metadata-serving.png)
diff --git a/docs/how/entity-onboarding.md b/docs/how/entity-onboarding.md
@@ -0,0 +1,2 @@
+# How to onboard an entity?
+
diff --git a/docs/how/graph-onboarding.md b/docs/how/graph-onboarding.md
@@ -0,0 +1,2 @@
+# How to onboard to GMA graph?
+
diff --git a/docs/how/metadata-modelling.md b/docs/how/metadata-modelling.md
@@ -1,5 +1,5 @@
-# How to model metadata for GMA?
-GMA uses [rest.li](https://rest.li), which is LinkedIn's open source REST framework.  
+# How to model metadata ?
+[GMA](../what/gma.md) uses [rest.li](https://rest.li), which is LinkedIn's open source REST framework.  
 All metadata in GMA needs to be modelled using [Pegasus schema (PDSC)](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) which is the data schema for [rest.li](https://rest.li).
 
 Conceptually we’re modelling metadata as a hybrid graph of nodes ([entities](../what/entity.md)) and edges ([relationships](../what/relationship.md)), with additional documents ([metadata aspects](../what/aspect.md)) attached to each node. 

diff --git a/docs/how/search-onboarding.md b/docs/how/search-onboarding.md
@@ -0,0 +1,2 @@
+# How to onboard to GMA search?
+
diff --git a/docs/imgs/metadata-serving.png b/docs/imgs/metadata-serving.png
diff --git a/docs/what/aspect.md b/docs/what/aspect.md
@@ -1,4 +1,5 @@
-# What is a GMA aspect?
+# What is a metadata aspect?
+
 A metadata aspect is a structured document, or more precisely a `record` in [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates),
  that represents a specific kind of metadata (e.g. ownership, schema, statistics, upstreams). 
  A metadata aspect on its own has no meaning (e.g. ownership for what?) and must be associated with a particular entity (e.g. ownership for PageViewEvent). 

diff --git a/docs/what/delta.md b/docs/what/delta.md
@@ -1,4 +1,4 @@
-# What is Delta in GMA?
+# What is a metadata delta?
 
 Rest.li supports [partial update](https://linkedin.github.io/rest.li/user_guide/restli_server#partial_update) natively without needing explicitly defined models. 
 However, the granularity of update is always limited to each field in a PDSC model. 

diff --git a/docs/what/entity.md b/docs/what/entity.md
@@ -1,4 +1,5 @@
-# What is a GMA entity?
+# What is an entity?
+
 An entity is very similar to the concept of a [resource](https://linkedin.github.io/rest.li/user_guide/restli_server#writing-resources) in [rest.li](http://rest.li/). 
 Generally speaking, an entity should have a defined [URN](urn.md) and a corresponding 
 [CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete) API for the metadata associated with a particular instance of the entity. A particular instance of an entity is essentially a node in the [metadata graph](graph.md). 

diff --git a/docs/what/gma.md b/docs/what/gma.md
@@ -0,0 +1,2 @@
+# What is Generalized Metadata Architecture (GMA)?
+
diff --git a/docs/what/gms.md b/docs/what/gms.md
@@ -0,0 +1,2 @@
+# What is Generalized Metadata Store (GMS)?
+
diff --git a/docs/what/graph.md b/docs/what/graph.md
@@ -0,0 +1 @@
+# What is GMA graph?
diff --git a/docs/what/relationship.md b/docs/what/relationship.md
@@ -1,4 +1,4 @@
-# What is a GMA relationship?
+# What is a relationship?
 
 A relationship is a named associate between exactly two entities, a source and a destination. 
 

diff --git a/docs/what/search-document.md b/docs/what/search-document.md
@@ -0,0 +1,94 @@
+# What is a search document?
+
+[Search documents](https://en.wikipedia.org/wiki/Search_engine_indexing) are also modeled using [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) explicitly. 
+In many ways, the model for a Document is very similar to an [Entity](entity.md) and [Relationship](relationship.md) model, 
+where each attribute/field contains a value that’s derived from various metadata aspects. 
+However, a search document is also allowed to have array type of attribute that contains only primitives or enum items. 
+This is because most full-text search engines supports membership testing against an array field, e.g. an array field containing all the terms used in a document.
+
+One obvious use of the attributes is to perform search filtering, e.g. give me all the `User` whose first name or last name is similar to “Joe” and reports up to `userFoo`. 
+Since the document is also served as the main interface for the search API, the attributes can also be used to format the search snippet. 
+As a result, one may be tempted to add as many attributes as needed. This is acceptable as the underlying search engine is designed to index a large number of fields.
+
+Below shows an example schema for the `User` search document. Note that:
+1. Each search document is required to have a type-specific `urn` field, generally maps to an entity in the [graph](graph.md).
+2. Similar to `Entity`, each document has an optional `removed` field for "soft deletion". 
+This is captured in [BaseDocument](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/BaseDocument.pdsc), which is expected to be included by all documents.
+3. Similar to `Entity`, all remaining fields are made `optional` to support partial updates.
+4. `management` shows an example of a string array field.
+5. `ownedDataset` shows an example on how a field can be derived from metadata [aspects](aspect.md) associated with other types of entity (in this case, `Dataset`).
+
+```json
+{
+  "type": "record",
+  "name": "BaseDocument",
+  "namespace": "com.linkedin.metadata.search",
+  "doc": "Common fields that apply to all documents",
+  "fields": [
+    {
+      "name": "removed",
+      "type": "boolean",
+      "doc": "Whether the entity has been removed or not",
+      "optional": true,
+      "default": false
+    }
+  ]
+}
+```
+
+```json
+{
+ "type": "record",
+ "name": "UserDocument",
+ "namespace": "com.linkedin.metadata.search",
+ "doc": "Data model for user entity search",
+ "include": [
+   "BaseDocument"
+ ],
+ "fields": [
+   {
+     "name": "urn",
+     "type": "com.linkedin.common.CorpuserUrn",
+     "doc": "Urn for the user"
+   },
+   {
+     "name": "firstName",
+     "type": "string",
+     "doc": "First name of the user",
+     "optional": true
+   },
+   {
+     "name": "lastName",
+     "type": "string",
+     "doc": "Last name of the user",
+     "optional": true
+   },
+   {
+     "name": "management",
+     "type": {
+       "type": "array",
+       "items": "com.linkedin.common.CorpuserUrn"
+     },
+     "doc": "The chain of management all the way to CEO",
+     "default": [],
+     "optional": true
+   },
+   {
+     "name": "costCenter",
+     "type": "int",
+     "doc": "Code for the cost center",
+     "optional": true
+   },
+   {
+     "name": "ownedDatasets",
+     "type": {
+       "type": "array",
+       "items": "com.linkedin.common.DatasetUrn"
+     },
+     "doc": "The list of dataset the user owns",
+     "default": [],
+     "optional": true
+   }
+ ]
+}
+```
diff --git a/docs/what/search-index.md b/docs/what/search-index.md
@@ -0,0 +1 @@
+# What is GMA search index?
diff --git a/docs/what/snapshot.md b/docs/what/snapshot.md
@@ -1,4 +1,4 @@
-# What is a snapshot in GMA?
+# What is a snapshot?
 
 A metadata snapshot models the current state of one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md). 
 Each entity type is expected to have:

diff --git a/docs/what/urn.md b/docs/what/urn.md
@@ -0,0 +1,2 @@
+# What is URN?
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# Metadata Serving Architecture

		![metadata-serving](../imgs/metadata-serving.png)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# What is Generalized Metadata Architecture (GMA)?