Skip to content

Commit

Permalink
Add doc about search document & some cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
Kerem Sahin committed Dec 19, 2019
1 parent 8f120f1 commit 3ba1492
Show file tree
Hide file tree
Showing 19 changed files with 126 additions and 11 deletions.
8 changes: 4 additions & 4 deletions docs/architecture/architecture.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# DataHub Architecture
![datahub-architecture](../imgs/datahub-architecture.png)

## Generalized Metadata Architecture (GMA)
Refer to [GMA](../what/gma.md).

## Metadata Serving
Refer to [metadata-serving](metadata-serving.md).

## Metadata Ingestion
Refer to [metadata-ingestion](metadata-ingestion.md).

## What is Generalized Metadata Architecture (GMA)?
Refer to [GMA](../what/gma.md).
Refer to [metadata-ingestion](metadata-ingestion.md).
2 changes: 2 additions & 0 deletions docs/architecture/metadata-ingestion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Metadata Ingestion Architecture

3 changes: 3 additions & 0 deletions docs/architecture/metadata-serving.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Metadata Serving Architecture

![metadata-serving](../imgs/metadata-serving.png)
2 changes: 2 additions & 0 deletions docs/how/entity-onboarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# How to onboard an entity?

2 changes: 2 additions & 0 deletions docs/how/graph-onboarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# How to onboard to GMA graph?

4 changes: 2 additions & 2 deletions docs/how/metadata-modelling.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# How to model metadata for GMA?
GMA uses [rest.li](https://rest.li), which is LinkedIn's open source REST framework.
# How to model metadata ?
[GMA](../what/gma.md) uses [rest.li](https://rest.li), which is LinkedIn's open source REST framework.
All metadata in GMA needs to be modelled using [Pegasus schema (PDSC)](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) which is the data schema for [rest.li](https://rest.li).

Conceptually we’re modelling metadata as a hybrid graph of nodes ([entities](../what/entity.md)) and edges ([relationships](../what/relationship.md)), with additional documents ([metadata aspects](../what/aspect.md)) attached to each node.
Expand Down
2 changes: 2 additions & 0 deletions docs/how/search-onboarding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# How to onboard to GMA search?

Binary file added docs/imgs/metadata-serving.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/what/aspect.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# What is a GMA aspect?
# What is a metadata aspect?

A metadata aspect is a structured document, or more precisely a `record` in [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates),
that represents a specific kind of metadata (e.g. ownership, schema, statistics, upstreams).
A metadata aspect on its own has no meaning (e.g. ownership for what?) and must be associated with a particular entity (e.g. ownership for PageViewEvent).
Expand Down
2 changes: 1 addition & 1 deletion docs/what/delta.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# What is Delta in GMA?
# What is a metadata delta?

Rest.li supports [partial update](https://linkedin.github.io/rest.li/user_guide/restli_server#partial_update) natively without needing explicitly defined models.
However, the granularity of update is always limited to each field in a PDSC model.
Expand Down
3 changes: 2 additions & 1 deletion docs/what/entity.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# What is a GMA entity?
# What is an entity?

An entity is very similar to the concept of a [resource](https://linkedin.github.io/rest.li/user_guide/restli_server#writing-resources) in [rest.li](http://rest.li/).
Generally speaking, an entity should have a defined [URN](urn.md) and a corresponding
[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete) API for the metadata associated with a particular instance of the entity. A particular instance of an entity is essentially a node in the [metadata graph](graph.md).
Expand Down
2 changes: 2 additions & 0 deletions docs/what/gma.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# What is Generalized Metadata Architecture (GMA)?

2 changes: 2 additions & 0 deletions docs/what/gms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# What is Generalized Metadata Store (GMS)?

1 change: 1 addition & 0 deletions docs/what/graph.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# What is GMA graph?
2 changes: 1 addition & 1 deletion docs/what/relationship.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# What is a GMA relationship?
# What is a relationship?

A relationship is a named associate between exactly two entities, a source and a destination.

Expand Down
94 changes: 94 additions & 0 deletions docs/what/search-document.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# What is a search document?

[Search documents](https://en.wikipedia.org/wiki/Search_engine_indexing) are also modeled using [PDSC](https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates) explicitly.
In many ways, the model for a Document is very similar to an [Entity](entity.md) and [Relationship](relationship.md) model,
where each attribute/field contains a value that’s derived from various metadata aspects.
However, a search document is also allowed to have array type of attribute that contains only primitives or enum items.
This is because most full-text search engines supports membership testing against an array field, e.g. an array field containing all the terms used in a document.

One obvious use of the attributes is to perform search filtering, e.g. give me all the `User` whose first name or last name is similar to “Joe” and reports up to `userFoo`.
Since the document is also served as the main interface for the search API, the attributes can also be used to format the search snippet.
As a result, one may be tempted to add as many attributes as needed. This is acceptable as the underlying search engine is designed to index a large number of fields.

Below shows an example schema for the `User` search document. Note that:
1. Each search document is required to have a type-specific `urn` field, generally maps to an entity in the [graph](graph.md).
2. Similar to `Entity`, each document has an optional `removed` field for "soft deletion".
This is captured in [BaseDocument](../../metadata-models/src/main/pegasus/com/linkedin/metadata/search/BaseDocument.pdsc), which is expected to be included by all documents.
3. Similar to `Entity`, all remaining fields are made `optional` to support partial updates.
4. `management` shows an example of a string array field.
5. `ownedDataset` shows an example on how a field can be derived from metadata [aspects](aspect.md) associated with other types of entity (in this case, `Dataset`).

```json
{
"type": "record",
"name": "BaseDocument",
"namespace": "com.linkedin.metadata.search",
"doc": "Common fields that apply to all documents",
"fields": [
{
"name": "removed",
"type": "boolean",
"doc": "Whether the entity has been removed or not",
"optional": true,
"default": false
}
]
}
```

```json
{
"type": "record",
"name": "UserDocument",
"namespace": "com.linkedin.metadata.search",
"doc": "Data model for user entity search",
"include": [
"BaseDocument"
],
"fields": [
{
"name": "urn",
"type": "com.linkedin.common.CorpuserUrn",
"doc": "Urn for the user"
},
{
"name": "firstName",
"type": "string",
"doc": "First name of the user",
"optional": true
},
{
"name": "lastName",
"type": "string",
"doc": "Last name of the user",
"optional": true
},
{
"name": "management",
"type": {
"type": "array",
"items": "com.linkedin.common.CorpuserUrn"
},
"doc": "The chain of management all the way to CEO",
"default": [],
"optional": true
},
{
"name": "costCenter",
"type": "int",
"doc": "Code for the cost center",
"optional": true
},
{
"name": "ownedDatasets",
"type": {
"type": "array",
"items": "com.linkedin.common.DatasetUrn"
},
"doc": "The list of dataset the user owns",
"default": [],
"optional": true
}
]
}
```
1 change: 1 addition & 0 deletions docs/what/search-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# What is GMA search index?
2 changes: 1 addition & 1 deletion docs/what/snapshot.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# What is a snapshot in GMA?
# What is a snapshot?

A metadata snapshot models the current state of one or multiple metadata [aspects](aspect.md) associated with a particular [entity](entity.md).
Each entity type is expected to have:
Expand Down
2 changes: 2 additions & 0 deletions docs/what/urn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# What is URN?

0 comments on commit 3ba1492

Please sign in to comment.