Documentation update part-2

Kerem Sahin · Kerem Sahin · commit 2b867f8c5f9c · 2019-12-19T17:23:48.000-08:00
diff --git a/docs/architecture/metadata-ingestion.md b/docs/architecture/metadata-ingestion.md
@@ -1,2 +1,64 @@
 # Metadata Ingestion Architecture
 
+## MCE Consumer Job
+
+## MAE Consumer Job
+
+All the emitted [MAE] will be consumed by a Kafka streams job, [mae-consumer-job], which updates the [graph] and [search index] accordingly. 
+The job itself is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed. 
+The builder should instruct the job how to update the graph and search index based on the metadata change. 
+The builder can optionally use [Remote DAO] to fetch additional metadata from other sources to help compute the final update.
+
+To ensure that metadata changes are processed in the correct chronological order, 
+MAEs are keyed by the entity [URN] &mdash; meaning all MAEs for a particular entity will be processed sequentially by a single Kafka streams thread. 
+
+## Search and Graph Index Builders
+
+As described in [Metadata Modelling] section, [Entity], [Relationship], and [Search Document] models do not directly encode the logic of how each field should be derived from metadata. 
+Instead, this logic should be provided in the form of a graph or search index builder.
+
+The builders register the metadata [aspect]s of their interest against [MAE Consumer Job](#mae-consumer-job) and will be invoked whenever a MAE involving the corresponding aspect is received. 
+If the MAE itself doesn&rsquo;t contain all the metadata needed, builders can use Remote DAO to fetch from GMS directly.
+
+```java
+public abstract class BaseIndexBuilder<DOCUMENT extends RecordTemplate> {
+
+ BaseIndexBuilder(@Nonnull List<Class<? extends RecordTemplate>> snapshotsInterested);
+
+ @Nullable
+ public abstract List<DOCUMENT> getDocumentsToUpdate(@Nonnull RecordTemplate snapshot);
+
+ @Nonnull
+ public abstract Class<DOCUMENT> getDocumentType();
+}
+```
+
+```java
+public interface GraphBuilder<SNAPSHOT extends RecordTemplate> {
+ GraphUpdates build(SNAPSHOT snapshot);
+
+ @Value
+ class GraphUpdates {
+   List<? extends RecordTemplate> entities;
+   List<RelationshipUpdates> relationshipUpdates;
+ }
+
+ @Value
+ class RelationshipUpdates {
+   List<? extends RecordTemplate> relationships;
+   BaseGraphWriterDAO.RemovalOption preUpdateOperation;
+ }
+}
+```
+
+[MAE]: ../what/mxe.md#metadata-audit-event-mae
+[graph]: ../what/graph.md
+[search index]: ../what/search-index.md
+[mae-consumer-job]: ../../metadata-jobs/mae-consumer-job
+[Remote DAO]: ../architecture/metadata-serving.md#remote-dao
+[URN]: ../what/urn.md
+[Metadata Modelling]: ../how/metadata-modelling.md
+[Entity]: ../what/entity.md
+[Relationship]: ../what/relationship.md
+[Search Document]: ../what/search-document.md
+[Aspect]: ../what/aspect.md
diff --git a/docs/architecture/metadata-serving.md b/docs/architecture/metadata-serving.md
@@ -1,3 +1,128 @@
 # Metadata Serving Architecture
 
-![metadata-serving](../imgs/metadata-serving.png)
+This section describes how metadata is served in GMA. In particular, it demonstrates how GMA can efficiently service different types of queries, including key-value, complex queries, and full text search.
+Below shows a high-level system diagram for the metadata serving architecture.
+
+![metadata-serving](../imgs/metadata-serving.png)
+
+There are four types of Data Access Object ([DAO]) that standardize the way metadata is accessed. 
+This section describes each type of DAO, its purpose, and the interface. 
+
+These DAOs rely heavily on [Java Generics](https://docs.oracle.com/javase/tutorial/extra/generics/index.html) so that the core logics can remain type-neutral. 
+However, as there&rsquo;s no inheritance in [Pegasus], the generics often fallback to extending [RecordTemplate] instead of the desired types (i.e. [entity], [relationship], metadata [aspect] etc). 
+It is possible to add additional runtime type checking to avoid binding the DAO to an unexpected type, at the expense of a slight degradation in performance.
+
+## Key-value DAO (Local DAO)
+
+[GMS] use [Local DAO] to store and retrieve metadata [aspect]s from the local document store. 
+Below shows the base class and its simple key-value interface. 
+As the DAO is a generic class, it needs to be bound to specific type during instantiation. 
+Each entity type will need to instantiate its own version of DAO.
+
+```java
+public abstract class BaseLocalDAO<ASPECT extends UnionTemplate> {
+
+ public abstract <URN extends Urn, METADATA extends RecordTemplate> void 
+  add(Class<METADATA> type, URN urn, METADATA value);
+
+ public abstract <URN extends Urn, METADATA extends RecordTemplate> 
+  Optional<METADATA> get(Class<METADATA> type, URN urn, int version);
+
+ public abstract <URN extends Urn, METADATA extends RecordTemplate> 
+  ListResult<Integer> listVersions(Class<METADATA> type, URN urn, int start, 
+    int pageSize);
+
+ public abstract <METADATA extends RecordTemplate> ListResult<Urn> listUrns( 
+  Class<METADATA> type, int start, int pageSize);
+
+ public abstract <URN extends Urn, METADATA extends RecordTemplate> 
+  ListResult<METADATA> list(Class<METADATA> type, URN urn, int start, int pageSize);
+}
+```
+
+Another important function of [Local DAO] is to automatically emit [MAE]s whenever the metadata is updated. 
+This is doable because MAE effectively use the same [Pegasus] models so [RecordTemplate] can be easily converted into the corresponding [GenericRecord].
+
+## Search DAO
+
+Search DAO is also a generic class that can be bound to a specific type of search document. 
+The DAO provides 3 APIs:
+* A `search` API that takes the search input, a [Filter], a [SortCriterion], some pagination parameters, and returns a [SearchResult]. 
+* An `autoComplete` API which allows typeahead-style autocomplete based on the current input and a [Filter], and returns [AutocompleteResult].
+* A `filter` API which allows for filtering only without a search input. It takes a a [Filter] and a [SortCriterion] as input and returns [SearchResult].
+
+```java
+public abstract class BaseSearchDAO<DOCUMENT extends RecordTemplate> {
+
+  public abstract SearchResult<DOCUMENT> search(String input, Filter filter, 
+        SortCriterion sortCriterion, int from, int size);
+
+  public abstract AutoCompleteResult autoComplete(String input, String field,
+        Filter filter, int limit);
+
+  public abstract SearchResult<DOCUMENT> filter(Filter filter, SortCriterion sortCriterion, 
+        int from, int size);
+}
+```
+
+## Query DAO
+
+Query DAO allows clients, e.g. [GMS](../what/gms.md), [MAE Consumer Job](metadata-ingestion.md#mae-consumer-job) etc, to perform both graph & non-graph queries against the metadata graph. 
+For instance, a GMS can use the Query DAO to find out &ldquo;all the dataset owned by the users who is part of the group `foo` and report to `bar`,&rdquo; which naturally translates to a graph query. 
+Alternatively, a client may wish to retrieve &ldquo;all the datasets that stored under /jobs/metrics&rdquo;, which doesn&rsquo;t involve any graph traversal.
+
+Below is the base class for Query DAOs, which contains the `findEntities` and `findRelationships` methods. 
+Both methods also have two versions, one involves graph traversal, and the other doesn&rsquo;t. 
+You can use `findMixedTypesEntities` and `findMixedTypesRelationships` for queries that return a mixture of different types of entities or relationships. 
+As these methods return a list of [RecordTemplate], callers will need to manually cast them back to the specific entity type using [isInstance()](https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#isInstance-java.lang.Object-) or reflection.
+
+Note that the generics (ENTITY, RELATIONSHIP) are purposely left untyped, as these types are native to the underlying graph DB and will most likely differ from one implementation to another.
+
+```java
+public abstract class BaseQueryDAO<ENTITY, RELATIONSHIP> {
+
+ public abstract <ENTITY extends RecordTemplate> List<ENTITY> findEntities(
+  Class<ENTITY> type, Filter filter, int offset, int count);
+
+ public abstract <ENTITY extends RecordTemplate> List<ENTITY> findEntities(
+  Class<ENTITY> type, Statement function);
+
+ public abstract List<RecordTemplate> findMixedTypesEntities(Statement function);
+
+ public abstract <ENTITY extends RecordTemplate, RELATIONSHIP extends RecordTemplate> List<RELATIONSHIP> 
+  findRelationships(Class<ENTITY> entityType, Class<RELATIONSHIP> relationshipType, Filter filter, int offset, int count);
+
+ public abstract <RELATIONSHIP extends RecordTemplate> List<RELATIONSHIP> 
+  findRelationships(Class<RELATIONSHIP> type, Statement function);
+
+ public abstract List<RecordTemplate> findMixedTypesRelationships(
+  Statement function);
+}
+```
+
+## Remote DAO
+
+[Remote DAO] is nothing but a specialized readonly implementation of [Local DAO]. 
+Rather than retrieving metadata from a local storage, Remote DAO will fetch the metadata from another [GMS]. 
+The mapping between [entity] type and GMS is implemented as a hard-coded map.
+
+To prevent circular dependency ([rest.li] service depends on remote DAO, which in turn depends on rest.li client generated by each rest.li service), 
+Remote DAO will need to construct raw rest.li requests directly, instead of using each entity&rsquo;s rest.li request builder.
+
+
+[AutocompleteResult]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/AutoCompleteResult.pdsc
+[Filter]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/Filter.pdsc
+[SortCriterion]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/SortCriterion.pdsc
+[SearchResult]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/SearchResult.java
+[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java
+[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java
+[DAO]: https://en.wikipedia.org/wiki/Data_access_object
+[Pegasus]: https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates
+[relationship]: ../what/relationship.md
+[entity]: ../what/entity.md
+[aspect]: ../what/aspect.md
+[GMS]: ../what/gms.md
+[Local DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseLocalDAO.java
+[Remote DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseRemoteDAO.java
+[MAE]: ../what/mxe.md#metadata-audit-event-mae
+[rest.li]: https://rest.li
diff --git a/docs/what/gma.md b/docs/what/gma.md
@@ -1,2 +1,9 @@
 # What is Generalized Metadata Architecture (GMA)?
 
+GMA is the name for DataHub's backend infrastructure. Unlike any existing architectures, GMA will be able to efficiently service the three most common query patterns (`document-oriented`, `complex` & `graph` queries, and `fulltext search`) 
+&mdash; with minimal onboarding efforts needed. 
+
+GMA also embraces a distributed model, where each team owns, develops and operates their own [metadata services](gms.md), while the metadata are automatically aggregated to populate the central metadata graph and search indexes. 
+This is made possible by standardizing the metadata models and the access layer. 
+We strongly believe that GMA can bring tremendous leverage to any team that has a need to store and access metadata. 
+Moreover, standardizing metadata modeling promotes a model-first approach to developments, resulting in a more concise, consistent, and highly connected metadata ecosystem that benefits all DataHub users.
diff --git a/docs/what/gms.md b/docs/what/gms.md
@@ -1,2 +1,8 @@
-# What is Generalized Metadata Store (GMS)?
+# What is Generalized Metadata Service (GMS)?
 
+
+
+## Central vs Distributed
+Any entity such as `datasets` or `users` which are [onboarded](../how/entity-onboarding.md) to GMA can be designed to be served through a separate microservice.
+These microservices are called GMS. Although we've a central GMS ([datahub-gms](../../gms)) which serves all entities, 
+we could easily have multiple GMS, each has responsible for a specific entity such as `dataset-gms`, `user-gms` etc. 
diff --git a/docs/what/graph.md b/docs/what/graph.md
@@ -1 +1,15 @@
 # What is GMA graph?
+
+All the [entities](entity.md) and [relationships](relationship.md) are stored in a graph database, Neo4j. 
+The graph always represents the current state of the world and has no direct support for versioning or history. 
+However, as stated in the [Metadata Modeling](../how/metadata-modelling.md) section, 
+the graph is merely a derived view of all metadata [aspects](aspect.md) thus can always be rebuilt directly from historic [MAEs](mxe.md#metadata-audit-event-mae). 
+Consequently, it is possible to build a specific snapshot of the graph in time by replaying MAEs up to that point.
+
+In theory, the system can work with any generic [OLTP](https://en.wikipedia.org/wiki/Online_transaction_processing) graph DB that supports the following operations:
+* Dynamical creation, modification, and removal of nodes and edges
+* Dynamical attachment of key-value properties to each node and edge
+* Transactional partial updates of properties of a specific node or edge
+* Fast ID-based retrieval of nodes & edges
+* Efficient queries involving both graph traversal and properties value filtering
+* Support efficient bidirectional graph traversal
diff --git a/docs/what/search-index.md b/docs/what/search-index.md
@@ -1 +1,17 @@
 # What is GMA search index?
+
+Each [search document](search-document.md) type (or [entity](entity.md) type) will be mapped to an independent search index in Elasticsearch. 
+Beyond the standard search engine features (analyzer, tokenizer, filter queries, faceting, sharding, etc), 
+GMA also supports the following specific features:
+* Partial update of indexed documents
+* Membership testing on multi-value fields
+* Zero downtime switch between indices
+
+Check out [Search DAO](../architecture/metadata-serving.md#search-dao) for search query abstraction in GMA.
+
+## Search Automation (TBD)
+
+We aim to automate the index creation, schema evolution, and reindexing such that the team will only need to focus on the search document model and their custom [Index Builder](../architecture/metadata-ingestion.md#search-index-builders) logic. 
+As the logic changes, a new version of the index will be created and populated from historic MAEs. 
+Once it&rsquo;s fully populated, the team can switch to the new version through a simple config change from their [GMS](gms.md). 
+They can also rollback to an older version of index whenever needed.