Skip to content

Commit 2b867f8

Browse files
author
Kerem Sahin
committed
Documentation update part-2
1 parent 3ba1492 commit 2b867f8

File tree

6 files changed

+232
-2
lines changed

6 files changed

+232
-2
lines changed
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,64 @@
11
# Metadata Ingestion Architecture
22

3+
## MCE Consumer Job
4+
5+
## MAE Consumer Job
6+
7+
All the emitted [MAE] will be consumed by a Kafka streams job, [mae-consumer-job], which updates the [graph] and [search index] accordingly.
8+
The job itself is entity-agnostic and will execute corresponding graph & search index builders, which will be invoked by the job when a specific metadata aspect is changed.
9+
The builder should instruct the job how to update the graph and search index based on the metadata change.
10+
The builder can optionally use [Remote DAO] to fetch additional metadata from other sources to help compute the final update.
11+
12+
To ensure that metadata changes are processed in the correct chronological order,
13+
MAEs are keyed by the entity [URN] — meaning all MAEs for a particular entity will be processed sequentially by a single Kafka streams thread.
14+
15+
## Search and Graph Index Builders
16+
17+
As described in [Metadata Modelling] section, [Entity], [Relationship], and [Search Document] models do not directly encode the logic of how each field should be derived from metadata.
18+
Instead, this logic should be provided in the form of a graph or search index builder.
19+
20+
The builders register the metadata [aspect]s of their interest against [MAE Consumer Job](#mae-consumer-job) and will be invoked whenever a MAE involving the corresponding aspect is received.
21+
If the MAE itself doesn’t contain all the metadata needed, builders can use Remote DAO to fetch from GMS directly.
22+
23+
```java
24+
public abstract class BaseIndexBuilder<DOCUMENT extends RecordTemplate> {
25+
26+
BaseIndexBuilder(@Nonnull List<Class<? extends RecordTemplate>> snapshotsInterested);
27+
28+
@Nullable
29+
public abstract List<DOCUMENT> getDocumentsToUpdate(@Nonnull RecordTemplate snapshot);
30+
31+
@Nonnull
32+
public abstract Class<DOCUMENT> getDocumentType();
33+
}
34+
```
35+
36+
```java
37+
public interface GraphBuilder<SNAPSHOT extends RecordTemplate> {
38+
GraphUpdates build(SNAPSHOT snapshot);
39+
40+
@Value
41+
class GraphUpdates {
42+
List<? extends RecordTemplate> entities;
43+
List<RelationshipUpdates> relationshipUpdates;
44+
}
45+
46+
@Value
47+
class RelationshipUpdates {
48+
List<? extends RecordTemplate> relationships;
49+
BaseGraphWriterDAO.RemovalOption preUpdateOperation;
50+
}
51+
}
52+
```
53+
54+
[MAE]: ../what/mxe.md#metadata-audit-event-mae
55+
[graph]: ../what/graph.md
56+
[search index]: ../what/search-index.md
57+
[mae-consumer-job]: ../../metadata-jobs/mae-consumer-job
58+
[Remote DAO]: ../architecture/metadata-serving.md#remote-dao
59+
[URN]: ../what/urn.md
60+
[Metadata Modelling]: ../how/metadata-modelling.md
61+
[Entity]: ../what/entity.md
62+
[Relationship]: ../what/relationship.md
63+
[Search Document]: ../what/search-document.md
64+
[Aspect]: ../what/aspect.md

docs/architecture/metadata-serving.md

Lines changed: 126 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,128 @@
11
# Metadata Serving Architecture
22

3-
![metadata-serving](../imgs/metadata-serving.png)
3+
This section describes how metadata is served in GMA. In particular, it demonstrates how GMA can efficiently service different types of queries, including key-value, complex queries, and full text search.
4+
Below shows a high-level system diagram for the metadata serving architecture.
5+
6+
![metadata-serving](../imgs/metadata-serving.png)
7+
8+
There are four types of Data Access Object ([DAO]) that standardize the way metadata is accessed.
9+
This section describes each type of DAO, its purpose, and the interface.
10+
11+
These DAOs rely heavily on [Java Generics](https://docs.oracle.com/javase/tutorial/extra/generics/index.html) so that the core logics can remain type-neutral.
12+
However, as there’s no inheritance in [Pegasus], the generics often fallback to extending [RecordTemplate] instead of the desired types (i.e. [entity], [relationship], metadata [aspect] etc).
13+
It is possible to add additional runtime type checking to avoid binding the DAO to an unexpected type, at the expense of a slight degradation in performance.
14+
15+
## Key-value DAO (Local DAO)
16+
17+
[GMS] use [Local DAO] to store and retrieve metadata [aspect]s from the local document store.
18+
Below shows the base class and its simple key-value interface.
19+
As the DAO is a generic class, it needs to be bound to specific type during instantiation.
20+
Each entity type will need to instantiate its own version of DAO.
21+
22+
```java
23+
public abstract class BaseLocalDAO<ASPECT extends UnionTemplate> {
24+
25+
public abstract <URN extends Urn, METADATA extends RecordTemplate> void
26+
add(Class<METADATA> type, URN urn, METADATA value);
27+
28+
public abstract <URN extends Urn, METADATA extends RecordTemplate>
29+
Optional<METADATA> get(Class<METADATA> type, URN urn, int version);
30+
31+
public abstract <URN extends Urn, METADATA extends RecordTemplate>
32+
ListResult<Integer> listVersions(Class<METADATA> type, URN urn, int start,
33+
int pageSize);
34+
35+
public abstract <METADATA extends RecordTemplate> ListResult<Urn> listUrns(
36+
Class<METADATA> type, int start, int pageSize);
37+
38+
public abstract <URN extends Urn, METADATA extends RecordTemplate>
39+
ListResult<METADATA> list(Class<METADATA> type, URN urn, int start, int pageSize);
40+
}
41+
```
42+
43+
Another important function of [Local DAO] is to automatically emit [MAE]s whenever the metadata is updated.
44+
This is doable because MAE effectively use the same [Pegasus] models so [RecordTemplate] can be easily converted into the corresponding [GenericRecord].
45+
46+
## Search DAO
47+
48+
Search DAO is also a generic class that can be bound to a specific type of search document.
49+
The DAO provides 3 APIs:
50+
* A `search` API that takes the search input, a [Filter], a [SortCriterion], some pagination parameters, and returns a [SearchResult].
51+
* An `autoComplete` API which allows typeahead-style autocomplete based on the current input and a [Filter], and returns [AutocompleteResult].
52+
* A `filter` API which allows for filtering only without a search input. It takes a a [Filter] and a [SortCriterion] as input and returns [SearchResult].
53+
54+
```java
55+
public abstract class BaseSearchDAO<DOCUMENT extends RecordTemplate> {
56+
57+
public abstract SearchResult<DOCUMENT> search(String input, Filter filter,
58+
SortCriterion sortCriterion, int from, int size);
59+
60+
public abstract AutoCompleteResult autoComplete(String input, String field,
61+
Filter filter, int limit);
62+
63+
public abstract SearchResult<DOCUMENT> filter(Filter filter, SortCriterion sortCriterion,
64+
int from, int size);
65+
}
66+
```
67+
68+
## Query DAO
69+
70+
Query DAO allows clients, e.g. [GMS](../what/gms.md), [MAE Consumer Job](metadata-ingestion.md#mae-consumer-job) etc, to perform both graph & non-graph queries against the metadata graph.
71+
For instance, a GMS can use the Query DAO to find out “all the dataset owned by the users who is part of the group `foo` and report to `bar`,” which naturally translates to a graph query.
72+
Alternatively, a client may wish to retrieve “all the datasets that stored under /jobs/metrics”, which doesn’t involve any graph traversal.
73+
74+
Below is the base class for Query DAOs, which contains the `findEntities` and `findRelationships` methods.
75+
Both methods also have two versions, one involves graph traversal, and the other doesn’t.
76+
You can use `findMixedTypesEntities` and `findMixedTypesRelationships` for queries that return a mixture of different types of entities or relationships.
77+
As these methods return a list of [RecordTemplate], callers will need to manually cast them back to the specific entity type using [isInstance()](https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#isInstance-java.lang.Object-) or reflection.
78+
79+
Note that the generics (ENTITY, RELATIONSHIP) are purposely left untyped, as these types are native to the underlying graph DB and will most likely differ from one implementation to another.
80+
81+
```java
82+
public abstract class BaseQueryDAO<ENTITY, RELATIONSHIP> {
83+
84+
public abstract <ENTITY extends RecordTemplate> List<ENTITY> findEntities(
85+
Class<ENTITY> type, Filter filter, int offset, int count);
86+
87+
public abstract <ENTITY extends RecordTemplate> List<ENTITY> findEntities(
88+
Class<ENTITY> type, Statement function);
89+
90+
public abstract List<RecordTemplate> findMixedTypesEntities(Statement function);
91+
92+
public abstract <ENTITY extends RecordTemplate, RELATIONSHIP extends RecordTemplate> List<RELATIONSHIP>
93+
findRelationships(Class<ENTITY> entityType, Class<RELATIONSHIP> relationshipType, Filter filter, int offset, int count);
94+
95+
public abstract <RELATIONSHIP extends RecordTemplate> List<RELATIONSHIP>
96+
findRelationships(Class<RELATIONSHIP> type, Statement function);
97+
98+
public abstract List<RecordTemplate> findMixedTypesRelationships(
99+
Statement function);
100+
}
101+
```
102+
103+
## Remote DAO
104+
105+
[Remote DAO] is nothing but a specialized readonly implementation of [Local DAO].
106+
Rather than retrieving metadata from a local storage, Remote DAO will fetch the metadata from another [GMS].
107+
The mapping between [entity] type and GMS is implemented as a hard-coded map.
108+
109+
To prevent circular dependency ([rest.li] service depends on remote DAO, which in turn depends on rest.li client generated by each rest.li service),
110+
Remote DAO will need to construct raw rest.li requests directly, instead of using each entity’s rest.li request builder.
111+
112+
113+
[AutocompleteResult]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/AutoCompleteResult.pdsc
114+
[Filter]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/Filter.pdsc
115+
[SortCriterion]: ../../metadata-dao/src/main/pegasus/com/linkedin/metadata/query/SortCriterion.pdsc
116+
[SearchResult]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/SearchResult.java
117+
[RecordTemplate]: https://github.com/linkedin/rest.li/blob/master/data/src/main/java/com/linkedin/data/template/RecordTemplate.java
118+
[GenericRecord]: https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/generic/GenericRecord.java
119+
[DAO]: https://en.wikipedia.org/wiki/Data_access_object
120+
[Pegasus]: https://linkedin.github.io/rest.li/DATA-Data-Schema-and-Templates
121+
[relationship]: ../what/relationship.md
122+
[entity]: ../what/entity.md
123+
[aspect]: ../what/aspect.md
124+
[GMS]: ../what/gms.md
125+
[Local DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseLocalDAO.java
126+
[Remote DAO]: ../../metadata-dao/src/main/java/com/linkedin/metadata/dao/BaseRemoteDAO.java
127+
[MAE]: ../what/mxe.md#metadata-audit-event-mae
128+
[rest.li]: https://rest.li

docs/what/gma.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,9 @@
11
# What is Generalized Metadata Architecture (GMA)?
22

3+
GMA is the name for DataHub's backend infrastructure. Unlike any existing architectures, GMA will be able to efficiently service the three most common query patterns (`document-oriented`, `complex` & `graph` queries, and `fulltext search`)
4+
— with minimal onboarding efforts needed.
5+
6+
GMA also embraces a distributed model, where each team owns, develops and operates their own [metadata services](gms.md), while the metadata are automatically aggregated to populate the central metadata graph and search indexes.
7+
This is made possible by standardizing the metadata models and the access layer.
8+
We strongly believe that GMA can bring tremendous leverage to any team that has a need to store and access metadata.
9+
Moreover, standardizing metadata modeling promotes a model-first approach to developments, resulting in a more concise, consistent, and highly connected metadata ecosystem that benefits all DataHub users.

docs/what/gms.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,8 @@
1-
# What is Generalized Metadata Store (GMS)?
1+
# What is Generalized Metadata Service (GMS)?
22

3+
4+
5+
## Central vs Distributed
6+
Any entity such as `datasets` or `users` which are [onboarded](../how/entity-onboarding.md) to GMA can be designed to be served through a separate microservice.
7+
These microservices are called GMS. Although we've a central GMS ([datahub-gms](../../gms)) which serves all entities,
8+
we could easily have multiple GMS, each has responsible for a specific entity such as `dataset-gms`, `user-gms` etc.

docs/what/graph.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,15 @@
11
# What is GMA graph?
2+
3+
All the [entities](entity.md) and [relationships](relationship.md) are stored in a graph database, Neo4j.
4+
The graph always represents the current state of the world and has no direct support for versioning or history.
5+
However, as stated in the [Metadata Modeling](../how/metadata-modelling.md) section,
6+
the graph is merely a derived view of all metadata [aspects](aspect.md) thus can always be rebuilt directly from historic [MAEs](mxe.md#metadata-audit-event-mae).
7+
Consequently, it is possible to build a specific snapshot of the graph in time by replaying MAEs up to that point.
8+
9+
In theory, the system can work with any generic [OLTP](https://en.wikipedia.org/wiki/Online_transaction_processing) graph DB that supports the following operations:
10+
* Dynamical creation, modification, and removal of nodes and edges
11+
* Dynamical attachment of key-value properties to each node and edge
12+
* Transactional partial updates of properties of a specific node or edge
13+
* Fast ID-based retrieval of nodes & edges
14+
* Efficient queries involving both graph traversal and properties value filtering
15+
* Support efficient bidirectional graph traversal

docs/what/search-index.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,17 @@
11
# What is GMA search index?
2+
3+
Each [search document](search-document.md) type (or [entity](entity.md) type) will be mapped to an independent search index in Elasticsearch.
4+
Beyond the standard search engine features (analyzer, tokenizer, filter queries, faceting, sharding, etc),
5+
GMA also supports the following specific features:
6+
* Partial update of indexed documents
7+
* Membership testing on multi-value fields
8+
* Zero downtime switch between indices
9+
10+
Check out [Search DAO](../architecture/metadata-serving.md#search-dao) for search query abstraction in GMA.
11+
12+
## Search Automation (TBD)
13+
14+
We aim to automate the index creation, schema evolution, and reindexing such that the team will only need to focus on the search document model and their custom [Index Builder](../architecture/metadata-ingestion.md#search-index-builders) logic.
15+
As the logic changes, a new version of the index will be created and populated from historic MAEs.
16+
Once it’s fully populated, the team can switch to the new version through a simple config change from their [GMS](gms.md).
17+
They can also rollback to an older version of index whenever needed.

0 commit comments

Comments
 (0)