Skip to content

Commit 2152d30

Browse files
yoonhyejinhsheth2
authored andcommitted
docs: merge cli guide (datahub-project#10464)
Co-authored-by: Harshal Sheth <[email protected]>
1 parent 93eeefc commit 2152d30

File tree

5 files changed

+75
-142
lines changed

5 files changed

+75
-142
lines changed

docs-website/sidebars.js

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -219,11 +219,6 @@ module.exports = {
219219
id: "docs/managed-datahub/approval-workflows",
220220
className: "saasOnly",
221221
},
222-
{
223-
"Metadata Ingestion With Acryl": [
224-
"docs/managed-datahub/metadata-ingestion-with-acryl/ingestion",
225-
],
226-
},
227222
{
228223
"DataHub API": [
229224
{

docs/components.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ either Kafka or using the Metadata Store Rest APIs directly. DataHub supports an
3838
a host of capabilities including schema extraction, table & column profiling, usage information extraction, and more.
3939

4040
Getting started with the Ingestion Framework is as simple: just define a YAML file and execute the `datahub ingest` command.
41-
Learn more by heading over the the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide.
41+
Learn more by heading over the [Metadata Ingestion](https://datahubproject.io/docs/metadata-ingestion/) guide.
4242

4343
## GraphQL API
4444

docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md

Lines changed: 0 additions & 116 deletions
This file was deleted.

docs/managed-datahub/welcome-acryl.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ Acryl DataHub employs a push-based metadata ingestion model. In practice, this m
4949

5050
This approach comes with another benefit: security. By managing your own instance of the agent, you can keep the secrets and credentials within your walled garden. Skip uploading secrets & keys into a third-party cloud tool.
5151

52-
To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](docs/managed-datahub/metadata-ingestion-with-acryl/ingestion.md).
52+
To push metadata into DataHub, Acryl provide's an ingestion framework written in Python. Typically, push jobs are run on a schedule at an interval of your choosing. For our step-by-step guide on ingestion, click [here](../../metadata-ingestion/cli-ingestion.md).
5353

5454
### Discovering Metadata
5555

Lines changed: 73 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,108 @@
11
# CLI Ingestion
22

3-
## Installing the CLI
3+
Batch ingestion involves extracting metadata from a source system in bulk. Typically, this happens on a predefined schedule using the [Metadata Ingestion](../docs/components.md#ingestion-framework) framework.
4+
The metadata that is extracted includes point-in-time instances of dataset, chart, dashboard, pipeline, user, group, usage, and task metadata.
45

5-
Make sure you have installed DataHub CLI before following this guide.
6+
## Installing DataHub CLI
67

7-
```shell
8-
# Requires Python 3.8+
8+
:::note Required Python Version
9+
Installing DataHub CLI requires Python 3.6+.
10+
:::
11+
12+
Run the following commands in your terminal:
13+
14+
```
915
python3 -m pip install --upgrade pip wheel setuptools
1016
python3 -m pip install --upgrade acryl-datahub
11-
# validate that the install was successful
12-
datahub version
13-
# If you see "command not found", try running this instead: python3 -m datahub version
17+
python3 -m datahub version
1418
```
1519

20+
Your command line should return the proper version of DataHub upon executing these commands successfully.
21+
22+
1623
Check out the [CLI Installation Guide](../docs/cli.md#installation) for more installation options and troubleshooting tips.
1724

18-
After that, install the required plugin for the ingestion.
25+
26+
## Installing Connector Plugins
27+
28+
Our CLI follows a plugin architecture. You must install connectors for different data sources individually.
29+
For a list of all supported data sources, see [the open source docs](../docs/cli.md#sources).
30+
Once you've found the connectors you care about, simply install them using `pip install`.
31+
For example, to install the `mysql` connector, you can run
1932

2033
```shell
21-
pip install 'acryl-datahub[datahub-rest]' # install the required plugin
34+
pip install --upgrade 'acryl-datahub[mysql]'
2235
```
23-
2436
Check out the [alternative installation options](../docs/cli.md#alternate-installation-options) for more reference.
2537

2638
## Configuring a Recipe
2739

28-
Create a `recipe.yml` file that defines the source and sink for metadata, as shown below.
40+
Create a [Recipe](recipe_overview.md) yaml file that defines the source and sink for metadata, as shown below.
2941

3042
```yaml
31-
# recipe.yml
43+
# example-recipe.yml
44+
45+
# MySQL source configuration
3246
source:
33-
type: <source_name>
47+
type: mysql
3448
config:
35-
option_1: <value>
36-
...
49+
username: root
50+
password: password
51+
host_port: localhost:3306
3752

53+
# Recipe sink configuration.
3854
sink:
39-
type: <sink_type_name>
55+
type: "datahub-rest"
4056
config:
41-
...
57+
server: "https://<your domain name>.acryl.io/gms"
58+
token: <Your API key>
4259
```
60+
The **source** configuration block defines where to extract metadata from. This can be an OLTP database system, a data warehouse, or something as simple as a file. Each source has custom configuration depending on what is required to access metadata from the source. To see configurations required for each supported source, refer to the [Sources](source_overview.md) documentation.
61+
62+
The **sink** configuration block defines where to push metadata into. Each sink type requires specific configurations, the details of which are detailed in the [Sinks](sink_overview.md) documentation.
63+
64+
To configure your instance of DataHub as the destination for ingestion, set the "server" field of your recipe to point to your Acryl instance's domain suffixed by the path `/gms`, as shown below.
65+
A complete example of a DataHub recipe file, which reads from MySQL and writes into a DataHub instance:
4366

4467
For more information and examples on configuring recipes, please refer to [Recipes](recipe_overview.md).
4568

46-
## Ingesting Metadata
4769

48-
You can run ingestion using `datahub ingest` like below.
70+
### Using Recipes with Authentication
71+
In Acryl DataHub deployments, only the `datahub-rest` sink is supported, which simply means that metadata will be pushed to the REST endpoints exposed by your DataHub instance. The required configurations for this sink are
72+
73+
1. **server**: the location of the REST API exposed by your instance of DataHub
74+
2. **token**: a unique API key used to authenticate requests to your instance's REST API
75+
76+
The token can be retrieved by logging in as admin. You can go to Settings page and generate a Personal Access Token with your desired expiration date.
77+
78+
<p align="center">
79+
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/home-(1).png"/>
80+
</p>
81+
82+
<p align="center">
83+
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/saas/settings.png"/>
84+
</p>
85+
86+
:::info Secure Your API Key
87+
Please keep Your API key secure & avoid sharing it.
88+
If you are on Acryl Cloud and your key is compromised for any reason, please reach out to the Acryl team at [email protected].
89+
:::
4990

91+
92+
## Ingesting Metadata
93+
The final step requires invoking the DataHub CLI to ingest metadata based on your recipe configuration file.
94+
To do so, simply run `datahub ingest` with a pointer to your YAML recipe file:
5095
```shell
5196
datahub ingest -c <path/to/recipe.yml>
5297
```
5398

99+
## Scheduling Ingestion
100+
101+
Ingestion can either be run in an ad-hoc manner by a system administrator or scheduled for repeated executions. Most commonly, ingestion will be run on a daily cadence.
102+
To schedule your ingestion job, we recommend using a job schedule like [Apache Airflow](https://airflow.apache.org/). In cases of simpler deployments, a CRON job scheduled on an always-up machine can also work.
103+
Note that each source system will require a separate recipe file. This allows you to schedule ingestion from different sources independently or together.
104+
Learn more about scheduling ingestion in the [Scheduling Ingestion Guide](/metadata-ingestion/schedule_docs/intro.md).
105+
54106
## Reference
55107

56108
Please refer the following pages for advanced guids on CLI ingestion.
@@ -59,8 +111,10 @@ Please refer the following pages for advanced guids on CLI ingestion.
59111
- [UI Ingestion Guide](../docs/ui-ingestion.md)
60112

61113
:::tip Compatibility
114+
62115
DataHub server uses a 3 digit versioning scheme, while the CLI uses a 4 digit scheme. For example, if you're using DataHub server version 0.10.0, you should use CLI version 0.10.0.x, where x is a patch version.
63116
We do this because we do CLI releases at a much higher frequency than server releases, usually every few days vs twice a month.
64117

65118
For ingestion sources, any breaking changes will be highlighted in the [release notes](../docs/how/updating-datahub.md). When fields are deprecated or otherwise changed, we will try to maintain backwards compatibility for two server releases, which is about 4-6 weeks. The CLI will also print warnings whenever deprecated options are used.
66119
:::
120+

0 commit comments

Comments
 (0)