Skip to content

Commit

Permalink
feat(cli): quickstart - experimental support for backup restore (#5418)
Browse files Browse the repository at this point in the history
  • Loading branch information
shirshanka authored Jul 25, 2022
1 parent 86012fd commit 941770f
Show file tree
Hide file tree
Showing 7 changed files with 300 additions and 4 deletions.
1 change: 1 addition & 0 deletions docker/mysql-setup/init.sql
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ create table if not exists metadata_aspect_v2 (
);

-- create default records for datahub user if not exists
DROP TABLE if exists temp_metadata_aspect_v2;
CREATE TABLE temp_metadata_aspect_v2 LIKE metadata_aspect_v2;
INSERT INTO temp_metadata_aspect_v2 (urn, aspect, version, metadata, createdon, createdby) VALUES(
'urn:li:corpuser:datahub',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ services:
hostname: kafka-setup
image: linkedin/datahub-kafka-setup:${DATAHUB_VERSION:-head}
mysql:
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_bin
command: --character-set-server=utf8mb4 --collation-server=utf8mb4_bin --default-authentication-plugin=mysql_native_password
container_name: mysql
environment:
- MYSQL_DATABASE=datahub
Expand All @@ -136,6 +136,7 @@ services:
- MYSQL_ROOT_PASSWORD=datahub
hostname: mysql
image: mariadb:10.5.8
# image: mysql:8
ports:
- ${DATAHUB_MAPPED_MYSQL_PORT:-3306}:3306
volumes:
Expand Down
6 changes: 6 additions & 0 deletions docs/how/backup-datahub.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Taking backup of DataHub

## Production

The recommended backup strategy is to periodically dump the database `datahub.metadata_aspect_v2` so it can be recreated from the dump which most managed DB services will support (e.g. AWS RDS). Then run [restore indices](./restore-indices.md) to recreate the indices.

In order to back up Time Series Aspects (which power usage and dataset profiles), you'd have to do a backup of Elasticsearch, which is possible via AWS OpenSearch. Otherwise, you'd have to reingest dataset profiles from your sources in the event of a disaster scenario!

## Quickstart

To take a backup of your quickstart, take a look at this [document](../quickstart.md#backing-up-your-datahub-quickstart-experimental) on how to accomplish it.
11 changes: 10 additions & 1 deletion docs/how/restore-indices.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,18 @@ When a new version of the aspect gets ingested, GMS initiates an MAE event for t
the search and graph indices. As such, we can fetch the latest version of each aspect in the local database and produce
MAE events corresponding to the aspects to restore the search and graph indices.

## Quickstart

If you're using the quickstart images, you can use the `datahub` cli to restore indices.

```
datahub docker quickstart --restore-indices
```
See [this section](../quickstart.md#restoring-only-the-index-use-with-care) for more information.

## Docker-compose

Run the following command from root to send MAE for each aspect in the Local DB.
If you are on a custom docker-compose deployment, run the following command (you need to checkout [the source repository](https://github.com/datahub-project/datahub)) from the root of the repo to send MAE for each aspect in the Local DB.

```
./docker/datahub-upgrade/datahub-upgrade.sh -u RestoreIndices
Expand Down
47 changes: 46 additions & 1 deletion docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,14 +150,59 @@ To stop DataHub's quickstart, you can issue the following command.
datahub docker quickstart --stop
```

### Resetting DataHub
### Resetting DataHub (a.k.a factory reset)

To cleanse DataHub of all of its state (e.g. before ingesting your own), you can use the CLI `nuke` command.

```
datahub docker nuke
```

### Backing up your DataHub Quickstart (experimental)

The quickstart image is not recommended for use as a production instance. See [Moving to production](#move-to-production) for recommendations on setting up your production cluster. However, in case you want to take a backup of your current quickstart state (e.g. you have a demo to your company coming up and you want to create a copy of the quickstart data so you can restore it at a future date), you can supply the `--backup` flag to quickstart.
```
datahub docker quickstart --backup
```
will take a backup of your MySQL image and write it by default to your `~/.datahub/quickstart/` directory as the file `backup.sql`. You can customize this by passing a `--backup-file` argument.
e.g.
```
datahub docker quickstart --backup --backup-file /home/my_user/datahub_backups/quickstart_backup_2002_22_01.sql
```
:::note

Note that the Quickstart backup does not include any timeseries data (dataset statistics, profiles, etc.), so you will lose that information if you delete all your indexes and restore from this backup.


### Restoring your DataHub Quickstart (experimental)
As you might imagine, these backups are restore-able. The following section describes a few different options you have to restore your backup.

#### Restoring a backup (primary + index) [most common]
To restore a previous backup, run the following command:
```
datahub docker quickstart --restore
```
This command will pick up the `backup.sql` file located under `~/.datahub/quickstart` and restore your primary database as well as the elasticsearch indexes with it.

To supply a specific backup file, use the `--restore-file` option.
```
datahub docker quickstart --restore --restore-file /home/my_user/datahub_backups/quickstart_backup_2002_22_01.sql
```

#### Restoring only the index [to deal with index out of sync / corruption issues]
Another situation that can come up is the index can get corrupt, or be missing some update. In order to re-bootstrap the index from the primary store, you can run this command to sync the index with the primary store.
```
datahub docker quickstart --restore-indices
```

#### Restoring a backup (primary but NO index) [rarely used]
Sometimes, you might want to just restore the state of your primary database (MySQL), but not re-index the data. To do this, you have to explicitly disable the restore-indices capability.

```
datahub docker quickstart --restore --no-restore-indices
```


### Upgrading your local DataHub

If you have been testing DataHub locally, a new version of DataHub got released and you want to try the new version then you can just issue the quickstart command again. It will pull down newer images and restart your instance without losing any data.
Expand Down
2 changes: 1 addition & 1 deletion metadata-ingestion/setup.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[flake8]
max-complexity = 15
max-complexity = 20
ignore =
# Ignore: line length issues, since black's formatter will take care of them.
E501,
Expand Down
Loading

0 comments on commit 941770f

Please sign in to comment.