Skip to content

Commit

Permalink
Update harlequin-databricks docs with latest (#97)
Browse files Browse the repository at this point in the history
connection methods and recommendations
  • Loading branch information
alexmalins authored Sep 3, 2024
1 parent 9210923 commit 66af2fd
Showing 1 changed file with 64 additions and 23 deletions.
87 changes: 64 additions & 23 deletions src/docs/databricks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,51 +53,92 @@ pipx install harlequin[databricks]

## Usage and Configuration

For a minimum connection you are going to need:
To connect to Databricks you are going to need to provide as CLI arguments:

- server-hostname
- http-path
- access-token
- credentials for one of the following authentication methods:
- a personal access token (PAT)
- a username and password
- an OAuth U2M type
- a service principle client ID and secret for OAuth M2M


### Personal Access Token (PAT) authentication:

```bash
harlequin -a databricks --server-hostname my_databricks.cloud.databricks.com --http-path /sql/1.0/endpoints/1234567890abcdef --access-token dabpi***
harlequin -a databricks --server-hostname ***.cloud.databricks.com --http-path /sql/1.0/endpoints/*** --access-token dabpi***
```

Authentication is also possible using a username and password (known as basic authentication):
### Username and password (basic) authentication:

```bash
harlequin -a databricks --server-hostname my_databricks.cloud.databricks.com --http-path /sql/1.0/endpoints/1234567890abcdef --username my_user --password my_pass
harlequin -a databricks --server-hostname ***.cloud.databricks.com --http-path /sql/1.0/endpoints/*** --username *** --password ***
```

Or by using [OAuth user-to-machine (U2M) authentication](https://docs.databricks.com/en/dev-tools/python-sql-connector.html#auth-u2m):
### OAuth U2M authentication:

For [OAuth user-to-machine (U2M) authentication](https://docs.databricks.com/en/dev-tools/python-sql-connector.html#auth-u2m)
supply either `databricks-oauth` or `azure-oauth` to the `--auth-type` CLI argument:

```bash
harlequin -a databricks --server-hostname my_databricks.cloud.databricks.com --http-path /sql/1.0/endpoints/1234567890abcdef --auth-type databricks-oauth
harlequin -a databricks --server-hostname ***.cloud.databricks.com --http-path /sql/1.0/endpoints/*** --auth-type databricks-oauth
```

For more details on command line options, run:
### OAuth M2M authentication:

For [OAuth machine-to-machine (M2M) authentication](https://docs.databricks.com/en/dev-tools/python-sql-connector.html#oauth-machine-to-machine-m2m-authentication)
you need to `pip install databricks-sdk` as an additional dependency
([databricks-sdk](https://github.com/databricks/databricks-sdk-py) is an optional dependency of
`harlequin-databricks`) and supply `--client-id` and `--client-secret` CLI arguments:

```bash
harlequin --help
harlequin -a databricks --server-hostname ***.cloud.databricks.com --http-path /sql/1.0/endpoints/*** --client-id *** --client-secret ***
```

## Using Unity Catalog and experiencing slow legacy `hive_metastore` indexing?
## Store an alias for your connection string

We recommend you include an alias for your connection string in your `.bash_profile`/`.zprofile` so
you can launch harlequin-databricks with a short command like `hdb` each time.

Run this command (once) to create the alias:

```bash
echo 'alias hdb="harlequin -a databricks --server-hostname ***.cloud.databricks.com --http-path /sql/1.0/endpoints/1234567890abcdef --access-token dabpi***"' >> .bash_profile
```

Indexing legacy metastores is slow on Databricks because it requires a SQL call for every table in
the legacy metastore to extract column metadata. This means refreshing Harlequin's Data Catalog
pane takes a long time for Databricks instances with lots of tables in legacy metastores like
`hive_metastore`.
## Using Unity Catalog and want fast Data Catalog indexing?

If your Databricks instance runs Unity Catalog, and you only want the Unity Catalog assets
listed in the Data Catalog pane, supply the `--skip-legacy-indexing` CLI flag when loading
Harlequin.
Supply the `--skip-legacy-indexing` command line flag if you do not care about legacy metastores
(e.g. `hive_metastore`) being indexed in Harlequin's Data Catalog pane.

This flag means only Unity Catalogs will be indexed - legacy metastores will not appear.
This flag will skip indexing of old non-Unity Catalog metastores (i.e. they won't appear in the
Data Catalog pane with this flag).

Because of the way legacy Databricks metastores works, a separate SQL query is required to fetch
the metadata of each table in a legacy metastore. This means indexing them for Harlequin's Data Catalog pane is slow.

Databricks's Unity Catalog upgrade brought
[Information Schema](https://docs.databricks.com/en/sql/language-manual/sql-ref-information-schema.html),
which allows harlequin-databricks to fetch metadata for all Unity Catalog assets with only two SQL queries.

So if your Databricks instance is running Unity Catalog, and you no longer care about the legacy
metastores, setting the `--skip-legacy-indexing` CLI flag is recommended as it will mean
much faster indexing & refreshing of the assets in the Data Catalog pane.

## Other CLI options:

For more details on command line options, run:

```bash
harlequin --help
```

Indexing Unity Catalogs is a super-fast operation requiring Harlequin to send only two SQL queries
to Databricks because of
[Information Schema](https://docs.databricks.com/en/sql/language-manual/sql-ref-information-schema.html).
## Issues, Contributions and Feature Requests

## Issues and Contributing
Please report bugs/issues with the harlequin-databricks adapter via its GitHub
[issues](https://github.com/alexmalins/harlequin-databricks/issues) page. You are welcome to
attempt fixes yourself by forking that repo then opening a [PR](https://github.com/alexmalins/harlequin-databricks/pulls).

Head over to the [alexmalins/harlequin-databricks](https://github.com/alexmalins/harlequin-databricks/) repo on GitHub.
For feature suggestions, please post in the harlequin-databricks repo
[discussions](https://github.com/alexmalins/harlequin-databricks/discussions).

0 comments on commit 66af2fd

Please sign in to comment.