Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) Ingest Tags from s3 bucket on an AWS Glue job and S3 Data Lake Ingest Job #4689

Merged
merged 37 commits into from
Apr 29, 2022
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
e1af525
begin work on tags
Jiafi Apr 18, 2022
e576aa0
working ingest from s3 and updated unit tests
Jiafi Apr 19, 2022
585d4ea
working s3 data lake ingest bucket tags
Jiafi Apr 19, 2022
9956022
cleanup
Jiafi Apr 19, 2022
cc00805
updated source docs for glue and s3_data_lake
Jiafi Apr 19, 2022
06e7750
Merge branch 'master' into jordan/tags
Jiafi Apr 19, 2022
ae629df
isort and fix test failure caused by config validator
Jiafi Apr 19, 2022
6761c77
add flake8 ignore for complexity and fix unit tests
Jiafi Apr 20, 2022
a0dd95f
Merge branch 'master' into jordan/tags
Jiafi Apr 20, 2022
fe98c4d
add object tagging functionality
Jiafi Apr 20, 2022
90e07d4
update unit tests to test for object tags
Jiafi Apr 20, 2022
3e13e6b
test cleanup
Jiafi Apr 20, 2022
b3a716d
Merge branch 'master' into jordan/tags
Jiafi Apr 21, 2022
6c31b37
fix bug in s3 ingest. Needed to use make_tag_urn
Jiafi Apr 21, 2022
bddf131
Merge branch 'jordan/tags' of github.com:Jiafi/datahub into jordan/tags
Jiafi Apr 21, 2022
7fa7bee
forgot make_tag_urn on the object
Jiafi Apr 21, 2022
f6dadb9
fix bug with pushing two GlobalTags at the same time. update tests f…
Jiafi Apr 21, 2022
b833ccb
Merge branch 'master' into jordan/tags
Jiafi Apr 21, 2022
3d060b5
fix typo
Jiafi Apr 25, 2022
eab89e3
Merge branch 'jordan/tags' of github.com:Jiafi/datahub into jordan/tags
Jiafi Apr 25, 2022
ae9e432
Merge branch 'master' into jordan/tags
Jiafi Apr 25, 2022
13b6f50
Merge branch 'master' into jordan/tags
Jiafi Apr 25, 2022
3262cf5
Merge branch 'master' into jordan/tags
Jiafi Apr 25, 2022
5ad5835
add some errors for poor configuration. Do not push tags as an aspec…
Jiafi Apr 26, 2022
f76e226
Merge branch 'jordan/tags' of github.com:Jiafi/datahub into jordan/tags
Jiafi Apr 26, 2022
c788f62
Merge branch 'master' into jordan/tags
Jiafi Apr 26, 2022
79376f2
lintFIx
Jiafi Apr 26, 2022
784e64d
instead of raising an exception, log that it could not grab the curre…
Jiafi Apr 26, 2022
6090ca1
Merge branch 'master' into jordan/tags
Jiafi Apr 26, 2022
541cdd6
remove similar exception from aws glue
Jiafi Apr 26, 2022
ee19426
Merge branch 'master' into jordan/tags
Jiafi Apr 27, 2022
de063bf
make key optional and only apply bucket tags when key is not None
Jiafi Apr 27, 2022
683d5e4
lintFix
Jiafi Apr 27, 2022
6096d75
change 2 logs when cant connect to datahub api to warns from debug
Jiafi Apr 27, 2022
85d2ebf
add error handling and warns for when no tags exist on a bucket and o…
Jiafi Apr 28, 2022
7668a6f
Merge branch 'master' into jordan/tags
Jiafi Apr 28, 2022
544ea97
Merge branch 'master' into jordan/tags
Jiafi Apr 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 45 additions & 46 deletions metadata-ingestion/source_docs/glue.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ This plugin extracts the following:
- Table metadata, such as owner, description and parameters
- Jobs and their component transformations, data sources, and data sinks

| Capability | Status | Details |
| -----------| ------ | ---- |
| Platform Instance | ✔ | [link](../../docs/platform-instances.md) |
| Capability | Status | Details |
| ----------------- | ------ | ---------------------------------------- |
| Platform Instance | ✔ | [link](../../docs/platform-instances.md) |
| Data Containers | ✔️ | |
| Data Domains | ✔️ | [link](../../docs/domains.md) |

Expand All @@ -41,31 +41,28 @@ sink:
```

## IAM permissions

For ingesting datasets, the following IAM permissions are required:

```json
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:$region-id:$account-id:catalog",
"arn:aws:glue:$region-id:$account-id:database/*",
"arn:aws:glue:$region-id:$account-id:table/*"
]
"Effect": "Allow",
"Action": ["glue:GetDatabases", "glue:GetTables"],
"Resource": [
"arn:aws:glue:$region-id:$account-id:catalog",
"arn:aws:glue:$region-id:$account-id:database/*",
"arn:aws:glue:$region-id:$account-id:table/*"
]
}
```

For ingesting jobs (`extract_transforms: True`), the following additional permissions are required:

```json
{
"Effect": "Allow",
"Action": [
"glue:GetDataflowGraph",
"glue:GetJobs",
],
"Resource": "*"
"Effect": "Allow",
"Action": ["glue:GetDataflowGraph", "glue:GetJobs"],
"Resource": "*"
}
```

Expand All @@ -75,33 +72,35 @@ plus `s3:GetObject` for the job script locations.

Note that a `.` is used to denote nested fields in the YAML recipe.

| Field | Required | Default | Description |
|---------------------------------|----------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `aws_region` | ✅ | | AWS region code. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_role` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_profile` | | | Named AWS profile to use, if not set the default will be used |
| `extract_transforms` | | `True` | Whether to extract Glue transform jobs. |
| `database_pattern.allow` | | | List of regex patterns for databases to include in ingestion. |
| `database_pattern.deny` | | | List of regex patterns for databases to exclude from ingestion. |
| `database_pattern.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching. |
| `table_pattern.allow` | | | List of regex patterns for tables to include in ingestion. |
| `table_pattern.deny` | | | List of regex patterns for tables to exclude from ingestion. |
| `table_pattern.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching. |
| `platform` | | `glue` | Override for platform name. Allowed values - `glue`, `athena` |
| `platform_instance` | | None | The Platform instance to use while constructing URNs. |
| `underlying_platform` | | `glue` | @deprecated(Use `platform`) Override for platform name. Allowed values - `glue`, `athena` |
| `ignore_unsupported_connectors` | | `True` | Whether to ignore unsupported connectors. If disabled, an error will be raised. |
| `emit_s3_lineage` | | `True` | Whether to emit S3-to-Glue lineage. |
| `glue_s3_lineage_direction` | | `upstream` | If `upstream`, S3 is upstream to Glue. If `downstream` S3 is downstream to Glue. |
| `extract_owners` | | `True` | When enabled, extracts ownership from Glue directly and overwrites existing owners. When disabled, ownership is left empty for datasets. |
| `domain.domain_key.allow` | | | List of regex patterns for tables to set domain_key domain key (domain_key can be any string like `sales`. There can be multiple domain key specified. |
| `domain.domain_key.deny` | | | List of regex patterns for tables to not assign domain_key. There can be multiple domain key specified. |
| `domain.domain_key.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching.There can be multiple domain key specified. |
| `catalog_id` | | None | The aws account id where the target glue catalog lives. If None, datahub will ingest glue catalog in aws caller's account. |
| Field | Required | Default | Description |
| ------------------------------- | -------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `aws_region` | ✅ | | AWS region code. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `aws_access_key_id` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_secret_access_key` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_session_token` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_role` | | Autodetected | See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `aws_profile` | | | Named AWS profile to use, if not set the default will be used |
| `extract_transforms` | | `True` | Whether to extract Glue transform jobs. |
| `database_pattern.allow` | | | List of regex patterns for databases to include in ingestion. |
| `database_pattern.deny` | | | List of regex patterns for databases to exclude from ingestion. |
| `database_pattern.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching. |
| `table_pattern.allow` | | | List of regex patterns for tables to include in ingestion. |
| `table_pattern.deny` | | | List of regex patterns for tables to exclude from ingestion. |
| `table_pattern.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching. |
| `platform` | | `glue` | Override for platform name. Allowed values - `glue`, `athena` |
| `platform_instance` | | None | The Platform instance to use while constructing URNs. |
| `underlying_platform` | | `glue` | @deprecated(Use `platform`) Override for platform name. Allowed values - `glue`, `athena` |
| `ignore_unsupported_connectors` | | `True` | Whether to ignore unsupported connectors. If disabled, an error will be raised. |
| `emit_s3_lineage` | | `True` | Whether to emit S3-to-Glue lineage. |
| `glue_s3_lineage_direction` | | `upstream` | If `upstream`, S3 is upstream to Glue. If `downstream` S3 is downstream to Glue. Please Note that this will not apply tags to any folders ingested, only the files. |
| `use_s3_bucket_tags` | | None | If an S3 Buckets Tags should be created for the Tables ingested by Glue. Please Note that this will not apply tags to any folders ingested, only the files. |
| `use_s3_object_tags` | | None | If an S3 Objects Tags should be created for the Tables ingested by Glue. |
| `extract_owners` | | `True` | When enabled, extracts ownership from Glue directly and overwrites existing owners. When disabled, ownership is left empty for datasets. |
| `domain.domain_key.allow` | | | List of regex patterns for tables to set domain_key domain key (domain_key can be any string like `sales`. There can be multiple domain key specified. |
| `domain.domain_key.deny` | | | List of regex patterns for tables to not assign domain_key. There can be multiple domain key specified. |
| `domain.domain_key.ignoreCase` | | `True` | Whether to ignore case sensitivity during pattern matching.There can be multiple domain key specified. |
| `catalog_id` | | None | The aws account id where the target glue catalog lives. If None, datahub will ingest glue catalog in aws caller's account. |

### Cross-account ingestion

Expand Down
Loading