-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingest): add snowflake-queries source #10835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe recent updates to the metadata ingestion module focus on enhancing Snowflake integration. Key changes include adding dependencies for Snowflake queries, refining lineage mapping, and improving schema and usage statistics extraction. Additionally, new entities and test cases have been introduced to support these enhancements, ensuring robust functionality and better handling of external lineage information and query details. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- metadata-ingestion/setup.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (8 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
- metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (21 hunks)
- metadata-models/src/main/pegasus/com/linkedin/query/QueryUsageStatistics.pdl (1 hunks)
- metadata-models/src/main/resources/entity-registry.yml (1 hunks)
Additional comments not posted (33)
metadata-models/src/main/pegasus/com/linkedin/query/QueryUsageStatistics.pdl (5)
18-18: Well-documented field:queryCount.The field
queryCountis well-documented and includes aTimeseriesFieldannotation for time series data.
24-24: Well-documented field:queryCost.The field
queryCostis well-documented and includes aTimeseriesFieldannotation for time series data.
30-30: Well-documented field:lastExecutedAt.The field
lastExecutedAtis well-documented and includes aTimeseriesFieldannotation for time series data.
36-36: Well-documented field:uniqueUserCount.The field
uniqueUserCountis well-documented and includes aTimeseriesFieldannotation for time series data.
42-42: Well-documented field:userCounts.The field
userCountsis well-documented and includes aTimeseriesFieldCollectionannotation for time series data collection.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (9)
49-69: Good use of configuration mixins and default values.The
SnowflakeQueriesConfigclass effectively uses configuration mixins and provides sensible default values for its fields.
73-77: Well-structured report class.The
SnowflakeQueriesReportclass is well-structured and extendsSourceReport.
80-107: Comprehensive initialization.The
__init__method provides a comprehensive initialization of theSnowflakeQueriesSourceclass, setting up the context, configuration, report, and aggregator.
108-112: Factory method for creating instances.The
createmethod is a factory method that parses configuration and creates an instance ofSnowflakeQueriesSource.
113-123: Efficient use of cached property for local temp path.The
local_temp_pathmethod efficiently uses thecached_propertydecorator to manage the local temporary path.
124-151: Efficient handling of audit log and work units.The
get_workunits_internalmethod efficiently handles the audit log and generates metadata work units.
152-196: Detailed method for fetching audit log.The
fetch_audit_logmethod is detailed and includes TODO comments for future enhancements.
202-293: Comprehensive audit log response parsing.The
_parse_audit_log_responsemethod provides comprehensive parsing of audit log responses, converting them intoPreparsedQueryobjects.
295-296: Simple method for retrieving the report.The
get_reportmethod is simple and straightforward, returning theSnowflakeQueriesReportinstance.metadata-models/src/main/resources/entity-registry.yml (1)
507-507: Correctly addedqueryUsageStatisticsaspect.The
queryUsageStatisticsaspect has been correctly added to the list of aspects for thequeryentity.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (6)
33-33: Correctly importedKnownLineageMapping.The
KnownLineageMappingclass has been correctly imported fromdatahub.sql_parsing.sql_parsing_aggregator.
268-271: Correctly updated method to returnIterable[KnownLineageMapping].The
_populate_external_lineage_from_copy_historymethod has been correctly updated to return an iterable ofKnownLineageMappingobjects.
277-280: Correctly updated method to returnIterable[KnownLineageMapping].The
_populate_external_lineage_from_show_querymethod has been correctly updated to return an iterable ofKnownLineageMappingobjects.
Line range hint
355-371: Correctly updated method to returnOptional[KnownLineageMapping].The
_process_external_lineage_result_rowmethod has been correctly updated to return an optionalKnownLineageMappingobject.
268-281: Efficient handling of external upstreams.The
_populate_external_upstreamsmethod efficiently handles the addition of external upstreams using the updated methods.
268-281: Efficient handling of work units.The
get_workunitsmethod efficiently handles the addition of work units using the updated methods.metadata-ingestion/setup.py (2)
412-412: Addition ofsnowflake-queriesplugin dependency looks good.The new
snowflake-queriesplugin dependency has been added correctly.
665-665: Addition ofsnowflake-queriesentry point looks good.The new entry point for
snowflake-querieshas been added correctly.metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (10)
439-450: New Method: Verify the usage ofaddmethod.The
addmethod is new. Ensure it is used appropriately throughout the code.
17-17: New Imports: Ensure they are necessary and used correctly.The imports from
datahub.configuration.time_window_config,datahub.sql_parsing.sql_parsing_common, anddatahub.sql_parsing.sqlglot_lineageare new. Ensure these are necessary and used correctly in the code.Also applies to: 34-34
Verification successful
Verified: The new imports are necessary and used correctly.
get_time_bucketis used in multiple files, includingsql_parsing_aggregator.py.QueryTypePropsis used insql_parsing_aggregator.pyand other related files.ColumnLineageInfois used extensively insql_parsing_aggregator.pyand other files.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the new imports are necessary and used in the code. # Test: Search for the usage of the new imports. Expect: Usage of each import in the code. rg --type python $'get_time_bucket' rg --type python $'QueryTypeProps' rg --type python $'ColumnLineageInfo'Length of output: 198
Script:
#!/bin/bash # Description: Verify that the new imports are necessary and used in the code. # Test: Search for the usage of the new imports. Expect: Usage of each import in the code. rg --type py 'get_time_bucket' rg --type py 'QueryTypeProps' rg --type py 'ColumnLineageInfo'Length of output: 14733
147-169: New Class: Verify the usage ofPreparsedQuery.The
PreparsedQueryclass is new. Ensure it is used appropriately throughout the code.Verification successful
Verified Usage:
PreparsedQueryclassThe
PreparsedQueryclass is used appropriately throughout the codebase, as evidenced by its presence in multiple files.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.pymetadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `PreparsedQuery` class. # Test: Search for the usage of the `PreparsedQuery` class. Expect: Usage of the class in the code. rg --type python $'PreparsedQuery'Length of output: 64
Script:
#!/bin/bash # Description: Verify the usage of the new `PreparsedQuery` class. # Test: Search for the usage of the `PreparsedQuery` class. Expect: Usage of the class in the code. rg --type py 'PreparsedQuery'Length of output: 1100
140-145: New Class: Verify the usage ofKnownLineageMapping.The
KnownLineageMappingclass is new. Ensure it is used appropriately throughout the code.Verification successful
Verified Usage: The
KnownLineageMappingclass is used appropriately throughout the codebase.The
KnownLineageMappingclass is found in multiple files, indicating proper integration and usage:
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pymetadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.pymetadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.pyScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `KnownLineageMapping` class. # Test: Search for the usage of the `KnownLineageMapping` class. Expect: Usage of the class in the code. rg --type python $'KnownLineageMapping'Length of output: 69
Script:
#!/bin/bash # Description: Verify the usage of the new `KnownLineageMapping` class. # Test: Search for the usage of the `KnownLineageMapping` class. Expect: Usage of the class in the code. rg 'KnownLineageMapping'Length of output: 1374
249-249: New Parameters and Attributes: Verify the usage ofgenerate_query_usage_statisticsand_query_usage_countsinSqlParsingAggregator.The
generate_query_usage_statisticsparameter and_query_usage_countsattribute are new. Ensure they are used appropriately throughout the code.Also applies to: 263-263, 270-271, 368-376
Verification successful
New Parameters and Attributes: Verified the usage of
generate_query_usage_statisticsand_query_usage_countsinSqlParsingAggregator.The
generate_query_usage_statisticsparameter and_query_usage_countsattribute are used appropriately throughout the codebase.
generate_query_usage_statisticsis used insql_parsing_aggregator.pyandsnowflake_queries.py._query_usage_countsis used insql_parsing_aggregator.py.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `generate_query_usage_statistics` parameter and `_query_usage_counts` attribute in `SqlParsingAggregator`. # Test: Search for the usage of the `generate_query_usage_statistics` parameter. Expect: Usage of the parameter in the code. rg --type python $'generate_query_usage_statistics' # Test: Search for the usage of the `_query_usage_counts` attribute. Expect: Usage of the attribute in the code. rg --type python $'_query_usage_counts'Length of output: 151
Script:
#!/bin/bash # Description: Verify the usage of the new `generate_query_usage_statistics` parameter and `_query_usage_counts` attribute in `SqlParsingAggregator`. # Test: Search for the usage of the `generate_query_usage_statistics` parameter. Expect: Usage of the parameter in the code. rg 'generate_query_usage_statistics' # Test: Search for the usage of the `_query_usage_counts` attribute. Expect: Usage of the attribute in the code. rg '_query_usage_counts'Length of output: 1859
202-202: New Attributes: Verify the usage ofnum_preparsed_queriesandnum_query_usage_stats_generatedinSqlAggregatorReport.The
num_preparsed_queriesandnum_query_usage_stats_generatedattributes are new. Ensure they are used appropriately throughout the code.Also applies to: 225-225
Verification successful
New Attributes: Verified usage of
num_preparsed_queriesandnum_query_usage_stats_generatedinSqlAggregatorReport.The attributes
num_preparsed_queriesandnum_query_usage_stats_generatedare used appropriately in the codebase:
num_preparsed_queries:metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pylines 202 and 225.num_query_usage_stats_generated:metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pylines 225 and 226.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `num_preparsed_queries` and `num_query_usage_stats_generated` attributes in `SqlAggregatorReport`. # Test: Search for the usage of the `num_preparsed_queries` attribute. Expect: Usage of the attribute in the code. rg --type python $'num_preparsed_queries' # Test: Search for the usage of the `num_query_usage_stats_generated` attribute. Expect: Usage of the attribute in the code. rg --type python $'num_query_usage_stats_generated'Length of output: 153
Script:
#!/bin/bash # Description: Verify the usage of the new `num_preparsed_queries` and `num_query_usage_stats_generated` attributes in `SqlAggregatorReport`. # Test: Search for the usage of the `num_preparsed_queries` attribute. Expect: Usage of the attribute in the code. rg 'num_preparsed_queries' # Test: Search for the usage of the `num_query_usage_stats_generated` attribute. Expect: Usage of the attribute in the code. rg 'num_query_usage_stats_generated'Length of output: 535
107-107: New Attribute: Verify the usage ofcolumn_usageinQueryMetadata.The
column_usageattribute is new. Ensure it is used appropriately throughout the code.Verification successful
Verified: The
column_usageattribute is used appropriately throughout the codebase.The
column_usageattribute is integrated and utilized correctly in various parts of the code.
- File:
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py- File:
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.pyScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `column_usage` attribute in `QueryMetadata`. # Test: Search for the usage of the `column_usage` attribute. Expect: Usage of the attribute in the code. rg --type python $'column_usage'Length of output: 62
Script:
#!/bin/bash # Description: Verify the usage of the new `column_usage` attribute in `QueryMetadata`. # Test: Search for the usage of the `column_usage` attribute. Expect: Usage of the attribute in the code. rg 'column_usage'Length of output: 1746
Script:
#!/bin/bash # Description: Extract relevant code segments where `column_usage` is used for manual review. # Extract lines containing `column_usage` in the relevant files for manual inspection. rg 'column_usage' -A 3 -B 3 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py rg 'column_usage' -A 3 -B 3 metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.pyLength of output: 4412
1140-1152: New Methods: Verify the usage of_gen_remaining_queriesand_gen_querymethods.The
_gen_remaining_queriesand_gen_querymethods are new. Ensure they are used appropriately throughout the code.Also applies to: 1156-1240
Verification successful
New Methods: Verified the usage of
_gen_remaining_queriesand_gen_querymethods.The
_gen_remaining_queriesand_gen_querymethods are defined and used appropriately withinmetadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py.
_gen_remaining_queriesis defined at line 1140 and used at line 1131._gen_queryis defined at line 1156 and used at lines 1160 and 1170.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `_gen_remaining_queries` and `_gen_query` methods. # Test: Search for the usage of the `_gen_remaining_queries` method. Expect: Usage of the method in the code. rg --type python $'def _gen_remaining_queries' # Test: Search for the usage of the `_gen_query` method. Expect: Usage of the method in the code. rg --type python $'def _gen_query'Length of output: 145
Script:
#!/bin/bash # Description: Verify the usage of the new `_gen_remaining_queries` and `_gen_query` methods. # Test: Search for the definition of the `_gen_remaining_queries` method. rg 'def _gen_remaining_queries' # Test: Search for the definition of the `_gen_query` method. rg 'def _gen_query' # Test: Search for the usage of the `_gen_remaining_queries` method. rg '_gen_remaining_queries' # Test: Search for the usage of the `_gen_query` method. rg '_gen_query'Length of output: 880
659-669: New Method: Verify the usage ofadd_preparsed_querymethod.The
add_preparsed_querymethod is new. Ensure it is used appropriately throughout the code.Verification successful
New Method: Verify the usage of
add_preparsed_querymethod.The
add_preparsed_querymethod is used appropriately in the codebase to handlePreparsedQueryobjects.
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py: Lines around 650-670Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the new `add_preparsed_query` method. # Test: Search for the usage of the `add_preparsed_query` method. Expect: Usage of the method in the code. rg --type python $'def add_preparsed_query'Length of output: 75
Script:
#!/bin/bash # Description: Verify the usage of the new `add_preparsed_query` method. # Test: Search for the usage of the `add_preparsed_query` method. Expect: Usage of the method in the code. rg 'add_preparsed_query'Length of output: 337
Script:
#!/bin/bash # Description: Extract the context of `add_preparsed_query` method usage. # Extract lines around the usage of `add_preparsed_query` method for deeper inspection. rg -A 5 -B 5 'add_preparsed_query' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1414
133-133: New Attribute: Verify the usage ofcolumn_usageinKnownQueryLineageInfo.The
column_usageattribute is new. Ensure it is used appropriately throughout the code.
treff7es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of mypy issue there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Outside diff range, codebase verification and nitpick comments (5)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_data_reader.py (1)
Line range hint
42-54: Ensure SQL query safety.The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security.
sql = 'select * from "{db_name}"."{schema_name}"."{table_name}" sample ({sample_size} rows);' cursor.execute(sql)You can use a parameterized query to avoid SQL injection:
sql = 'select * from "{db_name}"."{schema_name}"."{table_name}" sample (%s rows);' cursor.execute(sql, (sample_size,))metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (1)
71-73: Ensure Correct Description fordatabaseandlog_schema.The descriptions for
databaseandlog_schemafields should clearly explain their purpose related to the Fivetran connector log.- database: str = Field(description="The fivetran connector log database.") - log_schema: str = Field(description="The fivetran connector log schema.") + database: str = Field(description="The database where the Fivetran connector logs are stored.") + log_schema: str = Field(description="The schema within the Fivetran connector log database.")metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (3)
Line range hint
34-50:
Consider initializing known_urns in the constructor.To ensure all attributes are initialized in the constructor, consider initializing
self.known_urnsin the__init__method.- self.known_urns: Set[str] = set() # will be set later + self.known_urns: Set[str] = set()
Line range hint
290-293:
Consider adding a detailed TODO comment.The TODO comment should provide more details on what needs to be implemented.
- # TODO actor + # TODO: Implement actor extraction for lineage rows.
Line range hint
295-303:
Improve logging for filtered targets.Consider adding more details to the log message for better debugging.
- logger.debug( - f"Skipping lineage for {target.urn()} as it is not in known_urns" - ) + logger.debug( + f"Skipping lineage for target URN: {target.urn()} as it is not in known_urns" + )
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (36)
- metadata-ingestion/setup.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/api/source.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_assertion.py (4 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (8 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py (7 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_data_reader.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (12 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (5 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py (14 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (19 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_summary.py (6 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py (13 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py (10 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py (20 hunks)
- metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py (4 hunks)
- metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (24 hunks)
- metadata-ingestion/tests/integration/snowflake/common.py (2 hunks)
- metadata-ingestion/tests/integration/snowflake/snowflake_golden.json (13 hunks)
- metadata-ingestion/tests/integration/snowflake/snowflake_privatelink_golden.json (2 hunks)
- metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py (3 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json (1 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_basic_lineage.json (1 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_column_lineage_deduplication.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json (1 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts_from_temp_tables.json (3 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_table_rename.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_temp_table.json (2 hunks)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_view_lineage.json (1 hunks)
- metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (3 hunks)
- metadata-ingestion/tests/unit/test_snowflake_source.py (9 hunks)
Files skipped from review due to trivial changes (1)
- metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_basic_lineage.json
Additional context used
Ruff
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py
48-48: Local variable
mock_connectis assigned to but never usedRemove assignment to unused variable
mock_connect(F841)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py
195-203: Return the negated condition directly
Inline condition
(SIM103)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py
119-119: Use
key not in dictinstead ofkey not in dict.keys()Remove
.keys()(SIM118)
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
452-452: Use of
functools.lru_cacheorfunctools.cacheon methods can lead to memory leaks(B019)
Additional comments not posted (172)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_view_lineage.json (1)
84-100: Verify the structure and format of new entities.Ensure that the new entities added to the
subjectsarray follow the correct structure and format.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_assertion.py (2)
Line range hint
63-78: Ensure SQL query safety and verify processing logic.The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security. Verify that the processing logic correctly handles the fetched data.
Line range hint
103-121: Verify row processing logic.Ensure that the row processing logic correctly handles the data and generates the appropriate metadata change proposals.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_summary.py (2)
Line range hint
67-127: Ensure SQL query safety and verify processing logic.The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security. Verify that the processing logic correctly handles the fetched data.
Line range hint
129-131: Verify method for correctness.Ensure that the method correctly returns the summary report.
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json (1)
116-129: Ensure Consistent Usage of URNs.The URNs for datasets and schema fields must follow a consistent pattern. Verify that the URNs used here match the expected format and refer to the correct entities.
Verification successful
Ensure Consistent Usage of URNs.
The URNs in the JSON file follow the expected format and refer to the correct entities. The patterns for
urn:li:datasetandurn:li:schemaFieldare consistent throughout the file.
Datasets:
urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD)urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD)Schema Fields:
urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),a)urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),b)urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),c)urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),a)urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),b)urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),c)Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the consistency of URNs in the JSON file. # Test: Search for URNs in the file. Expect: All URNs follow the expected pattern and refer to valid entities. rg --type json 'urn:li:dataset:|urn:li:schemaField:' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.jsonLength of output: 1670
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json (1)
109-125: Ensure Consistent Usage of URNs.The URNs for datasets and schema fields must follow a consistent pattern. Verify that the URNs used here match the expected format and refer to the correct entities.
Verification successful
URNs are consistent and valid.
The URNs for datasets and schema fields in the specified JSON file follow the expected patterns and refer to the correct entities.
urn:li:dataset:(urn:li:dataPlatform:redshift,<namespace>,PROD)urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,<namespace>,PROD),<field>)Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the consistency of URNs in the JSON file. # Test: Search for URNs in the file. Expect: All URNs follow the expected pattern and refer to valid entities. rg --type json 'urn:li:dataset:|urn:li:schemaField:' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.jsonLength of output: 1861
Script:
#!/bin/bash # Description: Verify the consistency and validity of URNs in the JSON file. # Search for dataset URNs and validate their patterns. rg --type json 'urn:li:dataset:\(urn:li:dataPlatform:redshift,[^)]*,PROD\)' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json # Search for schemaField URNs and validate their patterns. rg --type json 'urn:li:schemaField:\(urn:li:dataset:\(urn:li:dataPlatform:redshift,[^)]*,PROD\),[^)]*\)' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.jsonLength of output: 3184
metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (1)
16-18: Verify Import ofSnowflakeConnectionConfig.Ensure that the
SnowflakeConnectionConfigimport is necessary and correctly used in the file.Verification successful
Verified Import of
SnowflakeConnectionConfig.The import of
SnowflakeConnectionConfigis necessary and correctly used in the filemetadata-ingestion/src/datahub/ingestion/source/fivetran/config.py. It is utilized in the definition of theSnowflakeDestinationConfigclass, which inherits fromSnowflakeConnectionConfig.
- Line 1:
SnowflakeConnectionConfigis imported.- Line 2:
SnowflakeDestinationConfigclass extendsSnowflakeConnectionConfig.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `SnowflakeConnectionConfig` in the file. # Test: Search for `SnowflakeConnectionConfig` in the file. Expect: The import is necessary and correctly used. rg 'SnowflakeConnectionConfig' metadata-ingestion/src/datahub/ingestion/source/fivetran/config.pyLength of output: 189
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py (1)
Line range hint
167-171:
Ensure Proper Handling of Database and Schema Selection.The
callable_for_db_connectionmethod should handle database and schema selection correctly, especially when the PUBLIC schema is absent.Verification successful
Verified: Proper Handling of Database and Schema Selection.
The
callable_for_db_connectionmethod correctly handles the database and schema selection, explicitly managing the case when the PUBLIC schema is absent by checking for a schema name and executing theUSE SCHEMAcommand if necessary.
- The method retrieves the native connection.
- Executes the
USE DATABASEcommand.- Checks for a schema name and executes the
USE SCHEMAcommand if provided.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the handling of database and schema selection in the `callable_for_db_connection` method. # Test: Search for the method in the file. Expect: Proper handling of database and schema selection, especially when the PUBLIC schema is absent. rg 'callable_for_db_connection' metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.pyLength of output: 241
Script:
#!/bin/bash # Description: Retrieve the full implementation of the `callable_for_db_connection` method to verify the handling of database and schema selection. # Use ast-grep to extract the implementation of `callable_for_db_connection` method from the file. ast-grep --lang python --pattern 'def callable_for_db_connection(self, db_name: str) -> Callable: $$$' metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.pyLength of output: 2040
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_column_lineage_deduplication.json (2)
96-112: Ensure correct entity formatting.The new subjects added under the
querySubjectsaspect appear correctly formatted and consistent with the existing structure.
160-182: Ensure correct entity formatting.The new subjects added under the
querySubjectsaspect appear correctly formatted and consistent with the existing structure.metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts.json (2)
121-137: Ensure correct entity formatting.The new subjects added under the
querySubjectsaspect appear correctly formatted and consistent with the existing structure.
185-201: Ensure correct entity formatting.The new subjects added under the
querySubjectsaspect appear correctly formatted and consistent with the existing structure.metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_table_rename.json (2)
84-100: Ensure correct entity formatting.The new subjects added under the
querySubjectsaspect appear correctly formatted and consistent with the existing structure.
199-215: Ensure correct entity formatting.The new subjects added under the
querySubjectsaspect appear correctly formatted and consistent with the existing structure.metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py (3)
11-13: Correct mixin replacements.The new mixins
EnvConfigMixinandPlatformInstanceConfigMixinare correctly imported and used.
Line range hint
34-62:
New classSQLFilterConfiglooks good.The new class
SQLFilterConfigand its fields are correctly defined and adhere to best practices.
63-76: Updates toSQLCommonConfigclass look good.The updates to the
SQLCommonConfigclass and its fields are correctly defined and adhere to best practices.metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py (8)
4-9: Imports look good!The added imports are relevant for the tests in this file.
76-79: Test case for missing role access looks good!The test correctly checks for the
PipelineInitErrorwhen the role is not granted.
4-4: Test case for missing warehouse access looks good!The test correctly simulates the condition and asserts the expected failure message.
Also applies to: 76-79
4-4: Test case for no databases with access looks good!The test correctly simulates the condition and asserts the expected failure message.
Also applies to: 76-79
4-4: Test case for no tables access looks good!The test correctly simulates the condition and asserts the expected failure message.
Also applies to: 76-79
4-4: Test case for listing columns error looks good!The test correctly simulates the condition and asserts the expected warning message.
Also applies to: 76-79
4-4: Test case for listing primary keys error looks good!The test correctly simulates the condition and asserts the expected warning message.
Also applies to: 76-79
4-4: Test cases for missing permissions look good!The tests correctly simulate the conditions and assert the expected failure messages.
Also applies to: 76-79
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (5)
2-2: Imports look good!The added import
Iterableis relevant for the tests in this file.
Line range hint
28-49: Fixture setup looks good!The
stateful_sourcefixture correctly sets up theSnowflakeV2Sourcewith the necessary configurations.Tools
Ruff
48-48: Local variable
mock_connectis assigned to but never usedRemove assignment to unused variable
mock_connect(F841)
47-49: Test case for redundant run job IDs looks good!The test correctly validates the job IDs for both lineage and usage extractors.
Tools
Ruff
48-48: Local variable
mock_connectis assigned to but never usedRemove assignment to unused variable
mock_connect(F841)
47-49: Test case for redundant run skip handler looks good!The test correctly covers multiple scenarios and validates the skip logic and suggested time windows.
Tools
Ruff
48-48: Local variable
mock_connectis assigned to but never usedRemove assignment to unused variable
mock_connect(F841)
47-49: Utility functions and checkpoint tests look good!The functions and tests correctly validate the checkpoint creation logic using mocks and assertions.
Tools
Ruff
48-48: Local variable
mock_connectis assigned to but never usedRemove assignment to unused variable
mock_connect(F841)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_temp_table.json (1)
84-100: JSON data for SQL parsing test cases looks good!The structure and values are correct and consistent with the expected schema.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py (7)
22-33: ClassSnowflakeStructuredReportMixinlooks good!The methods correctly use the
structured_reporterfor reporting warnings and errors.
Line range hint
36-63: ClassSnowflakeCommonProtocollooks good!The class defines essential methods and properties for Snowflake integration.
Line range hint
65-141: ClassSnowsightUrlBuilderlooks good!The methods are well-structured and handle various scenarios for building URLs.
Line range hint
143-225: ClassSnowflakeFilterMixinlooks good!The methods correctly implement the filtering logic based on the configurations.
Tools
Ruff
195-203: Return the negated condition directly
Inline condition
(SIM103)
227-258: ClassSnowflakeIdentifierMixinlooks good!The methods correctly handle identifiers based on the configurations.
Line range hint
259-283: ClassSnowflakeCommonMixinlooks good!The methods correctly combine the functionalities of the mixins and provide additional utilities.
Tools
Ruff
195-203: Return the negated condition directly
Inline condition
(SIM103)
Line range hint
259-283: Methodwarn_if_stateful_else_errorlooks good!The method correctly checks the configuration and logs appropriately.
Tools
Ruff
195-203: Return the negated condition directly
Inline condition
(SIM103)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts_from_temp_tables.json (1)
179-193: Ensure consistency in entity representation.The JSON structure looks correct. However, ensure that all entities are consistently represented across the dataset.
metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (5)
Line range hint
271-288:
Ensure proper exception handling and logging.The method handles exceptions and logs warnings. Ensure that the logging provides sufficient context for debugging.
Line range hint
305-320:
Handle missing DDL in STL scan entries.The method logs a warning for missing DDL. Ensure that the warning provides sufficient context for debugging.
Line range hint
322-333:
Ensure consistent handling of DDL in view lineage.The method handles DDL for views. Ensure that the handling is consistent with other methods.
Line range hint
335-348:
Ensure proper handling of source and target URNs.The method handles source and target URNs for copy commands. Ensure that the handling is consistent and correct.
Line range hint
391-393:
Ensure consistent generation of metadata work units.The method generates metadata work units. Ensure that the generation is consistent and follows best practices.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py (8)
Line range hint
135-155:
Ensure comprehensive validation for OAuth configuration.The method provides detailed validation for OAuth configuration. Ensure all edge cases are covered.
197-200: Ensure correct generation of SQLAlchemy URL.The method correctly generates the SQLAlchemy URL with the provided parameters.
Line range hint
225-263:
Ensure proper handling of private key in connection arguments.The method correctly handles private key for connection arguments.
Line range hint
263-301:
Ensure proper handling of OAuth connection.The method correctly handles OAuth connection generation.
305-314: Ensure proper handling of key pair connection.The method correctly handles key pair connection generation.
Line range hint
318-342:
Ensure proper handling of native connection.The method correctly handles native connection generation.
349-362: Ensure proper exception handling for connection generation.The method handles exceptions correctly when generating a connection.
114-114: Remove unnecessary.keys()call.Use
key not in dictinstead ofkey not in dict.keys().- if v not in _VALID_AUTH_TYPES.keys(): + if v not in _VALID_AUTH_TYPES:Likely invalid or redundant comment.
metadata-ingestion/src/datahub/ingestion/api/source.py (5)
117-119: Ensure context is truncated correctly.The method correctly truncates the context if it exceeds the maximum length.
Line range hint
142-146:
Ensure correct retrieval of log entries.The method correctly retrieves log entries of the specified type.
Line range hint
166-188:
Ensure correct reporting of work units.The method correctly reports work units and updates the relevant metrics.
Line range hint
194-199:
Ensure correct reporting of warnings.The method correctly reports warnings using structured logs.
Line range hint
225-232:
Ensure correct computation of statistics.The method correctly computes statistics for the report.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (9)
7-7: Imports look good.The new imports from
pydanticanddatahub.configuration.source_commonare appropriate for the added functionality.Also applies to: 12-16
84-96: New fields inSnowflakeFilterConfiglook good.The added fields for
database_pattern,schema_pattern, andmatch_fully_qualified_namesare appropriate for filtering configurations.
103-125: Root validator logic is sound but check backward compatibility.The root validator ensures proper configuration for schema patterns and maintains backward compatibility. Verify if the deprecation warning is communicated effectively to users.
128-134: New field inSnowflakeIdentifierConfiglooks good.The
convert_urns_to_lowercasefield with a default value ofTrueis appropriate for identifier configurations.
146-167: New fields inSnowflakeConfiglook good.The added fields for including table and view lineage are appropriate for lineage configurations.
158-168: Root validator logic is sound but check dependency oninclude_table_lineage.The root validator ensures that
include_table_lineageis set toTruewheninclude_view_lineageis enabled. Verify if this dependency is clearly documented and communicated to users.
Line range hint
170-365: New fields and validators inSnowflakeV2Configlook good.The added fields and validators for usage statistics, technical schema, primary and foreign keys, column lineage, lazy schema resolver, tags, and other configurations are appropriate for Snowflake V2.
327-330: Methodget_sql_alchemy_urllooks good.The method constructs a SQLAlchemy URL for Snowflake using the connection configuration.
Line range hint
371-417: Methodvalidate_shareslooks good.The method validates the
sharesconfiguration, ensuring that platform instances and databases are correctly configured.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py (9)
10-10: Import looks good.The new import from
datahub.ingestion.source.snowflake.snowflake_connectionis appropriate for the added functionality.
Line range hint
186-229: New methods inSnowflakeDataDictionarylook good.The added methods for showing databases, getting databases, and getting schemas for a database are appropriate for data dictionary operations.
Line range hint
270-299: Methodget_tables_for_databaselooks good but verify error handling.The method retrieves tables for a given database. Verify if the error handling for the query is sufficient.
Line range hint
303-313: Methodget_tables_for_schemalooks good.The method retrieves tables for a given schema in a database.
Line range hint
331-361: Methodget_views_for_databaselooks good but verify pagination logic.The method retrieves views for a given database with pagination. Verify if the pagination logic handles large result sets correctly.
Line range hint
424-438: Methodget_pk_constraints_for_schemalooks good.The method retrieves primary key constraints for a given schema in a database.
Line range hint
443-471: Methodget_fk_constraints_for_schemalooks good.The method retrieves foreign key constraints for a given schema in a database.
Line range hint
475-496: Methodget_tags_for_database_without_propagationlooks good.The method retrieves tags for a database without propagation.
Line range hint
530-541: Methodget_tags_on_columns_for_tablelooks good.The method retrieves tags on columns for a given table in a database.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (8)
1-12: Imports look good.The new imports from
pydantic,pathlib, andtyping_extensionsare appropriate for the added functionality.
57-87: New fields inSnowflakeQueriesExtractorConfiglook good.The added fields for window, deny usernames, temporary tables pattern, and local temp path are appropriate for query extraction configurations.
92-94: New field inSnowflakeQueriesSourceConfiglooks good.The added field for connection configuration is appropriate for Snowflake queries.
108-148: New methods inSnowflakeQueriesExtractorlook good.The added methods for initializing the extractor, handling configurations, and managing temporary paths are appropriate for query extraction.
175-203: Methodget_workunits_internallooks good but verify caching logic.The method retrieves work units for Snowflake queries. Verify if the caching logic for the audit log is sufficient.
205-258: Methodfetch_audit_loglooks good but verify error handling.The method fetches the audit log for Snowflake queries. Verify if the error handling for parsing audit log rows is sufficient.
259-365: Method_parse_audit_log_rowlooks good but verify JSON parsing logic.The method parses a row from the audit log. Verify if the JSON parsing logic for specific fields is sufficient.
402-501: Function_build_enriched_audit_log_querylooks good.The function constructs a query for fetching enriched audit logs with appropriate filters and pagination.
metadata-ingestion/tests/unit/test_snowflake_source.py (5)
27-27: Import looks good.The new import from
datahub.ingestion.source.snowflake.snowflake_utilsis appropriate for the added functionality.
448-460: Functiontest_aws_cloud_region_from_snowflake_region_idlooks good.The function correctly tests the conversion of Snowflake region ID to AWS cloud region.
470-472: Functiontest_google_cloud_region_from_snowflake_region_idlooks good.The function correctly tests the conversion of Snowflake region ID to Google Cloud region.
Line range hint
482-492: Functiontest_azure_cloud_region_from_snowflake_region_idlooks good.The function correctly tests the conversion of Snowflake region ID to Azure cloud region.
502-504: Functiontest_unknown_cloud_region_from_snowflake_region_idlooks good.The function correctly tests the handling of unknown Snowflake region IDs.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (17)
10-10: Import Statement for Closeable Interface AddedThe
Closeableinterface was added. This is necessary for ensuring that resources are properly released when the object is no longer needed.
18-19: Import Statement for SnowflakeConnection and SnowflakePermissionError AddedThe
SnowflakeConnectionandSnowflakePermissionErrorimports were added, which are essential for handling Snowflake connections and related errors.
32-32: Import Statement for KnownLineageMapping AddedThe
KnownLineageMappingimport was added. This is crucial for handling known lineage mappings in the lineage extraction process.
104-104: Class SnowflakeLineageExtractor Now Implements CloseableThe
SnowflakeLineageExtractorclass now implements theCloseableinterface. This is important for ensuring that resources are properly released.
121-121: Connection Initialization in ConstructorThe
SnowflakeConnectionis now initialized in the constructor, which aligns with the PR objectives of initializing the connection in the constructor.
130-130: Use of SnowflakeConnectionThe
SnowflakeConnectionis now assigned toself.connectionin the constructor, which ensures that the connection is available throughout the class methods.
262-265: Use of KnownLineageMapping in _populate_external_upstreamsThe
_populate_external_upstreamsmethod now usesKnownLineageMapping. This improves how external lineage data is processed and aggregated.
271-275: Use of KnownLineageMapping in _populate_external_upstreamsThe
_populate_external_upstreamsmethod now usesKnownLineageMappingfor show queries as well. This ensures consistency in handling external lineage data.
287-287: Return Type Changed to Iterable[KnownLineageMapping]The
_populate_external_lineage_from_show_querymethod now returnsIterable[KnownLineageMapping]. This aligns with the improved handling of lineage data.
321-321: Return Type Changed to Iterable[KnownLineageMapping]The
_populate_external_lineage_from_copy_historymethod now returnsIterable[KnownLineageMapping]. This aligns with the improved handling of lineage data.
329-334: Use of KnownLineageMapping in _populate_external_lineage_from_copy_historyThe
_populate_external_lineage_from_copy_historymethod now usesKnownLineageMapping. This improves how external lineage data is processed and aggregated.
349-349: Return Type Changed to Optional[KnownLineageMapping]The
_process_external_lineage_result_rowmethod now returnsOptional[KnownLineageMapping]. This aligns with the improved handling of lineage data.
355-355: Return None for Non-discovered TablesThe
_process_external_lineage_result_rowmethod now returnsNoneif the table is not indiscovered_tables. This ensures that only relevant tables are processed.
362-368: Use of KnownLineageMapping in _process_external_lineage_result_rowThe
_process_external_lineage_result_rowmethod now usesKnownLineageMappingfor creating lineage mappings. This improves the consistency and clarity of the lineage data.
423-428: Added Dataset Pattern Validation in map_query_result_upstreamsThe
map_query_result_upstreamsmethod now includes dataset pattern validation. This ensures that only allowed datasets are processed.
509-514: Added Dataset Pattern Validation in build_finegrained_lineage_upstreamsThe
build_finegrained_lineage_upstreamsmethod now includes dataset pattern validation. This ensures that only allowed datasets are processed.
565-567: Added close Method to Implement CloseableThe
closemethod was added to theSnowflakeLineageExtractorclass to fulfill theCloseableinterface requirements. This method should ensure that any resources are properly released.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py (13)
12-12: Import Statement for Closeable Interface AddedThe
Closeableinterface was added. This is necessary for ensuring that resources are properly released when the object is no longer needed.
17-18: Import Statement for SnowflakeConnection and SnowflakePermissionError AddedThe
SnowflakeConnectionandSnowflakePermissionErrorimports were added, which are essential for handling Snowflake connections and related errors.
109-109: Class SnowflakeUsageExtractor Now Implements CloseableThe
SnowflakeUsageExtractorclass now implements theCloseableinterface. This is important for ensuring that resources are properly released.
114-114: Connection Initialization in ConstructorThe
SnowflakeConnectionis now initialized in the constructor, which aligns with the PR objectives of initializing the connection in the constructor.
122-122: Use of SnowflakeConnectionThe
SnowflakeConnectionis now assigned toself.connectionin the constructor, which ensures that the connection is available throughout the class methods.
203-203: Use of SnowflakeConnection in _get_workunits_internalThe
_get_workunits_internalmethod now usesself.connection.queryfor querying Snowflake. This ensures consistency in how queries are executed.
235-235: Added Dataset Pattern Validation in _get_workunits_internalThe
_get_workunits_internalmethod now includes dataset pattern validation. This ensures that only allowed datasets are processed.
289-289: Added Warning for Failed Usage Statistics ParsingA warning is logged if parsing usage statistics fails. This helps in identifying issues during the ingestion process.
372-373: Assertion for Connection in _get_snowflake_historyAn assertion is added to ensure that
self.connectionis notNonebefore querying. This prevents potential runtime errors.
395-396: Assertion for Connection in _check_usage_date_rangesAn assertion is added to ensure that
self.connectionis notNonebefore querying. This prevents potential runtime errors.
505-505: Added Warning for Failed Operation History ParsingA warning is logged if parsing operation history fails. This helps in identifying issues during the ingestion process.
564-564: Added Dataset Pattern Validation in _is_object_validThe
_is_object_validmethod now includes dataset pattern validation. This ensures that only allowed datasets are processed.
590-592: Added close Method to Implement CloseableThe
closemethod was added to theSnowflakeUsageExtractorclass to fulfill theCloseableinterface requirements. This method should ensure that any resources are properly released.metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (1)
195-195: Handle URNs with Different Lengths in from_urn MethodThe
from_urnmethod now handles URNs with different lengths. This ensures that both standard and non-standard URNs are processed correctly.metadata-ingestion/tests/integration/snowflake/common.py (3)
531-531: LGTM! The query condition is correctly handled.The inclusion of view lineage and exclusion of column lineage is correctly implemented in the query.
607-610: LGTM! The query condition is correctly handled.The time window for copying lineage history is correctly implemented in the query.
608-610: LGTM! The query condition is correctly handled.The time window for copying lineage history is correctly implemented in the query.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py (9)
131-132: Improvement: Initialize connection in the constructor.The connection initialization in the constructor is a good practice for better resource management.
141-141: Refactor: Use composition for connection.Using composition for the connection (i.e.,
self.connection) improves code readability and reusability.
236-238: Update: Use SnowflakeConnectionConfig for connection parsing.Using
SnowflakeConnectionConfigfor connection parsing aligns with the new connection handling approach.
264-264: Update: Use SnowflakeConnection in check_capabilities.The function now uses
SnowflakeConnectionfor querying the Snowflake database, which aligns with the new connection handling approach.
426-426: Improvement: Reinitialize connection at the start.Reinitializing the connection at the start of the function ensures that the latest connection settings are used.
432-434: Improvement: Use SnowsightUrlBuilder for external URL generation.Using
SnowsightUrlBuilderfor external URL generation improves the handling of external URLs.
538-538: Update: Use SnowflakeConnection for session metadata queries.The function now uses
SnowflakeConnectionfor querying the Snowflake database for session metadata, which aligns with the new connection handling approach.
567-570: Update: Use SnowflakeConnection for Snowsight URL generation.The function now uses
SnowflakeConnectionfor querying the Snowflake database to generate the Snowsight URL, which aligns with the new connection handling approach.
Line range hint
618-618:
Improvement: Ensure proper resource management.The function ensures that the connection and extractors are properly closed, which improves resource management.
metadata-ingestion/setup.py (2)
414-414: Addition: Includesnowflake-queriesdependency.The
snowflake-queriesdependency has been added, which is necessary for the new Snowflake queries source.
667-667: Addition: Registersnowflake-queriessource in entry points.The
snowflake-queriessource has been added to the entry points, making it discoverable.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (2)
363-363: LGTM!The
upstreams_deny_patternparameter addition is appropriate and the function logic is intact.
414-414: LGTM!The
downstreams_deny_patternparameter addition and its usage increate_deny_regex_sql_filterare appropriate.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (6)
Line range hint
146-176: LGTM!The
snowsight_url_builderparameter addition and its assignment are appropriate and consistent with the rest of the initialization logic.
365-368: LGTM!The addition of the
match_fully_qualified_namesparameter in theis_schema_allowedfunction call is appropriate and consistent with the method's logic.
479-479: LGTM!The addition of the
view_pattern.allowedcheck is appropriate and ensures only allowed views are processed.
512-512: LGTM!The addition of the
table_pattern.allowedcheck is appropriate and ensures only allowed tables are processed.
Line range hint
789-799: LGTM!The usage of
snowsight_url_builderfor generating external URLs is appropriate and consistent with the method's logic.
984-987: LGTM!The usage of
snowsight_url_builderfor generating external URLs is appropriate and consistent with the method's logic.metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (15)
108-108: Verify the usage ofcolumn_usagefield.Ensure that the addition of
column_usageto theQueryMetadatadataclass is correctly utilized and doesn't introduce issues.
134-134: Verify the usage ofcolumn_usagefield.Ensure that the addition of
column_usageto theKnownQueryLineageInfodataclass is correctly utilized and doesn't introduce issues.Verification successful
Verify the usage of
column_usagefield.The
column_usagefield has been integrated into thesql_parsing_aggregator.pyfile and is used in various contexts such as assignments and function calls. The integration appears thorough and appropriate for tracking column usage without introducing issues.
- Instances of
column_usageare found in the following lines:
- Assignments: Lines where
column_usageis assigned values.- Function calls: Lines where
column_usageis passed as an argument.- Conditionals: Lines where
column_usageis used in conditions.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `column_usage` field in the file. # Test: Search for the usage of `column_usage`. Expect: Appropriate usage for tracking column usage. rg --type python 'column_usage' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 131
Script:
#!/bin/bash # Description: Verify the usage of `column_usage` field in the file. # Test: Search for the usage of `column_usage`. Expect: Appropriate usage for tracking column usage. rg 'column_usage' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 714
203-203: Verify the usage ofnum_preparsed_queriesfield.Ensure that the addition of
num_preparsed_queriesto theSqlAggregatorReportdataclass is correctly utilized and doesn't introduce issues.Verification successful
The
num_preparsed_queriesfield is correctly utilized for tracking preparsed queries.
- The field is declared and initialized to 0.
- It is incremented within the code, indicating it is used to count occurrences of preparsed queries.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_preparsed_queries` field in the file. # Test: Search for the usage of `num_preparsed_queries`. Expect: Appropriate usage for tracking preparsed queries. rg --type python 'num_preparsed_queries' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 140
Script:
#!/bin/bash # Description: Verify the usage of `num_preparsed_queries` field in the file. # Test: Search for the usage of `num_preparsed_queries`. Expect: Appropriate usage for tracking preparsed queries. rg --type py 'num_preparsed_queries' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 191
222-222: Verify the usage ofnum_lineage_skipped_due_to_filtersfield.Ensure that the addition of
num_lineage_skipped_due_to_filtersto theSqlAggregatorReportdataclass is correctly utilized and doesn't introduce issues.Verification successful
Verified the usage of
num_lineage_skipped_due_to_filtersfield.The
num_lineage_skipped_due_to_filtersfield is correctly utilized for tracking skipped lineage due to filters.
- Field declaration: Line 222
- Field usage: Incremented within the file
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_lineage_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_lineage_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped lineage. rg --type python 'num_lineage_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 153
Script:
#!/bin/bash # Description: Verify the usage of `num_lineage_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_lineage_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped lineage. rg 'num_lineage_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 220
257-259: Verify the usage ofgenerate_query_usage_statisticsparameter.Ensure that the addition of
generate_query_usage_statisticsto theSqlParsingAggregatorclass constructor is correctly utilized and doesn't introduce issues.Verification successful
The
generate_query_usage_statisticsparameter is correctly utilized.
- The parameter is declared, assigned to an instance variable, and used in logical conditions within the
SqlParsingAggregatorclass to control the generation of query usage statistics.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `generate_query_usage_statistics` parameter in the file. # Test: Search for the usage of `generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg --type python 'generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 150
Script:
#!/bin/bash # Description: Verify the usage of `generate_query_usage_statistics` parameter in the file. # Test: Search for the usage of `generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg -t py 'generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 459
230-230: Verify the usage ofnum_query_usage_stats_generatedfield.Ensure that the addition of
num_query_usage_stats_generatedto theSqlAggregatorReportdataclass is correctly utilized and doesn't introduce issues.Verification successful
The
num_query_usage_stats_generatedfield is correctly utilized.The field
num_query_usage_stats_generatedis properly used for tracking the generation of query usage statistics. It is incremented appropriately, indicating that this field is being utilized as intended.
- Declaration and initialization:
num_query_usage_stats_generated: int = 0- Increment operation:
self.report.num_query_usage_stats_generated += 1Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_query_usage_stats_generated` field in the file. # Test: Search for the usage of `num_query_usage_stats_generated`. Expect: Appropriate usage for tracking generated query usage statistics. rg --type python 'num_query_usage_stats_generated' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 150
Script:
#!/bin/bash # Description: Verify the usage of `num_query_usage_stats_generated` field in the file. # Test: Search for the usage of `num_query_usage_stats_generated`. Expect: Appropriate usage for tracking generated query usage statistics. rg --type py 'num_query_usage_stats_generated' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 225
18-18: Verify the usage ofget_time_bucket.Ensure that the usage of
get_time_bucketaligns with the intended functionality and handles time-based operations correctly.Verification successful
The usage of
get_time_bucketaligns with the intended functionality.The function is correctly used to convert timestamps into bucketed time formats based on the specified bucket duration.
- Location:
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `get_time_bucket` in the file. # Test: Search for the usage of `get_time_bucket`. Expect: Appropriate usage for time-based operations. rg --type python 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 134
Script:
#!/bin/bash # Description: Verify the usage of `get_time_bucket` in the file. # Test: Search for the usage of `get_time_bucket`. Expect: Appropriate usage for time-based operations. rg --type py 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 206
Script:
#!/bin/bash # Description: Retrieve more lines of context around the usage of `get_time_bucket` to ensure it aligns with the intended functionality. # Test: Display lines around the usage of `get_time_bucket`. rg --type py -A 5 -B 5 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1171
273-275: Verify the usage ofself.generate_query_usage_statistics.Ensure that the assignment of
generate_query_usage_statisticstoself.generate_query_usage_statisticsis correctly utilized in the class methods.Verification successful
The assignment and usage of
self.generate_query_usage_statisticsare correctly utilized in the class methods.
- The variable is assigned appropriately and used in multiple conditional checks to control the generation of query usage statistics.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `self.generate_query_usage_statistics` in the file. # Test: Search for the usage of `self.generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg --type python 'self.generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 155
Script:
#!/bin/bash # Description: Verify the usage of `self.generate_query_usage_statistics` in the file. # Test: Search for the usage of `self.generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation. rg 'self.generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 403
262-263: Verify the usage ofis_temp_tableandis_allowed_tableparameters.Ensure that the addition of
is_temp_tableandis_allowed_tableparameters to theSqlParsingAggregatorclass constructor is correctly utilized and doesn't introduce issues.Verification successful
Verified the usage of
is_temp_tableandis_allowed_tableparameters.The parameters
is_temp_tableandis_allowed_tablehave been correctly integrated into theSqlParsingAggregatorclass. They are assigned to instance variables and utilized in methods to control table filtering.
- The parameters are assigned to
self._is_temp_tableandself._is_allowed_table.- Methods
is_temp_tableandis_allowed_tableuse these instance variables to perform their logic.- These methods are used in various parts of the class to appropriately filter tables.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `is_temp_table` and `is_allowed_table` parameters in the file. # Test: Search for the usage of `is_temp_table` and `is_allowed_table`. Expect: Appropriate usage for controlling table filtering. rg --type python 'is_temp_table|is_allowed_table' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 151
Script:
#!/bin/bash # Description: Verify the usage of `is_temp_table` and `is_allowed_table` parameters in the file. # Test: Search for the usage of `is_temp_table` and `is_allowed_table`. Expect: Appropriate usage for controlling table filtering. rg 'is_temp_table|is_allowed_table' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 964
381-389: Verify the usage ofquery_usage_countsfield.Ensure that the addition of
query_usage_countsto theSqlParsingAggregatorclass is correctly utilized and doesn't introduce issues.Verification successful
The
query_usage_countsfield is correctly utilized.The
query_usage_countsfield added to theSqlParsingAggregatorclass is appropriately used for tracking query usage statistics.
- Lines involving
query_usage_countsinclude initialization, checks for its presence, and usage in various operations.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `query_usage_counts` field in the file. # Test: Search for the usage of `query_usage_counts`. Expect: Appropriate usage for tracking query usage statistics. rg --type python 'query_usage_counts' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 137
Script:
#!/bin/bash # Description: Verify the usage of `query_usage_counts` field in the file. # Test: Search for the usage of `query_usage_counts`. Expect: Appropriate usage for tracking query usage statistics. rg 'query_usage_counts' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 685
281-283: Verify the usage of theusage_configcheck.Ensure that the check for
usage_configwhengenerate_usage_statisticsorgenerate_query_usage_statisticsis enabled is correctly utilized and doesn't introduce issues.Verification successful
The usage of the
usage_configcheck is correctly utilized.The code contains several assertions to ensure that
usage_configis notNonewhen necessary, indicating that the check is correctly handled and does not introduce issues.
- Line defining
usage_configasOptional[BaseUsageConfig] = None- Line assigning
usage_configtoself.usage_config- Line checking if
self.usage_configisNone- Several assertions
assert self.usage_config is not NoneScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of the `usage_config` check in the file. # Test: Search for the usage of the `usage_config` check. Expect: Appropriate usage for ensuring `usage_config` is provided. rg --type python 'usage_config' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 131
Script:
#!/bin/bash # Description: Verify the usage of the `usage_config` check in the file. # Test: Search for the usage of the `usage_config` check. Expect: Appropriate usage for ensuring `usage_config` is provided. rg --type py 'usage_config' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 662
4-4: Verify the usage offunctools.Ensure that the usage of
functoolsdoesn't introduce memory leaks or other issues as warned by static analysis tools.Verification successful
Verified the appropriate usage of
functools.The
functools.lru_cachedecorator is correctly applied to the_name_from_urnmethod, providing caching benefits without introducing memory leaks or other issues.
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py:
import functools@functools.lru_cache(maxsize=128)Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `functools` in the file. # Test: Search for the usage of `functools`. Expect: Appropriate usage without memory leaks. rg --type python 'functools' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 128
Script:
#!/bin/bash # Description: Verify the usage of `functools` in the file. # Test: Search for the usage of `functools`. Expect: Appropriate usage without memory leaks. rg 'functools' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 138
Script:
#!/bin/bash # Description: Gather context around the usage of `functools.lru_cache` in the file. # Test: Search for the function where `functools.lru_cache` is applied and get surrounding lines to understand the context. rg -A 10 '@functools.lru_cache' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 555
148-169: Verify the usage ofPreparsedQuerydataclass.Ensure that the
PreparsedQuerydataclass is correctly utilized and its fields align with the intended functionality.Verification successful
Verified the usage of
PreparsedQuerydataclass.The
PreparsedQuerydataclass is correctly utilized in the codebase. The fields and usage align with the intended functionality of representing pre-parsed queries.
- Defined as a dataclass.
- Used in type annotations.
- Instantiated in the method
add_preparsed_query.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `PreparsedQuery` dataclass in the file. # Test: Search for the usage of `PreparsedQuery`. Expect: Appropriate usage for representing pre-parsed queries. rg --type python 'PreparsedQuery' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 133
Script:
#!/bin/bash # Description: Verify the usage of `PreparsedQuery` dataclass in the file. # Test: Search for the usage of `PreparsedQuery`. Expect: Appropriate usage for representing pre-parsed queries. rg 'PreparsedQuery' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 303
Script:
#!/bin/bash # Description: Gather surrounding context of `PreparsedQuery` usages in the file. # Test: Search for the surrounding context of `PreparsedQuery` usages. rg 'PreparsedQuery' -A 5 -B 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1751
141-146: Verify the usage ofKnownLineageMappingdataclass.Ensure that the
KnownLineageMappingdataclass is correctly utilized and its fields align with the intended functionality.Verification successful
The
KnownLineageMappingdataclass is correctly utilized.The
KnownLineageMappingdataclass is used appropriately in type hints and type checks to handle lineage information, ensuring it aligns with the intended functionality.
- Definition of
KnownLineageMapping: lines 141-146- Usage in
addmethod: lines following the definitionScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `KnownLineageMapping` dataclass in the file. # Test: Search for the usage of `KnownLineageMapping`. Expect: Appropriate usage for lineage mappings. rg --type python 'KnownLineageMapping' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 138
Script:
#!/bin/bash # Description: Verify the usage of `KnownLineageMapping` dataclass in the file. # Test: Search for the usage of `KnownLineageMapping`. Expect: Appropriate usage for lineage mappings. rg 'KnownLineageMapping' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 258
Script:
#!/bin/bash # Description: Inspect the context of `KnownLineageMapping` usages in the file. # Test: Extract the context lines around the usages of `KnownLineageMapping` to verify alignment with the expected functionality. rg 'KnownLineageMapping' -A 5 -B 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 1015
226-226: Verify the usage ofnum_queries_skipped_due_to_filtersfield.Ensure that the addition of
num_queries_skipped_due_to_filtersto theSqlAggregatorReportdataclass is correctly utilized and doesn't introduce issues.Verification successful
Verified the usage of
num_queries_skipped_due_to_filtersfield.The addition of
num_queries_skipped_due_to_filtersto theSqlAggregatorReportdataclass is correctly utilized to track the number of queries skipped due to filters.
- The field is incremented appropriately within the
_gen_querymethod when a query doesn't involve any allowed tables.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_queries_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped queries. rg --type python 'num_queries_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 153
Script:
#!/bin/bash # Description: Verify the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the usage of `num_queries_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped queries. rg 'num_queries_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 220
Script:
#!/bin/bash # Description: Verify the context of the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the context where `self.report.num_queries_skipped_due_to_filters += 1` is used. rg 'self\.report\.num_queries_skipped_due_to_filters \+= 1' -A 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.pyLength of output: 355
Script:
#!/bin/bash # Description: Verify the context of the usage of `num_queries_skipped_due_to_filters` field in the file. # Test: Search for the full context of the method or function where `self.report.num_queries_skipped_due_to_filters += 1` occurs. rg 'def ' -A 20 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py | rg -B 20 'self\.report\.num_queries_skipped_due_to_filters \+= 1'Length of output: 987
metadata-ingestion/tests/integration/snowflake/snowflake_privatelink_golden.json (2)
3895-3959: LGTM! Schema fields are correctly added to thequerySubjectsaspect.The schema fields are correctly specified with appropriate field paths, types, and other properties.
4174-4238: LGTM! Schema fields are correctly added to thequerySubjectsaspect.The schema fields are correctly specified with appropriate field paths, types, and other properties.
metadata-ingestion/tests/integration/snowflake/snowflake_golden.json (13)
Line range hint
1-1:
Approved: Addition of new dataset entity.The addition of the new dataset entity
urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.test_schema.table_2,PROD)is consistent with the PR summary.
4524-4566: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_1is consistent with the PR summary.
5138-5174: Approved: Addition of multiple schemaField entities.The addition of multiple schemaField entities related to
test_db.test_schema.table_2is consistent with the PR summary.
5202-5243: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_10is consistent with the PR summary.
5755-5796: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_4is consistent with the PR summary.
5988-6029: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_5is consistent with the PR summary.
6421-6462: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_3is consistent with the PR summary.
6860-6901: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_6is consistent with the PR summary.
7043-7084: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_8is consistent with the PR summary.
7096-7167: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_1andtest_db.test_schema.view_1is consistent with the PR summary.
7268-7309: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_9is consistent with the PR summary.
7435-7476: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_7is consistent with the PR summary.
7488-7559: Approved: Addition of multiple dataset and schemaField entities.The addition of multiple dataset and schemaField entities related to
test_db.test_schema.table_2andtest_db.test_schema.view_2is consistent with the PR summary.
| self.aggregator._is_temp_table = ( | ||
| lambda name: DatasetUrn.create_from_ids( | ||
| self.platform, | ||
| name, | ||
| env=self.config.env, | ||
| platform_instance=self.config.platform_instance, | ||
| ).urn() | ||
| not in self.known_urns | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve readability by using a named function for the lambda.
Using a named function improves readability and maintainability.
- self.aggregator._is_temp_table = (
- lambda name: DatasetUrn.create_from_ids(
- self.platform,
- name,
- env=self.config.env,
- platform_instance=self.config.platform_instance,
- ).urn()
- not in self.known_urns
- )
+ def is_temp_table(name: str) -> bool:
+ return DatasetUrn.create_from_ids(
+ self.platform,
+ name,
+ env=self.config.env,
+ platform_instance=self.config.platform_instance,
+ ).urn() not in self.known_urns
+ self.aggregator._is_temp_table = is_temp_tableCommittable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| self.aggregator._is_temp_table = ( | |
| lambda name: DatasetUrn.create_from_ids( | |
| self.platform, | |
| name, | |
| env=self.config.env, | |
| platform_instance=self.config.platform_instance, | |
| ).urn() | |
| not in self.known_urns | |
| ) | |
| def is_temp_table(name: str) -> bool: | |
| return DatasetUrn.create_from_ids( | |
| self.platform, | |
| name, | |
| env=self.config.env, | |
| platform_instance=self.config.platform_instance, | |
| ).urn() not in self.known_urns | |
| self.aggregator._is_temp_table = is_temp_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (3 hunks)
Additional comments not posted (2)
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (2)
2-2: Correctly updated return type toIterable[SnowflakeV2Source]The return type of the
stateful_sourcefixture has been correctly updated toIterable[SnowflakeV2Source], aligning with the changes in the source code.
47-49: Proper use of context manager forSnowflakeV2SourceThe addition of a context manager to manage the
SnowflakeV2Sourceinstance ensures proper resource management and cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (2)
71-77: Local temporary path configuration for audit logs.The
local_temp_pathfield is configured to store the audit log locally. The TODO comments suggest improvements for caching and local testing.Ensure the improvements for caching and local testing are tracked and implemented.
402-502: Build enriched audit log query.The
_build_enriched_audit_log_queryfunction correctly builds the SQL query to fetch enriched audit logs from Snowflake. The TODO comments suggest improvements for generating better query fingerprints, adding table filter clauses, and dropping unnecessary columns.Ensure the improvements suggested in the TODO comments are tracked and implemented.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- docs-website/vercel-setup.sh (2 hunks)
- metadata-ingestion/scripts/install_deps.sh (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
Files skipped from review due to trivial changes (1)
- docs-website/vercel-setup.sh
Additional comments not posted (17)
metadata-ingestion/scripts/install_deps.sh (1)
21-22: Addition ofkrb5-develdependency foryumsystems.The addition of
krb5-develis correctly placed under theyumpackage manager section. This ensures that Kerberos development libraries are available for systems usingyum.metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (16)
92-94: Snowflake connection configuration.The
connectionfield is correctly defined to configure the Snowflake connection.
96-99: Snowflake Queries Extractor Report fields.The fields for the time window and SQL aggregator report are correctly defined.
103-106: Snowflake Queries Source Report field.The field for the queries extractor report is correctly defined.
108-120: Initialization of SnowflakeQueriesExtractor.The constructor initializes the connection, configuration, reports, and SQL aggregator.
155-165: Local temporary path for audit logs.The
local_temp_pathmethod ensures a temporary directory is created for storing audit logs. It logs the path being used.
166-170: Check for temporary tables.The
is_temp_tablemethod checks if a table name matches any of the temporary table patterns.
172-174: Check for allowed tables.The
is_allowed_tablemethod checks if a table name is allowed based on dataset patterns.
175-203: Generate work units from queries.The
get_workunits_internalmethod generates work units from the queries. It handles the audit log caching and iterates through the queries to add them to the SQL aggregator.
204-257: Fetch audit logs from Snowflake.The
fetch_audit_logmethod fetches audit logs from Snowflake. It includes TODO comments for fetching additional information and handling errors.
259-262: Generate dataset identifier from qualified name.The
get_dataset_identifier_from_qualified_namemethod generates a dataset identifier from a qualified name.
263-365: Parse audit log row.The
_parse_audit_log_rowmethod parses a row from the audit log and generates aPreparsedQueryobject. It includes TODO comments for filtering table names and mapping email addresses.
368-373: Initialization of SnowflakeQueriesSource.The constructor initializes the context, configuration, reports, and queries extractor.
385-388: Create SnowflakeQueriesSource from config.The
createmethod creates aSnowflakeQueriesSourceinstance from a configuration dictionary and pipeline context.
390-392: Generate work units from queries.The
get_workunits_internalmethod generates work units from the queries using the queries extractor.
394-395: Get report for SnowflakeQueriesSource.The
get_reportmethod returns the report for the SnowflakeQueriesSource.
504-515: Snowflake query type mappings.The
SNOWFLAKE_QUERY_TYPE_MAPPINGconstant correctly maps Snowflake query types to internal query types.
| class SnowflakeQueriesExtractorConfig(SnowflakeIdentifierConfig, SnowflakeFilterConfig): | ||
| # TODO: Support stateful ingestion for the time windows. | ||
| window: BaseTimeWindowConfig = BaseTimeWindowConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for stateful ingestion for the time windows.
The TODO comment indicates that support for stateful ingestion for the time windows is pending.
Do you want me to generate the implementation for stateful ingestion or open a GitHub issue to track this task?
| # TODO: make this a proper allow/deny pattern | ||
| deny_usernames: List[str] = [] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this a proper allow/deny pattern.
The TODO comment suggests that the deny_usernames field should be converted to a proper allow/deny pattern.
Consider refactoring this field to support a proper allow/deny pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- metadata-ingestion/setup.py (2 hunks)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (19 hunks)
- metadata-models/src/main/resources/entity-registry.yml (1 hunks)
Files skipped from review due to trivial changes (2)
- metadata-ingestion/setup.py
- metadata-models/src/main/resources/entity-registry.yml
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820) * refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2) (datahub-project#10764) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819) * feat(ingest/transformer): tags to terms transformer (datahub-project#10758) Co-authored-by: Aseem Bansal <[email protected]> * fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752) Co-authored-by: Aseem Bansal <[email protected]> * feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822) * feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823) * feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824) * feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825) * add flag for includeSoftDeleted in scroll entities API (datahub-project#10831) * feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832) * feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826) * add scroll parameters to openapi v3 spec (datahub-project#10833) * fix(ingest): correct profile_day_of_week implementation (datahub-project#10818) * feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666) Co-authored-by: Harshal Sheth <[email protected]> * feat(cli): add more details to get cli (datahub-project#10815) * fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836) * fix(ingestion): fix datajob patcher (datahub-project#10827) * fix(smoke-test): add suffix in temp file creation (datahub-project#10841) * feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784) * feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645) Co-authored-by: Hendrik Richert <[email protected]> * docs(patch): add patch documentation for how implementation works (datahub-project#10010) Co-authored-by: John Joyce <[email protected]> * fix(jar): add missing custom-plugin-jar task (datahub-project#10847) * fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391) Co-authored-by: John Joyce <[email protected]> * docs(): Update posts.md (datahub-project#9893) Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * chore(ingest): update acryl-datahub-classify version (datahub-project#10844) * refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834) * fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849) * fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848) * fix(smoke-test): missing test for move domain (datahub-project#10837) * ci: update usernames to not considered for community (datahub-project#10851) * env: change defaults for data contract visibility (datahub-project#10854) * fix(ingest/tableau): quote special characters in external URL (datahub-project#10842) * fix(smoke-test): fix flakiness of auto complete test * ci(ingest): pin dask dependency for feast (datahub-project#10865) * fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542) * feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829) * chore(ingest): Mypy 1.10.1 pin (datahub-project#10867) * docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852) * docs: add new js snippet (datahub-project#10846) * refactor(ingestion): remove company domain for security reason (datahub-project#10839) * fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498) Co-authored-by: Yanik Häni <[email protected]> * fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874) * fix(manage-tokens): fix manage access token policy (datahub-project#10853) * Batch get entity endpoints (datahub-project#10880) * feat(system): support conditional write semantics (datahub-project#10868) * fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890) * feat(ingest/lookml): shallow clone repos (datahub-project#10888) * fix(ingest/looker): add missing dependency (datahub-project#10876) * fix(ingest): only populate audit stamps where accurate (datahub-project#10604) * fix(ingest/dbt): always encode tag urns (datahub-project#10799) * fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727) * fix(ingestion/looker): column name missing in explore (datahub-project#10892) * fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879) * feat(conditional-writes): misc updates and fixes (datahub-project#10901) * feat(ci): update outdated action (datahub-project#10899) * feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902) Co-authored-by: Gabe Lyons <[email protected]> * feat(ingest): add snowflake-queries source (datahub-project#10835) * fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906) * docs: add new company to adoption list (datahub-project#10909) * refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * feat(ui) Finalize support for all entity types on forms (datahub-project#10915) * Index ExecutionRequestResults status field (datahub-project#10811) * feat(ingest): grafana connector (datahub-project#10891) Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916) * feat(dataset): add support for external url in Dataset (datahub-project#10877) * docs(saas-overview) added missing features to observe section (datahub-project#10913) Co-authored-by: John Joyce <[email protected]> * fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882) * fix(structured properties): allow application of structured properties without schema file (datahub-project#10918) * fix(data-contracts-web) handle other schedule types (datahub-project#10919) * fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866) Co-authored-by: Harshal Sheth <[email protected]> * Add feature flag for view defintions (datahub-project#10914) Co-authored-by: Ethan Cartwright <[email protected]> * feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884) * fix(airflow): add error handling around render_template() (datahub-project#10907) * feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830) * feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904) * fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845) * feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864) * feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813) * fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924) * fix(build): fix lint fix web react (datahub-project#10896) * fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912) * feat(ingest): report extractor failures more loudly (datahub-project#10908) * feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905) * fix(ingest): fix docs build (datahub-project#10926) * fix(ingest/snowflake): fix test connection (datahub-project#10927) * fix(ingest/lookml): add view load failures to cache (datahub-project#10923) * docs(slack) overhauled setup instructions and screenshots (datahub-project#10922) Co-authored-by: John Joyce <[email protected]> * fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903) * fix(entityservice): fix merging sideeffects (datahub-project#10937) * feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938) Co-authored-by: John Joyce <[email protected]> * chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939) * docs: add learning center to docs (datahub-project#10921) * doc: Update hubspot form id (datahub-project#10943) * chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941) * fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895) * fix(ingest/abs): split abs utils into multiple files (datahub-project#10945) * doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950) * fix(ingest/setup): feast and abs source setup (datahub-project#10951) * fix(connections) Harden adding /gms to connections in backend (datahub-project#10942) * feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952) * fix(docs): make graphql doc gen more automated (datahub-project#10953) * feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723) * fix(spark-lineage): default timeout for future responses (datahub-project#10947) * feat(datajob/flow): add environment filter using info aspects (datahub-project#10814) * fix(ui/ingest): correct privilege used to show tab (datahub-project#10483) Co-authored-by: Kunal-kankriya <[email protected]> * feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955) * add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956) * fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966) * fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965) * fix(airflow/build): Pinning mypy (datahub-project#10972) * Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974) * fix(ingest/test): Fix for mssql integration tests (datahub-project#10978) * fix(entity-service) exist check correctly extracts status (datahub-project#10973) * fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982) * bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986) * fix(ui): Remove ant less imports (datahub-project#10988) * feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987) * feat(ingest/cli): init does not actually support environment variables (datahub-project#10989) * fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991) * feat(ingest/spark): Promote beta plugin (datahub-project#10881) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967) * feat(ingest): add `check server-config` command (datahub-project#10990) * feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466) Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig. This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig. I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled. Fixes datahub-project#9705 * fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993) * fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771) * feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985) * feat(ingest): improve `ingest deploy` command (datahub-project#10944) * fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920) - allow excluding soft-deleted entities in relationship-queries - exclude soft-deleted members of groups * fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996) * doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984) Co-authored-by: John Joyce <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Pedro Silva <[email protected]> * fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006) * fix(ui/ingest): Support invalid cron jobs (datahub-project#10998) * fix(ingest): fix graph config loading (datahub-project#11002) Co-authored-by: Pedro Silva <[email protected]> * feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968) Co-authored-by: Harshal Sheth <[email protected]> * fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011) * feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999) * feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935) * fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858) * feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016) Co-authored-by: Chris Collins <[email protected]> * docs: standardize terminology to DataHub Cloud (datahub-project#11003) * fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013) * docs(slack) troubleshoot docs (datahub-project#11014) * feat(propagation): Add graphql API (datahub-project#11030) Co-authored-by: Chris Collins <[email protected]> * feat(propagation): Add models for Action feature settings (datahub-project#11029) * docs(custom properties): Remove duplicate from sidebar (datahub-project#11033) * feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * feat(propagation): Add Documentation Propagation Settings (datahub-project#11038) * fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040) * fix(ci): smoke test lint failures (datahub-project#11044) * docs: fix learning center color scheme & typo (datahub-project#11043) * feat: add cloud main page (datahub-project#11017) Co-authored-by: Jay <[email protected]> * feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662) Co-authored-by: John Joyce <[email protected]> * docs: fix typo (datahub-project#11046) * fix(lint): apply spotless (datahub-project#11050) * docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034) * feat(cli): Add run-id option to put sub-command (datahub-project#11023) Adds an option to assign run-id to a given put command execution. This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline. * fix(ingest): improve sql error reporting calls (datahub-project#11025) * fix(airflow): fix CI setup (datahub-project#11031) * feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039) * fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971) * (chore): Linting fix (datahub-project#11015) * chore(ci): update deprecated github actions (datahub-project#10977) * Fix ALB configuration example (datahub-project#10981) * chore(ingestion-base): bump base image packages (datahub-project#11053) * feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051) * fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910) * feat(ingest/tableau): add retry on timeout (datahub-project#10995) * change generate kafka connect properties from env (datahub-project#10545) Co-authored-by: david-leifker <[email protected]> * fix(ingest): fix oracle cronjob ingestion (datahub-project#11001) Co-authored-by: david-leifker <[email protected]> * chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062) * feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041) * build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878) Co-authored-by: david-leifker <[email protected]> * fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063) * docs(ingest): update developing-a-transformer.md (datahub-project#11019) * feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056) * feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009) Co-authored-by: Harshal Sheth <[email protected]> * docs(airflow): update min version for plugin v2 (datahub-project#11065) * doc(ingestion/tableau): doc update for derived permission (datahub-project#11054) Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Harshal Sheth <[email protected]> * fix(py): remove dep on types-pkg_resources (datahub-project#11076) * feat(ingest/mode): add option to exclude restricted (datahub-project#11081) * fix(ingest): set lastObserved in sdk when unset (datahub-project#11071) * doc(ingest): Update capabilities (datahub-project#11072) * chore(vulnerability): Log Injection (datahub-project#11090) * chore(vulnerability): Information exposure through a stack trace (datahub-project#11091) * chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089) * chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088) * chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059) * chore(vulnerability): Overly permissive regex range (datahub-project#11061) Co-authored-by: Harshal Sheth <[email protected]> * fix: update customer data (datahub-project#11075) * fix(models): fixing the datasetPartition models (datahub-project#11085) Co-authored-by: John Joyce <[email protected]> * fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084) Co-authored-by: John Joyce <[email protected]> * feat(docs-site): hiding learn more from cloud page (datahub-project#11097) * fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082) Co-authored-by: Jay <[email protected]> * fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098) * docs: Refactor customer stories section (datahub-project#10869) Co-authored-by: Jeff Merrick <[email protected]> * fix(release): fix full/slim suffix on tag (datahub-project#11087) * feat(config): support alternate hashing algorithm for doc id (datahub-project#10423) Co-authored-by: david-leifker <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007) * fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898) Co-authored-by: Harshal Sheth <[email protected]> * chore: Update contributors list in PR labeler (datahub-project#11105) * feat(ingest): tweak stale entity removal messaging (datahub-project#11064) * fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104) * fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080) * feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069) * docs: update graphql docs on forms & structured properties (datahub-project#11100) * test(search): search openAPI v3 test (datahub-project#11049) * fix(ingest/tableau): prevent empty site content urls (datahub-project#11057) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(entity-client): implement client batch interface (datahub-project#11106) * fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114) * fix(ingest): downgrade column type mapping warning to info (datahub-project#11115) * feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118) * fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111) * fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122) * fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092) * fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121) * fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366) * feat(ui): Changes to allow editable dataset name (datahub-project#10608) Co-authored-by: Jay Kadambi <[email protected]> * fix: remove saxo (datahub-project#11127) * feat(mcl-processor): Update mcl processor hooks (datahub-project#11134) * fix(openapi): fix openapi v2 endpoints & v3 documentation update * Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update" This reverts commit 573c1cb. * docs(policies): updates to policies documentation (datahub-project#11073) * fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139) * feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116) * fix(mutator): mutator hook fixes (datahub-project#11140) * feat(search): support sorting on multiple fields (datahub-project#10775) * feat(ingest): various logging improvements (datahub-project#11126) * fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079) Co-authored-by: Harshal Sheth <[email protected]> * feat(docs-site) cloud page spacing and content polishes (datahub-project#11141) * feat(ui) Enable editing structured props on fields (datahub-project#11042) * feat(tests): add md5 and last computed to testResult model (datahub-project#11117) * test(openapi): openapi regression smoke tests (datahub-project#11143) * fix(airflow): fix tox tests + update docs (datahub-project#11125) * docs: add chime to adoption stories (datahub-project#11142) * fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158) * fix(kafka-setup): add missing script to image (datahub-project#11190) * fix(config): fix hash algo config (datahub-project#11191) * test(smoke-test): updates to smoke-tests (datahub-project#11152) * fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193) * chore(kafka): kafka version bump (datahub-project#11211) * readd UsageStatsWorkUnit * fix merge problems * change logo --------- Co-authored-by: Chris Collins <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: dushayntAW <[email protected]> Co-authored-by: sagar-salvi-apptware <[email protected]> Co-authored-by: Aseem Bansal <[email protected]> Co-authored-by: Kevin Chun <[email protected]> Co-authored-by: jordanjeremy <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> Co-authored-by: david-leifker <[email protected]> Co-authored-by: sid-acryl <[email protected]> Co-authored-by: Julien Jehannet <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: Felix Lüdin <[email protected]> Co-authored-by: Pirry <[email protected]> Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: cburroughs <[email protected]> Co-authored-by: ksrinath <[email protected]> Co-authored-by: Mayuri Nehate <[email protected]> Co-authored-by: Kunal-kankriya <[email protected]> Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: ipolding-cais <[email protected]> Co-authored-by: Tamas Nemeth <[email protected]> Co-authored-by: Shubham Jagtap <[email protected]> Co-authored-by: haeniya <[email protected]> Co-authored-by: Yanik Häni <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: 808OVADOZE <[email protected]> Co-authored-by: noggi <[email protected]> Co-authored-by: Nicholas Pena <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: ethan-cartwright <[email protected]> Co-authored-by: Ethan Cartwright <[email protected]> Co-authored-by: Nadav Gross <[email protected]> Co-authored-by: Patrick Franco Braz <[email protected]> Co-authored-by: pie1nthesky <[email protected]> Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <[email protected]> Co-authored-by: Ellie O'Neil <[email protected]> Co-authored-by: Ajoy Majumdar <[email protected]> Co-authored-by: deepgarg-visa <[email protected]> Co-authored-by: Tristan Heisler <[email protected]> Co-authored-by: Andrew Sikowitz <[email protected]> Co-authored-by: Davi Arnaut <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: amit-apptware <[email protected]> Co-authored-by: Sam Black <[email protected]> Co-authored-by: Raj Tekal <[email protected]> Co-authored-by: Steffen Grohsschmiedt <[email protected]> Co-authored-by: jaegwon.seo <[email protected]> Co-authored-by: Renan F. Lima <[email protected]> Co-authored-by: Matt Exchange <[email protected]> Co-authored-by: Jonny Dixon <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: Pinaki Bhattacharjee <[email protected]> Co-authored-by: Jeff Merrick <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: AndreasHegerNuritas <[email protected]> Co-authored-by: jayasimhankv <[email protected]> Co-authored-by: Jay Kadambi <[email protected]> Co-authored-by: David Leifker <[email protected]>
snowflake.connector.SnowflakeConnectionwith a newSnowflakeConnectiontype.SnowflakeConnectionaround using composition, removing some mixin classes likeSnowflakeQueryMixinandSnowflakeConnectionMixin.self.query(...)is nowself.connection.query(...). As part of this, the connection is initialized in the constructor instead of in theget_workunits_internalmethod.SnowflakeCommonMixinby introducing theSnowflakeFilterMixinandSnowflakeIdentifierMixininstead. I'm not fully convinced this is the best design - something with composition like SnowsightUrlBuilder might actually be better, but would increase the size of the diff even more.Follow up TODOs:
include_view_lineageflag.Checklist
Summary by CodeRabbit
New Features
QueryUsageStatisticsto track dataset usage statistics.Improvements
Bug Fixes
Tests
Chores