Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): replace sqllineage/sqlparse with our SQL parser #12020

Conversation

sagar-salvi-apptware
Copy link
Contributor

@sagar-salvi-apptware sagar-salvi-apptware commented Dec 3, 2024

  • We currently use sqllineage for mode/superset lineage, despite having a better SQL parser implemented.
  • We currently depend on sqllineage==1.3.8, which transitively depends on sqlparse==0.4.4
  • sqlparse==0.4.4 has a high sev vulnerability (see https://security.snyk.io/package/pip/sqlparse/0.4.4), and the first fixed version is 0.5.x
  • The first sqllineage version compatible with sqllparse 0.5.x is sqllineage 1.5.x. However, that depends on sqlalchemy>2, which many of our dependencies are incompatible with.
  • The easiest fix was to just move everything to use our sqlglot parser instead of upgrading our dep.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Dec 3, 2024
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 3, 2024
@hsheth2 hsheth2 requested a review from mayurinehate December 4, 2024 04:03
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable.
Awaiting response on comments and few changes around reporting.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 5, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 5, 2024
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also need updates to gx-plugin

"column-timestamp",
]
assert sorted(SqlLineageSQLParser(sql_query).get_tables()) == expected_tables
assert sorted(SqlLineageSQLParser(sql_query).get_columns()) == expected_columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep these tests around, and make it test with the new parser

see

for an example

Copy link
Contributor Author

@sagar-salvi-apptware sagar-salvi-apptware Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hsheth2,

I've updated the tests to align with SQLGlot, and I noticed some changes in the outputs. Could you please verify them?
CC. @mayurinehate

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 5, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 6, 2024
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 6, 2024
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 6, 2024
metadata-ingestion-modules/gx-plugin/setup.py Outdated Show resolved Hide resolved
@@ -1,5 +0,0 @@
Note! The integration can use an SQL parser to try to parse the tables the chart depends on. This parsing is disabled by default,
but can be enabled by setting `parse_table_names_from_sql: true`. The default parser is based on the [`sqllineage`](https://pypi.org/project/sqllineage/) package.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still require them to set parse_table_names_from_sql right?

let's update the docs here instead of removing them entirely

Copy link
Contributor Author

@sagar-salvi-apptware sagar-salvi-apptware Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need parse_table_names_from_sql, Added again

try:
runner = LineageRunner(query)
sql_parser_in_tables = create_lineage_sql_parsed_result(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably be setting the default db / schema when calling create_lineage_sql_parsed_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @hsheth2 , We don’t need it here since we only require table names. In the previous implementation as well, we didn’t include the database/schema.

metadata-ingestion/tests/unit/utilities/test_utilities.py Outdated Show resolved Hide resolved
tables_list.sort()
assert tables_list == ["my_view.SQL_TABLE_NAME"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change in particular seems problematic - can you look at the git history to understand why this test was added, and what we need to do to actually maintain most of the previous functionality? we'll likely need to tweak one of our sources

Copy link
Contributor Author

@sagar-salvi-apptware sagar-salvi-apptware Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @hsheth2
It was added for looker token, but right now its not being used anymore
so technically this Test-Case is invalid
Let me your thought on this ?

CC. @mayurinehate

e95446b

ans.add(col_ref.column)
return list(ans)

def get_downstream_columns(self) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the new get_downstream_columns is similar in functionality to the original get_columns? or am I mistaken on that?

overall I'm a bit confused why you added that extra method, when we previously didn't have anything like it?

Copy link
Contributor Author

@sagar-salvi-apptware sagar-salvi-apptware Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hsheth2 ,

I added the get_downstream_columns method to verify what we’re getting in downstream columns since our new parser provides both upstream and downstream column lineage, which wasn’t possible with sqllineage.

Additionally, I noticed that the get_columns method from the sqllineage parser isn’t used anywhere in the source code—it’s only referenced in test cases.

I’ve now removed the get_downstream_columns method.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Dec 6, 2024
@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-722-remove-sqlparse-linege-deps branch from 103eb28 to c1d7596 Compare December 9, 2024 10:10
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Dec 9, 2024
@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Dec 9, 2024
@hsheth2 hsheth2 changed the title fix: Remove sqlparse and sqllineage deps fix(ingest): replace sqllineage/sqlparse with our SQL parser Dec 10, 2024
@hsheth2 hsheth2 merged commit 57b12bd into datahub-project:master Dec 10, 2024
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants