Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to extract some records due to: source produced an invalid metadata work unit - DB2 & SQLalchemy #5465

Closed
QKotel opened this issue Jul 22, 2022 · 3 comments
Assignees
Labels
bug Bug report ingestion PR or Issue related to the ingestion of metadata

Comments

@QKotel
Copy link

QKotel commented Jul 22, 2022

Describe the bug
While ingesting DB2 metadata with sqlalchemy source. Only container and dataPlatformInstance are created.

Results in datasets without schemas in datahub

To Reproduce
Steps to reproduce the behavior:

  • Recipie Config:
{
  "source": {
    "type": "sqlalchemy",
    "config": {
      "platform": "db2",
      "connect_uri": "db2+ibm_db:/{connectionstring}",
      "env": "enviroment",
    },
  },
  "sink": {
    "type": "datahub-rest",
    "config": {"server": "gms_uri"},
  },
}
  • CLI Version: > 0.8.38
  • Datahub Version: 0.8.38
  • Sqlalchemy Dialects: ibm-db-sa

Expected behavior
Ingestion of datasets including schemas

Solution
Filepath: datahub/ingestion/source/sql/sql_common.py
Class: SQLAlchemySource
Method: get_table_properties
Line: 1009

 def get_table_properties(
        self, inspector: Inspector, schema: str, table: str
    ) -> Tuple[Optional[str], Optional[Dict[str, str]], Optional[str]]:
        try:
            location: Optional[str] = None
            # SQLALchemy stubs are incomplete and missing this method.
            # PR: https://github.com/dropbox/sqlalchemy-stubs/pull/223.
            table_info: dict = inspector.get_table_comment(table, schema)  # type: ignore
        except NotImplementedError:
            description: Optional[str] = None
            properties: Dict[str, str] = {}
        except ProgrammingError as pe:
            # Snowflake needs schema names quoted when fetching table comments.
            logger.debug(
                f"Encountered ProgrammingError. Retrying with quoted schema name for schema {schema} and table {table}",
                pe,
            )
            description = None
            properties = {}
            table_info: dict = inspector.get_table_comment(table, f'"{schema}"')  # type: ignore
        else:
            description = table_info["text"] #<== ## PROBLEM ##

            # The "properties" field is a non-standard addition to SQLAlchemy's interface.
            properties = table_info.get("properties", {})
        return description, properties, location

In my case, the marked location returns a tuple (None,) which causes the MCP to no longer match the schema defaults.

Solution

#...#
    table_info: dict = inspector.get_table_comment(table, f'"{schema}"')  # type: ignore
        else:
            description = table_info["text"][0]

#...#

I'm reporting this as a bug because I haven't tested how the change affects other sources.

@QKotel QKotel added the bug Bug report label Jul 22, 2022
@maggiehays maggiehays added the ingestion PR or Issue related to the ingestion of metadata label Jul 22, 2022
@MugdhaHardikar-GSLab
Copy link
Contributor

@QKotel Please confirm if your issue is solved.

@siddiquebagwan
Copy link
Contributor

Hi @QKotel
Is this still an issue? if not, will close it after a few days of inactivity.

@QKotel
Copy link
Author

QKotel commented Sep 5, 2022

Hello @MugdhaHardikar-GSLab, the change has fixed my problem, thank you

@QKotel QKotel closed this as completed Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

No branches or pull requests

4 participants