Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark lineage plugin - remove_partition_pattern config parameter has no effect #11942

Open
lorak89 opened this issue Nov 25, 2024 · 1 comment
Labels
bug Bug report ingestion PR or Issue related to the ingestion of metadata

Comments

@lorak89
Copy link

lorak89 commented Nov 25, 2024

Describe the bug
I am using acryl-spark-lineage-0.2.16 plugin and trying to use spark.datahub.metadata.remove_partition_pattern configuration option to remove some dynamic parts of the file path in the lineage, example:

  • original path: "s3a://test-bucket/output-app-test/aaa/120w/bbb/"
  • remove_partition_pattern: "120w/"
  • expected path after transformation: "s3a://test-bucket/output-app-test/aaa/bbb/"

part of the log file:
[spark-listener-group-shared] INFO io.openlineage.spark.agent.util.RemovePathPatternUtils - Removing path pattern from dataset name output-app-test/aaa/120w/bbb
[spark-listener-group-shared] DEBUG io.openlineage.spark.agent.util.RemovePathPatternUtils - Transformed path is output-app-test/aaa/120w/bbb

I was checking the implementation and looks like remove_partition_pattern is not implemented at all, I am referring to this part of the code: https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/openlineage-converter/src/main/java/io/datahubproject/openlineage/dataset/HdfsPathDataset.java#L54

To Reproduce
Steps to reproduce the behavior:

  1. create a simple pyspark app that reads and writes to s3
  2. configure spark app to use acryl-spark-lineage-0.2.16 and use remove_partition_pattern that matches some part of the path (doesn't matter if it is input or output path)
  3. check the log file for Removing path pattern from dataset name to confirm that path was not transformed (it can be also checked in the UI)

Expected behavior
A path should be transformed according to the given regexp (matching parts should be removed)

Additional context
I tried this on input and output paths and doesn't work for both.

@lorak89 lorak89 added the bug Bug report label Nov 25, 2024
@RyanHolstien RyanHolstien added the ingestion PR or Issue related to the ingestion of metadata label Nov 26, 2024
@Michalosu
Copy link

@treff7es, could you please check this issue? You have fixed recently another bug with duplicated inputs/outputs in this plugin - thanks again and I believe you're the best person to address this issue 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

No branches or pull requests

3 participants