You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am using acryl-spark-lineage-0.2.16 plugin and trying to use spark.datahub.metadata.remove_partition_pattern configuration option to remove some dynamic parts of the file path in the lineage, example:
original path: "s3a://test-bucket/output-app-test/aaa/120w/bbb/"
remove_partition_pattern: "120w/"
expected path after transformation: "s3a://test-bucket/output-app-test/aaa/bbb/"
part of the log file: [spark-listener-group-shared] INFO io.openlineage.spark.agent.util.RemovePathPatternUtils - Removing path pattern from dataset name output-app-test/aaa/120w/bbb [spark-listener-group-shared] DEBUG io.openlineage.spark.agent.util.RemovePathPatternUtils - Transformed path is output-app-test/aaa/120w/bbb
create a simple pyspark app that reads and writes to s3
configure spark app to use acryl-spark-lineage-0.2.16 and use remove_partition_pattern that matches some part of the path (doesn't matter if it is input or output path)
check the log file for Removing path pattern from dataset name to confirm that path was not transformed (it can be also checked in the UI)
Expected behavior
A path should be transformed according to the given regexp (matching parts should be removed)
Additional context
I tried this on input and output paths and doesn't work for both.
The text was updated successfully, but these errors were encountered:
@treff7es, could you please check this issue? You have fixed recently another bug with duplicated inputs/outputs in this plugin - thanks again and I believe you're the best person to address this issue 🙏
Describe the bug
I am using acryl-spark-lineage-0.2.16 plugin and trying to use
spark.datahub.metadata.remove_partition_pattern
configuration option to remove some dynamic parts of the file path in the lineage, example:"s3a://test-bucket/output-app-test/aaa/120w/bbb/"
"120w/"
"s3a://test-bucket/output-app-test/aaa/bbb/"
part of the log file:
[spark-listener-group-shared] INFO io.openlineage.spark.agent.util.RemovePathPatternUtils - Removing path pattern from dataset name output-app-test/aaa/120w/bbb
[spark-listener-group-shared] DEBUG io.openlineage.spark.agent.util.RemovePathPatternUtils - Transformed path is output-app-test/aaa/120w/bbb
I was checking the implementation and looks like
remove_partition_pattern
is not implemented at all, I am referring to this part of the code: https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/openlineage-converter/src/main/java/io/datahubproject/openlineage/dataset/HdfsPathDataset.java#L54To Reproduce
Steps to reproduce the behavior:
remove_partition_pattern
that matches some part of the path (doesn't matter if it is input or output path)Removing path pattern from dataset name
to confirm that path was not transformed (it can be also checked in the UI)Expected behavior
A path should be transformed according to the given regexp (matching parts should be removed)
Additional context
I tried this on input and output paths and doesn't work for both.
The text was updated successfully, but these errors were encountered: