ETL examples are unclear and hard to use #1743

jplaisted · 2020-07-22T20:16:29Z

Filing some feedback from SpotHero earlier today so we can track and improve this.

We have some Python scripts as examples to ETL data into datahub. However, it was a SpotHero developer's experience that these made onboarding difficult.

Specific issues:

Python 2.7 rather than Python 3
Deals with large, untyped json blobs that make it hard to A) read and B) determine if it matches the schema
Was unclear if Python was the suggested method. Companies are likely to use this a little more literally and use them as a basis for their own ETL scripts; Python is perhaps not the best choice. Even at LI we don't use Python for these; most (if not all) our ingestion pipelines are in Java.

So specific action items:

If we keep the Python example around, upgrade to Python 3 and see how else we can improve typing (are there avro / pegasus bindings?)
Write Java or other language examples too.
Write better documentation around best practices. Make it clear these are just examples. Give other examples of what we do at LI (we have instrumented services to write metadata and also some crawlers and also some Azkaban jobs; it totally depends on the source data and how it can be extracted).

cc @keremsahin1

mars-lan · 2020-07-22T21:29:16Z

I believe we've already migrated all the ETL scripts to Python 3 and updated the doc to reflect that: https://github.com/linkedin/datahub/tree/master/metadata-ingestion

That said, agree that the Python avro binding is not the easiest to use. We can also do a better job communicating the expectation, and finally we can also use more Java examples.

keremsahin1 · 2020-07-23T01:38:54Z

Thanks @jplaisted & @mars-lan . I agree that we should consider to provide Java based ETL code maybe closer to what we use internally at least for some of the most wanted platforms.

mars-lan · 2020-07-23T02:05:58Z

Ideally we'll find the SQLAlchemy-equivalence in Java so we can have one ETL that works across many systems.

liangjun-jiang · 2020-07-23T16:53:32Z

I also agree that Java based ETL solution. The reason is that Datahub has had built-in Serders for different aspects of an entity. If the ETL users are just feeding in data with JSON format (compared to the current Avro), and run something like

java  -jar ETL.jar produce -d data.json

It will be much easier. Compared to that we have to set up python environment we have been dealing with. And most importantly, the correct avro data format.

One example is that, internally, we are creating following relationship - a user can follow another user. After added this new aspect to CorpUser, I am creating sample ETL data.
Since there is an enum type used, I wouldn't know fieldDiscriminator needs to be there. Have to fumble around a few hours.

{"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:datahub", "aspects": [("com.linkedin.pegasus2avro.identity.CorpUserInfo", {"active": True, "displayName": "John Doe", "fullName": "John Doe", "email": "[email protected]", "title": "CEO"}),("com.linkedin.pegasus2avro.common.Follow",{"followees": [{"followee": {"corpUser": "urn:li:corpuser:kzhang10", "fieldDiscriminator":"corpUser"}}]})]}), "proposedDelta": None}

shakti-garg · 2020-08-12T14:51:47Z

@jplaisted
We are also trying to build a java based ETL solution.
This ETL solution is based on the following design considerations:

Should sit outside the datahub tool, so as to promote loose coupling in handshakes between internal data systems and datahub tool
should be able to host adapters for various kind of source systems, so that it becomes as easy as plugNplay
adapters would be having business logic, specific to the source system

Creating MCE messages in JSON-interpolation way(the way python mce-cli and others are doing currently) is looking error prone as

JSON blobs are very huge; hard to test and debug
string interpolation is error-prone, hard to maintain and prone to formatting issues.

I see java bindings being generated in the build along with JSON. Is it a good idea to publish them as jar, so that can be used for programmatic contracts, instead of relying only on AVSC? This could also be exposed as multiple-language bindings to promote ecosystem around.

Thanks

jplaisted · 2020-08-12T17:33:19Z

The metadata-events/mxe-avro-1.7 module's jar should already contain the Java generated avro classes in it.

We should probably publish it to a maven repository, along with a lot of our other jars. The only artifacts we have right now are our docker images on commit to master, and source code tars on release. I'll make a separate issue for that.

jplaisted · 2020-08-12T17:34:42Z

Also fwiw we really only use Java for our pipelines that emit MCEs at LI. For much the reasons you describe. Again, the original idea behind the Python scripts in open source was that they were smaller and easier to set up. But I think since most people would prefer Java due to the type bindings we should provide those examples.

shirshanka · 2021-03-09T18:04:04Z

Filing this away as closed with the new typed python ingestion framework.
https://datahubproject.io/docs/metadata-ingestion

Not perfect yet (is anything ever? :)), but at least hopefully not unclear and hard to use anymore.

jplaisted added the bug Bug report label Jul 22, 2020

mars-lan added feature-request Request for a new feature to be added and removed bug Bug report labels Jul 23, 2020

mars-lan assigned jplaisted Jul 23, 2020

mars-lan mentioned this issue Jul 31, 2020

metadata-ingestion script performance is too slow when contains too many table. #1520

Closed

This was referenced Aug 12, 2020

Publish build artifacts #1804

Closed

Start adding java ETL examples, starting with kafka etl. #1805

Merged

jplaisted mentioned this issue Sep 11, 2020

feat: Port mce-cli to Java. #1871

Merged

4 tasks

shirshanka closed this as completed Mar 9, 2021

chriscollins3456 pushed a commit to chriscollins3456/datahub that referenced this issue Sep 14, 2023

nested domain analytics (datahub-project#1743)

99fe0b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL examples are unclear and hard to use #1743

ETL examples are unclear and hard to use #1743

jplaisted commented Jul 22, 2020 •

edited by keremsahin1

Loading

mars-lan commented Jul 22, 2020

keremsahin1 commented Jul 23, 2020

mars-lan commented Jul 23, 2020

liangjun-jiang commented Jul 23, 2020 •

edited

Loading

shakti-garg commented Aug 12, 2020 •

edited

Loading

jplaisted commented Aug 12, 2020

jplaisted commented Aug 12, 2020

shirshanka commented Mar 9, 2021

ETL examples are unclear and hard to use #1743

ETL examples are unclear and hard to use #1743

Comments

jplaisted commented Jul 22, 2020 • edited by keremsahin1 Loading

mars-lan commented Jul 22, 2020

keremsahin1 commented Jul 23, 2020

mars-lan commented Jul 23, 2020

liangjun-jiang commented Jul 23, 2020 • edited Loading

shakti-garg commented Aug 12, 2020 • edited Loading

jplaisted commented Aug 12, 2020

jplaisted commented Aug 12, 2020

shirshanka commented Mar 9, 2021

jplaisted commented Jul 22, 2020 •

edited by keremsahin1

Loading

liangjun-jiang commented Jul 23, 2020 •

edited

Loading

shakti-garg commented Aug 12, 2020 •

edited

Loading