Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL examples are unclear and hard to use #1743

Closed
jplaisted opened this issue Jul 22, 2020 · 8 comments
Closed

ETL examples are unclear and hard to use #1743

jplaisted opened this issue Jul 22, 2020 · 8 comments
Assignees
Labels
feature-request Request for a new feature to be added

Comments

@jplaisted
Copy link
Contributor

jplaisted commented Jul 22, 2020

Filing some feedback from SpotHero earlier today so we can track and improve this.

We have some Python scripts as examples to ETL data into datahub. However, it was a SpotHero developer's experience that these made onboarding difficult.

Specific issues:

  • Python 2.7 rather than Python 3
  • Deals with large, untyped json blobs that make it hard to A) read and B) determine if it matches the schema
  • Was unclear if Python was the suggested method. Companies are likely to use this a little more literally and use them as a basis for their own ETL scripts; Python is perhaps not the best choice. Even at LI we don't use Python for these; most (if not all) our ingestion pipelines are in Java.

So specific action items:

  • If we keep the Python example around, upgrade to Python 3 and see how else we can improve typing (are there avro / pegasus bindings?)
  • Write Java or other language examples too.
  • Write better documentation around best practices. Make it clear these are just examples. Give other examples of what we do at LI (we have instrumented services to write metadata and also some crawlers and also some Azkaban jobs; it totally depends on the source data and how it can be extracted).

cc @keremsahin1

@jplaisted jplaisted added the bug Bug report label Jul 22, 2020
@mars-lan
Copy link
Contributor

I believe we've already migrated all the ETL scripts to Python 3 and updated the doc to reflect that: https://github.com/linkedin/datahub/tree/master/metadata-ingestion

That said, agree that the Python avro binding is not the easiest to use. We can also do a better job communicating the expectation, and finally we can also use more Java examples.

@keremsahin1
Copy link
Contributor

Thanks @jplaisted & @mars-lan . I agree that we should consider to provide Java based ETL code maybe closer to what we use internally at least for some of the most wanted platforms.

@mars-lan
Copy link
Contributor

Ideally we'll find the SQLAlchemy-equivalence in Java so we can have one ETL that works across many systems.

@mars-lan mars-lan added feature-request Request for a new feature to be added and removed bug Bug report labels Jul 23, 2020
@liangjun-jiang
Copy link
Contributor

liangjun-jiang commented Jul 23, 2020

I also agree that Java based ETL solution. The reason is that Datahub has had built-in Serders for different aspects of an entity. If the ETL users are just feeding in data with JSON format (compared to the current Avro), and run something like

java  -jar ETL.jar produce -d data.json

It will be much easier. Compared to that we have to set up python environment we have been dealing with. And most importantly, the correct avro data format.

One example is that, internally, we are creating following relationship - a user can follow another user. After added this new aspect to CorpUser, I am creating sample ETL data.
Since there is an enum type used, I wouldn't know fieldDiscriminator needs to be there. Have to fumble around a few hours.

{"auditHeader": None, "proposedSnapshot": ("com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot", {"urn": "urn:li:corpuser:datahub", "aspects": [("com.linkedin.pegasus2avro.identity.CorpUserInfo", {"active": True, "displayName": "John Doe", "fullName": "John Doe", "email": "[email protected]", "title": "CEO"}),("com.linkedin.pegasus2avro.common.Follow",{"followees": [{"followee": {"corpUser": "urn:li:corpuser:kzhang10", "fieldDiscriminator":"corpUser"}}]})]}), "proposedDelta": None}

@shakti-garg
Copy link
Contributor

shakti-garg commented Aug 12, 2020

@jplaisted
We are also trying to build a java based ETL solution.
This ETL solution is based on the following design considerations:

  1. Should sit outside the datahub tool, so as to promote loose coupling in handshakes between internal data systems and datahub tool
  2. should be able to host adapters for various kind of source systems, so that it becomes as easy as plugNplay
  3. adapters would be having business logic, specific to the source system

Creating MCE messages in JSON-interpolation way(the way python mce-cli and others are doing currently) is looking error prone as

  1. JSON blobs are very huge; hard to test and debug
  2. string interpolation is error-prone, hard to maintain and prone to formatting issues.

I see java bindings being generated in the build along with JSON. Is it a good idea to publish them as jar, so that can be used for programmatic contracts, instead of relying only on AVSC? This could also be exposed as multiple-language bindings to promote ecosystem around.

Thanks

@jplaisted
Copy link
Contributor Author

The metadata-events/mxe-avro-1.7 module's jar should already contain the Java generated avro classes in it.

We should probably publish it to a maven repository, along with a lot of our other jars. The only artifacts we have right now are our docker images on commit to master, and source code tars on release. I'll make a separate issue for that.

@jplaisted
Copy link
Contributor Author

Also fwiw we really only use Java for our pipelines that emit MCEs at LI. For much the reasons you describe. Again, the original idea behind the Python scripts in open source was that they were smaller and easier to set up. But I think since most people would prefer Java due to the type bindings we should provide those examples.

@shirshanka
Copy link
Contributor

Filing this away as closed with the new typed python ingestion framework.
https://datahubproject.io/docs/metadata-ingestion

Not perfect yet (is anything ever? :)), but at least hopefully not unclear and hard to use anymore.

chriscollins3456 pushed a commit to chriscollins3456/datahub that referenced this issue Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request Request for a new feature to be added
Projects
None yet
Development

No branches or pull requests

6 participants