-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETL examples are unclear and hard to use #1743
Comments
I believe we've already migrated all the ETL scripts to Python 3 and updated the doc to reflect that: https://github.com/linkedin/datahub/tree/master/metadata-ingestion That said, agree that the Python avro binding is not the easiest to use. We can also do a better job communicating the expectation, and finally we can also use more Java examples. |
Thanks @jplaisted & @mars-lan . I agree that we should consider to provide Java based ETL code maybe closer to what we use internally at least for some of the most wanted platforms. |
Ideally we'll find the SQLAlchemy-equivalence in Java so we can have one ETL that works across many systems. |
I also agree that Java based ETL solution. The reason is that Datahub has had built-in Serders for different aspects of an entity. If the ETL users are just feeding in data with JSON format (compared to the current Avro), and run something like
It will be much easier. Compared to that we have to set up python environment we have been dealing with. And most importantly, the One example is that, internally, we are creating
|
@jplaisted
Creating MCE messages in JSON-interpolation way(the way python mce-cli and others are doing currently) is looking error prone as
I see java bindings being generated in the build along with JSON. Is it a good idea to publish them as jar, so that can be used for programmatic contracts, instead of relying only on AVSC? This could also be exposed as multiple-language bindings to promote ecosystem around. Thanks |
The We should probably publish it to a maven repository, along with a lot of our other jars. The only artifacts we have right now are our docker images on commit to master, and source code tars on release. I'll make a separate issue for that. |
Also fwiw we really only use Java for our pipelines that emit MCEs at LI. For much the reasons you describe. Again, the original idea behind the Python scripts in open source was that they were smaller and easier to set up. But I think since most people would prefer Java due to the type bindings we should provide those examples. |
Filing this away as closed with the new typed python ingestion framework. Not perfect yet (is anything ever? :)), but at least hopefully not unclear and hard to use anymore. |
Filing some feedback from SpotHero earlier today so we can track and improve this.
We have some Python scripts as examples to ETL data into datahub. However, it was a SpotHero developer's experience that these made onboarding difficult.
Specific issues:
So specific action items:
cc @keremsahin1
The text was updated successfully, but these errors were encountered: