Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start adding java ETL examples, starting with kafka etl. #1805

Merged
merged 3 commits into from
Sep 11, 2020
Merged

Start adding java ETL examples, starting with kafka etl. #1805

merged 3 commits into from
Sep 11, 2020

Conversation

jplaisted
Copy link
Contributor

@jplaisted jplaisted commented Aug 12, 2020

We've had a few requests to start providing Java examples rather than Python due to type safety.

I've also started to add these to metadata-ingestion-examples to make it clearer these are examples. They can be used directly or as a basis for other things.

This is a work in progress. After we port all the examples to Java we'll delete the python versions.

#1743

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

We've had a few requests to start providing Java examples rather than Python due to type safety.

I've also started to add these to metadata-ingestion-examples to make it clearer these are *examples*. They can be used directly or as a basis for other things.

This is a work in progress. After we port all the examples to Java we'll delete the python versions.
@jplaisted jplaisted requested a review from mars-lan August 12, 2020 23:32
Copy link
Contributor

@mars-lan mars-lan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to eventually drop all Python-based ETL scripts or have them co-exist? We even have some closure ETL under contrib, wondering if we should have a language-specific structure, e.g. /metadata-ingestion/java /metadata-ingestion/python instead?

@cobolbaby
Copy link
Contributor

Is the intention to eventually drop all Python-based ETL scripts or have them co-exist? We even have some closure ETL under contrib, wondering if we should have a language-specific structure, e.g. /metadata-ingestion/java /metadata-ingestion/python instead?

Is it possible to migrate the ingestion scripts under contrib?

@jplaisted
Copy link
Contributor Author

Is the intention to eventually drop all Python-based ETL scripts or have them co-exist? We even have some closure ETL under contrib, wondering if we should have a language-specific structure, e.g. /metadata-ingestion/java /metadata-ingestion/python instead?

I did initially have that (under metadata-ingestion-examples/python and java), but it was kind of awkward imo. Quite nested. And maintaining two apps isn't very ideal.

So yes, I pivoted and my new idea was to delete the Python examples afterward.

@jplaisted
Copy link
Contributor Author

Is it possible to migrate the ingestion scripts under contrib?

The only downside is we may stop maintaining them.

@liangjun-jiang
Copy link
Contributor

I'd say, for right now, put this ingestion example under contrib. The reason is that, my understanding is that ingesting with Kafka is easier by leveraging existing mce-job as a template. ingesting with hive and others existing Python examples would be much time consuming. We don't want to confuse potential adapters.
Unless one day we could really port what all Python scripts can offer, we can move the Java based ETL be official ingestion solution.

@mars-lan
Copy link
Contributor

mars-lan commented Aug 15, 2020

Is the intention to eventually drop all Python-based ETL scripts or have them co-exist? We even have some closure ETL under contrib, wondering if we should have a language-specific structure, e.g. /metadata-ingestion/java /metadata-ingestion/python instead?

I did initially have that (under metadata-ingestion-examples/python and java), but it was kind of awkward imo. Quite nested. And maintaining two apps isn't very ideal.

So yes, I pivoted and my new idea was to delete the Python examples afterward.

Replacing all Python examples might take time, and I'd expect some people will still prefer Python over Java. Maintaining both is indeed non-ideal either so one option is to eventually send the Python scripts to contrib, along with other clojure-based ETL scripts for the community to maintain over the long term.

@jplaisted
Copy link
Contributor Author

I'll move them to contrib.

Involved moving some files in contrib/metadata-ingestion to a specific jdbc directory first.
@jplaisted
Copy link
Contributor Author

@mars-lan sorry for the delay, PTAL. I'll move each example as I port them.

@mars-lan
Copy link
Contributor

@mars-lan sorry for the delay, PTAL. I'll move each example as I port them.

Some got moved to /controib/metadata-ingestion/jdbc which I assume is the incorrect location?

@jplaisted
Copy link
Contributor Author

Some got moved to /controib/metadata-ingestion/jdbc which I assume is the incorrect location?

See that commit description. The contrib/metadata-ingestion dir seemed to just be a jdbc example at the top level. I moved those into a jdbc dir.

@mars-lan
Copy link
Contributor

Some got moved to /controib/metadata-ingestion/jdbc which I assume is the incorrect location?

See that commit description. The contrib/metadata-ingestion dir seemed to just be a jdbc example at the top level. I moved those into a jdbc dir.

e.g. I don't see any mentioning of jdbc in this? It's using python hive lib directly? https://github.com/linkedin/datahub/blob/12607e30c6f33c35652c425bd983e69ac0860543/contrib/metadata-ingestion/jdbc/bin/dataset-hive-generator.py. All the .hs files are Haskell so I don't think they're using JDBC either?

@jplaisted
Copy link
Contributor Author

That commit as in this one. Not 1805 overall lol.

So if you look at the state of things before, it was

contrib/metadata-integestion/
   bin/
   config/
   openldap-etl/
   sample/
   README.md # talking about JDBC

I assumed everything that wasn't openldap-etl was jdbc related.

Part of the issue here is lack of structure and documentation to start with. It is not clear to me what these are for. I want this to be very well structured, with everything under metadata-ingestion/ ideally be a separate contributed thing. Just because things are in contrib/ doesn't meant they should be messy. I don't know what, at a glance, bin, config, and sample are for...

@jplaisted
Copy link
Contributor Author

I don't really want to block the java ports on cleaning up contrib, unless you have a clear path forward. Are we sure we can't just delete these python examples? :p

@mars-lan
Copy link
Contributor

I don't really want to block the java ports on cleaning up contrib, unless you have a clear path forward. Are we sure we can't just delete these python examples? :p

Let's move the existing Haskell stuff to /contrib/metadata-ingestion/haskell to be align with new python stuff you're moving over? I'll do a separate clean up of those later.

@jplaisted
Copy link
Contributor Author

Done

@jplaisted jplaisted merged commit 6ece2d6 into datahub-project:master Sep 11, 2020
@mars-lan mars-lan mentioned this pull request Sep 15, 2020
4 tasks
@shubhamg931
Copy link
Contributor

@jplaisted Why are we maintaining separate metadata-ingestion-examples directory for the ported examples? imo keeping it in metadata-ingestion/java folder would have been a better option while they are being ported as suggested by @mars-lan

Is the intention to eventually drop all Python-based ETL scripts or have them co-exist? We even have some closure ETL under contrib, wondering if we should have a language-specific structure, e.g. /metadata-ingestion/java /metadata-ingestion/python instead?

Two directories are creating confusion I guess

@jplaisted
Copy link
Contributor Author

jplaisted commented Oct 19, 2020

  1. Minor, but more nested was going to be more painful, imo
  2. I wanted to make it clear these are examples that people can extrapolate. We had a few questions that made me think people were unclear about their purpose. e.g. people writing their own things in python and then complaining about python, when they could've just as easily written their own things in Java / Kotlin / whatever.
  3. I did want to remove the Python examples entirely, but Mars suggested moving them to contrib instead. In either case I highly suspect they'll become out of date and break due to lack of any testing (including compile time checking).

In theory, if we port all the examples to Java, we'll have one folder. I'll try to find more time to continue porting things.

@jplaisted jplaisted deleted the etl branch February 26, 2021 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants