-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collaboration with Apache Arrow org #28
Comments
The next steps would be to collect any Arrow-relevant IP and make a code donation to the Apache Arrow project (like we did with Go http://incubator.apache.org/ip-clearance/arrow-go-library.html and Ruby), then continue development there. Let me know if you want to proceed with that, and I'll be ready to help. |
Thanks for your help! In general I find all the legalistic tomfoolery horrifying, but I won't let that stop us unless there actually is some problem here. Would this involve abandoning the MIT license in favor of some form of Apache license? Is the apache license in any sense more restrictive? Any thoughts from the other Julia people on this? Any feeling for whether the rest of the arrow community will be upset that this is not a wrapper? I'm hoping they might not care if they don't use Julia, but admittedly I won't be particularly happy if I clone this repo a year from now and find that someone completely rewrote it as a wrapper. I'm also pretty concerned that making this part of the arrow repository will make it a 100 times harder to work on than if it were e.g. in https://github.com/JuliaIO. (Again, feedback is welcome here, I wasn't necessarily planning on doing much more work on this in the near future, so I'm definitely open to input from anyone who feels they may be a potential contributor.) |
Tagging some more Julia people that may have interest in this, apologies if you don't @ararslan @ihnorton @nalimilan @tanmaykm (feel free to add others) |
(Arrow developer/PMC, hoping to also answer some questions, haven't used Julia in 5 years, so I rather count as non-knowledgable in that space)
Given that most commits are from you and the other two contributors only contributed small changes, this will be very easy at the current stage. It will involve some "paper work" emails but they should be straight-forward.
Yes this would involve changing the license to Apache. The Apache license is much longer in terms of text but normally as accepted as the MIT license. If you are concerned, there are plentyful of short description of what the Apache license is about, read one of them.
No, in the Arrow repository we already have distinct implementations in JavaScript, Java, Go, C++ and Rust. They all share no code but we do have integration tests between them to ensure that they are all compatible. That is one of the main advantages of having them all in single repository: To ensure that they all work together eventhough that they don't share anything.
What are the benefits of doing it in JuliaIO? |
When you're talking about software redistribution, being careful about IP lineage and third party code makes everyone's lives (especially the lawyers') a lot easier in the long run. The process itself is not onerous despite any appearances.
The main difference between Apache 2.0 and MIT license is that the Apache license provides a patent grant to users, which provides additional security / peace of mind for developers who may produce commercial software that depends on a project. MIT and Apache 2.0 are compatible from a code reuse perspective as permissive licenses.
I'm curious what might be perceived as "harder". No one (out of > 150 contributors) has been having a hard time contributing to the project as far as I can tell -- I think that "use pull requests and do not break the build" is a reasonable bar of professionalism for an open source contributor. We also ask that contributors help maintain an intelligible change log for the project and write commit messages / PR descriptions that explain the content of their work. |
Thanks guys. I think that's good enough for me. I suppose my consternation about the project being harder to work on in the arrow repo just comes from the fact that the Julia community is relatively small, and this is something I've grown pretty comfortable with. No worries. It would probably be convenient if we could keep a mirror in an Arrow.jl repo somewhere, as that would probably simplify its installation with the Julia package manager, not sure what that would look like but I've seen such things done elsewhere. Ok, let's give people some time to respond with comments and then move ahead. Thanks again! |
Seems like what we'd want to do after voting on a Julia release is to push the changes and tag the new release in a repo that's connected to the Julia package management system (it seems like GitHub -- or git repositories at least -- and Julia packages are intertwined, is that right?). |
I'm not really convinced that moving the code into the main arrow repo is a good idea. Here are some issues I see:
I do think it would be fantastic to make sure we sync the testing story somehow. But I don't think we need to go to one repo for that, there seem to be lots of options to make sure stuff gets tested in sync while maintaining the current repo structure. On the license: that to me seems mostly extra work that at least I don't want to deal with ;) If folks want to change it, go ahead... But if we keep the repos distinct, I don't see a need to change the license. There is also again a situation that within the julia ecosystem the MIT license is by far the most used, and it might be easier to just stick with that. But, mostly, I'm agnostic about this point. |
I don't see why packaging and development process need to be tightly coupled. You only need to update the package manager when a release actually happens, and this update to the package manager should take < 1 minute to do.
We are cutting separate JavaScript releases already, so I don't think this is an issue
This seems like FUD to me. I really question how valuable the contributions are from people who are put off by a "who moved my cheese" type of issue. This is already a pretty difficult project to contribute to on the spectrum of open source projects just by its low-level systems nature (data structures, binary protocols, file formats, serialization, etc.) One could possibly create an apache/Arrow.jl repo, but this would make integration testing and other things a lot harder; the monorepo brings a lot of benefits when you are dealing with binary interoperability |
I agree it would make sense to move the Julia package to the same repo as all other implementations. The advantages of that approach may not be striking right now, but I imagine it will make it much easier in the long run to keep all implementations in sync while the Arrow format evolves. It would make sense to have some kind of mirror repo to work with the Julia package manager, though. |
The concept of Julia packages are indeed closely tied to git repositories (but not necessarily github). In principle it's possible to make a package out of any type of repository, but indeed it does greatly simplify things if the repo is in the standard format like you see for Arrow.jl right now. I feel like this problem is probably solvable with git and github though, isn't it possible to make a mirror of a directory or something? I seem to remember seeing some setups like that but can't find any good examples at the moment. I think at this point it is worth asking though: why is there such eagerness to have everything in a single repository? In my experience this actually makes things more difficult, not easier. I don't understand the thinking behind why it is so important to have so many languages in that one repo. We work on things in separate repos all the time and it's never an issue in the slightest, even when it involves IO and binary stuff. Regardless, I'm willing to donate it to that repo as long as we have some sort of "Arrow.jl", however that would work. |
Well, it is on julia, for all practical purposes.
That is not how it works on julia. The whole workflow of downloading a package, working on a dev version of the package etc. is tightly integrated with the package manager. We also would no longer have access to most of the tool integration we have if the code moved into the arrow repo (things like attobot, femtocleaner, integrated testing & build in VS Code, Documenter.jl, Coverage.jl and probably more stuff).
Ah, so you are using tags that include things like
I think in reality we would constantly get PRs against the repo that the package manger knows about because that is the workflow that every julia documentation recommends. I think that would be cumbersome. I also think it is quite vital to make it easy to open PRs even for contributors that are not low level super devs. There are lots of little things in the Arrow.jl codebase where even less experienced devs can help, and I would not want to lose them. A good example is the one commit I contributed to this repo. I think pretty much anyone could have done that, certainly no low level coding experience required.
I think a productive conversation at this point would be what kind of testing we would like to see. Once we have figured that out, we can see whether that would be easier with two repos or a monorepo. I don't really know what type of integration testing there is right now in the arrow repo between the different implementations, I think it would be really helpful to understand that better. |
This project is held together by its inter-language binary integration tests. A single pull request may affect multiple implementations -- if we split the project up into multiple git repos, effectively we would have a network of circular dependencies in the CI. So if you needed to make a change that affected both Java and C++, you need a way to integration tests two PRs jointly. If you merge a breaking change in one repo, all the builds in the other repo break. By working in a monorepo, we maintain harmonious CI builds across all the subprojects. As an example, the JavaScript developers recently merged support for emitting binary streams from JS to consume in other implementations: apache/arrow@fc7a382. In this patch, the CI verifies that the JavaScript data emitted can be consumed by Java and C++. As time goes on, the matrix of implementations producing and consuming data will grow larger. It's much easier to stay in sync this way |
That sounds a little weird to me, as projects depend critically on other projects in different repos all the time, I don't quite see why this should be a special case. My understanding was that proper versioning was supposed to take care of this sort of thing. Right now there are (more than one) Julia Feather repos that depend on this package, if I break one of them I consider myself to have screwed up, just like if a numpy tag breaks pandas those guys would have screwed up. (Though @wesm undoubtedly has about a billion times more experience than I do with this sort of thing, so perhaps I don't know what I'm talking about.) Anyway, I really don't want to get us all into an argument here about the relative merits of how this gets set up, my interest is just in making sure this gets maintained and letting people know that they can integrate their Arrow stuff with Julia. I think having some sort of mirror repo or something might solve that. I'll ask around on Julia discourse about whether anyone has done something like this. I'm imagining an "arrow/Arrow.jl" that somehow links to the appropriate part of the arrow repo. |
The relationship between the different Arrow implementations is different from a normal package dependency. We have made deliberate changes to the binary format over the last 2.5 years, and it's a lot of work to keep everything in sync -- it sounds like the burden will be on the Julia-Arrow developers to create bespoke tooling to assist with integration testing against the other implementations. There will likely be cases where the Julia implementation will "slip" and become incompatible with the other implementations, as it will not be possible in all cases for CI to force the incompatibility to be resolved with each patch contributed to the project. We could set up some nightly builds to at least complain to the mailing list if something is broken in any 24 hour time span. This is to say, all of this discussion is moot until the Julia implementation is able to consume and produce valid Arrow binary messages, and validate them against a source of truth (i.e. the JSON integration test format developed). |
FWIW, I'm all in favor of moving the code under the apache arrow monorepo. We've been dancing around on the outside for too long and it's more than worth the little extra packaging/release work to be more tightly integrated w/ the rest of the arrow community. We have plenty of options to work around any current julia package system quirks, like mirroring a stand-alone repo that we would sync w/ releases. And in 0.7 (which should be officially tagged alpha today!), the new package manager is much more flexible w/ where package source code lives & structured. |
Yes, that is a very good point and I had a feeling that we were getting too far away from it. Is there some sort of standard set of tests we can run through? Since I pretty much wrote this from scratch, it seems like the burden's on me right now to sit down and make sure I get this to a point where it reliably passes some sort of standard test, and it might take me a little while before I really get a chance to dig into it anyway. Is it a problem that this is an incomplete implementation? (We've yet to do structs. I'm assuming I'd have to do some of the basic IPC work but I don't imagine that being very hard.) |
@ExpandingMan, I'm happy to help w/ implementing the message protocol side of things. I've perused the spec a few times. Happy to coordinate efforts. |
@quinnj, any PR would be welcome, in fact I should make you an admin. I don't know what's needed for the message protocols, but I'd imagine it would be quite simple. Unless it would require us to implement structs, that I'd imagine would be at least a little bit of work. If you have any ideas about how to test whether we are compliant with what's in the main arrow repo, that is also extremely welcome. Realistically we need to do that before we can seriously think about pushing this over there. |
Building complete binary read and write support and also being able to interface with the JSON integration test format is a sort of big project, I'd guess at least 80 hours of development time. If it got done in less time than that, I would be extraordinarily impressed. @trxcllnt how much time do you think you spent on this on the JS side? Ultimately, implementations need to make their way into the integration test suite as proof of compliance https://github.com/apache/arrow/tree/master/integration -- at the moment we have C++, Java, and JavaScript running there. |
My thinking was that we are already a huge part of the way there. Looking at the test examples, I still believe that to be the case, although like I said, we haven't implemented structs. @quinnj, do you see any reason why this should be very far off? I'll look into running those tests. By the way, looks like it shouldn't be too difficult to deal with the package issues with the new package manager, see here. |
Maybe the best strategy would be to get the test stuff up and running first, and once julia 1.0 is released (I mean the final version, not the alpha), take a look whether the new package manager would make the monorepo easier? |
@ExpandingMan I'd say the sooner we can get the Julia code shepherded into the Apache project, the better; waiting is likely to create more work for the PMC if you start collecting a longer contributor list. It doesn't need to wait for integration tests |
At least it would be a good idea to switch to the Apache license immediately. |
Ok, at this point I definitely have every intent of moving this to the monorepo, but we have to figure out what's required on the technical side first. As I've said, I don't really see why we'd have If indeed this would require a large effort, I'm not sure the appetite would be their right now just to get this moved over to the monorepo as quickly as possible. We'll probably see a Julia 0.7 release candidate today and most of us on the Julia side will probably be busy making sure all the packages we use run on it without issue (should be mostly done for this package). I already use Arrow.jl for Feather.jl every day without issue, so at least for me there's not some huge urgency to change things. That said, if the effort required is more modest I'd like to try to finish the work within the next few weeks or so. Since most of us already use so much stuff on Apache License 2.0 already anyway, I have no objection to changing the license for this immediately. What would be involved in this? Just changing the license file? |
Having now seen 3 implementations (one of which I did) go through the process of building this and getting it working with all the integration tests, I'm sticking with my estimation of the effort involved based on what is built so far. It's possible you could do it in less time, but I imagine in the course of doing so you would want to add a healthy amount of new unit tests and spend some time thinking about abstractions related to memory management / zero-copy. Changing the license now does not help. The main task will be determining the ownership of the IP, any third party licenses for code that you did not author yourselves, and obtaining the consent of the IP owners (i.e. filing some CLAs) to move the code to the Apache foundation. |
There actually is already some of this in there, the copying semantics are completely predictable and users have full control over when it occurs. Anyway, point taken, I'll trust your judgment that a huge amount of work remains on this. To be brutally honest, I don't have any motivation to undertake a big project on this right now, there are lots of other things I could be working on that would be a lot more useful to me, and what's already here should already be plenty useful for a variety of Julia packages already (I hope). I would like to compile a rough list of the remaining work, however. To that end, I may spend some time looking at the integration tests this weekend. @quinnj , perhaps you already have a much better feel for what's missing than I do? In the meantime, please just know that this package is here, and I am completely willing to donate it to the arrow monorepo, so if anyone comes to you guys expressing interest in Julia support for arrow, please make them aware that this is a (perhaps small) head start. |
IIRC the bulk of the time was spent in developer UX/build tooling. We went to great lengths to make sure the integration tests execute against all the JS build targets, and I spent a bunch of time fixing issues around that. The actual integration runner validation tests are quite slim. Even recently doing the IPC Writer, the most difficult bit of the integration work was adapting node's chunked Stream APIs to the Arrow message format. The single largest boost to my velocity was adding the ability to dump out all the test JSON via the integration runner, then writing out the corresponding Java/C++ arrow files/streams to disk. This allowed me to rapidly iterate on the JS tests in isolation, and use the integration script later in the process. If you want to take advantage of the commands we put in to do this, here's the process: # build arrow-cpp
cd ~/arrow/cpp
mkdir build && cd build
cmake ..
# target `all` builds the integration test commands json-to-arrow, file-to-stream etc.
make all
# build arrow-java
cd ~/arrow/java
mvn install
# init arrow-js (needs node v9+)
cd ~/arrow/js
npm install
# generate integration test JSON via integration.py
# generates file/stream binary files for both C++ and Java
npm run create:testdata
# check that the files were written
ls -lR ~/arrow/js/test/data
# should see the following dirs:
# cpp
# java
# json You might get an error at Cheers! |
Thanks, that's useful information. Having looked into this a little bit more today, I definitely realize that one vital thing that I haven't implemented is the IPC spec. I still believe that alone shouldn't be terribly difficult as I tried to design the things I have implemented to make something like that easy. Perhaps I'm kidding myself, we'll see. Hopefully I'll spend some time on it this weekend, I definitely feel this package should have that, even if I don't have an immediate use for it myself. |
I'm a little late to the party, but I'd like to put my vote in for joining the monorepo. I think the testing benefits speak for themselves and the opportunity to have more "arrow trained" eyes and hands on the code will be a huge win |
I would like to see this move to Apache as well, so that questions can be part of the mailing list. At the risk of overcommitting, MapD has a vested interest in having the full Arrow implementation available in Julia as we try to build a MapD julia package. Right now, I'm at the intersection of most people not knowing Julia AND not being able to ask the canonical group of folks about Arrow. The former can be solved pretty easily, the latter being the more important. |
Thanks @randyzwitch. Clearly, I would like to see the Julia community involved. The world is very small, and with open source being as hard as it is to build, we would be much stronger working together than in isolation |
Again, in case it wasn't clear or got lost in the course of this thread, I have no objections. Clearly what remains to be done is to write some code for loading the standard arrow IPC metadata formats using FlatBuffers.jl and using them to construct the necessary objects. We haven't implemented arrow structs yet and I'm unsure whether all of this is possible without them, though I think loading arrays should be very simple. I am rather confused over what exactly constitutes valid Arrow metadata however: I would like to understand why the feather metadata format and those described in the arrow documentation look so different. At this point I feel sufficiently confused by the documentation that I would probably have to dig into the source code to really understand how to implement the arrow IPC metadata... Again, I'm not going to work on this in the near future since I don't have any immediate need for it, but if anyone wants to put in the effort I'd be happy to spend the time to read and test PR's, even if they are only bits and pieces. |
I think I would have time in about a month or two to really put some effort in to get this integrated w/ apache arrow proper. That shouldn't dissuade anyone else from having a go, but I have interest in contributing, but just have a bit of a backlog at the moment before I could push on this. |
Cool. The best way to transition would be to set up a PR with the code donation and we can conduct the IP clearance which won't take long. We will also need to discuss it on the mailing list. Contributors will need to submit ICLAs to the ASF secretary |
Hello all, just wanted to echo a brief exchange I had with Wes on the Arrow mailing list: I have undertaken finishing this in earnest. While I'm at it, I've undertaken a cleanup and overhaul that will make this package more suitable for implementing the full standard. I expect things to go quickly once I have the metadata completely cracked, but for now I'm having a very suspicious problem with FlatBuffers.jl, and it's looking like I might have to make some PR's there to get this working (I'll of course post an issue over there once I have everything together). Anyway, I'll invite more scrutiny once I am a little farther along. When I get closer to tagging, I'll be in a better position to inquire about help getting this into the mono-repo. I can't say what my timeline is, but I have lots of use for a completed version of this package, so I'm fairly committed to getting it done. Stay tuned. |
Does anyone know what |
That was definitely one of my concerns: I'm not aware of any good options right now on how the Julia package manager would deal with this being in the mono-repo. |
I think the monorepo issue is a distraction from the governance and community organization issues. If there is some structural problem that prevents a Julia codebase from existing within a repository subdirectory, then we can create a separate git repository. But I view it as undesirable because it will make integration testing much more difficult |
@ExpandingMan I'm not sure how far you are into your development of a pure Julia implementation of the spec, but I noticed that Cxx.jl was recently (~a month ago) fixed up to work with Julia 1.0+. I was wondering if wrapping the C++ libraries would present an easy way of getting the main features implemented (as python does), with the option to then slowly rewrite parts in Julia over time? (I'm a bit out of the loop so this comment may be misguided - equally don't know if this is the best place to post this) |
That's not misguided, there are many advantages to simply having a wrapper with Cxx.jl. I am a little bit concerned with the status of Cxx.jl until it matures a bit more. We just lost JavaCall.jl until Julia 2.0 because of an issue with the JVM, that's a loss which hurts (though I'd make the trade for Cxx). I had made what I felt was good progress re-writing this. Last I worked on it, I was in the process of writing functions for ingesting the flatbuffer metadata, and was successfully able to read all the metadata I had output from C++. I then went dark for a while mostly because I have been going through a phase of being extremely busy at my job for a good while now. I still intend on going back and finishing it, but unfortunately it's something I have to do entirely in my free time. I have considered a hybrid approach of reading the metadata using C++, but implementing the bulk of the data ingestion in Julia with |
Thanks, makes sense. The JavaCall issue is actually really annoying - hadn't seen that. I have been building up a reasonable codebase around some Java APIs on 1.0, and was going to upgrade soon.. |
How is the implementation coming along? Can this be used to make use of Apacha Plasma store? |
Perhaps you want to ask at https://github.com/JuliaData/Arrow.jl |
Yeah this package is deprecated now in favor of the JuliaData one, I'm pleased to say. The new one is based partially on my rewrite of this but is mostly original work by @quinnj . I'll archive this package and put a notification on the README when I get a chance. |
I'm opening this issue in case anything should be done to ensure we have collaboration with the rest of the Apache Arrow organization. I'll join the mailing list today and reach out, but I'm not sure what if anything else should be done.
I suspect we'll meet with some skepticism because I didn't spend very much time looking at the C++ arrow code when I wrote this, so ensuring that this package has identical behaviors to those wrapping C++ is potentially difficult.
I think it's pretty safe to say that those of us primarily using Julia consider it very important to have a pure Julia implementation, in my view it would be very unfortunate if we wound up replacing parts of this with a wrapper. It's just much easier and nicer to work on if it's in pure Julia, and one of the goals of the language is to be able to replace C++ in this sort of use case. At the core of the Julia implementation should probably always be a "maximally efficient"
AbstractVector
for referencing Arrow data.Otherwise, suggested changes to be more compliant with arrow standards are certainly welcome.
Will post back here if there's any relevant news.
@quinnj @sglyon @davidanthoff @wesm
The text was updated successfully, but these errors were encountered: