Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collaboration with Apache Arrow org #28

Open
ExpandingMan opened this issue May 30, 2018 · 45 comments
Open

collaboration with Apache Arrow org #28

ExpandingMan opened this issue May 30, 2018 · 45 comments

Comments

@ExpandingMan
Copy link
Owner

I'm opening this issue in case anything should be done to ensure we have collaboration with the rest of the Apache Arrow organization. I'll join the mailing list today and reach out, but I'm not sure what if anything else should be done.

I suspect we'll meet with some skepticism because I didn't spend very much time looking at the C++ arrow code when I wrote this, so ensuring that this package has identical behaviors to those wrapping C++ is potentially difficult.

I think it's pretty safe to say that those of us primarily using Julia consider it very important to have a pure Julia implementation, in my view it would be very unfortunate if we wound up replacing parts of this with a wrapper. It's just much easier and nicer to work on if it's in pure Julia, and one of the goals of the language is to be able to replace C++ in this sort of use case. At the core of the Julia implementation should probably always be a "maximally efficient" AbstractVector for referencing Arrow data.

Otherwise, suggested changes to be more compliant with arrow standards are certainly welcome.

Will post back here if there's any relevant news.

@quinnj @sglyon @davidanthoff @wesm

@wesm
Copy link

wesm commented May 30, 2018

The next steps would be to collect any Arrow-relevant IP and make a code donation to the Apache Arrow project (like we did with Go http://incubator.apache.org/ip-clearance/arrow-go-library.html and Ruby), then continue development there. Let me know if you want to proceed with that, and I'll be ready to help.

@ExpandingMan
Copy link
Owner Author

Thanks for your help!

In general I find all the legalistic tomfoolery horrifying, but I won't let that stop us unless there actually is some problem here.

Would this involve abandoning the MIT license in favor of some form of Apache license? Is the apache license in any sense more restrictive? Any thoughts from the other Julia people on this?

Any feeling for whether the rest of the arrow community will be upset that this is not a wrapper? I'm hoping they might not care if they don't use Julia, but admittedly I won't be particularly happy if I clone this repo a year from now and find that someone completely rewrote it as a wrapper.

I'm also pretty concerned that making this part of the arrow repository will make it a 100 times harder to work on than if it were e.g. in https://github.com/JuliaIO. (Again, feedback is welcome here, I wasn't necessarily planning on doing much more work on this in the near future, so I'm definitely open to input from anyone who feels they may be a potential contributor.)

@ExpandingMan
Copy link
Owner Author

Tagging some more Julia people that may have interest in this, apologies if you don't

@ararslan @ihnorton @nalimilan @tanmaykm

(feel free to add others)

@xhochy
Copy link

xhochy commented May 30, 2018

(Arrow developer/PMC, hoping to also answer some questions, haven't used Julia in 5 years, so I rather count as non-knowledgable in that space)

In general I find all the legalistic tomfoolery horrifying, but I won't let that stop us unless there actually is some problem here.

Given that most commits are from you and the other two contributors only contributed small changes, this will be very easy at the current stage. It will involve some "paper work" emails but they should be straight-forward.

Would this involve abandoning the MIT license in favor of some form of Apache license? Is the apache license in any sense more restrictive? Any thoughts from the other Julia people on this?

Yes this would involve changing the license to Apache. The Apache license is much longer in terms of text but normally as accepted as the MIT license. If you are concerned, there are plentyful of short description of what the Apache license is about, read one of them.

Any feeling for whether the rest of the arrow community will be upset that this is not a wrapper?

No, in the Arrow repository we already have distinct implementations in JavaScript, Java, Go, C++ and Rust. They all share no code but we do have integration tests between them to ensure that they are all compatible. That is one of the main advantages of having them all in single repository: To ensure that they all work together eventhough that they don't share anything.

I'm also pretty concerned that making this part of the arrow repository will make it a 100 times harder to work on than if it were e.g. in https://github.com/JuliaIO.

What are the benefits of doing it in JuliaIO?

@wesm
Copy link

wesm commented May 30, 2018

In general I find all the legalistic tomfoolery horrifying, but I won't let that stop us unless there actually is some problem here.

When you're talking about software redistribution, being careful about IP lineage and third party code makes everyone's lives (especially the lawyers') a lot easier in the long run. The process itself is not onerous despite any appearances.

Would this involve abandoning the MIT license in favor of some form of Apache license? Is the apache license in any sense more restrictive? Any thoughts from the other Julia people on this?

The main difference between Apache 2.0 and MIT license is that the Apache license provides a patent grant to users, which provides additional security / peace of mind for developers who may produce commercial software that depends on a project. MIT and Apache 2.0 are compatible from a code reuse perspective as permissive licenses.

I'm also pretty concerned that making this part of the arrow repository will make it a 100 times harder to work on than if it were e.g. in https://github.com/JuliaIO.

I'm curious what might be perceived as "harder". No one (out of > 150 contributors) has been having a hard time contributing to the project as far as I can tell -- I think that "use pull requests and do not break the build" is a reasonable bar of professionalism for an open source contributor. We also ask that contributors help maintain an intelligible change log for the project and write commit messages / PR descriptions that explain the content of their work.

@ExpandingMan
Copy link
Owner Author

Thanks guys. I think that's good enough for me.

I suppose my consternation about the project being harder to work on in the arrow repo just comes from the fact that the Julia community is relatively small, and this is something I've grown pretty comfortable with. No worries.

It would probably be convenient if we could keep a mirror in an Arrow.jl repo somewhere, as that would probably simplify its installation with the Julia package manager, not sure what that would look like but I've seen such things done elsewhere.

Ok, let's give people some time to respond with comments and then move ahead. Thanks again!

@wesm
Copy link

wesm commented May 30, 2018

It would probably be convenient if we could keep a mirror in an Arrow.jl repo somewhere, as that would probably simplify its installation with the Julia package manager, not sure what that would look like but I've seen such things done elsewhere.

Seems like what we'd want to do after voting on a Julia release is to push the changes and tag the new release in a repo that's connected to the Julia package management system (it seems like GitHub -- or git repositories at least -- and Julia packages are intertwined, is that right?).

@davidanthoff
Copy link
Contributor

I'm not really convinced that moving the code into the main arrow repo is a good idea. Here are some issues I see:

  • The julia package manager right now is really built around the notion of one repo per package. Deviating from that will make things a lot more complicated (if it is possible at all). Aside from just the package manager, there are lots and lots of other tools in the julia ecosystem that make the same assumption, and it would be really, really painful to move away from that (things like editors, the whole CI infrastructure etc.)
  • The julia package presumably would have to version in sync with all the other arrow implementations. I don't think that is desirable at this point, I think the julia package at this point needs to be able to cut a release on its own schedule.
  • I think realistically we probably will have more contributors to this package from regular julia contributors. Those are very familiar with the current setup (one repo per package, everything conforms to the normal julia package patterns), and I think we would make it less likely to get PRs from folks if the code was moved into a different repo with lots of other stuff. There is of course a flip side to this point, namely that we would probably be more likely to get PRs from regular arrow contributors if the code lived in the main arrow repo. In my unscientific, subjective judgment we are more likely to get PRs from julia folks, though, so I think we should make it easy for them.

I do think it would be fantastic to make sure we sync the testing story somehow. But I don't think we need to go to one repo for that, there seem to be lots of options to make sure stuff gets tested in sync while maintaining the current repo structure.

On the license: that to me seems mostly extra work that at least I don't want to deal with ;) If folks want to change it, go ahead... But if we keep the repos distinct, I don't see a need to change the license. There is also again a situation that within the julia ecosystem the MIT license is by far the most used, and it might be easier to just stick with that. But, mostly, I'm agnostic about this point.

@wesm
Copy link

wesm commented May 30, 2018

The julia package manager right now is really built around the notion of one repo per package. Deviating from that will make things a lot more complicated (if it is possible at all).

I don't see why packaging and development process need to be tightly coupled. You only need to update the package manager when a release actually happens, and this update to the package manager should take < 1 minute to do.

I think the julia package at this point needs to be able to cut a release on its own schedule.

We are cutting separate JavaScript releases already, so I don't think this is an issue

I think we would make it less likely to get PRs from folks if the code was moved into a different repo with lots of other stuff.

This seems like FUD to me. I really question how valuable the contributions are from people who are put off by a "who moved my cheese" type of issue. This is already a pretty difficult project to contribute to on the spectrum of open source projects just by its low-level systems nature (data structures, binary protocols, file formats, serialization, etc.)

One could possibly create an apache/Arrow.jl repo, but this would make integration testing and other things a lot harder; the monorepo brings a lot of benefits when you are dealing with binary interoperability

@nalimilan
Copy link

I agree it would make sense to move the Julia package to the same repo as all other implementations. The advantages of that approach may not be striking right now, but I imagine it will make it much easier in the long run to keep all implementations in sync while the Arrow format evolves. It would make sense to have some kind of mirror repo to work with the Julia package manager, though.

@ExpandingMan
Copy link
Owner Author

The concept of Julia packages are indeed closely tied to git repositories (but not necessarily github). In principle it's possible to make a package out of any type of repository, but indeed it does greatly simplify things if the repo is in the standard format like you see for Arrow.jl right now.

I feel like this problem is probably solvable with git and github though, isn't it possible to make a mirror of a directory or something? I seem to remember seeing some setups like that but can't find any good examples at the moment.

I think at this point it is worth asking though: why is there such eagerness to have everything in a single repository? In my experience this actually makes things more difficult, not easier. I don't understand the thinking behind why it is so important to have so many languages in that one repo. We work on things in separate repos all the time and it's never an issue in the slightest, even when it involves IO and binary stuff.

Regardless, I'm willing to donate it to that repo as long as we have some sort of "Arrow.jl", however that would work.

@davidanthoff
Copy link
Contributor

I don't see why packaging and development process need to be tightly coupled.

Well, it is on julia, for all practical purposes.

You only need to update the package manager when a release actually happens

That is not how it works on julia. The whole workflow of downloading a package, working on a dev version of the package etc. is tightly integrated with the package manager. We also would no longer have access to most of the tool integration we have if the code moved into the arrow repo (things like attobot, femtocleaner, integrated testing & build in VS Code, Documenter.jl, Coverage.jl and probably more stuff).

We are cutting separate JavaScript releases already, so I don't think this is an issue

Ah, so you are using tags that include things like apache-arrow-js in the name, right? The tooling on the julia side expects tags to be of the form v0.1.0, so that is another thing we would have to manually sort out...

This seems like FUD to me. I really question how valuable the contributions are from people who are put off by a "who moved my cheese" type of issue. This is already a pretty difficult project to contribute to on the spectrum of open source projects just by its low-level systems nature (data structures, binary protocols, file formats, serialization, etc.)

I think in reality we would constantly get PRs against the repo that the package manger knows about because that is the workflow that every julia documentation recommends. I think that would be cumbersome. I also think it is quite vital to make it easy to open PRs even for contributors that are not low level super devs. There are lots of little things in the Arrow.jl codebase where even less experienced devs can help, and I would not want to lose them. A good example is the one commit I contributed to this repo. I think pretty much anyone could have done that, certainly no low level coding experience required.

One could possibly create an apache/Arrow.jl repo, but this would make integration testing and other things a lot harder; the monorepo brings a lot of benefits when you are dealing with binary interoperability

I think a productive conversation at this point would be what kind of testing we would like to see. Once we have figured that out, we can see whether that would be easier with two repos or a monorepo.

I don't really know what type of integration testing there is right now in the arrow repo between the different implementations, I think it would be really helpful to understand that better.

@wesm
Copy link

wesm commented May 30, 2018

I don't understand the thinking behind why it is so important to have so many languages in that one repo.

This project is held together by its inter-language binary integration tests. A single pull request may affect multiple implementations -- if we split the project up into multiple git repos, effectively we would have a network of circular dependencies in the CI. So if you needed to make a change that affected both Java and C++, you need a way to integration tests two PRs jointly. If you merge a breaking change in one repo, all the builds in the other repo break. By working in a monorepo, we maintain harmonious CI builds across all the subprojects.

As an example, the JavaScript developers recently merged support for emitting binary streams from JS to consume in other implementations: apache/arrow@fc7a382. In this patch, the CI verifies that the JavaScript data emitted can be consumed by Java and C++. As time goes on, the matrix of implementations producing and consuming data will grow larger. It's much easier to stay in sync this way

@ExpandingMan
Copy link
Owner Author

This project is held together by its inter-language binary integration tests. A single pull request may affect multiple implementations -- if we split the project up into multiple git repos, effectively we would have a network of circular dependencies in the CI.

That sounds a little weird to me, as projects depend critically on other projects in different repos all the time, I don't quite see why this should be a special case. My understanding was that proper versioning was supposed to take care of this sort of thing. Right now there are (more than one) Julia Feather repos that depend on this package, if I break one of them I consider myself to have screwed up, just like if a numpy tag breaks pandas those guys would have screwed up. (Though @wesm undoubtedly has about a billion times more experience than I do with this sort of thing, so perhaps I don't know what I'm talking about.)

Anyway, I really don't want to get us all into an argument here about the relative merits of how this gets set up, my interest is just in making sure this gets maintained and letting people know that they can integrate their Arrow stuff with Julia. I think having some sort of mirror repo or something might solve that. I'll ask around on Julia discourse about whether anyone has done something like this. I'm imagining an "arrow/Arrow.jl" that somehow links to the appropriate part of the arrow repo.

@wesm
Copy link

wesm commented May 30, 2018

The relationship between the different Arrow implementations is different from a normal package dependency. We have made deliberate changes to the binary format over the last 2.5 years, and it's a lot of work to keep everything in sync -- it sounds like the burden will be on the Julia-Arrow developers to create bespoke tooling to assist with integration testing against the other implementations. There will likely be cases where the Julia implementation will "slip" and become incompatible with the other implementations, as it will not be possible in all cases for CI to force the incompatibility to be resolved with each patch contributed to the project. We could set up some nightly builds to at least complain to the mailing list if something is broken in any 24 hour time span.

This is to say, all of this discussion is moot until the Julia implementation is able to consume and produce valid Arrow binary messages, and validate them against a source of truth (i.e. the JSON integration test format developed).

@quinnj
Copy link

quinnj commented May 30, 2018

FWIW, I'm all in favor of moving the code under the apache arrow monorepo. We've been dancing around on the outside for too long and it's more than worth the little extra packaging/release work to be more tightly integrated w/ the rest of the arrow community. We have plenty of options to work around any current julia package system quirks, like mirroring a stand-alone repo that we would sync w/ releases. And in 0.7 (which should be officially tagged alpha today!), the new package manager is much more flexible w/ where package source code lives & structured.

@ExpandingMan
Copy link
Owner Author

ExpandingMan commented May 30, 2018

This is to say, all of this discussion is moot until the Julia implementation is able to consume and produce valid Arrow binary messages, and validate them against a source of truth (i.e. the JSON integration test format developed).

Yes, that is a very good point and I had a feeling that we were getting too far away from it.

Is there some sort of standard set of tests we can run through? Since I pretty much wrote this from scratch, it seems like the burden's on me right now to sit down and make sure I get this to a point where it reliably passes some sort of standard test, and it might take me a little while before I really get a chance to dig into it anyway.

Is it a problem that this is an incomplete implementation? (We've yet to do structs. I'm assuming I'd have to do some of the basic IPC work but I don't imagine that being very hard.)

@quinnj
Copy link

quinnj commented May 30, 2018

@ExpandingMan, I'm happy to help w/ implementing the message protocol side of things. I've perused the spec a few times. Happy to coordinate efforts.

@ExpandingMan
Copy link
Owner Author

@quinnj, any PR would be welcome, in fact I should make you an admin. I don't know what's needed for the message protocols, but I'd imagine it would be quite simple. Unless it would require us to implement structs, that I'd imagine would be at least a little bit of work. If you have any ideas about how to test whether we are compliant with what's in the main arrow repo, that is also extremely welcome. Realistically we need to do that before we can seriously think about pushing this over there.

@wesm
Copy link

wesm commented May 30, 2018

Building complete binary read and write support and also being able to interface with the JSON integration test format is a sort of big project, I'd guess at least 80 hours of development time. If it got done in less time than that, I would be extraordinarily impressed. @trxcllnt how much time do you think you spent on this on the JS side?

Ultimately, implementations need to make their way into the integration test suite as proof of compliance https://github.com/apache/arrow/tree/master/integration -- at the moment we have C++, Java, and JavaScript running there.

@ExpandingMan
Copy link
Owner Author

ExpandingMan commented May 30, 2018

Building complete binary read and write support and also being able to interface with the JSON integration test format is a sort of big project, I'd guess at least 80 hours of development time.

My thinking was that we are already a huge part of the way there. Looking at the test examples, I still believe that to be the case, although like I said, we haven't implemented structs. @quinnj, do you see any reason why this should be very far off?

I'll look into running those tests.

By the way, looks like it shouldn't be too difficult to deal with the package issues with the new package manager, see here.

@davidanthoff
Copy link
Contributor

Maybe the best strategy would be to get the test stuff up and running first, and once julia 1.0 is released (I mean the final version, not the alpha), take a look whether the new package manager would make the monorepo easier?

@wesm
Copy link

wesm commented May 30, 2018

@ExpandingMan I'd say the sooner we can get the Julia code shepherded into the Apache project, the better; waiting is likely to create more work for the PMC if you start collecting a longer contributor list. It doesn't need to wait for integration tests

@nalimilan
Copy link

At least it would be a good idea to switch to the Apache license immediately.

@ExpandingMan
Copy link
Owner Author

Ok, at this point I definitely have every intent of moving this to the monorepo, but we have to figure out what's required on the technical side first. As I've said, I don't really see why we'd have $\ge 80$ man-hours of work ahead of us to have a minimal implementation that's useful for IPC and compliant with the tests in that package, but I don't really know what I'm talking about so perhaps there's some big piece of this that I'm missing.

If indeed this would require a large effort, I'm not sure the appetite would be their right now just to get this moved over to the monorepo as quickly as possible. We'll probably see a Julia 0.7 release candidate today and most of us on the Julia side will probably be busy making sure all the packages we use run on it without issue (should be mostly done for this package). I already use Arrow.jl for Feather.jl every day without issue, so at least for me there's not some huge urgency to change things. That said, if the effort required is more modest I'd like to try to finish the work within the next few weeks or so.

Since most of us already use so much stuff on Apache License 2.0 already anyway, I have no objection to changing the license for this immediately. What would be involved in this? Just changing the license file?

@wesm
Copy link

wesm commented May 31, 2018

As I've said, I don't really see why we'd have $\ge 80$ man-hours of work ahead of us to have a minimal implementation that's useful for IPC and compliant with the tests in that package, but I don't really know what I'm talking about so perhaps there's some big piece of this that I'm missing.

Having now seen 3 implementations (one of which I did) go through the process of building this and getting it working with all the integration tests, I'm sticking with my estimation of the effort involved based on what is built so far. It's possible you could do it in less time, but I imagine in the course of doing so you would want to add a healthy amount of new unit tests and spend some time thinking about abstractions related to memory management / zero-copy.

Changing the license now does not help. The main task will be determining the ownership of the IP, any third party licenses for code that you did not author yourselves, and obtaining the consent of the IP owners (i.e. filing some CLAs) to move the code to the Apache foundation.

@ExpandingMan
Copy link
Owner Author

spend some time thinking about abstractions related to memory management / zero-copy

There actually is already some of this in there, the copying semantics are completely predictable and users have full control over when it occurs.

Anyway, point taken, I'll trust your judgment that a huge amount of work remains on this. To be brutally honest, I don't have any motivation to undertake a big project on this right now, there are lots of other things I could be working on that would be a lot more useful to me, and what's already here should already be plenty useful for a variety of Julia packages already (I hope).

I would like to compile a rough list of the remaining work, however. To that end, I may spend some time looking at the integration tests this weekend. @quinnj , perhaps you already have a much better feel for what's missing than I do?

In the meantime, please just know that this package is here, and I am completely willing to donate it to the arrow monorepo, so if anyone comes to you guys expressing interest in Julia support for arrow, please make them aware that this is a (perhaps small) head start.

@trxcllnt
Copy link

trxcllnt commented May 31, 2018

@trxcllnt how much time do you think you spent on this on the JS side?

IIRC the bulk of the time was spent in developer UX/build tooling. We went to great lengths to make sure the integration tests execute against all the JS build targets, and I spent a bunch of time fixing issues around that.

The actual integration runner validation tests are quite slim. Even recently doing the IPC Writer, the most difficult bit of the integration work was adapting node's chunked Stream APIs to the Arrow message format.

The single largest boost to my velocity was adding the ability to dump out all the test JSON via the integration runner, then writing out the corresponding Java/C++ arrow files/streams to disk. This allowed me to rapidly iterate on the JS tests in isolation, and use the integration script later in the process.

If you want to take advantage of the commands we put in to do this, here's the process:

# build arrow-cpp 
cd ~/arrow/cpp
mkdir build && cd build
cmake ..
# target `all` builds the integration test commands json-to-arrow, file-to-stream etc.
make all

# build arrow-java
cd ~/arrow/java
mvn install

# init arrow-js (needs node v9+)
cd ~/arrow/js
npm install

# generate integration test JSON via integration.py
# generates file/stream binary files for both C++ and Java
npm run create:testdata

# check that the files were written
ls -lR ~/arrow/js/test/data
# should see the following dirs:
# cpp
# java
# json

You might get an error at npm run create:testdata if you don't have python 3.x aliased as python3. If you do, you can edit this line to be the proper command on your system, then re-run the create:testdata command.

Cheers!

@ExpandingMan
Copy link
Owner Author

Thanks, that's useful information.

Having looked into this a little bit more today, I definitely realize that one vital thing that I haven't implemented is the IPC spec. I still believe that alone shouldn't be terribly difficult as I tried to design the things I have implemented to make something like that easy. Perhaps I'm kidding myself, we'll see. Hopefully I'll spend some time on it this weekend, I definitely feel this package should have that, even if I don't have an immediate use for it myself.

@sglyon
Copy link
Contributor

sglyon commented Jun 5, 2018

I'm a little late to the party, but I'd like to put my vote in for joining the monorepo. I think the testing benefits speak for themselves and the opportunity to have more "arrow trained" eyes and hands on the code will be a huge win

@randyzwitch
Copy link

I would like to see this move to Apache as well, so that questions can be part of the mailing list. At the risk of overcommitting, MapD has a vested interest in having the full Arrow implementation available in Julia as we try to build a MapD julia package. Right now, I'm at the intersection of most people not knowing Julia AND not being able to ask the canonical group of folks about Arrow. The former can be solved pretty easily, the latter being the more important.

(cc: @tmostak @wesm )

@wesm
Copy link

wesm commented Jul 2, 2018

Thanks @randyzwitch. Clearly, I would like to see the Julia community involved. The world is very small, and with open source being as hard as it is to build, we would be much stronger working together than in isolation

@ExpandingMan
Copy link
Owner Author

Again, in case it wasn't clear or got lost in the course of this thread, I have no objections.

Clearly what remains to be done is to write some code for loading the standard arrow IPC metadata formats using FlatBuffers.jl and using them to construct the necessary objects. We haven't implemented arrow structs yet and I'm unsure whether all of this is possible without them, though I think loading arrays should be very simple. I am rather confused over what exactly constitutes valid Arrow metadata however: I would like to understand why the feather metadata format and those described in the arrow documentation look so different. At this point I feel sufficiently confused by the documentation that I would probably have to dig into the source code to really understand how to implement the arrow IPC metadata...

Again, I'm not going to work on this in the near future since I don't have any immediate need for it, but if anyone wants to put in the effort I'd be happy to spend the time to read and test PR's, even if they are only bits and pieces.

@quinnj
Copy link

quinnj commented Jul 2, 2018

I think I would have time in about a month or two to really put some effort in to get this integrated w/ apache arrow proper. That shouldn't dissuade anyone else from having a go, but I have interest in contributing, but just have a bit of a backlog at the moment before I could push on this.

@wesm
Copy link

wesm commented Jul 2, 2018

Cool. The best way to transition would be to set up a PR with the code donation and we can conduct the IP clearance which won't take long. We will also need to discuss it on the mailing list. Contributors will need to submit ICLAs to the ASF secretary

@ExpandingMan
Copy link
Owner Author

Hello all, just wanted to echo a brief exchange I had with Wes on the Arrow mailing list:

I have undertaken finishing this in earnest. While I'm at it, I've undertaken a cleanup and overhaul that will make this package more suitable for implementing the full standard. I expect things to go quickly once I have the metadata completely cracked, but for now I'm having a very suspicious problem with FlatBuffers.jl, and it's looking like I might have to make some PR's there to get this working (I'll of course post an issue over there once I have everything together).

Anyway, I'll invite more scrutiny once I am a little farther along. When I get closer to tagging, I'll be in a better position to inquire about help getting this into the mono-repo. I can't say what my timeline is, but I have lots of use for a completed version of this package, so I'm fairly committed to getting it done.

Stay tuned.

@davidanthoff
Copy link
Contributor

Does anyone know what ]dev Arrow would do if the code was in the mono-repo?

@ExpandingMan
Copy link
Owner Author

That was definitely one of my concerns: I'm not aware of any good options right now on how the Julia package manager would deal with this being in the mono-repo.

@wesm
Copy link

wesm commented Mar 19, 2019

I think the monorepo issue is a distraction from the governance and community organization issues. If there is some structural problem that prevents a Julia codebase from existing within a repository subdirectory, then we can create a separate git repository. But I view it as undesirable because it will make integration testing much more difficult

@kcajf
Copy link

kcajf commented May 17, 2019

@ExpandingMan I'm not sure how far you are into your development of a pure Julia implementation of the spec, but I noticed that Cxx.jl was recently (~a month ago) fixed up to work with Julia 1.0+. I was wondering if wrapping the C++ libraries would present an easy way of getting the main features implemented (as python does), with the option to then slowly rewrite parts in Julia over time?

(I'm a bit out of the loop so this comment may be misguided - equally don't know if this is the best place to post this)

@ExpandingMan
Copy link
Owner Author

That's not misguided, there are many advantages to simply having a wrapper with Cxx.jl. I am a little bit concerned with the status of Cxx.jl until it matures a bit more. We just lost JavaCall.jl until Julia 2.0 because of an issue with the JVM, that's a loss which hurts (though I'd make the trade for Cxx).

I had made what I felt was good progress re-writing this. Last I worked on it, I was in the process of writing functions for ingesting the flatbuffer metadata, and was successfully able to read all the metadata I had output from C++. I then went dark for a while mostly because I have been going through a phase of being extremely busy at my job for a good while now. I still intend on going back and finishing it, but unfortunately it's something I have to do entirely in my free time.

I have considered a hybrid approach of reading the metadata using C++, but implementing the bulk of the data ingestion in Julia with AbstractVectors like I have in my dev branch. I haven't looked close enough at the C++ API to know how practical that is.

@kcajf
Copy link

kcajf commented May 17, 2019

Thanks, makes sense.

The JavaCall issue is actually really annoying - hadn't seen that. I have been building up a reasonable codebase around some Java APIs on 1.0, and was going to upgrade soon..

@mkschulze
Copy link

How is the implementation coming along? Can this be used to make use of Apacha Plasma store?

@ViralBShah
Copy link

Perhaps you want to ask at https://github.com/JuliaData/Arrow.jl

@ExpandingMan
Copy link
Owner Author

Yeah this package is deprecated now in favor of the JuliaData one, I'm pleased to say. The new one is based partially on my rewrite of this but is mostly original work by @quinnj .

I'll archive this package and put a notification on the README when I get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests