-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Reducing cadence of major arrow-rs releases introducing patch releases #5368
Comments
I have been taking a somewhat brute force approach to this, by simply reducing the cadence of releases. This isn't necessarily ideal, but it does somewhat avoid this issue. I think the key to making this work, is to devise a minimally intrusive process where we can still make breaking changes, but still maintain patch releases. I don't really know the best way to achieve this. |
I once proposed an idea named Libraries relying on Most libraries only use a part of arrow's public API, which is relatively stable. However, if there are changes, libraries can update their MSAV, allowing downstream users to be aware of and adapt to these changes. Just so you know, I'm not sure if it's a good idea. |
Here are some options I can think of:
I think option 1 is the lowest maintainer overhead approach, but has the drawbacks of:
|
Thanks for the idea @Xuanwo -- it isn't clear to me how the MSAV approach would be different from having arrow-re release minor versions (e.g. |
MSAV enables arrow to make breaking changes without requiring library developers to update their MSAV, provided the library remains unaffected. For instance, the library And, yes. I think this approach is difficult to implement across the entire ecosystem. |
I like this proposal. In essence, it means decoupling versions of Arrow the project and arrow-rs the crate. Even more robust way of versioning would be to only release a major version every X amount of months, but only if there were any breaking changes. Otherwise, only a minor version can be published. Regarding MSAV approach: I don't think this is sound advice to users of arrow-rs - it would lead to occasional breaking builds. That's because if, for example, library MSRV works only because all new versions of Rust are guaranteed to be backward compatible - they never introduce breaking changes. This is noted with the fact that Rust itself is on major version 1.x for a long time now. |
Thanks @aljazerzen -- can you be clearer about what proposal you are referring to? |
I mean the original one. I found this issue because I'm running into the problem of managing transitive dependencies to arrow-rs. Because I depend on duckdb-rs and because it had not need updated to use arrow-rs 50, I cannot upgrade my dependency on arrow-rs, since I want compatible versions. Other popular crates (such as serde or regex) don't cause this problem because they only publish minor or patch releases. If this was the case with arrow-rs, then duckdb-rs could depend on arrow-rs~=1.49, which would be compatible with a new version of arrow-rs 1.50. Obviously arrow-rs cannot go back to version 1.x, but it could stop releasing new major versions that don't contain breaking changes or at least release them with less cadence. It would mean much less toll on downstream maintainers. |
Yes, I think this is the key -- the actual version number isn't really important -- what is important is not releasing breaking changes.
Indeed |
Another point that @tustvold made that is worth repeating is that in the current model, we sometimes make new APIs that are expected to be changed prior to the next breaking major release, so ensuring we don't release such an API would be an additional overhead / require some additional discipline |
I think we should separate two issues that appear to have gotten conflated:
I'm in favour of 1. and we could probably aim to hew closer to quarterly major releases (we're relatively close atm). I think 2. is harder, and tbh I am not sure how many people are really calling for this. We could/should do patch releases when sufficient functionality has accumulated, but I'm less keen on committing to a regular cadence |
What about doing something like tokio: https://docs.rs/tokio/latest/tokio/#unstable-features? |
With my Examples: IN DataFusion we had several features sit for 2+ months downstream in DataFusion (for example apache/datafusion#8693 from @Jefffrey ) waiting on a release that contained a non breaking API change). InfluxDB: @erratic-pattern is working on a feature internally that is waiting on #5433 (though I think that technically is a breaking API change) |
Similarly, in Lance we are often waiting for Arrow to be released and then DataFusion to be released with reference to that Arrow version. However, whenever possible, we will implement some workaround in Lance. For example, we have a custom cast function that handles FSL. But there are some cases where we can't easily implement a workaround. For example, we added S3 encryption support in object-store. It's hard to say how often it would be challenging for us. But if it does become challenging, I think I would volunteer to work on putting together the minor or patch releases. |
Precisely speaking, only if DataFusion exposes arrow-rs's types in public APIs, AND the end-user of DataFusion need to use the type to talk with other crates using arrow-rs. It's totally fine to have multiple versions of arrow-rs in the project if arrow-rs is only used as internal implementation. Therefore, this makes me think that perhaps decoupled version can help at this stage. Specifically, we can have a set of "core" crates, which defines the types used in public APIs (e.g., |
Thanks @xxchan -- I think you understand the issue and structure. The additional maintenance burden comes from handling breaking changes to the public APIs (and since there are a lot of public APIs there are a lot of potentially breaking changes) |
|
Or, as it was discussed above, not have new releases of major arrow versions in the first place. |
For context #5623 is proposing a breaking change to arrow-array. We have had similar breaking changes to arrow-schema as part of adding support for view types I therefore think even if we did separate the versioning of the individual crates, we would still need the ability to create breaking changes. R.e. pyo3 I would like someone to please clarify if #5566 is a breaking change, as if so that has already been merged to main... |
The next release is going to have to be breaking because of the PyO3 and object_store upgrades. Whilst not breaking in and of themselves they introduce a version upgrade hazard due to the way cargo handles dependency resolution across compatibility ranges. I anticipate cutting this in the next few weeks. We may also want to bring forward a fix for incorrect interval ordering, although I'm still unsure how best to solve that one. |
reply #5566 (comment) here
This sounds like bumping MSRV. Consumers are required to bump their Rust version, but there are strong arguments that this should not be a semver breaking change. rust-lang/api-guidelines#231 (comment) I’m not familiar with pyo3, and not sure whether the analogy is precise though. From a more practical point of view, I think it depends on whether bumping major version brings benefits to users. For users don’t use pyo3, it’s definitely brings disadvantages. For users using pyo3, the workload seems to be the same. They can just pin to older arrow version if they don’t want to upgrade pyo3 for a while. (Similar to the solution for MSRV) There seems to be no large difference whether they pin to an older major or minor arrow version. I now roughly feel that bumping major version unnecessarily might bring more harm (in productivity) in the ecosystem than including some “little” breaking changes in minor versions (like tokio unstable). Although the latter might be more “correct”. Just random personal feeling, correct me if I’m wrong. |
The issue is defined in more detail here - https://doc.rust-lang.org/cargo/reference/resolver.html#version-incompatibility-hazards And further expanded upon here - https://github.com/dtolnay/semver-trick?tab=readme-ov-file#coordinated-upgrades Basically say a user has a dependency on pyo3 0.20 in their project, as soon as we publish a minor release their project will start failing to build (assuming no lockfile) with a thoroughly opaque error about two identically named types not being equal. Rustc does actually hint at what the issue might be, but unless people happen to know cargo's somewhat peculiar versioning behaviour, it can be not very obvious. In the past people have filled issues on this repo or pinged maintainers on discord/slack. The rust docs are fairly unambiguous that the correct response to this is to yank the release - https://doc.rust-lang.org/cargo/reference/resolver.html#semver-breaking-patch-release-breaks-the-build Whilst I agree it is unfortunate, and I had really hoped to avoid this release being breaking, I'm not really sure we can just pretend it isn't a breaking change... Ultimately we should still be |
Unfortunately, I agree that because of the bump of This will create a lot of toll downstream, because crates use arrow types in their public interfaces so they will face same problems as arrow is facing with pyo3. It is a shame that the whole If I understand the "semver-trick" this would be the release process:
This would also mean that most of the code in arrow 51.1 could be removed and just replaced with dependencies on arrow 52. This is all way too much work, while also not solving the problem of publishing the new major versions. It would just make lagging behind the latest arrow version less of a problem because old versions would contain most of the new changes. Another solution would be to move pyarrow module into a separate crate that can be versioned independently. |
As big user of the individual arrow-* sub-crates, it would make my life a lot easier if each sub-crate was versioned independently (similar to how any of the mainstream Rust projects do it). After creating a single library with public APIs exposing the Arrow traits, it only took a week before I ran into this issue with some of my users who wanted to use versions 47 and 51 in their projects. From my perspective, there was zero difference between the two versions and I didn't understand why the major version was bumped for crates like It's a maintenance headache for a crates' users if the maintainers aren't following SemVer since that's such a huge part of the expectations of the ecosystem built around Rust at a crates ecosystem level. |
I agree and also proposed this before #5368 (comment). I think this is the ultimate solution and we have to follow this in “1.0” status for the main libraries. IMO the main reasons why single version is used are:
|
I’m a little confused: what prevents us from holding one vote together for arrow-array v1 and pyarrow v2? |
Nothing in theory -- I think the limiting factor for this is maintainer bandwidth which is a scarce resource Insofar as those on this thread who are interested in this topic can lend a hand (preparing / reviewing PRs, branches, etc) it would add to the available bandwidth and make some of the other proposals more feasible. I am personally willing to run the actual voting / release process, but I don't have the bandwidth to create the PRs / manage the required branches for this process. |
It's probably worth mentioning that making breaking changes and updating the major version of these crates is not that big of a deal to most downstream users if the individual sub-crates are versioned independently. Just compare the number of downloads for Breaking changes usually come with improvements and feature additions for the clients of the individual crates; however, when there are no changes to the crates which are being used it becomes extremely problematic because there is no motivation among all dependents to coalesce on the most recent version. This is as @Xuanwo said:
This also has consequences for security updates and vulnerability resolution. If your library clients are conditioned to assume that new versions of their dependencies are not relevant to them, they will be less likely to pull in new security/vulnerability fixes you've pushed to the project. Having separate version numbers for all of the sub-crates does not need to be any more of a burden on the maintainers of this crate than the current versioning approach. It should even be less work for maintainers as only a subset of the crates should have version updates due to a single change. Compared to using separate versions for sub-crates, what is the benefit of the current scheme for anyone? |
Simplicity, it will be frustrating and hard for downstreams to reason about what combination of crate versions are compatible with one another if they are versioned independently. As most downstreams will need to use a combination of arrow crates, this will turn upgrades into a labour intensive mess, especially as cargo's behaviour of using multiple versions concurrently would not lead to helpful errors if you got it wrong. Currently we try hard to keep breaking changes small, at worst requiring updating a few call sites, we're additionally going to try delaying breaking changes to a quarterly schedule after the next release. I suggest we proceed with that as the plan and circle back in 6 months and assess. |
Usually I just look at crates.io if I'm using an older version of crates; however, I'm most likely to use the most recent versions if the project is adhering to SemVer like the majority of the Rust community.
This doesn't help users who are currently dealing with supporting multiple major versions for the low-level crates which independently don't have a reason to have multiple major versions... addendum: I'd also like to note that with the current state there is no good solution which can be "figured out" - it's literally impossible. If the sub-crates are versioned separately, at least downstream users have a chance to get a reasonable setup going. |
They do have a reason, I articulated it above, you may not like the reason but there is a reason 😆 Regardless my expectation is that the cadence of breaking changes to arrow-array and similar core crates will align with that of the 3 monthly releases, and so this a moot point anyway. I appreciate your frustration, if perhaps not your tone, but we're doing the best we can |
Apologies for the tone. As you said, I am frustrated and being forced to use this crate has not been a good experience because of the versioning approach. |
I also agree with this to some degree. But it at least requires some nontrivial work, e.g., updating release CI workflow or scripts, changing crates' And as @alamb mentioned above, this option is not excluded from consideration. If we want to help, we can contribute what the actual changes need to be, and them the maintainers may consider adopting the new workflow. |
I would love to help review PRs that made it easier to manage / track versions and do non breaking releases. Thank you @xxchan |
I tried to capture the outcome of this discussion in #5737 and document what the updated plan for releases is. Feedback welcome. |
|
|
|
This has been a really interesting and productive discussion. I wanted to add a few notes/questions:
|
I'm sorry to be curt, but I'm growing a bit frustrated that this conversation appears to be going in circles, the core abstractions in arrow-rs are not stable, can't be treated as such, and no amount of release hackery will change this. |
I agree with this point. Unless/until we have more discipline about keeping the APIs stable, major version bumps are required. I see conventional commits / unstable feature flags, etc as way to improve discipline around keeping APIs stable. Maybe one way to reduce maintainer bandwidth requirements would be to implement a CI check for breaking API. I filed #5791 to track this idea. Anyone interested in helping ease the maintenance burden maybe could help figure out how to automate some of it |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As more people use arrow, the overall burden to users from frequent major releases is increasing. Furthermore, the pace of breaking API changes is decreasing, so the burden on maintainers to avoid breaking changes is decreasing
As the
arrow
crate becomes more widely used in the ecosystem by projects other than DataFusion and other early adopters, the frequent major releases causes several issues:parquet
andarrow
releases are coupled so releasing a version of parquet requires releasing a new version of arrowThe major version bumps imposes non trivial overhead on user crates. Some crates like
arrow_serde
have implemented clever, though complex, workaround like having feature flags for each arrow version (see the recent discussion with @chmp onarrow_serde
chmp/serde_arrow#131)Also, from what I can see many of the recent arrow-rs changes aren't really adding new APIs, they are more like filling in feature gaps and bugs, which also reflected in the slower pace of the last few releases.
Describe the solution you'd like
I propose we set a more regular major release cadence (e.g. every 3 months) and only do minor, compatible, releases between those releases.
This would absolutely require more maintainer effort, but at this stage in the project the effort may be more manageable as the APIs are in a pretty good place I think
Describe alternatives you've considered
I think there are various alternatives to trigger releases / what cadence. I don't have a hugely strong opinion in this matter
Additional context
At some point in the past we actually had fewer major releases -- see #1120
There was non trivial process overhead so we (well , really I) abandoned it and went YOLO on major releases as there wasn't really any maintenance bandwidth to do anything else
The text was updated successfully, but these errors were encountered: