-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 support #426
S3 support #426
Conversation
25d4fe9
to
f76ff32
Compare
cmake RemoteHandle::read() remove StreamFuture(StreamFuture&&) python bindings and tests read to device memory dependencies: aws-sdk-cpp parse_s3_path cmake: adding AWSSDK to python built RemoteFile.from_url() benchmark benchmark: --api numpy ibenchmark: --api cudf cleanup benchmark: adding --api cudf-fsspec read(): use PushAndPopContext impl. pread ensure_aws_s3_api_is_initalized AwsS3Client S3Context clean up SameThreadExecutor use shared pointer cmake: AWSSDK COMPONENTS s3 transfer create_transfer_manager clean up BufferAsStream test_read: larger data size KVIKIO_NVTX_FUNC_RANGE benchmark: use random RemoteHandle::read_to_host(): print bandwidth benchmark clean up remove the use of the transfer module ci: some more aws-sdk-cpp benchmark: pytest.importorskip("moto") remote_handle.pyx remote_file.py make remote IO optional don't use typing_extensions dependencies: boto3 and moto more aws-sdk-cpp trigger CI error if remote module wasn't built
python/kvikio/tests/test_aws_s3.py
Outdated
# TODO: remove before PR merge. Trigger CI error if the remote module wasn't built | ||
import kvikio._lib.remote_handle # isort: skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[debug] remove before PR merging
@KyleFromNVIDIA, what is the status of the building work? Is it ready for review/merge? |
This reverts commit b978323.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the handling of the aws-sdk-cpp
dependency and how it will constrain our compatibility between RAPIDS and the conda-forge ecosystem. This dependency is used by many packages and it is tightly pinned, meaning that RAPIDS will need to track and continually migrate its dependency to align exactly with the conda-forge ecosystem. See threads in this review.
clean - remove all existing build artifacts and configuration (start over) | ||
libkvikio - build and install the libkvikio C++ code | ||
kvikio - build and install the kvikio Python package (requires libkvikio) | ||
--no-s3 - build with no AWS S3 support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this just be defined as a part of --cmake-args
, with no special --no-s3
flag? It is possible for users to specify the option via --cmake-args
. If the user gives a value there, we probably want to respect that with higher priority than defaulting to "ON" based on the absence of a --no-s3
flag.
@@ -64,11 +64,13 @@ requirements: | |||
- rapids-build-backend >=0.3.0,<0.4.0.dev0 | |||
- scikit-build-core >=0.10.0 | |||
- libkvikio ={{ version }} | |||
- aws-sdk-cpp>=1.11.267 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws-sdk-cpp
requires a nearly-exact match at runtime (version x.x.x) and we need compatibility with conda-forge. Note that we will need to maintain this pinning very actively and align with conda-forge for every RAPIDS release, or we risk breaking environment compatibility with conda-forge. This is a large part of why I am hesitant on adding this dependency -- it constrains us tightly, and we do not benefit from the automated migrations that are used for conda-forge recipes to "rebuild the world" when pinnings like aws-sdk-cpp are updated.
We always need to be matching this line: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/2e592a612d925988747cd6daa9e271328ceb3bfc/recipe/conda_build_config.yaml#L268
- aws-sdk-cpp>=1.11.267 | |
- aws-sdk-cpp==1.11.379 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my ignorance, but why do we need an exact match? Naively, I would have thought that a lower-limit would be more flexible? Do all packages in conda-forge needs to maintain this pinning if they use aws-sdk-cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdice, would it help if we remove the version requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First I'll explain the migration logic. Then I'll answer the questions you posed about other options.
Migrations and why they're needed
The aws-sdk-cpp
release cadence is very frequent. Some of those releases (there's a huge number of them) are picked up by conda-forge, but taking the new version requires "rebuilding the world" because aws-sdk-cpp
is a core C++ dependency of many packages (see mamba repoquery whoneeds -c conda-forge aws-sdk-cpp
). Updates happen via a conda-forge migration (tracked here) which rebuild everything that depends on aws-sdk-cpp
.
See the number of recent AWS-related rebuilds of libarrow, for instance:
https://github.com/conda-forge/arrow-cpp-feedstock/commits/main/recipe/meta.yaml
I'll ignore the aws-crt-cpp
and other rebuilds, and only focus on aws-sdk-cpp
because that's the dependency we are introducing here.
In the past 12 months, there have been 6 rebuilds of libarrow for new versions of aws-sdk-cpp
(1.11.379, 1.11.329, 1.11.267, 1.11.242, 1.11.210, 1.11.182).
Every time that conda-forge migrates to a new version, we must use the same version so that RAPIDS (KvikIO specifically) can be installed alongside the latest conda-forge packages.
Try this:
mamba create -n test --dry-run libarrow
Note that aws-sdk-cpp 1.11.379
is included in the environment because there has been a recent rebuild of libarrow
.
Can we use a lower limit (or no version pinning)?
If we were to use a lower limit (or no version pinning), kvikio would use the latest aws-sdk-cpp
when it is built. That would impose a runtime pinning on the latest aws-sdk-cpp
. However, if a conda-forge migration has not happened yet (or is incomplete), those kvikio
packages would be incompatible with the latest conda-forge packages of things like libarrow. We need to match the migrators so we don't get "latest aws-sdk-cpp
" and instead align with "current version used by conda-forge's ecosystem-wide pinnings". Typically conda-forge maintainers will start a migration as soon as a new aws-sdk-cpp
is released, but that's not guaranteed to happen immediately. Sometimes migrations are delayed or skipped, which would leave our packages unusable if we had picked up an aws-sdk-cpp
version that conda-forge did not adopt with a migration.
So what do we do?
RAPIDS is not built with conda-forge infrastructure, but aims to be compatible with the conda-forge ecosystem. We do not benefit from conda-forge's automation for version migrations but we must track them in order to be compatible. (The conda-forge bots that "rebuild the world" do not rebuild RAPIDS recipes, but maybe we could create such a tool.)
To remain compatible with conda-forge, we need to match the exact pinnings for core packages like fmt
, spdlog
, aws-sdk-cpp
, and others. We have a lot of pain from matching conda-forge migrations for fmt
and spdlog
(see Keith's warning and this problem in cuspatial that comes from us being behind on spdlog/fmt migrations)
Our best course of action for now is to:
- pin
aws-sdk-cpp
to the exact version in this YAML file: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/f291cfdaf8c328337cb8cd1f63c63caceeda8991/recipe/conda_build_config.yaml#L267-L268 (that's a permalink but we should track the latest as it changes)- we cannot pin more loosely --
aws-sdk-cpp
imposes run-exports that pin very tightly, which typically indicates that there is no guarantee of stable ABI across versions
- we cannot pin more loosely --
- manually update that version when conda-forge issues a migration
- consider adding tooling to help automate migrations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened rapidsai/build-planning#100 to think more about conda-forge migration support in RAPIDS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks for the explanation @bdice! I now see why you were hesitant from the get-go 😊
What if we move to use libcurl
instead? They take API and ABI stability very serious: https://curl.se/libcurl/features.html#stableapi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually thinking of using libcurl from the start since we only need the very basic S3 operations. I don’t think it will be too hard to implement and it will make Azure Blob Storage support straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be a better option than aws-sdk-cpp
if we can use libcurl
. It seems like libcurl
is already a dependency of some packages we depend on in RAPIDS, so that gives me more confidence in it. See $ mamba repoquery whoneeds -c conda-forge libcurl 2>&1 | awk '{print $1;}' | sort | uniq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I will give libcurl a try
run: | ||
- python | ||
- numpy >=1.23,<3.0a0 | ||
- cupy >=12.0.0 | ||
- zarr | ||
- aws-sdk-cpp>=1.11.267 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not list a run dependency, this will be handled by run-exports: https://github.com/conda-forge/aws-sdk-cpp-feedstock/blob/f82d968670bdb9939ed7604a5fb7bb4885e2e6ba/recipe/meta.yaml#L14
- aws-sdk-cpp>=1.11.267 |
@@ -52,6 +52,7 @@ requirements: | |||
{% else %} | |||
- libcufile-dev # [linux] | |||
{% endif %} | |||
- aws-sdk-cpp>=1.11.267 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match conda-forge.
- aws-sdk-cpp>=1.11.267 | |
- aws-sdk-cpp==1.11.379 |
@@ -83,6 +84,7 @@ outputs: | |||
{% else %} | |||
- libcufile-dev # [linux] | |||
{% endif %} | |||
- aws-sdk-cpp>=1.11.267 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this and add a host
dependency on aws-sdk-cpp==1.11.379
here so that we can have compatible runtime pinnings inserted automatically by the run-exports.
|
||
retcode = run_cmd( | ||
cmd=[ | ||
sys.executable or "python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sys.executable
should always be defined? Are there cases where that isn't safe?
], | ||
cwd=benchmarks_path, | ||
) | ||
assert retcode == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this assert anything about the stdout/stderr outputs?
Co-authored-by: Bradley Dice <[email protected]>
@madsbk Would you please share the status of this work? Are you planning to include AWS C++ SDK changes in 24.12? |
Support read directly from a http server like: ```python import kvikio import cupy with kvikio.RemoteFile.from_http_url("http://127.0.0.1:9000/myfile") as f: ary = cupy.empty(f.nbytes, dtype="uint8") f.read(ary) ``` This PR is the first step to support S3 using libcurl instead of [aws-s3-sdk](#426), which has some pros and cons: * Pros * The [global conda pinning issue](#426 (comment)) is less of a problem. * We can support other protocols such as http, ftp, and Azure’s storage, without much work. * We avoid the [free-after-main issue in aws-s3-sdk](https://github.com/rapidsai/kvikio/blob/000126516db430988ab9af5ee1576ca3fe6afe27/cpp/include/kvikio/remote_handle.hpp#L87-L94). This is huge since we would otherwise have to pass around a `S3Context` in libcudf and cudf to handle shutdown correctly. This is not a problem in libcurl, see https://curl.se/libcurl/c/libcurl.html under `Global constants`. * Cons * Hard to support the AWS configuration file. We will require the user to either specify the options programmatically or through environment variables like `AWS_ACCESS_KEY_ID ` and `AWS_SECRET_ACCESS_KEY `. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Lawrence Mitchell (https://github.com/wence-) URL: #464
Replaced by #479. |
Implements AWS S3 read support using libcurl: ```python import kvikio import cupy with kvikio.RemoteFile.from_s3_url("s://my-bucket/my-file") as f: ary = cupy.empty(f.nbytes, dtype="uint8") f.read(ary) ``` Supersedes #426 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) URL: #479
Implements AWS S3 read support.
Tasks