Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify batch and stream API check #271

Merged

Conversation

madsbk
Copy link
Member

@madsbk madsbk commented Aug 28, 2023

We now use the cuFileReadAsync symbol to check for availability of both the stream and batch API of cuFile.

We used to look for the mangled symbol _ZTS13CUfileOpError to determine the availability of the batch API. However, since mangling is compiler specific, we prefer using cuFileReadAsync even though it means that the batch API isn't available until CUDA version 12.2.

Additionally, this PR also implements CUFILE_CHECK_STREAM_IO(), which is used to check async IO errors.
cc. @tell-rebanta

@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 28, 2023
@madsbk madsbk force-pushed the unify_batch_and_stream_api_check branch from 109835e to f4f71ab Compare August 28, 2023 13:04
@madsbk madsbk force-pushed the unify_batch_and_stream_api_check branch from f4f71ab to 4dc2cee Compare August 28, 2023 13:36
@madsbk madsbk force-pushed the unify_batch_and_stream_api_check branch from 0d5f440 to e4a4622 Compare August 28, 2023 14:29
@madsbk madsbk force-pushed the unify_batch_and_stream_api_check branch from e4a4622 to 5358b35 Compare August 28, 2023 14:45
@madsbk madsbk marked this pull request as ready for review August 28, 2023 15:19
@madsbk madsbk requested review from a team as code owners August 28, 2023 15:19
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error handling for negative bytes written in the async write is incorrect. And, if I read the docs correctly, the pointer we're passing is also bad :(.

cpp/include/kvikio/shim/cufile.hpp Show resolved Hide resolved
check(cudaStreamSynchronize(stream) == cudaSuccess);
// Note, `bytes_done` might be negative, which indicate an IO error thus we
// use `CUFILE_CHECK_STREAM_IO` to check for errors.
CUFILE_CHECK_STREAM_IO(bytes_done);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a garbage error reporting api.

cout << "Performing async I/O using file handle" << endl;
off_t f_off{0};
off_t d_off{0};
ssize_t bytes_done{0};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect, per the documentation for cuFileWriteAsync:

The bytes_written pointer should be allocated with cuMemHostAlloc or registered with cuMemHostRegister, and the pointer to access that memory from the device can be obtained by using cuMemHostGetDevicePointer.

Which we are not doing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

cpp/CMakeLists.txt Show resolved Hide resolved
if (_nbytes < 0) { \
throw(_exception_type){std::string{"cuFile error at: "} + __FILE__ + ":" + \
KVIKIO_STRINGIFY(__LINE__) + ": " + \
cufileop_status_error((CUfileOpError)(CUFILEOP_BASE_ERR - _nbytes))}; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the documentation says that this returns:

  • The number of bytes successfully written. or;
  • -1 on IO errors. or;
  • All other errors will return a negative integer value of the CUfileOpError enum value.

But this code suggests that it returns -(some_cufile_op_error - CUFILEOP_BASE_ERR). Which can't be right given the above because it value-puns CU_FILE_DRIVER_NOT_INITIALIZED with an IO error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the docs are correct. I now just print the raw value of _nbytes. Let's fix this in a follow-up when we know the intended behavior :)

cpp/include/kvikio/error.hpp Show resolved Hide resolved
@madsbk madsbk requested a review from wence- August 30, 2023 07:21
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tiny fixes, LGTM!

cpp/examples/basic_io.cpp Show resolved Hide resolved
* until execution time.
* @param bytes_read Pointer to the bytes read from file. This pointer should be a non-NULL value
* and *bytes_read set to 0. The bytes_read memory should be allocated with cuMemHostAlloc/malloc/
* mmap or registered with cuMemHostRegister.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you have just copied the cufile docs here, but this sentence makes no sense to me. AIUI, just plain malloc and mmap are completely unknown to the driver. So does this mean use cuMemHostAlloc or cuMemAlloc or cuMemMap? Or am I fully even more confused than I thought?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it means that it has to be allocated on the heap (not stack)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct.

cpp/include/kvikio/file_handle.hpp Outdated Show resolved Hide resolved
madsbk and others added 3 commits August 30, 2023 13:22
@madsbk
Copy link
Member Author

madsbk commented Aug 30, 2023

Thanks @robertmaynard and @wence-

@madsbk
Copy link
Member Author

madsbk commented Aug 30, 2023

/merge

@rapids-bot rapids-bot bot merged commit 63d76a9 into rapidsai:branch-23.10 Aug 30, 2023
27 checks passed
@madsbk madsbk deleted the unify_batch_and_stream_api_check branch August 30, 2023 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants