Conversation
1b523a0 to
974ac99
Compare
|
Hello, @SFN-eu . Thanks for this huge PR ! Would you mind testing on https://github.com/alicevision/dataset_monstree (full) and put the output of the depthmap node somewhere ? Thanks ! |
|
So apparently the monstertree dataset causes it to intermittently segfault on some (but not all) of the stages of depthMapEstimation, the joys of testing and it-works-on-my-{machine,dataset}. Will get back once I've fixed the issue. |
|
So, the segfaults only happened a couple of times with the meshroom GUI , then disappeared. I have since been unable to replicate them, even with valgrind, so I'm tentatively putting them down to using development versions of both AliceVision and Meshroom. If anyone has any similar issues, please report. In the meantime, here is the result of running the default "photogrammetry" pipeline on monstertree/full: |
|
There is a huge difference between the cuda result and your. The most visible is the depthmapfilter. The output is almost sparse in your output ? |
36e89ec to
3c91498
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a SYCL-based alternative to the existing CUDA implementation of the depthMapEstimation and depthMapFiltering pipeline stages. Using AdaptiveCpp as the SYCL implementation, this enables running these stages on a much wider variety of hardware including CPUs and non-NVIDIA GPUs. The PR addresses issue #439 (removing CUDA dependency). The SYCL and CUDA implementations are mutually exclusive at build time.
Changes:
- New
depthMap_sycldirectory with SYCL reimplementation of the depthMap subsystem (SGM, Refine, volume IO, mipmap images, device cache, multi-device dispatch, etc.) - CMake build system modifications to support AdaptiveCpp/SYCL as an alternative to CUDA, including a new
USE_SYCLoption andadd_sycl_to_targetintegration - Conditional compilation in the pipeline entry points (
main_depthMapEstimation.cpp,main_depthMapFiltering.cpp) to switch between CUDA and SYCL implementations
Reviewed changes
Copilot reviewed 49 out of 49 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| src/CMakeLists.txt | Adds ALICEVISION_USE_SYCL option and AdaptiveCpp discovery |
| src/cmake/Helpers.cmake | Adds USE_SYCL option to alicevision_add_library |
| src/aliceVision/CMakeLists.txt | Conditionally adds depthMap_sycl subdirectory |
| src/aliceVision/depthMap_sycl/CMakeLists.txt | Build definition for the SYCL depthMap library |
| src/software/pipeline/CMakeLists.txt | SYCL build targets for estimation and filtering executables |
| src/software/pipeline/main_depthMapEstimation.cpp | Conditional includes and function calls for SYCL path |
| src/software/pipeline/main_depthMapFiltering.cpp | Conditional includes and function calls for SYCL path |
| src/aliceVision/depthMap_sycl/computeOnMultiDevices.{hpp,cpp} | Multi-device dispatch with load balancing |
| src/aliceVision/depthMap_sycl/DepthMapEstimator.{hpp,cpp} | Main depth map estimation orchestration |
| src/aliceVision/depthMap_sycl/Sgm.{hpp,cpp} | Semi-Global Matching implementation |
| src/aliceVision/depthMap_sycl/Refine.{hpp,cpp} | Refinement step implementation |
| src/aliceVision/depthMap_sycl/NormalMapEstimator.{hpp,cpp} | Normal map estimation |
| src/aliceVision/depthMap_sycl/sycl/*.{hpp,cpp} | Device-side utilities: memory, matrix, color, cache, mipmap, etc. |
| src/aliceVision/depthMap_sycl/*.{hpp,cpp} | Host-side utilities: params, depth lists, IO, tiles |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| id = inthsr(device.get_info<sycl::info::device::vendor_id>()) + | ||
| strhsr(platform.get_info<sycl::info::platform::name>()); // + has the advantage of avoiding collisions around zero, unlike xor. It's also fast, and commutativity is not a problem for our usecase |
| std::rethrow_exception(e); | ||
| } catch (sycl::exception const &e) { | ||
| ALICEVISION_LOG_INFO("Caught asynchronous SYCL exception " << e.code() << ": \"" | ||
| << e.what() << "\" Warning: Only allocation faliures will dealt with!"); |
| // allocate final deth/similarity map tile list in host memory | ||
| ALICEVISION_LOG_DEBUG(deviceName <<": Allocating final deth/similarity map tile list in host memory"); |
| // compute gausian blur | ||
| float sumFactor = 0.0f; | ||
|
|
||
| for(int j = -downscale; j <= downscale; j++) // Note: gausian radius is downscale level |
| #pragma once | ||
|
|
||
| #define TSIM_REFINE_USE_HALF | ||
|
|
||
| namespace aliceVision { | ||
| namespace depthMap { | ||
|
|
||
| /* | ||
| * @note TSim is the similarity type for volume in device memory. | ||
| * @note TSimAcc is the similarity accumulation type for volume in device memory. | ||
| * @note TSimRefine is the similarity type for volume refinement in device memory. | ||
| */ | ||
|
|
||
| #ifdef TSIM_USE_FLOAT | ||
| using TSim = float; | ||
| using TSimAcc = float; | ||
| #else | ||
| using TSim = unsigned char; | ||
| using TSimAcc = unsigned int; // TSimAcc is the similarity accumulation type | ||
| #endif | ||
|
|
||
| #ifdef TSIM_REFINE_USE_HALF | ||
| using TSimRefine = sycl::half; |
| // Symmetric Householder reductio3 to tridiago3al form. | ||
|
|
||
| static inline void sycl_tred2(double V[3][3], double d[3], double e[3]) |
| * @brief Support class to reppresent a subpart of a patch pattern | ||
| * Each patch pattern subpart gives one similarity score. | ||
| * | ||
| * @note We use a static function to aquire a single global reference | ||
| */ | ||
| struct PatchPatternSubpart | ||
| { | ||
| sycl::float2 coordinates[ALICEVISION_DEVICE_PATCH_MAX_COORDS_PER_SUBPARTS]; //< subpart coordinate list | ||
| int nbCoordinates; //< subpart number of coordinate | ||
| float level; //< subpart related mipmap level (>=0) | ||
| float downscale; //< subpart related mipmap downscale (>=1) | ||
| float weight; //< subpart related similarity weight in range (0, 1) | ||
| int wsh; //< subpart half-width (full and circle) | ||
| bool isCircle; //< subpart is a circle | ||
| }; | ||
|
|
||
| /** | ||
| * @struct PatchPattern | ||
| * @brief Support class to reppresent a patch pattern | ||
| */ | ||
| class PatchPattern | ||
| { | ||
| public: | ||
| PatchPatternSubpart subparts[ALICEVISION_DEVICE_PATCH_MAX_SUBPARTS]; //< patch pattern subparts (one similarity per subpart) | ||
| int nbSubparts; //< patch pattern number of subparts (>0) | ||
|
|
||
| // Singleton, no copy operator | ||
| void operator=(PatchPattern const&) = delete; | ||
|
|
||
| // Default destructor | ||
| ~PatchPattern() = default; | ||
|
|
||
| /** | ||
| * @brief helper func to always get the same Patch Pattern | ||
| */ | ||
| static PatchPattern& getGlobalPatchPattern() { | ||
| static PatchPattern instance; | ||
| return instance; | ||
| } | ||
| private: | ||
| // Singleton, private default constructor | ||
| PatchPattern() = default; | ||
| }; | ||
|
|
||
| /** | ||
| * @brief Build user custom patch pattern singelton |
| const float angleDifference = (M_PI * 2.f) / subpart.nbCoordinates; | ||
|
|
||
| // compute patch pattern relative coordinates | ||
| for (int j = 0; j < subpart.nbCoordinates; ++j) | ||
| { | ||
| sycl::float2& coords = subpart.coordinates[j]; | ||
|
|
||
| const float radians = angleDifference * j; | ||
| coords.x() = std::cos(radians) * radiusValue; | ||
| coords.y() = std::sin(radians) * radiusValue; | ||
| } | ||
|
|
||
| subpart.wsh = int(subpartParams.radius + std::pow(2.f, subpartParams.level - 1.f)); | ||
| subpart.nbCoordinates = subpartParams.nbCoordinates; |
| #include <aliceVision/alicevision_omp.hpp> | ||
| #include <sycl/sycl.hpp> | ||
|
|
||
| // Needed for checking device caracteristics |
| inline const void getEigenVectorsDesc(sycl::float3& cg, /*sycl::float3& v1, sycl::float3& v2, */sycl::float3& v3/*, float& d1, float& d2, float& d3*/) | ||
| { | ||
| double V[3][3], d[3]; | ||
|
|
||
| const double xmean = xsum / count; | ||
| const double ymean = ysum / count; | ||
| const double zmean = zsum / count; | ||
|
|
||
| cg = sycl::double3(xmean, ymean, zmean).convert<float>(); | ||
|
|
||
| V[0][0] = (xxsum - xsum * xmean - xsum * xmean + xmean * xmean * count) / count; | ||
| V[0][1] = (xysum - ysum * xmean - xsum * ymean + xmean * ymean * count) / count; | ||
| V[0][2] = (xzsum - zsum * xmean - xsum * zmean + xmean * zmean * count) / count; | ||
| V[1][0] = (xysum - xsum * ymean - ysum * xmean + ymean * xmean * count) / count; | ||
| V[1][1] = (yysum - ysum * ymean - ysum * ymean + ymean * ymean * count) / count; | ||
| V[1][2] = (yzsum - zsum * ymean - ysum * zmean + ymean * zmean * count) / count; | ||
| V[2][0] = (xzsum - xsum * zmean - zsum * xmean + zmean * xmean * count) / count; | ||
| V[2][1] = (yzsum - ysum * zmean - zsum * ymean + zmean * ymean * count) / count; | ||
| V[2][2] = (zzsum - zsum * zmean - zsum * zmean + zmean * zmean * count) / count; | ||
|
|
||
| // should be sorted | ||
| sycl_eigen_decomposition(V, d); | ||
|
|
||
| /* | ||
| v1 = sycl::normalize(sycl::float3((float)V[0][2], (float)V[1][2], (float)V[2][2])); | ||
| v2 = sycl::normalize(sycl::float3((float)V[0][1], (float)V[1][1], (float)V[2][1])); | ||
| */ | ||
| v3 = sycl::normalize(sycl::float3((float)V[0][0], (float)V[1][0], (float)V[2][0])); | ||
|
|
||
| /* | ||
| d1 = (float)d[2]; | ||
| d2 = (float)d[1]; | ||
| d3 = (float)d[0]; | ||
| */ | ||
| } | ||
|
|
||
| inline const bool computePlaneByPCA(sycl::float3& p, sycl::float3& n) |
|
Hi! Cool work! Happy to see that another backend is in the works for this project. And once AdaptiveCpp has sufficient Apple Metal support, I could drop my Metal implementation whatsoever. Do you have an approximate timeline on when Metal can be used with AdaptiveCpp? I would then consider dropping my upstreaming plans for MTL-AliceVision if the same could be achieved with a unified backend :).
And if I may chip in on this: If one asked me, I'd support a unified (and backend agnostic) DepthMap library, at least as long as multiple backends need to coexist. Because if they were exclusive to one another, a system with mixed GPU vendors could not utilize all hardware. I had started to think about how one could design that and came up with a concept similar to the |
It already works on Apple CPU, Metal support is slowly being patched in at AdaptiveCpp/AdaptiveCpp#864; see also https://adaptivecpp.github.io/AdaptiveCpp/install-metal/#enabling-the-metal-backend. Tl;dr it's an experimental option that's still rather barebones and misses features that this port relies on (namely, USM pointers and the
This is literally how the AdaptiveCPP backend works. The new DepthMap library is unified and, in some ways, "backend agnostic", because acpp abstracts everything away and just presents to the end programmer a list of all available devices on the system (and does in fact use OMP, CUDA, HIP, Intel ZE ecc. under the hood, handling synchronization and memcpy's between them at runtime). It also has an experimental multi-device queue (https://adaptivecpp.github.io/AdaptiveCpp/multi-device-queue/) for automatic work distribution, but currently "This extension should not yet be used for any production workloads" (testing how well it performs is something I want to do as part of the performance tuning, but correctness comes first). |
f4b667f to
1e555ed
Compare
|
P.s. does anyone know how to stop GitHub from spamming the full list of existing commits every time I rebase onto the tip of the |
|
Understood, thanks. We'll see if the |
|
I mean, it would be possible if the maintainers wished for it, but in other benchmarks AdaptiveCPP has shown itself to be better (i.e. faster) at running CUDA applications than CUDA itself (for those reading who are so inclined, you can check out the full paper: https://dl.acm.org/doi/full/10.1145/3731125.3731127). So basically I don't see the point of using the CUDA implementation if you want to use the SYCL one ... which can also use CUDA. I kept it around because it's far better tested, and coming in with a PR that deletes an entire library didn't seem particularly helpful. |
070a90d to
ff24a36
Compare
bcc6e98 to
aab5591
Compare
aab5591 to
73d3ec5
Compare
…ith full trilinear interpolation
…memcpy and memset on Y slices of volumes
…ace with rest of project
…SYCL as well as CUDA
use-after-free issues
…imilarityVolue.hpp
…s to reduce (device) memory usage and reallocation
…thmic time) into using std::unordered_map (hashmap, ammortized linear time)
…tion in sycl/buffer.hpp
… non-existant compute object if a tile was skipped
…iginal CUDA implementation, and further changes should be checked against regressions.
…nd support at the same time
459be82 to
6881103
Compare
Should now be done with #6881103! The option in question is "--backend", valid options are 0 (CUDA) and 1 (SYCL) depending on what backends where enabled with the ALICEVISION_DEPTHMAP_BACKEND compile option. |







Description
This PR adds an SYCL alternative to the CUDA implementation of the depthMapEstimation and depthMapFiltering stages of the pipeline, allowing them to be run on a much wider variety of hardware (including CPUs).
Features list
Todo list
Implementation remarks
See also: #439 (comment)
SYCL is an abstract CPP standard for writing device-agnostic accelerated code, developed by Khronos group. It is fairly similar to CUDA, with the main exception being that all code is valid CPP and as such can live together in the same file. More documentation is available at: https://github.khronos.org/SYCL_Reference/index.html
Various implementations exist, I went with AdaptiveCPP as it was the only free, open source on that supported almost every piece of currently available hardware (with the exception of Metal GPUs, although support is under development). The implementation of the AdaptiveCPP generic backend merits its own discussion: it uses JIT compilation to optimize code at run-time for the device (and even the workload, depending on environment options). See https://adaptivecpp.github.io/AdaptiveCpp/compilation/ and https://adaptivecpp.github.io/AdaptiveCpp/generic-sscp/ for more information.
The implementation is a fairly strict translation of the existing code, with the exception of mipmap images, that had to be re-implemented from scratch, and the elimination of CUDA constant memory (as SYCL has no equivalent). Given SYCL's kernel launch semantics, it was significantly easier to put kernel invocation and code in the same .cpp unit, instead of separate .cu and .cuh files. However, the code is similar enough that kernels between the two versions can be compared with diff.
Main questions: