Skip to content

[WIP] Dev/port cuda to sycl#2077

Draft
SFN-eu wants to merge 52 commits intoalicevision:developfrom
SFN-eu:dev/port-cuda-to-sycl
Draft

[WIP] Dev/port cuda to sycl#2077
SFN-eu wants to merge 52 commits intoalicevision:developfrom
SFN-eu:dev/port-cuda-to-sycl

Conversation

@SFN-eu
Copy link

@SFN-eu SFN-eu commented Mar 11, 2026

Description

This PR adds an SYCL alternative to the CUDA implementation of the depthMapEstimation and depthMapFiltering stages of the pipeline, allowing them to be run on a much wider variety of hardware (including CPUs).

Features list

  • Reimplement the src/aliceVision/depthMap subdirectory in SYCL
  • Free bonus: approx. 30% memory usage reduction when using the new implementation
  • Possible fix for [Request] Remove CUDA dependency #439

Todo list

  • Initial correct implementation
  • Performance: match CUDA speed (this implementation is currently a bit slower, but there is room for improvement)
  • Rebase on up-to-date develop branch (or is it preferred to resolve this in the merge commit?)
  • Test on other hardware and with other pipelines (currently tested with a default pipeline on an RTX3060 and a Ryzen 7 5700G)
    • Monstree-full

Implementation remarks

See also: #439 (comment)

SYCL is an abstract CPP standard for writing device-agnostic accelerated code, developed by Khronos group. It is fairly similar to CUDA, with the main exception being that all code is valid CPP and as such can live together in the same file. More documentation is available at: https://github.khronos.org/SYCL_Reference/index.html

Various implementations exist, I went with AdaptiveCPP as it was the only free, open source on that supported almost every piece of currently available hardware (with the exception of Metal GPUs, although support is under development). The implementation of the AdaptiveCPP generic backend merits its own discussion: it uses JIT compilation to optimize code at run-time for the device (and even the workload, depending on environment options). See https://adaptivecpp.github.io/AdaptiveCpp/compilation/ and https://adaptivecpp.github.io/AdaptiveCpp/generic-sscp/ for more information.

The implementation is a fairly strict translation of the existing code, with the exception of mipmap images, that had to be re-implemented from scratch, and the elimination of CUDA constant memory (as SYCL has no equivalent). Given SYCL's kernel launch semantics, it was significantly easier to put kernel invocation and code in the same .cpp unit, instead of separate .cu and .cuh files. However, the code is similar enough that kernels between the two versions can be compared with diff.

Main questions:

  • Currently, building with SYCL requires disabling CUDA. IS this appropriate, or should the build system be modified (and generally, are their problems with the current build system config)?
  • To avoid removing the existing implementation, the new one lives under src/aliceVision/depthMap_sycl/. Should this be moved?
  • This adds AdaptiveCPP as a dependency. Do new dependencies need any special treatment?
  • Is there any applicable documentation? The new implementation behaves identically to the old one from a user's perspective, and I couldn't find any developer documentation.
  • The depthMap pipeline doesn't have any tests that I could find. Do we want to add them/did I miss something?

@SFN-eu SFN-eu changed the title Dev/port cuda to sycl [WIP} Dev/port cuda to sycl Mar 11, 2026
@SFN-eu SFN-eu changed the title [WIP} Dev/port cuda to sycl [WIP] Dev/port cuda to sycl Mar 11, 2026
@SFN-eu SFN-eu marked this pull request as draft March 11, 2026 07:57
@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch 2 times, most recently from 1b523a0 to 974ac99 Compare March 11, 2026 08:25
@servantftransperfect
Copy link
Contributor

Hello, @SFN-eu .

Thanks for this huge PR !

Would you mind testing on https://github.com/alicevision/dataset_monstree (full) and put the output of the depthmap node somewhere ?

Thanks !

@SFN-eu
Copy link
Author

SFN-eu commented Mar 11, 2026

So apparently the monstertree dataset causes it to intermittently segfault on some (but not all) of the stages of depthMapEstimation, the joys of testing and it-works-on-my-{machine,dataset}. Will get back once I've fixed the issue.

@SFN-eu
Copy link
Author

SFN-eu commented Mar 11, 2026

So, the segfaults only happened a couple of times with the meshroom GUI , then disappeared. I have since been unable to replicate them, even with valgrind, so I'm tentatively putting them down to using development versions of both AliceVision and Meshroom. If anyone has any similar issues, please report.

In the meantime, here is the result of running the default "photogrammetry" pipeline on monstertree/full:
image
Here is the raw output of the DepthMap and DepthMapFilter stages, as generated by meshroom:
https://github.com/SFN-eu/tmp

@servantftransperfect
Copy link
Contributor

There is a huge difference between the cuda result and your. The most visible is the depthmapfilter. The output is almost sparse in your output ?

@SFN-eu
Copy link
Author

SFN-eu commented Mar 12, 2026

Oops, missed that, apologies. Having had a closer look (and now knowing what to look for) there are two issues:

A brightness difference in DepthMap (CUDA on right)
image

Some glitches in DepthMapFiltered (CUDA still on right)
image

I'll fix these and see if I can get CI to work.

@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch 2 times, most recently from 36e89ec to 3c91498 Compare March 12, 2026 17:42
@SFN-eu
Copy link
Author

SFN-eu commented Mar 12, 2026

Digging into the problem more. Interestingly the SYCL implementation (top) reports a better sim map, despite performing worse:
image

Also, apparently device accelerated code is only ran in DepthMapFilter for normal map estimation. The exact same CPU code (src/aliceVision/fuseCut/Fuser.cpp) is being run to filter both depth maps, but the inaccuracies compound and end up getting rejected by the filter (as they should be, if the filter is doing it's job).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a SYCL-based alternative to the existing CUDA implementation of the depthMapEstimation and depthMapFiltering pipeline stages. Using AdaptiveCpp as the SYCL implementation, this enables running these stages on a much wider variety of hardware including CPUs and non-NVIDIA GPUs. The PR addresses issue #439 (removing CUDA dependency). The SYCL and CUDA implementations are mutually exclusive at build time.

Changes:

  • New depthMap_sycl directory with SYCL reimplementation of the depthMap subsystem (SGM, Refine, volume IO, mipmap images, device cache, multi-device dispatch, etc.)
  • CMake build system modifications to support AdaptiveCpp/SYCL as an alternative to CUDA, including a new USE_SYCL option and add_sycl_to_target integration
  • Conditional compilation in the pipeline entry points (main_depthMapEstimation.cpp, main_depthMapFiltering.cpp) to switch between CUDA and SYCL implementations

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
src/CMakeLists.txt Adds ALICEVISION_USE_SYCL option and AdaptiveCpp discovery
src/cmake/Helpers.cmake Adds USE_SYCL option to alicevision_add_library
src/aliceVision/CMakeLists.txt Conditionally adds depthMap_sycl subdirectory
src/aliceVision/depthMap_sycl/CMakeLists.txt Build definition for the SYCL depthMap library
src/software/pipeline/CMakeLists.txt SYCL build targets for estimation and filtering executables
src/software/pipeline/main_depthMapEstimation.cpp Conditional includes and function calls for SYCL path
src/software/pipeline/main_depthMapFiltering.cpp Conditional includes and function calls for SYCL path
src/aliceVision/depthMap_sycl/computeOnMultiDevices.{hpp,cpp} Multi-device dispatch with load balancing
src/aliceVision/depthMap_sycl/DepthMapEstimator.{hpp,cpp} Main depth map estimation orchestration
src/aliceVision/depthMap_sycl/Sgm.{hpp,cpp} Semi-Global Matching implementation
src/aliceVision/depthMap_sycl/Refine.{hpp,cpp} Refinement step implementation
src/aliceVision/depthMap_sycl/NormalMapEstimator.{hpp,cpp} Normal map estimation
src/aliceVision/depthMap_sycl/sycl/*.{hpp,cpp} Device-side utilities: memory, matrix, color, cache, mipmap, etc.
src/aliceVision/depthMap_sycl/*.{hpp,cpp} Host-side utilities: params, depth lists, IO, tiles

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +21 to +22
id = inthsr(device.get_info<sycl::info::device::vendor_id>()) +
strhsr(platform.get_info<sycl::info::platform::name>()); // + has the advantage of avoiding collisions around zero, unlike xor. It's also fast, and commutativity is not a problem for our usecase
std::rethrow_exception(e);
} catch (sycl::exception const &e) {
ALICEVISION_LOG_INFO("Caught asynchronous SYCL exception " << e.code() << ": \""
<< e.what() << "\" Warning: Only allocation faliures will dealt with!");
Comment on lines +155 to +156
// allocate final deth/similarity map tile list in host memory
ALICEVISION_LOG_DEBUG(deviceName <<": Allocating final deth/similarity map tile list in host memory");
Comment on lines +100 to +103
// compute gausian blur
float sumFactor = 0.0f;

for(int j = -downscale; j <= downscale; j++) // Note: gausian radius is downscale level
Comment on lines +7 to +29
#pragma once

#define TSIM_REFINE_USE_HALF

namespace aliceVision {
namespace depthMap {

/*
* @note TSim is the similarity type for volume in device memory.
* @note TSimAcc is the similarity accumulation type for volume in device memory.
* @note TSimRefine is the similarity type for volume refinement in device memory.
*/

#ifdef TSIM_USE_FLOAT
using TSim = float;
using TSimAcc = float;
#else
using TSim = unsigned char;
using TSimAcc = unsigned int; // TSimAcc is the similarity accumulation type
#endif

#ifdef TSIM_REFINE_USE_HALF
using TSimRefine = sycl::half;
Comment on lines +12 to +14
// Symmetric Householder reductio3 to tridiago3al form.

static inline void sycl_tred2(double V[3][3], double d[3], double e[3])
Comment on lines +24 to +69
* @brief Support class to reppresent a subpart of a patch pattern
* Each patch pattern subpart gives one similarity score.
*
* @note We use a static function to aquire a single global reference
*/
struct PatchPatternSubpart
{
sycl::float2 coordinates[ALICEVISION_DEVICE_PATCH_MAX_COORDS_PER_SUBPARTS]; //< subpart coordinate list
int nbCoordinates; //< subpart number of coordinate
float level; //< subpart related mipmap level (>=0)
float downscale; //< subpart related mipmap downscale (>=1)
float weight; //< subpart related similarity weight in range (0, 1)
int wsh; //< subpart half-width (full and circle)
bool isCircle; //< subpart is a circle
};

/**
* @struct PatchPattern
* @brief Support class to reppresent a patch pattern
*/
class PatchPattern
{
public:
PatchPatternSubpart subparts[ALICEVISION_DEVICE_PATCH_MAX_SUBPARTS]; //< patch pattern subparts (one similarity per subpart)
int nbSubparts; //< patch pattern number of subparts (>0)

// Singleton, no copy operator
void operator=(PatchPattern const&) = delete;

// Default destructor
~PatchPattern() = default;

/**
* @brief helper func to always get the same Patch Pattern
*/
static PatchPattern& getGlobalPatchPattern() {
static PatchPattern instance;
return instance;
}
private:
// Singleton, private default constructor
PatchPattern() = default;
};

/**
* @brief Build user custom patch pattern singelton
Comment on lines +227 to +240
const float angleDifference = (M_PI * 2.f) / subpart.nbCoordinates;

// compute patch pattern relative coordinates
for (int j = 0; j < subpart.nbCoordinates; ++j)
{
sycl::float2& coords = subpart.coordinates[j];

const float radians = angleDifference * j;
coords.x() = std::cos(radians) * radiusValue;
coords.y() = std::sin(radians) * radiusValue;
}

subpart.wsh = int(subpartParams.radius + std::pow(2.f, subpartParams.level - 1.f));
subpart.nbCoordinates = subpartParams.nbCoordinates;
#include <aliceVision/alicevision_omp.hpp>
#include <sycl/sycl.hpp>

// Needed for checking device caracteristics
Comment on lines +335 to +371
inline const void getEigenVectorsDesc(sycl::float3& cg, /*sycl::float3& v1, sycl::float3& v2, */sycl::float3& v3/*, float& d1, float& d2, float& d3*/)
{
double V[3][3], d[3];

const double xmean = xsum / count;
const double ymean = ysum / count;
const double zmean = zsum / count;

cg = sycl::double3(xmean, ymean, zmean).convert<float>();

V[0][0] = (xxsum - xsum * xmean - xsum * xmean + xmean * xmean * count) / count;
V[0][1] = (xysum - ysum * xmean - xsum * ymean + xmean * ymean * count) / count;
V[0][2] = (xzsum - zsum * xmean - xsum * zmean + xmean * zmean * count) / count;
V[1][0] = (xysum - xsum * ymean - ysum * xmean + ymean * xmean * count) / count;
V[1][1] = (yysum - ysum * ymean - ysum * ymean + ymean * ymean * count) / count;
V[1][2] = (yzsum - zsum * ymean - ysum * zmean + ymean * zmean * count) / count;
V[2][0] = (xzsum - xsum * zmean - zsum * xmean + zmean * xmean * count) / count;
V[2][1] = (yzsum - ysum * zmean - zsum * ymean + zmean * ymean * count) / count;
V[2][2] = (zzsum - zsum * zmean - zsum * zmean + zmean * zmean * count) / count;

// should be sorted
sycl_eigen_decomposition(V, d);

/*
v1 = sycl::normalize(sycl::float3((float)V[0][2], (float)V[1][2], (float)V[2][2]));
v2 = sycl::normalize(sycl::float3((float)V[0][1], (float)V[1][1], (float)V[2][1]));
*/
v3 = sycl::normalize(sycl::float3((float)V[0][0], (float)V[1][0], (float)V[2][0]));

/*
d1 = (float)d[2];
d2 = (float)d[1];
d3 = (float)d[0];
*/
}

inline const bool computePlaneByPCA(sycl::float3& p, sycl::float3& n)
@philippremy
Copy link
Contributor

Hi! Cool work! Happy to see that another backend is in the works for this project. And once AdaptiveCpp has sufficient Apple Metal support, I could drop my Metal implementation whatsoever. Do you have an approximate timeline on when Metal can be used with AdaptiveCpp?

I would then consider dropping my upstreaming plans for MTL-AliceVision if the same could be achieved with a unified backend :).

Main questions:

  • To avoid removing the existing implementation, the new one lives under src/aliceVision/depthMap_sycl/. Should this be moved?

And if I may chip in on this: If one asked me, I'd support a unified (and backend agnostic) DepthMap library, at least as long as multiple backends need to coexist. Because if they were exclusive to one another, a system with mixed GPU vendors could not utilize all hardware. I had started to think about how one could design that and came up with a concept similar to the ComputeOnMultiGPUs approach: something like a work-orchestrator pattern. The orchestrator class provides a unified API and internally handles dispatching to different backends (and potentially multiple devices). That would also allow for a more fine-grained device/vendor selection, if required. That would leave us with a few API classes (like DepthMapEstimatorOrchestrator, RefineOrchestrator, SgmOrchestrator), abstracting all backend implementation details away.

@SFN-eu
Copy link
Author

SFN-eu commented Mar 13, 2026

Hi! Cool work! Happy to see that another backend is in the works for this project. And once AdaptiveCpp has sufficient Apple Metal support, I could drop my Metal implementation whatsoever. Do you have an approximate timeline on when Metal can be used with AdaptiveCpp?

It already works on Apple CPU, Metal support is slowly being patched in at AdaptiveCpp/AdaptiveCpp#864; see also https://adaptivecpp.github.io/AdaptiveCpp/install-metal/#enabling-the-metal-backend. Tl;dr it's an experimental option that's still rather barebones and misses features that this port relies on (namely, USM pointers and the double type). You would probably have to ask them for a more detailed timeline, if it even exists.

And if I may chip in on this: If one asked me, I'd support a unified (and backend agnostic) DepthMap library, at least as long as multiple backends need to coexist. Because if they were exclusive to one another, a system with mixed GPU vendors could not utilize all hardware. I had started to think about how one could design that and came up with a concept similar to the ComputeOnMultiGPUs approach: something like a work-orchestrator pattern.

This is literally how the AdaptiveCPP backend works. The new DepthMap library is unified and, in some ways, "backend agnostic", because acpp abstracts everything away and just presents to the end programmer a list of all available devices on the system (and does in fact use OMP, CUDA, HIP, Intel ZE ecc. under the hood, handling synchronization and memcpy's between them at runtime). It also has an experimental multi-device queue (https://adaptivecpp.github.io/AdaptiveCpp/multi-device-queue/) for automatic work distribution, but currently "This extension should not yet be used for any production workloads" (testing how well it performs is something I want to do as part of the performance tuning, but correctness comes first).

@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch from f4b667f to 1e555ed Compare March 13, 2026 15:24
@SFN-eu
Copy link
Author

SFN-eu commented Mar 13, 2026

P.s. does anyone know how to stop GitHub from spamming the full list of existing commits every time I rebase onto the tip of the develop branch? It's getting a bit old...

@philippremy
Copy link
Contributor

Understood, thanks. We'll see if the double problem will get resolved. Metal itself has no native support for 64-bit floating-point types, that is something I encountered in my port and just switched to floats.
And by unified library I meant the original CUDA library and the SYCL port. I think it'll be interesting to merge these into a single library with a unified interface - as long as the old CUDA code remains in the project. Of course that becomes irrelevant if the SYCL part is intended to replace the existing CUDA lib altogether.

@SFN-eu
Copy link
Author

SFN-eu commented Mar 13, 2026

I mean, it would be possible if the maintainers wished for it, but in other benchmarks AdaptiveCPP has shown itself to be better (i.e. faster) at running CUDA applications than CUDA itself (for those reading who are so inclined, you can check out the full paper: https://dl.acm.org/doi/full/10.1145/3731125.3731127).

So basically I don't see the point of using the CUDA implementation if you want to use the SYCL one ... which can also use CUDA. I kept it around because it's far better tested, and coming in with a PR that deletes an entire library didn't seem particularly helpful.

@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch 3 times, most recently from 070a90d to ff24a36 Compare March 13, 2026 17:15
@SFN-eu
Copy link
Author

SFN-eu commented Mar 14, 2026

Narrowed down the issue to the initial similarity volume computation: SYCL (first image) implementation is in fact rather more sparse:
image
image

@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch 2 times, most recently from bcc6e98 to aab5591 Compare March 17, 2026 12:54
@SFN-eu
Copy link
Author

SFN-eu commented Mar 17, 2026

Ok, the problem is (at least in part) with my implementation of mipmap images (SYCL top), they are not getting filtered properly:

image

(apologies for the perhaps slow progress, I have been going through the code with a fine toothed comb and have found a whole host of other minor glitches, but this precise issue is sill escaping me ... 🙄)

@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch from aab5591 to 73d3ec5 Compare March 18, 2026 07:45
sfn added 28 commits March 20, 2026 18:48
…s to reduce (device) memory usage and reallocation
…thmic time) into using std::unordered_map (hashmap, ammortized linear time)
… non-existant compute object if a tile was skipped
…iginal CUDA implementation, and further changes should be checked against regressions.
@SFN-eu SFN-eu force-pushed the dev/port-cuda-to-sycl branch from 459be82 to 6881103 Compare March 20, 2026 17:59
@SFN-eu
Copy link
Author

SFN-eu commented Mar 20, 2026

Could you change the build system, to ensure that we can build depthMap (in cuda) and depthMap_sycl at the same time?
We will need to keep both implementations for a while to fully validate in production that there is no regression and expose an option to choose between one implementation and another.

Should now be done with #6881103! The option in question is "--backend", valid options are 0 (CUDA) and 1 (SYCL) depending on what backends where enabled with the ALICEVISION_DEPTHMAP_BACKEND compile option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants