[WIP] Dev/port cuda to sycl by SFN-eu · Pull Request #2077 · alicevision/AliceVision

SFN-eu · 2026-03-11T07:53:37Z

Description

This PR adds an SYCL alternative to the CUDA implementation of the depthMapEstimation and depthMapFiltering stages of the pipeline, allowing them to be run on a much wider variety of hardware (including CPUs).

Features list

Reimplement the src/aliceVision/depthMap subdirectory in SYCL
Free bonus: approx. 30% memory usage reduction when using the new implementation
Possible fix for [Request] Remove CUDA dependency #439

Todo list

Initial correct implementation
Performance: match CUDA speed (this implementation is currently a bit slower, but there is room for improvement)
Rebase on up-to-date develop branch (or is it preferred to resolve this in the merge commit?)
Test on other hardware and with other pipelines (currently tested with a default pipeline on an RTX3060 and a Ryzen 7 5700G)
- Monstree-full

Implementation remarks

Pull request overview

This PR adds a SYCL-based alternative to the existing CUDA implementation of the depthMapEstimation and depthMapFiltering pipeline stages. Using AdaptiveCpp as the SYCL implementation, this enables running these stages on a much wider variety of hardware including CPUs and non-NVIDIA GPUs. The PR addresses issue #439 (removing CUDA dependency). The SYCL and CUDA implementations are mutually exclusive at build time.

Changes:

New depthMap_sycl directory with SYCL reimplementation of the depthMap subsystem (SGM, Refine, volume IO, mipmap images, device cache, multi-device dispatch, etc.)
CMake build system modifications to support AdaptiveCpp/SYCL as an alternative to CUDA, including a new USE_SYCL option and add_sycl_to_target integration
Conditional compilation in the pipeline entry points (main_depthMapEstimation.cpp, main_depthMapFiltering.cpp) to switch between CUDA and SYCL implementations

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
src/CMakeLists.txt	Adds `ALICEVISION_USE_SYCL` option and AdaptiveCpp discovery
src/cmake/Helpers.cmake	Adds `USE_SYCL` option to `alicevision_add_library`
src/aliceVision/CMakeLists.txt	Conditionally adds `depthMap_sycl` subdirectory
src/aliceVision/depthMap_sycl/CMakeLists.txt	Build definition for the SYCL depthMap library
src/software/pipeline/CMakeLists.txt	SYCL build targets for estimation and filtering executables
src/software/pipeline/main_depthMapEstimation.cpp	Conditional includes and function calls for SYCL path
src/software/pipeline/main_depthMapFiltering.cpp	Conditional includes and function calls for SYCL path
src/aliceVision/depthMap_sycl/computeOnMultiDevices.{hpp,cpp}	Multi-device dispatch with load balancing
src/aliceVision/depthMap_sycl/DepthMapEstimator.{hpp,cpp}	Main depth map estimation orchestration
src/aliceVision/depthMap_sycl/Sgm.{hpp,cpp}	Semi-Global Matching implementation
src/aliceVision/depthMap_sycl/Refine.{hpp,cpp}	Refinement step implementation
src/aliceVision/depthMap_sycl/NormalMapEstimator.{hpp,cpp}	Normal map estimation
src/aliceVision/depthMap_sycl/sycl/*.{hpp,cpp}	Device-side utilities: memory, matrix, color, cache, mipmap, etc.
src/aliceVision/depthMap_sycl/*.{hpp,cpp}	Host-side utilities: params, depth lists, IO, tiles

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/aliceVision/depthMap_sycl/sycl/DeviceCache.cpp

+        id = inthsr(device.get_info<sycl::info::device::vendor_id>()) +
+            strhsr(platform.get_info<sycl::info::platform::name>()); // + has the advantage of avoiding collisions around zero, unlike xor. It's also fast, and commutativity is not a problem for our usecase


src/aliceVision/depthMap_sycl/computeOnMultiDevices.cpp

+      std::rethrow_exception(e);
+    } catch (sycl::exception const &e) {
+        ALICEVISION_LOG_INFO("Caught asynchronous SYCL exception " << e.code() << ": \""
+                             << e.what() << "\" Warning: Only allocation faliures will dealt with!");


src/aliceVision/depthMap_sycl/DepthMapEstimator.cpp

+    // allocate final deth/similarity map tile list in host memory
+    ALICEVISION_LOG_DEBUG(deviceName <<": Allocating final deth/similarity map tile list in host memory");


src/aliceVision/depthMap_sycl/sycl/deviceMipmapBuilder.hpp

+                // compute gausian blur
+                float sumFactor = 0.0f;
+
+                for(int j = -downscale; j <= downscale; j++) // Note: gausian radius is downscale level


src/aliceVision/depthMap_sycl/sycl/planeSweeping/similarity.hpp

+#pragma once
+
+#define TSIM_REFINE_USE_HALF
+
+namespace aliceVision {
+namespace depthMap {
+
+/*
+ * @note TSim is the similarity type for volume in device memory.
+ * @note TSimAcc is the similarity accumulation type for volume in device memory.
+ * @note TSimRefine is the similarity type for volume refinement in device memory.
+ */
+
+#ifdef TSIM_USE_FLOAT
+using TSim = float;
+using TSimAcc = float;
+#else
+using TSim = unsigned char;
+using TSimAcc = unsigned int;  // TSimAcc is the similarity accumulation type
+#endif
+
+#ifdef TSIM_REFINE_USE_HALF
+using TSimRefine = sycl::half;


src/aliceVision/depthMap_sycl/sycl/eig33.hpp

+// Symmetric Householder reductio3 to tridiago3al form.
+
+static inline void sycl_tred2(double V[3][3], double d[3], double e[3])


src/aliceVision/depthMap_sycl/sycl/PatchPattern.hpp

+ * @brief Support class to reppresent a subpart of a patch pattern
+ *        Each patch pattern subpart gives one similarity score.
+ *
+ * @note We use a static function to aquire a single global reference
+ */
+struct PatchPatternSubpart
+{
+    sycl::float2 coordinates[ALICEVISION_DEVICE_PATCH_MAX_COORDS_PER_SUBPARTS];   //< subpart coordinate list
+    int nbCoordinates;  //< subpart number of coordinate
+    float level;        //< subpart related mipmap level (>=0)
+    float downscale;    //< subpart related mipmap downscale (>=1)
+    float weight;       //< subpart related similarity weight in range (0, 1)
+    int wsh;            //< subpart half-width (full and circle)
+    bool isCircle;      //< subpart is a circle
+};
+
+/**
+ * @struct PatchPattern
+ * @brief Support class to reppresent a patch pattern
+ */
+class PatchPattern
+{
+public:
+    PatchPatternSubpart subparts[ALICEVISION_DEVICE_PATCH_MAX_SUBPARTS];  //< patch pattern subparts (one similarity per subpart)
+    int nbSubparts; //< patch pattern number of subparts (>0)
+
+    // Singleton, no copy operator
+    void operator=(PatchPattern const&) = delete;
+
+    // Default destructor
+    ~PatchPattern() = default;
+
+    /**
+     * @brief helper func to always get the same Patch Pattern
+     */
+    static PatchPattern& getGlobalPatchPattern() {
+        static PatchPattern instance;
+        return instance;
+    }
+private:
+    // Singleton, private default constructor
+    PatchPattern() = default;
+};
+
+/**
+ * @brief Build user custom patch pattern singelton


src/aliceVision/depthMap_sycl/sycl/PatchPattern.hpp

+                const float angleDifference = (M_PI * 2.f) / subpart.nbCoordinates;
+
+                // compute patch pattern relative coordinates
+                for (int j = 0; j < subpart.nbCoordinates; ++j)
+                {
+                    sycl::float2& coords = subpart.coordinates[j];
+
+                    const float radians = angleDifference * j;
+                    coords.x() = std::cos(radians) * radiusValue;
+                    coords.y() = std::sin(radians) * radiusValue;
+                }
+
+                subpart.wsh = int(subpartParams.radius + std::pow(2.f, subpartParams.level - 1.f));
+                subpart.nbCoordinates = subpartParams.nbCoordinates;


src/aliceVision/depthMap_sycl/computeOnMultiDevices.cpp

+#include <aliceVision/alicevision_omp.hpp>
+#include <sycl/sycl.hpp>
+
+// Needed for checking device caracteristics


src/aliceVision/depthMap_sycl/sycl/eig33.hpp

+    inline const void getEigenVectorsDesc(sycl::float3& cg, /*sycl::float3& v1, sycl::float3& v2, */sycl::float3& v3/*, float& d1, float& d2, float& d3*/)
+    {
+        double V[3][3], d[3];
+
+        const double xmean = xsum / count;
+        const double ymean = ysum / count;
+        const double zmean = zsum / count;
+
+        cg = sycl::double3(xmean, ymean, zmean).convert<float>();
+
+        V[0][0] = (xxsum - xsum * xmean - xsum * xmean + xmean * xmean * count) / count;
+        V[0][1] = (xysum - ysum * xmean - xsum * ymean + xmean * ymean * count) / count;
+        V[0][2] = (xzsum - zsum * xmean - xsum * zmean + xmean * zmean * count) / count;
+        V[1][0] = (xysum - xsum * ymean - ysum * xmean + ymean * xmean * count) / count;
+        V[1][1] = (yysum - ysum * ymean - ysum * ymean + ymean * ymean * count) / count;
+        V[1][2] = (yzsum - zsum * ymean - ysum * zmean + ymean * zmean * count) / count;
+        V[2][0] = (xzsum - xsum * zmean - zsum * xmean + zmean * xmean * count) / count;
+        V[2][1] = (yzsum - ysum * zmean - zsum * ymean + zmean * ymean * count) / count;
+        V[2][2] = (zzsum - zsum * zmean - zsum * zmean + zmean * zmean * count) / count;
+
+        // should be sorted
+        sycl_eigen_decomposition(V, d);
+
+        /*
+        v1 = sycl::normalize(sycl::float3((float)V[0][2], (float)V[1][2], (float)V[2][2]));
+        v2 = sycl::normalize(sycl::float3((float)V[0][1], (float)V[1][1], (float)V[2][1]));
+        */
+        v3 = sycl::normalize(sycl::float3((float)V[0][0], (float)V[1][0], (float)V[2][0]));
+
+        /*
+        d1 = (float)d[2];
+        d2 = (float)d[1];
+        d3 = (float)d[0];
+        */
+    }
+
+    inline const bool computePlaneByPCA(sycl::float3& p, sycl::float3& n)


philippremy · 2026-03-13T15:03:22Z

Hi! Cool work! Happy to see that another backend is in the works for this project. And once AdaptiveCpp has sufficient Apple Metal support, I could drop my Metal implementation whatsoever. Do you have an approximate timeline on when Metal can be used with AdaptiveCpp?

I would then consider dropping my upstreaming plans for MTL-AliceVision if the same could be achieved with a unified backend :).

Main questions:

To avoid removing the existing implementation, the new one lives under src/aliceVision/depthMap_sycl/. Should this be moved?

And if I may chip in on this: If one asked me, I'd support a unified (and backend agnostic) DepthMap library, at least as long as multiple backends need to coexist. Because if they were exclusive to one another, a system with mixed GPU vendors could not utilize all hardware. I had started to think about how one could design that and came up with a concept similar to the ComputeOnMultiGPUs approach: something like a work-orchestrator pattern. The orchestrator class provides a unified API and internally handles dispatching to different backends (and potentially multiple devices). That would also allow for a more fine-grained device/vendor selection, if required. That would leave us with a few API classes (like DepthMapEstimatorOrchestrator, RefineOrchestrator, SgmOrchestrator), abstracting all backend implementation details away.

SFN-eu · 2026-03-13T15:22:48Z

Hi! Cool work! Happy to see that another backend is in the works for this project. And once AdaptiveCpp has sufficient Apple Metal support, I could drop my Metal implementation whatsoever. Do you have an approximate timeline on when Metal can be used with AdaptiveCpp?

It already works on Apple CPU, Metal support is slowly being patched in at AdaptiveCpp/AdaptiveCpp#864; see also https://adaptivecpp.github.io/AdaptiveCpp/install-metal/#enabling-the-metal-backend. Tl;dr it's an experimental option that's still rather barebones and misses features that this port relies on (namely, USM pointers and the double type). You would probably have to ask them for a more detailed timeline, if it even exists.

And if I may chip in on this: If one asked me, I'd support a unified (and backend agnostic) DepthMap library, at least as long as multiple backends need to coexist. Because if they were exclusive to one another, a system with mixed GPU vendors could not utilize all hardware. I had started to think about how one could design that and came up with a concept similar to the ComputeOnMultiGPUs approach: something like a work-orchestrator pattern.

This is literally how the AdaptiveCPP backend works. The new DepthMap library is unified and, in some ways, "backend agnostic", because acpp abstracts everything away and just presents to the end programmer a list of all available devices on the system (and does in fact use OMP, CUDA, HIP, Intel ZE ecc. under the hood, handling synchronization and memcpy's between them at runtime). It also has an experimental multi-device queue (https://adaptivecpp.github.io/AdaptiveCpp/multi-device-queue/) for automatic work distribution, but currently "This extension should not yet be used for any production workloads" (testing how well it performs is something I want to do as part of the performance tuning, but correctness comes first).

SFN-eu · 2026-03-13T15:25:18Z

P.s. does anyone know how to stop GitHub from spamming the full list of existing commits every time I rebase onto the tip of the develop branch? It's getting a bit old...

philippremy · 2026-03-13T15:31:52Z

Understood, thanks. We'll see if the double problem will get resolved. Metal itself has no native support for 64-bit floating-point types, that is something I encountered in my port and just switched to floats.
And by unified library I meant the original CUDA library and the SYCL port. I think it'll be interesting to merge these into a single library with a unified interface - as long as the old CUDA code remains in the project. Of course that becomes irrelevant if the SYCL part is intended to replace the existing CUDA lib altogether.

SFN-eu · 2026-03-13T15:39:03Z

I mean, it would be possible if the maintainers wished for it, but in other benchmarks AdaptiveCPP has shown itself to be better (i.e. faster) at running CUDA applications than CUDA itself (for those reading who are so inclined, you can check out the full paper: https://dl.acm.org/doi/full/10.1145/3731125.3731127).

So basically I don't see the point of using the CUDA implementation if you want to use the SYCL one ... which can also use CUDA. I kept it around because it's far better tested, and coming in with a PR that deletes an entire library didn't seem particularly helpful.

SFN-eu · 2026-03-14T12:35:45Z

Narrowed down the issue to the initial similarity volume computation: SYCL (first image) implementation is in fact rather more sparse:

SFN-eu · 2026-03-17T16:35:38Z

Ok, the problem is (at least in part) with my implementation of mipmap images (SYCL top), they are not getting filtered properly:

(apologies for the perhaps slow progress, I have been going through the code with a fine toothed comb and have found a whole host of other minor glitches, but this precise issue is sill escaping me ... 🙄)

…t>::infinity()

…ith full trilinear interpolation

…memcpy and memset on Y slices of volumes

…ace with rest of project

…SYCL as well as CUDA

use-after-free issues

…imilarityVolue.hpp

…s to reduce (device) memory usage and reallocation

…thmic time) into using std::unordered_map (hashmap, ammortized linear time)

…tion in sycl/buffer.hpp

… non-existant compute object if a tile was skipped

…big impact)

…the next commit)

…iginal CUDA implementation, and further changes should be checked against regressions.

…nd support at the same time

SFN-eu · 2026-03-20T18:05:42Z

Could you change the build system, to ensure that we can build depthMap (in cuda) and depthMap_sycl at the same time?
We will need to keep both implementations for a while to fully validate in production that there is no regression and expose an option to choose between one implementation and another.

Should now be done with #6881103! The option in question is "--backend", valid options are 0 (CUDA) and 1 (SYCL) depending on what backends where enabled with the ALICEVISION_DEPTHMAP_BACKEND compile option.

SFN-eu changed the title ~~Dev/port cuda to sycl~~ [WIP} Dev/port cuda to sycl Mar 11, 2026

SFN-eu changed the title ~~[WIP} Dev/port cuda to sycl~~ [WIP] Dev/port cuda to sycl Mar 11, 2026

SFN-eu marked this pull request as draft March 11, 2026 07:57

SFN-eu force-pushed the dev/port-cuda-to-sycl branch 2 times, most recently from 1b523a0 to 974ac99 Compare March 11, 2026 08:25

SFN-eu force-pushed the dev/port-cuda-to-sycl branch 2 times, most recently from 36e89ec to 3c91498 Compare March 12, 2026 17:42

fabiencastan requested a review from Copilot March 13, 2026 07:53

Copilot started reviewing on behalf of fabiencastan March 13, 2026 07:54 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

SFN-eu force-pushed the dev/port-cuda-to-sycl branch from f4b667f to 1e555ed Compare March 13, 2026 15:24

SFN-eu force-pushed the dev/port-cuda-to-sycl branch 3 times, most recently from 070a90d to ff24a36 Compare March 13, 2026 17:15

SFN-eu force-pushed the dev/port-cuda-to-sycl branch 2 times, most recently from bcc6e98 to aab5591 Compare March 17, 2026 12:54

SFN-eu force-pushed the dev/port-cuda-to-sycl branch from aab5591 to 73d3ec5 Compare March 18, 2026 07:45

sfn added 28 commits March 20, 2026 18:48

Replace INFINITY macro with the more correct std::numeric_limits<floa…

62c8256

…t>::infinity()

Create minimal shell "accessor" objects for copying over to device, w…

d7aa2cf

…ith full trilinear interpolation

Rework memory layout of multidimensional arrays to allow easy use of …

40dd5db

…memcpy and memset on Y slices of volumes

Pass queues by reference

5d97bdf

Port deviceSimilarityVolume.cpp, including all device code and interf…

b449d23

…ace with rest of project

Port depthMapEstimantion and depthMapFiltering to be compatible with …

9d36e7c

…SYCL as well as CUDA

Various bugfixes 2: debug statement boogaloo

36d814f

Poort DeviceCache.cpp to SYCL

5a4fe7f

Port DeviceMipmapImage.cpp to SYCL

3786233

Finish device implementation of DeviceMipmapImage, along fixing some

59dd098

use-after-free issues

Fix mismatched symbols between deviceSimilarityVolume.cpp and deviceS…

63bbfc9

…imilarityVolue.hpp

Misc formatting cleanup and bug fixes

7247698

Rework construction, storage and destruction of Sgm and Refine object…

34f5201

…s to reduce (device) memory usage and reallocation

Swap device cache from using std::map (ordered red-black tree, logari…

c6b57a4

…thmic time) into using std::unordered_map (hashmap, ammortized linear time)

Consolidate all multi-dimensional array coordinate to address calcula…

1fd82dc

…tion in sycl/buffer.hpp

Fix bug in DepthMapEstimator.cpp that could result in dereferencing a…

8789d69

… non-existant compute object if a tile was skipped

Port deviceDepthSimilarityMap to SYCL

24834e1

Fix miscelanous memory access and allocation issues

3d5c33b

Collection of minor correctness and performance improvements (with a …

5e9f049

…big impact)

General cleanup while tracking down an issue (which will be fixed in …

0e8b162

…the next commit)

Fix mipmap images

cc6b64d

General cleanup

d7aeb7b

Performance improvements

0b0192a

Correctness fixes. The port now produces equivalent results to the or…

7038ab5

…iginal CUDA implementation, and further changes should be checked against regressions.

Fix compile error after rebasing

7e5c7eb

Fix CUDA build

1d58da8

Correctness fixes

a653faa

Rework build and library system to allow building cuda and sycl backe…

6881103

…nd support at the same time

SFN-eu force-pushed the dev/port-cuda-to-sycl branch from 459be82 to 6881103 Compare March 20, 2026 17:59

		id = inthsr(device.get_info<sycl::info::device::vendor_id>()) +
		strhsr(platform.get_info<sycl::info::platform::name>()); // + has the advantage of avoiding collisions around zero, unlike xor. It's also fast, and commutativity is not a problem for our usecase

		// allocate final deth/similarity map tile list in host memory
		ALICEVISION_LOG_DEBUG(deviceName <<": Allocating final deth/similarity map tile list in host memory");

		// Symmetric Householder reductio3 to tridiago3al form.

		static inline void sycl_tred2(double V[3][3], double d[3], double e[3])

Uh oh!

Conversation

SFN-eu commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Features list

Todo list

Implementation remarks

Uh oh!

servantftransperfect commented Mar 11, 2026

Uh oh!

SFN-eu commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SFN-eu commented Mar 11, 2026

Uh oh!

servantftransperfect commented Mar 11, 2026

Uh oh!

SFN-eu commented Mar 12, 2026

Uh oh!

SFN-eu commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

philippremy commented Mar 13, 2026

Uh oh!

SFN-eu commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SFN-eu commented Mar 13, 2026

Uh oh!

philippremy commented Mar 13, 2026

Uh oh!

SFN-eu commented Mar 13, 2026

Uh oh!

SFN-eu commented Mar 14, 2026

Uh oh!

SFN-eu commented Mar 17, 2026

Uh oh!

SFN-eu commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SFN-eu commented Mar 11, 2026 •

edited

Loading

SFN-eu commented Mar 11, 2026 •

edited

Loading

SFN-eu commented Mar 13, 2026 •

edited

Loading