The autovectorisation section is a bit brief. In particular, the vectorisation cost model for the targeted SIMD extensions will often assume that unaligned SIMD loads are expensive. If you’re just doing one multiply, for example, then an unaligned load may not be worth it. The structure will, by default, have 4-byte alignment. Sticking alignas(16) on it will likely enable more vectorisation. GCC and clang also have a vector_size (non-standard) attribute that will let you define arbitrary-sized vectors and will then try to lower operations on them to SIMD things. This can typically give you a big benefit for this kind of thing, without needing any custom vector code.
It’s also worth noting that newer vector extensions put a lot of effort into scatter-gather operations specifically to enable autovectorisation. These are annoying to implement (imagine if you do a scatter store to four different pages and each one page faults and is paged in from disk!), but they make it possible for the compiler to vectorise loops over more complex data structures, which can be a huge win.
I’d have thought there was a fairly narrow space where CPU SIMD mattered in game engines. Most of the places where SIMD will get an 8x speedup, GPU offload will get a 1000x speedup. This is even more true on systems with a unified memory architecture where there’s much less cost in moving data from CPU to GPU. It would be nice for the article to discuss this.
Early SIMD (MMX, 3DNow, SSE1) on mainstream CPUs typically gave a 10-30% speedup in games, but then adding a dedicated GPU more than doubled the resolution and massively increased the rendering quality.
Reading a bit more:
The autovectorisation section is a bit brief. In particular, the vectorisation cost model for the targeted SIMD extensions will often assume that unaligned SIMD loads are expensive. If you’re just doing one multiply, for example, then an unaligned load may not be worth it. The structure will, by default, have 4-byte alignment. Sticking
alignas(16)
on it will likely enable more vectorisation. GCC and clang also have avector_size
(non-standard) attribute that will let you define arbitrary-sized vectors and will then try to lower operations on them to SIMD things. This can typically give you a big benefit for this kind of thing, without needing any custom vector code.It’s also worth noting that newer vector extensions put a lot of effort into scatter-gather operations specifically to enable autovectorisation. These are annoying to implement (imagine if you do a scatter store to four different pages and each one page faults and is paged in from disk!), but they make it possible for the compiler to vectorise loops over more complex data structures, which can be a huge win.
I’d have thought there was a fairly narrow space where CPU SIMD mattered in game engines. Most of the places where SIMD will get an 8x speedup, GPU offload will get a 1000x speedup. This is even more true on systems with a unified memory architecture where there’s much less cost in moving data from CPU to GPU. It would be nice for the article to discuss this.
Early SIMD (MMX, 3DNow, SSE1) on mainstream CPUs typically gave a 10-30% speedup in games, but then adding a dedicated GPU more than doubled the resolution and massively increased the rendering quality.
An earlier article from the same author mentions that to switch to GPU they’d have to change algorithm: