In this blog post, I (Aki) discuss different optimizations that can speed up Stan sampling. I approach this from the user viewpoint, and you can thank numerous (10+) Stan developers for many of the speed-ups mentioned below. The speed-ups have required hard work from developers working on Stan core C++/OCaml (big part of Stan is written in C++ and the Stan compiler handling the Stan model code is written in OCaml) and on the interfaces (Cmd/R/Python/Julia). The examples are based on CmdStanR and GCC on Ubuntu Linux, but the same optimizations can be used in RStan, PyStan, CmdStanPy, CmdStan, and Windows/Mac.
Stan code is first transpiled to C++ code (by stanc compiler), which is then compiled to machine code and linked with libraries (by some C++ compiler and linker), and thus there are different places for optimizations. Some of the optimizations can also benefit of or require some changes to the Stan code.
- C++ compiler optimization options
- C++ compiler CPU architecture options
- stanc optimization options
- External BLAS / LAPACK libraries
- External BLAS / LAPACK libraries with multithreading
- Within chain threading with reduce_sum
- Solver tolerances
Many of the optimizations are related to CPU vs memory access speed, memory locality, branch prediction, etc. As these are excellently presented and demonstrated in a blog post What scientists must know about hardware to write fast code, I don’t go to the details here, but mention the relevant points along the way.
The downside of the more extensive optimizations can be significantly increased compilation time. Also switching some of the compiler and library options may require rebuilding CmdStan which usually takes much more time than just compiling one model. This is the reason why not all optimizations are enabled by default in Stan. When installing CmdStan, some of the Stan C++ code related to math, sampler, I/O, etc. is compiled. When compiling a model code, then only the model part needs to be compiled and linked with previously compiled parts and libraries. Change of some compiler options mentioned below may also require recompiling (rebuilding) the CmdStan installation which will increase the time needed for experimenting with compiler options.
C++ compiler optimization options
C++ compilers can do a lot of different optimizations. By default when calling ,C++ compiler, the speed optimizations are not enabled (-O0
), but CmdStan compilation enables by default most of the optimizations GCC compiler has (-O3
). For more details you can start from GCC Options That Control Optimization.
-O[123]
Although GCC doc says -O0
) would reduce the compilation time, I’ve observed -O1
to have 50% shorter compilation time than -O0
and to produce a much faster sampling. In one example sampling with -O1
was 36 times faster than with -O0
. There is no reason to use -O0
.
In my experiments, the differences in compilation time and sampling speed are small between -O1
, -O2
, -O3
. I’ve observed -O1
to have 10-20% shorter compilation time, but 0-15% longer sampling time, so it may be useful when making experiments requiring frequent recompilation while having very fast sampling. I think it’s fine by default to use -O3
.
-O
option can be controlled by adding to make/local
(3 is default)
O = 3
STAN_CPP_OPTIMS=true
There are some further C++ optimizations that Stan developers tested a couple years ago to improve sampling speed. These have been collected to a list and can be enabled with a single STAN_CPP_OPTIMS
option. In my experiments, these provided an additional few percent reduction in sampling time, but increased the compilation time with high variability. As the expected benefits for many models are small, and the compilation time can more than double, these are not enabled by default. If you want to try to squeeze some extra speed-up for a posterior with a very long sampling time, you can enable these by adding to make/local
STAN_CPP_OPTIMS=true
C++ compiler CPU architecture options
The C++ compiler -O
optimization levels do many optimizations, but are still making code that will run with older CPUs, too. More modern CPUs have extended instruction set that can speed-up computation significantly, but requires compiler option to enable that. As different CPU architectures and compilers have defined these options slightly differently, I briefly review different cases:
- recent gcc x86 and Arm:
-march=X
: Tells the compiler that X is the minimal architecture the binary must run on. The compiler is free to use architecture-specific instructions. This option behaves differently on Arm and x86. On Arm,-march
does not override-mtune
, but on x86-march
does override both-mtune
and-mcpu
.-mtune
=X: Tells the compiler to optimize for microarchitecture X, but does not allow the compiler to change the ABI or make assumptions about available instructions. This option has more-or-less the same meaning on Arm and x86.-mcpu
=X: On Arm, this option is a combination of-march
and-mtune
. It simultaneously specifies the target architecture and optimizes for a given microarchitecture. On x86, this option is a deprecated synonym for-mtune
.
- clang: starting from 12.0
-mtune
works the same as in gcc - So based on the documentation of the recent compilers,
-mtune=native
should not use features that are not available, but… - … older compilers may have different behavior or may not recognize the specific CPU details and then using
-mtune=native
may fail, but-mtune
=generic is likely to work. If only-march=native
has been defined, it usually implies-mtune=native
, except when the compiler doesn’t recognize all the details, and switches to-mtune
=generic, which seems to have happened in those Windows cases where just dropping-mtune=native
did help.
-march=native
can have a big effect, and if that is included adding -mtune=native
doesn’t seem to hurt anything. These are not enabled by default, as some compilers have problems with them. Enabling -march
(and -mtune
) can increase the compilation time up to 30%, but can also reduce the sampling time up to 50% as discussed in a discourse post.
The biggest reason -march
can have a big effect is that it enables the use of SIMD “single instruction, multiple data” in modern CPUS. SIMD instructions pack several floating point number computations in one instruction so that in one CPU clock step there is potential for higher performance. For maximal benefit, this requires memory-aligned numbers, no interruptions, and no loops (see more in What scientists must know about hardware to write fast code). I’ve seen the biggest speed-ups with code including big matrix-vector multiplications (e.g. regression with many covariates) or Cholesky factorizations (e.g. GPs and other multi-normals). In these cases, the computation is made with BLAS/LAPACK routines (available also internally in Eigen), which have highly optimized code to take the best advantage of SIMD. See some speed comparisons in a discourse post. If you don’t see reduced sampling time, either your problem is too small or the computation is in for loops.
stanc optimization options
stanc compiler is the part of Stan and transpiles Stan model code to C++ code. stanc has its own optimization flags.
--O1
--O1
enables various optimizations. Release notes for 2.29 tell
–O1 uses optimizations that are simple, do not dramatically change the program, and are unlikely to noticeably slow down compile times are applied. These optimizations include dead code elimination, copy and constant propagation, automatic-differentiation level optimization and detection of opportunities to represent parameter vectors and matrices as structs-of-arrays.
Structs-of-arrays optimization uses better memory mapping for matrix and vector parameters. Stan autodiff needs to store in addition to parameter values, the partial derivatives.
The original memory mapping of parameters was aimed for flexibility, allowing assignments to individually indexed elements of matrices and vectors, and was storing each value and corresponding partial derivative together, and then the whole matrix or vector was presented as an array of structures (AoS). While this provided flexibility, it breaks the memory locality. Since 2.29, --O1
optimization, enables structure of arrays (SoA) memory mapping for matrix and vector parameters that are not indexed (all operations are in vectorized form). For models with matrix or vector parameters with many elements, this option can provide up to a 50% reduction in sampling time. See 2.29 release notes for some additional description of this feature.
-march=native
and --O1
are complementary, and sometimes only one of them helps, but sometimes both help for a better combined reduction of sampling time.
As --O1
SoA optimization requires there is no indexing of the relevant parameters, changes to Stan code may have a big effect. When the AoS memory mapping is used, for-loops in Stan code can have the same speed as vectorized form. For example, if sigma is AoS-mapped, then
for (n in 1:N) {
sigma[n] = exp(sigma[n]);
}
and
sigma = exp(sigma);
have about the same speed. But if we use the vectorized form and --O1
enables SoA memory mapping, then we may see up to 40% reduction in the sampling time.
STAN_NO_RANGE_CHECKS
This option should be used only when you trust that your code has all the correct indexing. Range checks are on by default and check that indexing doesn’t go beyond the range and causes unexpected results or memory corruption. This will not cause your computer to explode, just causes your Stan sampling to fail without a good message why. However, if you trust your code, you can get an additional reduction in sampling time. After the other optimizations, I’ve seen a few percent reductions in sampling time with no range checks.
You can disable the range checks by adding to make/local
STAN_NO_RANGE_CHECKS=true
External BLAS / LAPACK libraries
BLAS and LAPACK libraries are highly optimized libraries for vector and matrix computation. There are different BLAS and LAPACK libraries with further CPU-specific optimizations.
Stan uses by default Eigen library for a lot of the computation. Eigen has its own internal BLAS and LAPACK routines that are fast if using -march=native
. Most of the time there is no reason to use external BLAS and LAPACK libraries. Using external BLAS and LAPACK libraries add some overhead in passing the computation to external libraries, and even if the external BLAS and LAPACK are enabled, Eigen uses them only if the matrices and vectors are big enough so that the overhead can be assumed to be negligible.
In my experiments presented in a discourse post, most of the time the external libraries were not faster, but there was also a case where using Intel MKL library did provide 19% reduction in sampling time compared to Eigen internal. As installing the external libraries is an additional burden, and I encountered also one missing function implementation, I don’t recommend using external libraries as default (unless using multithreading discussed below). However, if you know your model includes big matrix multiplications or factorizations (like Cholesky), and you think even 20% reduction in sampling time would be useful, then it can be worth testing them. When making the speed comparison between external and Eigen internal BLAS/LAPACK, remember to compare to Eigen with -march=native
enabled when compiling the model.
See the discourse thread for the instructions to how enable external BLAS/LAPACK libraries.
external BLAS / LAPACK libraries with multithreading
Some Eigen docs do mention that multithreading in internal BLAS/LAPACK would be possible, but the change seems to be so recent that it didn’t work the last time I tested. However, I was able to use enable multithreading in external BLAS/LAPACK libraries. When multithreading is enabled for BLAS/LAPACK libraries, the multithreading is used within one chain without any modifications to the Stan code. The multithreading is applied only for the external BLAS/LAPACK routines, that is, only for big enough matrix computations. For some models, the matrix computations can be dominated, for example, Gaussian processes and other multi-normal with big covariance matrices. For such models, enabling multithreading in external BLAS/LAPACK can produce up to linear scaling in speed with respect to the number of physical cores used. See some timing comparisons in a dicourse post.
The benefit of using multithreading for BLAS/LAPACK is specifically for models where the computation is not trivially factorizable to smaller chunks, and in addition, it usually doesn’t require any changes to the Stan code (sometimes some operations may be slightly faster than others, see the above mentioned comparison).
I mentioned physical cores above. The modern CPUs usually have X physical cores and 2X virtual cores. While virtual cores enable running more threads simultaneously, there is only one floating point computation pipeline per physical core, and if the sampling is dominated by computation filling the floating point computation pipelines, additional virtual core threads will not help, and are likely to increase the computation time due to cost of switching between threads using the floating point computation pipeline.
See the discourse thread for the instructions to how enable external BLAS/LAPACK libraries and multithreading for them.
Within chain threading with reduce_sum
If some part of the computation can be divided intoto many independent computations and results of these computations are then summed together, then it is possible to reduce_sum approach with multiple threading. For example, if the likelihood is a product of conditionally independent likelihood terms, then the log likelihood is a sum of independent log likelihood terms. The following code for log-normal regression with single threading
target += lognormal_lpdf(Y | Intercept + Xc * b, sigma);
can be changed to
target += reduce_sum(partial_log_lik_lpmf, seq, grainsize, Y, Xc, b, Intercept, sigma);
where array[n] int seq = linspaced_int_array(n, 1, n);
and partial_log_lik_lpmf
is defined as
real partial_log_lik_lpmf(int[] seq, int start, int end, data vector Y, data matrix Xc, vector b, real Intercept, real sigma) {
return lognormal_lpdf(Y[start:end] | Intercept + Xc[start:end] * b, sigma);
}
grainsize
tells the size of the block, when dividing the computation into several blocks.
A discourse post shows some timing results for using reduce_sum
. These timing experiments where made without stanc --O1
optimization.
reduce_sum
approach has a slicing indexing of the parameter vector within, which disables SoA memory mapping, and thus there can be cases where using reduce_sum
with 2 threads has similar performance as not using it with 1 thread and --O1
.
reduce_sum
requires changes to the code, but template code for reduce_sum
can be generated with brms make_stancode(y ~ x, family=lognormal(), data=d, threads=2)
. See also
When using reduce_sum
the models need to be compiled with an additional flag that can be set in make/local
as
STAN_THREADS=true
See more in Reduce-sum in Stan users guide and a Reduce-sum tutorial.
Solver tolerances
This one is a bit different, but it can have a big effect on performance, I include it (and we have a paper about this).
Stan functions for 1d integration, algebra solvers, ODE solvers, and differential-algebraic equation (DAE) solver accept tolerance arguments. By default, the tolerance value is small to get accurate results in different situations. In some cases, the tolerance value can be much higher without having a significant impact on the accuracy, but the sampling speed can be orders of magnitude faster. A paper An importance sampling approach for reliable and efficient inference in Bayesian ordinary differential equation models discusses how higher tolerance values can be safely used.