exp: Padé denominator solve via LinearSolve default algorithm choice#239
exp: Padé denominator solve via LinearSolve default algorithm choice#239ChrisRackauckas-Claude wants to merge 10 commits into
Conversation
…nctions
`phi(A, k)` / `phi!` previously computed phi_0(A), ..., phi_k(A) by calling
`phiv_dense!` on each of the m basis vectors, each exponentiating an
(m+k) x (m+k) block matrix -- overall O(m (m+k)^3).
This implements Algorithm 5.1 of Al-Mohy & Liu, "A scaling and recovering
algorithm for the matrix phi-functions" (arXiv:2506.01193, 2025), which
computes all phi functions simultaneously in O(k m^3): scale A by 2^-s,
evaluate the [m/m] diagonal Pade approximant to phi_p via Paterson--Stockmeyer,
recover the lower-index approximants with the recurrence
R^{(j)} = z R^{(j+1)} + 1/j!, and undo the scaling with the double-argument
formula. Backward-error-optimal Pade degree m and scaling s are selected from
the paper's theta table (Table 3.1) and the alpha_r / t cost analysis.
The new path is the default for Float64/ComplexF64 dense inputs (its Pade
tables are tuned for double precision); other element types and callers passing
the legacy `caches` bundle keep the basis-vector path. Diagonal inputs are
unchanged.
Validated against a high-precision block-exponential reference and the legacy
algorithm: worst-case relative error ~1e-14 across normal, non-normal,
large-norm, complex, Hessenberg and zero matrices for k = 1..10, matching or
beating the old accuracy. Measured speedups over the basis-vector path: 18x at
n=50, 51x at n=100, 72x at n=200 (k=3).
Closes SciML#235.
Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Address review feedback on SciML#236: * PhiPadeCache: a reusable workspace holding every scratch buffer. The strided Float64/ComplexF64 path (`_phi_almohy!`) now runs fully in place via mul!/getrf!/getrs! and broadcasting, allocating nothing once a cache exists (verified 0 bytes with @allocated across sizes). Pass it as the `caches` keyword to `phi!` to reuse across calls. (@stevengj: "do in-place, maybe?") * Container-preserving generic path (`_phi_almohy_generic`) for immutable dense matrices: an SMatrix input stays an SMatrix throughout instead of being converted to a Matrix. Routed only for `!ismutable` inputs, so sparse/other mutable non-strided matrices keep the legacy path. (@stevengj: "unfortunate to use Matrix here if you start with SMatrix") * Export and document PhiPadeCache and phi!; bump to 1.31.0 (new public API, SemVer minor). Tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and SMatrix static-in/static-out accuracy vs the block-exponential reference. Full basictests + Aqua + JET pass locally (the only failures are the pre-existing ForwardDiff-1.x Issue 42 asserts, green under CI's pinned ForwardDiff 0.10.x). Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
`LAPACK.getrf!(A, ipiv)` with a preallocated pivot vector only exists on newer Julia; on the LTS (1.10) it errored, breaking the workspace solve. ccall `getrf!` directly through libblastrampoline (as `StegrWork` does for `stegr!`) so the factorization stays zero-allocation on all supported versions; the 4-argument `getrs!` used for the solve is available everywhere. Verified on Julia 1.10 and 1.12: correctness ~2.5e-15 vs the BigFloat reference and 0 bytes allocated on cache reuse. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
The workspace solve used qualified `LinearAlgebra.BlasInt` and `LinearAlgebra.LAPACK.getrs!`, which SciMLTesting's ExplicitImports `all_qualified_accesses_are_public` check flags as reaching into non-public internals (they are not on its ignore list). Mirror `exp_noalloc.jl` instead: import `BlasInt` (an allow-listed explicit import) and ccall both `getrf!` and `getrs!` through `BLAS.@blasfunc`/`BLAS.libblastrampoline` (allow-listed names). No qualified access to non-public names remains in `phi_almohy.jl`. Behavior unchanged: verified correctness ~2.5e-15 and 0-byte cache reuse on Julia 1.10 and 1.12; Aqua and JET still pass. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Re-analysis pass over the Al-Mohy--Liu implementation: * Paterson--Stockmeyer rewritten in Horner form with the delta trick. The previous block scheme (mirroring the reference MATLAB) used explicit Atau powers plus two products per block -- up to 12 multiplications for m = 12 where Fasi's pi_m(tau) = 7, so the parameter selection was optimizing a cost the evaluator did not achieve. The Horner form's count now equals pi_m(tau) exactly. Measured 10-20% end-to-end speedup on the workspace path across n = 25..200. * Numerical failure returns codes instead of throwing, so adaptive integrators can reject a step rather than abort: cache.info[] is 0 on success, the LAPACK getrf info for a singular Pade denominator, or -1 for non-finite results (LAPACK does not flag NaN pivots, so a post-solve finiteness check backs the code); outputs are NaN-filled on failure. Fixes a latent InexactError where NaN/Inf input crashed in ceil(Int, t) during parameter selection before reaching the solve, and the generic path's SingularException from `\`. Wrongly sized workspaces (programmer errors) still throw, with informative messages. * Workspace vectors sized from p at construction (was a fixed cap that BoundsError'd for p >~ 48) with an explicit capacity check; opnorm(A, 1) hoisted out of the per-degree ell loop (was recomputed 8x); dead Atau/Id buffers dropped (2n^2 less memory). * Fix a closure-capture leak in the test helpers: phi_block_reference assigned to `n`, clobbering the enclosing testset's `n` (Julia closures assign to existing outer locals); the helpers now declare their temporaries `local`. Verified: worst relative error 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1..8); 0 bytes allocated on cache reuse (Float64 and ComplexF64, n = 10..100, p = 1..8, Julia 1.10 and 1.12); NaN/Inf inputs return info != 0 with NaN outputs and no throw; full basictests 574 pass on 1.10 and 1.12 (only the pre-existing ForwardDiff-1.x Issue 42 failures remain, which pass under CI's pinned ForwardDiff 0.10); ExplicitImports clean. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
…alls Two review-driven changes: * _phi_be_coeff evaluated (m+p)!, m!, (2m+p)!, (2m+p+1)! separately before dividing, so intermediates could spuriously overflow (Inf/Inf = NaN for p >~ 159) even where the quotient is representable (@stevengj). Rewritten as a product of sub-unit factors, prod 1/(m+p+j) * prod 1/(m+j), which is monotonically decreasing and can only underflow when the true value does. Matches the BigFloat-exact value to 1.1e-15 over m = 1:12, p = 1:60; new p = 30 regression test. * The hand-rolled getrf!/getrs! ccalls are gone: the solve now uses the public LinearAlgebra API, lu!(Dfact; check = false) + issuccess + ldiv!, which supports the matrix RHS directly and keeps the no-throw return-code semantics (issuccess false -> info > 0, NaN-filled outputs). LinearSolve.jl was considered but its LinearCache only solves vector right-hand sides through the public path, which would force a column-by-column loop or reaching into cache internals; plain lu!/ldiv! achieves ccall-free with zero new dependencies. Cost: the pivot vector is now the one per-call allocation (8n + ~100 bytes, measured 112 B at n = 7 to 928 B at n = 100; all O(n^2) buffers still reused), since no public API preallocates lu! pivots. Allocation tests assert that O(n) bound. The BlasInt import and ipiv field are removed; phi_almohy.jl is now entirely public-API. Verified on Julia 1.10 and 1.12: 254-assertion Phi testset passes (including new large-p and allocation-bound tests), full basictests 605 pass (only the pre-existing ForwardDiff-1.x Issue 42 failures remain), ExplicitImports clean, NaN/Inf return codes intact. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
With LinearSolve 4.0.0 supporting LinearProblem(A, B::Matrix) like A\B (SciML/LinearSolve.jl#1072), PhiPadeCache now embeds a LinearSolve cache and _phi_solve! runs the Pade denominator solve through it instead of raw lu!/ldiv!. GenericLUFactorization is used because it refactorizes in place with a cached pivot vector: workspace reuse is now fully allocation-free (0 bytes measured for Float64 and ComplexF64 at n = 7..100, p = 3..8, on Julia 1.10 and 1.12), improving on the lu! version's per-call O(n) pivot allocation. (LUFactorization currently copies A on every dense refactorization -- an in-place opt-in is being added upstream, after which the LAPACK path can be swapped in for large n.) Return-code semantics are unchanged: retcode Success -> info 0, singular -> info 1 with NaN-filled outputs, non-finite results -> -1. Tests tightened back to @allocated == 0; 254-assertion Phi testset passes on 1.10 and 1.12 against LinearSolve 4.0.0. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
ExpMethodHigham2005Base solved (V - U) X = (V + U) with raw LAPACK.gesv! calls; route both sites (small-norm and squaring branches) through LinearSolve's default algorithm choice with the batched matrix RHS instead. This picks the best factorization per size/type (e.g. RecursiveFactorization for small matrices when loaded), and fails with a retcode rather than a LAPACKException on pathological input. Verified exp ≈ Base.exp for n = 5, 40, 120 at small and large norms (both code paths) against LinearSolve 4.0.0. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
|
Benchmarked honestly: as written, this is a performance regression in the hot regime — do not merge yet.
Slower up to ~1.7× for n ≤ 60 (the Krylov- Next revision here: thread a cached |
ForwardDiff 1.0 made iszero(::Dual) require zero partials as well, so Dual(0, 1) fell through to intlog2, where ceil(Int, nx) dropped the partials and 2^intlog2(0) = 2^64 overflowed to 0; the final result^0 = one(result) then returned derivative 0 instead of 1 (Issue 42 testset). Gate the scaling on nx > 1 instead: values in (0, 1] already produced s = 0, and Dual comparison uses the value only, so zero-valued Duals with nonzero partials now take the s = 0 branch. Verified against both ForwardDiff 0.10 and 1.x. Widens the ForwardDiff test compat to "0.10.13, 1" accordingly; with LinearSolve 4 in the dependency tree, PureKLU 1.1 forces ForwardDiff >= 1, which is how this latent failure surfaced. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]>
The per-call solve(LinearProblem(temp, X)) added init, matrix copies, and a fresh factorization allocation on every exponential! call, which measured slower than the raw LAPACK.gesv! it replaced for n <= 60. Build a LinearSolve.init workspace once in alloc_mem with workspace-owned n-by-n buffers and alias = LinearAliasSpecifier( alias_A = true, alias_b = true), so each call just refills the buffers and solve! refactorizes in place (lu!) with no O(n^2) copy (LinearSolve >= 4.1). Both Pade branches (small-norm and scaling-and-squaring) route through the shared helper, which restores gesv!'s SingularException contract on a failed factorization. Best-of-N per-call times (Julia 1.12, OpenBLAS, this machine), old gesv! vs cached LinearSolve: ~1.3-1.8x slower for n <= 30 (fixed solve!/property overhead of a few microseconds), parity at n = 60, and 1.2-1.5x faster for n >= 120. Allocations on cache reuse stay O(n) (e.g. 1.4 KB at n = 60). Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]>
Stacked on #236 (branches from it; shares the LinearSolve v4 dependency — merge #236 first).
First step of moving the package's remaining linear solves onto LinearSolve:
ExpMethodHigham2005Base's Padé denominator solve(V−U) X = (V+U)(both the small-norm and squaring branches) now goes through LinearSolve's default algorithm choice with the v4 batched matrix RHS, replacing the rawLAPACK.gesv!calls. Benefits: per-size/type algorithm selection (e.g. RecursiveFactorization when loaded), retcode-based failure instead ofLAPACKException, consistency withphi!.Verified locally against LinearSolve 4.0.0:
exponential! ≈ Base.expfor n = 5, 40, 120 at small and large norms (exercising both converted sites).Scoped small per review preference. Follow-ups, in order: (1)
ExpMethodHigham2005(exp_noalloc.jl) with a workspace-embedded LinearSolve cache once the upstream in-place dense-LU option lands (SciML/LinearSolve.jl#1074 discussion); (2) benchmark whether a cachedinit/solve!here is worth the extraalloc_memplumbing. Deliberately not convertingexp_generic.jl(exotic-eltype path; generic LU is already optimal there) or the SMatrix\(container-preserving).CI note: needs LinearSolve 4.0.0 registered in General (in flight) to resolve.
🤖 Generated with Claude Code