exp: Padé denominator solve via LinearSolve default algorithm choice by ChrisRackauckas-Claude · Pull Request #239 · SciML/ExponentialUtilities.jl

ChrisRackauckas-Claude · 2026-07-05T02:42:25Z

Please ignore until reviewed by @ChrisRackauckas. Draft; opened by an agent at Chris's direction.

Stacked on #236 (branches from it; shares the LinearSolve v4 dependency — merge #236 first).

First step of moving the package's remaining linear solves onto LinearSolve: ExpMethodHigham2005Base's Padé denominator solve (V−U) X = (V+U) (both the small-norm and squaring branches) now goes through LinearSolve's default algorithm choice with the v4 batched matrix RHS, replacing the raw LAPACK.gesv! calls. Benefits: per-size/type algorithm selection (e.g. RecursiveFactorization when loaded), retcode-based failure instead of LAPACKException, consistency with phi!.

Verified locally against LinearSolve 4.0.0: exponential! ≈ Base.exp for n = 5, 40, 120 at small and large norms (exercising both converted sites).

Scoped small per review preference. Follow-ups, in order: (1) ExpMethodHigham2005 (exp_noalloc.jl) with a workspace-embedded LinearSolve cache once the upstream in-place dense-LU option lands (SciML/LinearSolve.jl#1074 discussion); (2) benchmark whether a cached init/solve! here is worth the extra alloc_mem plumbing. Deliberately not converting exp_generic.jl (exotic-eltype path; generic LU is already optimal there) or the SMatrix \ (container-preserving).

CI note: needs LinearSolve 4.0.0 registered in General (in flight) to resolve.

🤖 Generated with Claude Code

…nctions `phi(A, k)` / `phi!` previously computed phi_0(A), ..., phi_k(A) by calling `phiv_dense!` on each of the m basis vectors, each exponentiating an (m+k) x (m+k) block matrix -- overall O(m (m+k)^3). This implements Algorithm 5.1 of Al-Mohy & Liu, "A scaling and recovering algorithm for the matrix phi-functions" (arXiv:2506.01193, 2025), which computes all phi functions simultaneously in O(k m^3): scale A by 2^-s, evaluate the [m/m] diagonal Pade approximant to phi_p via Paterson--Stockmeyer, recover the lower-index approximants with the recurrence R^{(j)} = z R^{(j+1)} + 1/j!, and undo the scaling with the double-argument formula. Backward-error-optimal Pade degree m and scaling s are selected from the paper's theta table (Table 3.1) and the alpha_r / t cost analysis. The new path is the default for Float64/ComplexF64 dense inputs (its Pade tables are tuned for double precision); other element types and callers passing the legacy `caches` bundle keep the basis-vector path. Diagonal inputs are unchanged. Validated against a high-precision block-exponential reference and the legacy algorithm: worst-case relative error ~1e-14 across normal, non-normal, large-norm, complex, Hessenberg and zero matrices for k = 1..10, matching or beating the old accuracy. Measured speedups over the basis-vector path: 18x at n=50, 51x at n=100, 72x at n=200 (k=3). Closes SciML#235. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

@allocated

Address review feedback on SciML#236: * PhiPadeCache: a reusable workspace holding every scratch buffer. The strided Float64/ComplexF64 path (`_phi_almohy!`) now runs fully in place via mul!/getrf!/getrs! and broadcasting, allocating nothing once a cache exists (verified 0 bytes with @allocated across sizes). Pass it as the `caches` keyword to `phi!` to reuse across calls. (@stevengj: "do in-place, maybe?") * Container-preserving generic path (`_phi_almohy_generic`) for immutable dense matrices: an SMatrix input stays an SMatrix throughout instead of being converted to a Matrix. Routed only for `!ismutable` inputs, so sparse/other mutable non-strided matrices keep the legacy path. (@stevengj: "unfortunate to use Matrix here if you start with SMatrix") * Export and document PhiPadeCache and phi!; bump to 1.31.0 (new public API, SemVer minor). Tests: workspace reuse + zero-allocation assertion, complex-matrix workspace, and SMatrix static-in/static-out accuracy vs the block-exponential reference. Full basictests + Aqua + JET pass locally (the only failures are the pre-existing ForwardDiff-1.x Issue 42 asserts, green under CI's pinned ForwardDiff 0.10.x). Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

`LAPACK.getrf!(A, ipiv)` with a preallocated pivot vector only exists on newer Julia; on the LTS (1.10) it errored, breaking the workspace solve. ccall `getrf!` directly through libblastrampoline (as `StegrWork` does for `stegr!`) so the factorization stays zero-allocation on all supported versions; the 4-argument `getrs!` used for the solve is available everywhere. Verified on Julia 1.10 and 1.12: correctness ~2.5e-15 vs the BigFloat reference and 0 bytes allocated on cache reuse. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

The workspace solve used qualified `LinearAlgebra.BlasInt` and `LinearAlgebra.LAPACK.getrs!`, which SciMLTesting's ExplicitImports `all_qualified_accesses_are_public` check flags as reaching into non-public internals (they are not on its ignore list). Mirror `exp_noalloc.jl` instead: import `BlasInt` (an allow-listed explicit import) and ccall both `getrf!` and `getrs!` through `BLAS.@blasfunc`/`BLAS.libblastrampoline` (allow-listed names). No qualified access to non-public names remains in `phi_almohy.jl`. Behavior unchanged: verified correctness ~2.5e-15 and 0-byte cache reuse on Julia 1.10 and 1.12; Aqua and JET still pass. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

Re-analysis pass over the Al-Mohy--Liu implementation: * Paterson--Stockmeyer rewritten in Horner form with the delta trick. The previous block scheme (mirroring the reference MATLAB) used explicit Atau powers plus two products per block -- up to 12 multiplications for m = 12 where Fasi's pi_m(tau) = 7, so the parameter selection was optimizing a cost the evaluator did not achieve. The Horner form's count now equals pi_m(tau) exactly. Measured 10-20% end-to-end speedup on the workspace path across n = 25..200. * Numerical failure returns codes instead of throwing, so adaptive integrators can reject a step rather than abort: cache.info[] is 0 on success, the LAPACK getrf info for a singular Pade denominator, or -1 for non-finite results (LAPACK does not flag NaN pivots, so a post-solve finiteness check backs the code); outputs are NaN-filled on failure. Fixes a latent InexactError where NaN/Inf input crashed in ceil(Int, t) during parameter selection before reaching the solve, and the generic path's SingularException from `\`. Wrongly sized workspaces (programmer errors) still throw, with informative messages. * Workspace vectors sized from p at construction (was a fixed cap that BoundsError'd for p >~ 48) with an explicit capacity check; opnorm(A, 1) hoisted out of the per-degree ell loop (was recomputed 8x); dead Atau/Id buffers dropped (2n^2 less memory). * Fix a closure-capture leak in the test helpers: phi_block_reference assigned to `n`, clobbering the enclosing testset's `n` (Julia closures assign to existing outer locals); the helpers now declare their temporaries `local`. Verified: worst relative error 8.8e-15 over 33 BigFloat-referenced cases (real/complex, normal/nonnormal, p = 1..8); 0 bytes allocated on cache reuse (Float64 and ComplexF64, n = 10..100, p = 1..8, Julia 1.10 and 1.12); NaN/Inf inputs return info != 0 with NaN outputs and no throw; full basictests 574 pass on 1.10 and 1.12 (only the pre-existing ForwardDiff-1.x Issue 42 failures remain, which pass under CI's pinned ForwardDiff 0.10); ExplicitImports clean. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

@stevengj

…alls Two review-driven changes: * _phi_be_coeff evaluated (m+p)!, m!, (2m+p)!, (2m+p+1)! separately before dividing, so intermediates could spuriously overflow (Inf/Inf = NaN for p >~ 159) even where the quotient is representable (@stevengj). Rewritten as a product of sub-unit factors, prod 1/(m+p+j) * prod 1/(m+j), which is monotonically decreasing and can only underflow when the true value does. Matches the BigFloat-exact value to 1.1e-15 over m = 1:12, p = 1:60; new p = 30 regression test. * The hand-rolled getrf!/getrs! ccalls are gone: the solve now uses the public LinearAlgebra API, lu!(Dfact; check = false) + issuccess + ldiv!, which supports the matrix RHS directly and keeps the no-throw return-code semantics (issuccess false -> info > 0, NaN-filled outputs). LinearSolve.jl was considered but its LinearCache only solves vector right-hand sides through the public path, which would force a column-by-column loop or reaching into cache internals; plain lu!/ldiv! achieves ccall-free with zero new dependencies. Cost: the pivot vector is now the one per-call allocation (8n + ~100 bytes, measured 112 B at n = 7 to 928 B at n = 100; all O(n^2) buffers still reused), since no public API preallocates lu! pivots. Allocation tests assert that O(n) bound. The BlasInt import and ipiv field are removed; phi_almohy.jl is now entirely public-API. Verified on Julia 1.10 and 1.12: 254-assertion Phi testset passes (including new large-p and allocation-bound tests), full basictests 605 pass (only the pre-existing ForwardDiff-1.x Issue 42 failures remain), ExplicitImports clean, NaN/Inf return codes intact. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

@allocated

With LinearSolve 4.0.0 supporting LinearProblem(A, B::Matrix) like A\B (SciML/LinearSolve.jl#1072), PhiPadeCache now embeds a LinearSolve cache and _phi_solve! runs the Pade denominator solve through it instead of raw lu!/ldiv!. GenericLUFactorization is used because it refactorizes in place with a cached pivot vector: workspace reuse is now fully allocation-free (0 bytes measured for Float64 and ComplexF64 at n = 7..100, p = 3..8, on Julia 1.10 and 1.12), improving on the lu! version's per-call O(n) pivot allocation. (LUFactorization currently copies A on every dense refactorization -- an in-place opt-in is being added upstream, after which the LAPACK path can be swapped in for large n.) Return-code semantics are unchanged: retcode Success -> info 0, singular -> info 1 with NaN-filled outputs, non-finite results -> -1. Tests tightened back to @allocated == 0; 254-assertion Phi testset passes on 1.10 and 1.12 against LinearSolve 4.0.0. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

ExpMethodHigham2005Base solved (V - U) X = (V + U) with raw LAPACK.gesv! calls; route both sites (small-norm and squaring branches) through LinearSolve's default algorithm choice with the batched matrix RHS instead. This picks the best factorization per size/type (e.g. RecursiveFactorization for small matrices when loaded), and fails with a retcode rather than a LAPACKException on pathological input. Verified exp ≈ Base.exp for n = 5, 40, 120 at small and large norms (both code paths) against LinearSolve 4.0.0. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]> Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh

ChrisRackauckas-Claude · 2026-07-05T05:40:46Z

Benchmarked honestly: as written, this is a performance regression in the hot regime — do not merge yet.

exponential! best-of-N, old (gesv!) vs new (solve(LinearProblem(...)) per call), Julia 1.12, LinearSolve 4.0.0:

n	norm	old	new	new + RF loaded
10	small	10 µs	17 µs	18 µs
30	small	81 µs	96 µs	88 µs
60	small	358 µs	561 µs	454 µs
60	large	534 µs	792 µs	568 µs
120	small	3.96 ms	3.52 ms	4.00 ms
250	small	9.96 ms	9.32 ms	9.80 ms

Slower up to ~1.7× for n ≤ 60 (the Krylov-Hm sizes this package hits hardest), roughly neutral at n ≥ 120. The cost is the per-call init (A/b copies + default-alg dispatch), which gesv!'s single in-place LAPACK call doesn't pay; RecursiveFactorization narrows but doesn't close the gap.

Next revision here: thread a cached init/solve! LinearSolve cache through alloc_mem so repeated calls skip init and copies, then re-benchmark. If cached solve still can't match gesv! at n ≤ 60, the honest conclusion is that this solve should stay on gesv! (or the conversion should be behavior-only for the retcode benefit with an explicit perf note).

ForwardDiff 1.0 made iszero(::Dual) require zero partials as well, so Dual(0, 1) fell through to intlog2, where ceil(Int, nx) dropped the partials and 2^intlog2(0) = 2^64 overflowed to 0; the final result^0 = one(result) then returned derivative 0 instead of 1 (Issue 42 testset). Gate the scaling on nx > 1 instead: values in (0, 1] already produced s = 0, and Dual comparison uses the value only, so zero-valued Duals with nonzero partials now take the s = 0 branch. Verified against both ForwardDiff 0.10 and 1.x. Widens the ForwardDiff test compat to "0.10.13, 1" accordingly; with LinearSolve 4 in the dependency tree, PureKLU 1.1 forces ForwardDiff >= 1, which is how this latent failure surfaced. Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]>

The per-call solve(LinearProblem(temp, X)) added init, matrix copies, and a fresh factorization allocation on every exponential! call, which measured slower than the raw LAPACK.gesv! it replaced for n <= 60. Build a LinearSolve.init workspace once in alloc_mem with workspace-owned n-by-n buffers and alias = LinearAliasSpecifier( alias_A = true, alias_b = true), so each call just refills the buffers and solve! refactorizes in place (lu!) with no O(n^2) copy (LinearSolve >= 4.1). Both Pade branches (small-norm and scaling-and-squaring) route through the shared helper, which restores gesv!'s SingularException contract on a failed factorization. Best-of-N per-call times (Julia 1.12, OpenBLAS, this machine), old gesv! vs cached LinearSolve: ~1.3-1.8x slower for n <= 30 (fixed solve!/property overhead of a few microseconds), parity at n = 60, and 1.2-1.5x faster for n >= 120. Allocations on cache reuse stay O(n) (e.g. 1.4 KB at n = 60). Co-Authored-By: Chris Rackauckas <[email protected]> Co-Authored-By: Claude Fable 5 <[email protected]>

ChrisRackauckas and others added 8 commits July 2, 2026 13:32

ChrisRackauckas and others added 2 commits July 5, 2026 02:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

exp: Padé denominator solve via LinearSolve default algorithm choice#239

exp: Padé denominator solve via LinearSolve default algorithm choice#239
ChrisRackauckas-Claude wants to merge 10 commits into
SciML:masterfrom
ChrisRackauckas-Claude:exp-linearsolve

ChrisRackauckas-Claude commented Jul 5, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

ChrisRackauckas-Claude commented Jul 5, 2026

Uh oh!

ChrisRackauckas-Claude commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants