Skip to content

exp: Padé denominator solve via LinearSolve default algorithm choice#239

Draft
ChrisRackauckas-Claude wants to merge 10 commits into
SciML:masterfrom
ChrisRackauckas-Claude:exp-linearsolve
Draft

exp: Padé denominator solve via LinearSolve default algorithm choice#239
ChrisRackauckas-Claude wants to merge 10 commits into
SciML:masterfrom
ChrisRackauckas-Claude:exp-linearsolve

Conversation

@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor

Please ignore until reviewed by @ChrisRackauckas. Draft; opened by an agent at Chris's direction.

Stacked on #236 (branches from it; shares the LinearSolve v4 dependency — merge #236 first).

First step of moving the package's remaining linear solves onto LinearSolve: ExpMethodHigham2005Base's Padé denominator solve (V−U) X = (V+U) (both the small-norm and squaring branches) now goes through LinearSolve's default algorithm choice with the v4 batched matrix RHS, replacing the raw LAPACK.gesv! calls. Benefits: per-size/type algorithm selection (e.g. RecursiveFactorization when loaded), retcode-based failure instead of LAPACKException, consistency with phi!.

Verified locally against LinearSolve 4.0.0: exponential! ≈ Base.exp for n = 5, 40, 120 at small and large norms (exercising both converted sites).

Scoped small per review preference. Follow-ups, in order: (1) ExpMethodHigham2005 (exp_noalloc.jl) with a workspace-embedded LinearSolve cache once the upstream in-place dense-LU option lands (SciML/LinearSolve.jl#1074 discussion); (2) benchmark whether a cached init/solve! here is worth the extra alloc_mem plumbing. Deliberately not converting exp_generic.jl (exotic-eltype path; generic LU is already optimal there) or the SMatrix \ (container-preserving).

CI note: needs LinearSolve 4.0.0 registered in General (in flight) to resolve.

🤖 Generated with Claude Code

ChrisRackauckas and others added 8 commits July 2, 2026 13:32
…nctions

`phi(A, k)` / `phi!` previously computed phi_0(A), ..., phi_k(A) by calling
`phiv_dense!` on each of the m basis vectors, each exponentiating an
(m+k) x (m+k) block matrix -- overall O(m (m+k)^3).

This implements Algorithm 5.1 of Al-Mohy & Liu, "A scaling and recovering
algorithm for the matrix phi-functions" (arXiv:2506.01193, 2025), which
computes all phi functions simultaneously in O(k m^3): scale A by 2^-s,
evaluate the [m/m] diagonal Pade approximant to phi_p via Paterson--Stockmeyer,
recover the lower-index approximants with the recurrence
R^{(j)} = z R^{(j+1)} + 1/j!, and undo the scaling with the double-argument
formula. Backward-error-optimal Pade degree m and scaling s are selected from
the paper's theta table (Table 3.1) and the alpha_r / t cost analysis.

The new path is the default for Float64/ComplexF64 dense inputs (its Pade
tables are tuned for double precision); other element types and callers passing
the legacy `caches` bundle keep the basis-vector path. Diagonal inputs are
unchanged.

Validated against a high-precision block-exponential reference and the legacy
algorithm: worst-case relative error ~1e-14 across normal, non-normal,
large-norm, complex, Hessenberg and zero matrices for k = 1..10, matching or
beating the old accuracy. Measured speedups over the basis-vector path: 18x at
n=50, 51x at n=100, 72x at n=200 (k=3).

Closes SciML#235.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Address review feedback on SciML#236:

* PhiPadeCache: a reusable workspace holding every scratch buffer. The
  strided Float64/ComplexF64 path (`_phi_almohy!`) now runs fully in place
  via mul!/getrf!/getrs! and broadcasting, allocating nothing once a cache
  exists (verified 0 bytes with @allocated across sizes). Pass it as the
  `caches` keyword to `phi!` to reuse across calls. (@stevengj: "do
  in-place, maybe?")

* Container-preserving generic path (`_phi_almohy_generic`) for immutable
  dense matrices: an SMatrix input stays an SMatrix throughout instead of
  being converted to a Matrix. Routed only for `!ismutable` inputs, so
  sparse/other mutable non-strided matrices keep the legacy path.
  (@stevengj: "unfortunate to use Matrix here if you start with SMatrix")

* Export and document PhiPadeCache and phi!; bump to 1.31.0 (new public
  API, SemVer minor).

Tests: workspace reuse + zero-allocation assertion, complex-matrix
workspace, and SMatrix static-in/static-out accuracy vs the block-exponential
reference. Full basictests + Aqua + JET pass locally (the only failures are
the pre-existing ForwardDiff-1.x Issue 42 asserts, green under CI's pinned
ForwardDiff 0.10.x).

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
`LAPACK.getrf!(A, ipiv)` with a preallocated pivot vector only exists on
newer Julia; on the LTS (1.10) it errored, breaking the workspace solve.
ccall `getrf!` directly through libblastrampoline (as `StegrWork` does for
`stegr!`) so the factorization stays zero-allocation on all supported
versions; the 4-argument `getrs!` used for the solve is available
everywhere.

Verified on Julia 1.10 and 1.12: correctness ~2.5e-15 vs the BigFloat
reference and 0 bytes allocated on cache reuse.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
The workspace solve used qualified `LinearAlgebra.BlasInt` and
`LinearAlgebra.LAPACK.getrs!`, which SciMLTesting's ExplicitImports
`all_qualified_accesses_are_public` check flags as reaching into non-public
internals (they are not on its ignore list). Mirror `exp_noalloc.jl`
instead: import `BlasInt` (an allow-listed explicit import) and ccall both
`getrf!` and `getrs!` through `BLAS.@blasfunc`/`BLAS.libblastrampoline`
(allow-listed names). No qualified access to non-public names remains in
`phi_almohy.jl`.

Behavior unchanged: verified correctness ~2.5e-15 and 0-byte cache reuse on
Julia 1.10 and 1.12; Aqua and JET still pass.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
Re-analysis pass over the Al-Mohy--Liu implementation:

* Paterson--Stockmeyer rewritten in Horner form with the delta trick. The
  previous block scheme (mirroring the reference MATLAB) used explicit
  Atau powers plus two products per block -- up to 12 multiplications for
  m = 12 where Fasi's pi_m(tau) = 7, so the parameter selection was
  optimizing a cost the evaluator did not achieve. The Horner form's count
  now equals pi_m(tau) exactly. Measured 10-20% end-to-end speedup on the
  workspace path across n = 25..200.

* Numerical failure returns codes instead of throwing, so adaptive
  integrators can reject a step rather than abort: cache.info[] is 0 on
  success, the LAPACK getrf info for a singular Pade denominator, or -1
  for non-finite results (LAPACK does not flag NaN pivots, so a post-solve
  finiteness check backs the code); outputs are NaN-filled on failure.
  Fixes a latent InexactError where NaN/Inf input crashed in
  ceil(Int, t) during parameter selection before reaching the solve, and
  the generic path's SingularException from `\`. Wrongly sized workspaces
  (programmer errors) still throw, with informative messages.

* Workspace vectors sized from p at construction (was a fixed cap that
  BoundsError'd for p >~ 48) with an explicit capacity check; opnorm(A, 1)
  hoisted out of the per-degree ell loop (was recomputed 8x); dead Atau/Id
  buffers dropped (2n^2 less memory).

* Fix a closure-capture leak in the test helpers: phi_block_reference
  assigned to `n`, clobbering the enclosing testset's `n` (Julia closures
  assign to existing outer locals); the helpers now declare their
  temporaries `local`.

Verified: worst relative error 8.8e-15 over 33 BigFloat-referenced cases
(real/complex, normal/nonnormal, p = 1..8); 0 bytes allocated on cache
reuse (Float64 and ComplexF64, n = 10..100, p = 1..8, Julia 1.10 and
1.12); NaN/Inf inputs return info != 0 with NaN outputs and no throw;
full basictests 574 pass on 1.10 and 1.12 (only the pre-existing
ForwardDiff-1.x Issue 42 failures remain, which pass under CI's pinned
ForwardDiff 0.10); ExplicitImports clean.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
…alls

Two review-driven changes:

* _phi_be_coeff evaluated (m+p)!, m!, (2m+p)!, (2m+p+1)! separately before
  dividing, so intermediates could spuriously overflow (Inf/Inf = NaN for
  p >~ 159) even where the quotient is representable (@stevengj). Rewritten
  as a product of sub-unit factors,
  prod 1/(m+p+j) * prod 1/(m+j), which is monotonically decreasing and can
  only underflow when the true value does. Matches the BigFloat-exact value
  to 1.1e-15 over m = 1:12, p = 1:60; new p = 30 regression test.

* The hand-rolled getrf!/getrs! ccalls are gone: the solve now uses the
  public LinearAlgebra API, lu!(Dfact; check = false) + issuccess + ldiv!,
  which supports the matrix RHS directly and keeps the no-throw return-code
  semantics (issuccess false -> info > 0, NaN-filled outputs).
  LinearSolve.jl was considered but its LinearCache only solves vector
  right-hand sides through the public path, which would force a
  column-by-column loop or reaching into cache internals; plain lu!/ldiv!
  achieves ccall-free with zero new dependencies. Cost: the pivot vector is
  now the one per-call allocation (8n + ~100 bytes, measured 112 B at
  n = 7 to 928 B at n = 100; all O(n^2) buffers still reused), since no
  public API preallocates lu! pivots. Allocation tests assert that O(n)
  bound. The BlasInt import and ipiv field are removed; phi_almohy.jl is
  now entirely public-API.

Verified on Julia 1.10 and 1.12: 254-assertion Phi testset passes
(including new large-p and allocation-bound tests), full basictests 605
pass (only the pre-existing ForwardDiff-1.x Issue 42 failures remain),
ExplicitImports clean, NaN/Inf return codes intact.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
With LinearSolve 4.0.0 supporting LinearProblem(A, B::Matrix) like A\B
(SciML/LinearSolve.jl#1072), PhiPadeCache now embeds a LinearSolve cache
and _phi_solve! runs the Pade denominator solve through it instead of
raw lu!/ldiv!. GenericLUFactorization is used because it refactorizes in
place with a cached pivot vector: workspace reuse is now fully
allocation-free (0 bytes measured for Float64 and ComplexF64 at
n = 7..100, p = 3..8, on Julia 1.10 and 1.12), improving on the lu!
version's per-call O(n) pivot allocation. (LUFactorization currently
copies A on every dense refactorization -- an in-place opt-in is being
added upstream, after which the LAPACK path can be swapped in for large
n.) Return-code semantics are unchanged: retcode Success -> info 0,
singular -> info 1 with NaN-filled outputs, non-finite results -> -1.

Tests tightened back to @allocated == 0; 254-assertion Phi testset
passes on 1.10 and 1.12 against LinearSolve 4.0.0.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
ExpMethodHigham2005Base solved (V - U) X = (V + U) with raw
LAPACK.gesv! calls; route both sites (small-norm and squaring branches)
through LinearSolve's default algorithm choice with the batched matrix
RHS instead. This picks the best factorization per size/type (e.g.
RecursiveFactorization for small matrices when loaded), and fails with
a retcode rather than a LAPACKException on pathological input.

Verified exp ≈ Base.exp for n = 5, 40, 120 at small and large norms
(both code paths) against LinearSolve 4.0.0.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
Claude-Session: https://claude.ai/code/session_01EYp371jx6LurezUDhKcYRh
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Benchmarked honestly: as written, this is a performance regression in the hot regime — do not merge yet.

exponential! best-of-N, old (gesv!) vs new (solve(LinearProblem(...)) per call), Julia 1.12, LinearSolve 4.0.0:

n norm old new new + RF loaded
10 small 10 µs 17 µs 18 µs
30 small 81 µs 96 µs 88 µs
60 small 358 µs 561 µs 454 µs
60 large 534 µs 792 µs 568 µs
120 small 3.96 ms 3.52 ms 4.00 ms
250 small 9.96 ms 9.32 ms 9.80 ms

Slower up to ~1.7× for n ≤ 60 (the Krylov-Hm sizes this package hits hardest), roughly neutral at n ≥ 120. The cost is the per-call init (A/b copies + default-alg dispatch), which gesv!'s single in-place LAPACK call doesn't pay; RecursiveFactorization narrows but doesn't close the gap.

Next revision here: thread a cached init/solve! LinearSolve cache through alloc_mem so repeated calls skip init and copies, then re-benchmark. If cached solve still can't match gesv! at n ≤ 60, the honest conclusion is that this solve should stay on gesv! (or the conversion should be behavior-only for the retcode benefit with an explicit perf note).

ChrisRackauckas and others added 2 commits July 5, 2026 02:25
ForwardDiff 1.0 made iszero(::Dual) require zero partials as well, so
Dual(0, 1) fell through to intlog2, where ceil(Int, nx) dropped the
partials and 2^intlog2(0) = 2^64 overflowed to 0; the final
result^0 = one(result) then returned derivative 0 instead of 1 (Issue
42 testset). Gate the scaling on nx > 1 instead: values in (0, 1]
already produced s = 0, and Dual comparison uses the value only, so
zero-valued Duals with nonzero partials now take the s = 0 branch.
Verified against both ForwardDiff 0.10 and 1.x.

Widens the ForwardDiff test compat to "0.10.13, 1" accordingly; with
LinearSolve 4 in the dependency tree, PureKLU 1.1 forces ForwardDiff
>= 1, which is how this latent failure surfaced.

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
The per-call solve(LinearProblem(temp, X)) added init, matrix copies,
and a fresh factorization allocation on every exponential! call, which
measured slower than the raw LAPACK.gesv! it replaced for n <= 60.
Build a LinearSolve.init workspace once in alloc_mem with
workspace-owned n-by-n buffers and alias = LinearAliasSpecifier(
alias_A = true, alias_b = true), so each call just refills the buffers
and solve! refactorizes in place (lu!) with no O(n^2) copy
(LinearSolve >= 4.1). Both Pade branches (small-norm and
scaling-and-squaring) route through the shared helper, which restores
gesv!'s SingularException contract on a failed factorization.

Best-of-N per-call times (Julia 1.12, OpenBLAS, this machine), old
gesv! vs cached LinearSolve: ~1.3-1.8x slower for n <= 30 (fixed
solve!/property overhead of a few microseconds), parity at n = 60,
and 1.2-1.5x faster for n >= 120. Allocations on cache reuse stay
O(n) (e.g. 1.4 KB at n = 60).

Co-Authored-By: Chris Rackauckas <[email protected]>
Co-Authored-By: Claude Fable 5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants