Skip to content

[WIP] Laser forEach iterators#479

Draft
mratsim wants to merge 16 commits intomasterfrom
laser-iterators
Draft

[WIP] Laser forEach iterators#479
mratsim wants to merge 16 commits intomasterfrom
laser-iterators

Conversation

@mratsim
Copy link
Owner

@mratsim mratsim commented Dec 12, 2020

This replaces the use of mapX_inline and applyX_inline by the forEach / forEachContiguous / forEachParallel / forEachSerial laser iterators.

This is particularly valuable for recurrent neural network like GRU because we can implement the equation in a straightforward manner with meaning ful variable name instead of magic x, y, z and we would have needed an apply11_inline anyway (with the correct var/non-var parameters).

TODO:

  • There is a significant parallel performance regression on GRU when running test_nnp_gru. While before this PR the parallel version was "only" 2x slower than serial, now it's 13x slower than serial which probably signals a false sharing issue.
    Even then RNN are a type of iterative stencil application that require special care and often used for polyhedral benchmarking so another approach using tiling is probably needed to properly speed those up. (see https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md#polyhedral-approaches)

Not in scope:

  • using forEach for backward propagation in GRU: this is a headache inducing refactoring

@mratsim
Copy link
Owner Author

mratsim commented Jan 3, 2021

Some changes in operator BLAS L1, require changing autograd to +.= because the += based on apply2_inline was doing implicit broadcast somehow. But then we have cascading issues that requires changing +.= to use broadcast2 and broadcast2 fixes, and then we have the Stack autograd layer that is failing tests. See 2929cb5 in the laser-iterators-stashed branch

@greptile-apps
Copy link

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This WIP PR replaces the existing applyX_inline / mapX_inline family of higher-order tensor operations with the Laser forEach / forEachSerial / forEachStaged iterators across ~20 files. The motivating benefit — demonstrated clearly in nnp_gru.nim — is the ability to fuse multi-tensor element-wise loops with meaningful variable names rather than cryptic x, y, z placeholders, and to avoid needing bespoke apply4_inline / apply10_inline variants.

Key changes:

  • higher_order_applymap.nim now imports and re-exports foreach / foreach_staged, making forEach available throughout the tensor module.
  • All in-place scalar and element-wise operators (+=, -=, *=, /=, ^.=, etc.) in operators_broadcasted.nim and operators_blas_l1.nim are ported to forEach.
  • Activation functions, optimizer update loops (SGD, SGDMomentum, Adam), error metric functions, and GRU cell forward/backward passes are all migrated.
  • nnp_gru.nim gains a clean, readable fused GRU cell loop (7–10 tensors in a single forEach), and the previous n_tmp copy-back workaround is correctly eliminated.
  • laser/tensor/initialization.nim disables OpenMP in copyFromRaw and setZero to avoid contention when these are called inside an outer OMP region (referencing issue OpenMP barrier slowness in neural network examples with high core count. #485).
  • nnp_sigmoid_cross_entropy.nim reverts its parallel map2_inline + sum: reduction to a sequential for loop pending a working forEachStaged implementation.

Issues to address before merging:

  • The acknowledged 13× parallel slowdown in GRU (test_nnp_gru) stems from forEach distributing work element-wise across all batch_size × hidden_size elements of 7–10 tensor arguments, many of which are non-contiguous column slices of the same buffer. Adjacent rows in hidden land on the same cache line, causing severe false sharing. Using forEachSerial for the inner loop (as done in softmax) would recover serial performance until a tiled approach is ready.
  • sigmoid_cross_entropy now runs sequentially, a direct performance regression for training workloads.
  • copyFromRaw/setZero unconditionally lose their OpenMP path; an omp_in_parallel() guard would preserve performance for standalone use.
  • optimizerAdam is changed from func to proc without a comment; the reason is unclear since the function body does not call forEach.

Confidence Score: 2/5

  • Not safe to merge: acknowledged 13× GRU parallel regression, reverted parallel sigmoid cross-entropy, and disabled OpenMP in core initialization paths are significant regressions.
  • The logical migration of forEach is correct and the code reads much better, but there are three unresolved performance regressions (GRU parallel, sigmoid_cross_entropy sequential, initialization OMP disabled) that the PR itself marks as WIP. Merging in the current state would degrade performance for real training workloads.
  • src/arraymancer/nn_primitives/nnp_gru.nim (false sharing / 13× slowdown), src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim (sequential fallback), src/arraymancer/laser/tensor/initialization.nim (OpenMP removed unconditionally)

Important Files Changed

Filename Overview
src/arraymancer/nn_primitives/nnp_gru.nim Core GRU forward/backward rewritten using forEach for loop fusion; logically correct but introduces a severe 13× parallel slowdown due to false sharing across non-contiguous tensor slices, and removes the n_tmp workaround correctly.
src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim Parallel map2_inline/sum: reduction replaced with a sequential for loop, introducing a performance regression; forEachStaged variant is commented out as a TODO pending an OpenMP fix.
src/arraymancer/laser/tensor/initialization.nim OpenMP-parallel copyFromRaw and setZero unconditionally replaced with serial copyMem/zeroMem to avoid contention when called inside OMP regions; hurts standalone bulk allocation performance.
src/arraymancer/nn/optimizers/optimizers.nim SGD, SGDMomentum, and Adam optimizer update loops cleanly migrated from apply2_inline/apply3_inline to forEach with meaningful variable names; optimizerAdam downgraded from func to proc without explanation.
src/arraymancer/tensor/higher_order_applymap.nim Adds foreach/foreach_staged imports and exports, and replaces the apply(f: proc(x: var T)) OpenMP loop with forEach; semantically equivalent change.
src/arraymancer/nn_primitives/nnp_activation.nim All activation functions (sigmoid, relu, tanh) and their backward passes migrated from map_inline/apply_inline to forEach with descriptive variable names; semantically equivalent.
src/arraymancer/tensor/math_functions.nim In-place math operations (melwise_mul, melwise_div, mreciprocal, mnegate, mabs, mclamp) migrated to forEach; complex reciprocal correctly split into forEach x in t, r in result pattern.
src/arraymancer/tensor/operators_broadcasted.nim Broadcast scalar operators (+.=, -.=, *.=, /.=, ^.=) migrated from apply_inline to forEach; all semantically equivalent.
src/arraymancer/laser/strided_iteration/foreach_common.nim New shared infrastructure for forEach macros; handles tensor aliasing, var/let detection, raw pointer extraction, and shape assertion for multi-tensor iteration.
src/arraymancer/stats/kde.nim Single apply_inline call in kde normalization replaced with forEach x in result: x *= normFactor; semantically identical.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[forEach / forEachSerial / forEachStaged macros] --> B[higher_order_applymap.nim\nexports forEach]
    B --> C[tensor/math_functions.nim\nmelwise_mul, melwise_div, mreciprocal, etc.]
    B --> D[tensor/operators_broadcasted.nim\n+.= -.= *.= /.= ^.=]
    B --> E[tensor/operators_blas_l1.nim\n*= /=]
    B --> F[nn_primitives/nnp_activation.nim\nsigmoid, relu, tanh + backwards]
    B --> G[nn/optimizers/optimizers.nim\nSGD, SGDMomentum, Adam update]
    B --> H[ml/metrics/common_error_functions.nim\nsquared_error, relative_error, etc.]
    B --> I[ml/dimensionality_reduction/pca.nim\nexplained_variance]
    B --> J[stats/kde.nim\nnormalization]

    A --> K[nn_primitives/nnp_gru.nim]
    K --> K1[gru_cell_inference\nforEach over 7 tensor slices]
    K --> K2[gru_cell_forward\nforEach over 10 tensor slices]
    K --> K3[gru_cell_backward\nforEach for dh update]

    A --> L[nn_primitives/nnp_softmax.nim\nforEachSerial inside OMP loop]
    A --> M[nn_primitives/nnp_softmax_cross_entropy.nim\nforEachSerial inside OMP loop]

    N[laser/tensor/initialization.nim] --> N1[copyFromRaw: OMP removed\nserial copyMem]
    N --> N2[setZero: OMP removed\nserial zeroMem]

    O[nnp_sigmoid_cross_entropy.nim] --> O1[parallel sum removed\nsequential for loop]

    style K1 fill:#f99,stroke:#c33
    style K2 fill:#f99,stroke:#c33
    style O1 fill:#f99,stroke:#c33
    style N1 fill:#fc9,stroke:#c63
    style N2 fill:#fc9,stroke:#c63
Loading

Comments Outside Diff (2)

  1. src/arraymancer/laser/tensor/initialization.nim, line 145-163 (link)

    OpenMP disabled in copyFromRaw and setZero without a clear migration path

    Both copyFromRaw (line ~145) and setZero (line ~172) have their OpenMP-parallel chunks replaced with serial copyMem/zeroMem calls. While the comment correctly identifies that calling these inside an OpenMP parallel region causes contention, this also serializes standalone bulk tensor allocation and zeroing, which directly hurts performance for any code that allocates tensors outside a parallel region.

    Consider making the serial path conditional on whether the caller is already inside an OMP region (e.g., via an omp_in_parallel() guard), rather than disabling parallelism unconditionally:

    when not defined(openmp):
      # full parallel version
      omp_parallel_chunks(...): copyMem(...)
    else:
      if omp_in_parallel() == 0:
        omp_parallel_chunks(...): copyMem(...)
      else:
        copyMem(...)

    This preserves performance for the common single-threaded allocation path while avoiding contention when nested inside an OMP region.

  2. src/arraymancer/nn/optimizers/optimizers.nim, line 234-242 (link)

    optimizerAdam changed from func to proc without explanation

    optimizerAdam was previously declared as func, which gives the Nim compiler a noSideEffect guarantee that enables additional optimisations and improves readability. The change to proc removes this guarantee. The function body does not call forEach or any new side-effectful API — it only builds the result object — so the reason for the downgrade is unclear.

    If the change was forced by a compiler error (e.g., zeros_like or fields being non-noSideEffect), please add a comment explaining this. Otherwise, restoring the func keyword would be preferable:

Last reviewed commit: d2d625a

Comment on lines +44 to +52
result = 0.T
for xi, ti in zip(input, target):
result += (-ti * xi + max(xi,0) + ln1p(exp(-abs(xi))) ) / T(input.shape[1])

# TODO - Parallel fused map-reduce, openmp issue - https://github.com/mratsim/Arraymancer/issues/485
# forEachStaged ii in input, ti in target:
# before_loop:
# var local_sum{.exportc.} = 0.T
# in_loop:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance regression: parallel reduction replaced with sequential loop

The previous implementation used map2_inline with a parallel-friendly sum: combinator, giving OpenMP-accelerated summation. The new code replaces it with a plain sequential for xi, ti in zip(input, target) loop, removing all parallelism from this hot-path loss function.

The commented-out forEachStaged approach shows this was noticed, but the fallback is now strictly slower for large batches. Referencing issue #485 is helpful, but users relying on multi-core performance for training will see a regression here until the staged version is implemented. At minimum, the old parallel code should be kept as the active path while the forEachStaged variant is developed, or the TODO should clearly call out that this is a temporary performance regression.

Comment on lines +82 to +97
forEach wrx in W3x[_, sr], # Wr*x
wzx in W3x[_, sz], # Wz*x
wx in W3x[_, s], # W*x
urh in U3h[_, sr], # Ur*h
uzh in U3h[_, sz], # Uz*h
uh in U3h[_, s], # U*h
h in hidden: # hidden state
# Reset (r) gate and Update (z) gate
let r = sigmoid(wrx + urh)
let z = sigmoid(wzx + uzh)

# Candidate hidden state ñ
let n = tanh(wx + r * uh)

# h' = (1 - z) *. ñ + z *. h
h = (1-z) * n + z*h
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forEach over non-contiguous slices of same buffer amplifies false-sharing

W3x[_, sr], W3x[_, sz], and W3x[_, s] are all non-contiguous column-slices of the same underlying W3x buffer (stride > 1 along axis 1). When forEach distributes these 7 tensors element-wise across OpenMP threads, adjacent rows of the same tensor land on the same cache line, and threads competing on adjacent rows in hidden (the only mutated tensor) will cause false sharing.

This is the root cause of the 13× parallel slowdown mentioned in the PR description: the fused forEach loop parallelises over batch_size × hidden_size elements, but the GRU equations produce a write pattern across hidden that can collide at the cache-line level between threads.

If parallel execution is needed short-term, consider looping over the batch dimension sequentially and using forEachSerial for the inner hidden-size dimension (matching the structure already used in softmax and softmax_cross_entropy_backward), or serialise the entire cell with forEachSerial until the polyhedral tiling approach is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant