[WIP] Laser forEach iterators by mratsim · Pull Request #479 · mratsim/Arraymancer

mratsim · 2020-12-12T10:53:49Z

This replaces the use of mapX_inline and applyX_inline by the forEach / forEachContiguous / forEachParallel / forEachSerial laser iterators.

This is particularly valuable for recurrent neural network like GRU because we can implement the equation in a straightforward manner with meaning ful variable name instead of magic x, y, z and we would have needed an apply11_inline anyway (with the correct var/non-var parameters).

TODO:

There is a significant parallel performance regression on GRU when running test_nnp_gru. While before this PR the parallel version was "only" 2x slower than serial, now it's 13x slower than serial which probably signals a false sharing issue.
Even then RNN are a type of iterative stencil application that require special care and often used for polyhedral benchmarking so another approach using tiling is probably needed to properly speed those up. (see https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md#polyhedral-approaches)

Not in scope:

using forEach for backward propagation in GRU: this is a headache inducing refactoring

…igate (was already slower than serial)

mratsim · 2021-01-03T13:01:47Z

Some changes in operator BLAS L1, require changing autograd to +.= because the += based on apply2_inline was doing implicit broadcast somehow. But then we have cascading issues that requires changing +.= to use broadcast2 and broadcast2 fixes, and then we have the Stack autograd layer that is failing tests. See 2929cb5 in the laser-iterators-stashed branch

…gmoid_cross_entropy

greptile-apps · 2026-03-16T08:33:05Z

Greptile Summary

This WIP PR replaces the existing applyX_inline / mapX_inline family of higher-order tensor operations with the Laser forEach / forEachSerial / forEachStaged iterators across ~20 files. The motivating benefit — demonstrated clearly in nnp_gru.nim — is the ability to fuse multi-tensor element-wise loops with meaningful variable names rather than cryptic x, y, z placeholders, and to avoid needing bespoke apply4_inline / apply10_inline variants.

Key changes:

higher_order_applymap.nim now imports and re-exports foreach / foreach_staged, making forEach available throughout the tensor module.
All in-place scalar and element-wise operators (+=, -=, *=, /=, ^.=, etc.) in operators_broadcasted.nim and operators_blas_l1.nim are ported to forEach.
Activation functions, optimizer update loops (SGD, SGDMomentum, Adam), error metric functions, and GRU cell forward/backward passes are all migrated.
nnp_gru.nim gains a clean, readable fused GRU cell loop (7–10 tensors in a single forEach), and the previous n_tmp copy-back workaround is correctly eliminated.
laser/tensor/initialization.nim disables OpenMP in copyFromRaw and setZero to avoid contention when these are called inside an outer OMP region (referencing issue OpenMP barrier slowness in neural network examples with high core count. #485).
nnp_sigmoid_cross_entropy.nim reverts its parallel map2_inline + sum: reduction to a sequential for loop pending a working forEachStaged implementation.

Issues to address before merging:

The acknowledged 13× parallel slowdown in GRU (test_nnp_gru) stems from forEach distributing work element-wise across all batch_size × hidden_size elements of 7–10 tensor arguments, many of which are non-contiguous column slices of the same buffer. Adjacent rows in hidden land on the same cache line, causing severe false sharing. Using forEachSerial for the inner loop (as done in softmax) would recover serial performance until a tiled approach is ready.
sigmoid_cross_entropy now runs sequentially, a direct performance regression for training workloads.
copyFromRaw/setZero unconditionally lose their OpenMP path; an omp_in_parallel() guard would preserve performance for standalone use.
optimizerAdam is changed from func to proc without a comment; the reason is unclear since the function body does not call forEach.

Confidence Score: 2/5

Not safe to merge: acknowledged 13× GRU parallel regression, reverted parallel sigmoid cross-entropy, and disabled OpenMP in core initialization paths are significant regressions.
The logical migration of forEach is correct and the code reads much better, but there are three unresolved performance regressions (GRU parallel, sigmoid_cross_entropy sequential, initialization OMP disabled) that the PR itself marks as WIP. Merging in the current state would degrade performance for real training workloads.
src/arraymancer/nn_primitives/nnp_gru.nim (false sharing / 13× slowdown), src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim (sequential fallback), src/arraymancer/laser/tensor/initialization.nim (OpenMP removed unconditionally)

Important Files Changed

Filename	Overview
src/arraymancer/nn_primitives/nnp_gru.nim	Core GRU forward/backward rewritten using `forEach` for loop fusion; logically correct but introduces a severe 13× parallel slowdown due to false sharing across non-contiguous tensor slices, and removes the `n_tmp` workaround correctly.
src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim	Parallel `map2_inline`/`sum:` reduction replaced with a sequential `for` loop, introducing a performance regression; `forEachStaged` variant is commented out as a TODO pending an OpenMP fix.
src/arraymancer/laser/tensor/initialization.nim	OpenMP-parallel `copyFromRaw` and `setZero` unconditionally replaced with serial `copyMem`/`zeroMem` to avoid contention when called inside OMP regions; hurts standalone bulk allocation performance.
src/arraymancer/nn/optimizers/optimizers.nim	SGD, SGDMomentum, and Adam optimizer update loops cleanly migrated from `apply2_inline`/`apply3_inline` to `forEach` with meaningful variable names; `optimizerAdam` downgraded from `func` to `proc` without explanation.
src/arraymancer/tensor/higher_order_applymap.nim	Adds `foreach`/`foreach_staged` imports and exports, and replaces the `apply(f: proc(x: var T))` OpenMP loop with `forEach`; semantically equivalent change.
src/arraymancer/nn_primitives/nnp_activation.nim	All activation functions (sigmoid, relu, tanh) and their backward passes migrated from `map_inline`/`apply_inline` to `forEach` with descriptive variable names; semantically equivalent.
src/arraymancer/tensor/math_functions.nim	In-place math operations (`melwise_mul`, `melwise_div`, `mreciprocal`, `mnegate`, `mabs`, `mclamp`) migrated to `forEach`; complex `reciprocal` correctly split into `forEach x in t, r in result` pattern.
src/arraymancer/tensor/operators_broadcasted.nim	Broadcast scalar operators (`+.=`, `-.=`, `*.=`, `/.=`, `^.=`) migrated from `apply_inline` to `forEach`; all semantically equivalent.
src/arraymancer/laser/strided_iteration/foreach_common.nim	New shared infrastructure for `forEach` macros; handles tensor aliasing, var/let detection, raw pointer extraction, and shape assertion for multi-tensor iteration.
src/arraymancer/stats/kde.nim	Single `apply_inline` call in `kde` normalization replaced with `forEach x in result: x *= normFactor`; semantically identical.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[forEach / forEachSerial / forEachStaged macros] --> B[higher_order_applymap.nim\nexports forEach]
    B --> C[tensor/math_functions.nim\nmelwise_mul, melwise_div, mreciprocal, etc.]
    B --> D[tensor/operators_broadcasted.nim\n+.= -.= *.= /.= ^.=]
    B --> E[tensor/operators_blas_l1.nim\n*= /=]
    B --> F[nn_primitives/nnp_activation.nim\nsigmoid, relu, tanh + backwards]
    B --> G[nn/optimizers/optimizers.nim\nSGD, SGDMomentum, Adam update]
    B --> H[ml/metrics/common_error_functions.nim\nsquared_error, relative_error, etc.]
    B --> I[ml/dimensionality_reduction/pca.nim\nexplained_variance]
    B --> J[stats/kde.nim\nnormalization]

    A --> K[nn_primitives/nnp_gru.nim]
    K --> K1[gru_cell_inference\nforEach over 7 tensor slices]
    K --> K2[gru_cell_forward\nforEach over 10 tensor slices]
    K --> K3[gru_cell_backward\nforEach for dh update]

    A --> L[nn_primitives/nnp_softmax.nim\nforEachSerial inside OMP loop]
    A --> M[nn_primitives/nnp_softmax_cross_entropy.nim\nforEachSerial inside OMP loop]

    N[laser/tensor/initialization.nim] --> N1[copyFromRaw: OMP removed\nserial copyMem]
    N --> N2[setZero: OMP removed\nserial zeroMem]

    O[nnp_sigmoid_cross_entropy.nim] --> O1[parallel sum removed\nsequential for loop]

    style K1 fill:#f99,stroke:#c33
    style K2 fill:#f99,stroke:#c33
    style O1 fill:#f99,stroke:#c33
    style N1 fill:#fc9,stroke:#c63
    style N2 fill:#fc9,stroke:#c63

Comments Outside Diff (2)

src/arraymancer/laser/tensor/initialization.nim, line 145-163 (link)

OpenMP disabled in copyFromRaw and setZero without a clear migration path

Both copyFromRaw (line ~145) and setZero (line ~172) have their OpenMP-parallel chunks replaced with serial copyMem/zeroMem calls. While the comment correctly identifies that calling these inside an OpenMP parallel region causes contention, this also serializes standalone bulk tensor allocation and zeroing, which directly hurts performance for any code that allocates tensors outside a parallel region.

Consider making the serial path conditional on whether the caller is already inside an OMP region (e.g., via an omp_in_parallel() guard), rather than disabling parallelism unconditionally:
```
when not defined(openmp):
  # full parallel version
  omp_parallel_chunks(...): copyMem(...)
else:
  if omp_in_parallel() == 0:
    omp_parallel_chunks(...): copyMem(...)
  else:
    copyMem(...)
```
This preserves performance for the common single-threaded allocation path while avoiding contention when nested inside an OMP region.
src/arraymancer/nn/optimizers/optimizers.nim, line 234-242 (link)

optimizerAdam changed from func to proc without explanation

optimizerAdam was previously declared as func, which gives the Nim compiler a noSideEffect guarantee that enables additional optimisations and improves readability. The change to proc removes this guarantee. The function body does not call forEach or any new side-effectful API — it only builds the result object — so the reason for the downgrade is unclear.

If the change was forced by a compiler error (e.g., zeros_like or fields being non-noSideEffect), please add a comment explaining this. Otherwise, restoring the func keyword would be preferable:

_{Last reviewed commit: d2d625a}

greptile-apps · 2026-03-16T08:33:09Z

src/arraymancer/nn_primitives/nnp_sigmoid_cross_entropy.nim

+  result = 0.T
+  for xi, ti in zip(input, target):
+    result += (-ti * xi +  max(xi,0) + ln1p(exp(-abs(xi))) ) / T(input.shape[1])
+
+  # TODO - Parallel fused map-reduce, openmp issue - https://github.com/mratsim/Arraymancer/issues/485
+  # forEachStaged ii in input, ti in target:
+  #   before_loop:
+  #     var local_sum{.exportc.} = 0.T
+  #   in_loop:


Performance regression: parallel reduction replaced with sequential loop

The previous implementation used map2_inline with a parallel-friendly sum: combinator, giving OpenMP-accelerated summation. The new code replaces it with a plain sequential for xi, ti in zip(input, target) loop, removing all parallelism from this hot-path loss function.

The commented-out forEachStaged approach shows this was noticed, but the fallback is now strictly slower for large batches. Referencing issue #485 is helpful, but users relying on multi-core performance for training will see a regression here until the staged version is implemented. At minimum, the old parallel code should be kept as the active path while the forEachStaged variant is developed, or the TODO should clearly call out that this is a temporary performance regression.

greptile-apps · 2026-03-16T08:33:13Z

src/arraymancer/nn_primitives/nnp_gru.nim

+  forEach wrx in W3x[_, sr], # Wr*x
+          wzx in W3x[_, sz], # Wz*x
+          wx in W3x[_, s],   # W*x
+          urh in U3h[_, sr], # Ur*h
+          uzh in U3h[_, sz], # Uz*h
+          uh in U3h[_, s],   # U*h
+          h in hidden:       # hidden state
+    # Reset (r) gate and Update (z) gate
+    let r = sigmoid(wrx + urh)
+    let z = sigmoid(wzx + uzh)
+
+    # Candidate hidden state ñ
+    let n = tanh(wx + r * uh)
+
+    # h' = (1 - z) *. ñ + z *. h
+    h = (1-z) * n + z*h


forEach over non-contiguous slices of same buffer amplifies false-sharing

W3x[_, sr], W3x[_, sz], and W3x[_, s] are all non-contiguous column-slices of the same underlying W3x buffer (stride > 1 along axis 1). When forEach distributes these 7 tensors element-wise across OpenMP threads, adjacent rows of the same tensor land on the same cache line, and threads competing on adjacent rows in hidden (the only mutated tensor) will cause false sharing.

This is the root cause of the 13× parallel slowdown mentioned in the PR description: the fused forEach loop parallelises over batch_size × hidden_size elements, but the GRU equations produce a write pattern across hidden that can collide at the cache-line level between threads.

If parallel execution is needed short-term, consider looping over the batch dimension sequentially and using forEachSerial for the inner hidden-size dimension (matching the structure already used in softmax and softmax_cross_entropy_backward), or serialise the entire cell with forEachSerial until the polyhedral tiling approach is ready.

mratsim added 12 commits January 3, 2021 11:42

first introduction of laser iterators

5a9ca29

Support tensor fields iterators for PCA.

2177ae3

forEach for error function (1% slowdown in test_nnp_loss.nim)

03abaf7

Support array access and export forEach from the "tensor" module

1a59bdf

update the optimizer tester

1a69a60

Handled typed AST passed to untyped param :/

886f939

Optimizers use forEach

2dfdd81

activation functions

46e0c79

Use the new forEach, warning! huge parallel perf regression to invest…

ac2f240

…igate (was already slower than serial)

change all apply_inline

1037c2b

no need for apply4 workaround in GRU

c69835a

Change softmax iterator, also make example compile

1910009

mratsim force-pushed the laser-iterators branch from c86fdc9 to 1910009 Compare January 3, 2021 11:32

Convert apply3 and frobenius inner prod

d8d986d

mratsim added 3 commits January 3, 2021 15:05

#485 first part, don't parallelize setZero

69efb08

Trying to solve #485 to no avail, at least drop parallelization of si…

979f5d5

…gmoid_cross_entropy

don't parallelize copyFromRaw either

d2d625a

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Laser forEach iterators#479

[WIP] Laser forEach iterators#479
mratsim wants to merge 16 commits intomasterfrom
laser-iterators

mratsim commented Dec 12, 2020

Uh oh!

mratsim commented Jan 3, 2021

Uh oh!

greptile-apps bot commented Mar 16, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 16, 2026

Uh oh!

greptile-apps bot Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mratsim commented Dec 12, 2020

Uh oh!

mratsim commented Jan 3, 2021

Uh oh!

greptile-apps bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 16, 2026 •

edited

Loading