Conversation
…igate (was already slower than serial)
c86fdc9 to
1910009
Compare
|
Some changes in operator BLAS L1, require changing autograd to |
Greptile SummaryThis WIP PR replaces the existing Key changes:
Issues to address before merging:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[forEach / forEachSerial / forEachStaged macros] --> B[higher_order_applymap.nim\nexports forEach]
B --> C[tensor/math_functions.nim\nmelwise_mul, melwise_div, mreciprocal, etc.]
B --> D[tensor/operators_broadcasted.nim\n+.= -.= *.= /.= ^.=]
B --> E[tensor/operators_blas_l1.nim\n*= /=]
B --> F[nn_primitives/nnp_activation.nim\nsigmoid, relu, tanh + backwards]
B --> G[nn/optimizers/optimizers.nim\nSGD, SGDMomentum, Adam update]
B --> H[ml/metrics/common_error_functions.nim\nsquared_error, relative_error, etc.]
B --> I[ml/dimensionality_reduction/pca.nim\nexplained_variance]
B --> J[stats/kde.nim\nnormalization]
A --> K[nn_primitives/nnp_gru.nim]
K --> K1[gru_cell_inference\nforEach over 7 tensor slices]
K --> K2[gru_cell_forward\nforEach over 10 tensor slices]
K --> K3[gru_cell_backward\nforEach for dh update]
A --> L[nn_primitives/nnp_softmax.nim\nforEachSerial inside OMP loop]
A --> M[nn_primitives/nnp_softmax_cross_entropy.nim\nforEachSerial inside OMP loop]
N[laser/tensor/initialization.nim] --> N1[copyFromRaw: OMP removed\nserial copyMem]
N --> N2[setZero: OMP removed\nserial zeroMem]
O[nnp_sigmoid_cross_entropy.nim] --> O1[parallel sum removed\nsequential for loop]
style K1 fill:#f99,stroke:#c33
style K2 fill:#f99,stroke:#c33
style O1 fill:#f99,stroke:#c33
style N1 fill:#fc9,stroke:#c63
style N2 fill:#fc9,stroke:#c63
|
| result = 0.T | ||
| for xi, ti in zip(input, target): | ||
| result += (-ti * xi + max(xi,0) + ln1p(exp(-abs(xi))) ) / T(input.shape[1]) | ||
|
|
||
| # TODO - Parallel fused map-reduce, openmp issue - https://github.com/mratsim/Arraymancer/issues/485 | ||
| # forEachStaged ii in input, ti in target: | ||
| # before_loop: | ||
| # var local_sum{.exportc.} = 0.T | ||
| # in_loop: |
There was a problem hiding this comment.
Performance regression: parallel reduction replaced with sequential loop
The previous implementation used map2_inline with a parallel-friendly sum: combinator, giving OpenMP-accelerated summation. The new code replaces it with a plain sequential for xi, ti in zip(input, target) loop, removing all parallelism from this hot-path loss function.
The commented-out forEachStaged approach shows this was noticed, but the fallback is now strictly slower for large batches. Referencing issue #485 is helpful, but users relying on multi-core performance for training will see a regression here until the staged version is implemented. At minimum, the old parallel code should be kept as the active path while the forEachStaged variant is developed, or the TODO should clearly call out that this is a temporary performance regression.
| forEach wrx in W3x[_, sr], # Wr*x | ||
| wzx in W3x[_, sz], # Wz*x | ||
| wx in W3x[_, s], # W*x | ||
| urh in U3h[_, sr], # Ur*h | ||
| uzh in U3h[_, sz], # Uz*h | ||
| uh in U3h[_, s], # U*h | ||
| h in hidden: # hidden state | ||
| # Reset (r) gate and Update (z) gate | ||
| let r = sigmoid(wrx + urh) | ||
| let z = sigmoid(wzx + uzh) | ||
|
|
||
| # Candidate hidden state ñ | ||
| let n = tanh(wx + r * uh) | ||
|
|
||
| # h' = (1 - z) *. ñ + z *. h | ||
| h = (1-z) * n + z*h |
There was a problem hiding this comment.
forEach over non-contiguous slices of same buffer amplifies false-sharing
W3x[_, sr], W3x[_, sz], and W3x[_, s] are all non-contiguous column-slices of the same underlying W3x buffer (stride > 1 along axis 1). When forEach distributes these 7 tensors element-wise across OpenMP threads, adjacent rows of the same tensor land on the same cache line, and threads competing on adjacent rows in hidden (the only mutated tensor) will cause false sharing.
This is the root cause of the 13× parallel slowdown mentioned in the PR description: the fused forEach loop parallelises over batch_size × hidden_size elements, but the GRU equations produce a write pattern across hidden that can collide at the cache-line level between threads.
If parallel execution is needed short-term, consider looping over the batch dimension sequentially and using forEachSerial for the inner hidden-size dimension (matching the structure already used in softmax and softmax_cross_entropy_backward), or serialise the entire cell with forEachSerial until the polyhedral tiling approach is ready.
This replaces the use of
mapX_inlineandapplyX_inlineby the forEach / forEachContiguous / forEachParallel / forEachSerial laser iterators.This is particularly valuable for recurrent neural network like GRU because we can implement the equation in a straightforward manner with meaning ful variable name instead of magic
x,y,zand we would have needed anapply11_inlineanyway (with the correct var/non-var parameters).TODO:
Even then RNN are a type of iterative stencil application that require special care and often used for polyhedral benchmarking so another approach using tiling is probably needed to properly speed those up. (see https://github.com/numforge/laser/blob/master/research/automatic_loop_nest_scheduling.md#polyhedral-approaches)
Not in scope: