Open
Description
The prefetch in Stream here: https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L4061 in the ARM implementation of Stream can degrade throughput. On a Jetson Nano, I have a Memset-like operation that can achieve 11 GB/s with Store, and is reduced to ~3.5 GB/s with Stream unless I remove the prefetch. Can the prefetch be removed or made optional?
Metadata
Assignees
Labels
No labels
Activity