Interesting, I might try this out on some of the other tasks on HighLoad. It would be nice if we could get some hard evidence that engaging the stream prefetcher by interleaving is actually the reason for the perf gain. Otherwise, I don’t buy explanations like these at face value, especially when virtual memory is involved.
One thing that makes it tricky to optimize on the platform is that you have limited ways to exfiltrate profiling data. Your only way of introspecting the system is to dump info to stderr.
If you have an Intel CPU, can’t you try it locally? You can just pipe /dev/urandom into your program.
Maybe run the program under perf (or whatever the equivalent is on Mac OS or Windows) to count cache misses?
You can, but you need to have the same hardware as the target platform (in this case a Haswell server). I have a Coffee Lake desktop processor, so whatever results I get locally will not be representative.
I was thinking of running both the streaming and interleaved on the same machine to see if you could replicate the same speedup (even for a different microarchitecture).
That said, I looked up the phrase mentioned in the blog post in a the January 2023 version of the Intel Optimization Manual, and I could only find it under “EARLIER GENERATIONS OF INTEL® 64” for Sandy Bridge.
Not sure it would apply to Skylake/Coffee Lake.
I’d love to benchmark this against my bytecount crate at some point (but alas! I currently lack the time). I think you can win a tiny bit by better amortizing the horizontal additions.
Interesting, I might try this out on some of the other tasks on HighLoad. It would be nice if we could get some hard evidence that engaging the stream prefetcher by interleaving is actually the reason for the perf gain. Otherwise, I don’t buy explanations like these at face value, especially when virtual memory is involved.
One thing that makes it tricky to optimize on the platform is that you have limited ways to exfiltrate profiling data. Your only way of introspecting the system is to dump info to stderr.
If you have an Intel CPU, can’t you try it locally? You can just pipe
/dev/urandom
into your program. Maybe run the program underperf
(or whatever the equivalent is on Mac OS or Windows) to count cache misses?You can, but you need to have the same hardware as the target platform (in this case a Haswell server). I have a Coffee Lake desktop processor, so whatever results I get locally will not be representative.
I was thinking of running both the streaming and interleaved on the same machine to see if you could replicate the same speedup (even for a different microarchitecture).
That said, I looked up the phrase mentioned in the blog post in a the January 2023 version of the Intel Optimization Manual, and I could only find it under “EARLIER GENERATIONS OF INTEL® 64” for Sandy Bridge. Not sure it would apply to Skylake/Coffee Lake.
Good point, this piqued my interest. I know what I’ll be doing next weekend :P
I’d love to benchmark this against my bytecount crate at some point (but alas! I currently lack the time). I think you can win a tiny bit by better amortizing the horizontal additions.