Counting Bytes Faster Than You'd Think Possible

18

Counting Bytes Faster Than You'd Think Possible assembly c performance blog.mattstuchlik.com
via xoranth 4 months ago | caches
Archive.org Archive.today Ghostarchive
| 6 comments

6

1. 1
  
  dist1ll 4 months ago | link
  
  Interesting, I might try this out on some of the other tasks on HighLoad. It would be nice if we could get some hard evidence that engaging the stream prefetcher by interleaving is actually the reason for the perf gain. Otherwise, I don’t buy explanations like these at face value, especially when virtual memory is involved.
  
  One thing that makes it tricky to optimize on the platform is that you have limited ways to exfiltrate profiling data. Your only way of introspecting the system is to dump info to stderr.
  1. 1
    
    xoranth 4 months ago | link
    
    If you have an Intel CPU, can’t you try it locally? You can just pipe /dev/urandom into your program. Maybe run the program under perf (or whatever the equivalent is on Mac OS or Windows) to count cache misses?
    1. 1
      
      dist1ll 4 months ago | link
      
      You can, but you need to have the same hardware as the target platform (in this case a Haswell server). I have a Coffee Lake desktop processor, so whatever results I get locally will not be representative.
      1. 1
        
        xoranth 4 months ago | link
        
        I was thinking of running both the streaming and interleaved on the same machine to see if you could replicate the same speedup (even for a different microarchitecture).
        
        That said, I looked up the phrase mentioned in the blog post in a the January 2023 version of the Intel Optimization Manual, and I could only find it under “EARLIER GENERATIONS OF INTEL® 64” for Sandy Bridge. Not sure it would apply to Skylake/Coffee Lake.
        
        1
        
        dist1ll 4 months ago | link
        
        Good point, this piqued my interest. I know what I’ll be doing next weekend :P
2. 1
  
  llogiq 4 months ago | link
  
  I’d love to benchmark this against my bytecount crate at some point (but alas! I currently lack the time). I think you can win a tiny bit by better amortizing the horizontal additions.