Skip to content

runtime: improve scaling of lock2 #68578

Closed
@rhysh

Description

Part of my experience with runtime.mutex values is that process-wide demand for a single mutex can slow the program to a crawl. I don't think that alone will come as a surprise, though the magnitude of slowdown is more than I'd expect. The main surprise is that programs can have a hard time recovering from that performance cliff once they've fallen off.

I've described a case of "larger magnitude than I expected" as #57070. I've described (perhaps indirectly) "performance cliff that defies self-recovery" as #61426, #61427, and #61428. The runtime's ChanContended benchmark shows clear degradation with increasing GOMAXPROCS, which I suspect is the same problem.


Here's what I think is happening:

For some users of runtime.mutex values, the critical section is very fast. When exiting the critical section, unlock2 checks to see if there's another thread that has gone to sleep while waiting for the mutex. If unlock2 finds one, it wakes a single lock2 sleeper. While waking the other thread, unlock2 removes that sleeper's note within the mutex. That looks different in lock_sema.go and lock_futex.go, either popping a single entry off of the linked list or a full state reset to 0.

Now, the previous user of the mutex (recent caller of unlock2) is awake, and the lock2 caller that it woke is awake. If the previous user needs the same mutex again, it enters lock2 and we now have two non-sleeping threads attempting to acquire the mutex. Whichever one wins, it will soon call unlock2. If the caller of unlock2 finds that another thread is sleeping in lock2 waiting for this mutex, it will wake that thread.

Given enough time, a caller of lock2 will give up on the procyield/osyield section and go to sleep. But, how long does it take a thread to give up? And how long does it take a thread to complete another pass through the critical section, call unlock2, and wake an additional thread?

It looks like the time a thread in lock2 takes to go to sleep is over 100 times longer than the runtime's fastest critical sections. Preliminary measurements on an M1 MacBook Air with darwin/arm64 show ~6000 ns of spinning between sleeps and ~14 ns for an uncontended channel operation. That means, I think, that nearly all of the threads that need a runtime.mutex for a quick critical section will be awake the whole time. Being awake means repeatedly loading (about 5 times per wake cycle) and attempting CAS on the mutex value.

As more threads demand access to the mutex's cache line, updating it becomes more expensive. Acquiring the mutex involves updating that cache line. Going to sleep and waking sleepers also require updates to that cache line.


BenchmarkChanContended does 100 sends and 100 receives on a buffered channel with no channel-based waiting. That's 200 lock2/unlock2 pairs per "op". Each additional thread allowed by GOMAXPROCS reduces the whole-process throughput (with minor exceptions near the limit of physical cores, of around 1%).

Below is the behavior I see on an Intel i7-13700H (linux/amd64, 6 P cores with 2 threads each plus 8 E cores). When the process is allowed to use 4 threads, the whole-process throughput is 1/2 of what it was with 1 thread. With 8 threads, throughput is halved again. With 12, it's halved yet again. At GOMAXPROCS=20 the 200 channel operations take an average of 44 µs, so on average there's an unlock2 call every 220 ns, each with a fresh opportunity to wake a sleeping thread.

$ go version
go version go1.23rc2 linux/amd64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 20 | tr '\n' ',')" -test.count=10
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                 │      -      │
                 │   sec/op    │
ChanContended      3.167µ ± 1%
ChanContended-2    4.723µ ± 6%
ChanContended-3    5.563µ ± 3%
ChanContended-4    6.527µ ± 2%
ChanContended-5    7.848µ ± 4%
ChanContended-6    8.683µ ± 1%
ChanContended-7    11.19µ ± 0%
ChanContended-8    13.96µ ± 1%
ChanContended-9    17.15µ ± 5%
ChanContended-10   20.17µ ± 3%
ChanContended-11   24.14µ ± 0%
ChanContended-12   29.16µ ± 1%
ChanContended-13   32.72µ ± 4%
ChanContended-14   36.23µ ± 0%
ChanContended-15   36.10µ ± 1%
ChanContended-16   37.73µ ± 5%
ChanContended-17   37.59µ ± 4%
ChanContended-18   41.37µ ± 6%
ChanContended-19   42.87µ ± 1%
ChanContended-20   44.36µ ± 2%
geomean            17.43µ

And here's the behavior I see on an M1 MacBook Air (darwin/arm64, 4 P cores plus 4 E cores). With 5 threads, throughput is less than half of what the program can do with one thread.

$ go version
go version go1.23rc2 darwin/arm64
$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu="$(seq 1 8 | tr '\n' ',')" -test.count=10
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
                │      -      │
                │   sec/op    │
ChanContended     2.639µ ± 1%
ChanContended-2   3.303µ ± 6%
ChanContended-3   4.548µ ± 1%
ChanContended-4   5.041µ ± 0%
ChanContended-5   5.788µ ± 1%
ChanContended-6   6.171µ ± 0%
ChanContended-7   6.470µ ± 0%
ChanContended-8   6.405µ ± 0%
geomean           4.829µ

Another perspective is to consider how much on-CPU time the process has. The data below show that during 1.78 seconds of wall-clock time, the process's 20 threads were on-CPU for a total of 27.74 seconds within lock2 calls. Those threads were not sleeping!

$ go test runtime -test.run='^$' -test.bench=ChanContended -test.cpu=20 -test.count=1 -test.cpuprofile=/tmp/p
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
BenchmarkChanContended-20    	   26667	     44404 ns/op
PASS
ok  	runtime	1.785s

$ go tool pprof -peek runtime.lock2 /tmp/p
File: runtime.test
Type: cpu
Time: Jul 24, 2024 at 8:45pm (UTC)
Duration: 1.78s, Total samples = 31.32s (1759.32%)
Showing nodes accounting for 31.32s, 100% of 31.32s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                            27.74s   100% |   runtime.lockWithRank
     4.57s 14.59% 14.59%     27.74s 88.57%                | runtime.lock2
                                            19.50s 70.30% |   runtime.procyield
                                             2.74s  9.88% |   runtime.futexsleep
                                             0.84s  3.03% |   runtime.osyield
                                             0.07s  0.25% |   runtime.(*lockTimer).begin
                                             0.02s 0.072% |   runtime.(*lockTimer).end
----------------------------------------------------------+-------------

In short, we have lock2 implementations that appear on the page to let threads sleep but which in practice make them all spin, and spinning threads at least correlate with (and I think also cause) slower lock handoffs.

I think we can do better. I have some ideas on how, but at this point I'd love feedback on whether others agree with my analysis and on the state of the art here today.

CC @golang/runtime

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

NeedsFixThe path to resolution is known, but the work has not been done.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions