fix(spanner): fix negative values for max_in_use_sessions metrics#10449
fix(spanner): fix negative values for max_in_use_sessions metrics#10449
Conversation
spanner/session_test.go
Outdated
| if (sp.idleList.Len() + | ||
| int(sp.createReqs)) != 1 { |
There was a problem hiding this comment.
nit: Can this be on one line (I found it hard to read in the current form)
spanner/session_test.go
Outdated
| sh.recycle() | ||
| } | ||
|
|
||
| for true { |
There was a problem hiding this comment.
nit: Can we add an escape hatch to the loop (e.g. stop after X time)? Now a future bug could cause this to loop for ever, which is harder to debug than a test failure after for example 2 seconds.
spanner/session.go
Outdated
| // Decrease the number of sessions in use. | ||
| p.decNumInUseLocked(ctx) | ||
| // Decrease the number of sessions in use, only when not from idle list. | ||
| if !isExpire { |
There was a problem hiding this comment.
I think that it would be better to add an additional argument to the remove(..) method that explicitly says whether the session was in use or not. So something like this:
func (p *sessionPool) remove(s *session, isExpire bool, wasInUse bool) {
...
if wasInUse {
p.decNumInUseLocked(ctx)
}
}In theory, it could be that this method is called in the future to remove sessions that have not expired, but that also were not in use at the time that they were being removed, and then we could re-introduce a similar bug as the one here. That is less likely with an explicit argument that clearly says what it is for.
208c629 to
53f7f2c
Compare
53f7f2c to
78e81d5
Compare
spanner/session.go
Outdated
| func deleteSession(ctx context.Context, s *session, wg *sync.WaitGroup) { | ||
| defer wg.Done() | ||
| s.destroyWithContext(ctx, false) | ||
| s.destroyWithContext(ctx, false, true) |
There was a problem hiding this comment.
Is deleteSession only called for sessions that are in use at that moment?
Note that inUse means that the session was checked out of the pool at the moment that this method is being called. So in this case it would mean that we are calling deleteSession(..) for a session that was checked out.
There was a problem hiding this comment.
Renamed the function to closeSession as it will only be called when doing application cleanup.
There was a problem hiding this comment.
But won't that then mean that the metric will drop to a negative value for a (very) short time when the application is shutting down? Assume that the situation is that:
- The pool has 100 sessions.
- 10 of them are in use.
- The application shuts down and closes the client.
- The client closes all sessions and calls this method for all 100 sesssions.
- The inUseSessions metric drop from 10 to -90 for a very short time.
There was a problem hiding this comment.
wasInUse was set to false in latest commit, hence negative values should not appear, and graphs will just show last exported value.
| p.mu.Unlock() | ||
| } | ||
| s.destroy(false) | ||
| s.destroy(false, true) |
There was a problem hiding this comment.
Is destroy() only called for sessions that are in use? I would expect that it could also be called for a session that is in the list of idle sessions, and in that case it was not in use.
There was a problem hiding this comment.
sessionHandle is the wrapper which is created only for transactions(to be used) so we can assume any call for destroy using sessionHandle was in use
There was a problem hiding this comment.
Ah, that makes it clearer, thanks!
(#10508) * fix(spanner): add debug log to print full stack trace when negative value happens * skip decrementing num_in_use metric count when session is destroyed from healthchecks.
Internal bug: b/343756862