-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in Scheduler Performance in Large Scale Clusters #127912
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig scheduling |
/assign @alculquicondor |
/kind regression |
@hakuna-matatah can share that? |
/unassign |
@hakuna-matatah just to be sure, can you try the latest patch version? |
@hakuna-matatah May you share your cluster setup as well, e.g. kube-apiserver version, etcd version. k8s-scheduler v1.31.0 kubernetes/pkg/scheduler/framework/parallelize/parallelism.go Lines 56 to 65 in 9edcffc
k8s-scheduler v1.30.5 kubernetes/pkg/scheduler/framework/parallelize/parallelism.go Lines 56 to 65 in 74e84a9
prometheus v1.19.1 (used in v1.31.0) func (g *gauge) Add(val float64) {
for {
oldBits := atomic.LoadUint64(&g.valBits)
newBits := math.Float64bits(math.Float64frombits(oldBits) + val)
if atomic.CompareAndSwapUint64(&g.valBits, oldBits, newBits) {
return
}
}
} prometheus v1.16.0 (used in v1.30.5) func (g *gauge) Add(val float64) {
for {
oldBits := atomic.LoadUint64(&g.valBits)
newBits := math.Float64bits(math.Float64frombits(oldBits) + val)
if atomic.CompareAndSwapUint64(&g.valBits, oldBits, newBits) {
return
}
}
} So would like to reproduce the issue first to do further checking |
APIServer version is already mentioned in the issue description (v1.30.5 , v1.31.0) . Also, the code changes b/w version w.r.t prom that you have mentioned above is something I have noticed before as well. But apparently pprof is telling me a different story. In your setup are you using 5k nodes simulation using kwok ? Because scheduler decision making/CPU Cycles/Prometheus Operations are proportional to number of nodes ( In 5k nodes case, it evaluates 500 nodes - 10%) that are being evaluated when you schedule these 50k pods on 5k nodes at 1k QPS. If i may ask, what are the CPU cycles for prom you have seen in your test for 1.30 and 1.31 ? Could you share the flame graphs if you have it handy ? |
Yes, I use 5k nodes and 50k pods as you described in the issue. But the result looks not big different 1.31.0
1.30.5
|
It looks like in your results both (1.30 and 1.31) are performing bad in terms of Scheduler latency - Based on the throughput numbers I have noticed in your comment, it is possible that you may be not tweaking your Scheduler component QPS settings - that should generally explain numbers maxing at Could you double check if you have tweaked Scheduler QPS settings to 1000/1000 at the least ? |
I guess it may be due to the fact that I run kwok in an HDD server. Regarding the config, here is what I changed according to what you described above.
|
Yeah, that is the test config, and it looks good to to me ^^^.
defaults are 50/100 which is why your throughput results are capping of at 51 ish and max(burst) to 146. Also, change APIServer |
Ah, I get what you mean, let me tweak it later and run again. |
@hakuna-matatah can you share the number of nodes used in your first run? |
Both the test results that I posted for 1.31 and 1.30 in the description of this issue is on |
Here are the launch parameters of
And my test results 1.31.0 + NVME SSD
1.30.5 + NVME SSD
1.31.0 + HDD
1.30.5 + HDD
Would try to post pprof results later. |
The binary for 'kube-scheduler' is built via go1.22 in your test, right? @hakuna-matatah |
Yes. I have just used the vanilla k8 scheduler. Based on go.mod it seems to use 1.22, so yes. |
Based on results, it appears that your throughput is capping at I think you need to reach higher throughput to put more load on the system, that might help see the difference in performance I guess. If you look at my results its sitting at In your setup do you also have |
Share the same one, let me split it and see if it would better |
I separated the event etcd, and the throughput looks better. But 1.31 is still not worse than 1.30 in my server. 1.31
1.30
|
I see you got your setup finally working similar to mine based on the throughput numbers I see for 1.30. Thank you for the effort to try to reproduce. I will try to run the test once again (however this time I will run with |
Here is how I setup via kwok Kwok configuration
Kwok Launch Command
|
What happened?
Scheduler throughput and Performance has regressed in 1.31 when compared to 1.30
What did you expect to happen?
Scheduler throughput and Performance should at least stay same as 1.30 on 1.31 or improve.
How can we reproduce it (as minimally and precisely as possible)?
I'm leveraging the test that I have written to measure scheduler throughput and performance by directly creating pods to APIServer without KCM controllers in the picture.
Settings:
1000
and total pods here set to 50kTest results: You would get roughly following latency and throughput numbers for
1.30v
Latency:
Throughput:
Test results: You would get roughly following latency and throughput numbers for
1.31v
Latency:
Throughput:
Anything else we need to know?
You can see that on 1.31v, the latency for
create_to_schedule
phase increased 3X or more ( I have posted the one that has lowest latency and highest throughput among other tests that I have run) and Throughput has reduced significantly from ~936 to ~704 at peak/p99.When I looked at the pprof of the runs on 1.30v and 1.31v, major differences showed up as following:
k8s.io/kubernetes/pkg/scheduler/framework/parallelize/parallelism.go ( overall this is slightly higher on 1.31v i.e; ~57% vs ~46% on 1.30v)
1.31v pprof
1.30v pprof
You can see that % of cpu cycles/time spent is more than doubled for Prometheus operations for the same amount of work i.e 50k pods with 1K QPS ^^^
I can post flame graphs as well
Food for thought:
Generally we should
batch prometheus gauge operations
for performance improvement given the CPU cycles consumption in Scheduler_one go routine (as we schedule pods serially in one go routine given the nature of Scheduler) for better performance. Also, I don't think consumers would need precision at the current level that we are doing today. Generally users scrape prometheus metrics at the very least at 10sec, 30sec, 1min or 5mins interval. Would like to know what community thinks about this ? At least we should have a gating feature to configure the precision of omitting Prometheus metrics ?Kubernetes version
Cloud provider
OS version
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Linux ip-172-16-60-69.us-west-2.compute.internal 5.10.224-212.876.amzn2.x86_64 #1 SMP Thu Aug 22 16:55:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: