Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in Scheduler Performance in Large Scale Clusters #127912

Open
hakuna-matatah opened this issue Oct 8, 2024 · 26 comments
Open

Regression in Scheduler Performance in Large Scale Clusters #127912

hakuna-matatah opened this issue Oct 8, 2024 · 26 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@hakuna-matatah
Copy link
Contributor

hakuna-matatah commented Oct 8, 2024

What happened?

Scheduler throughput and Performance has regressed in 1.31 when compared to 1.30

What did you expect to happen?

Scheduler throughput and Performance should at least stay same as 1.30 on 1.31 or improve.

How can we reproduce it (as minimally and precisely as possible)?

I'm leveraging the test that I have written to measure scheduler throughput and performance by directly creating pods to APIServer without KCM controllers in the picture.

Settings:

  • You can run this test with this QPS set to 1000 and total pods here set to 50k
Test results: You would get roughly following latency and throughput numbers for 1.30v

Latency:

    {
      "data": {
        "Perc50": 4853.606009,
        "Perc90": 6501.635529,
        "Perc99": 7152.44798
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    },
    {
      "data": {
        "Perc50": 3750.943,
        "Perc90": 5345.859,
        "Perc99": 5861.101
      },
      "unit": "ms",
      "labels": {
        "Metric": "create_to_schedule"
      }
    }

Throughput:

{
  "perc50": 803,
  "perc90": 856,
  "perc99": 936,
  "max": 936
}
Test results: You would get roughly following latency and throughput numbers for 1.31v

Latency:

    {
      "data": {
        "Perc50": 10675.556409,
        "Perc90": 17805.60988,
        "Perc99": 19445.20954
      },
      "unit": "ms",
      "labels": {
        "Metric": "pod_startup"
      }
    }
    
        {
      "data": {
        "Perc50": 9579.25,
        "Perc90": 16696.174,
        "Perc99": 18275.346
      },
      "unit": "ms",
      "labels": {
        "Metric": "create_to_schedule"
      }
    },

Throughput:

{
"perc50": 638,
"perc90": 683,
"perc99": 713,
"max": 713
}

Anything else we need to know?

You can see that on 1.31v, the latency for create_to_schedule phase increased 3X or more ( I have posted the one that has lowest latency and highest throughput among other tests that I have run) and Throughput has reduced significantly from ~936 to ~704 at peak/p99.

When I looked at the pprof of the runs on 1.30v and 1.31v, major differences showed up as following:

  • Prometheus.(*gauge).Add [k8s.io/client-go/util/workqueue.ParallelizeUntil.func1] (main contributing factor)
  • k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until.func1
    k8s.io/kubernetes/pkg/scheduler/framework/parallelize/parallelism.go ( overall this is slightly higher on 1.31v i.e; ~57% vs ~46% on 1.30v)
1.31v pprof
Screenshot 2024-10-07 at 5 19 48 PM Screenshot 2024-10-07 at 5 20 19 PM
1.30v pprof
Screenshot 2024-10-07 at 5 07 22 PM Screenshot 2024-10-07 at 5 21 18 PM

You can see that % of cpu cycles/time spent is more than doubled for Prometheus operations for the same amount of work i.e 50k pods with 1K QPS ^^^

I can post flame graphs as well

Food for thought:

Generally we should batch prometheus gauge operations for performance improvement given the CPU cycles consumption in Scheduler_one go routine (as we schedule pods serially in one go routine given the nature of Scheduler) for better performance. Also, I don't think consumers would need precision at the current level that we are doing today. Generally users scrape prometheus metrics at the very least at 10sec, 30sec, 1min or 5mins interval. Would like to know what community thinks about this ? At least we should have a gating feature to configure the precision of omitting Prometheus metrics ?

Kubernetes version

$ kubectl version

sh-4.2$ kubectl version
Client Version: v1.30.4-eks-a737599
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5-eks-ce1d5eb
sh-4.2$ kubectl version
Client Version: v1.31.0-eks-a737599
Kustomize Version: v5.4.2
Server Version: v1.31.0-eks-a737599

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

$ uname -a

Linux ip-172-16-60-69.us-west-2.compute.internal 5.10.224-212.876.amzn2.x86_64 #1 SMP Thu Aug 22 16:55:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux


# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@hakuna-matatah hakuna-matatah added the kind/bug Categorizes issue or PR as related to a bug. label Oct 8, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 8, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hakuna-matatah
Copy link
Contributor Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 8, 2024
@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Oct 8, 2024

/assign @alculquicondor

@dims
Copy link
Member

dims commented Oct 8, 2024

/kind regression

@k8s-ci-robot k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Oct 8, 2024
@haosdent
Copy link
Member

haosdent commented Oct 8, 2024

I can post flame graphs as well

@hakuna-matatah can share that?

@alculquicondor
Copy link
Member

/unassign
/assign @macsko

@alculquicondor
Copy link
Member

Server Version: v1.31.0-eks-a737599

@hakuna-matatah just to be sure, can you try the latest patch version?

@haosdent
Copy link
Member

haosdent commented Oct 9, 2024

@hakuna-matatah May you share your cluster setup as well, e.g. kube-apiserver version, etcd version.
I tried using Kwok to reproduce locally. So far, it seems the same between 1.30 and 1.31.
And regarding if any code changes cause the regression, it does not look so obvious.

k8s-scheduler v1.31.0

func (p Parallelizer) Until(ctx context.Context, pieces int, doWorkPiece workqueue.DoWorkPieceFunc, operation string) {
goroutinesMetric := metrics.Goroutines.WithLabelValues(operation)
withMetrics := func(piece int) {
goroutinesMetric.Inc()
doWorkPiece(piece)
goroutinesMetric.Dec()
}
workqueue.ParallelizeUntil(ctx, p.parallelism, pieces, withMetrics, workqueue.WithChunkSize(chunkSizeFor(pieces, p.parallelism)))
}

k8s-scheduler v1.30.5

func (p Parallelizer) Until(ctx context.Context, pieces int, doWorkPiece workqueue.DoWorkPieceFunc, operation string) {
goroutinesMetric := metrics.Goroutines.WithLabelValues(operation)
withMetrics := func(piece int) {
goroutinesMetric.Inc()
doWorkPiece(piece)
goroutinesMetric.Dec()
}
workqueue.ParallelizeUntil(ctx, p.parallelism, pieces, withMetrics, workqueue.WithChunkSize(chunkSizeFor(pieces, p.parallelism)))
}

prometheus v1.19.1 (used in v1.31.0)

https://github.com/prometheus/client_golang/blob/6e3f4b1091875216850a486b1c2eb0e5ea852f98/prometheus/gauge.go#L122-L130

func (g *gauge) Add(val float64) {
	for {
		oldBits := atomic.LoadUint64(&g.valBits)
		newBits := math.Float64bits(math.Float64frombits(oldBits) + val)
		if atomic.CompareAndSwapUint64(&g.valBits, oldBits, newBits) {
			return
		}
	}
}

prometheus v1.16.0 (used in v1.30.5)

https://github.com/prometheus/client_golang/blob/3583c1e1d085b75cab406c78b015562d45552b39/prometheus/gauge.go#L122-L130

func (g *gauge) Add(val float64) {
	for {
		oldBits := atomic.LoadUint64(&g.valBits)
		newBits := math.Float64bits(math.Float64frombits(oldBits) + val)
		if atomic.CompareAndSwapUint64(&g.valBits, oldBits, newBits) {
			return
		}
	}
}

So would like to reproduce the issue first to do further checking

@hakuna-matatah
Copy link
Contributor Author

@haosdent

May you share your cluster setup as well, e.g. kube-apiserver version , etcd version.

APIServer version is already mentioned in the issue description (v1.30.5 , v1.31.0) .
For etcd , it is on 3.5.15.

Also, the code changes b/w version w.r.t prom that you have mentioned above is something I have noticed before as well. But apparently pprof is telling me a different story.

In your setup are you using 5k nodes simulation using kwok ? Because scheduler decision making/CPU Cycles/Prometheus Operations are proportional to number of nodes ( In 5k nodes case, it evaluates 500 nodes - 10%) that are being evaluated when you schedule these 50k pods on 5k nodes at 1k QPS.

If i may ask, what are the CPU cycles for prom you have seen in your test for 1.30 and 1.31 ? Could you share the flame graphs if you have it handy ?

@haosdent
Copy link
Member

haosdent commented Oct 9, 2024

In your setup are you using 5k nodes simulation using kwok ? Because scheduler decision making/CPU Cycles/Prometheus Operations are proportional to number of nodes ( In 5k nodes case, it evaluates 500 nodes - 10%) that are being evaluated when you schedule these 50k pods on 5k nodes at 1k QPS.

Yes, I use 5k nodes and 50k pods as you described in the issue.

But the result looks not big different

1.31.0

SchedulingThroughput: {
  "perc50": 50,
  "perc90": 51,
  "perc99": 51,
  "max": 146
}
{
  "data": {
    "Perc50": 169711.952,
    "Perc90": 307111.845,
    "Perc99": 338075.19
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
},
{
  "data": {
    "Perc50": 169715.269042,
    "Perc90": 307115.448418,
    "Perc99": 338062.491898
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},

1.30.5

SchedulingThroughput: {
  "perc50": 50,
  "perc90": 51,
  "perc99": 51,
  "max": 146
}
{
  "data": {
    "Perc50": 187034.303,
    "Perc90": 336674.553,
    "Perc99": 370054.485
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
},
{
  "data": {
    "Perc50": 187038.148303,
    "Perc90": 336678.248519,
    "Perc99": 370057.865132
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
}

@hakuna-matatah
Copy link
Contributor Author

But the result looks not big different

It looks like in your results both (1.30 and 1.31) are performing bad in terms of Scheduler latency - create_to_schedule "Perc99": 370054.485 (1.30.5) , 338062.491898 (1.31.0)

Based on the throughput numbers I have noticed in your comment, it is possible that you may be not tweaking your Scheduler component QPS settings - that should generally explain numbers maxing at "max": 146 in your results.

Could you double check if you have tweaked Scheduler QPS settings to 1000/1000 at the least ?

@haosdent
Copy link
Member

haosdent commented Oct 9, 2024

performing bad in terms of Scheduler latency

I guess it may be due to the fact that I run kwok in an HDD server.

Regarding the config, here is what I changed according to what you described above.

diff --git a/clusterloader2/testing/scheduler-throughput/config.yaml b/clusterloader2/testing/scheduler-throughput/config.yaml
index ff98a335a..1925adeb3 100644
--- a/clusterloader2/testing/scheduler-throughput/config.yaml
+++ b/clusterloader2/testing/scheduler-throughput/config.yaml
@@ -1,4 +1,4 @@
-{{$totalSchedulerThroughputPods := DefaultParam .CL2_SCHEDULER_THROUGHPUT_PODS 5000}}
+{{$totalSchedulerThroughputPods := DefaultParam .CL2_SCHEDULER_THROUGHPUT_PODS 50000}}
 {{$defaultQps := DefaultParam .CL2_DEFAULT_QPS  500}}
 {{$defaultBurst := DefaultParam .CL2_DEFAULT_BURST 1000}}
 {{$uniformQps := DefaultParam .CL2_UNIFORM_QPS 500}}
@@ -16,7 +16,8 @@ tuningSets:
     burst: {{$defaultBurst}}
 - name: UniformQPS
   qpsLoad:
-    qps: {{$uniformQps}}
+          # qps: {{$uniformQps}}
+    qps: 1000

@hakuna-matatah
Copy link
Contributor Author

Yeah, that is the test config, and it looks good to to me ^^^.
What i was saying my earlier comment is Scheduler Component tweaks:

--kube-api-qps
--kube-api-burst

defaults are 50/100 which is why your throughput results are capping of at 51 ish and max(burst) to 146.

Also, change APIServer --max-mutating-requests-inflight to 750 and --max-requests-inflight to 1500 settings so that APIServer don't throttle the requests. It doesn't mean it is exercising these many inflight but just higher number to escape AP&F throttling.

@haosdent
Copy link
Member

haosdent commented Oct 9, 2024

Ah, I get what you mean, let me tweak it later and run again.

@alculquicondor
Copy link
Member

@hakuna-matatah can you share the number of nodes used in your first run?

@hakuna-matatah
Copy link
Contributor Author

can you share the number of nodes used in your first run? in your first run?

Both the test results that I posted for 1.31 and 1.30 in the description of this issue is on 5k node cluster - @alculquicondor

@haosdent
Copy link
Member

Here are the launch parameters of kube-scheduler and kube-apiserver.

kube-scheduler ... --kube-api-qps=1000 --kube-api-burst=1000
kube-apiserver ... --max-mutating-requests-inflight=750 --max-requests-inflight=1500

And my test results

1.31.0 + NVME SSD

{
  "perc50": 306,
  "perc90": 383,
  "perc99": 441,
  "max": 459
}
{
  "data": {
    "Perc50": 4963.981121,
    "Perc90": 11769.484125,
    "Perc99": 12426.396285
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 4960.29,
    "Perc90": 11763.426,
    "Perc99": 12422.835
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}

1.30.5 + NVME SSD

{
  "perc50": 578,
  "perc90": 643,
  "perc99": 665,
  "max": 665
}
{
  "data": {
    "Perc50": 18239.979316,
    "Perc90": 28795.556615,
    "Perc99": 30748.889733
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 18232.406,
    "Perc90": 28767.094,
    "Perc99": 30745.207
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}

1.31.0 + HDD

{
  "perc50": 408,
  "perc90": 553,
  "perc99": 605,
  "max": 636
}
{
  "data": {
    "Perc50": 11264.737408,
    "Perc90": 13824.312798,
    "Perc99": 14374.699844
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
}
{
  "data": {
    "Perc50": 11260.401,
    "Perc90": 13819.908,
    "Perc99": 14370.655
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}

1.30.5 + HDD

{
    "perc50": 453,
    "perc90": 602,
    "perc99": 654,
    "max": 661
  }
{
  "data": {
    "Perc50": 23474.124919,
    "Perc90": 27309.497551,
    "Perc99": 27659.268635
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 23465.388,
    "Perc90": 27303.328,
    "Perc99": 27654.799
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}

Would try to post pprof results later.

@haosdent
Copy link
Member

The binary for 'kube-scheduler' is built via go1.22 in your test, right? @hakuna-matatah

@alculquicondor
Copy link
Member

/unassign @macsko
/assign @haosdent
Since it looks like you are making progress

@k8s-ci-robot k8s-ci-robot assigned haosdent and unassigned macsko Oct 11, 2024
@hakuna-matatah
Copy link
Contributor Author

The binary for 'kube-scheduler' is built via go1.22 in your test, right?

Yes. I have just used the vanilla k8 scheduler. Based on go.mod it seems to use 1.22, so yes.

@hakuna-matatah
Copy link
Contributor Author

Based on results, it appears that your throughput is capping at 600'ish for both versions here.

I think you need to reach higher throughput to put more load on the system, that might help see the difference in performance I guess. If you look at my results its sitting at 936ish for 1.30 here.

In your setup do you also have events split from the main etcd cluster ? @haosdent

@haosdent
Copy link
Member

In your setup do you also have events split from the main etcd cluster

Share the same one, let me split it and see if it would better

@haosdent
Copy link
Member

haosdent commented Oct 14, 2024

I separated the event etcd, and the throughput looks better. But 1.31 is still not worse than 1.30 in my server.

1.31

# 1st run
{
  "perc50": 758,
  "perc90": 928,
  "perc99": 1019,
  "max": 1019
}
{
  "data": {
    "Perc50": 31758.654095,
    "Perc90": 34779.523829,
    "Perc99": 35434.285248
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 604.554,
    "Perc90": 999.083,
    "Perc99": 1147.571
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}
# 2nd run
{
  "perc50": 755,
  "perc90": 774,
  "perc99": 999,
  "max": 999
}
{
  "data": {
    "Perc50": 868.94607,
    "Perc90": 4454.312928,
    "Perc99": 5262.844059
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
}
{
  "data": {
    "Perc50": 525.668,
    "Perc90": 925.705,
    "Perc99": 1047.774
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}
# 3rd run
{
  "perc50": 756,
  "perc90": 890,
  "perc99": 1004,
  "max": 1004
}
{
  "data": {
    "Perc50": 11640.859249,
    "Perc90": 14229.300136,
    "Perc99": 15493.851436
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 542.441,
    "Perc90": 945.327,
    "Perc99": 1090.351
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}

1.30

# 1st run
{
  "perc50": 675,
  "perc90": 829,
  "perc99": 889,
  "max": 889
}
{
  "data": {
    "Perc50": 16591.625663,
    "Perc90": 17942.435568,
    "Perc99": 18584.784782
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 4474.532,
    "Perc90": 5561.24,
    "Perc99": 5885.108
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}
# 2nd run
{
  "perc50": 784,
  "perc90": 836,
  "perc99": 981,
  "max": 981
}
{
  "data": {
    "Perc50": 11128.261189,
    "Perc90": 19480.209226,
    "Perc99": 19984.189789
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 533.046,
    "Perc90": 932.872,
    "Perc99": 1062.666
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}
# 3rd run
{
  "perc50": 786,
  "perc90": 814,
  "perc99": 975,
  "max": 975
}
{
  "data": {
    "Perc50": 10278.805716,
    "Perc90": 15989.451618,
    "Perc99": 16332.52332
  },
  "unit": "ms",
  "labels": {
    "Metric": "pod_startup"
  }
},
{
  "data": {
    "Perc50": 521.541,
    "Perc90": 924.316,
    "Perc99": 1019.648
  },
  "unit": "ms",
  "labels": {
    "Metric": "create_to_schedule"
  }
}

@hakuna-matatah
Copy link
Contributor Author

hakuna-matatah commented Oct 16, 2024

I see you got your setup finally working similar to mine based on the throughput numbers I see for 1.30. Thank you for the effort to try to reproduce.
Its interesting that you see similar performance in 1.30 and 1.31, and your numbers are similar to what I was seeing on my 1.30 setup but my 1.31 numbers seems to be bit off in terms of performance/throughput, I wonder why that could be ? 😕

I will try to run the test once again (however this time I will run with 1.31.1 instead of 1.31.0 ) to see if I'm able to match up with my 1.30 performance/throughput.

@haosdent
Copy link
Member

Here is how I setup via kwok

Kwok configuration

# cat /root/.kwok/kwok.yaml
kind: KwokctlConfiguration
componentsPatches:
- name: kube-apiserver
  extraArgs:
  - key: max-mutating-requests-inflight
    value: "750"
  - key: max-requests-inflight
    value: "1500"
  - key: etcd-servers-overrides
    value: "/events#http://${replace to another etcd instance}"
- name: kube-scheduler
  extraArgs:
  - key: kube-api-qps
    value: "1000"
  - key: kube-api-burst
    value: "1000"

Kwok Launch Command

kwokctl create cluster --name=test-cluster --wait=5m --kube-scheduler-image=registry.k8s.io/kube-scheduler:v1.31.0 --kube-apiserver-image=registry.k8s.io/kube-apiserver:v1.31.0 --etcd-image=registry.k8s.io/etcd:3.5.15-0
kwokctl scale node --name test-cluster --replicas 5000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

6 participants