Image pull failed: Back-off pulling image (Kubernetes Executor) (#27664) · Issues · GitLab.org / gitlab-runner

Image pull failed: Back-off pulling image (Kubernetes Executor)

## Status update 2022-04-07 - We are testing on an EKS cluster(1.19) in order to reliably reproduce the error messages listed in the issue description and comments. - So far we haven't been able to generate the error with our test configuration. ### Test matrix | Test Date | Runner Version |EKS |Test Result| | ------ | ------ | ------ |------ | | 2022-04-07| 14.9.0 |1.19|No errors pulling images| ### Configurations used <details><summary>gitlab-ci.yml</summary> ```yaml stages: - build - test build-image: stage: build variables: DOCKER_DRIVER: overlay2 DOCKER_HOST: tcp://dockerservice:2375/ DOCKER_TLS_CERTDIR: "" image: docker:20.10.14 services: - name: docker:20.10.14-dind alias: dockerservice before_script: - sleep 30 - docker info script: - echo ${CI_BUILD_TOKEN} | docker login -u gitlab-ci-token --password-stdin ${CI_REGISTRY} - docker build --pull --no-cache -t ${CI_REGISTRY_IMAGE}:1.0.0 - << EOF FROM alpine:3.13.1 RUN apk update RUN apk upgrade RUN apk add --no-cache git ca-certificates curl python3 jq apache2-utils nmap-ncat netcat-openbsd wget ENTRYPOINT [] EOF - docker push ${CI_REGISTRY_IMAGE}:1.0.0 test: stage: test image: ${CI_REGISTRY_IMAGE}:1.0.0 variables: GIT_STRATEGY: none script: - sleep 10 parallel: 50 ``` </details> <details><summary>config.toml</summary> ```toml concurrent = 90 check_interval = 1 log_level = "debug" [session_server] session_timeout = 1800 [[runners]] url = "https://gitlab.com/" token = "__REDACTED__" executor = "kubernetes" [runners.custom_build_dir] [runners.kubernetes] #service_account="image-pull-secret-sa" pull_policy="always" #image_pull_secrets=["regcred"] #image = "aaro301/alpine:latest" namespace_overwrite_allowed = "" privileged = true allow_privilege_escalation = true service_account_overwrite_allowed = "" pod_annotations_overwrite_allowed = "" terminationGracePeriodSeconds = 30 [runners.kubernetes.affinity] [runners.kubernetes.volumes] [runners.kubernetes.dns_config] ``` </details> ## Problem We are currently experiencing problem on our Kubernetes cluster : pods which needs to pull image through ImagePullSecrets fail *sometimes* with the error `image pull failed: Back-off pulling image` This problem appeared after we upgraded our kubernetes cluster from 1.17 to 1.18. # How to reproduce 1. Configure a kube-executor runner and set **pull_policy = "always"** to maximize reproduce chances 2. Create a project with this gitlab-ci.yml : ```yaml stages: - build - test build-image: stage: build variables: DOCKER_HOST: tcp://127.0.0.1:2375 DOCKER_TLS_CERTDIR: "" image: docker:19.03.15 services: - docker:19.03.15-dind script: - echo ${CI_BUILD_TOKEN} | docker login -u gitlab-ci-token --password-stdin ${CI_REGISTRY} - docker build --pull --no-cache -t ${CI_REGISTRY_IMAGE}:1.0.0 - << EOF FROM alpine:3.13.1 RUN apk update --quiet && \ apk upgrade --quiet && \ apk add --quiet --no-cache git ca-certificates curl python3 jq apache2-utils nmap-ncat netcat-openbsd wget ENTRYPOINT [] EOF - docker push ${CI_REGISTRY_IMAGE}:1.0.0 test: stage: test image: ${CI_REGISTRY_IMAGE}:1.0.0 variables: GIT_STRATEGY: none script: - sleep 10 parallel: 30 ``` 3. Schedule this project to run every 10 minutes until some jobs fails with `Image pull failed: Back-off pulling image` # Tests & Results We tried multiple configurations on staging environments : | Versions | EKS 1.17.12 | EKS 1.18.9 | EKS 1.19.6 | EKS 1.21.2 | EKS1.18.9(Master)+EKS1.17.12(Nodes) | |--------------|:-----------:|:-------------:|:-------------:|:-------------:|:-----------------------------------:| | Runner 13.3 | OK | Random errors | Random errors | Untested | Untested | | Runner 13.8 | OK | Random errors | Random errors | Untested | OK | | Runner 13.9 | OK | Random errors | Random errors | Untested | OK | | Runner 13.12 | OK | Random errors | Random errors | Random errors | OK | # Debug & Trace logs ## Runner 13.8.0 / Gitlab 13.8.4 ``` ERROR: Job failed (system failure): prepare environment: image pull failed: Back-off pulling image "registry.example.net/tests/issue-image:1.0.0". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information duration=2m54.750286918s job=22436 project=550 runner=aDykLRhX WARNING: Failed to process runner builds=18 error=prepare environment: image pull failed: Back-off pulling image "registry.example.net/tests/issue-image:1.0.0". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information executor=kubernetes runner=aDykLRhX ``` ## Kubelet 1.18.9 ``` kubelet: E0305 10:59:52.282050 3958 reflector.go:178] object-"runners"/"runner-adyklrhx-project-550-concurrent-216pxpk": Failed to list *v1.Secret: secrets "runner-adyklrhx-project-550-concurrent-216pxpk" is forbidden: User "system:node:issue.eu-west-1.compute.internal" cannot list resource "secrets" in API group "" in the namespace "runners": no relationship found between node "issue.eu-west-1.compute.internal" and this object kubelet: I0305 10:59:52.347492 4012 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-lxr6w" (UniqueName: "kubernetes.io/secret/14b74129-b719-4183-bffc-530b5d9d5bb3-default-token-lxr6w") pod "runner-adyklrhx-project-550-concurrent-21q7gp7" (UID: "14b74129-b719-4183-bffc-530b5d9d5bb3") kubelet: W0305 10:59:53.509762 4012 kubelet_pods.go:863] Unable to retrieve pull secret runners/ runner-adyklrhx-project-550-concurrent-216pxpk for runners/runner-adyklrhx-project-550-concurrent-21q7gp7 due to failed to sync secret cache: timed out waiting for the condition. The image pull may not succeed. kubelet: E0305 10:59:55.315834 4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.net/v2/tests/issue-image/manifests/1.0.0: denied: access forbidden" kubelet: E0305 10:59:56.125427 4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ImagePullBackOff: "Back-off pulling image \"registry.example.net/tests/issue-image:1.0.0\"" kubelet: E0305 10:59:57.146155 4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ImagePullBackOff: "Back-off pulling image \"registry.example.net/tests/issue-image:1.0.0\"" ``` **NB** : on 1.17.12 we also have the first message "no relationship found between..." but **no** message "due to failed to sync secret cache: timed out waiting for the condition. The image pull may not succeed." ## EKS "Master" 1.18.9 ``` 09:59:52.181921 - scheduler.go:731 - pod runners/runner-adyklrhx-project-550-concurrent-21q7gp7 is bound successfully on node "issue.eu-west-1.compute.internal", 12 nodes evaluated, 2 nodes were found feasible. 09:59:52.163165 - audit.k8s.io/v1 - create binding (@/api/v1/namespaces/runners/pods/runner-adyklrhx-project-550-concurrent-21q7gp7/binding & issue.eu-west-1.compute.internal ) - (HTTP 201) 09:59:52.201538 - node_authorizer.go:197 - NODE DENY: node "issue.eu-west-1.compute.internal" cannot get unknown secret runners/runner-adyklrhx-project-550-concurrent-216pxpk 09:59:52.201492 - audit.k8s.io/v1 - no relationship found between node "xxx" and this object (the ImagePullSecret) - (HTTP 403) 09:59:52.208891 - audit.k8s.io/v1 - get pods - (HTTP 200) 09:59:52.212011 - audit.k8s.io/v1 - patch pod status (running) - (HTTP 200) 09:59:53.522568 - audit.k8s.io/v1 - list the ImagePullSecret - (HTTP 200) 09:59:56.133547 - audit.k8s.io/v1 - get the ImagePullSecret - (HTTP 200) 09:59:56.139091 - audit.k8s.io/v1 - patch pod stat : ErrImagePull - (HTTP 200) ``` **NB** : on 1.17.12 we won't see any differences expect the last audit message is `patch pod stat : RUNNING` instead `ErrImagePull` # Assumption - We have discarded a race condition between the creation of pod and the creation of the secret because we observed pods sometimes had "imagepullbackoff" after 30s wait to be scheduled on our cluster - As we can't reproduce the problem on a kubelet 1.17 with master node on 1.18 : we assume the problem is linked to kubelet >=1.18 & gitlab runner design - May-it be a race condition between scheduler and node_authorizer/graph_populator ? https://github.com/kubernetes/kubernetes/pull/46941 # Other related issues - Seems to be the same problem as : https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4846

issue