Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop on canary deployment using Gateway API and EKS #1732

Open
mketh-nhs opened this issue Nov 27, 2024 · 3 comments
Open

Infinite loop on canary deployment using Gateway API and EKS #1732

mketh-nhs opened this issue Nov 27, 2024 · 3 comments

Comments

@mketh-nhs
Copy link

mketh-nhs commented Nov 27, 2024

Describe the bug

I'm attempting to use Flagger on AWS EKS with the Gateway API making use of the AWS Gateway API Controller. I have followed the instructions in the tutorial at: https://docs.flagger.app/tutorials/gatewayapi-progressive-delivery but when triggering a canary deployment Flagger seems to get stuck in a loop of starting the canary deployment, changing the HTTRoute object weightings (in this case 10% to the canary, 90% to the primary) and then restarting the canary deployment, it never fails the canary after reaching the progress deadline timeout. It doesn't even appear to be getting to the rollout stage as the logs don't indicate the webhook ever running, however the pre-rollout check does run and succeed, but then runs again the next time round the loop. As an experiment I also disabled all metric checks as I don't think it is even getting as far as running them. Looking at the traffic weightings in AWS VPC Lattice I can see it alternating between the 90%/10% split and then briefly goes back up to 100%/0% before going back round the loop.

I have also tried setting skipAnalysis to true, which successfully promoted the canary, so the problem seems to be something to do with the analysis stage itself.

My canary configuration is as follows:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    targetPort: 9898
    hosts:
      - flaggertest.k8s.testdomain.uk
    gatewayRefs:
      - name: testgateway
        namespace: test
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    webhooks:
      - name: smoke-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 15s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
      - name: load-test-get
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

Flagger logs (in debug mode):

{"level":"info","ts":"2024-11-27T10:28:51.097Z","caller":"flagger/main.go:149","msg":"Starting flagger version 1.38.0 revision b6ac5e19aa7fa2949bbc8bf37a0f6c1e31b1745d mesh provider gatewayapi:v1beta1"}
{"level":"info","ts":"2024-11-27T10:28:51.097Z","caller":"clientcmd/client_config.go:659","msg":"Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work."}
{"level":"info","ts":"2024-11-27T10:28:51.097Z","caller":"clientcmd/client_config.go:659","msg":"Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work."}
{"level":"info","ts":"2024-11-27T10:28:51.130Z","caller":"flagger/main.go:441","msg":"Connected to Kubernetes API v1.30.6-eks-7f9249a"}
{"level":"info","ts":"2024-11-27T10:28:51.130Z","caller":"flagger/main.go:294","msg":"Waiting for canary informer cache to sync"}
{"level":"info","ts":"2024-11-27T10:28:51.130Z","caller":"cache/shared_informer.go:313","msg":"Waiting for caches to sync for flagger"}
{"level":"info","ts":"2024-11-27T10:28:51.230Z","caller":"cache/shared_informer.go:320","msg":"Caches are synced for flagger"}
{"level":"info","ts":"2024-11-27T10:28:51.230Z","caller":"flagger/main.go:301","msg":"Waiting for metric template informer cache to sync"}
{"level":"info","ts":"2024-11-27T10:28:51.230Z","caller":"cache/shared_informer.go:313","msg":"Waiting for caches to sync for flagger"}
{"level":"info","ts":"2024-11-27T10:28:51.331Z","caller":"cache/shared_informer.go:320","msg":"Caches are synced for flagger"}
{"level":"info","ts":"2024-11-27T10:28:51.331Z","caller":"flagger/main.go:308","msg":"Waiting for alert provider informer cache to sync"}
{"level":"info","ts":"2024-11-27T10:28:51.331Z","caller":"cache/shared_informer.go:313","msg":"Waiting for caches to sync for flagger"}
{"level":"info","ts":"2024-11-27T10:28:51.432Z","caller":"cache/shared_informer.go:320","msg":"Caches are synced for flagger"}
{"level":"info","ts":"2024-11-27T10:28:51.447Z","caller":"flagger/main.go:206","msg":"Connected to metrics server http://prometheus-server.flagger-system.svc.cluster.local:80"}
{"level":"debug","ts":"2024-11-27T10:28:51.447Z","caller":"controller/controller.go:99","msg":"Creating event broadcaster"}
{"level":"info","ts":"2024-11-27T10:28:51.447Z","caller":"server/server.go:45","msg":"Starting HTTP server on port 8080"}
{"level":"info","ts":"2024-11-27T10:28:51.448Z","caller":"controller/controller.go:191","msg":"Starting operator"}
{"level":"info","ts":"2024-11-27T10:28:51.448Z","caller":"controller/controller.go:200","msg":"Started operator workers"}
{"level":"info","ts":"2024-11-27T10:28:51.454Z","caller":"controller/controller.go:312","msg":"Synced test/podinfo"}
{"level":"info","ts":"2024-11-27T10:29:01.522Z","caller":"router/gateway_api_v1beta1.go:218","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T10:29:01.553Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"debug","ts":"2024-11-27T10:29:01.553Z","logger":"event-broadcaster","caller":"record/event.go:377","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"test\", Name:\"podinfo\", UID:\"e138fd88-a30e-4e98-ba2d-86b9214f3e5f\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"2456463\", FieldPath:\"\"}): type: 'Normal' reason: 'Synced' Starting canary analysis for podinfo.test"}
{"level":"info","ts":"2024-11-27T10:29:01.567Z","caller":"controller/events.go:33","msg":"Pre-rollout check smoke-test passed","canary":"podinfo.test"}
{"level":"debug","ts":"2024-11-27T10:29:01.567Z","logger":"event-broadcaster","caller":"record/event.go:377","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"test\", Name:\"podinfo\", UID:\"e138fd88-a30e-4e98-ba2d-86b9214f3e5f\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"2456463\", FieldPath:\"\"}): type: 'Normal' reason: 'Synced' Pre-rollout check smoke-test passed"}
{"level":"info","ts":"2024-11-27T10:29:01.595Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}
{"level":"debug","ts":"2024-11-27T10:29:01.595Z","logger":"event-broadcaster","caller":"record/event.go:377","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"test\", Name:\"podinfo\", UID:\"e138fd88-a30e-4e98-ba2d-86b9214f3e5f\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"2456463\", FieldPath:\"\"}): type: 'Normal' reason: 'Synced' Advance podinfo.test canary weight 10"}
{"level":"info","ts":"2024-11-27T10:29:31.525Z","caller":"router/gateway_api_v1beta1.go:218","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T10:29:31.562Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"debug","ts":"2024-11-27T10:29:31.563Z","logger":"event-broadcaster","caller":"record/event.go:377","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"test\", Name:\"podinfo\", UID:\"e138fd88-a30e-4e98-ba2d-86b9214f3e5f\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"2456677\", FieldPath:\"\"}): type: 'Normal' reason: 'Synced' Starting canary analysis for podinfo.test"}
{"level":"info","ts":"2024-11-27T10:29:31.601Z","caller":"controller/events.go:33","msg":"Pre-rollout check smoke-test passed","canary":"podinfo.test"}
{"level":"debug","ts":"2024-11-27T10:29:31.603Z","logger":"event-broadcaster","caller":"record/event.go:377","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"test\", Name:\"podinfo\", UID:\"e138fd88-a30e-4e98-ba2d-86b9214f3e5f\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"2456677\", FieldPath:\"\"}): type: 'Normal' reason: 'Synced' Pre-rollout check smoke-test passed"}
{"level":"info","ts":"2024-11-27T10:29:31.629Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}
{"level":"debug","ts":"2024-11-27T10:29:31.629Z","logger":"event-broadcaster","caller":"record/event.go:377","msg":"Event(v1.ObjectReference{Kind:\"Canary\", Namespace:\"test\", Name:\"podinfo\", UID:\"e138fd88-a30e-4e98-ba2d-86b9214f3e5f\", APIVersion:\"flagger.app/v1beta1\", ResourceVersion:\"2456677\", FieldPath:\"\"}): type: 'Normal' reason: 'Synced' Advance podinfo.test canary weight 10"}

Any ideas on what might be going wrong?

To Reproduce

Expected behavior

Canary rollout progresses and succeeds

Additional context

  • Flagger version: 1.38.0
  • Kubernetes version: 1.30.6-eks-7f9249a
  • Service Mesh provider: VPC Lattice (via AWS Gateway API controller)
  • Ingress provider: VPC Lattice (via AWS Gateway API controller)
@stefanprodan
Copy link
Member

Can you please try Flagger 1.39, we fixed a drift detection problem for Gateway API

@mketh-nhs
Copy link
Author

Thanks @stefanprodan. I have now upgraded but unfortunately still seem to have the same issue:

{"level":"info","ts":"2024-11-27T12:26:46.205Z","caller":"flagger/main.go:149","msg":"Starting flagger version 1.39.0 revision 4d497b2a9d2a0726071dd0b16a92f2e63a9130e2 mesh provider gatewayapi:v1beta1"}
{"level":"info","ts":"2024-11-27T12:26:46.205Z","caller":"clientcmd/client_config.go:659","msg":"Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work."}
{"level":"info","ts":"2024-11-27T12:26:46.206Z","caller":"clientcmd/client_config.go:659","msg":"Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work."}
{"level":"info","ts":"2024-11-27T12:26:46.225Z","caller":"flagger/main.go:441","msg":"Connected to Kubernetes API v1.30.6-eks-7f9249a"}
{"level":"info","ts":"2024-11-27T12:26:46.225Z","caller":"flagger/main.go:294","msg":"Waiting for canary informer cache to sync"}
{"level":"info","ts":"2024-11-27T12:26:46.225Z","caller":"cache/shared_informer.go:313","msg":"Waiting for caches to sync for flagger"}
{"level":"info","ts":"2024-11-27T12:26:46.326Z","caller":"cache/shared_informer.go:320","msg":"Caches are synced for flagger"}
{"level":"info","ts":"2024-11-27T12:26:46.326Z","caller":"flagger/main.go:301","msg":"Waiting for metric template informer cache to sync"}
{"level":"info","ts":"2024-11-27T12:26:46.327Z","caller":"cache/shared_informer.go:313","msg":"Waiting for caches to sync for flagger"}
{"level":"info","ts":"2024-11-27T12:26:46.427Z","caller":"cache/shared_informer.go:320","msg":"Caches are synced for flagger"}
{"level":"info","ts":"2024-11-27T12:26:46.427Z","caller":"flagger/main.go:308","msg":"Waiting for alert provider informer cache to sync"}
{"level":"info","ts":"2024-11-27T12:26:46.427Z","caller":"cache/shared_informer.go:313","msg":"Waiting for caches to sync for flagger"}
{"level":"info","ts":"2024-11-27T12:26:46.527Z","caller":"cache/shared_informer.go:320","msg":"Caches are synced for flagger"}
{"level":"info","ts":"2024-11-27T12:26:46.537Z","caller":"flagger/main.go:386","msg":"Notifications enabled for https://hooks.slack.com/servic"}
{"level":"info","ts":"2024-11-27T12:26:46.537Z","caller":"server/server.go:45","msg":"Starting HTTP server on port 8080"}
{"level":"info","ts":"2024-11-27T12:26:46.538Z","caller":"controller/controller.go:191","msg":"Starting operator"}
{"level":"info","ts":"2024-11-27T12:26:46.538Z","caller":"controller/controller.go:200","msg":"Started operator workers"}
{"level":"info","ts":"2024-11-27T12:30:22.809Z","caller":"controller/controller.go:312","msg":"Synced test/podinfo"}
{"level":"info","ts":"2024-11-27T12:30:26.566Z","caller":"router/kubernetes_default.go:175","msg":"Service podinfo-canary.test created","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:26.585Z","caller":"router/kubernetes_default.go:175","msg":"Service podinfo-primary.test created","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:26.585Z","caller":"controller/events.go:33","msg":"all the metrics providers are available!","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:26.608Z","caller":"canary/deployment_controller.go:323","msg":"Deployment podinfo-primary.test created","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:26.612Z","caller":"controller/events.go:45","msg":"podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:56.554Z","caller":"controller/events.go:33","msg":"all the metrics providers are available!","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:56.581Z","caller":"canary/hpa_reconciler.go:104","msg":"HorizontalPodAutoscaler v2 podinfo-primary.test created","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:56.598Z","caller":"router/kubernetes_default.go:175","msg":"Service podinfo.test created","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:56.598Z","caller":"controller/scheduler.go:257","msg":"Scaling down Deployment podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:56.626Z","caller":"router/gateway_api_v1beta1.go:164","msg":"HTTPRoute podinfo.test created","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:30:56.662Z","caller":"controller/events.go:33","msg":"Initialization done! podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:32:26.585Z","caller":"router/gateway_api_v1beta1.go:220","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:33:26.586Z","caller":"router/gateway_api_v1beta1.go:220","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:35:26.592Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:35:56.587Z","caller":"router/gateway_api_v1beta1.go:220","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:35:56.607Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:35:56.623Z","caller":"controller/events.go:33","msg":"Pre-rollout check smoke-test passed","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:35:56.647Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:26.588Z","caller":"router/gateway_api_v1beta1.go:220","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:26.611Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:26.630Z","caller":"controller/events.go:33","msg":"Pre-rollout check smoke-test passed","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:26.651Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:56.598Z","caller":"router/gateway_api_v1beta1.go:220","msg":"HTTPRoute podinfo.test updated","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:56.619Z","caller":"controller/events.go:33","msg":"Starting canary analysis for podinfo.test","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:56.638Z","caller":"controller/events.go:33","msg":"Pre-rollout check smoke-test passed","canary":"podinfo.test"}
{"level":"info","ts":"2024-11-27T12:36:56.662Z","caller":"controller/events.go:33","msg":"Advance podinfo.test canary weight 10","canary":"podinfo.test"}

@mketh-nhs
Copy link
Author

Any updates on this please @stefanprodan ? It does appear to be a similar bug to the one that was fixed in 1.39 in terms of the behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants