Skip to content

Conversation

@fabriziopandini
Copy link
Member

@fabriziopandini fabriziopandini commented Oct 6, 2018

What this PR does / why we need it:
kubeadm now automatically creates a new stacked etcd member when joining a new control plane node (does not applies to external etcd)

Which issue(s) this PR fixes:
Fixes # kubernetes/kubeadm#1123

Special notes for your reviewer:
IMO two points deserve more attention:

  • The fact the local/stacked etcd client now uses the IP defined by API server advertise address
  • How the joining node discovers the list of endpoints for the existing etcd members

I will keep this in WIP until there is agreement on the above points

Release note:

kubeadm now automatically creates a new stacked etcd member when joining a new control plane node (does not applies to external etcd)

/sig cluster-lifecycle
/kind feature

/cc @timothysc
/cc @chuckha
/cc @detiber
@kubernetes/sig-cluster-lifecycle-pr-reviews

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 6, 2018
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm labels Oct 6, 2018
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @fabriziopandini . added a couple of comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better %v: to be %q?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, did we investigate if this approach is safe?
what happens if the port is already taken on that endpoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default peer port and "shouldn't collide" but we should definitely add a check here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Member -> member:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly creating should be uppercase for consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent issue about golang 1.11 as well needing to use Infof vs. Infoln.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Infoln

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make such ports into constants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should add them to the default consts file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in a separated PR to keep the scope of this PR as small as possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for of -> for or of only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if the connection times increase with the number of etcd pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would up the timeout for new connection. Default client inside the api-server is 20 seconds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased timeout to 20

Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that we've already copied over the etcd ca key/cert and generated the correct certificates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(when there the etcd cluster is empty)" => "(when the etcd cluster is empty)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost want this map[string]string to be a struct, something like etcdClusterConfiguration or similar. Do you think that would add anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly here b/c we're re-wrapping an etcd member struct. I would change it to [string]IP at a minimum though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also testability of this function will require using some of the etcd utils I built eons ago under apiserver/storage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done using a struct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach with one minor concern: there is a race condition that the PodIP may change after the PodIP lookup and before our client access, but I'm ok living with that possibility.

I wonder if there is a clean way to solve this by managing etcd as a statefulset. The other thought I had was adding a Service object to the static manifest and managing that by hand (by kubeadm). The goal of these two ideas is to provide a stable DNS name instead of a PodIP lookup. Either way, those types of changes would be way out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ClusterStatus should have the endpoints.
Or as discussed on the call put the meta-data in an annotation for the pod to make it easier to extract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using ClusterStatus or a pre-set annotation to use as a starting point.

After we have the initial endpoint list, we should use that to query the etcd cluster itself for the list of endpoints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, we should also raise an error if the queried endpoints do not match the expected endpoints, or if the cluster is not in a healthy state to start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done reading from cluster status + calling sync to get the real list of endpoints from etcd

Copy link
Contributor

@timothysc timothysc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of comments plus we should highlight that folks will need todo this on odd numbers for stacked deploys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing loopback from the SAN?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to dig through history but I remember us adding it on purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound strange, but restored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the default peer port and "shouldn't collide" but we should definitely add a check here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do a local client member check, to verify it's correct, and to list the current members in the log line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a recent issue about golang 1.11 as well needing to use Infof vs. Infoln.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should add them to the default consts file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ClusterStatus should have the endpoints.
Or as discussed on the call put the meta-data in an annotation for the pod to make it easier to extract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would up the timeout for new connection. Default client inside the api-server is 20 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly here b/c we're re-wrapping an etcd member struct. I would change it to [string]IP at a minimum though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also testability of this function will require using some of the etcd utils I built eons ago under apiserver/storage

@timothysc
Copy link
Contributor

/assign @timothysc @detiber

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method used during upgrade? If so, it seems odd to call 'AddMember()' for an instance that is already a member of the etcd cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is used only when adding a new control plane instance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using ClusterStatus or a pre-set annotation to use as a starting point.

After we have the initial endpoint list, we should use that to query the etcd cluster itself for the list of endpoints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, we should also raise an error if the queried endpoints do not match the expected endpoints, or if the cluster is not in a healthy state to start.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2018
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 21, 2018
@fabriziopandini
Copy link
Member Author

@neolit123 @chuckha @timothysc @detiber Thanks for the valuable feedback!

Now

  1. The list of etcd members is initially discovered from cluster status, and then I query etcd for getting the real list of members.
  2. Now also upgrades works with stacked etcd

The latest open point to be addressed before removing WIP is the usage of API server advertise an address, that can lead to problems in case the user choose an advertise address that doesn't correspond to any IP address on the machine

@timothysc
Copy link
Contributor

@fabriziopandini Is this ready to go? If so can you remove the WIPs from this and other PRs.

@fabriziopandini
Copy link
Member Author

@timothysc last point pending is the discussion about usage of API server advertise for etcd

@timothysc
Copy link
Contributor

@fabriziopandini ^ want to update now?

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini, timothysc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2018
@fabriziopandini fabriziopandini changed the title [WIP] - kubeadm stacked etcd kubeadm stacked etcd Oct 26, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2018
@timothysc timothysc added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2018
@fabriziopandini
Copy link
Member Author

@timothysc this is ready to go for me now
If we can have some help on testing this before the end of the cycle it will be great...

@fabriziopandini
Copy link
Member Author

/test pull-kubernetes-e2e-kops-aws

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@fabriziopandini
Copy link
Member Author

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2018
@neolit123
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants