Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE to use nodePool instead of nodeConfig #323

Open
wants to merge 1 commit into
base: cloud-foundation
Choose a base branch
from

Conversation

sourced-nan
Copy link

use of nodePools to provide flexibility in node level autoscaling and management, as well multiple node pool in one cluster.

@ocsig ocsig added the cloud-foundations Cloud Foundation Toolkit development label Nov 3, 2018
masterAuth:
username: <FIXME:user_name>
password: <FIXME:password>
username: <FIXME:username>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really be using basic auth here? It's considered a security weakness: https://cloud.google.com/kubernetes-engine/docs/concepts/security-overview

Also, trailing backspace might cause confusion in YAML

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was provided as an example and in no means best practice. Also, basic auth is still the default behavior when creating via Gui.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the GUI cautions against using basic auth with the following message "Basic authentication is not recommended because it provides no confidentiality protection for transmitted credentials".

Starting with version 1.12, clusters will have basic authentication and client certificate issuance disabled by default, even via GUI.

username: <FIXME:user_name>
password: <FIXME:password>
username: <FIXME:username>
password: <FIXME:password>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really be using basic auth here? It's considered a security weakness: https://cloud.google.com/kubernetes-engine/docs/concepts/security-overview

Also, trailing backspace might cause confusion in YAML

maxNodeCount: 2
management:
autoUpgrade: True
autoRepair: True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best practice to add a line at EOF

@@ -34,4 +35,4 @@ resources:
clusterIpv4Cidr: ${CLUSTERIPV4_CIDR}
ipAllocationPolicy:
useIpAliases: True
servicesIpv4CidrBlock: ${SERVICESIPV4_CIDRBLOCK}
servicesIpv4CidrBlock: ${SERVICESIPV4_CIDRBLOCK}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best practice to add a line at EOF

- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
nodePools:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Malformed YAML, extra-spaces. nodePools should align with subnetwork

- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
nodePools:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No initialClusterVersion value like in other templates

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was because cluster version changes frequently and as an example, this will just default to the default version. the other could probably be changed though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be beneficial to have some consistency among the examples. If a user starts with the gke_regional.yaml example file, and does not know they need to add initialClusterVersion themselves (since it's not in the example for gke_regional.yaml), they will end up with a broken default version from gke.py.schema and the following error message:

Waiting for create [operation-1542046313989-57a7ba2c5c389-96c2ff4c-4b16d716]...failed.                                                                       
ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation [operation-1542046313989-57a7ba2c5c389-96c2ff4c-4b16d716]: errors:
- code: RESOURCE_ERROR
  location: /deployments/my-gke-regional/resources/myk8sregional
  message: '{"ResourceType":"gcp-types/container-v1beta1:projects.locations.clusters","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Master
    version \"1.9.7-gke.6\" is unsupported.","status":"INVALID_ARGUMENT","statusMessage":"Bad
    Request","requestPath":"https://container.googleapis.com/v1beta1/projects/cft-test-220920/locations/us-east1/clusters","httpMethod":"POST"}}'

They will then have to figure out themselves which versions are working at the moment and add those to the example themselves, or they will need to go ahead and start modifying the default in gke.py.schema, which is the less elegant but more intuitive thing to do. Both of those seem not ideal.

Additionally, I'm curious why are we specifying initialClusterVersion in gke.py as a required parameter (and not an optional one) to begin with?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I'm curious why are we specifying initialClusterVersion in gke.py as a required parameter (and not an optional one) to begin with?

We shouldn't make initialClusterVersion a required parameter... I see this in the .py file

'properties':
            {
                'cluster':
                    {
                        'name':
                            name + '-cluster',
                        'initialNodeCount':
                            propc.get('initialNodeCount'),
                        'initialClusterVersion':
                            propc.get('initialClusterVersion')
                    }
            }

Completely missed it first pass - but this essentially makes this field required.

autoscaling:
enabled: True
minNodeCount: 1
maxNodeCount: 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration results in unschedulable pods for K8s management pods:

Problem
Your cluster has 7 unschedulable pods

Details
Unschedulable pods
event-exporter-v0.1.9-5c8fb98cdb-8ggrr
heapster-v1.5.2-6d67555d76-sqhn9
kube-dns-autoscaler-69c5cbdcdd-pbzfh
kube-dns-db496df86-j9k9b
kubernetes-dashboard-7dbf5b9c66-6zmvw
and other 2 unschedulable pods.

Possible Actions
Increase maximum size limit for autoscaling in one or more node pools that have autoscaling enabled.

- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
autoscaling:
Copy link

@erikkugel erikkugel Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no 'taints' specified for the zonal cluster, but we specify them for global and global private. We should consider being consistent.

taints:
- key: mykey1
value: value1
effect: NO_SCHEDULE
Copy link

@erikkugel erikkugel Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying NO_SCHEDULE for a minNodeCount: 1 and maxNodeCount: 3 with the specified node type results in un-scheduled system pods for k8s. To address this without increasing autoscaling definitions, use PREFER_NO_SCHEDULE instead of NO_SCHEDULE

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/NodeConfig#NodeTaint

@fawix
Copy link
Contributor

fawix commented Nov 12, 2018

@sourced-vince - can you please check this?

@fawix fawix removed cloud-foundations Cloud Foundation Toolkit development labels Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants