Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request juju#10979 from achilleasa/dev-fetch-network-inter…
…faces-in-instance-poller juju#10979 ## Description of change This PR concludes the refactoring work for the instance poller. ### The old approach Prior to this PR, the machiner worker would periodically wake up, collect the device information that was visible to it and sent it to the controller via the `SetObservedNetworkConfig` API call. The implementation for this API method (lives in the `networkingcommon` package) would then query the network provider for additional interface info and merge that to the data sent in by the machiner. The final piece of fused data would then be broken down into two slices: - a `[]state.LinkLayerDeviceArgs` - a `[]state.LinkLayerDeviceAddresses` A state-based setter method would then be invoked for each one of those slices to update the link layer devices collection and the machine addresses collection. However, the original approach had some caveats: - We would need to query the provider for network info each time that the machiner worker wakes up; if we have deployed a large number of machines we would be making an API call to the provider for each machine in the model even though the provider-based information doesn't really change that often. - As far as detecting, populating (and keeping in sync) the **provider** addresses for the machine we were currently relying on the fact that the machiner worker is running and always able to connect to the API server. ### The new approach In the new world, the machiner worker is the primary source for collecting and updating the link layer device (and LLD address) collection. The code that queried the underlying provider for extra info has now been removed from the `SetObservedNetworkConfig` code path. Instead, the instancepoller (a more dependable worker since it runs on the controller) will now also collect (if the provider supports it) the interface information using a single batch call at the same time when it queries the instance information (yet another batch call). If the provider does not support querying for network interfaces (e.g. MAAS, LXD etc.), the instancepoller will create a fake list of minimal network interface information using the address info obtained via the instance status query. This information is then sent to the controller via a new instancepoller API method called `SetProviderNetworkConfig`. The controller-side implementation: - attempts to infer the space ID for each provided interface address (private and/or shadow) using one of the following mechanisms (depending on what bits of information is provided by the instance poller): - match address to a subnet CIDR (multiple matches are considered ambiguous and will cause an error) - match CIDR **and** providerID to a subnet (this one is un-ambiguous) In the case where we cannot match an address to any subnet (e.g. public shadow address), we automatically assign it to the **alpha** space. The operator can (will be able to) later use the juju CLI and re-assign the address so using a sane default should not cause any issues. Finally, since the interface information pushed by the instance poller may contain provider specific IDs (network, address, subnet IDs) that are not visible to the machiner worker, the introduced `SetProviderNetworkConfig` also makes a _best-effort_ attempt to match (by address) each instancepoller-reported interface to an existing linklayer device/address combination and backfill the provider specific bits. ## Caveats and potential issues Since the instancepoller will detect the machine before it actually starts (hence, before the machiner worker runs) it would be a good opportunity to pre-populate the link layer device/address collections from the instancepoller-sourced data. Unfortunately, some providers (I am looking at you ec2!) do not report the names for network interfaces **and** our linklayer device document IDs include the interface name. Therefore, pre-populating and then merging the machiner worker information is not really an option as we will end up with different documents. The chosen approach does not have this problem as we only backfill the provider ID bits to devices that have been already added by the machiner. The caveat here is that this backfill process will take a while to complete; once a machine enters the long poll group, it won't be refreshed for about 15 minutes which is more or less the latency we expect to see. Furthermore, I would like to point out that while testing this code, I have noticed that in the case of EC2, the link layer device collection and its sibling IP address collection **DOES** in fact include duplicate entries for the same link layer device. This also happens in the 2.6 branch which does not include any of the instancepoller refactoring work and might allude to a potential bug. The following examples (link layer device and ip.address collection bits) were obtained by bootstrapping a 2.6 controller on AWS ```json { "_id" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0:m#0#d#unsupported0", "name" : "unsupported0", "model-uuid" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0", "mtu" : 0, "providerid" : "eni-067314af609749f26", "machine-id" : "0", "type" : "ethernet", "mac-address" : "0e:89:85:5c:80:47", "is-auto-start" : true, "is-up" : true, "parent-name" : "", "txn-revno" : NumberLong(2), "txn-queue" : [ "5de64e7d90f35328503a5868_33348be2", "5de64e7d90f35328503a5869_d8534326" ] } ... { "_id" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0:m#0#d#ens5", "name" : "ens5", "model-uuid" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0", "mtu" : 9001, "machine-id" : "0", "type" : "ethernet", "mac-address" : "0e:89:85:5c:80:47", "is-auto-start" : true, "is-up" : true, "parent-name" : "", "txn-revno" : NumberLong(2), "txn-queue" : [ "5de64e7d90f35328503a586d_0056020f", "5de64e7d90f35328503a586f_b036dc72" ] } ``` ```json { "_id" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0:m#0#d#unsupported0#ip#172.31.36.2", "model-uuid" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0", "device-name" : "unsupported0", "machine-id" : "0", "subnet-cidr" : "172.31.32.0/20", "config-method" : "dynamic", "value" : "172.31.36.2", "txn-revno" : NumberLong(2), "txn-queue" : [ "5de64e7d90f35328503a5869_d8534326" ] } ... { "_id" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0:m#0#d#ens5#ip#172.31.36.2", "model-uuid" : "181280d3-d776-4790-82bc-fd7f7b7bd8a0", "device-name" : "ens5", "machine-id" : "0", "subnet-cidr" : "172.31.32.0/20", "config-method" : "static", "value" : "172.31.36.2", "gateway-address" : "172.31.32.1", "is-default-gateway" : true, "txn-revno" : NumberLong(2), "txn-queue" : [ "5de64e7d90f35328503a586f_b036dc72" ] } ``` ## QA steps Bootstrap on ec2, gce, maas and lxd with ```--logging-config='<root>=ERROR;juju.worker.instancepoller=DEBUG;juju.apiserver.instancepoller=DEBUG'``` If the provider supports spaces, create a new space from a subnet and spin up a machine on both the alpha (or the default equivalent on MAAS) and the new space. Tailing the debug logs should report that the instancepoller has detected new provider addresses which should include a space ID as their suffix. ```console $ juju debug-log -m {controller/default} --include-module juju.worker.instancepoller --include-module juju.apiserver.instancepoller --replay --level DEBUG ```
- Loading branch information