This is a very surprising result and I’d love to see some root-cause analysis. On older Xen, packets sent between two VMs on the same host skipped things like checksum calculation (which was a big saving and is now always offloaded to hardware) and were just sent in a ring buffer. I wonder if the same is happening but they’re being copied on the CPU, whereas the real NIC is doing DMA.
I also wouldn’t be surprised if local connections are now still bouncing out to the NIC. I think most cloud providers now run their paravirtualsied / emulated devices on SmartNICs. This should be no worse than sending to another VM on a remote machine, but it’s possible that they don’t optimise for the loopback case and so end up hitting a slow path.
I don’t think cloud providers do this yet, but I proposed making encrypted network streams the default model for communication between Arm Confidential Computing Architecture Realms on Azure and providing TLS and QUIC offload, so that local-Realm to local-Realm could skip the encryption entirely. That should give a benefit from colocation with graceful fallback.
Yeah I was surprised too! But makes sense when some of the implementation details are examined. Regarding Azure (which has particularly slow loopback forwarding) the article says:
The explanation lies in Azure’s accelerated networking, which is turned on by default for AKS. Packets destined for a VM on the same physical host take longer because they do not benefit from hardware acceleration. Instead, they are handled as exception packets in the software-based programmable virtual switch (VFP) on the physical host. [https://www.usenix.org/conference/nsdi18/presentation/firestone]
This is one of those undocumented behavior distinctions you can very well experiment to find out, and which may appear stable, but which make me want to assume the worst case across the whole domain. If I’m riding on an abstracted pay service, I can’t depend on such a fine detail of behavior to stay constant.
This is a very surprising result and I’d love to see some root-cause analysis. On older Xen, packets sent between two VMs on the same host skipped things like checksum calculation (which was a big saving and is now always offloaded to hardware) and were just sent in a ring buffer. I wonder if the same is happening but they’re being copied on the CPU, whereas the real NIC is doing DMA.
I also wouldn’t be surprised if local connections are now still bouncing out to the NIC. I think most cloud providers now run their paravirtualsied / emulated devices on SmartNICs. This should be no worse than sending to another VM on a remote machine, but it’s possible that they don’t optimise for the loopback case and so end up hitting a slow path.
I don’t think cloud providers do this yet, but I proposed making encrypted network streams the default model for communication between Arm Confidential Computing Architecture Realms on Azure and providing TLS and QUIC offload, so that local-Realm to local-Realm could skip the encryption entirely. That should give a benefit from colocation with graceful fallback.
Yeah I was surprised too! But makes sense when some of the implementation details are examined. Regarding Azure (which has particularly slow loopback forwarding) the article says:
I haven’t found much about AWS Nitro loopback connections – they aren’t mentioned in https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-nitro-perf.html – tho the article didn’t measure a performance penalty so I suppose there’s nothing for AWS to warn their customers about.
Is it my impression or the data he presents does show better latency in at least two of the presented plots?
I agree. In particular, for
seems rather a stretch - most of the blue-only area is around 50µs, while almost all of the orange-only area is above 70µs.
To be fair, for
they say
This is one of those undocumented behavior distinctions you can very well experiment to find out, and which may appear stable, but which make me want to assume the worst case across the whole domain. If I’m riding on an abstracted pay service, I can’t depend on such a fine detail of behavior to stay constant.
I would also assume that this is a sintetic benchmark. Would the results be the same on, say, 50% network saturation? How about 80%?
in the three major clouds