Aidan Finn, IT Pro

Azure Virtual Networks Do Not Exist

In this post, I want to share the most important thing that you should know when you are designing connectivity and security solutions in Microsoft Azure: Azure virtual networks do not exist.

A Fiction Of Your Own Mind

I understand why Microsoft has chosen to use familiar terms and concepts with Azure networking. It’s hard enough for folks who have worked exclusively with on-premises technologies to get to grips with all of the (ongoing) change in The Cloud. Imagine how bad it would be if we ripped out everything they knew about networking and replaced it with something else.

In a way, that’s exactly what happens when you use Azure’s networking. It is most likely very different to what you have previously used. Azure is a multi-tenant cloud. Countless thousands of tenants are signed up and using a single global physical network. If we want to avoid all the pains of traditional hosting and enable self-service, then something different has to be done to abstract the underlying physical network. Microsoft has used VXLAN to create software-defined networking; this means that an Azure customer can create their own networks with address spaces that have nothing to do with the underlying physical network. The Azure fabric tracks what is running where, and what NICs can talk to each other, and forward packets as required.

In Azure, everything is either a physical (rare) or a virtual (most common) machine. This includes all the PaaS resources and even those so-called serverless resources. When you drill down far enough in the platform, you will find either a machine with an operating system with a NIC. That NIC is connected to a network of some kind, either an Azure-hosted one (in the platform) or a virtual network that you created.

The NIC Is The Router

The above image is from a slide I use quite often in my Azure networking presentations. I use it to get a concept across to the audience.

Every virtual machine (except for Azure VMware Services) is hosted on a Hyper-V host, and remember that most PaaS services are hosted in virtual machines. In the image, there are two virtual machines that want to talk to each other. They are connected to a common virtual network that uses a customer-defined prefix of 10.0.0.0/8.

The source VM sends a packet to 10.10.1.5. The packet exits the VM’s guest OS and hits the Azure NIC. The NIC is connected to a virtual switch in the host – did you know that in Hyper-V, the switch port is a part of the NIC to enable consistent processing no matter what host the VM is moved to? The virtual switch encapsulates the packet to enable transmission across the physical network – the physical network has no idea about the customer’s prefix of 10.0.0.0/8. How could it? I’d guess that 80% of customers use all or some of that prefix. Encapsulation allows the pack to hide the customer-defined source and destination addresses. The Azure Fabric knows where the customer’s destination (10.10.1.5) is running, so it uses the physical destination host’s address in the encapsulated packet.

Now the packet is free to travel across the physical Azure network – across the rack, data centre, region or even the global network – to reach the destination host. Now the packet moves up the stack, is encapsulated and dropped into the NIC of the destination VM where things like NSG rules (how the NSG is associated doesn’t matter) are processed.

Here’s what you need to learn here:

The packet went directly from source to destination at the customer level. Sure it travelled along a Microsoft physical network but we don’t see that. We see that the packet left the source NIC and arrived directly at the destination NIC.
Each NIC is effectively its own router.
Each NIC is where NSG rules are processed: source NIC for outbound rules and destination NIC for inbound rules.

The Virtual Network Does Not Exist

Have you ever noticed that every Azure subnet has a default gateway that you cannot ping?

In the above example, no packets travelled across a virtual network. There were no magical wires. Packets didn’t go to a default gateway of the source subnet, get routed to a default gateway of a destination subnet and then to the destination NIC. You might have noticed in the diagram that the source and destination were on different peered virtual networks. When you peer a virtual network, an operator is not sent sprinting into the Azure data centres to install patch cables. There is no mysterious peering connection.

This is the beauty and simplicity of Azure networking in action. When you create a virtual network, you are simply stating:

Anything connected to this network can communicate with each other.

Why do we create subnets? In the past, subnets were for broadcast control. We used them for network isolation. In Azure:

We can isolate items from each other in the same subnet using NSG rules.
We don’t have broadcasts – they aren’t possible.

Our reasons for creating subnets are greatly reduced, and so are our subnet counts. We create subnets when there is a technical requirement – for example, an Azure Bastion requires a dedicated subnet. We should end up with much simpler, smaller virtual networks.

How To Think of Azure Networks

I cannot say that I know how the underlying Azure fabric works. But I can imagine it pretty well. I think of it simply as a mapping system. And I explain it using Venn diagrams.

Here’s an example of a single virtual network with some connected Azure resources.

Connecting these resources to the same virtual network is an instruction to the fabric to say: “Let these things be able to route to each other”. When the app service (with VNet Integration) wants to send a packet to the virtual machine, the NIC on the source VM will send the packets directly to the NIC of the destination VM.

Two more virtual networks, blue and green, are created. Note that none of the virtual networks are connected/peered. Resources in the black network can talk only to each other. Resources in the blue network can talk only to each other. Resources in the green network can talk only to each other.

Now we will introduce some VNet peering:

Black <> Blue
Black <> Green

As I stated earlier, no virtual cables are created. Instead, the fabric has created new mappings. These new mappings enable new connectivity:

Black resources can talk with blue resources
Black resources can talk with green resources.

However, green resources cannot talk directly to blue resources – this would require routing to be enabled via the black network with the current peering configuration.

I can implement isolation within the VNets using NSG rules. If I want further inspection and filtering from a firewall appliance then I can deploy one and force traffic to route via it using BGP or User-Defined Routing.

Wrapping Up

The above simple concept is the biggest barrier I think that many people have when it comes to good Azure network design. If you grasp the fact that virtual networks do not exist and that packets route directly from source to destination and then be able to process those two facts then you are well on your way to designing well-connected/secured networks and being able to troubleshoot them.

If You Liked This Article

If you liked this article, then why don’t you check out my custom Azure training with my company, Cloud Mechanix. My next course is Azure Firewall Deep Dive, a two day virtual course where I go through how to design and implement Azure Firewall, including every feature. This two day course runs on February 12/13, timed for (but not limited to) European attendees.

Will International Events Impact Cloud Computing

You must have been hiding under a rock if you haven’t noticed how cloud computing has become the default in IT. I have started to wonder about the future of cloud computing. Certain international events have the potential to disrupt cloud computing in a major way. I’m going to play out two scenarios in this post and illustrate what the possible problems may be.

Bear In The East

Russia expanded their conflict with Ukraine in February 2024. This was the largest signal so far that the leadership of Russia wanted to expand their post-Soviet borders to include some of the former USSR nations. The war in Ukraine is taking much longer than expected and has eaten the Russian military, thanks to the determination of the Ukrainian people. However, we know that Russia has eyes elsewhere.

The Baltic nations (Lithuania, Latvia and Estonia) provide a potential land link between Russia and the Baltic Sea. North of those nations is Finland, a country with a long & wild border with Russia – and also one with a history of conflict with Russia. Finland (and Sweden) has recognised the potential of this expanded threat by joining NATO.

If you read “airport thrillers” like me, then you’ll know that Sweden has an island called Gotland in the Baltic Sea. It plays a huge strategic role in controlling that sea. If Russia were to take that island, they could prevent resupply via the Baltic Sea to the Baltic countries and Finland, leaving only air, land, and the long route up North – speaking of which …

Norway also shares a land border with Russia to the north of Finland. The northern Norwegian coast faces the main route from Murmansk (a place I attacked many times when playing the old Microprose F-19 game). Murmansk is the home of the Russian Atlantic fleet. Their route to the Atlantic is north of the Norwegian coast and south between Iceland and Ireland.

In the Artic is Svalbard, a group of islands that is host to polar bears and some pretty tough people. This island is also eyed up by Russia – I’m told that it’s not unusual to hear stories of some kind of espionage there.

So Russia could move west and attack. What would happen then?

Nordic Azure Regions

There are several Azure regions in the Nordics:

Norway East, paired with Norway West
Sweden Central, paired with Sweden South
One is “being built” in Espoo, Finland, just outside the capital of Helsinki.

Norway West is a small facility that is hosted in a third-party data centre and is restricted to a few customers.

I say “being built” with the Finish region because I suspect that its been active for a while with selected customers. Not long after the announcement of the region (2022) I had a nationally strategic customer tell me that the local Microsoft data centre salesperson was telling them to stop deploying in Azure West Europe (Netherlands) and to start using the new Finnish region.

FYI: the local Microsoft data centre salesperson has a target of selling only the local Azure region. The local subsidiary has to make a usage commitment to HQ before a region is approved. Adoption in another part of Azure doesn’t contribute to this target.

I remember this conversation because it was not long after tanks rolled into Ukraine and talk of Finland joining NATO began heating up. I asked my customer: “Let’s say you place nationally critical services into the new Finnish region. What is one of the first things that Russia will send missiles to?” Yes, they will aim to shut down any technology and communications systems first … including Azure regions. All the systems hosted in Espoo will disappear in a flaming pile of debris. I advised the customer that if I were them, I would continue to use cloud regions that were as far away as possible while still meeting legal requirements.

Norway’s situation is worse. Their local and central governments have to comply with a data placement law, which prevents the placement of certain data outside of Norway. If you’re using Azure, you have no choice, you must use Norway East, which is in urban Oslo (the capital on the south coast). Private enterprises can choose any of the European regions (they typically take West Europe/Netherlands, paired with North Europe/Ireland) so they have a form of disaster recovery (I’ll come back to this topic later). However, Norway East users cannot replicate into Norway West – the Stavanger-located region is only available to a select (allegedly) three customers and it is very small.

FYI: restricted access paired regions are not unusual in Azure.

Disaster Recovery

So a hypersonic missile just took out my Azure region – what do I do next? In an ideal world, all of your data was replicated in another location. Critical systems were already built with redundant replicas. Other systems can be rebuilt by executing pipelines with another Azure region selected.

Let’s shoot all of that down, shall we?

So I have used Norway East. And I’ve got a bunch of PaaS data storage systems. Many of those storage systems (Azure Backup recovery services vaults) are built on blob storage. Blob storage offers geo-redundancy which is restricted to the paired region. If my data storage can only replicate to the paired region and there is no paired region available to me, when there is no replication option. You will need to bake your own replication system.

Some compute/data resource types offer replication in any region. For example, Cosmos DB can replicate to other regions but that comes with potential sync/latency issues. Azure VMs offer Azure Site Recovery which enables replication to any region. This is where I expect the “cloud native” types to be “GitOps!” but they always seem to focus only on compute and forget things like data – no we won’t be putting massive data stores in an AKS container 🙂

Has anyone not experienced capacity issues in an Azure region in the last few years? There are probably many causes for that so we won’t go down that rabbit hole. But a simple task of deploying a new AVD worker pool or a firewall with zone resilience commonly results in a failure because the region doesn’t have capacity. What would happen if Norway East disappeared and all of the tenants started to failover/redeploy to other European regions? Let’s just say that there would be massive failures everywhere.

Orange Man In The West

Greenland is an autonomous territory of the Kingdom of Denmark. Being a Danish territory makes it a part of the EU. US president-elect, Donald Trump, has been sabre-rattling about Greenland recently. He either wants the US to take it over by economic (trade war) or military means.

If the USA goes into a trade war with Denmark, then it will go into a trade war with all of the EU. Neither side will win. If the tech giants continue to personally support Donald Trump then I can imagine the EU retaliating against them. Considering that Microsoft, Amazon, and Google are American companies, sanctions against those companies would be bad – the cost of cloud computing could rocket and make it unviable.

If the USA invaded Greenland (a NATO ally by virtue of being a Danish territory) then it would lead to very a unpleasant situation between NATO/EU and the USA. One could imagine that American companies would be shunned, not just emotionally but also legally. That would end Azure, AWS, and Google in the EU.

So how would one recover from losing their data and compute platform? It’s not like you can just live migrate a petabyte data lake or a workload based on Azure Functions.

The Answer

I don’t have a good answer. I know of an organisation that had a “only do VMs in Azure” policy. I remember bing dumbfounded at the time. They explained that it was for support reasons. But looking back on it, they abstracted themselves from Azure by use of an operating system. They could simply migrate/restore their VMs to another location if necessary – on-prem, another cloud, another country. They are not tied to the cloud platform, the location, or the hardware. But they do lose so many of the benefits of using the cloud.

I expect someone will say “use on-prem for DR”. OK, so you’ll build a private cloud, at huge expense and let it sit there doing nothing on the off-chance that it might be used. If I was in that situation then I wouldn’t be using Azure/etc at all!

I’ve been wondering for a while if the EU could fund/sponsor the creation of an IT sector in Europe that is independent from the USA. It would need an operating system, productivity software, and a cloud platform. We don’t have any tech giants as big or as cash rich as Microsoft in the EU so this would have to be sponsored. I also think that it would have to be a collaboration. My fear is that it would be bogged down in bureaucracy and have a heavy Germany/France first influence. But I am looking at the news every day and realsing that we need to consider a non-USA solution.

Wrapping Up

I’m all doom and gloom today. Maybe it’s all of the negativity in the news that is bringing me down. I see continued war in Ukraine, Russia attacking infrastructure in the Baltic sea, and threats from the USA. The world has changed and we all will need to start thinking about how we act in it.

Manage Existing Azure Firewall With Firewall Policy Using Bicep

In this post, I want to discuss how I recently took over the management of an existing Azure Firewall using Firewall Policy/Azure Firewall Manager and Bicep.

Background

We had a customer set up many years ago using our old templated Azure deployment based on ARM. At the centre of their network is Azure Firewall. That firewall plays a big role in the customer’s micro-segmented network, with over 40,000 lines of ARM code defining the many firewall rules.

The firewall was deployed before Azure Firewall Manager (AFM) was released. AFM is a pretty GUI that enables the management of several Azure networking resource types, including Azure Firewall. But when it comes to managing the firewall, AFM uses a resource called Firewall Policy; you don’t have to touch AFM at all – you can deploy a Firewall Policy, link the firewall to it (via Resource ID), and edit the Firewall Policy directly (Azure Portal or code) to manage the firewall settings or code.

One of the nicest features of Azure Firewall is a result of it being an Azure PaaS resource. Like every other resource type (there are exceptions sometimes) Azure Firewall is completely manageable via code. Not only can you deploy the firewall. You can operate it on a day-to-day basis using ARM/Bicep/Terraform/Pulumi if you want: the settings and the firewall rules. That means you can have complete change control and rollback using the features of Git in DevOps, GitHub, etc.

All new features in Azure Firewall have surfaced only via Firewall Policy since the general availability release of AFM. A legacy Azure Firewall that doesn’t have a Firewall Policy is missing many security and management features. The team that works regularly with this customer approached me about adding Firewall Policy to the customer’s deployment and including that in the code.

The Old Code

As I said before, the old code was written in ARM. I won’t get into it here, but we couldn’t add the required code to do the following without significant risk:

A module for Firewall Policy
Updating the module for Azure Firewall to include the link to the FIrewall Policy.

I got a peer to give me a second opinion and he agreed with my original assessment. We should:

Create a new set of code to manage the Azure Firewall using Bicep.
Introduce Firewall Policy via Bicep.
Remove the ARM module for Azure Firewall from the ARM code.
Leave the rest of the hub as is (ARM) because this is a mission-critical environment.

The High-Level Plan

I decided to do the following:

Set up a new repo just for the Azure Firewall and Firewall Policy.
Deploy the new code in there.
Create a test environment and test like crazy there.
The existing Azure Firewall public IP could not change because it was used in DNAT rules and by remote parties in their firewall rules.
We agreed that there should be “no” downtime in the process but I wanted time for a rollback just in case. I would create non-parameterised ARM exports of the entire hub, the GatewaySubnet route table (critical to routing intent and a risk point in this kind of work), and the Azure Firewall. Our primary rollback plan would be to run the un-modified ARM code to restore everything as it was.

The Build

I needed an environment to work in. I did a non-parameterised export of the hub, including the Azure Firewall. I decompiled that to Bicep and deployed it to a dedicated test subscription. This did require some clean-up:

The public IP of the firewall would be different so DNAT rules would need a new destination IP.
Every rules collection group (many hundreds of them) had a resource ID that needed to be removed – see regex searches in Visual Studio Code.

The deployment into the test environment was a two-stage job – I needed the public IP address to obtain the destination address value for the DNAT rules.

Now I had a clone of the production environment, including all the settings and firewall rules.

The Bicep Code

I’ve been doing a lot of Bicep since the Spring of this year (2024). I’ve been using Azure Verified Modules (AVM) since early Summer – it’s what we’ve decided should be our standard approach, emulating the styling of Azure Verified Solutions.

We don’t use Microsoft’s landing zones. I have dug into them and found a commonality. The code is too impressive. The developer has been too clever. Very often, “customer configuration” is hard-coded into the Bicep. For example, the image template for Azure Image Builder (in the AVD landing zone) is broken up across many variables which are unioned until a single variable is produced. The image template is file that should be easy to get at and commonly updated.

A managed service provider knows that architecture (the code) should be separated from customer configuration. This allows the customer configuration to be frequently updated separately from the architecture. And, in turn, it should be possible to update the architecture without having to re-import the customer configuration.

My code design is simple:

Main.bicep which deploys the Azure Firewall (AVM) and the Firewal Policy (AVM).
A two-property paramater controls the true/false (bool) condition of whether or not the two resources are deployed.
A main.bicepparam supplies parameters to configure the SKUs/features/settings of the Azure Firewall and Firewall Policy using custom types (enabling complete Intellisense in VS Code).
A simple module documents the Rules Collections in single array. This array is returned as an output to main.bicep and fed as a single value to the Firewall Policy module.

I did attempt to document the Rules Collections as ARM and use the Bicep function to load an ARM file. This was my preference because it would simplify producing the firewall rules from the Azure Portal and inputting them into the file, both for the migration and for future operations. However, the Bicep function to load a file is limited to too few characters. The eventual Rules Colleciton Group module had over 40,000 lines!

My test process eventually gave me a clean result from start to finish.

The Migration

The migration was scheduled for late at night. Earlier in the afternoon, a freeze was put in place on the firewall rules. That enabled me to:

Use Azure Firewall Manager to start the process of producing a Firewall Policy. I chose the option to import the rules from the existing production firewall. I then clicked the link to export the rules to ARM and saved the file locally.
I decompiled the ARM code to Bicep. I copied and pasted the 3 Rules Collection Groups into my Rules Collection Group module.
I then ran the deployment with no resources enabled. This told me that the pipeline was function correctly against the production environment.
When the time came, I made my “backups” of the production hub and firewall.
I updated the parameters to enable the deployment of the Firewall Policy. That was a quick run – the Azure Firewall was not touched so there was no udpate to the Firewall. This gave me one last chance to compare the firewall settings and rules before the final steps began.
I removed the DNS settings from the Azure Firewall. I found in testing that I could not attach a Firewall Policy to an Azure Firewall if both contained DNS settings. I had to remove those settings from the production firewall. This could have caused some downtime to any clients using the firewall as their DNS server but the feature is not rolled out yet.
I updated the parameters to enable management of the Azure Firewall. The code here included the name of the in-place Public IP Address. The parameters also included the resource IDs of the hub Virtual Network and the Log Analytics Workspace (Resource Specfic tables in the code). The pipeline ran … this was the key part because the Bicep code was updating the firewall with the resource ID of the Firewall Policy. Everything worked perfectly … almost … the old diagnostics settings were still there and had to be removed because the new code used a new naming standard. One quick deletion and a re-run and all was good.
One of my colleagues ran a bunch of pre-documented and pre-verified tests to confirm that all was was.
I then commented out the code for the Azure Firewall from the old ARM code for the hub. I re-ran the pipeline and cleaned up some errors until we had a repeated clean run.

The technical job was done:

Azure Firewall was managed using a Firewall Policy.
Azure Firewall had modern diagnostics settings.
The configuration is being done using code (Bicep).

You might say “Aidan, there’s a PowerShell script to do that job”. Yes there is, but it wasn’t going to produce the code that we needed to leave in place. This task did the work and has left the customer with code that is extremely flexible with every resource property available as a mandatory/optional property through a documented type specific to the resource type. As long as no bugs are found, the code can be used as is to configure any settings/features/rules in Azure Firewall or Azure Firewall manager either through the parameters files (SKUs and settings) or the Rules Collection Groups module (firewall rules).

Azure Firewall Deep Dive Training

If you thought that this post was interesting then please do check out my Azure Firewall Deep Dive course that is running on February 12th – February 13th, 2025 from 09:30-16:00 UK/Irish time/10:30-17:00 Amsterdam/Berlin time. I’ve run this course twice in the last two weeks and the feedback has been super.

Azure Firewall Deep Dive Training

I’ll tell you about my new virtual training course on Azure Firewall and share some schedule information in this post.

Background

I’ve been talking about Azure Firewall for years. I’ve done lots of sessions at user groups and conferences. I’ve done countless handovers with customers and colleagues. One of my talking points is that I reckoned that I could teach someone with a little Azure/networking knowledge everything there is to know about Azure Firewall in 2 days. And that’s what I decided to do!

I was updating one of my sessions earlier in the year when I realised that it was pretty must the structure of a training couse. Instead of me just listing out features or barely dicusssing architecture to squeeze it into a 45-60 minute-long session, I could take the time to dive deep and share all that I know or could research.

The Course

I produced a 2-day course that could be taught in-person, but my primary vector is virtual/online – it’s hard to get a bunch of people from all over into one place and there is also a cost to me in hosting a physical event that would increse the cost of the course. I decided that virtual was best, with an option off doing it in person if a suitable opportunity arose.

The course content is delivered using a combination of presentation and demo. Presentation lets me explain the what’s, why’s and so on. Demonstration lets me show you how.

The demo lab is built from a Bicep deployment, based on Azure Verified Modules (AVM). A hub & spoke network architecture is created with an Application Gateway, a simple VM workload, and a simple App Services (Private Endpoint) workload. The demonstrations follow a “hands-on guide”; this guide is written as if this was a step-by-step hands-on course, instructing the reader exactly which button to click and what/where to type. Each exercise builds on the last, eventually resulting in a secure network architecture with all of the security, monitoring, and management bells and whistles.

Why did I opt for demo instead of hands-on? Hands-on works for in-person classes. But you cannot assist in the same way when people struggle. In addition, waiting for attendees to complete labs would add another day (and cost) to the class.

Before and class, I share all of the content that I use:

System requirements and setup instructions.
The Bicep files for the demo lab.
The hands-on lab instructions
The PowerPoint
And a few more useful bits

I always update content – for example, my first run of this class was during Microsoft Ignite 2024 and I added a few bits from the news. Therefore I share the updated content with attendees after the course.

The First Run

I ran the class for the first time earlier this week, Novemer 20-21 2024. Attendees from all around Europe joined me for 2 days. At first they were quiet. Online is tough for speakers like me because I look for visual feedback on how I’m doing. But then the questions started coming – people were interested in what I was saying. Interaction also makes the class more interesting for me – sometimes you get comments that coer things you didn’t originally include and everyone benefits – I updated the course with one such item at the end of day 1!

I shared a 4-question anonymouse survey to learn what people thought. The feedback was awesome.

Feedback

This course was previously run in November 2024 for a European audience. The survey feedback was as follows:

How would you rate this course?

Excellent: 83%
Good: 17%

Was This Course Worth Your Time?

Yes: 100%

Would you recommend this course to others?

Yes: 100%

Some of the comments:

“I think it is a very good introduction to Azure Firewall, but it goes beyond foundational concepts so medium- experienced admins will also get value from this. I like the sections on architecture and explanations of routing and DNS. I think this course will enable people to do a good job more than for example az 700 because of the more practical approach. You are good at explaining the material”.

“Just what I wanted from a Deep dive course.”

“Perfectly delivered. Crystal clear content and very well explained”.

Future Classes

I have this class scheduled for two more runs, each timed for different parts of the world:

The classes are ultra-affordable. A few hundred Euros/dollars gets you custom content based on real-world usage. I did fint a virtual 2-day course on Palo Alto firewalls that cost $1700! You’ll also find that I run early-bird registration costs and discounts for more than 1 booking. If you have a large group (5+) then we might be able to figure out a lower rate 🙂

More To Come

More classes are coming! I have an old one to reinvent based on lots of experience over the years and at least 1 new one to write from scratch. Watch out for more!

Azure Image Builder Job Fails With TCP 60000, 5986 or 22 Errors

In this post, I will explain how to solve the situation when an Azure Image Builder job fails with the following errors:

[ERROR] connection error: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.9:5986/wsman”: proxyconnect tcp: dial tcp 10.0.1.4:60000: i/o timeout

[ERROR]connection error: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded
[ERROR] WinRM connection err: unknown error Post “https://10.1.10.8:5986/wsman”: context deadline exceeded

Note that the second error will probably be the following if you are building a Linux VM:

[ERROR]connection error: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded
[ERROR] SSH connection err: unknown error Post “https://10.1.10.8:22/ssh”: context deadline exceeded

The Scenario

I’m using Azure Image Builder to prepare a reusable image for Azure Virtual Desktop with some legacy software packages from external vendors. Things like re-packaging (MSIX) will be a total support no-no, so the software must be pre-installed into the image.

I need a secure solution:

The virtual machine should not be reachable on the Internet.
Software packages will be shared from a storage account with a private endpoint for the blob service.

This scenario requires that I prepare a virtual network and customise the Image Template to use an existing subnet ID. That’s all good. I even looked at the PowerShell example from Microsoft which told me to allow TCP 60000-60001 from Azure Load Balancer to Virtual Network:

I also added my customary DenyAll rule at priority 4000 – the built in Deny rule doesn’t deny all that much!

I did that … and the job failed, initially with the first of the errors above, related to TCP 60000. Weird!

Troubleshooting Time

Having migrated countless legacy applications with missing networking documentation into micro-segmented Azure networks, I knew what my next steps were:

Deploy Log Analytics and a Storage Account for logs
Enable VNet Flow Logs with Traffic Analytics on the staging subnet (where the build is happening) NSG
Recreate the problem (do a new build)
Check the NTANetAnalytics table in Log Analytics

And that’s what I did. Immediately I found that there were comms problems between the Private Endpoint (Azure Container Instance) and the proxy VM. TCP 60000 was attempted and denied because the source was not the Azure Load Balancer.

I added a rule to solve the first issue:

I re-ran the test (yes, this is painfully slow) and the job failed.

This time the logs showed failures from the proxy VM to the staging (build) VM on TCP 5986. If you’re building a Linux VM then this will be TCP 22.

I added a third rule:

Now when I test I see the status switch from Running, to Distributing, to Success.

Root Cause?

Adding my DenyAll rule caused the scenario to vary from the Microsoft docs. The built-in AllowVnetInBound rule is too open because it allows all sources in a routed “network”, including other networks in a hub & spoke. So I micro-segment using a low priority DenyAll rule.

The default AllowVnetInBound rule would have allowed the container>proxy and proxy>VM traffic, but I had overridden it. So I need to create rules to allow that traffic.

Lots of Speaking Activity

After a quiet few pandemic years with no in-person events and the arrival of twins, my in-person presentation activity was minimal. My activity has started to increase, and there have been plenty of recent events and more are scheduled for the near future.

The Recent Past

Experts Live Netherlands 2024

It was great to return to The Netherlands to present at Experts Live Netherlands. Many European IT workers know the Experts Live brand; Isidora Maurer (MVP) has nurtured & shepherded this conference brand over the years, starting with a European-wide conference and then working with others to branch it out to localised events that give more people a chance to attend. I presented at this event a few years ago but personal plans prevented me from submitting again until this year. And I was delighted to be accepted as a speaker.

Hosted in Nieuwegein, just a short train ride from Schiphol airport in Amsterdam (Dutch public transport is amazing) the conference featured a packed expo hall and many keen attendees. I presented my “Azure Firewall: The Legacy Firewall Killer” session to a standing-room-only room.

TechMentor Microsoft HQ 2024

The first conference I attended was WinConnections 2004 in Lake Las Vegas. That conference changed my career. I knew that TechMentor had become something like that – the quality of the people I knew who were presenting at the event in the past was superb. I had the chance to submit some sessions this time around and was happy to have 3 accepted, including a pre-conference all-day seminar.

I worked my tail off on that “pre-con”. It’s an expansion of one of my favourite sessions that many events are scared of, probably because they think it’s too niche or too technical: “Routing – The Virtual Cabling of Secure Azure Networking”. Expanding a 1 hour session to a full day might seem daunting but I had to limit how much content I included! Plus I had to make this a demo session. I worked endless hours on a Bicep deployment to build a demo lab for the attendees. This was necessary because it would take too long to build by hand. I had issues with Azure randomly failing and with version stuff changing inside Microsoft. As one might expect, the demo gods were not kind on the day and I quickly had to pivot from hands-on labs to demos. While the questions were few during the class, there were lots of conversations during the breaks and even on the following days.

My second session was “Azure Firewall: The Legacy Firewall Killer” – this is a popular session. I like doing this topic because it gives me a chance to crack a few jokes – my family will groan at that thought!

My final session was the one that I was most worried about. “Your Azure Migration Project Is Doomed To FAIL” was never accepted by any event before. I think the title might seem negative but it’s meant to be funny. The content is based on my experience dealing with mid-large organisations who never quite understand the difference between cloud migration and cloud adoption. I explain this through several fictional stories. There is liberal use of images from Unsplash and opportunities to make some laughter. This have been the session that I was least confident in, but it worked.

TechMentor puts a lot of effort into mixing the attendees and the presenters. On the first night, attendees and presenters went to a local pizza place/bar and sat in small booths. We had to talk to each other. The booth that I was at featured people from all over the USA with different backgrounds. People came and went, but we talked and were the last to leave. On the second day, lunch was an organised affair where each presenter was host to a table. Attendees could grab lunch and sit with a presenter to discuss what was on their minds. I knew that migrations were a hot topic. And I also knew that some of those attendees were either doing their first migration or first re-attempt at a migration. I was able to tune my session a tiny bit to the audience and it hit home. I think the best thing about this was the attention I saw in the room, the verbal feedback that I heard just after the session, and the folks who came up to talk to me after.

A Break

I brought my family to the airport the day before I flew to TechMentor. They were going to Spain for 4 weeks and I joined them a few days later after a l-o-n-g Seattle-Las Angeles-Dublin-Alicante journey (I really should have stayed one extra night in Seattle and taken the quicker 1-hop via Iceland).

33+ Celsius heat, sunshine, a pool, a relaxed atmosphere in a residential town (we didn’t go to a “hotel town”) was a great place to work for a week and then do two weeks of vacation.

I went running most mornings, doing 5-7KMs. I enjoy getting up early in places like this, picking a route to explore on a map, and hitting the streets to discover the locality and places to go with my family. It’s so different to home where I have just two routes with footpaths that I can use.

Coming home was a shock. Ireland isn’t the sunniest or the warmest place in the world, but it feels like mid-winter at the moment. I think I acclimatised to Spain as much as a pasty Irish person can. This morning I even had to put a jacket on and do a couple of KMs to wait for my legs to warm up before picking up the pace.

Upcoming Events

There are three confirmed events coming up:

Nieuwegein (Netherlands) September 11: Azure Fest 2025

I return to this Dutch city in a few days to do a new session “Azure Virtual Network Manager”. I’ve been watching this product develop since the private preview. It’s not quite ready (pricing is hopefully being fixed) but it could be a complete game changer for managing complex Azure networks for secure/compliant PaaS and IaaS deployments. I’ll discuss and demo the product, sharing what I like and don’t like.

Dublin (Ireland) October 7: Microsoft Azure Community & AI Day

Organised by Nicolas Chang (MVP) this event will feature a long list of Irish MVPs discussing Azure and AI in a rapid-fire set of short sessions. I don’t think that the event page has gone live yet so watch out for it. I will be presenting the “Azure Virtual Network Manager” again at this event.

TBA: Nordics

I’ve confirmed my speaking slots for 2 sessions at an event that has not publicly announced the agenda yet. I look forward to heading north and sharing some of my experiences.

My Sessions

If you are curious, then you can see my Sessionize public profile here, which is where you’ll see my collection of available sessions.

Azure Back To School 2024 – Govern Azure Networking Using Azure Virtual Network Manager

This post about Azure Virtual Network Manager is a part of the online community event, Azure Back To School 2024. In this post, I will discuss how you can use Azure Virtual Network Manager (AVNM) to centrally manage large numbers of Azure virtual networks in a rapidly changing/agile and/or static environment.

Challenges

Organisations around the globe have a common experience: dealing with a large number of networks that rapidly appear/disappear is very hard. If those networks are centrally managed then there is a lot of re-work. If the networks are managed by developers/operators then there is a lot of governance/verification work.

You need to ensure that networks are connected and are routed according to organisation requirements. Mandatory security rules must be put in place to either allow required traffic or to block undesired flows.

That wasn’t a big deal in the old days when there were maybe 3-4 huge overly trusting subnets in the data centre. Network designs change when we take advantage of the ability to transform when deploying to the cloud; we break those networks down into much smaller Azure virtual networks and implement micro-segmentation. This approach introduces simplified governance and a superior security model that can reliably build barriers to advanced persistent threats. Things sound better until you realise that there are no many more networks and subnets that there ever were in the on-premises data centre, and each one requires management.

This is what Azure Virtual Network Manager was created to help with.

Introducing Azure Virtual Network Manager

AVNM is not a new product but it has not gained a lot of traction yet – I’ll get into that a little later. Spoiler alert: things might be changing!

The purpose of AVNM is to centralise configuration of Azure virtual networks and to introduce some level of governance. Don’t get me wrong, AVNM does not replace Azure Policy. In fact, AVNM uses Azure Policy to do some of the leg work. The concept is to bring a network-specialist toolset to the centralised control of networks instead of using a generic toolset (Azure Policy) that can be … how do I say this politely … hmm … mysterious and a complete pain in the you-know-what to troubleshoot.

AVNM has a growing set of features to assist us:

Network groups: A way to identify virtual networks or subnets that we want to manage.
Connectivity configurations: Manage how multiple virtual networks are connected.
Security admin rules: Enforce security rules at the point of subnet connection (the NIC).
Routing configurations: Deploy user-defined routes by policy.
Verifier: Verify the networks can allow required flows.

Deployment Methodology

The approach is pretty simple:

Identify a collection of networks/subnets you want to configure by creating a Network Group.
Build a configuration, such as connectivity, security admin rules, or routing.
Deploy the configuration targeting a Network Group and one or more Azure regions.

The configuration you build will be deployed to the network group members in the selected region(s).

Network Groups

Part of a scalable configuration feature of AVNM is network groups. You will probably build several or many network groups, each collecting a set of subnets or networks that have some common configuration requirement. This means that you can have ea large collection of targets for one configuration deployment.

Network Groups can be:

Static: You manually add specific networks to the group. This is ideal for a limited and (normally) unchanging set of targets to receive a configuration.
Dynamic: You will define a query based on one or more parameters to automatically discover current and future networks. The underlying mechanism that is used for this discovery is Azure Policy – the query is created as a policy and assigned to the scope of the query.

Dynamic groups are what you should end up using most of the time. For example, in a governed environment, Azure resources are often tagged. One can query virtual networks with specific tags and in specific Azure regions and have them automatically appear in a network group. If a developer/operator creates a new network, governance will kick in and tag those networks. Azure Policy will discover the networks and instantly inform AVNM that a new group member was discovered – any configurations applied to the group will be immediately deployed to the new network. That sounds pretty nice, right?

Connectivity Configurations

Before we continue, I want you to understand that virtual network peering is not some magical line or pipe. It’s simply an instruction to the Azure network fabric to say “A collection of NICs A can now talk with a collection of NICs B”.

We often want to either simplify the connectivity of networks or to automate desired connectivity. Doing this at scale can be done using code, but doing it in an agile environment requires trust. Failure usually happens between the chair and the keyboard, so we want to automate desired connectivity, especially when that connectivity enables integration or plays a role in security/compliance.

Connectivity Configurations enable three types of network architecture:

Hub-and-spoke: This is the most common design I see being required and the only one I’ve ever implemented for mid-large clients. A central regional hub is deployed for security/transit. Workloads/data are placed in spokes and are peered only with the hub (the network core). A router/firewall is normally (not always) the next hop to leave a spoke.
Full mesh: Every virtual network is connected directly to every other virtual network.
Hub-and-spoke with mesh: All spokes are connected to the hub. All spokes are connected to each other. Traffic to/from the outside world must go through the hub. Traffic to other spokes goes directly to the destination.

Mesh is interesting. Why would one use it? Normally one would not – a firewall in the hub is a desirable thing to implement micro-segmentation and advanced security features such as Intrusion Detection and Prevention System (IDPS). But there are business requirements that can override security for limited scenarios. Imagine you have a collection of systems that must integrate with minimised latency. If you force a hop through a firewall then latency will potentially be doubled. If that firewall is deemed an unnecessary security barrier for these limited integrations by the business, then this is a scenario where a full mesh can play a role.

This is why I started off discussing peering. Whether a system is in the same subnet/network or not, it doesn’t matter. The physical distance matters, not the virtual distance. Peering is not a cable or a connection – it’s just an instruction.

However, Virtual Network Peering is not even used in mesh! It’s something different that can handle the scale of many virtual networks being interconnected called a Connected Group. One configuration inter-connects all the virtual networks without having to create 1-1 peerings between many virtual networks.

A very nice option with this configuration is the ability to automatically remove pre-existing peering connections to clean up unwanted previous designs.

Security Admin Rules

What is a Network Security Group (NSG) rule? It’s a Hyper-V port ACL that is implemented at the NIC of the virtual machine (yours or in the platform hosting your PaaS service). The subnet or NIC association is simply a scaling/targeting system; the rules are always implemented at the NIC where the virtual switch port is located.

NSGs do not scale well. Imagine you need to deploy a rule to all subnets/NICs to allow/block a flow. How many edits will you need to do? And how much time will you waste on prioritising rules to ensure that your rule is processed first?

Security Admin Rules are also implemented using Port ACLs but they are always processed first. You can create a rule or a set or rules and deploy it to a Network Group. All NICs will be updated and your rules will always be processed first.

Tip: Consider using VNet Flow Logs to troubleshoot Security Admin Rules.

Routing Configurations

This is one of the newer features in AVNM and was a technical blocker for me until it was introduced. Routing plays a huge role in a security design, forcing traffic from the spoke through a firewall in the hub. Typically, in VNet-based hub deployments, we place one user-defined route (UDR) in each subnet to make that flow happen. That doesn’t scale well and relies on trust. Some have considered using BGP routing to accomplish this but that can be easily overridden after quite a bit of effort/cost to get the route propagated in the first place.

AVNM introduced a preview to centrally configure UDRs and deploy them to Network Groups with just a few clicks. There are a few variations on this concept to decide how granular you want the resulting Route Tables to be:

One is shared with virtual networks.
One is shared with all subnets in a virtual network.
One per subnet.

Verification

This is a feature that I’m a little puzzled about and I am left wondering if it will play a role in some other future feature. The idea is that you can test your configurations to ensure that they are working. There is a LOT of cross-over with Network Watcher and there is a common limitation: it only works with virtual machines.

What’s The Bad News?

Once routing configurations go generally available, I would want to use AVNM in every deployment that I do in the future. But there is a major blocker: pricing. AVNM is priced per subscription at $73/month. For those of you with a handful of subscriptions, that’s not much at all. But for those of us who saw that the subscription is a natural governance boundary and use LOTS of subscriptions (like in Microsoft Cloud Adoption Framework), this is a huge deal – it can make AVNM the most expensive thing we do in Azure!

The good news is that the message has gotten through to Microsoft and some folks in Azure networking have publicly commented that they are considering changes to the way that the pricing of AVNM is calculated.

The other bit bad news is an oldie: Azure Policy. Dynamic network group membership is built by Azure Policy. If a new virtual network is created by a developer, it can be hours before policy detects it and informs AVNM. In my testing, I’ve verified that once AVNM sees the new member, it triggers the deployment immediately, but the use of Azure Policy does create latency, enabling some bad practices to be implemented in the meantime.

Summary

I was a downer on AVNM early on. But recent developments and some of the ideas that the team is working on have won me over. The only real blocker is pricing, but I think that the team is serious about fixing that. I stated earlier that AVNM hasn’t gotten a lot of traction. I think that this should change once pricing is fixed and routing configurations are GA.

I recently demonstrated using AVNM to build out the connectivity and routing of a hub-and-spoke with micro-segmentation at a conference. Using Azure Portal, the entire configuration probably took less than 10 minutes. Imagine that: 10 minutes to build out your security and compliance model for now and for the future.

Azure Route Server Saves The Day

In this post, I will discuss a recent scenario where we used Azure Route Server branch-to-branch routing to rescue a client.

The Original Network Design

This client is a large organisation with a global footprint. They had a previous WAN design that was out of scope for our engagement. The heart of the design was Meraki SD-WAN, connecting their global locations. I like Meraki – it’s relatively simple and it just works – that’s coming from me, an Azure networking person with little on-premises networking experience.

The client started using the services of a cloud provider (not Microsoft). The client followed the guidance of the vendor and deployed a leased line connection to a cloud region that was close to their headquarters and to their own main data centre. The leased line provides low latency connectivity between applications hosted on-premises and applications/data hosted in the other cloud.

Adding Azure

The customer wanted to start using Azure for general compute/data tasks. My employer was engaged to build the original footprint and to get them started on their journey.

I led the platform build-out, delegating most of the hands-on and focusing on the design. We did some research and determined the best approach to integrate with the other cloud vendor was via ExpressRoute. The Azure footprint was placed in an Azure region very close to the other vendor’s region.

An ExpressRoute circuit was deployed between a VNet-based hub in Azure – always my preference because of the scalability, security/governance concepts, and the superiority over Virtual WAN hub when it comes to flexibility and troubleshooting. The Meraki solution from the Azure Marketplace was added to the hub to connect Azure to the SD-WAN and BGP propagation with Azure was enabled using Azure Route Server. To be honest – that was relatively simple.

The customer had two clouds:

The other vendor via a leased line.
Azure via SD-WAN.
And an interconnect between Azure and the other cloud via ExpressRoute.

Along Came a Digger

My day-to-day involvement with the client was over months previously. I got a message early one morning from a colleague. The client was having a serious networking issue and could I get online. The issue was that an excavator/digger had torn up the lines that provided connectivity between the client’s data centre and the other cloud.

Critical services in the other Cloud were unavailable:

App integration and services with the on-premises data centre.
App availability to end users in the global offices.

I thought about it for a short while and checked out my theory online. One of the roles of Azure Route server is to enable branch to branch connectivity between “on-premises” locations between ExpressRoute/VPN.

Forget that the other cloud is a cloud – think of the other cloud’s region as an on-premises site that is connected via ExpressRoute and the above Microsoft diagram makes sense – we can interconnect the two locations via BGP propagation through Azure Route Server:

The “on-premises” location via ExpressRoute
The SD-WAN via the Meraki which is already peered with Azure Route Server

I presented the idea to the client. They processed the information quickly and the plan was implemented quickly. How quickly? It’s one setting in Azure Route Server!

The Solution

The workaround was to use Azure as a temporary route to the other Cloud. The client had routes from their data centre and global offices to Azure via the Meraki SD-WAN. BGP routes were propagating between the SD-WAN connected locations, thanks to the peering between the Meraki NVA in the Azure hub and Azure Route Server.

BGP routes were also propagating between the other cloud and Azure thanks to ExpressRoute.

The BGP routes that did exist between the SD-WAN and the other cloud were gone because the leased line was down – and was going to be down for some time.

We wanted to fill the gap – get routes from the other cloud and the SD-WAN to propagate through Azure. If we did that then the SD-WAN locations and the other cloud could route via the Meraki and the ExpressRoute gateway in the Azure Hub – Azure would become the gateway between the SD-WAN and the other cloud.

The solution was very simple: enable branch-to-branch connectivity in Azure Route Server. There’s a little wait when you do that and then you run a command to check the routes that are being advertised to the Route Server peer (the Meraki NVA in this case).

The result was near instant. Routes were advertised. We checked Azure Monitor metrics on the ExpressRoute circuit and could see a spike in traffic that coincided with the change. The plan had worked.

The Results

I had not heard anything in a while. This morning I heard that the client was happy with the fix. In fact, user experience was faster.

Go back to the original diagram before Azure and I can explain. Users are located in the branch offices around the world. Their client applications are connecting to services/data in the other cloud. Their route is a “backhaul”:

SD-WAN to central data centre
Leased line over long distance to the other cloud

When we introduced the “Azure bypass” after the leased line failure, a new route appeared for end users:

SD-WAN to Azure
A very short distance hop over ExpressRoute

Latency was reduced quite a bit so user experience improved. On the contrary, latency between the on-premises data centre and the other cloud has increased because the SD-WAN is a new hop but at least the path is available. The original leased line is still down after a few weeks – this is not the fault of the client!

Some Considerations

Ideally one would have two leased lines in place for failover. That incurs costs and it was not possible. What about Azure ExpressRoute Metro? That is still in preview at this time and is not available in the Azure metro in question.

However, this workaround has offered a triangle of connectivity. When the lease line in repaired, I will recommend that the triangle becomes their failover – if any one path fails, the other two will take the place, bringing the automatic recoverability that was part of the concept of the original ARPANET.

The other change is that the other cloud should become another site in the Meraki SD-WAN to improve the user app experience.

If we do keep branch-to-branch connectivity then we need to consider “what is the best path”? For example, we want the data centre to route directly to the other cloud when the leased line is available because that offers the lowest latency. But what if a route via Azure is accidentally preferred? We need control.

In Azure Route Server, we have the option to control connectivity from the Azure perspective (my focus):

(Default) Prefer ExpressRoute: Any routes received over ExpressRoute will be used. This would offer sub-optimal routes because on-premises prefixes will be received from the other cloud.
Prefer VPN: Any routes received over VPN will be used. This would offer sub-optimal routes because other cloud prefixes will be received from on-premises.
Use AS path: Let the admin/network advertise a preferred path. This would offer the desired control – “use this path unless something goes wrong”.

Azure’s Software Defined Networking

In this post, I will explain why Azure’s software-defined networking (virtual networks) differs from the cable-defined networking of on-premises networks.

Background

Why am I writing this post? I guess that this one has been a long time coming. I noticed a trend early in my working days with Azure. Most of the people who work with Azure from the infrastructure/platform point of view are server admins. Their work includes doing all of the resource stuff you’d expect, such as Azure SQL, VMs, App Services, … virtual networks, Network Security Groups, Azure Firewall, routing, … wait … isn’t that networking stuff? Why isn’t the network admin doing that?

I think the answer to that question is complicated. A few years ago I added a question to the audience to some of my presentations on Azure networking. I asked who was a ON-PREMISES networking admin versus an ON-PREMISES something-else. And then I said “the ‘server admins’ are going to understand what I will tech more easily than the network admins will”. I could see many heads nodding in agreement. Network admins typically struggle with Azure networking because it is very different.

Cable-Defined Networking

Normally, on-premises networking is “cable-defined”. That phrase means that packets go from source to destination based on physical connections. Those connections might be indirect:

Appliances such as routers decide what turn to take at a junction point
Firewalls either block or allow packets
Other appliances might convert signals from electrons to photons or radio waves.

A connection is always there and, more often than not, it’s a cable. Cables make packet flow predictable.

Look at the diagram of your typical on-premises firewall. It will have ethernet ports for different types of networks:

External
Management
Site-to-site connectivity
DMZ
Internal
Secure zone

Each port connects to a subnet that is a certain network. Each subnet has one or more switches that only connect to servers in that subnet. The switches have uplinks to the appropriate port in the firewall, thus defining the security context of that subnet. It also means that a server in the DMZ network must pass through the firewall, via the cable to the firewall, to get to another subnet.

In short, if a cable does not make the connection, then the connection is not possible. That makes things very predictable – you control the security and performance model by connecting or not connecting cables.

Software-Defined Networking

Azure is a cloud, and as a cloud, it must enable self-service. Imagine being a cloud subscriber, and having to open a support call to create a network or a subnet. Maybe they need to wait 3 days while some operators plug in cables and run Cisco commands. Or they need to order more switches because they’ve run out of capacity and you might need to wait weeks. Is this the hosting of the 2000’s or is it The Cloud?

Azure’s software-defined networking enables the customer to run a command themselves (via the Portal, script, infrastructure-as-code, or API) to create and configure networks without any involvement from Microsoft staff. If I need a new network, a subnet, a firewall, a WAF, or almost anything networking in Azure (with the exception of a working ExpressRoute circuit) then I don’t need any human interaction from a support staff member – I do it and have the resource anywhere from a few seconds to 45 minutes later, depending on the resource type.

This is because the physical network of Azure is overlayed with a software-defined network based on VXLAN. In simple terms, you have no visibility of the physical network. You use simulated networks that hide the underlying complexities, scale, and addressing. You create networks of your own address/prefix choice and use them. Your choice of addresses affects only your networks because they actually have nothing to do with how packets route at the physical layer – that’s handled by traditional networking at the physical layer – but that’s a matter only for the operators of the Microsoft global network/Azure.

A diagram helps … and here’s one that I use in my Azure networking presentations.

In this diagram, we see a source and a destination running in Azure. In case you were not aware:

Just about everything in Azure runs in a virtual machine, even so-called serverless computing. That virtual machine might be hidden in the platform but it is there. Exceptions might include some very expensive SKUs for SAP services and Azure VMware hosts.
The hosts for those virtual machines are running (drumroll please) Hyper-V, which as one may now be forced to agree, is scalable 😀

The source wants to send a packet to a destination. The source is connected to a Virtual Network and has the address of 10.0.1.4. The destination is connected to another virtual network (the virtual networks are peered) and has an address of 10.10.1.4. The virtual machine guest OS sends the packet to the NIC where the Azure fabric takes over. The fabric knows what hosts the source and destination are running on. The packet is encapsulated by the fabric – the letter is put into a second envelope. The envelope has a new source address, that of the source host, and a new destination, the address of the destination host. This enables the packet to traverse the physical network of Microsoft’s data centres even if 1000s of tenants are using the 10.x.x.x prefixes. The packet reaches the destination host where it is decapsulated, unpacking the original packet and enabling the destination host to inject the packet into the NIC of the destination.

This is why you cannot implement GRE networking in Azure.

Virtual Networks Aren’t What You Think

The software-defined networking in Azure maintains a mapping. When you create a virtual network, a new map is created. It tells Azure that NICs (your explicitly created NICs or those of platform resources that are connected to your network) that connect to the virtual network are able to talk to each other. The map also tracks what Hyper-V hosts the NICs are running on. The purpose of the virtual network is to define what NICs are allowed to talk to each other – to enforce the isolation that is required in a multi-tenant cloud.

What happens when you peer two virtual networks? Does a cable monkey run out with some CAT6 and create a connection? Is the cable monkey creating a virtual connection? Does that connection create a bottleneck?

The answer to the second question is a hint as to what happens when you implement virtual network peering. The speed of connections between a source and destination in different virtual networks is the potential speed of their NICs – the slowest NIC (actually the slowest VM, based on things like RSS/VMQ/SR-IOV) in any source/destination flow is the bottleneck.

VNet peering does not create a “connection”. Instead, the mapping that is maintained by the fabric is altered. Think of it being like a Venn Diagram. Once you implement peering, the loops that define what can talk to what has a new circle. VNet1 has a circle encompassing its NICs. VNet2 has a circle encompassing its NICs. Now a new circle is created that encompasses VNet1 and VNet2 – any source in VNet1 can talk directly, using encapsulation/decapsulation) to any destination in VNet2 and vice versa without going through some resource in the virtual networks.

You might have noticed before now that you cannot ping the default gateway in an Azure virtual network. It doesn’t exist because there is no cable to a subnet appliance to reach other subnets.

You also might have noticed that tools like traceroute are pretty useless in Azure. That’s because the expected physical hops are not there. This is why using tools like test-netconnection (Windows PowerShell) or Network Watcher Connection Troubleshoot/Connection Monitor are very important.

Direct Connections

Now you know what’s happening under the covers. What does that mean? When a packet goes from source to destination, there is no hop. Have a look at the diagram below.

It’s not an unusual diagram. There’s an on-prem network on the left that connects to Azure virtual networks using a VPN tunnel that is terminated in Azure by a VPN Gateway. The VPN Gateway is deployed into a hub VNet. There’s some stuff in the hub, including a firewall. Services/data are deployed into spoke VNets – the spoke VNets are peered with the hub.

One can immediately see that the firewall, in the middle, is intended to protect the Azure VNets from the on-premises network(s). That’s all good. But this is where the problems begin. Many will look at that diagram and think that this protection will just work.

If we take what I’ve explained above we’ll understand really what will happen. The VPN Gateway is implemented in the platform as two Azure virtual machines. Packets will come in over the tunnel to one of those VMs. Then the packets will hit the NIC of the VM to route to a spoke VNet. What path will those packets take? There’s a firewall in the pretty diagram. The firewall is placed right in the middle! And that firewall is ignored. That’s because packets leaving the VPN Gateway VM will be encapsulated and go straight to the NIC of the destination NIC in one of the spokes as if it were teleported.

To get the flow that you require for security purposes you need to understand Azure routing and either implement the flow via BGP or User-Defined Routing.

Now have a look at this diagram of a virtual appliance firewall running in Azure from Palo Alto.

Look at all those pretty subnets. What is the purpose of them? Oh I know that there’s public, management, VPN, etc. But why are they all connecting to different NICs? Are there physical cables to restrict/control the flow of packets between some spoke virtual network and a DMZ virtual network? Nope. What forces packets to the firewall? Azure routing does. So those NICs in the firewall do what? They don’t isolate, they complicate! They aren’t for performance, because the VM size controls overall NIC throughput and speed. They don’t add performance, they complicate!

The real reason for all those NICs is to simulate eth0, eth1, etc that are referenced by the Palo Alto software. It enables Palo Alto to keep the software consistent between on-prem appliances and their Azure Marketplace appliance. That’s it – it saves Palo Alto some money. Meanwhile, Azure Firewall using a single IP address on the virtual network (via the Standard tier load balancer, but you might notice each compute instance IP as a source) and there is no sacrifice in security.

Wrapping Up

There have been countless times over the years when having some level of understanding of what is happening under the covers has helped me. If you grasp the fundamentals of how packets rally get from A to B then you are better prepared to design, deploy, operate, or troubleshoot Azure networking.

Network Rules Versus Application Rules for Internal Traffic

This post is about using either Network Rules or Application Rules in Azure Firewall for internal traffic. I’m going to discuss a common scenario, a “problem” that can occur, and how you can deal with that problem.

The Rules Types

There are three kinds of rules in Azure Firewall:

DNAT Rules: Control traffic originating from the Internet, directed to a public IP address attached to Azure Firewall, and translated to a private IP Address/port in an Azure virtual network. This is implicitly applied as a Network Rule. I rarely use DNAT Rules – most modern applications are HTTP/S and enter the virtual network(s) via an Application Gateway/WAF.
Application Rules: Control traffic going to HTTP, HTTPS, or MSSQL (including Azure database services).
Network Rules: Control anything going anywhere.

The Scenario

We have an internal client, which could be:

On-premises over private networking
Connecting via point-to-site VPN
Connected to a virtual network, either the same as the Azure Firewall or to a peered virtual network.

The client needs to connect to a server, with SSL authentication, to a server. The server is connected to another virtual network/subnet. The route to the server goes through the Azure Firewall. I’ll complicate things by saying that the server is a PaaS resource with a Private Endpoint – this doesn’t affect the core problem but it makes troubleshooting more difficult 🙂

NSG rules and firewall rules have been accounted for and created. The essential connection is either HTTPS or MSSQL and is implemented as an Application Rule.

The Problem

The client attempts a connection to the server. You end up with some type of application error stating that there was either a timeout or a problem with SSL/TLS authentication.

You begin to troubleshoot:

Azure Firewall shows the traffic is allowed.
NSG Flow Logs show nothing – you panic until you remember/read that Private Endpoints do not generate flow logs – I told you that I’d complicate things 🙂 You can consider VNet Flow Logs to try to capture this data and then you might discover the cause.

You try and discover two things:

If you disconnect the NSG from the subnet then the connection is allowed. Hmm – the rules are correct so the traffic should be allowed: traffic from the client prefix(es) is permitted to the server IP address/port. The rules match the firewall rules.
You change the Application Rule to a Network Rule (with the NSG still associated and unchanged) and the connection is allowed.

So, something is going on with the Application Rules.

The Root Cause

In this case, the Application Rule is SNATing the connection. In other words, when the connection is relayed from the Azure Firewall instance to the server, the source IP is no longer that of the client; the source IP address is a compute instance in the AzureFirewallSubnet.

That is why:

The connection works when you remove the NSG
The connection works when you use a Network Rule with the NSG – the Network Rule does not SNAT the connection.

Solutions

There are two solutions to the problem:

Using Application Rules

If you want to continue to use Application Rules then the fix is to modify the NSG rule. Change the source IP prefix(es) to be the AzureFirewallSubnet.

The downsides to this are:

The NSG rules are inconsistent with the Azure Firewall rules.
The NSG rules are no longer restricting traffic to documented approved clients.

Using Network Rules

My preference is to use Network Rules for all inbound and east-west traffic. Yes, we lose some of the “Layer-7” features but we still have core features, including IDPS in the Premium SKU.

Contrary to using Application Rules:

The NSG rules are consistent with the Azure Firewall rules.
The NSG rules restrict traffic to the documented approved clients.

When To Use Application Rules?

In my sessions/classes, I teach:

Use DNAT rules for the rare occasion where Internet clients will connect to Azure resources via the public IP address of Azure Firewall.
Use Application Rules for outbound connections to Internet, including Azure resources via public endpoints, through the Azure Firewall.
Use Network Rules for everything else.

This approach limits “weird sh*t” errors like what is described above and means that NSG rules are effectively clones of the Azure Firewall rules, with some additional stuff to control stuff inside of the Virtual Network/subnet.