Background
The lack of TLS certificates for websites have long been a problem, posing security risks for both the visitors and the owner of the site. At GoDaddy, we are on the lookout for opportunities to better secure our services. About 80 million domains are supported by the GoDaddy Parking team, and the list continues to grow daily. In this article, we discuss how we recently took on the interesting challenge of dynamically provisioning TLS certs for our domains while still serving content to them.
The following points illustrate some of the reasons why providing TLS certs for domains while serving content is difficult:
- We support about 80 million domains, and the list of domain names is changing at any given time.
- On average, the query per second (QPS) hitting our parked pages is around 70,000, with peak going over 140,000.
- Lots of activities are generated by bots.
- We also need to optimize ROI.
First iteration
We originally tested using TLS certs provided by Let's Encrypt with small amount of domain names. We liked their fast response time. However after we slowly increased the number of certificate provisioning requests, we soon run into throttling problems due to the number of domains we need to support. After discussing with both Let's Encrypt and the GoDaddy CA, we chose the GoDaddy CA as our certificate provider to support our ever growing number of domains in a cost efficient way.
The following architecture diagram shows a simplified version of our first iteration for our dynamic TLS cert provision system:
NGINX is an open-source software and is well known for its ability to scale and handle huge amounts of incoming traffic, which makes it an ideal candidate for our parking application, given the high QPS. We heavily utilize the rate limiting, ACL, and reverse proxy features with our own customization. We separated out the cert provisioning part into its own API layer, which interacts with the GoDaddy CA by requesting certificates using the GoDaddy's product ACME offering. With a single active region and two more in standby mode, we felt pretty good about this solution. And it worked! Or so we thought.
Bots and traffic spike
Having worked on parking applications for a while, we expected bot activities and traffic spikes. But we were still surprised by how our cluster reacted when overwhelmed with TCP/IP requests. We saw our cluster desperately trying to scale out to serve the increased traffic, while both the newly spun up and existing nodes kept failing due to resource exhaustion, which in turn further reduced the number of available nodes. This is a downward spiral. We had to adjust our cluster's initial size to be big enough to handle more than our normal traffic, and also adjust the auto-scaling policy. The system seemed to have stabilized.
How much does this cost?!
And then we saw the prediction of our annual AWS cost skyrocket to about $1.9M. We knew we had to do something to reduce the cost. For readers who are familiar with how AWS charges based on outgoing traffic volume, this probably isn't a surprise at all. Our engineering team was on a mission to cut cost. Every small detail of costs was examined with obsession. With our traffic volumes, every little bit adds up quickly. The following are a few interesting actions that helped reduce costs that might help other engineering teams facing similar challenges:
- Close the connection when we don't have a TLS cert to serve yet.
The default behavior for our server was to return a self-signed cert when we couldn't serve a TLS cert signed by our CA. The industry standard for TLS certificates is at least 2,048 bits in size. By terminating the connection when we don't have the cert, we managed to save $90K per year. - Separate the components so they can scale differently.
Serving the initial TCP request and the actual TLS cert provisioning carry vastly different compute cost and also different QPS, since we won't provision new certs for repeat visits. We decided to separate out the NGINX component to let it do what it is best known for, serving high traffic. This way, the most compute-heavy component, the TLS cert provisioning API layer, is receiving way smaller QPS behind the rate-limiting and caching layer. - Remove two non-essential HTML tags from the default page served out of NGINX
Our landers heavily utilize JavaScript. Removing thenoscript
HTML tag would not negatively impact our use case. We also removed part of themeta
HTML tag. The reduction of content size due to the removal of these 2 tags multiplied by our high QPS lead to a very nice saving of $180K per year.
What does the current solution look like?
After experiencing the resource exhaustion and cost increases with our original plan, we ended up with the following simplified deployment model:
The TLS Terminator handles both HTTP and HTTPS traffic. When the firewall proxies HTTPS traffic to the terminator, it adds its IP address in TCP connection via Proxy Protocol v2. The terminator parses the proxy header to get proxy's IP, the source address is still in TCP package, so the terminator knows both the client IP and proxy IP.
Where we are and what the future looks like
We currently serve around 65 million active TLS certs, and expanded our TLS support to GoDaddy Web Forwarding service and Partner parked domains. This enabled both GoDaddy and our partner customers to have:
- enhanced data protection
- trust from their customers
- improved search engine ranking
- support for regulatory compliance
We look forward to working with more partners and are super excited to have this opportunity to help secure the internet, making the web a safer and more trustworthy place for everyone!