Evolving Okta’s edge infrastructure

Okta is constantly evolving our cloud infrastructure to meet the needs of our customers. We place reliability and scalability at the core of our design decisions for services that process billions of authentications per month. This article dives into how a recent project to remove one of our most heavily trafficked services yielded significant operational and reliability improvements.

A glimpse at Okta’s edge

In the past, three core services received and facilitated the majority of customer traffic to Okta’s Workforce Identity Cloud at the edge: an Application Load Balancer to protect against request floods, an Apache-based service for SSL termination, and an Nginx-based service for routing and business logic.

 

Application Load Balancer in action

 

These services are deployed globally and scaled to handle large volumes of traffic that Okta processes daily. While performant, the tight coupling of these services have gradually become an operational challenge. Always looking to improve our infrastructure, a cross-functional team set out to reevaluate these services.

One too many?

Okta’s customers expect a performant and available service, and it is imperative that we meet these expectations. While routinely processing countless login flows, being on the public internet invites the unexpected. Whether customer-initiated or otherwise, Okta at times receives large influxes of traffic in a short period of time. 

In closely examining the services we operate, we determined that Apache’s per-thread bottleneck limited our ability to handle large influxes of traffic without impact and decided to remove this service from our edge entirely. This would move Nginx’s event-driven architecture forward in our stack as a better means of handling unpredictable traffic patterns.

 

Application Load Balancer without Apache

 

With increased reliability as the primary motivating factor, the opportunity to terminate several hundred Apache servers would see a significant reduction in operational toil. Between improving service reliability, reducing our volume of system log aggregation, eliminating operational toil, and cost savings, we would benefit greatly from decommissioning our Apache service.

Our team developed a robust choreography to shift traffic away from the Apache service and directly to Nginx, enabling us to quickly iterate on issues uncovered in test environments before gradually applying this change to our global production environments.

In testing

Okta’s Apache service runs a lightweight Java application with custom configurations to process incoming requests before passing them on to our Nginx service. In removing Apache, we had to ensure any functionality provided by the service was recreated within Nginx. Through synthetic testing, our team quickly identified several patterns of incoming requests that were no longer handled properly with the Apache service removed.

The “double slash” problem

Because Apache once served as the initial application to receive customer traffic, it contained logic developed over the years to identify and respond properly to malformed requests. As Okta’s edge infrastructure matured over the years, this functionality largely moved earlier in the stack. Still, one issue with removing Apache was quickly identified — we were no longer able to properly handle requests containing //.

In testing environments without the Apache service, Nginx returned improper status codes for any request containing //. As a contrived example, API calls for /api/v1/users continued to behave as expected, but calls to //api/v1/users were observed to return client error HTTP responses.

Our Apache service handled these requests with a simple rewrite rule, but Nginx would return error codes for requests without the rewrite, so we had to introduce a new rewrite rule to restore this functionality.

Observing RFC 3986, this would be our first brush with Hyrum's Law. 

The “query string” problem

With a robust suite of synthetic tests to validate the resolution of the “double slash” problem, we began a phased rollout of the Apache service removal into our staging environments. As the amount of traffic processed without Apache gradually increased, we again observed Nginx returning improper response codes for certain requests previously processed by Apache without issue.

As before, we uncovered a case where Apache previously rewrote malformed requests into a format that Nginx could process. Against RFC 1738, Apache had been rewriting encoded query strings into decoded values. As an example, requests to /api/v1/users%3Flimit=1 were being decoded and passed to Nginx as /api/v1/users?limit=1. Without Apache in the request path, Nginx was unable to process encoded query strings and returned an error to the client originating the request. To address this, an additional rewrite rule was introduced in our Nginx configuration, and we were able to continue the rollout.

Several iterations later

Removing such a prominent service proved no simple feat but was ultimately achieved without sustained customer impact. This effort saw several iterations over time, but the focus on outcome remained the same:

  • Remove a performance bottleneck
  • Improve service reliability
  • Reduce operational toil

Having completed this effort across all environments, the benefits have quickly become apparent and the team is already planning our next improvements.

Have questions about this blog post? Reach out to us at [email protected].

Explore more insightful Engineering Blogs from Okta to expand your knowledge.

Ready to join our passionate team of exceptional engineers? Visit our career page.

Unlock the potential of modern and sophisticated identity management for your organization. Contact Sales for more information.