Why Netflix Rolled Its Own Node.js Functions-as-a-Service for its API Platform
Ever since hook.io introduced Functions-as-a-Service (FaaS) in 2014, developers have been seizing this new tech with two happy hands. It was the next horizon in the dream of serverless computing: an all-in-one “no ops” platform that allows developers to build to develop, launch and manage application functionalities — only without the hassle of, you know, building the infrastructure that usually goes with. Barely four years later, FaaS has become a turnkey tool in the cloud engineering kit, a built-in standard offering from cloud service providers like AWS Lambda, Google Cloud Functions and Microsoft Azure Functions.
Engineers love the “no-ops” aspect of FaaS, which makes it possible to simply upload modular chunks of functionality onto the cloud provider of your choice and then execute them as isolated, reliable, and low latency production services. Enterprises love that their devs can deploy code to production faster than ever before. Netflix, a company respected for being an early and extremely effective adopter of cloud native tech, happily embraced FaaS to keep the films flowing smoothly to their 130 million customers streaming 140 million hours of video each day.
The New Stack spoke with Yunong Xiao, a software engineer at Netflix and design/architecture lead for the Netflix API Platform, about the company’s experience rolling its own in-house FaaS capabilities for its API platform.
Netflix began building this solution in early 2015 before FaaS solutions were a built-in offering of cloud service providers. Now, as Netflix moves beyond the specific use case of their API Platform — and as cloud offerings have matured — they are enabling Function use cases broadly across the company, built on top of AWS Lambda.
What drove decision to embrace FaaS for the Netflix API platform?
FaaS features are a perfect fit for the Netflix API Platform, which provides engineers the ability to write and deploy tier-1 services using JavaScript without having to manage infrastructure or operations. A JavaScript-based FaaS platform, which lets UI engineers deploy JavaScript functions as production services, meant we could deliver latency-sensitive services right in the heart of every request to the Netflix API Platform.
At what point did writing your own FaaS platform just seem like the right thing to do?
A few years ago we had a compelling use-case for serverless in the Netflix API Platform team. We have customers in more than 190 countries, and so that means having many client teams who each own distinct UIs — which requires rapid innovation coupled with high availability. At the same time, our client teams use the backend for front end (BFF) pattern for their UIs, meaning there’s a bespoke service for each version of their UIs that they own. We design our product for innovation and have hundreds of A/B tests each year, each with many variants. To enable this type of rapid innovation, these BFFs are owned by the client teams themselves and typically get changed with every release.
It’s hard to design, build, and operate high performance, low latency, and highly available services — even for seasoned server engineers with years of experience. Expecting client engineers to own and operate these services with these requirements would be unreasonable — since their core expertise is building superlative user interfaces.
FaaS and serverless allows each client team to offload the architecture and operations of their services to a common platform maintained by my team — the API platform team — and allows them to focus on just writing the business logic that differentiates each BFF from the next.
And you built your own in-house FaaS platform because…
When we started on this journey in 2015, we were unable to find a third party FaaS platform that satisfied our API Platform needs. Though many current offerings like AWS Lambda have come a long way since then, at that time, most external usages of FaaS and serverless offerings were for latency insensitive event-driven tasks, not large-scale latency-sensitive services. Additionally, running highly available services on the Netflix API Platform requires integration with the Netflix services stack, which also did not exist at the time. Our platform consists of many different components, such as developer tooling, runtime, infrastructure orchestration, and operations tooling.
The runtime, by the way, is named NodeQuark — since it uses Node.js.
How does NodeQuark fit in with the rest of the Netflix stack?
First, our choices are driven by the goal of streamlining the entire software development lifecycle.
We start with the ability to bootstrap a consistent development environment for each engineer via NEWT — 0ur developer productivity tools team built a local development tool called Newt (Netflix Workflow Toolkit). It’s a collection of extensible development tools maintained by our engineering tools teams. Through NEWT, we provide a native development environment and workflow. NEWT bootstraps a FaaS development environment, and we can develop, test, and debug functions locally. Underneath the hood, we’re provisioning a Docker container locally that contains the FaaS runtime, and the tooling seamless syncs code back and force between the container and the local host — providing debugging information via logs and debugging ports.
On the build and management side, with functions, we’ve built a function index which immutably versions and stores the functions. Since we have many teams using functions, the index is multitenant and supports namespaces for teams and projects.
We use Spinnaker and Titus underneath the hood to manage our infrastructure. Spinnaker as the CI/CD tool and Titus for container orchestration. Spinnaker allows us to coordinate complex deployment interactions needed to support FaaS. Titus allows us to deploy containers reliably on a massive scale. Newt helps simplify container development both iteratively locally and through Titus onboarding. Having a consistent container environment between Newt and Titus helps developer deploy with confidence. The NEWT tooling also provides a CLI to easily manage the deployment process.
Operationally we use Atlas for metrics and dashboards — providing runtime visibility into the health of each service — and integrating with PagerDuty for alerts. These metrics, alerts, and dashboards are all automatically generated for each new function, ensuring full visibility and operability for each service.
Wow. Still getting our heads around Netflix having an in-house Developer Productivity Tools Team to make sure your devs have perfect-fit tools for doing their work.
We’re pragmatic with our use of technologies, understanding that it’s all just a means to an end with the main goal of supporting the business. You can see this in our adoption of AWS and many other open source technologies such as gRPC, Node.js, Docker, and others.