From Pets to Cattle: Going Cloud Native with Apache Kafka
The COVID-19 pandemic accelerated changes that were already underway in the IT industry. It underscored the need for low latency and speed in serving customers, and boosted the importance of emerging best practices.
The organizations that survived or even thrived over the last year and a half monetized speed and met the demand for just-in-time updates and robust performance. No matter what the industry, that expectation of a real-time user experience is not going to change.
One thing has become undeniable: Highly distributed businesses necessitate highly distributed operations. Apache Kafka is an open source platform for continuous, distributed event streaming, used at the likes of Airbnb, Goldman Sachs, Netflix, PayPal, Spotify and Uber. When Kafka is leveraged correctly in the cloud, it becomes the driver of that distribution.
“Kafka itself is really the core element of this new kind of fast, low-latency architecture.” said Maureen Fleming, program VP of intelligent process automation research at IDC, a global market analyst. She told The New Stack that pandemic business resiliency was grounded in event-driven architecture that sees software design and data organized around the user journey. Chances are, when you see distributed systems updating in real time, “Kafka is going to be there keeping track of it all.”
But, while Kafka does provide high-performance data pipelines, streaming analytics, data integration and mission-critical applications at 70% of the Fortune 500 companies, it isn’t often a smooth transition to cloud native.
Moving the already distributed Kafka to the much more distributed cloud — or more likely diverse multi and hybrid clouds — can suck up a lot of time and money. Leveraging managed cloud services like Confluent Cloud, however, can help transform your Kafka from a high-maintenance pet into replaceable cattle.
When Kafka Turns into a Pesky Pet
Dan Rosanova, head of product for Confluent Cloud, analogizes Kafka’s ability to put data in motion to a mesh or network for connecting data that’s usually locked away in silos, databases or storage.
Kafka was created in 2011 by Jay Kreps, Neha Narkhede and Jun Rao, who at the time were engineers at LinkedIn. The founders saw that companies in a variety of industries would need a system like Kafka. But for most organizations, building data infrastructure with open source components wasn’t their focus and detracted from their core business. Thus, the trio founded Confluent in 2014 to bridge this gap. In June, Confluent launched an initial public offering that helped raise its market capitalization to $11.5 billion.
Fleming described Kafka as the first system to aggregate highly distributed data from microservices and systems into a centralized place where it is categorized, stored and published as “topics,” which can then be propagated anywhere. It works great on-premise.
But when teams looked to migrate much or all of those topics into the cloud, Kafka had to be customized and re-platformed for each environment it was deployed to. It became the primary decision development teams focused on. Even executives weighed in on where each asset was going to be located, on which cloud or data center.
“Projects were delayed because the most important decision is where are you going to put your software or your asset,” Fleming said. “All of a sudden, moving Kafka to a cloud architecture meant that all of the things that made you want to be flexible — and want to use it — went away.”
The challenge, Rosanova said, is that Kafka is a distributed system, which doesn’t automatically translate to the cloud. Distributed systems, he said, are like having a bunch of pets: “You have to feed, take care of, clean up after them, so you control the environment to make that easier. In the cloud, the environment is always changing — machines fail, networking changes, instances restart — so you have to change your assumptions.”
Organizations are often surprised at the delay to ROI when moving to the cloud. On-premise data centers were always your pets, and the cloud is meant to be more your cattle, but it really takes effort to get there.
The initial pains of running Kafka in the cloud include:
- Networking challenges. The network policy doesn’t really change in your own data center, but in the cloud it’s all virtual and can change really quickly. As a result, different teams in your organization can make different choices, said Rosanova.
- Security challenges. Moving anything to the cloud, with Kafka in particular, means you need to secure not just how the machines talk to each other but how you talk to the machines. Making and configuring these decisions becomes another time sucker.
- Surprise costs. The initial move from on-premise to cloud can actually cost more. It’s common to over-provision or to need faster disks than you initially thought, especially when you get sprawl. According to Rosanova, “It’s a huge project to get to the cloud, and then you’re burning more money than expected” in search of the right size.
And it’s not just a one-off moving expense. He said, “If you have to be multi-cloud, you hit the same hurdles again. Every cloud provider offers similar functionality, but with subtle differences, your teams will have to discover and work around.”
More organizations these days are running on multiple clouds, particularly in industries that go through frequent mergers and acquisitions. Cloud migration isn’t always a first step, Rosanova said, because that makes an already huge merger-and-acquisition project even bigger.
Turning Cloud Native Kafka into Cattle
There’s a definitive trend — from Kubernetes to cloud to data management — toward getting as high in the stack as possible with less to manage and more time to focus on increasing business value and cutting costs.
“The highest level of abstractions are the closest way to an Easy button that you can get to in the cloud,” said Rosanova.
If you’re running Kubernetes yourself with Kafka, it’s going to take you months to set up in any new environment, he said. But if you’re using a SaaS or serverless to manage and monitor Kafka, it’s going to take a matter of days. This hugely decreases your time to value.
“When you get in a good place in the cloud, you can spin up resources very quickly — and the higher you are up the abstraction stack, you can do it more easily,” Rosanova said. “Or, if you’re using a SaaS provider you can get something going today.”
And there’s no doubt that using a managed cloud is better for your security — and your costs and the environment.
“If you’re using the cloud correctly, you are almost definitely getting better security than your own data center,” Rosanova said. “The large cloud providers have huge, advanced security teams monitoring their data and networks,” citing real-time monitoring, their direct connections to the FBI, and governmental partnerships. Their experience means they handle DDoS and malware attacks well, he said, and that they are better all the way down to the physical security on the hardware.
Cloud vendors treat their own data centers like pets, so you can be automatically compliant and secure and can treat them as ephemeral cattle. That’s important, because security isn’t usually your bottom line.
Finally, by taking Kafka from on-premise to the cloud, you gain scalability. Cyclical workloads used to take all year to maintain, Rosanova said. A retail business would begin scaling up for the holiday rush in July — and then, by January, they are spending months tearing down just to build up again.
But then, he noted, “in the cloud, here’s a slider and a button and an API call, and it can be down to hours.”
While up to five years ago, the decision about where to deploy dominated, now, with Kubernetes and software solutions to abstract and manage the cloud, that decision for all of her clients has become trivial, Fleming said. Cloud native Kafka has given them portability with ease.
The analyst added, “With cloud native architecture, you can build what you need and not worry about it being ignored.”
Making Room for Hyper-Personalization at AO
Back in 2017, the household appliances and electronics online retailer AO was looking to increase customer satisfaction. To gain that edge in e-commerce, the company decided to focus on creating a unified, hyper-personalized user journey.
AO had always relied on historical customer data, but wanted to combine that with real-time signals — like clickstream events — based on visitor behavior, which would then feed into individualized marketing automation. For example, based on a visitor’s onsite behavior, including product and category views, they are pinged with an appropriate voucher for 10% off and free delivery.
To break down data silos in order to create this view of an onsite visitor, the AO customer personalization team decided to adopt an event streaming approach to accomplish this, and adopted Kafka and Confluent Cloud.
The managed cloud solution allowed AO to “liberate data from our heritage systems and combine it with real-time signals from our customers to deliver a hyper-personalized experience,” said Jon Vines, head of data engineering and integration at AO, in a case study published by Confluent. “You just can’t do that with a data lake and batch processing.”
This hyper-personalized approach quickly showed in A/B testing a significant uptick in customer conversation rates, Vines said.
Before moving to Confluent Cloud, developers used to spend up to three days on rebuilds every time there was a broker outage in Kafka. By moving away from a self-managed environment and into Confluent Cloud and its continuous monitoring of Kafka clusters, failures can be detected earlier and automatically resolved, evading a major incident. This means the development team is more available to focus on new features and applications.
AO’s developers are also able to move more quickly; Vines said they are now able to deploy from beta into production in half an hour.
“Pace became even more crucial during the pandemic because the world moved so rapidly, from predominantly in-store shopping to online,” he said. “The speed at which we are able to create new use cases that improve the customer journey with Confluent Cloud is helping us to cement our online market leadership position, even as we continue to adapt to ongoing changes.