Many reasons contribute to an organisation’s data infrastructure becoming a chaotic mess — data silos, scaling difficulties, and poor governance, to name a few. Often, it’s a matter of one business need stacking on top of another until the complexity becomes unmanageable.
Shaun Clowes, Chief Product Officer of Confluent, explained, “Every time you have a new data need, what do you do? You connect the destination point-to-point to all the different sources needed. Have a new mobile app? Add some wires. New features in the web app? Add some wires. Each new need? Even more wires. Over time, your data infrastructure turns into a giant, tangled mess of interconnections. These links aren’t just complex; they’re also surprisingly fragile.”
Speaking at Confluent’s Kafka Summit in Bangalore, Clowes highlighted that many businesses fail to anticipate how their data infrastructure will cope with the volume and scale of the programs and applications they plan to adopt.
“Imagine a system upstream changes, causing schema drift, and suddenly the data isn’t compatible as before. What happens? All the systems depending on that data fail. Sometimes they fail loudly in the middle of the night, but sometimes, even worse, they fail silently and go undetected for a long time. Meanwhile, all systems downstream also break, and the cycle continues. Your data needs and business requirements keep accelerating. One day, it’s new features for a mobile app; the next, it’s AI inference or classification. And so, the brittle connections keep multiplying,” he said.
Paradigm shift
To address data chaos, Clowes suggests that enterprises need to unlearn old habits quickly.
Firstly, there should be a shift from point-to-point data connections to a model where multiple subscribers can access and benefit from the most crucial data in the enterprise.
Secondly, enterprises should transition from mere data consumption to providing well-formed, valuable, and reusable data to other participants in the ecosystem — essentially, data products.
Within this data mess, it’s important to recognise the existence of two separate but interconnected realms: the operational estate and the analytical estate.
In the operational estate, an enterprise’s online applications collaborate to deliver business processes and enhance the experiences of employees and customers. Meanwhile, the analytical estate reassembles data to support informed business decision-making.
“The focus has largely been on the analytical estate, which makes sense because data products were initially conceived to make sense of all this data across many systems from the operational domain,” Clowes noted. “But these data products are still derived from data born in the operational state, which remains oblivious to their existence. Even a minor change in the operational estate can disrupt all these meticulously created data products.”
Solving the equation
More often than not, architects and developers find themselves relying on brittle, point-to-point connections when building new features or capabilities, or integrating AI into applications. This highlights the need for a new, universal data product — one that operates across the operational and analytical estates, Clowes explained.
To achieve this, enterprises can harness the capabilities of a data streaming platform like Kafka. Clowes outlined four key capabilities of such platforms:
- Stream
- Connect
- Govern
- Process
“Firstly, we connect to any data system, whether operational or analytical, in source or sync. By doing so, we set all the most important data in your organisation in motion as streams. We ensure these streams are trustworthy by implementing quality controls and governance checks. Once the data is deemed trustworthy, we then catalogue it so it can be discovered by other teams, who can then use it for their own use cases, whether those are in real time or batch. We enrich these streams by stitching, evolving, and improving data from across your organisation to produce really high-value data assets,” Clowes elaborated.
Analytical issues
While Kafka is recognised as the open standard for integrating data across the operational estate, addressing the analytical estate requires a different approach due to its historically siloed nature, remarked Jay Kreps, Confluent Co-Founder and CEO.
“Kafka often serves as the data feed into this area. However, when considering the various tools available, they generally run on not just streams, but largely on extensive tables of data. This is the essence of a data warehouse or data lake. Historically, in the analytical space, these tables have been isolated within different vendor-specific tools. Each data warehouse is its own little island. However, this has been changing recently. There’s been a shift towards open, shared tables stored in cloud storage,” he observed.
“It’s common to take the streams of data from Kafka, then write them into Iceberg, stream by stream. But this setup is far from perfect. Currently, the integration is mostly surface-level, and maintaining it is quite painful. The main source of this pain is the vast diversity of data. Almost every business event needs to be represented in the analytical estate, so all our Kafka streams must be mapped into our data lake or warehouse, often involving incredibly detailed and error-prone mappings,” said Kreps.
Many enterprises use Apache Iceberg to navigate their data within the analytical estate, but as Kreps pointed out, many are using the platform quite inefficiently.
Customers often manage hundreds of little Spark jobs, with each parsing out the correct fields from an event stream to be placed onto a table, added the Confluent CEO.
“That method works until it doesn’t. When something upstream changes, that Spark job may fail, if you’re lucky, or it might simply send incorrect data downstream, leading to extensive processing and days spent cleaning up the mess. This isn’t the ideal way for these components to integrate. So, what if we could do this better? What if we could truly unite these two areas and provide real integration between the streams and the tables, effectively merging the streaming world with the batch world?” he wondered.
To achieve this, the solution must integrate across three major layers of the stack — stream, governance, and processing. This was the rationale behind Confluent reimagining its Apache Flink solution earlier this year as a cloud-native service.
“Flink is becoming the standard for real-time stream processing, and we believe Apache Iceberg is the standard for the tables of data shared in the analytical estate. By combining these into a coherent platform, we aim to make the world of streaming accessible to all applications and companies,” Kreps said.
Data excellence
For Confluent, data infrastructure transcends mere features and functionalities — it’s fundamentally about democratisation, especially in the booming AI landscape. Enterprises require platforms capable of managing the extensive volume of requirements, Kreps noted.
“When we consider what’s necessary to truly integrate AI into our companies and leverage it as a powerful tool, those AI applications will be the ones that bridge the traditional divides between the analytical and operational estates. They’ll use data in complex ways similar to our analytics workloads but will do so in customer-facing functions that drive parts of the business and demand robust SLAs. This necessitates real software engineering to make these applications viable globally. Consequently, a new data platform is essential — one that simplifies this integration, which is the data streaming platform,” he explained.
Clowes echoed similar views on the accessibility of data in today’s enterprises: “We’re no longer caught in a relentless struggle with point-to-point links and brittle data. Instead, we find ourselves in a vicious cycle where data sets become more than the sum of their parts. They’re immediately available for teams across the enterprise to utilise as soon as they are created.”