Your Data Stack Is Outdated — Here’s How to Future-Proof It

A guide to designing modular data platforms with efficiency and scalability in mind.

Dec 11th, 2024 11:00am by Vinoth Chandar

Featued image for: Your Data Stack Is Outdated — Here’s How to Future-Proof It

The data landscape is undergoing a tectonic shift. While data warehouses have served as the system of record for decades, today’s plethora of use cases demands a more flexible approach.

The answer? Unbundling the data platform.

The Case for Unbundling

Unbundling means breaking down a product stack into specialized, composable components. Applied to data platforms, this translates to decoupling storage, compute, query engines, and data tools while ensuring seamless interoperability. Think of it as building your data platform with Lego blocks.

The traditional bundled approach — exemplified by data warehouses — imposes significant organizational constraints. These platforms typically lock users into closed data formats for storage, with compute entirely controlled by vendors. Teams are restricted to a single query engine or SQL runtime, preventing them from leveraging specialized tools for specific needs. While platforms like Snowflake excel at traditional dashboarding and analytics, modern data workflows demand more diverse capabilities — from real-time analytics with Clickhouse, distributed SQL processing for complex queries, ML frameworks like PyTorch for model training, and vector databases for AI embeddings. This vendor lock-in to a single compute runtime makes adopting these purpose-built tools or future innovations nearly impossible. Additionally, catalogs and metadata remain proprietary and isolated.

An unbundled platform shatters these limitations. Organizations gain the freedom to choose any data format that suits their needs. They can bring their own compute or leverage vendor-managed options. Multiple query engines and management tools become available simultaneously, allowing teams to use the right tool for each workload. Perhaps most importantly, the platform enables true interoperability across catalogs and engines.

In my experience, this transformative approach was directly validated by implementing Uber’s large-scale data platform. When I was a principal engineer at Uber, I helped them migrate all of their rapidly growing data from a proprietary on-premises data warehouse to an open transactional data lake, serving at least five distinct use cases through specialized compute engines that delivered optimal cost-performance ratios and features for each workflow. The platform’s architecture enabled Uber to leverage Apache Spark for Data Science, Apache Flink for Stream Processing, Apache Hive for ETLs, and Presto for Interactive analytics while maintaining select portions of the data warehouse for complex BI operations.

The Evolution of Data Platforms

The industry is converging toward open data lakehouses as the new foundation. This evolution traces back to the 2000s, when we used relational databases and on-prem data warehouses. The rise of social networks and machine learning made Hadoop installations prominent, enabling companies like Linkedin to build rich user experiences and a suite of data-driven products like job recommendations and “People You May Know.” From the vantage point of a lead engineer on LinkedIn’s data infrastructure team, I witnessed this architectural pattern replicated across companies employing similar architectures to build such products.

Widespread cloud migrations to Spark platforms and cloud data warehouses followed this. Historically, data warehouses and data lakes operated as separate stacks with distinct capabilities and use cases. However, this artificial divide is now dissolving as modern open table formats bring transactional capabilities and ACID guarantees to data lakes, eliminating the need to maintain these parallel silos.

The last five years have witnessed these approaches converge. The unnatural separation between cloud warehouses and cloud data lakes has dissolved, mainly because cloud providers have already separated compute and storage costs. This convergence points toward a more integrated, flexible future for data.

Blueprint for an Unbundled Data Platform

Core Components

The storage layer forms the foundation of an unbundled platform. Modern implementations leverage file formats like Parquet and ORC alongside table formats such as Hudi, Iceberg, and Delta Lake. Interoperability layers, including XTable and Uniform, ensure seamless data access across different systems.

Above storage sits the metadata and catalog layer. Operational catalogs handle query planning, while specialized components manage data governance and sharing capabilities. The interoperability layers at this level enable smooth communication between different system parts.

Data management encompasses several crucial functions. Workflow schedulers orchestrate complex operations, while ETL frameworks handle data transformations. Optimization tools continuously improve performance, and ingestion capabilities efficiently move data in and out of the platform.

The query layer connects everything. Analytics engines, data warehouses, and data frame libraries provide various ways to interact with the data. ML and AI training frameworks tap into the same data source, while direct-query databases eliminate unnecessary data movement.

Real-World Implementation Examples

Uber’s implementation showcases the power of this approach. Their 250-petabyte enterprise data lake, built on Apache Hudi, stores data in open formats such as Avro and Apache Parquet. As we discussed, they employ multiple query engines and, additionally, use modern machine learning frameworks like Ray for building critical product features like ETA prediction. An extended Hive Metastore handles access control across the platform.

Walmart has taken a different path, balancing warehouse and data lake architectures. Their open data lakehouse delivers exceptional data freshness. Spark handles batch and stream processing, while BigQuery serves warehousing needs. They’ve integrated Presto and Trino for cost-effective interactive queries, with multiple catalog synchronization services ensuring data consistency across systems.

Future Challenges and Opportunities

Several critical areas demand attention as the industry moves forward. Unstructured data support needs enhancement, requiring better integration with existing storage layers and improved format handling. New specialized file formats must emerge to optimize point lookups and model serving. Row-based formats will play an increasingly important role as lightweight serving stores such as cloud storage also gain low-latency access.

Catalog interoperability presents another frontier. Today’s operational catalogs often exist in isolation, and open-source options like Hive Metastore are slowly becoming outdated. The development of new open source catalogs and better interoperability standards will define the next phase of evolution.

Building Your Next Data Platform

Success in this new paradigm requires adherence to key principles. Start with an open data lakehouse foundation, ensuring storage remains independent from specific engines. Design a modular architecture that treats cloud storage as the single source of truth. Preserve compute flexibility to avoid lock-in and build efficiency into the platform from day one. Embrace incremental processing and efficient data management practices from the start.

The future belongs to unbundled data platforms, offering the flexibility, efficiency, and future-proofing essential for tomorrow’s data challenges. This transition demands careful planning and architectural decisions, but the rewards fundamentally transform technology and teams. Beyond the clear technical benefits — reduced vendor lock-in, access to best-of-breed technology, and enhanced innovation capabilities — unbundled platforms create more productive, empowered teams. Data engineers can focus on driving value rather than wrestling with outdated technology. Analysts and data scientists gain immediate access to trustworthy, single-source-of-truth data through their preferred tools, regardless of use case. Those who recognize and act on this shift will have a clear advantage in the years ahead.

This article is part of The New Stack’s contributor network. Have insights on the latest challenges and innovations affecting developers? We’d love to hear from you. Become a contributor and share your expertise by filling out this form or emailing Matt Burns at [email protected].

Vinoth Chandar is the creator and PMC chair of the Apache Hudi project, a seasoned distributed systems/database engineer and a dedicated entrepreneur. During his time at Uber he created Hudi, which pioneered transactional data lakes as we know them today,...