Ep 51: Operationalizing your warehouse, streaming analytics, and cereal (w/ Arjun Narayan of Materialize and Nathan Bean of General Mills)
How exactly companies are leveraging their warehouse to operationalize their business
[Join data practitioners worldwide at Coalesce 2023—built by data people, for data people. Register today.]
It turns out data plays a big role in getting cereal manufactured and delivered so you can enjoy your Cheerios reliably for breakfast. In this episode of The Analytics Engineering Podcast, we talk with Arjun Narayan, CEO of Materialize, a company building an operational warehouse, and Nathan Bean, a data leader at General Mills responsible for all of the company's manufacturing analytics and insights.
We discuss Materialize’s founding story, how steaming technology has matured, and how exactly companies are leveraging their warehouse to operationalize their business—in this case, at one of the largest consumer product companies in the United States.
Nathan and Arjun touch on the significance of real-time data streaming for operational decision-making, how SQL is democratizing real-time data access, and the development journey of Materialize.
Listen & subscribe from:
Key takeaways from this episode:
Do you think that we needed to go through the whole noSQL movement in the early days of large-scale cloud distributed systems?
Arjun: SQL is appropriate because a lot of people had Google Envy. The thing that fundamentally broke a lot of these systems was these few internet-scale companies who genuinely had orders of magnitude more data volume than anybody had ever had before. If you take an inventory management system, lots of companies have been selling stuff and buying and creating inventory management systems.
What happens is when Amazon comes on board, they have orders of magnitude more SKUs than anybody has had prior because of the internet scale. There is a great video from like '94 about why Jeff Bezos decided to sell books on the internet because, on the internet, you can have an unlimited number of SKUs, and books are the longest tail distribution of SKUs. But all that just sort of breaks these databases. Once Google really starts trying to index every web page, they ended up building bespoke systems that could scale for their workload size.
The mistake is a lot of people who didn't have that problem copied their architecture because it was cool. Google may have needed MapReduce, and they did. All the other people who copied Google saying that's the wave of the future because Google is the future were making a mistake. And we generally accept that today.
What specific attributes or features of the Materialize technology contribute to its performance, user-friendliness, and cost-effectiveness in enabling companies like General Mills to take real-time actions?
Arjun: If you take batch historical analytics, you have these elastically scalable cloud data warehouses like BigQuery or Snowflake where you can write lots of complex logic that deals with huge amounts of data, and it can join them all together. It's very performant, simple to use, relatively speaking.
You just write SQL, which can get really gnarly and complex in a good way because it's able to express a lot of complicated business logic. That's wonderful, but the latencies don't get much better than ours. If you have meaningful data set sizes, maybe if you can really squeeze everything out of it, down to double-digit minutes.
If you need something real-time, you have to give up all of this infrastructure. You have to go to some sort of bespoke streaming pipeline using Kafka or Kafka and Flink or a variety of these technologies, which are orders of magnitude harder to program.
You essentially have to write a bunch of Java code. You don't have the flexibility of SQL. You don't have the operational ease of a cloud data warehouse. Most companies that I've encountered have this hard trade-off where they have to switch from system A to system B and build a bunch of bespoke streaming infrastructure.
Of course, because it's so non-trivial, this is restricted to the very few select class of use cases that are existentially important to that firm. Most use cases that would benefit from being real-time don't get to be real-time because the streaming team's already overburdened.
Are there unique aspects or "secret sauce" elements that set Materialize apart from other technologies in this space that you can share?
Arjun: Materialize aims to provide a modern data stack-like SQL environment where users can write SQL and get the benefits of real-time incrementally updating materialized views of the SQL as the underlying data changes.
It eliminates the need for complex bespoke streaming pipelines and offers the flexibility of SQL for real-time use cases, addressing the limitations of traditional batch processing and bespoke streaming solutions. This unique approach makes it a game-changer for companies like General Mills, enabling them to make real-time decisions with ease and cost-effectiveness.
Looking 10 years out, what do you hope will be true for the data industry?
Nathan: Certainly, I have two main hopes for the future of the data industry. Firstly, I'd like to see a convergence between streaming and batch processing. Currently, they're often treated as distinct approaches, but I believe it would be beneficial if they were seen as implementation details. This would simplify data management and build more confidence among customers. Having both streaming and batch processes can create confusion.
Secondly, I'm a strong advocate for open semantic layers or headless BI. I envision a future where businesses can easily define and use semantic layers for their data, making it more accessible and understandable. Additionally, there's room for improvement in handling business metrics, especially when it comes to dealing with aggregates.
I also mentioned my appreciation for LINQ, a query language developed by Microsoft, and my hope for its return. LINQ is known for its elegance and ease of use. PRQL is another example of a more declarative approach to data transformation, and I believe there's potential for a "SQL 2.0" to address some of the limitations of SQL.