Lessons From Building a Workflow Platform

Introduction

Our names are Kim Björkman and Nicolas Vivot and we are software engineers on the Workflow Engine team at CADDi.

We work to provide the internal batch processing platform on which DRAWERs ingestion pipeline runs.

This article is the 21st in the advent calendar, and is about what we learned while building this platform.

Background

Back when the first proof-of-concept version of CADDi DRAWER was built, it included an ingestion pipeline responsible for processing drawings, orchestrating them, and applying various machine-learning models to extract data and bring information into the digital age. This pipeline, cleverly labeled “The Pipeline” internally, was a monolith written in TypeScript and hosted on Cloud Run. It was glued together by Cloud Task queues to form a “workflow-like” experience. One process could trigger others by pushing new tasks onto queues. While this solution worked decently for a while, it came with serious drawbacks.

Scalability

The monolithic design of The Pipeline, deployed on Cloud Run, imposed significant scalability limitations. Although Cloud Run supports horizontal scaling up to 1000 instances, it wasn’t sufficient for our growing workloads and wasn’t cost-effective. Additionally, Cloud Run’s hard limits on CPU and memory per instance capped the processing power of each instance, restricting overall throughput.

Orchestration

The monolithic nature of “The Pipeline” resulted in a lack of modularity and extremely high complexity. Much of the system’s design was based on tribal knowledge, a legacy of its proof-of-concept origins. Documentation was sparse, and making updates—let alone adding new processes—was a daunting task requiring access to gatekeepers of this tribal knowledge.

Visibility

The Pipeline offered limited visibility into the state of processing. While individual processes could be monitored, there was no way to get a comprehensive view of the entire “workflow” or the state of its subcomponents. This made debugging and optimization difficult.

The Workflow Engine

In 2023, a team was assembled to address these challenges and create a robust workflow engine for CADDi’s async batch-processing needs. This new platform would not only resolve the shortcomings of The Pipeline but also lay the foundation for future scalability and transparency.

After evaluating several options and conducting proofs of concept, we decided to base the new platform on Kubernetes and leverage Argo Workflows for workflow orchestration. This combination provided the scalability and modularity required for a more modern workflow engine.

While Kubernetes and Argo Workflows offered powerful abstractions, integrating them with our existing systems wasn’t straightforward. We quickly realized that running a workflow platform is not only about orchestrating workflows and providing compute power, we also needed to develop additional services to handle various complexities surrounding the platform, such as:

  • Real time cluster resource monitoring & management
  • Cross cluster routing
  • Priority management
  • Platform I/O standardization

Today, the new workflow engine processes thousands of workflows daily, with millions of drawings per week. Key features include:

  • Scalability
    • The new platform provides a new level of scalability to our business. Allowing reliable concurrent customer onboarding, all while not even using 20% of the current full capacity
    • Now it’s easier to unlock further scalability and localization by expanding service to multiple clusters across many regions.
  • Multi-Tenancy & Isolation
    • GCP IAM policies ensure secure tenant isolation and data separation.
    • GKE node pool specialization combined with node selectors and taints further ensure isolation and efficient resource usage.
  • Visibility
    • The Argo Workflows UI provides comprehensive insights into the states of workflows and steps.

Lessons Learned

  1. The importance of collaboration
  2. Design for Scalability from the Start
  3. Stay on your toes
  4. Clearly define ownership and responsibilities

The importance of collaboration to build trust and adoption

When we first launched our internal platform, we were excited to provide teams across the company with a tool to streamline their workflows. However, we quickly ran into a classic chicken-and-egg problem: we struggled to get users on board because our platform didn’t yet include all the features they needed. At the same time, we couldn’t prioritize or implement those features effectively without input and feedback from users.

This initial challenge taught us a valuable lesson: the importance of collaboration. Instead of trying to build a perfect solution in isolation, we shifted our focus to engaging directly with our potential users. We organized feedback sessions, established clear communication channels, and invited teams to co-design the features they needed most. By making collaboration the cornerstone of our development process, we were not only able to prioritize the most impactful features but also foster trust and enthusiasm among our users.

Over time, we hope this approach will pay off. As more teams begin to use the platform and seeing its value, their feedback will help us evolve it into a truly indispensable tool. The chicken-and-egg problem isn’t a roadblock but a Catalyst for creating a culture of collaboration and shared ownership.

Design for scalability from the start

Retrofitting scalability and multi-tenancy into an existing system is far more challenging than designing with these principles in mind from the beginning. Early design choices that account for growth, such as modular architecture and clear tenant separation, not only save significant rework later but also ensure the platform can adapt to evolving needs seamlessly. This foresight helps avoid bottlenecks and technical debt, enabling the system to scale gracefully as adoption increases.

Stay on your toes

Projects often start with a set of specifications and assumptions, which are almost guaranteed to change over time, especially for long-term initiatives. Optimizations made too early can lead to unforeseen challenges and require significant adjustments to accommodate new specifications, resulting in technical debt.

To navigate this, it’s essential to remain flexible and open to reevaluating decisions as the project evolves. Unexpected issues are almost inevitable, and these can force teams to revisit and rethink previous choices.

For example, one major challenge we faced stemmed from our initial decision to adopt a multi-tenant/multi-namespace GKE architecture with Workload Identity Federation. We later discovered hidden limitations in Google’s implementation, which prevented our architecture from scaling as intended. This unexpected roadblock required close collaboration with Google engineers and ultimately led us to transition to a multi-tenant/single-namespace strategy as a workaround.

This experience reinforced the importance of anticipating changes, accommodating unforeseen complexities, and avoiding rigid assumptions early in the design process. Lack of clear specifications at the start can lead to suboptimal architectural decisions. When specifications arrive late, after key technical choices, the result is often a compromise—bending the technology to fit the requirements and accumulating technical debt.

Clearly define ownership and responsibilities

Integrating the platform into the existing system turned out to be more intricate than we initially expected. With limited resources, our team stepped in to take temporary ownership of the integration components. However, without a dedicated product manager or detailed specifications, we had to work with minimal guidance in the early stages.

What began as a relatively small task evolved into a critical part of the system as new requirements emerged over time. The "temporary" ownership became a long-term responsibility, and this vital component struggled with insufficient resources and attention for much of the project.

This taught us a key lesson: establishing clear ownership and accountability from the start is crucial. It ensures that critical components are properly managed and supported, preventing them from being deprioritized or overlooked as the project progresses.

Conclusion

The journey of building and refining our platform has been an ongoing process of learning and adapting. From overcoming the scalability and visibility challenges of "The Pipeline" to creating a modern workflow engine, we’ve seen firsthand the importance of modularity, collaboration, and foresight.

Key takeaways from our experience include the value of engaging users early to foster adoption, the necessity of designing for scalability from the outset, and the need to remain flexible in the face of evolving requirements. Most importantly, clearly defining ownership and responsibilities ensures that every component of the system is properly supported and prioritized.

We’re proud of how far our platform has come and excited for its future growth, knowing that the lessons we’ve learned will guide us in creating even more impactful solutions.