Neosync as a Developer Data Platform

Neosync as a Developer Data Platform

Introduction

As a developer, you need data in order to build, test and debug features. Most of us end up hand-writing mock data and stuffing it into a script that runs when we stand up our local environment. But this only covers a small subset of the use-cases that we need data for and it usually isn't representative of the data that we see in production in structure and messiness. Ideally, we would love to test against production data but there are obvious security and privacy concerns.

So then how do we get high-quality to test and debug our code?

This was the question that started Neosync and the one that we outlined in our Introducing Neosync blog. Over the last year, we've been working with companies of all shapes and sizes to help them anonymize and orchestrate sensitive data from production to lower level environments.

During that time, customers have been asking us to help them generate and move data to other parts of their stack so they can test their streaming infrastructure, messaging queues, data pipelines, batch processes, APIs and more.

And it's become clear that there is an opportunity for Neosync to support more than just databases. We've now reached a point where we believe that it's time to introduce a new type of platform that allows developers to self-serve pretty much any kind of data that they need to build, test and debug their code, applications and infrastructure.

For a while now, our customers at Neosync have been referring to us as a Developer Data Platform. Initially, we didn't think too much of it but recently we've realized that this is a perfect name for this type of platform. So we decided to make it official.

Neosync is a Developer Data Platform or a DDP.

What does a Developer Data Platform do?

Definition: A Developer Data Platform allows developers to define, generate, orchestrate and self-serve data for databases, APIs, data pipelines, messaging queues, batch processes and more in order to build, test and debug their code and applications in a security and privacy-compliant way.

This definition might seem overwhelming but a DDP really does just three things:

  1. Orchestrates data - DDPs can move data across environments, systems and applications.
  2. Generates data - DDPs can generate net new synthetic data that matches an existing schema or type definition
  3. Anonymizes data - DDPs can anonymize data so that it can be safely used in lower level environments without the privacy and security concerns
  4. Subsets data - DDPs can subset data in order to generate specific data sets for testing or debugging

Here's how we think about it architecturally.

ddp

There can be a number of inputs into a DDP such as a database, stream, batch files or even API calls. These datasources feed their data into a DDP where it is anonymized or used as a reference for synthetic data generation. That data is then orchestrated and sync'ed down to databases, object storage or other destinations in lower level environments.

From there, developers can self-serve the data they need based on their use-case from a centralized, version controlled and audited platform. This makes it easier for organizations to track who is using what data and how it's being used. Even though that data is anonymized or synthetically generated it's still important to understand usage patterns and gives developers a way to re-use data sets that others on their team may have already generated.

Isn't this just ETL?

This is the main question that I expect people to ask and I think it's a fair one. At the end of the day, anytime you're moving data across systems, applications or environments, ETL is the first thing that comes to mind. A DDP definitely does ETL but it also does more that traditional ETL platforms don't do. It also doesn't do things that other ETL platform do.

So I made a table to compare the features of an ETL platform and DDP.

FeatureETL PlatformDeveloper Data Platform (DDP)
Data OrchestrationYesYes
Data TransformationSQL-based TransformationsSynthetic Data Generation and Data Anonymization
Data GenerationNoYes
Data AnonymizationNoYes
Data SubsettingNoYes
Integration with DatabasesYesYes
Integration with APIsLimitedYes
Integration with Messaging QueuesLimitedYes
Integration with Batch ProcessesLimitedYes
Self-Serve Data AccessNoYes
Version ControlNoYes
Audit TrailsYesYes
Schema AdherenceOften handledEnsures adherence to existing schemas and type definitions
Performance TestingNoYes
Custom Data Set CreationNoYes
Privacy ComplianceNot typically addressedEnsures data is anonymized for privacy compliance

As you can see, the majority of differences come up in the data sources and the data generation and anonymization. In my mind this is the main difference between ETL platforms and DDPs. DDPs specialize in support different types of inputs and in a core generation/anonymization engine.

Who needs a Developer Data Platform

There are two questions you can ask yourself to see if you need a DDP:

  1. Do you care about giving/having a great developer experience?
  2. Do you have sensitive data that you would rather not expose to lower-level environments?

If you answered yes to either or both of these questions, then you should be using a DDP.

Whether that's user data or other sensitive business data, DDPs will emerge as the go-to platform for data orchestration, generation and anonymization across the stack.

The road ahead

We're at the very beginning of this journey but it's become clear that the world needs a DDP. Developers expect to have a world class developer experience and you can't offer that without world class data. In addition, data security and privacy requirements are only becoming more important for organizations. We believe a DDP is the right solution to protect data and improve developer productivity and efficiency.

Our roadmap is long but we're excited to create this category and bring this platform to market.

Evis & Nick