Using Neosync to Anonymize and Securely Move Data Across AWS Accounts

Using Neosync to Anonymize and Securely Move Data Across AWS Accounts

Intro

Neosync is a data anonymization and synthetic data platform that customers use to anonymize sensitive production data and sync it across environments. One of the most common questions that we get from customers is, "How can we move data across our cloud accounts since our production and staging systems are in different accounts ?". And it makes sense given that most customers segregate their environments into different AWS/GCP/Azure accounts. So I wanted to take a minute and talk through how we solve this using Neosync.

In this blog we'll go through a best practice architecture review for how to securely move data across AWS accounts.

Requirements

First, let's expand on our requirements. Generally, when we talk to customers about this use-case, the same requirements come up:

  1. I don't want to give unauthorized developers access to our production environment
  2. I want to anonymize my senstive production data.
  3. I want to limit the amount of network traffic crossing VPCs.

These are all reasonable requirements that most mature, cloud-native companies will have. And as you're about to see, using this architecture we can meet all of these requirements.

Architecture review

Let's diagram out the traditional customer cloud environment using segregated accounts. We're going to use AWS in this example but you can pretty much substitute it with any other cloud provider.

aws-start

In this diagram, we have two environments, each in it's own VPC. In each environment, we have a database that is in a private subnet and then a bunch of other resources. Since we really only care about the database in this blog, I've pretty much left out everything else for brevity.

This is pretty standard and generally what most segregated environments look like.

Now, back to our original question. How do we securely move data across environments? Let's build on this diagram.

The first thing we'll need is another AWS account that we can use as a shared account. In this account we'll deploy S3 and use it as our staging ground. Let's update our diagram.

aws-start

Why S3?

Why are we using S3? Why don't we just use a couple of bastion hosts to tunnel into each environment? Ah good question.

First, S3 is cheaper than using an EC2 instance as a bastion host and depending on the amount of data you're moving, may have cheaper data transfer costs if you're moing data across regions. Second, configuring, managing and scaling bastion hosts is a pain. And you're likely not going to want those running all of the time, so you'll want to spin them up, use them and bring them back down. Meaning you'll need to terraform that entire solution, which isn't the biggest deal but it's still a pain.

Second, if you use S3 as a staging ground, you can have different sync'ing schedules from production -> stage. Meaning that you can sync from prod -> stage 1x/month but then folks can pull from stage however many times they want since the data will still be there (if you don't delete it). A developer can do what they need with the data and change it and mess it up and then just blow away their environment and re-sync a fresh copy without having to trigger the entire pipeline again.

Third, while we haven't talked about Neosync yet, another reason why S3 is great is because you limit the direction connection across environments through Neosync. We'll talk about this more in a minute.

Let's add in some data flow lines so we can see how data is going to flow across the system.

aws-start

Adding Neosync into the mix

Now that we have our data flow diagrammed out, let's add update our diagram to include Neosync and talk through what's happening.

aws-start

We've deployed Neosync and attached it to our private subnet in both environments. From a deployment perspective, you can deploy Neosync into Kubernetes or on an EC2 machine, you'll just need to set up the networking.

Let's start from our production environment. The entire Neosync product is deployed here including the Frontend, Backend, Worker, Orchestrator and more. This is where developers create their jobs and define the schema transformations. Here, Neosync talks to the production database and retrieves, anonymizes and streams the anonymized data to S3.

Now S3 has anonymized production data that any other environment that can talk to S3 can retrieve.

Let's move to our staging environment.

The unique thing here is that only the CLI and Neosync API server are deployed in staging. We don't need to deploy all of the resources because we're just using Neosync as a way to retrieve the data from S3 and stream it to another database. Which is exactly what's happening here.

The Neosync CLI retrieves configurations from the Neosync API Server and then streams the anonymized data from S3 directly to a staging server.

This allows us to fulfill our requirements. No developer we're given extra privileges. Sensitive production data was anonymized before it went to lower level environments. Inter-network traffic was minimized by using S3 as a staging ground versus using bastion-hosts.

And we get the added benefit of being able to hydrate other databases - development, local, CI, from S3 without having to run a production sync again since the data is already in S3!

Wrapping up

In this blog, we looked at how to securely move data across AWS accounts using Neosync and S3. This is a common architecture that we see customers adopt because it's fast, reliable and secure. Using S3 as our staging ground, we can flexibly sync data down to lower level environments without giving developers extra permissions to access production infrastructure.


Implementing Foreign Key Tracking in Memory

Implementing Foreign Key Tracking in Memory

An overview of how Neosync handles foreign key tracking in memory during data synchronizations

December 3rd, 2024

View Article