Powerful database anonymizer with flexible rules. Written in Rust.
Datanymizer is created & supported by Evrone. See what else we develop with Rust.
More information you can find in articles in English and Russian.
Database -> Dumper (+Faker) -> Dump.sql
You can import or process your dump with supported database without 3rd-party importers.
Datanymizer generates database-native dump.
There are several ways to install pg_datanymizer
, choose a more convenient option for you.
# Linux / macOS / Windows (MINGW and etc). Installs it into ./bin/ by default
$ curl -sSfL https://raw.githubusercontent.com/datanymizer/datanymizer/main/cli/pg_datanymizer/install.sh | sh -s
# Or more shorter way
$ curl -sSfL https://git.io/pg_datanymizer | sh -s
# Specify installation directory and version
$ curl -sSfL https://git.io/pg_datanymizer | sudo sh -s -- -b /usr/local/bin v0.2.0
# Alpine Linux (wget)
$ wget -q -O - https://git.io/pg_datanymizer | sh -s
# Installs the latest stable release
$ brew install datanymizer/tap/pg_datanymizer
# Builds the latest version from the repository
$ brew install --HEAD datanymizer/tap/pg_datanymizer
$ docker run --rm -v `pwd`:/app -w /app datanymizer/pg_datanymizer
First, inspect your database schema, choose fields with sensitive data, and create a config file based on it.
# config.yml
tables:
- name: markets
rules:
name_translations:
template:
format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
rules:
- words:
min: 1
max: 2
- words:
min: 1
max: 2
- name: franchisees
rules:
operator_mail:
template:
format: user-{{_1}}-{{_2}}
rules:
- random_num: {}
- email:
kind: Safe
operator_name:
first_name: {}
operator_phone:
phone:
format: +###########
name_translations:
template:
format: '{"en": "{{_1}}", "ru": "{{_2}}"}'
rules:
- words:
min: 2
max: 3
- words:
min: 2
max: 3
- name: users
rules:
first_name:
first_name: {}
last_name:
last_name: {}
- name: customers
rules:
email:
template:
format: user-{{_1}}-{{_2}}
rules:
- random_num: {}
- email:
kind: Safe
uniq:
required: true
try_count: 5
phone:
phone:
format: +7##########
uniq: true
city:
city: {}
age:
random_num:
min: 10
max: 99
first_name:
first_name: {}
last_name:
last_name: {}
birth_date:
datetime:
from: 1990-01-01T00:00:00+00:00
to: 2010-12-31T00:00:00+00:00
And then start to make dump from your database instance:
pg_datanymizer -f /tmp/dump.sql -c ./config.yml postgres://postgres:postgres@localhost/test_database
It creates new dump file /tmp/dump.sql
with native SQL dump for Postgresql database.
You can import fake data from this dump into new Postgresql database with command:
psql -U postgres -d new_database < /tmp/dump.sql
Dumper can stream dump to STDOUT
like pg_dump
and you can use it in other pipelines:
pg_datanymizer -c ./config.yml postgres://postgres:postgres@localhost/test_database > /tmp/dump.sql
You can specify which tables you choose or ignore for making dump.
For dumping only public.markets
and public.users
data.
# config.yml
#...
filter:
only:
- public.markets
- public.users
For ignoring those tables and dump data from others.
# config.yml
#...
filter:
except:
- public.markets
- public.users
You can also specify data and schema filters separately.
This is equivalent to the previous example.
# config.yml
#...
filter:
data:
except:
- public.markets
- public.users
For skipping schema and data from other tables.
# config.yml
#...
filter:
schema:
only:
- public.markets
- public.users
For skipping schema for markets
table and dumping data only from users
table.
# config.yml
#...
filter:
data:
only:
- public.users
schema:
except:
- public.markets
You can use wildcards in the filter
section:
?
matches exactly one occurrence of any character;*
matches arbitrary many (including zero) occurrences of any character.
You can specify conditions (SQL WHERE
statement) and limit for dumped data per table:
# config.yml
tables:
- name: people
query:
# don't dump some rows
dump_condition: "last_name <> 'Sensitive'"
# select maximum 100 rows
limit: 100
As the additional option, you can specify SQL conditions that define which rows will be transformed (anonymized):
# config.yml
tables:
- name: people
query:
# don't dump some rows
dump_condition: "last_name <> 'Sensitive'"
# preserve original values for some rows
transform_condition: "NOT (first_name = 'John' AND last_name = 'Doe')"
# select maximum 100 rows
limit: 100
You can use the dump_condition
, transform_condition
and limit
options in any combination (only
transform_condition
; transform_condition
and limit
; etc).
You can specify global variables available from any template
rule.
# config.yml
tables:
users:
bio:
template:
format: "User bio is {{var_a}}"
age:
template:
format: {{_0 | float * global_multiplicator}}
#...
globals:
var_a: Global variable 1
global_multiplicator: 6
Rule | Description |
---|---|
email |
Emails with different options |
ip |
IP addresses. Supports IPv4 and IPv6 |
words |
Lorem words with different length |
first_name |
First name generator |
last_name |
Last name generator |
city |
City names generator |
phone |
Generate random phone with different format |
pipeline |
Use pipeline to generate more complicated values |
capitalize |
Like filter, it capitalizes input value |
template |
Template engine for generate random text with included rules |
digit |
Random digit (in range 0..9 ) |
random_num |
Random number with min and max options |
password |
Password with different length options (support max and min options) |
datetime |
Make DateTime strings with options (from and to ) |
more than 70 rules in total... |
For the complete list of rules please refer this document.
You can specify that result values must be unique (they are not unique by default). You can use short or full syntax.
Short:
uniq: true
Full:
uniq:
required: true
try_count: 5
Uniqueness is ensured by re-generating values when they are same.
You can customize the number of attempts with try_count
(this is an optional field, the default number of tries
depends on the rule).
Currently, uniqueness is supported by: email
, ip
, phone
, random_num
.
You can specify the locale for individual rules:
first_name:
locale: RU
The default locale is EN
but you can specify a different default locale:
tables:
# ........
default:
locale: RU
We also support ZH_TW
(traditional chinese) and RU
(translation in progress).
You can reference values of other row fields in templates.
Use prev
for original values and final
- for anonymized:
tables:
- name: some_table
# You must specify the order of rule execution when using `final`
rule_order:
- greeting
- options
rules:
first_name:
first_name: {}
greeting:
template:
# Keeping the first name, but anonymizing the last name
format: "Hello, {{ prev.first_name }} {{ final.last_name }}!"
options:
template:
# Using the anonymized value again
format: "{greeting: \"{{ final.greeting }}\"}"
You must specify the order of rule execution when using final
with rule_order
.
All rules not listed will be placed at the beginning (i.e. you must list only rules with final
).
We implemented a built-in key-value store that allows information to be exchanged between anonymized rows.
It is available via the special functions in templates.
Take a look at an example:
tables:
- name: users
rules:
name:
template:
# Save a name to the store as a side effect, the key is `user_names.<USER_ID>`
format: "{{ _1 }}{{ store_write(key='user_names.' ~ prev.id, value=_1) }}"
rules:
- person_name: {}
- name: user_operations
rules:
user_name:
template:
# Using the saved value again
format: "{{ store_read(key='user_names.' ~ prev.user_id) }}"
- Postgresql
- MySQL or MariaDB (TODO)
- pg_datanymizer CLI application manual.
- config.yml file specification.
- Full list of transformation rules.
- Integration testing manual.
Mac to Linux
rustup target add x86_64-unknown-linux-gnu
brew tap messense/macos-cross-toolchains
brew install x86_64-unknown-linux-gnu
CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_LINKER=x86_64-linux-gnu-gcc cargo build --target x86_64-unknown-linux-gnu --release --features openssl/vendored