Datagen is a Web Server with a UI and a sets of APIs (with a swagger) to interact with.
Datagen is able to:
-
Push Data to multiple systems (HDFS, S3, ADLS, GCS, Kafka, Hive, HBase etc…)
-
Handle multiple formats (Avro, Parquet, ORC, JSON, CSV)
-
Generate structured data of different types (string, integer, timestamp, uuid etc…)
-
Provide pre-defined data filterable (countries, 40K+ cities, 35K+ names, phone pattern etc… )
-
Make data respecting rules (min, max, repartition, pattern, regex etc…)
-
Add relations between data generated (one column value depends on other’s values with possible complex evaluation)
-
Generate unstructured data leveraging AI (OpenAI, Bedrock, Ollama, and even any model locally embedded)
-
Schedule generation of data
-
Analyze existing data to mimic it (currently in alpha version)
Refer to this documentation to understand Datagen, know how to deploy and use it:
Datagen is made with:
-
Java 17
-
Maven 3.9
It is built for:
-
CDP 7.1.9+
-
AWS, GCP, Azure
-
Docker & K8s
Datagen is made as a Spring Boot project with dependencies on SDK to interact with various services.
Datagen can be run in two ways:
-
As a standalone web server, anywhere, but you’ll need to provide configurations manually
-
Fully integrated into CDP using a parcel, a CSD and available as service through Cloudera Manager (configuration is automatic)
To make a fast and easy deployment, here is a procedure:
wget https://datagen-repo.s3.eu-west-3.amazonaws.com/1.0.0/standalone/datagen-standalone-files.tar.gz
tar -xvzf datagen-standalone-files.tar.gz cd datagen_standalone-1.0.0/
./launch.sh
By default, TLS is de-activated and a user/password is required (admin/admin by default).
Refer to this detailed procedure available on Datagen’s Documentation
By default, TLS is activated and a user/password is required (admin/admin by default).