DSS Technial Architecture
DSS Technial Architecture
DSS Technial Architecture
In Production
2. “Data prep is taking way too much 6. “We are using multiple platforms and
time away from developing advanced tools for data prep, model
analytics that drive our business” development and deployment”
Operationalization
Orchestrate data workflows in production:
AUTOMATION
advanced workflow automations,
Env performance monitoring… System Database
Administrator Administrator
Operational
LOG FILES
Customer
Touch Points In-Memory Processing …
LARGE SCALE DATA Big Data / distributed Analytical databases / DWH Dev. or Prod.
STORAGE & PROCESSING HDFS MPP Environments
SYSTEMS
A FEW NOTES:
- DSS is installed on your own
Hadoop client Connect Custom Python
infrastructure (whether your own data
center or in the cloud) libs & conf. through JDBC connectors
- DSS does not ingest the data, instead it R/W/X R/W/X
will connect to your infrastructure and Data Storage &
push down the calculations on it to avoid Processing
data movements. But…
- ... DSS can do local processing, including
in Python or R, hence the hosting server
may need enough memory/CPU.
- When integrated with a Hadoop cluster,
DSS is usually on an edge node / gateway
Hadoop / HDFS & Analytical SQL DB’s / API’s, 3rd party app & data,
node. distributed processing Operational SQL DB’s DWH custom data sources
Security
Maturity (early stage vs production oriented)
Active Passive
Dedicated environment for : DSS DSS
Database Data
Vertica, File System Data
Greenplum,
Redshift, Host File System,
PostgreSQL, Remote File System,
… …
© Dataiku – 2020 – Confidential and proprietary information 12
Where Can Processing Occur
Local In Hadoop / Spark In Kubernetes &
In SQL Database
Server AWS EMR / …. Docker
Automation node will run maintain and monitor project workflows and
Automation env models in production. Since most actions are batches you can partition
64-128 GB the activity in the 24 hours and optimize resource consumption.
(+ 64 GB in preprod) You can also use a nonproduction automation node to validate your
project before going to production
Scoring nodes are real time production nodes for scoring or labeling with
Scoring env
prediction models. A single node doesn’t require a lot of memory but
4+ GB per
these nodes are generally deployed on dedicated clusters of containers
fleet of n nodes
or virtual machines
Memory usage on the DSS server side can be controlled at the Unix level leveraging Cgroup linux capabilities
Database resource management can be done on the DB side at the user level when per-user credentials mode is activated
© Dataiku – 2020 – Confidential and proprietary information 15
Single Server Installation
Design/Automation Managed File
node System
Data
Set
s )
http(
Core
browser
Data
ing
Set
am
tre
SS
DS
Data
job Set
user
Data
Set
s )
http(
Core
browser Data
Set
Data
Set
Design/Automation
node
Data
Set
Data
Set
s)
http(
Data
Set
Core
browser *Fast Path connections available where
provided (e.g. RedShift + S3, Spark+Cloud,
worker worker Spark+HDFS, Teradata + Hadoop...).
C )
( RP Otherwise, data is streamed through DSS
mit
Path*
Fast
su b using abstracted representations to ensure
compatibility across natively compatible
storage systems
SQL
job
user su b
mit
(JDB
C)
Cluster Manager
Design/Automation
node
worker worker
executor executor
HDFS
Data
Set
Data
job Set
user Data
Set
DATA
Public DSS Core
API Job Kernel
Job Kernel
Users UI
Configuration Logs
Notebook
Kernels
External
HTTP(S)
services
Main design environment with multiple concurrent users:
● RAM (if no Hadoop/Spark cluster) HTTP(S)
○ small: 2-3 people on <5 GB data: 32-64 GB LDAP,
○ medium: 2-3 with heavy machine learning: 64-128 GB JDBC,
○ large: 5+ on large data, with ML: 256+ RMI,
● 8 to 32 core (if no Hadoop/Spark cluster; 1 core per simultaneous active user) RPC
● Config: 128-256 GB of disk for DSS
● Storage for your data if stored locally, ~10x size of raw data
● SSD recommended
© Dataiku – 2020 – Confidential and proprietary information 21
DSS Automation
Hadoop Proxy
Client
application
Public
API Configuration
DSS Core
Job Kernel
Job Kernel
DATA
Viewers UI Orchestrator Logs
Admin External
services
HTTP(S) HTTP(S)
LDAP,
JDBC,
Batch Production environment RMI,
- 64-128 GB of RAM (if no Hadoop/Spark cluster)
- 8 to 32 core (if no Hadoop/Spark cluster) RPC
- 128-256 GB of disk for DSS
(+ Storage for your data if stored locally )
Models
Client Models
Data
Public
API DSS Core
Application Code
Code
Configuration
Configuration
HTTP(S) HTTP(S),
JDBC,
Real time Production environment RMI,
- 4-8 GB of RAM (per instance)
RPC
- 2 to 4 core (per instance)
- 32 GB disk (per instance)
Supporting Installations
Data sources: JDBC entry point; network connectivity
Hadoop: ports + workers required by specific distribution; network connectivity
Spark: executor + callback (two way connection) to DSS
Privileged Ports
DSS itself cannot run on ports 80 or 443 because it does not run as root, and cannot bind to these privileged ports.
The recommended setup to have DSS available on ports 80 or 443 is to have a reverse proxy (nginx or apache) running on the
same machine, forwarding traffic from ports 80 / 443 to the DSS port.
(https://doc.dataiku.com/dss/latest/installation/proxies.html)
It is highly recommended that you reserve at least 100 GB of space for the data
directory
Backend is a single point of failure. It won't go down alone! Hence it is supposed to handle as little actual processing as
possible. Backend can spawn child processes: custom scenario steps/triggers, Scala validation, API node DevServer, macros,
etc.
© Dataiku – 2020 – Confidential and proprietary information 31
DSS Components and Processes
IPYTHON (JUPYTER)
WEBAPP BACKEND
• Data access is partitioned by connection security • In addition to the previous capabilities, now possible to
rules prevent a hostile code from overriding access rules
• Project access is partitioned by project security rules • Identity of the logged-in DSS user is propagated for
execution of local and remote (Hadoop and container)
• User groups are split with specific permission for
code
executing code
• Allows traceability and reliance on the internal access
• All code is executed with the DSS service account
controls of the clusters
• Therefore: in this model, a hostile coder can write
• Also permits more per-user resource control
arbitrary code which runs as the DSS service account
Dataiku enforces security per project and Dataiku enforces security per project and
Dataiku objects
connections connections
DSS Host All users actions run using a service Each user runs processes using her/his own
Python, R, bash … account account
Hadoop Use a single service account authenticated Each user launches Hadoop Jobs as her/his
Spark, Impala, Hive … with Kerberos own account using Kerberos
Execution of Hadoop and Spark code Single service user End users
U1 U2 U3 HADOOP
Audit Tools
Group B user
DSS
Production Pre-Production
Automation/API Automation/API
IT Team IT Team
1. Run the project on production 1. Validate project’s automation
2. Monitor performances and results
End users
Permissions
Allow groups to use the code
env and define their level of
use: i.e. use only, can
manage/update
Container Exec
Build docker images that
include the libraries of your
code env
Build for specific container
configs or all configs
Logs
Review any errors in install
code env