DSS Technial Architecture

DSS Technical Architecture

In Production

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku

Overview
Scalability / High Availability
Instance Architecture
Product Architecture
DSS Processes
Table of Contents
Global Architecture
IT Integration
DSS Security
DSS Code Environments
Cloud Design Patterns

Common Challenges
1. “My data analysts need a platform 5. “Our data analysts have no way of
that empowers them to develop a collaborating with each other and our
wide array of analytics” data science team”
2. “Data prep is taking way too much 6. “We are using multiple platforms and
time away from developing advanced tools for data prep, model
analytics that drive our business” development and deployment”
3. “I don’t have a central analytics 7. “Putting models into production can be

platform that allows me to put in data time consuming and when we want to
access and users controls” make changes the process starts all
over again”
4. “We are having trouble connecting to
all of our different data stores.”
© Dataiku – 2020 – Confidential and proprietary information
DSS Architecture Overview
©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 4

Dataiku DSS Architecture
Ready For Production
Personas Served
DESIGN All the features to collaboratively prepare,

Env analyze, and model Data Business
Scientist Analyst
Operationalization
Orchestrate data workflows in production:
AUTOMATION
advanced workflow automations,
Env performance monitoring… System Database
Administrator Administrator
API DEPLOYER Visual interface for managing API Scoring

Service infrastructure. System Web
Administrator Developer
API services for integration in operational

API SCORING
systems: execute real-time predictions,
Service access key services and data. Web
Developer
© Dataiku – 2020 – Confidential and proprietary information 5

Dataiku DSS at the Core Of Your Data
Platform
Dataiku DSS is the central platform where the analytical workflows and machine learning models supporting your data projects can be created
SOURCE DATA and ran from end-to-end. These projects will benefit both from the built-in features of DSS and its ability to be extended using custom code (R, CONSUMPTION
Python, SQL, Spark…) to address any specific need (connectivity, business logic, complex modelling…).
CRM DATABASES COLLABORATION
DSS DESIGN NODE DSS AUTOMATION NODE Reporting &

Dashboards
Finance
✓ Project bundling and deployment
DATA ML MODEL ✓ Advanced automation scenario
DEPLOYMENT
PREPARATION BUILDING ✓ Reporting and monitoring
Transactions ✓ Management API’s
Operational
LOG FILES
Event logs Systems

EXPLORATION / MODEL
ACQUISITION
ANALYTICS ASSESSMENT
Customer
Touch Points In-Memory Processing …
DSS API NODE

Real-time
FOR CODERS
Operations Applications
✓ Deploy model through REST API’s
API’s
FOR CLICKERS ✓ Model versioning

✓ HA & load balancing
..or Calculations Push-Down ✓ Logging
External Data
LARGE SCALE DATA Big Data / distributed Analytical databases / DWH Dev. or Prod.
STORAGE & PROCESSING HDFS MPP Environments
SYSTEMS

Integrating DSS with your Infrastructure
DSS is an “on-premises” deployment
Dataiku DSS Consumption
Illustration with Design Node
Built-in DSS Connectors
Interact via their

web browser
Custom Python
exporters / plugins
Linux server
A FEW NOTES:
- DSS is installed on your own
Hadoop client Connect Custom Python
infrastructure (whether your own data
center or in the cloud) libs & conf. through JDBC connectors
- DSS does not ingest the data, instead it R/W/X R/W/X
will connect to your infrastructure and Data Storage &
push down the calculations on it to avoid Processing
data movements. But…
- ... DSS can do local processing, including
in Python or R, hence the hosting server
may need enough memory/CPU.
- When integrated with a Hadoop cluster,
DSS is usually on an edge node / gateway
Hadoop / HDFS & Analytical SQL DB’s / API’s, 3rd party app & data,
node. distributed processing Operational SQL DB’s DWH custom data sources

Scalability and High Availability

Scalability and High-Availability
Scalability for Design Node / Automation Node is done
by splitting based on:
Project / Team LB
Security
Maturity (early stage vs production oriented)
Active Passive
Dedicated environment for : DSS DSS
Resource consuming projects

Develop / maintain plugins Shared
(or replicated w/ sync)
High Availability for Design Node / Automation Node File System
Active/Passive high availability

Use of a shared file system between nodes

Scalability and High-Availability
API Scoring Nodes
API Nodes are stateless
Support Active/Active high availability
LB
API Performance
The number of API nodes required depends of the target
QPS (Query Per Second) :
Optimized models (java, spark, or SQL engines; see A A A
documentation), estimate to 100 to 2000 QPS
For non-optimized models, estimate 5-50 qps per node
If using an external RDBMS for enrichment, it must also have
HA

Instance Architecture

Leverage Your Infrastructure
Run in Memory
Run in Database Python, R, …
Enterprise SQL,
Analytic SQL
Distributed ML
By default, DSS Run In Cluster Mllib, H2O, …
automatically chooses Spark, Impala, Hive, …
the most effective

execution engine
amongst the engines
available next to the
input data of each Data Lake
computation. Cassandra,
HDFS, …
ML in Memory
S3 Python Scikit-Learn, R, …
Database Data
Vertica, File System Data
Greenplum,
Redshift, Host File System,
PostgreSQL, Remote File System,
… …
Where Can Processing Occur
Local In Hadoop / Spark In Kubernetes &
In SQL Database
Server AWS EMR / …. Docker
Data Preparation YES YES YES YES

(Interactive / Recipe in Workflow) Spark, Hive, Impala (with Spark over K8s)
Coding: Python, R, Scala YES

(Notebook / Recipe in Workflows)
YES YES Custom code with DSS YES
API
SQL Analytics N/A YES YES YES

(Notebook / Recipe in Workflow) (Hive, Impala, Pig, SparkSQL) (with Spark over K8s)
Visualization YES YES YES YES

(Charts) (most charts) (most charts) (most charts with Spark)
YES YES YES YES

Machine Learning: Training scikit-learn, XGBoost,
MLlib, Sparkling Water Vertica ML
scikit-learn, XGBoost,
Keras/Tensorflow Keras/Tensorflow
YES YES YES YES

Machine Learning: Inference Depending on algorithms Depending on algorithms
Depending on
Depending on algorithms
algorithms

Backup/Disaster Recovery
Backup and Recovery Discussion
Periodic backup of DATADIR (contains all config/DSS state)
Consistent live backup requires snapshots (disk-level for cloud and NAS/SAN, or OS-
level with LVM)
Industry standard backup procedure applies

Enterprise Scale Sizing
Recommendation
Design environments are generally consume more memory than other
Design env because it’s the collaborative environment for design prototyping and
128-256 GB experiments.
Automation node will run maintain and monitor project workflows and
Automation env models in production. Since most actions are batches you can partition
64-128 GB the activity in the 24 hours and optimize resource consumption.
(+ 64 GB in preprod) You can also use a nonproduction automation node to validate your
project before going to production
Scoring nodes are real time production nodes for scoring or labeling with
Scoring env
prediction models. A single node doesn’t require a lot of memory but
4+ GB per
these nodes are generally deployed on dedicated clusters of containers
fleet of n nodes
or virtual machines
Memory usage on the DSS server side can be controlled at the Unix level leveraging Cgroup linux capabilities
Database resource management can be done on the DB side at the user level when per-user credentials mode is activated
Single Server Installation
Design/Automation Managed File
node System
Data
Set
s )
http(
Core
browser
Data
ing
Set
am
tre
SS
DS
Data
job Set
user

DSS Instance + Hadoop
Design/Automation
HDFS
node
Data
Set
s )
http(
Core
browser Data
Set
Data
Set
job worker worker

submit
user (RPC)

Hybrid model: DSS Instance + Hadoop + SQL
HDFS
Design/Automation
node
Data
Set
Data
Set
s)
http(
Data
Set
Core
browser *Fast Path connections available where
provided (e.g. RedShift + S3, Spark+Cloud,
worker worker Spark+HDFS, Teradata + Hadoop...).
C )
( RP Otherwise, data is streamed through DSS
mit
Path*
Fast
su b using abstracted representations to ensure
compatibility across natively compatible
storage systems
SQL
job
user su b
mit
(JDB
C)
Table Table Table

Hybrid model: DSS Instance + Hadoop +
Spark SPARK
Cluster Manager
Design/Automation
node
worker worker
executor executor
s ) task task task task

http(
Core
browser
HDFS
Data
Set
Data
job Set
user Data
Set

Product Architecture

DSS Design
Client
application Hadoop Proxy
DATA
Public DSS Core
API Job Kernel
Job Kernel
Users UI
Configuration Logs
Notebook
Kernels
External
HTTP(S)
services
Main design environment with multiple concurrent users:
● RAM (if no Hadoop/Spark cluster) HTTP(S)
○ small: 2-3 people on <5 GB data: 32-64 GB LDAP,
○ medium: 2-3 with heavy machine learning: 64-128 GB JDBC,
○ large: 5+ on large data, with ML: 256+ RMI,
● 8 to 32 core (if no Hadoop/Spark cluster; 1 core per simultaneous active user) RPC
● Config: 128-256 GB of disk for DSS
● Storage for your data if stored locally, ~10x size of raw data
● SSD recommended
DSS Automation
Hadoop Proxy
Client
application
Public
API Configuration
DSS Core
Job Kernel
Job Kernel
DATA
Viewers UI Orchestrator Logs
Admin External
services
HTTP(S) HTTP(S)
LDAP,
JDBC,
Batch Production environment RMI,
- 64-128 GB of RAM (if no Hadoop/Spark cluster)
- 8 to 32 core (if no Hadoop/Spark cluster) RPC
- 128-256 GB of disk for DSS
(+ Storage for your data if stored locally )

DSS Scoring
Models
Client Models
Data
Public
API DSS Core
Application Code
Code
Configuration
Configuration
HTTP(S) HTTP(S),
JDBC,
Real time Production environment RMI,
- 4-8 GB of RAM (per instance)
RPC
- 2 to 4 core (per instance)
- 32 GB disk (per instance)

Open Ports
Base Installations
Design: User’s choice of base TCP port (default 11200) + next 9 consecutive ports
Only the first of these ports needs to be opened out of the machine. It is highly recommended to firewall the other ports
Automation: User’s choice of base TCP port (default 12200) + next 9 consecutive ports
API: User’s choice of base TCP (default 13200) + next 9 consecutive ports
Supporting Installations
Data sources: JDBC entry point; network connectivity
Hadoop: ports + workers required by specific distribution; network connectivity
Spark: executor + callback (two way connection) to DSS
Privileged Ports
DSS itself cannot run on ports 80 or 443 because it does not run as root, and cannot bind to these privileged ports.
The recommended setup to have DSS available on ports 80 or 443 is to have a reverse proxy (nginx or apache) running on the
same machine, forwarding traffic from ports 80 / 443 to the DSS port.
(https://doc.dataiku.com/dss/latest/installation/proxies.html)

Data Directory
The data directory contains:
The configuration of Data Science Studio, including all user-generated

configuration (datasets, recipes, insights, models, ...)
Log files for the server components
Log files of job executions
Various caches and temporary files
A Python virtual environment dedicated to running the Python components of
Data Science Studio, including any user-installed supplementary packages
Data Science Studio startup and shutdown scripts and command-line tools
Depending on your configuration, the data directory can also contain some
managed datasets. Managed datasets can also be created outside of the data
directory with some additional configuration.
It is highly recommended that you reserve at least 100 GB of space for the data
directory

Data Directory
FileSystem Add to
Repositories
Binary files from DSS dedicated backup
installation package DSS binaries dataiku-dss-x.x.x Optional
DSS workspace initialized Workspace

during installation and (triggered during
installation )
$DSS_HOME
evolving over time with

projects and configuration Configuration and
metadata (critical) $DSS_HOME/config
DSS core component

activity logs $DSS_HOME/run
($DSS_HOME/run) and Logs DSS
user triggered jobs $DSS_HOME/jobs

($DSS_HOME/jobs)

DSS Processes

DSS Components and Processes
Starting the DSS Design/Automation Node
4 processes are spawned
Supervisor: process manager
Nginx server listening to installation port
Backend server listening to installation port + 1
Ipython (Jupyter) server listening to installation
port + 2
The next slides detail the role of each
server and where they sit in the overall DSS
architecture.

DSS Core (Design and Automation)

NGINX
Handles all interactions with the end

user through its web browser. It acts as
a HTTP proxy for forwarding requests to
all other DSS components. It binds to
the DSS port number specified at
install.
Protocol: HTTP(s) and websockets.

BACKEND
Metadata server of DSS
● Interact with config folder

● Prepare preview
● Explore (e.g. charts aggregation)
● git
● public api
● schedule
● scenarios
It binds to the DSS port number

specified at install +1.
Backend is a single point of failure. It won't go down alone! Hence it is supposed to handle as little actual processing as
possible. Backend can spawn child processes: custom scenario steps/triggers, Scala validation, API node DevServer, macros,
etc.
IPYTHON (JUPYTER)
It handles interactions with R , Python

and Scala notebook kernels using the
ZMQ protocol.
It binds to the DSS port number

specified at install +2.

JOB EXECUTION KERNEL (JEK)
Handles dependencies computation

and recipes running on DSS engine. For
other engines and code recipes, it will
launch its child processes: Python, R,
Spark, SQL, etc.

FUTURE EXECUTION KERNEL (FEK)
Handles non-jobs related background

tasks that may be dangerous, such as:
● metrics computation. It can launch

child Python processes for custom
Python metrics.
● sample building for machine
learning and charts.
● Machine learning preparation
steps.

ANALYSIS SERVER
Handles Python-based machine

learning training, as well as data
preprocessing.
WEBAPP BACKEND
Handles current user-created webapp

backends (Python Flask Backend,
Python Bokeh and R Shiny)

Global Architecture

Global Architecture Design
Considerations
Type of production integration
Batch-oriented workflows : data preparation, feature engineering, model training, and
batch scoring on trained models
Required automation node
Interaction with other component like ETL flow can be managed with an enterprise orchestrator
Real-time scoring : deployment of machine learning models behind API

Required scoring node and generally automation node to implement automatic retrain and
update
Sizing is mainly driven by number of queries per second you need to manage and number of
services

Topics to Consider During Architecture
Design
Model life cycle management :
Define the process to going from design to the automation and API Nodes
Design can be influenced by different team ownership on part of the process
Complexity and number of staging environments can vary depending of the criticality of the
job
Advanced deployment recommendations:

Build a separate dev environment containing design, automation and API node to test both the
projects and the deployment
Identify the appropriate number of test/staging environments for production
The complete dev environment can be on the same physical layer
For production environment
API nodes need to be deployed on at least 2 separate servers
Automation node should run on its own server

Scaling the Design Node
We recommend to split projects across multiple DSS instances based on
several criteria :
Team and project organization
Expected collaboration
Security and Data Access Management
Type and maturity of the project: early stage experimentation vs production ready
projects
We recommend a separate environment:

For heavy resource consuming ML experiments, to avoid slowing down the work of then
entire team with only one job. Note: this can be avoided with the use of Docker/K8s.
To develop/maintain and test plugins

Centralized Skills
The “Center of Excellence” Model
Integration UAT Center of Excellence Pre-Prod Production
Design Design Design Automation Automation

API Automation API API API
API
IT IT Advanced Dev Ops Dev Ops

Selected Analytics Selected
Users Teams Users

Data Sources

What is the Catalog within DSS
DSS Catalog Goal: Allow fast, explained, repeatable, governed access to
assets created or available to the DSS user
Power of Default Sharing
All assets created within DSS are indexed into an ElasticSearch for immediate use
Power of Default Documentation and Tagging
All assets created within DSS can have documentation and tags associated with them
allowing for easy search and reuse of existing assets
Capability to visually define a Common Data Model
A common model can be exposed through projects and some ability to expose via a
DSS data API

Metadata Management within DSS
For DSS Created Assets
Information capture at the point of artifact creation
Tagging available for all artifacts
Ongoing development: Parent-child tagging and reporting (all the projects with this
connection, create a tag)
Capture of Summary, Description and ToDo list information for each artifact
Creation / Last Modification Information
Capture of Analytic Asset Lineage
Ability to push in and pull out metadata via the Public REST API
Meanings: Business-defined terms and quality measuring rules for datasets
For Externally Created Assets
Indexing of existing SQL Sources (name and connection only)
Ability to push in and pull out metadata via the Public REST API
Future Release
Custom Fields within the Data Catalog for all artifacts

Data Quality within DSS
For Dataset Assets
Define Meanings: Business-defined terms and quality measuring rules for datasets
UI-based Embedded Graphs and Statistics
Dataset Metrics: setup simple and/or complex probes and checks of data to ascertain changes
over time
Statistically sound sampling techniques to ascertain issues over large datasets
Access to simple but yet powerful graphing capabilities to look for anomalies within datasets
Future Release
New Statistical Exploration of Datasets

File-based Data Sources
File-based Data Sources File-based Data Formats
Type Read Write Format Read Write
Upload your files yes yes Delimited values (CSV, TSV, …) yes yes
Server filesystem yes yes Fixed width yes no
HDFS yes yes Excel (from Excel 97) yes only via export
Amazon S3 yes yes Avro yes yes
Google Cloud Stora yes yes Custom format using regular expression yes no
ge
Azure Blob Storage yes yes XML yes no
FTP yes yes JSON yes no
SSH (SCP and SFT yes yes ESRI Shapefiles yes no
P)
HTTP yes no MySQL Dump yes no
Apache Combined log format yes no
Hadoop-specific Data Formats
Parquet yes yes

Hive SequenceFile yes yes
Hive RCFile yes yes
Hive ORCFile yes yes
SQL and NoSQL-based Data Sources
SQL Data Sources NoSQL Data Sources
Type Read Write Type Read Write
MySQL yes yes MongoDB yes yes
MySQL yes yes Cassandra yes yes
PostgreSQL yes yes ElasticSearch yes yes
Vertica yes yes
Amazon Redshift yes yes
Beta and Best Effort Data Sources
Greenplum yes yes
Teradata yes yes Type Read Write
IBM Netezza yes yes IBM DB2 yes (Beta) yes (Beta)
SAP HANA yes yes Exasol yes (Beta) yes (Beta)
Oracle yes yes MemSQL. yes (Beta) yes (Beta)
Exadata yes yes Other SQL databases (JDBC best effort, not
no, generally
driver) guaranteed
Microsoft SQL Serv yes yes
er Twitter (Streaming API) yes no
Google BigQuery yes yes
Snowflake yes yes Custom Python
Custom Python or
Generic APIs or R code,
R code, plugins
plugins

IT Integration

Single Sign On
DSS supports the following SSO protocols
SAMLv2
SPNEGO / Kerberos
SP-Initiated SSO
DSS acts as a SAML Service Provider (SP), which can authenticate to an Identity
Provider (IdP)
Supported
OKTA
PingFederate PingIdentity
Azure Active Directory
Google G.Suite

Key Integration Points
Key Points of Security Integration
Configures easily for use with HTTPS
Supports LDAP, LDAPS, and LDAP/TLS
Supports SSO (SAMLv2 and SPNEGO)
Relies on impersonation where applicable

sudo on Unix
proxy user on Hadoop / Oracle
constrained delegation for SQL server
Otherwise personal credential for other DBs
Complete audit trail exportable to external system
Permissions and multi-level authorization dashboard

Proxies
Why proxies with DSS?
Expose DSS on a different host/port that its native installation (reverse proxy)
DSS is installed on a server without direct outgoing internet access (proxy)
WebSocket
DSS uses the WebSocket protocol for part of its user interface.
Direct or Reverse Proxies configured between DSS and its users must support
WebSocket
Example:
Nginx 13.13 and above
Apache 2.4.5 and above
Amazon Web Service

Audit Trail
Audit Trail Details
DSS includes an audit trail
that logs all actions
performed by users
Log files are available on

the server
DSS sends data using log4j

library : All log4j appenders
can be used (Kafka,…)

DSS Security

DSS Security: Concepts
DSS Security Concepts
Security based mainly on projects and groups
Users belong to groups
Groups are managed only by global administrators
The two main security atoms of DSS are the project and the connection
Groups are granted permissions:
On connections
On projects
Global (instance-wide) permissions

Motivation and Concept
REGULAR SECURITY USER ISOLATION FRAMEWORK SECURITY
• Data access is partitioned by connection security • In addition to the previous capabilities, now possible to
rules prevent a hostile code from overriding access rules
• Project access is partitioned by project security rules • Identity of the logged-in DSS user is propagated for
execution of local and remote (Hadoop and container)
• User groups are split with specific permission for
code
executing code
• Allows traceability and reliance on the internal access
• All code is executed with the DSS service account
controls of the clusters
• Therefore: in this model, a hostile coder can write
• Also permits more per-user resource control
arbitrary code which runs as the DSS service account

Security Models
Basic Security User Isolation Framework Security
Dataiku enforces security per project and Dataiku enforces security per project and
Dataiku objects
connections connections
Authentication Password-based or LDAP based

Password-based or LDAP based authentication
authentication
DSS Host All users actions run using a service Each user runs processes using her/his own
Python, R, bash … account account
Each user connects to each database using

Databases Data Single database account for all users
her/his own account
Hadoop Use a single service account authenticated Each user launches Hadoop Jobs as her/his
Spark, Impala, Hive … with Kerberos own account using Kerberos
Dataiku Native Audit trail

Audit Trail Dataiku native Audit trail
Hadoop Audit trail

Comparing Security Models
Regular Security User Isolation Framework
Feature
Security
Access control on projects Yes Yes
Access control on connections Yes Yes
Enforcement of permissions to execute code Yes Yes
Execution of local code Single service user End users
Execution of Hadoop and Spark code Single service user End users
Connecting to secure Hadoop clusters (Kerberos) Yes Yes
HDFS ACLs to enforce permissions even against code

No Yes
execution
Traceability of all actions, including code execution Yes Yes
Non-repudiable audit log Yes Yes
Hadoop-level traceability of actions (Cloudera Navigator,

Single service user End users
Atlas, ...)

Project-level Security

Security with Hadoop
Without Security
U1 U2 U3 HADOOP
DSS system user Hadoop user

DSS
Group A user
DATA
Audit Tools
Group B user
DSS
job1 job2 job3 job1 job2 job3

Standard Hadoop Security

User Isolation Framework Security

Requirements for User Isolation Framework
To impersonate, you need privileges
User Isolation Framework security requires:
Root access on the machine where DSS runs
Required for impersonation of local processes
‘Proxyuser’ privilege on the Hadoop cluster for the DSS service account
Cluster-wide privilege
Effectively gives DSS account control over the end-user accounts on the cluster

Guidelines

Full Lifecycle of Project - Example
IT Sandbox Design
IT Team Design Automation/API Design Data

Automation/API
1. Test Solution updates 1. Test Solution updates 1. Design the project 4. Test project’s automation Team
2. Develop Add-ons 2. Define automation process
3. Create project’s bundle
API Deployer manages API movement

across the environments
Production Pre-Production
Automation/API Automation/API
IT Team IT Team
1. Run the project on production 1. Validate project’s automation
2. Monitor performances and results
End users

DSS Code Environments

Code Environments in DSS
DSS allows you to create an arbitrary number of code environments !
→ A code environment is a standalone and self-contained environment
to run Python or R code
→ Each code environment has its own set of packages
→ In the case of Python environments, each environment may also use
its own version of Python
→ You can set a default code env per project

→ You can choose a specific code env for
any Python/R recipe
→ You can choose a specific code env for
the visual ML

Code Environments in DSS: Intro
➢ DSS allows for Data Scientists to create and manage their own Python and R coding
environments, if given permission to do so by an Admin (Group Permissons)
➢ These Envs can be activated and deactivated for different pieces of code/levels in DSS
including
○ Projects, web apps, notebooks, and plugins
➢ To create/ manage Code Envs: Click the Gear -> Administration -> Code Envs

Code Environments in DSS:
Installing Packages to your Env
To Install Packages to your

ENV
▪ Click on your ENV in the list of
Code ENVS
▪ Go to ‘Packages to Install’ section
▪ Type in the packages you wish to
install line by line, like how you
would for a requirements.txt file
▪ Click Save and Update
Standard pip syntax applies
here
i.e. -e /local/path/to/lib will install
a local python package not
availalble on pypi
Review installed packages in
“Installed Packages”
Other Options
Permissions
Allow groups to use the code
env and define their level of
use: i.e. use only, can
manage/update
Container Exec
Build docker images that
include the libraries of your
code env
Build for specific container
configs or all configs
Logs
Review any errors in install
code env

Non-Standard PyPi/CRAN Servers
By default, DSS will connect to public

repositories (PyPi/Conda/CRAN) in order to
download libraries for code env.
This is undesirable in some customer
deployments:
air-gapped installed
customers with restrictions on library use
Admins can set up specific mirrors for use in
code environments
ADMIN > SETTINGS > MISC > Code env extra
options
Set CRAN mirror URL, extra options for
pip/conda as needed. Follow standard
documentation.
example: --index-url for pip


DSS Technial Architecture

Uploaded by

Copyright:

Available Formats

DSS Technial Architecture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSS Technial Architecture

Uploaded by

Copyright:

Available Formats

DSS Technical Architecture

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku

3. “I don’t have a central analytics 7. “Putting models into production can be

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 4

DESIGN All the features to collaboratively prepare,

API DEPLOYER Visual interface for managing API Scoring

API services for integration in operational

© Dataiku – 2020 – Confidential and proprietary information 5

CRM DATABASES COLLABORATION

DSS DESIGN NODE DSS AUTOMATION NODE Reporting &

Event logs Systems

DSS API NODE

FOR CLICKERS ✓ Model versioning

© Dataiku – 2020 – Confidential and proprietary information 6

Built-in DSS Connectors

Interact via their

© Dataiku – 2020 – Confidential and proprietary information 7

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 8

Resource consuming projects

High Availability for Design Node / Automation Node File System

Active/Passive high availability

© Dataiku – 2020 – Confidential and proprietary information 9

© Dataiku – 2020 – Confidential and proprietary information 10

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 11

automatically chooses Spark, Impala, Hive, …

the most effective

Data Preparation YES YES YES YES

Coding: Python, R, Scala YES

SQL Analytics N/A YES YES YES

Visualization YES YES YES YES

YES YES YES YES

YES YES YES YES

© Dataiku – 2020 – Confidential and proprietary information 13

© Dataiku – 2020 – Confidential and proprietary information 14

© Dataiku – 2020 – Confidential and proprietary information 16

job worker worker

© Dataiku – 2020 – Confidential and proprietary information 17

Table Table Table

© Dataiku – 2020 – Confidential and proprietary information 18

s ) task task task task

© Dataiku – 2020 – Confidential and proprietary information 19

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku 20

© Dataiku – 2020 – Confidential and proprietary information 22

© Dataiku – 2020 – Confidential and proprietary information 23

© Dataiku – 2020 – Confidential and proprietary information 24

The configuration of Data Science Studio, including all user-generated

© Dataiku – 2020 – Confidential and proprietary information 25

installation package DSS binaries dataiku-dss-x.x.x Optional

DSS workspace initialized Workspace

evolving over time with

DSS core component

($DSS_HOME/run) and Logs DSS

user triggered jobs $DSS_HOME/jobs

© Dataiku – 2020 – Confidential and proprietary information 26

©2020 dataiku, Inc. | dataiku.com | [email protected] | @dataiku

© Dataiku – 2020 – Confidential and proprietary information 28

© Dataiku – 2020 – Confidential and proprietary information 29

Handles all interactions with the end

Protocol: HTTP(s) and websockets.