Skip to content

Instantly share code, notes, and snippets.

@MokshSharma06
Forked from j-thepac/DP-900.md
Created March 25, 2025 12:14
Material for Dp-900 Exam

DP 900

Azure Data Fundamentals: Explore core data concepts

  • Type: structured, semi-structured, or unstructured.

  • Data stores : File stores ,Databases

  • file Formats :

    • Delimited text files
    • JavaScript Object Notation (JSON)
    • Extensible Markup Language (XML)
  • Optimized File Format

    • Avro: is a row-based format. It was created by Apache
    • ORC (Optimized Row Columnar format): organizes data into columns developed by HortonWorks for Apache Hive
    • Parquet is another columnar data format. It was created by Cloudera and Twitter very efficient compression and encoding schemes. -databases
    • Relational databases
      • tables that represent entities
      • entity is assigned a primary key that uniquely identifies it
      • keys = Normalization = Remove Dups
      • use SQL
    • Non-relational databases
      • Key-value databases
      • Document databases (JSON)
      • Column family databases (key , Columns )
      • Graph databases,
  • Transactional data processing

    • specific events organization wants to track
    • Used in Money / Goods / services
    • The work performed by transactional systems is Online Transactional Processing (OLTP).
    • DB optimized are both read and write operations
    • CRUD operation
    • OLTP systems shd enforce ACID (Atomicity(entire transction happens at once or nothing happens),Consistency,Isolation,Durability (persisted)
  • analytical data processing

    • read-mostly
    • files / OLTP > datalake > ETL > Warehouse-OLAP (Facts and Dim) > Analytics
  • Data lakes

    • file-based data must be collected and analyzed.
  • Data warehouses

    • relational schema
    • read operations
    • Uses denormalizationfor performance ( Star Scheme)
  • Key job roles

    • Database administrators - manage databases, permissions, backup ,policies
    • Data engineers - data integration, data cleaning routines,governance rules, and pipelines
    • Data analysts -create visualizations
    • Data scientists - Build Model

Microsoft Azure Data Fundamentals: Explore relational data in Azure

Explore fundamental relational data concepts

  • collections of entities from the real world as table

  • entity =record of information ie., objects and events

  • Relational tables = structured data ie., each row in a table has the same columns

  • Normalization

    • Separate each entity into its own -TABLE.
    • Separate each discrete attribute into its own -COLUMN.
    • Uniquely identify each entity row using a -PRIMARY KEY.
    • FOREIGN KEY columns to link related entities.
    • Types:
      • 1st Normal Form (1NF): Make all cols only 1 value
      • 2nd Normal Form (2NF): Remove duplicates, create Foreign Key
      • 3rd Normal Form (3NF): move cols out of the tables which are not depending on primary key
  • SQL (Structured Query Language)

    • communicate with a relational database.
  • Type of SQl Language

    • Transact-SQL :Microsoft SQL Server and Azure SQL services.
    • pgSQL- PostgreSQL.
    • PL/SQL- Oracle.
  • Statement types

    • DDL (Data Definition Language) ie.,create,alter,rename,drop (dangerous)
    • DML (Data Manipluation) ie., select,insert,update ,delete
    • DCL (Data Control) ie., Grant,deny,revoke
    • TCL (Transaction control) ie., rollback
  • Columns marked as NOT NULL are referred to as mandatory columns

  • Views

    • virtual table

    • It is used to save a complex query and complex tables

        CREATE VIEW Deliveries AS 
        SELECT o.OrderDate,c.FirstName 
        FROM Order AS o JOIN Customer AS c ON o.CustomerID = c.ID;
      
        SELECT OrderDate, FirstName FROM Deliveries
      
  • Stored Procedure

    • SQL code that you can save, so the code can be reused

        CREATE PROCEDURE procedure_name2
        @ProductID INT
        AS
        SELECT * FROM SalesLT.Product where ProductID=@ProductID
        GO;
        EXEC procedure_name2 680;
      
  • Index

    • index = index at the back of a book.
    • database management system use index to fetch data quickly
    • Primary key will have index hence faster
    • table can have > 1 indexes
    • Index = 1 or more cols
    • insert, update, or delete data , indexes table must be changed
    • 2 types of Indexes:
      • Clustered - Primary Key of a table

        • Table can have only 1 clustered index
        • Data is stored in the order of index
      • non clustered - Created Seperatly

        • Index stored separatly with pointers
        • create nonclustered index index_name on table (col_name); -Eg : CREATE INDEX idx_ProductName ON Product(Name);
  • Unnormalized

    • Data duplication exists
    • Because of Duplication it is hard to change data
    • not Recommended

relational database services in Azure

  • Azure SQL VM (IAAS) - Fully compatible with on-premises
  • Azure Managed Services (PAAS)
    • Azure SQL Database
    • Azure SQL Managed Instance
    • Azure Database open-source ie., My SQL, MariaDB,PostgreSQL
  • Azure SQL Edge (IOT)

Azure VM - IAAS

  • virtual machine with installation of SQL Server
  • Self Administered (OS , Updates etc., )
  • "lift and shift" migration of existing on-premises
  • Full Control - Mem ,CPU etc

Azure SQL VM Installation

  • Search for "Virtual Machine" in portal
  • New > Select images (SQL)
  • U can ssh to virtualMachine in Cloudshell or
  • Search box "Azure SQL"
  • Under "SQL virtual machines"> select "Free SQL Server Server 2019 Ubuntu "
  • Subscripption= Free, egion =Free,Image = Free ,Size = Least
  • name = my-sql-server-vm
  • create
  • Save certificate to a folder when asked
  • Passed
  • To connect to server from local : ck on to resources page > settings >click on Connect

Azure SQL Managed Instance

  • PAAS
  • All features of SQL server + Extra features
  • Recommended in migration from on-prem
  • Cross Database queries
  • send emails
  • SQL server Agents Eg: run jobs
  • Virtual network (only for vCore)
  • Supports Analysis , Reporting
  • Access to SQL Server Agent , Database Mail
  • Linked servers, Service Broker

Azure SQL Database

  • platform as a service (PaaS)
  • High availability
  • Automatic Updates
  • Autobackups
  • Serverless
  • Flexible
  • Auto Encryption( TDE)
  • Authentication
  • Read Replica DB can be created ( for reporting against reporting Db)
  • low Cost
  • isn't fully compatible with on-premises SQL
  • types :
    • Single Database
      • Default
      • charged per hour for the resources
      • Serverless
        • shared by databases belonging to other Azure subscribers
        • automatically scales
    • Elastic Pool
      • multiple databases can share the same resources
      • multiple-tenancy
      • For databases with resource requirements that vary over time, and can help you to reduce costs
  • testdeepak/deepak/test-123 , SELECT * FROM SalesLT.Product;

Other DB ( all are almost same )

  • enable organizations to move to Azure
  • mySQL (oracle) : LAMP apps ie., linux ,Apache, MySQL, and PHP
  • Maria DB: Fork of MySQL
  • Postgres ( uses pgsql , support custom datatype,manipulate geometric data)
  • Types
    • Single Server :Basic, General Purpose, and Memory Optimized
    • Flexible Server :more control and server customizations,better cost optimization
    • Hyperscale :database is split across nodes. Data is split into chunks based on the value of a partition key or sharding key
  • ports :
    • SQL - 1433
    • MYSQL = 3306
    • Postgrs = 5432
  • Security
    • Transparent data encryption (TDE)- encrypting data at rest.
    • Transport Layer Security (TLS) to encrypt data that is transmitted across a network
    • Dynamic data masking (DDM) limits sensitive data exposure by masking it to non-privileged users.

Explore non-relational data in Azure

Explore Azure Storage

  • Azure Blob (Binary large objects Storage)
    • store massive amounts of unstructured data as binary large objects, or blobs
    • Uses container
    • can read and write blobs inside a container
    • virtual folders simialr to file system
    • types : Block , Page , Append
      • Block (100 MB) = Max of 50k Blocks = 4.7 TB
      • Page = Into Pages , Fast Read and Write , Max 8 TB
      • Append = Used for append operation ie.,logs . 4MB - 195 GB
    • 3 Access
      • hot = high performance
      • cold = low performance , less charges
      • Archive = lowest storage cost , long time to access data
  • Azure DataLake Storage Gen2
    • newer ver of Gen1
    • Advantages = blob storage , cost-control of storage tiers, hierarchical namespace , analytics
    • analytical data lakes
    • Integrate with Azure HDInsight, Azure Databricks, and Azure Synapse Analytics
    • Data Lake Storage Gen2 = Azure Storage + hierarchical namespace checkbox (Under Advanced)
    • hierarchical namespace = Folder operations (rename , delete , copy etc.,)
    • By creating the storage account, or you can upgrade an existing Azure Storage account to support Data Lake Gen2
    • After Upgrading you can’t revert it to a flat namespace.
  • Azure Files (File Share)
    • 100 TB of data in a single storage account.
    • maximum size of a single file is 1 TB
    • 2000 Concurrent operations
    • upload files using Azure portal, or AzCopy utility.
    • Azure File Sync service to synchronize local copy
    • 2 Types:
      • Standard :hard disk-based hardware in a datacenter,
      • Premium :uses solid-state disks , greater throughput
    • 2 file sharing protocols
      • Server Message Block (SMB) : used in all OS
      • Network File System (NFS) : Only in some mac and linux (Premium)
  • Azure Table (No SQL , prefer cosmos instead)
    • Row = Key , Value
    • each row holding the entire data for a logical entity
    • must have a unique key
    • no concept of foreign keys, relationships, stored procedures, views,
    • denormalized
    • The number of fields in each row can be different
    • Partition Key
      • used to group related data

      • rows with same partition key are stored together

      • They are independent in size

      • Items in the same partition are stored in row key order.

      • used for faster search and range queries fetch contiguous block of rows

      • Ex

          +------+---------+------------+-----------+------------------+
          | Pkey | RowKey  | TimeStamp  | Property1 | Property2        |
          +------+---------+------------+-----------+------------------+
          | 1    | unique1 | 2022-01-01 | Deepak    | Bangalore, India |
          +------+---------+------------+-----------+------------------+
          | 2    | unique2 | 2022-01-01 | John      |                  |
          +------+---------+------------+-----------+------------------+
          | 2    | unique3 | 2022-01-01 |           | Newyork,India    |
          +------+---------+------------+-----------+------------------+
        

Cosmos DB

  • NoSQL DB
  • Horz scaling ,High scalability
  • Use Api to query Data
  • no administraton
  • Auto Scale (no limit)
  • All features similar to SQL Server
  • Can use SQL language to Query NoSQL data
  • multi-region writes for globally distributed user local replica
  • Creation:Account > DB > Container >item Note :Seperate account is needed for each
  • testdeepak/deepak/test-123 , SELECT * FROM SalesLT.Product;

Cosmos DB APIs

  • Documents :
    • Types :
      • Core SQL : JSON , SQL syntx
      • Mongo (Binary JSON - BSON )
    • Ex: Person(name ,Address()).In RDBMS address will be seperate table
    • Core SQL :DB > Container > Item
    • Mongo : DB >Collection >Doc
  • Table API
  • Column (Cassandra)
    • rows and columns
    • Has 1 Primary Key Column
    • Other columns are grouped as 1 column
    • Eg: (Pk=1 ,col2=(name="deepak",ph="123 )), (Pk=2 ,col2=(name="ABC",email="abc@gmail" ))
    • compatible with Apache Cassandra,
    • not mandatory for every row to have the same columns.
    • KeySpace > Table >Row
    • SQL to Query (SELECT * FROM Employees WHERE ID = 2)
  • Key Val :Map or Dict
  • Gremlin (graph)
    • used to store complex relationship
    • Uses nodes and edges
    • Eg:Organization Charts
    • Azure Cosmos , Gremlin
    • DB > Garph > Node , Edge

Microsoft Azure Data Fundamentals: Explore data analytics in Azure

Explore fundamentals of modern data warehousing

  • Data ingestion pipelines
    • Data ingestion and processing -ETL,ELT
    • Analytical data store - data warehouses, file-system based data lakes, and hybrid architectures
    • Analytical data model - cube , Facts and Dim ,Star Scheme
    • Data visualisation -comparisons, dashboards , and key performance indicators (KPIs),reports
  • Data ingestion pipelines
    • Azure Data Factory or Azure Synapse Analytics
    • one or more activities
    • Activities = data flow that incrementally manipulates the data until an output dataset is produced.
    • Pipelines consist of one or more activities (built-in activities, linked service )
    • Eg : Azure Blob Store linked service to ingest the input dataset, Azure SQL Database to run a stored procedure, Run task on Azure Databricks or Azure HDInsight, or apply custom logic using an Azure Function.
  • Data warehouse
    • relational database ie ., stored in a schema
    • optimized for data analytics rather than transactional
    • denormalised into a schema
    • Facts and Dim
  • Data lake
    • tabular schemas on semi-structured data files
    • structured, semi-structured, and even unstructured data
  • Hybrid approaches
    • Used in Spark
    • data lakes and data warehouses in a lake database or data lakehouse.
    • stored as files in a data lake > expose them as tables (templates ie., struct) > queried using SQL using SQL pools (Azure Synapse Analytics )
    • PolyBase : query external tables
  • Azure services for analytical stores
    • Azure Synapse Analytics
      • unified data analytics solution
      • single service interface for multiple analytical capabilities
      • Pipelines similar to Azure Data Factory.
      • Use SQL , Datawarehouse
      • Apache Spark
      • Azure Synapse Data Explorer - data analytics using Kusto Query Language (KQL).
      • Not Support data Extraction from Multiple sources (use Datafactory)
      • MPP - Massive Parallel Process - Computer nodes
    • Azure Databricks
      • Apache Spark data processing platform with SQL database semantics
      • notebook to query
      • visualize data in web-based interface.
      • Multiple sources
    • Azure HDInsight
      • Apache Spark
      • Apache Hadoop - a distributed system that uses MapReduce jobs
      • Apache HBase - an open-source large-scale NoSQL
      • Apache Kafka - a message broker for data stream processing.
      • Apache Storm - open-source for real-time data processing

Fundamentals of real-time analytics

  • 2 Types:

    • Batch processing
      • multiple data records are collected and stored processed 1 operation.
      • Eg : postpaid Bill
      • Adv : Large Volume of Data , Time can be schdeuled , For Complex Operations
      • DisAdv : Latency , Input Data needs to be prepared
    • Stream processing
      • source data is constantly monitored and processed in real time
      • Adv : No Latency ,Real Time , No prep of data
      • Disav : Not for High Volume , Recent Data , Simple Operations
  • General architecture for stream processing

    • An event generates some data Eg: a social media posts , a logs
    • data is captured Eg: folder in a cloud data store or a table in a database , the source may be a "queue"
    • The event data is processed by a perpetual query running in time window Eg:Count the number of sensor emissions per minute.
    • The results stored output sink Eg: file, database table etc.,
  • Azure Stream Analytics (PAAS)

    • Ingest data from an input eg : Azure event hub, Azure IoT Hub, or Azure Storage blob container.
    • Process the data by using a SQL query
    • Write the results - Gen 2, Azure SQL Database, Synapse , Azure Functions, event hub, Power BI, or others.
    • created using Stream Analytics job
    • Stream Analysis cluster = dedicated tenant for Stream Analytics job
  • Spark Structured Streaming :

    • develop streaming for Apache Spark based services : Synapse , Databricks, HDInsight.
  • Delta Lake

    • Datalake +transactional consistency + schema enforcement
    • streaming and batch
    • Can be used as Source / Sink
  • Azure Data Explorer:

    • database ,

    • analytics

    • ingesting querying batch , streaming data with a time-series element

    • a standalone Azure service

    • Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.

    • Uses Kusto Query Language (KQL)

    • Query telemetry data that includes a timestamp attribute.

    • Example :

        LogEvents
        | where StartTime > datetime(2021-12-31) 
        | where EventType == 'Error'
        | project StartTime, EventType , Message
      

Fundamentals of data visualization

  • Microsoft Power BI

    • data analysts can use to build interactive data visualisations for business users
    • Dashboard - Single page coll of reports
    • Reports - Collection of visuals with more than 1 page
    • Powr BI report Builder = Author and publish paginated reports
    • Powr Bi Desktop - create interactive report for Dashboard
    • Power Bi Service = Desktop + Dashboard
    • Power BI Desktop
      • import data from a wide range of data sources
      • combine and organize into analytics data model
      • create reports that contain interactive visualizations of the data.
    • Power BI service
      • reports can be published and interacted with by business users
      • basic data modeling using a web browser (limited functionality)
    • Power BI phone app.
    • Users can consume reports, dashboards, and apps in the Power BI service
  • Concepts of data modeling

    • Facts and Dimensions
    • numeric is called measures,entities is called dimensions
    • Eg : table containing numeric measures for sales (such as revenue or quantity) and dimensions for products, customers, and time.
    • model forms a multidimensional structure called cube
    • Denormalised Star Scheme (Best practise)
    • star Schema is fact table is related to one or more dimension tables
    • Dim = Person Details , Fact = Transaction done by Person
    • hierarchies = drill-up or drill-down to find aggregated values at different levels
    • model tab of Power BI Desktop to define your analytical model
  • Visualizations

    • Tables and text: simplest way to communicate data
    • Bar and column: compare numeric values for discrete categories.
    • Line: examine trends, often over time.
    • Pie: visually compare categorized values as proportions of a total.
    • Scatter: compare two numeric measures, identify relationship or correlation
    • Maps: compare values for different geographic areas or locations.
  • Create Reports

    1. PowerBi Destop > GetData > Web > add URl> OK
    2. Model Tab to Create Model ie., Format , Heirachy
    3. enable Visualizations: File > Options and Settings> Security section> Use Map and Filled Map visuals> OK.
    4. Fields Pane contain the Model created
  • Visualizaton which can be pinned (gives No additional Data) = textbox ,Images,videos,streaming Data , Webcontent

  • cannot be pinned in PowerBI = Interactive Reports , Datasets,Dashboards,Xlsx ,SSRS

================================== REFERENCES

Descriptive - averge Revenue Diagnostic - Why avg rev low ? Predictive - Avg revenue happens when covid Presciptive - Youtube recommenede videos Cognative - AI , Self Driving Car

RDBMS SQL Server in VM - User installs the SQL server in VM (Apply image on Container) SQL Managed Instance - Fully Managed by Azure SQL database (PaaS) Azure Database ( other DBs than - MariaDB , PostgreSQL )

DP-900 | Azure Data Fundamentals Certification (https://www.youtube.com/watch?v=0f9JLKgfFXM)

DataStore Types

  • RDBMS
  • NoSQL
  • Analytical DB
  • Object/Blob/File

WithoutCloud

  • Buying infra is Expensive
  • Utilization of entire Infra efficiently all the time ?
  • Team need to maintain

With Cloud

  • On Demand provisioning ( pay per use)
  • Scaling easier
  • High Availiablity
  • low Latency by adding more data center( users from other parts of the world)
  • Maintainablity
  • crash backup
  • Going Global and CDS
  • Security

Terminalogy

  • IAAS (Infra as Service)

    • SQL (image) installed to VM by user
    • User is responsible for everything
  • Paas (Platform as A Service) - Recommended

    • Cloud is responsible for everything
    • Eg: Azure SQL Db, Cosmos etc.,
    • Create SQL Db
      • Search for SQL Database in portal
      • Create new resource grp
      • Select eveything free ,
      • COnfigure db>Basic or (General Purpose > Serverless)
      • Select Local redundant
      • (usn,admin)my-paas/Password123
      • Networking > Public (firewall settings)
      • Networking > Add current client IP address and Allow Azure services and resources to access this server = YES
    • Search for SQL Databases in portal or "Azure SQL" > SQL Databses
    • Azure provides QUery Editor
  • SAAS (Software as A Service)

    • Online Excel , Outlook , CRM , Box etc .,
    • Serice Provider is responsible for runtime.

DataFormats

Structured (RDBMS)

  • tables ,rows and cols
  • Schema
  • Relationship between other tables
  • index = primary key , query efficiently
  • Constraint = not null
  • Used in OLTP (online transaction)
    • Eg: online payments
    • Heavy Writes
    • Azure SQL DB , DB for MySQL , DB for Postgresql
  • OLAP (Online Analytics Processin)
    • Eg :ETL , Warehouse
    • data from other Dbs
    • Azure Synapse Analytics
  • OLTP vs OLAP
    • OLTP is row data storage , OLAP is column wise storage (column is compressed bcoz of same type of data)
    • OLTP data cannot be distributed , OLAP data (column data) can be distributed across mulitple nodes

Semi (JSON , key value )

  • Fexible schema (schema not verified , no constraints)
  • Horizontal Scaling possible
  • Cosmos DB used for NoSQL , SemiStructured

Azure Data Studio

  • Create SQL server
    • Free
    • servername : my-sql-database-server-new ,c azuser/Password123
    • Configure DbServer
      • General Purpose > Serverless
    • locally Redundant
    • Networking (public , all access , add client)
  • Download Azure Data Studio
    • ck on new connection
    • server( server name from resource dashboard)
    • enter usn /password
    • connect

Create new Cosmos SQL :

  • Data Stored as JSON but use SQL to fetch data
  • Create
    • Free and Local redundant
    • account name = any unique name(cosmos-sql-udemy)
    • East Asia
    • Select limit the total amount option
    • Goto resource > Add Container
    • Mandatory =
      • Db id
      • throughput= manual
      • Container id
      • partition key column = "/pk"
    • Ro Run , open Dataexplorer
      • cosmos-sql-udemy-db > core-sql-container > items > new item
      • save
      • Query "select * from c where c.name="deepak"

Create new Cosmos Mongo :

  • use mongo filter to pull data
  • container is called collection in Cosmos Mongo
  • Uses sharding (similar to partition)
  • sharding is split data across multiple shards
  • shard key = partition key
  • filter ="{"key":value}"

Serverless vs provision Throughput (RU)

  • RU is billed per hr , Serverless = usage
  • RU no autoscale ,serverless= Autoscale
  • RU =mulitregion ,Serverless= No
  • RU = data unlimited , Serverless= 50Gb

Azure Storage (ie., UnStructured - pictures)

  • Similar to Harddisk,Fileshare
  • Azure Disks : additional storage acts like harddisk
  • Azure Files (File Storage) : File share
  • Azure Blob (Object Storage) : REST API objects ,upload vis REST etc.,

Azure Data Factory

  • define and schedule data pipelines
  • integrate your pipelines with other Azure services
  • persist the results in another data store.
  • ETL

Azure Stream Analytics

  • Streaming ETL Engine

Azure Data Explorer

  • in Azure Synapse Analytics.
  • querying of log and telemetry data
  • Runtime

Azure Purview

  • map your data and track data lineage across multiple data sources and systems

Microsoft Power BI

  • analytical data modeling and reporting
  • Power BI reports can be created by using the Power BI Desktop application
  • published and delivered through Power BI service and Power BI mobile app

Azure Region :

  • Region is geo location to host service
  • Region > Zone > Datacenter

Azure Zone:

  • Part of a region
  • Region can have 1 or many zones at same or different location with 1 or more seperate Data Center

Purchase

    1. vCore
    • Serverless
      • Auto Scale
      • charged only on usage and data stored
    • Provisioned
      • Can be configured
      • charged based on memory and Compute
    1. DTU :
    • Bundle (compute and Memory)
    • compute and Memory cannot be configured independently

Data Reduncany (replicating Data)

  • LRS (Local Reduncany Storage)- 3 syned copies maintined in same Data Center
  • ZRS (Zone)- 3 synced copies in different Availialibity Zones
  • GRS (Geo)- LRS + 1 Async copy to secondary region
  • RA-GRS : GRS + read access for primary and secondary
  • GZRS (Geo Zone)- ZRS + 1 Async copy to secondary region (most expensive)

Block Storage

  • Harddisk to VM (SSD , HDD , premium SSD , )
  • Use Azure Disk
  • Add Resource > Virtual Machines > Add > Disks
  • (alternate )Resources >Storage > select any disk name created > Datastorage >Containers (different from docker/kubetl)
  • types : managed (By azure) , Unmanaged (By user)

Sources for stream processing

Azure Event Hubs: manage queues of event data Azure IoT Hub: Internet-of-things (IoT) devices. Azure Data Lake Store Gen 2: A highly scalable storage service for batch /streaming data. Apache Kafka: Use Apache Spark and also use Azure HDInsight to create a Kafka cluster.

Sinks for stream processing

Azure Event Hubs: Azure Data Lake Store Gen 2 or Azure blob storage: Azure SQL Database or Azure Synapse Analytics, or Azure Databricks Microsoft Power BI

Spark on Microsoft Azure

Azure Synapse Analytics Azure Databricks Azure HDInsight

Analytics Types

  • Descriptive- what,SQL,averge Revenue
  • Diagnostic - why,Why avg rev low ?
  • Predictive (ML) - what will happen ,Avg revenue happens when covid
  • Prescriptive - Use Predictive Analysis and take nest step),Youtube recommenede videos ,Auto Complete
  • Cognitive ( AI + ML) ,AI , Self Driving Car

Big Data

  • 3V = Volumne,Varity, Velocity
  • Storage
    • DataWarehouse
      • Processed Data
      • SQL
    • DataLake
      • Raw Data (Google Cloud , Azure Data Lake Storgae Gen2)
      • Blob Storage . Gen 2
      • parq format
      • Synapse can directly read from Datalake

Azure Specific Services

  • Synapse (End to End Analytics)
    • Dataintegration + Load DW + Data Analytics
    • Sql
    • Spark jobs
  • Data Factory
    • Manged Serice
    • Serverless
    • ETL
    • Data pielines
  • Power BI ( Visualization)

Hadoop , Spark , Databricks

  • Hadoop
    • Support variety of data
    • HDFS
    • Map Reduce / parallelization in nodes
    • java Python
    • Hive (query)
    • spark ( In Memory)
  • Databricks
    • Web Based managed service for spark

Parquet

  • Storage Format
  • opensource
  • Column storage
  • High Compression
  • Azure Data Factory , Azure Data lake ,Blob Storage

hadoop and Spark Services

  • Azure HDInsight (old)
  • Azure Databricks - new (managed service)
    • Read Data from Azure SQL Database, Cosmos,Event Hubs
  • Synapse
  • Datafactory

Databricks

Massive Parallel Processing (MPP)

  • compute across muliuplpe nodes
  • Spark ,Synapse

Pipelines

Batch Piplines

  • Huge
  • In groups
  • Hours or No of Records
  • high Latency

Streaming pipelines

  • Small Number
  • Use Event Hubs
  • Realtime
  • Low Latency

Synapse

  • Target Data store itself is a transformation Engine
  • Synapse Analytcs
  • Linked Service ,Control Flow ,datMovement

Synapse Analytics

  • End to End Analytics
  • SQL
  • SPARK (spark jobs)
  • Pipelines
  • PowerBI,Cosmos,Azure ML,Azure Data Lake ,GCp
  • parq ,csv ,JSON
  • Consumption Model : Dedicated ,Serverless
  • Polybase : run SQL on external DataSources
  • Data Explorer pools can be used to run near real-time analytics on large volumes of logs

XXXX

  • ARM Template (JSON) - Copy paste configuration of environments

  • Synapse - Not Support data Extraction from Multiple sources

  • SQL mnaged Instance DB = Point to site VPN connections and Private endpoint

  • Databricks = no streaming ,User cannot start stop cluster

  • object Store - File / Audio ,Video Eg: Blob

  • Control Node - responsible for intreacting with app

  • Azure Cosmos DB Data Migration tool - Migaration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment