LinkedIn OpenFabric Project - Interop 2017

LinkedIn OpenFabric Project
Shawn Zandi
Principal Architect
Infrastructure Engineering

LinkedIn Infrastructure
• Infrastructure architecture based on application’s behavior & requirements
• Pre-planned static topology
• Single operator
• Single tenant with many applications
• As oppose to multi-tenant with different (or unknown) needs
• 34% infrastructure growth on annual basis and over half a billion users
2

Traffic Demands
• High intra and inter-DC bandwidth demand due to organic growth
• Every single byte of member activity, creates thousands bytes of east-west traffic inside
data center:
• Application Call Graph
• Metrics, Analytics and Tracking via Kafka
• Hadoop and Offline Jobs
• Machine Learning
• Data Replications
• Search and Indexing
• Ads, recruiting solutions, etc.
3

the need to access the code
BGP Requirements on the switch software:
FlowSpec
Tweak/Remove ASN in AS-Path
Telemetry, etc.
4
We had to change our architecture to ﬁt the product that works for the most…

R. White, S. Zandi - IETF Journal March 2017
“Ownership to control your own future… speciﬁcally, owning your
architecture means the ability to intertwine your network and your
business in a way that leads to competitive advantage”

Edge Network to Eyeballs (EdgeConnect)
Backbone Network (Falco)
Bare Metal HW (Open19)
OS / Kernel (Linux)
Container (LPS)
Application
Own End to End To Enable and Control
6
Data Center Network (Open19 + Falco + OpenFabric)
End to end control enables us to move problems and/or complexities
to the code, OS, network, client software, or solve by architecture…
Bare Metal HW (Open19)
OS / Kernel (Linux)
Container (LPS)
Application

Core Design Principles
• Simplicity: “perfection has been reached not when there is nothing left to
add, but when there is nothing left to take away.”
• Openness: Use community-based tools where possible.
• Independence: Refuse to develop a dependence on a single vendor or
vendor-driven architecture (and hence avoid the inevitable forklift
upgrades)
• Programmability: Being able to modify the behavior of the data center
fabric in near real time, without touching devices…
7

The Building Block
Linux on Merchant Silicon (ODM/OEM)
Big Chassis Switches
Designed around robustness (NSR, ISSU, etc.)
Feature-rich but mostly irrelevant to LinkedIn needs (FCoE, VXLAN, EVPN, MCLAG, etc.)
8
Project Falco

PodW
SpineSpineSpine
LeafLeafLeafLeaf
Spine
PodX
SpineSpineSpine
LeafLeafLeafLeaf
Spine
PodY
SpineSpineSpine
LeafLeafLeafLeaf
Spine
PodZ
SpineSpineSpine
LeafLeafLeafLeaf
Spine
Fabric 2

Spine Spine Spine
Fabric 4
Spine Spine Spine
Fabric 1
Spine Spine Spine
4,096 x100G ports
Non-Blocking
Scale-out
9
Spine Spine Spine
Fabric 3
Data Center Fabric

Simplifying Data Center Network
Uniﬁed Architecture
Single SKU (hardware and software) for all switches while procuring hardware from multiple
ODM channels (multi-homing)
Minimum Features (L3: IPv4, IPv6 Routing)
No Overlay - For the infrastructure, the application is stateless
No LAG (Link Aggregation)
No Middle-box (Firewall, Load-balancer, etc.) moved to application
Network is only a set of intermediate boxes running linux
10

11
Self-Deﬁned Networks
• Snap-on power cables/PCB—200-250 watts per brick
• Snap-on data cables—up to 100G per brick

12
Present: Altair Design
Pod 1
ToRX ToR32ToRYToR1
Pod X
ToRX ToR32ToRYToR1
Pod Y
ToRX ToR32ToRYToR1
Pod 64
ToRX ToR32ToRYToR1
Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1
Spine32SpineYSpineXSpine1 Spine1 SpineX SpineY Spine32 Spine1 SpineX SpineY Spine32Spine32SpineYSpineXSpine1
ToR
Leaf
Spine
True 5 Stage Clos Architecture (Maximum Path Length: 5 Chipsets to Minimize Latency)
Moved complexity from big boxes to our advantage, where we can manage and control!
Single SKU - Same Chipset - Uniform IO design (Bandwidth, Latency and Buffering)
Dedicated control plane, OAM and CPU for each ASIC

Non-Blocking Parallel Fabrics
13
Fabric 4
Fabric 3
Fabric 2
Fabric 1
ToR
ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer
ToR

15
ToR
1
ToR
1025
ToR
1024
ToR
2048

16
Leaf
1
Leaf
256
Leaf
128
Leaf
129

20
ToR
1
ToR
1025
ToR
1024
ToR
2048
Leaf
1
Leaf
256
Leaf
128
Leaf
129
Fabric
1
Fabric
4
~2400 switches to support ~100,000 bare metal servers

Tier 1
ToR - Top of the Rack
Broadcom Tomahawk 32x 100G
10/25/50/100G Attachement
Regular Server Attachement 10G
Each Cabinet: 96 Dense Compute units
Half Cabinet (Leaf-Zone) 48x 10G port for servers + 4 uplinks of 50G
Full Cabinet: 2x Single ToR Zones: 48 + 48 = 96 Servers
21
Project Falco
ToR
Server
Leaf
Spine Spine
Leaf Leaf Leaf
Spine Spine

Tier 2
Leaf
Non-Blocking Topology:
32x downlinks of 50G to serve 32 ToR
32x uplinks of 50G to provide 1:1 Over-subscription
22
Project Falco
ToR
Server
Leaf
Spine Spine
Leaf Leaf Leaf
Spine Spine

Tier 3
Spine
Non-Blocking Topology:
64 downlinks to provide 1:1 Over-subscription
To serve 64 pods (each pod 32 ToR)
100,000 Servers: Each pod (Approximately 1550 Compute)
23
Project Falco
ToR
Server
Leaf
Spine Spine
Leaf Leaf Leaf
Spine Spine

25
Role/Function: Identity + Location
Discover Location in a Regular Topology
Starts Forwarding
Self-Deﬁned Network

31
Simplifying the picture
Fabric1..4 Spine1
Leaf 129..132
ToR 1025
Leaf 1..4
ToR 1

LinkedIn OpenFabric Project - Interop 2017

33
Fabric 4
Fabric 3
Fabric 2
Fabric 1
ToR
ToR
Oversimplified!

OpenFabric Project
Devices treated as a standard host
Conﬁgurations are eliminated, not automated
Once installed, no changes to conﬁguration
Just upgrades, as with any other server
34
to manage the fabric as one thing

OpenFabric Project
Self-Deﬁned Programmable Data Center
Distributed Routing Protocol (v4+v6)
SRv6 to enable end-to-end control
Centralized Policies: Controller Based Trafﬁc Optimizer
Enables Self-Healing Network
35

SRv6 in DC makes sense!
Same Forwarding Plane
Internet Protocol is responsible for host-to-host (end-to-end) delivery
Does not require MPLS stack on hosts and middle boxes!
Merchant Silicon Support
Currently working with Microsoft team for SAI adoption
36

W X Y Z
W X Y Z
W X Y Z
37
Pod 1
2 32…1
Pod 32
322 352…321
Pod 48
642 672…641
Pod 64
962 992…961
W X Y Z
2171217021692168213121302129212820912090208920882051205020492048
2339233823372336 2368 2369 2370 2371 2400 2401 2402 24032307230623052304
Distributed Control Plane Complexity

Current: BGP in DC
38
Pod 1
2 32…1
Pod 32
322 352…321
Pod 48
642 672…641
Pod 64
962 992…961
2171217021692168213121302129212820912090208920882051205020492048
2339233823372336 2368 2369 2370 2371 2400 2401 2402 24032307230623052304
BGP Session over Link-Local Address
Separate ASN per box (eBGP)
RFC5549 - Advertising IPv4 NLRI over IPv6 sessions
Single Link Between Boxes (Multipath AS-Path Relax)
Anycast VIPs /32 & /128
Best Path Selection: Hop Count (Shortest AS-Path)

Hardware
Network
Transport
Application
BGP (1990s)
Clos Topology (1950s)
Ethernet & IP (1980s)
Rethinking The Network Stack
39

Control Plane Requirements
Fast, simple distributed control plane
No tags, bells, or whistles (no hacks, no policy)
Auto discover neighbors, establish adjacency and build RIB
Minimal (to zero) conﬁguration
Must use TLVs for future, backward compatible, extensibility
Must carry MPLS labels (per node/interface) 
40

BGP in Data Center
• BGP is an IDR designed to connect different autonomous system, to provide policy
and control between different routing domains to select a best path.
• True: BGP can scale and is extensible. BGP has many policy knobs.
• A datacenter fabric operated under a single administrative domain instead of series
of individual routers with different policies and decision process.
41

Control Plane: Routing Options
42
Heavy weight; lots of features and “stuff” that are not needed
Modifications to support single IP configuration required
Does not supply full topology view
Proven scaling
BGP
Not proven to scale in this environment
Light weight
Most requirements for zero configuration are already met
Provides full topology view
IS-IS
A lot of work
But could use bits and pieces from other placesBuild New

IETF Drafts
BGP-based SPF
• Shortest Path Routing Extensions for BGP Protocol - draft-keyupate-idr-bgp-spf-02
OpenFabric - Flooding Optimization in a Clos network
• draft-white-openfabric-00
Self-Deﬁned Networks:
BGP Logical Link Discovery Protocol (LLDP) Peer Discovery
• draft-acee-idr-lldp-peer-discovery-00
43

Forwarding Challenges
• ECMP is Blind!
• End to end path selection is required for some applications.
• Application / Operator cannot easily enforce a path...
44

45
Reachability ≠ Availability
To
SeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSe
To
SeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSeSe

46
Host A
Host BReachability ≠ Availability

• Forwarding based by shortest (and random) path
• We move forward toward a destination, regardless of
road conditions, trafﬁc jams (congestions), etc.
47
Reachability ≠ Availability

OpenFabric Control Plane
48
NOS
Control Plane Distribute Reachability
Centralized Policy
Fast, simple distributed control plane
No tags, bells, or whistles
Auto discover fabric locality
Auto discover neighbors
No configuration required
Expresses engineered paths
Expresses filters
Expresses QoS (where needed)
Minimal configuration (server information)
Use deployed tools where possible (Kafka)
“Standard” distribution
same as servers + packages/tools

Hardware
Routing
Policy
Applications
Link Selection
Topology Discovery and Network Graph
Control
Telemetry/Visibility, Machine Learning, Prediction Engine, Self Healing, etc.
Forwarding
Merchant Silicon
Rethinking The Network Stack
49

Network Element
Management
Plane
SNMP, Syslog, etc.
System &
Environmental
Data
Packet & Flow
Data
Network Operating System
Kafka Network Agent
ASIC Drivers
Management Plane: Reducing Protocols
50
SYSLOG
SNMP
SFLOW
Kafka

Network ElementNetwork ElementNetwork ElementNetwork Element
Management
Plane
SNMP, Syslog, etc.
System &
Environmental
Data
Packet & Flow
Data
Network Operating System
Kafka Agent
Monitoring and Management System
Kafka Broker
Machine Learning & Data Processing
Alert
Processor
Log Retention
Data Store
Event
Correlation
Kafka Pub/Sub Pipeline
Record, Process and Replay Network State
51

Programmable Data Center
52
A data center fabric that distributes traffic amongst all available links efficiently and
effectively while maintaining lowest latency and providing the most possible bandwidth to
different applications based on different needs and priorities.
Forwarding traffic based on demands & patterns:
• Application
• Latency
• Loss
• Bandwidth (Throughput)

Programmable Data Center
53
Metrics & Analytics
Machine Learning
Self Healing
etc. (API to Infrastructure)

• The fabric software/architecture that enables applications to meet and
interact with infrastructure to:
OpenFabric
54
• Discover and Learn
• Provision
• Manage
• Control
• Monitor

Project Altair: The Evolution of LinkedIn’s Data Center Network
Project Falco: Decoupling Switching Hardware and Software
Open19: A New Vision for the Data Center

LinkedIn OpenFabric Project - Interop 2017

More Related Content

LinkedIn OpenFabric Project - Interop 2017