The document discusses LinkedIn's OpenFabric project which aims to simplify their data center network architecture. Some key points:
- The architecture is based on the needs of LinkedIn's applications which require high intra- and inter-DC bandwidth.
- It uses a simplified design with single switch hardware and software SKUs, no overlays or LAGs, and moves complexity from switches to software.
- The control plane was redesigned for simplicity with a custom routing protocol instead of BGP to provide full topology visibility without configuration.
- The goal is to treat the entire fabric as code and enable applications to directly interact with and control the infrastructure. Telemetry and machine learning are used for monitoring.
2. LinkedIn Infrastructure
• Infrastructure architecture based on application’s behavior & requirements
• Pre-planned static topology
• Single operator
• Single tenant with many applications
• As oppose to multi-tenant with different (or unknown) needs
• 34% infrastructure growth on annual basis and over half a billion users
2
3. Traffic Demands
• High intra and inter-DC bandwidth demand due to organic growth
• Every single byte of member activity, creates thousands bytes of east-west traffic inside
data center:
• Application Call Graph
• Metrics, Analytics and Tracking via Kafka
• Hadoop and Offline Jobs
• Machine Learning
• Data Replications
• Search and Indexing
• Ads, recruiting solutions, etc.
3
4. the need to access the code
BGP Requirements on the switch software:
FlowSpec
Tweak/Remove ASN in AS-Path
Telemetry, etc.
4
We had to change our architecture to fit the product that works for the most…
5. R. White, S. Zandi - IETF Journal March 2017
“Ownership to control your own future… specifically, owning your
architecture means the ability to intertwine your network and your
business in a way that leads to competitive advantage”
6. Edge Network to Eyeballs (EdgeConnect)
Backbone Network (Falco)
Bare Metal HW (Open19)
OS / Kernel (Linux)
Container (LPS)
Application
Own End to End To Enable and Control
6
Data Center Network (Open19 + Falco + OpenFabric)
End to end control enables us to move problems and/or complexities
to the code, OS, network, client software, or solve by architecture…
Bare Metal HW (Open19)
OS / Kernel (Linux)
Container (LPS)
Application
7. Core Design Principles
• Simplicity: “perfection has been reached not when there is nothing left to
add, but when there is nothing left to take away.”
• Openness: Use community-based tools where possible.
• Independence: Refuse to develop a dependence on a single vendor or
vendor-driven architecture (and hence avoid the inevitable forklift
upgrades)
• Programmability: Being able to modify the behavior of the data center
fabric in near real time, without touching devices…
7
8. The Building Block
Linux on Merchant Silicon (ODM/OEM)
Big Chassis Switches
Designed around robustness (NSR, ISSU, etc.)
Feature-rich but mostly irrelevant to LinkedIn needs (FCoE, VXLAN, EVPN, MCLAG, etc.)
8
Project Falco
10. Simplifying Data Center Network
Unified Architecture
Single SKU (hardware and software) for all switches while procuring hardware from multiple
ODM channels (multi-homing)
Minimum Features (L3: IPv4, IPv6 Routing)
No Overlay - For the infrastructure, the application is stateless
No LAG (Link Aggregation)
No Middle-box (Firewall, Load-balancer, etc.) moved to application
Network is only a set of intermediate boxes running linux
10
12. 12
Present: Altair Design
Pod 1
ToRX ToR32ToRYToR1
Pod X
ToRX ToR32ToRYToR1
Pod Y
ToRX ToR32ToRYToR1
Pod 64
ToRX ToR32ToRYToR1
Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1
Spine32SpineYSpineXSpine1 Spine1 SpineX SpineY Spine32 Spine1 SpineX SpineY Spine32Spine32SpineYSpineXSpine1
ToR
Leaf
Spine
True 5 Stage Clos Architecture (Maximum Path Length: 5 Chipsets to Minimize Latency)
Moved complexity from big boxes to our advantage, where we can manage and control!
Single SKU - Same Chipset - Uniform IO design (Bandwidth, Latency and Buffering)
Dedicated control plane, OAM and CPU for each ASIC
13. Non-Blocking Parallel Fabrics
13
Fabric 4
Fabric 3
Fabric 2
Fabric 1
ToR
ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer
ToR
ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer
21. Tier 1
ToR - Top of the Rack
Broadcom Tomahawk 32x 100G
10/25/50/100G Attachement
Regular Server Attachement 10G
Each Cabinet: 96 Dense Compute units
Half Cabinet (Leaf-Zone) 48x 10G port for servers + 4 uplinks of 50G
Full Cabinet: 2x Single ToR Zones: 48 + 48 = 96 Servers
21
Project Falco
ToR
Server
Leaf
Spine Spine
Leaf Leaf Leaf
Spine Spine
22. Tier 2
Leaf
Broadcom Tomahawk 32x 100G
Non-Blocking Topology:
32x downlinks of 50G to serve 32 ToR
32x uplinks of 50G to provide 1:1 Over-subscription
22
Project Falco
ToR
Server
Leaf
Spine Spine
Leaf Leaf Leaf
Spine Spine
23. Tier 3
Spine
Broadcom Tomahawk 32x 100G
Non-Blocking Topology:
64 downlinks to provide 1:1 Over-subscription
To serve 64 pods (each pod 32 ToR)
100,000 Servers: Each pod (Approximately 1550 Compute)
23
Project Falco
ToR
Server
Leaf
Spine Spine
Leaf Leaf Leaf
Spine Spine
33. 33
Fabric 4
Fabric 3
Fabric 2
Fabric 1
ToR
ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer
ToR
ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer
Oversimplified!
34. OpenFabric Project
Devices treated as a standard host
Configurations are eliminated, not automated
Once installed, no changes to configuration
Just upgrades, as with any other server
34
to manage the fabric as one thing
35. OpenFabric Project
Self-Defined Programmable Data Center
Distributed Routing Protocol (v4+v6)
SRv6 to enable end-to-end control
Centralized Policies: Controller Based Traffic Optimizer
Enables Self-Healing Network
35
36. SRv6 in DC makes sense!
Same Forwarding Plane
Internet Protocol is responsible for host-to-host (end-to-end) delivery
Does not require MPLS stack on hosts and middle boxes!
Merchant Silicon Support
Currently working with Microsoft team for SAI adoption
36
37. W X Y Z
W X Y Z
W X Y Z
37
Pod 1
2 32…1
Pod 32
322 352…321
Pod 48
642 672…641
Pod 64
962 992…961
W X Y Z
2171217021692168213121302129212820912090208920882051205020492048
2339233823372336 2368 2369 2370 2371 2400 2401 2402 24032307230623052304
Distributed Control Plane Complexity
38. Current: BGP in DC
38
Pod 1
2 32…1
Pod 32
322 352…321
Pod 48
642 672…641
Pod 64
962 992…961
2171217021692168213121302129212820912090208920882051205020492048
2339233823372336 2368 2369 2370 2371 2400 2401 2402 24032307230623052304
BGP Session over Link-Local Address
Separate ASN per box (eBGP)
RFC5549 - Advertising IPv4 NLRI over IPv6 sessions
Single Link Between Boxes (Multipath AS-Path Relax)
Anycast VIPs /32 & /128
Best Path Selection: Hop Count (Shortest AS-Path)
40. Control Plane Requirements
Fast, simple distributed control plane
No tags, bells, or whistles (no hacks, no policy)
Auto discover neighbors, establish adjacency and build RIB
Minimal (to zero) configuration
Must use TLVs for future, backward compatible, extensibility
Must carry MPLS labels (per node/interface)
40
41. BGP in Data Center
• BGP is an IDR designed to connect different autonomous system, to provide policy
and control between different routing domains to select a best path.
• True: BGP can scale and is extensible. BGP has many policy knobs.
• A datacenter fabric operated under a single administrative domain instead of series
of individual routers with different policies and decision process.
41
42. Control Plane: Routing Options
42
Heavy weight; lots of features and “stuff” that are not needed
Modifications to support single IP configuration required
Does not supply full topology view
Proven scaling
BGP
Not proven to scale in this environment
Light weight
Most requirements for zero configuration are already met
Provides full topology view
IS-IS
A lot of work
But could use bits and pieces from other placesBuild New
43. IETF Drafts
BGP-based SPF
• Shortest Path Routing Extensions for BGP Protocol - draft-keyupate-idr-bgp-spf-02
OpenFabric - Flooding Optimization in a Clos network
• draft-white-openfabric-00
Self-Defined Networks:
BGP Logical Link Discovery Protocol (LLDP) Peer Discovery
• draft-acee-idr-lldp-peer-discovery-00
43
44. Forwarding Challenges
• ECMP is Blind!
• End to end path selection is required for some applications.
• Application / Operator cannot easily enforce a path...
44
47. • Forwarding based by shortest (and random) path
• We move forward toward a destination, regardless of
road conditions, traffic jams (congestions), etc.
47
Reachability ≠ Availability
48. OpenFabric Control Plane
48
NOS
Control Plane Distribute Reachability
Centralized Policy
Fast, simple distributed control plane
No tags, bells, or whistles
Auto discover fabric locality
Auto discover neighbors
No configuration required
Expresses engineered paths
Expresses filters
Expresses QoS (where needed)
Minimal configuration (server information)
Use deployed tools where possible (Kafka)
“Standard” distribution
same as servers + packages/tools
50. Network Element
Management
Plane
SNMP, Syslog, etc.
System &
Environmental
Data
Packet & Flow
Data
Network Operating System
Kafka Network Agent
ASIC Drivers
Management Plane: Reducing Protocols
50
SYSLOG
SNMP
SFLOW
Kafka
51. Network ElementNetwork ElementNetwork ElementNetwork Element
Management
Plane
SNMP, Syslog, etc.
System &
Environmental
Data
Packet & Flow
Data
Network Operating System
Kafka Agent
Monitoring and Management System
Kafka Broker
Machine Learning & Data Processing
Alert
Processor
Log Retention
Data Store
Event
Correlation
Kafka Pub/Sub Pipeline
Record, Process and Replay Network State
51
52. Programmable Data Center
52
A data center fabric that distributes traffic amongst all available links efficiently and
effectively while maintaining lowest latency and providing the most possible bandwidth to
different applications based on different needs and priorities.
Forwarding traffic based on demands & patterns:
• Application
• Latency
• Loss
• Bandwidth (Throughput)
54. • The fabric software/architecture that enables applications to meet and
interact with infrastructure to:
OpenFabric
54
• Discover and Learn
• Provision
• Manage
• Control
• Monitor
55. Project Altair: The Evolution of LinkedIn’s Data Center Network
Project Falco: Decoupling Switching Hardware and Software
Open19: A New Vision for the Data Center