Service  Architectures  at  Scale      
Lessons  from  Google  and  eBay	
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/service-arch-scale-google-ebay
Presented at QCon London
www.qconlondon.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Architecture  
Evolution	
• eBay
•  5th generation today
•  Monolithic Perl à Monolithic C++ à Java à microservices
• Twitter
•  3rd generation today
•  Monolithic Rails à JS / Rails / Scala à microservices
• Amazon
•  Nth generation today
•  Monolithic C++ à Java / Scala à microservices
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Ecosystem    
of  Services	
•  Hundreds to thousands of
independent services
•  Many layers of dependencies,
no strict tiers
•  Graph of relationships, not a
hierarchy
C	
B	
A	
 E	
F	
G	
D
Evolution,    
not  Intelligent  Design	
•  No centralized, top-down design of the system
•  Variation and Natural selection
o  Create / extract new services when needed to solve a problem
o  Deprecate services when no longer used
o  Services justify their existence through usage
•  Appearance of clean layering is an emergent
property
Google    
Service  Layering	
•  Cloud Datastore: NoSQL service
o  Highly scalable and resilient
o  Strong transactional consistency
o  SQL-like rich query capabilities
•  Megastore: geo-scale structured
database
o  Multi-row transactions
o  Synchronous cross-datacenter replication
•  Bigtable: cluster-level structured storage
o  (row, column, timestamp) -> cell contents
•  Colossus: next-generation clustered file
system
o  Block distribution and replication
•  Borg: cluster management infrastructure
o  Task scheduling, machine assignment
Cloud  
Datastore	
Megastore	
Bigtable	
Colossus	
Borg
Architecture  without  an  
Architect?	
•  No “Architect” title / role
•  (+) No central approval for technology decisions
o  Most technology decisions made locally instead of globally
o  Better decisions in the field
•  (-) eBay Architecture Review Board
o  Central approval body for large-scale projects
o  Usually far too late in the process to be valuable
o  Experienced engineers saying “no” after the fact vs. encoding knowledge
in a reusable library, tool, or service
Standardization	
•  Standardized communication
o  Network protocols
o  Data formats
o  Interface schema / specification
•  Standardized infrastructure
o  Source control
o  Configuration management
o  Cluster management
o  Monitoring, alerting, diagnosing, etc.
Standards become standards by
being better than the alternatives!
“Enforcing”  
Standardization	
•  Encouraged via
o  Libraries
o  Support in underlying services
o  Code reviews
o  Searchable code
The easiest way to encourage best
practices is with *code*!
Make it really easy to do the right
thing, and harder to do the wrong
thing!
Service  
Independence	
•  No standardization of service internals
o  Programming languages
o  Frameworks
o  Persistence mechanisms
In a mature ecosystem of services,
we standardize the arcs of the
graph, not the nodes!
Creating    
New  Services	
•  Spinning out a new service
o  Almost always built for particular use-case first
o  If successful and appropriate, form a team and generalize for multiple
use-cases
•  Pragmatism wins
•  Examples
o  Google File System
o  Bigtable
o  Megastore
o  Google App Engine
o  Gmail
Deprecating    
Old  Services	
•  What if a service is a failure?
o  Repurpose technologies for other uses
o  Redeploy people to other teams
•  Examples
o  Google Wave -> Google Apps
o  Multiple generations of core services
“Every service at Google is either
deprecated or not ready yet.”
-- Google engineering proverb
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Characteristics  of  an  
Effective  Service	
•  Single-purpose
•  Simple, well-defined interface
•  Modular and independent
•  Isolated persistence (!)
A	
C	
 D	
 E	
B
Goals  of  a    
Service  Owner	
•  Meet the needs of my clients …
•  Functionality
•  Quality
•  Performance
•  Stability and reliability
•  Constant improvement over time
•  … at minimum cost and effort
•  Leverage common tools and infrastructure
•  Leverage other services
•  Automate building, deploying, and operating my service
•  Optimize for efficient use of resources
Responsibilities  of  a  
Service  Owner	
•  End-to-end Ownership
o  Team owns service from design to deployment to retirement
o  No separate maintenance or sustaining engineering team
o  DevOps philosophy of “You build it, you run it”
•  Autonomy and Accountability
o  Freedom to choose technology, methodology, working environment
o  Responsibility for the results of those choices
Service  as    
Bounded  Context	
•  Primary focus on my service
o  Clients which depend on my service
o  Services which my service depends on
o  Cognitive load is very bounded
•  Very little worry about
o  The complete ecosystem
o  The underlying infrastructure
•  è Small, nimble service teams
Service	
Client  
A	
Client  
B	
Client  
C
Service-­‐‑Service    
Relationships	
•  Vendor – Customer Relationship
o  Friendly and cooperative, but structured
o  Clear ownership and division of responsibility
o  Customer can choose to use service or not (!)
•  Service-Level Agreement (SLA)
o  Promise of service levels by the provider
o  Customer needs to be able to rely on the service, like a utility
Service-­‐‑Service    
Relationships	
•  Charging and Cost Allocation
o  Charge customers for *usage* of the service
o  Aligns economic incentives of customer and provider
o  Motivates both sides to optimize for efficiency
o  (+) Pre- / post-allocation at Google
Maintaining  
Service  Quality	
•  Small incremental changes
o  Easy to reason about and understand
o  Risk of code change is nonlinear in the size of the change
o  (-) Initial memcache service submission
•  Solid Development Practices
o  Code reviews before submission
o  Automated tests for everything
•  Google build and test system
o  Uses production cluster manager
o  Runs millions of tests per day in parallel
o  All acceptance tests run before code is accepted into source control
Maintaining    
Interface  Stability	
•  Backward / forward compatibility of interfaces
o  Can *never* break your clients’ code
o  Often multiple interface versions
o  Sometimes multiple deployments
o  Majority of changes don’t impact the interface in any way
•  Explicit deprecation policy
o  Strong incentive to wean customers off old versions (!)
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Predictable  
Performance	
•  Services at scale highly exposed to performance
variability
•  Imagine an operation …
o  1ms median latency, but 1 second latency at 99.99%ile (1 in 10,000)
o  Service using one machine à 0.01% slow
o  Service using 5,000 machines à 50% slow
•  Predictability trumps average performance
o  Low latency + inconsistent performance != low latency
o  Far easier to program to consistent performance
o  Tail latencies are *much* more important than average latencies
Google  App  Engine    
Memcache  Service	
•  Periodic “hiccups” in latency at 99.99%ile and
beyond
•  Very difficult to detect and diagnose
•  è Slab memory allocation
Service  
Reliability	
•  Systems at scale highly exposed to failure
o  Software, hardware, service failures
o  Sharks and backhoes
o  Operator “oops”
•  Resilience in depth
o  Redundancy for machine / cluster / data center failures
o  Load-balancing and flow control for service invocations
o  Rapid rollback for “oops”
Service  Reliability:  
Deployment	
•  Incremental Deployment
o  Canary systems
o  Staged rollouts
o  Rapid rollback
•  eBay “Feature Flags”
o  Decouple code deployment from feature deployment
o  Rapidly turn on / off features without redeploying code
o  Typically deploy with feature turned off, then turn on as a separate step
Service  Reliability:  
Monitoring	
•  Instrumentation
o  Common monitoring service
o  Machine / instance statistics: CPU, memory, I/O
o  Request statistics: request rate, error rate, latency distribution
o  Application / service statistics
o  Downstream service invocations
•  Diagnosability
o  In-process web server with current statistics
o  Distributed tracing of requests through multiple service invocations
You can have too much alerting,
but you can never have too much
monitoring!
Service  Architectures    
at  Scale	
•  Ecosystem of Services
•  Building a Service
•  Operating a Service
•  Service Anti-Patterns
Service  
Anti-­‐‑PaQerns	
•  The “Mega-Service”
o  Overbroad area of responsibility is difficult to reason about, change
o  Leads to more upstream / downstream dependencies
•  Shared persistence
o  Breaks encapsulation, encourages “backdoor” interface violations
o  Unhealthy and near-invisible coupling of services
o  (-) Initial eBay SOA efforts
Thank  You!	
•  @randyshoup
•  linkedin.com/in/randyshoup
•  Slides will be at slideshare.net/randyshoup
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/service-
arch-scale-google-ebay

Service Architectures at Scale: Lessons from Google and eBay

  • 1.
    Service  Architectures  at Scale       Lessons  from  Google  and  eBay Randy Shoup @randyshoup linkedin.com/in/randyshoup
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /service-arch-scale-google-ebay
  • 3.
    Presented at QConLondon www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
    Architecture   Evolution • eBay •  5thgeneration today •  Monolithic Perl à Monolithic C++ à Java à microservices • Twitter •  3rd generation today •  Monolithic Rails à JS / Rails / Scala à microservices • Amazon •  Nth generation today •  Monolithic C++ à Java / Scala à microservices
  • 5.
    Service  Architectures    at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 6.
    Service  Architectures    at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 7.
    Ecosystem     of Services •  Hundreds to thousands of independent services •  Many layers of dependencies, no strict tiers •  Graph of relationships, not a hierarchy C B A E F G D
  • 8.
    Evolution,     not Intelligent  Design •  No centralized, top-down design of the system •  Variation and Natural selection o  Create / extract new services when needed to solve a problem o  Deprecate services when no longer used o  Services justify their existence through usage •  Appearance of clean layering is an emergent property
  • 9.
    Google     Service Layering •  Cloud Datastore: NoSQL service o  Highly scalable and resilient o  Strong transactional consistency o  SQL-like rich query capabilities •  Megastore: geo-scale structured database o  Multi-row transactions o  Synchronous cross-datacenter replication •  Bigtable: cluster-level structured storage o  (row, column, timestamp) -> cell contents •  Colossus: next-generation clustered file system o  Block distribution and replication •  Borg: cluster management infrastructure o  Task scheduling, machine assignment Cloud   Datastore Megastore Bigtable Colossus Borg
  • 10.
    Architecture  without  an  Architect? •  No “Architect” title / role •  (+) No central approval for technology decisions o  Most technology decisions made locally instead of globally o  Better decisions in the field •  (-) eBay Architecture Review Board o  Central approval body for large-scale projects o  Usually far too late in the process to be valuable o  Experienced engineers saying “no” after the fact vs. encoding knowledge in a reusable library, tool, or service
  • 11.
    Standardization •  Standardized communication o Network protocols o  Data formats o  Interface schema / specification •  Standardized infrastructure o  Source control o  Configuration management o  Cluster management o  Monitoring, alerting, diagnosing, etc.
  • 12.
    Standards become standardsby being better than the alternatives!
  • 13.
    “Enforcing”   Standardization •  Encouragedvia o  Libraries o  Support in underlying services o  Code reviews o  Searchable code
  • 14.
    The easiest wayto encourage best practices is with *code*!
  • 15.
    Make it reallyeasy to do the right thing, and harder to do the wrong thing!
  • 16.
    Service   Independence •  Nostandardization of service internals o  Programming languages o  Frameworks o  Persistence mechanisms
  • 17.
    In a matureecosystem of services, we standardize the arcs of the graph, not the nodes!
  • 18.
    Creating     New Services •  Spinning out a new service o  Almost always built for particular use-case first o  If successful and appropriate, form a team and generalize for multiple use-cases •  Pragmatism wins •  Examples o  Google File System o  Bigtable o  Megastore o  Google App Engine o  Gmail
  • 19.
    Deprecating     Old Services •  What if a service is a failure? o  Repurpose technologies for other uses o  Redeploy people to other teams •  Examples o  Google Wave -> Google Apps o  Multiple generations of core services
  • 20.
    “Every service atGoogle is either deprecated or not ready yet.” -- Google engineering proverb
  • 21.
    Service  Architectures    at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 22.
    Characteristics  of  an  Effective  Service •  Single-purpose •  Simple, well-defined interface •  Modular and independent •  Isolated persistence (!) A C D E B
  • 23.
    Goals  of  a    Service  Owner •  Meet the needs of my clients … •  Functionality •  Quality •  Performance •  Stability and reliability •  Constant improvement over time •  … at minimum cost and effort •  Leverage common tools and infrastructure •  Leverage other services •  Automate building, deploying, and operating my service •  Optimize for efficient use of resources
  • 24.
    Responsibilities  of  a  Service  Owner •  End-to-end Ownership o  Team owns service from design to deployment to retirement o  No separate maintenance or sustaining engineering team o  DevOps philosophy of “You build it, you run it” •  Autonomy and Accountability o  Freedom to choose technology, methodology, working environment o  Responsibility for the results of those choices
  • 25.
    Service  as    Bounded  Context •  Primary focus on my service o  Clients which depend on my service o  Services which my service depends on o  Cognitive load is very bounded •  Very little worry about o  The complete ecosystem o  The underlying infrastructure •  è Small, nimble service teams Service Client   A Client   B Client   C
  • 26.
    Service-­‐‑Service     Relationships • Vendor – Customer Relationship o  Friendly and cooperative, but structured o  Clear ownership and division of responsibility o  Customer can choose to use service or not (!) •  Service-Level Agreement (SLA) o  Promise of service levels by the provider o  Customer needs to be able to rely on the service, like a utility
  • 27.
    Service-­‐‑Service     Relationships • Charging and Cost Allocation o  Charge customers for *usage* of the service o  Aligns economic incentives of customer and provider o  Motivates both sides to optimize for efficiency o  (+) Pre- / post-allocation at Google
  • 28.
    Maintaining   Service  Quality • Small incremental changes o  Easy to reason about and understand o  Risk of code change is nonlinear in the size of the change o  (-) Initial memcache service submission •  Solid Development Practices o  Code reviews before submission o  Automated tests for everything •  Google build and test system o  Uses production cluster manager o  Runs millions of tests per day in parallel o  All acceptance tests run before code is accepted into source control
  • 29.
    Maintaining     Interface Stability •  Backward / forward compatibility of interfaces o  Can *never* break your clients’ code o  Often multiple interface versions o  Sometimes multiple deployments o  Majority of changes don’t impact the interface in any way •  Explicit deprecation policy o  Strong incentive to wean customers off old versions (!)
  • 30.
    Service  Architectures    at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 31.
    Predictable   Performance •  Servicesat scale highly exposed to performance variability •  Imagine an operation … o  1ms median latency, but 1 second latency at 99.99%ile (1 in 10,000) o  Service using one machine à 0.01% slow o  Service using 5,000 machines à 50% slow •  Predictability trumps average performance o  Low latency + inconsistent performance != low latency o  Far easier to program to consistent performance o  Tail latencies are *much* more important than average latencies
  • 32.
    Google  App  Engine    Memcache  Service •  Periodic “hiccups” in latency at 99.99%ile and beyond •  Very difficult to detect and diagnose •  è Slab memory allocation
  • 33.
    Service   Reliability •  Systemsat scale highly exposed to failure o  Software, hardware, service failures o  Sharks and backhoes o  Operator “oops” •  Resilience in depth o  Redundancy for machine / cluster / data center failures o  Load-balancing and flow control for service invocations o  Rapid rollback for “oops”
  • 34.
    Service  Reliability:   Deployment • Incremental Deployment o  Canary systems o  Staged rollouts o  Rapid rollback •  eBay “Feature Flags” o  Decouple code deployment from feature deployment o  Rapidly turn on / off features without redeploying code o  Typically deploy with feature turned off, then turn on as a separate step
  • 35.
    Service  Reliability:   Monitoring • Instrumentation o  Common monitoring service o  Machine / instance statistics: CPU, memory, I/O o  Request statistics: request rate, error rate, latency distribution o  Application / service statistics o  Downstream service invocations •  Diagnosability o  In-process web server with current statistics o  Distributed tracing of requests through multiple service invocations
  • 36.
    You can havetoo much alerting, but you can never have too much monitoring!
  • 37.
    Service  Architectures    at  Scale •  Ecosystem of Services •  Building a Service •  Operating a Service •  Service Anti-Patterns
  • 38.
    Service   Anti-­‐‑PaQerns •  The“Mega-Service” o  Overbroad area of responsibility is difficult to reason about, change o  Leads to more upstream / downstream dependencies •  Shared persistence o  Breaks encapsulation, encourages “backdoor” interface violations o  Unhealthy and near-invisible coupling of services o  (-) Initial eBay SOA efforts
  • 39.
    Thank  You! •  @randyshoup • linkedin.com/in/randyshoup •  Slides will be at slideshare.net/randyshoup
  • 40.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/service- arch-scale-google-ebay