SlideShare a Scribd company logo
Scheduling on Large
Clusters
Based on Google’s Omega Paper
Sameer Tiwari
Hadoop Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech
Scheduling on Large Clusters
● Goals
o High Utilization
o Honor User Defined constraints
o Maintain High Efficiency
● Issues
o Un-predictable load
o Varying types of load
o Increasing load and cluster size
Types of Schedulers
● Monolithic
o Single Resource Manager and Scheduler
o Google Borg
● Two Level
o Single Resource Management and multiple schedulers
o Mesos, Hadoop-on-Demand (HOD project)
● Shared state
o Multiple schedulers with access to all resources
o Google Omega
Monolithic Schedulers
● Stable, been around since 1990s
● Issues
o Head of line blocking
o Scalability is limited
o Popular with HPC community
 Maui -> Moab(R), Platform LSF (IBM)
o Multi Path scheduling addresses some of these problems
Statically Partitioned Schedulers
● Common with Hadoop deployments
o Assumes full control of resources
o Dedicated or statically partitioned clusters
● Issues
o Low utilization
o Data fragmentation
● AKA: Two-level Schedulers
o Resource Manager dynamically partitions a cluster
o Resources presented to partitions as “offers”
o Partitions request resources as needed
o e.g. Mesos and Hadoop on Demand (HOD)
● Issues
o Pessimistic locking is used during allocation
o Not suitable for “long running” jobs
o Gang scheduling (e.g. MPI jobs) can cause deadlocks
o Each scheduler has no idea about any other scheduler
 Pre-emption is tricky
Dynamic Schedulers
● What type of scheduler is Hadoop YARN?
o App Master requests single RM, per job
o But, the App Master provides job-mgmt service, not scheduling
o Effectively, its a Monolithic Scheduler
Trivia
● No external Resource Manager
● Each scheduler has full access to cluster
● A copy of the cluster state is at each scheduler
● Optimistic concurrency control
o Updates are made atomically in a transaction
o Only one commit will succeed
o Failed transactions will try again
● Gang scheduling, will not result in resource hoarding
Shared State Schedulers
● Each scheduler, free to choose a policy
● Requires a common understanding of
o Resources
o Precedence
● Relies on post-facto enforcement
● Results in high utilization and efficiency
Shared State Schedulers
Questions?

More Related Content

Scheduling on large clusters - Google's Borg and Omega, YARN, Mesos

  • 1. Scheduling on Large Clusters Based on Google’s Omega Paper Sameer Tiwari Hadoop Architect, Pivotal Inc. [email protected], @sameertech
  • 2. Scheduling on Large Clusters ● Goals o High Utilization o Honor User Defined constraints o Maintain High Efficiency ● Issues o Un-predictable load o Varying types of load o Increasing load and cluster size
  • 3. Types of Schedulers ● Monolithic o Single Resource Manager and Scheduler o Google Borg ● Two Level o Single Resource Management and multiple schedulers o Mesos, Hadoop-on-Demand (HOD project) ● Shared state o Multiple schedulers with access to all resources o Google Omega
  • 4. Monolithic Schedulers ● Stable, been around since 1990s ● Issues o Head of line blocking o Scalability is limited o Popular with HPC community  Maui -> Moab(R), Platform LSF (IBM) o Multi Path scheduling addresses some of these problems
  • 5. Statically Partitioned Schedulers ● Common with Hadoop deployments o Assumes full control of resources o Dedicated or statically partitioned clusters ● Issues o Low utilization o Data fragmentation
  • 6. ● AKA: Two-level Schedulers o Resource Manager dynamically partitions a cluster o Resources presented to partitions as “offers” o Partitions request resources as needed o e.g. Mesos and Hadoop on Demand (HOD) ● Issues o Pessimistic locking is used during allocation o Not suitable for “long running” jobs o Gang scheduling (e.g. MPI jobs) can cause deadlocks o Each scheduler has no idea about any other scheduler  Pre-emption is tricky Dynamic Schedulers
  • 7. ● What type of scheduler is Hadoop YARN? o App Master requests single RM, per job o But, the App Master provides job-mgmt service, not scheduling o Effectively, its a Monolithic Scheduler Trivia
  • 8. ● No external Resource Manager ● Each scheduler has full access to cluster ● A copy of the cluster state is at each scheduler ● Optimistic concurrency control o Updates are made atomically in a transaction o Only one commit will succeed o Failed transactions will try again ● Gang scheduling, will not result in resource hoarding Shared State Schedulers
  • 9. ● Each scheduler, free to choose a policy ● Requires a common understanding of o Resources o Precedence ● Relies on post-facto enforcement ● Results in high utilization and efficiency Shared State Schedulers

Editor's Notes

  • #3: Users can ask for colocation or ask for a particular rack or machine Efficiency is : Fast allocation
  • #7: Works well with small jobs (<<cluster resources) and short lived jobs that give up resources frequently
  • #8: Works well with small jobs (<<cluster resources) and short lived jobs that give up resources frequently
  • #9: * Addresses two issues of the two-level scheduler approach – limited parallelism due to pessimistic concurrency control - restricted visibility of resources in a scheduler framework - no head-of-line blocking * Potential cost of redoing work when the optimistic concurrency assumptions are incorrect * Resource Hoarding not possible in an all-or-nothing resource allocation * To prevent starvation: Incremental transactions == accept all but conflicting txns