-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Closed
Description
Goals:
- Make it easier for users (query submitters) to understand resource limits. That is apply limits per-query, rather than per-task (users don't know what a task is)
- Prevent workers from running out of memory and crashing
- Better utilize the cluster's memory, rather than having a "big query" queue that limits ingest
Assumptions:
- The delta memory (including temporary allocations) required for an
Operatorto process aPageis linear, with a small constant, in the size of thePage - Maximum number of running splits (for all tasks) per worker is relatively small (100s)
- Almost all queries only need a small amount of memory, but some need several orders of magnitude more
- We only need to track large memory allocations, everything else is insignificant
Other observations:
- It's infeasible to predict the memory usage of an arbitrary query in advance
- For many, but not all, Operators, it's easy to derive the maximum memory usage
Key Concepts:
- per-query limit: queries will have a global limit on the amount of distributed memory they may use. If they exceed this limit they will be killed.
- per-query-per-machine limit: there will also be a limit on how much they may use per machine. This is primarily to prevent workers from running out of memory.
- "memory pools": workers will have one or more "memory pools". Every query is assigned to a memory pool and all allocations come out of that pool. These pools serve a few purposes:
- They provide a low-latency way to track memory allocations. The coordinator can set the size of a pool and the queries in that pool, after which workers are free to allocate as much memory as is in the pool without further synchronization with the coordinator
- They provide resource isolation which is critical for avoiding deadlocks (we avoid deadlocks by running the largest query in a separate, "reserved", pool)
- The coordinator can move queries between pools. For example, after the largest query finishes, the second largest would be promoted to the "reserved" pool)
- They can be used for QoS, by segregating workloads into different pools (this part may or may not be implemented as part of this project)
Design:
- Memory allocations may be "required" or "revokable". "revokable" allocations may be revoked at any time, and this must not effect the completion, or correctness of the query
- To avoid deadlocks, it's sufficient to guarantee that the query with the largest memory limit can make progress.
- Queries will be submitted with a high limit (say 10% of the cluster's memory), but will rarely use this much (see "assumptions" section). Therefore, it's desirable to plan queries on the assumption that they won't use much memory (say 10-way hash join), but then be able to re-plan them to change to a larger (say 100-way hash join) plan. This doesn't change the memory limit for the query, just the way that the query is executed.
- The current memory management system allows queries to use memory proportional to the number of stages they have, since limits are only enforced per-task. It's an explicit goal of this project to remove this behavior
Workers:
- New config flag to set "reserved memory" this is to account for memory needed by Presto, and for all the small allocations that we don't track
- Memory pool usage is reported to the coordinator
- Add a new query memory manager that tracks all memory used on the worker by a particular query. This will enforce the per-query-per-machine limit, and requests memory from the local pool. The memory usage will also be communicated back to the coordinator, via the already existing stats API
- This manager can then be configured to either kill the query or pause it when it reaches the limit. Pausing it should help implement adaptation.
Operatorinterface will be changed to have agetMemoryUsagemethodDriverwill tally memory usage byOperators and report it to the query memory manager. It will then block if the pool is empty (this may exceed the pool's limit, but only by a small constant factor of the Page size. See "assumptions" section)
Coordinator:
- Add cluster memory manager that will track cluster memory usage and memory pools on workers. Guarantees that no deadlocks occur by running the largest query in its own memory pool (i.e. reserving all the memory in advance). It will also enforce memory limits at the per-query level
- (optional): task scheduling should be aware of the memory usage on workers and avoid ones that are running low
- (note): Since memory usage is reported asynchronously, it's possible for a query to exceed its limit, but finish before the coordinator realizes this and kills it. This could lead to queries failing non-deterministically, but seems unimportant to address, as it should be mostly theoretical
Tentative roadmap:
This is a tentative roadmap toward a full implementation of the system described above.
- M1:
- Track memory usage from all
Operators with significant usage. ORC reader, and table writers in particular are not currently tracked. Track all of these as "required" allocations - Memory pools are implemented, but treated as if they had infinite size
- None of the limit enforcement changed (still use per-task limit)
- JMX stats for these, and maybe expose them in web UI too
- M2:
- Change policy for killing queries that exceed their memory limit from per-task to per-query/per-query-per-machine
- These limit can be configured separately, and since the desired per-query limit is likely a function of the cluster size, the default could be infinite or it could be configured as a percentage. The per-query-per-machine limit can be set, by default, to something like the current per-task limit multiplied by the number of tasks that currently get assigned to a machine on average
- M3:
- Flag to turn on killing of queries when the general pool runs out of memory. This can be used in situations where throughput is more important than guaranteeing that queries complete
- Implement blocking when pools run out of memory
- (optional): Add a config flag to toggle deadlock prevention and memory pool limits, for use cases where the latency overhead is unacceptable
- M4:
- Measure performance impact of planning queries in such a way that re-planning is unnecessary (plan JOINs/GROUP BYs on all machines...etc)
- (optional): Implement re-planning, if the above experiment shows this is necessary. Probably just increasing the number of nodes used for GROUP BYs and JOINs (including semi joins)
- Remove
experimental_big_queryflag - M5:
- Implement "revokable" memory. This would primarily be used for increasing table scan parallelism, and for partial aggregations
Future work:
- Analyze queries to determine an upper bound on their memory usage, based on the plan
- It should be possible to define rules which set the resource limits for a query, similar to how the queueing system's rules work
- It should be possible to set limits on cpu-time used by a query
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels