Koordinator
What Is Koordinator
Based from the official documentation, Koordinator is a modern scheduling system that colocate microservices, AI, and big data workloads on kubernetes. It achieves high utilization by combining elastic resource quota, efficient pod-packing, over-commitment, and resource sharing with container resource isolation.
Koordinator is high performance, scalable, yet most importantly, proven in mass production environments. It allows you to build container orchestration systems that support enterprise production environments.
Use Case Node Level Colocation
Node-level colocation in Kubernetes refers to the practice of scheduling pods from different applications or workloads onto the same node intentionally. This is done to optimize resource utilization, reduce latency between communicating pods, or leverage hardware-specific capabilities.
Online workloads tend to be bursty in terms of resource usage, they tend to request a large amount of resources even though its true utilization may be less than 10% on average. However, offline workloads like OLAP and offline processing jobs tend to have weaker latency requirements, yet they can consistently take up close to full utilization most of the time. This is where node-level colocation comes in - by running both online and offline workloads on the same machine, we could exploit the difference in requirements and usage patterns, reducing the need to run them on separate servers yet achieving minimal negative side effects on both workloads.
Example of colocating different tiers of workloads on the same machine based on current projected utilization.
The goal is thus to allow offline workloads to utilize unused resources by online workloads on the machines they are running on, which would allow us to achieve higher overall resource utilization in the cluster. The Eunomia node-level colocation project achieves this by performing real-time resource estimation of online workloads, which provides an accurate view of potentially reusable resources, which can influence external resource schedulers to make use of an increase of available reusable resources, or to evict workloads when resource estimations become tight.
This allows other teams that wish to run workloads with weaker latency requirements (known as offline workloads) to make use of the available resources as a result of peaks and troughs in the cluster's actual resource utilization, which is denoted by the blue section in Figure 1b above.
Use Case Node Resource Overcommit
In Kubernetes when specifying resources for a Deployment or Pod, you are able to specify both requests and limits. By specifying a request smaller than a limit, this effectively specifies that the workload shall have a minimum resource allocated according to its request, whilst its maximum resource that it may use is specified according to its limit. The difference between the request and limit may not be guaranteed by the environment that the container runs in. When the sum of all workloads' requests exceeds the total resource capacity (i.e. node allocatable) on the node, the node is considered to be overcommitted.
In many organisations services are often scaled according to a very conservative estimate (i.e. scaled too large), and set to the maximum size that it is estimated to handle. However, this results in a lot of underutilized resources, and one such strategy to improve overall resource utilization is to set the user's requested resources to a lower value for its minimum threshold (i.e. reduced request), whilst maintaining the resource limits. The factor which we divide the resource limits is called the overcommit factor, and a factor of 2 would halve the resource limits to give the resource requests.
Additional safeguards and optimizations as part of the resource overcommit scope of koordinator include:
- Usage class scheduling: Integrates with a scheduler plugin to avoid colocating too many workloads with the same usage class, which is the specifies that workloads of the same class are likely to burst at the same time on the same node
- Usage class resource allocation: Allocates resources based on usage class to avoid sharing cpuset with workloads of the same usage class
- NUMA-aware CPU alignment: Minimizes the overhead resulting from crossing the NUMA socket boundary by allocating cpusets to the same NUMA socket on a best-effort basis
Use Case Node Resource Allocation
After workloads have been scheduled to a machine, there are many different allocation decisions for node-level resources, including CPU, memory, I/O, network, etc. This is typically achieved via combination of technologies like Linux cgroups and _Intel Resource Director Techonology (RDT)
Allocation decisions need to be made for several scenarios, such as the following:
- Resource Enforcement: For pods or workloads running on the machine, users may wish to apply resource enforcement on their containers, such as limiting CPU usage or IO bandwidth.
- Resource Overcommit: In order to achieve usage class guarantees, the agent needs to apply some enforcements to mitigate the risks associated with overcommit.
- Node-Level Colocation: To safely allow colocating offline workloads with online workloads, the agent needs to minimize the impacts caused by offline workloads on the rest of the machine.
In order to achieve colocation and overcommit, there needs to be some protections and safeguards when either colocating offline workloads with online ones, or when overcommitting resources on a single machine across multiple online workloads.
A non-exhaustive list of resources that are controlled by koordinator may include:
- CPU: CFS quota, cpuset, group identity
- Memory: memory limit, OOM QoS, memory QoS, memcg minimum watermark adjustment
- Memory Bandwidth: memory bandwidth, L3 cache allocation
- I/O: blkio bandwidth, IO QoS
These resources may be allocated using different methods, such as policy-based allocation (static configuration) or prediction/recommendation-based allocation (dynamic configuration based on some estimation).
Use Case Unified Scheduling
Besides the allocation and isolation with overselling introduced about, it is crucial to have a global view for scheduling and allocating among all online and offline services. This unified scheduling system provides a birds-eye view and control towards the whole cluster including online and offline clusters. It allows us to schedule in advance to avoid potential resource and isolation issue. At the same time, it is supposed to optimize the arrangement for a better resource utilization and improve the performance for colocated services.
This diagram illustrates an architecture for running Presto on Kubernetes, potentially enhanced with Yarn for resource management and some custom components for scheduling and performance prediction. Let's break down the components and their interactions:
Core Components:
-
K8s API Server: The central control plane component of Kubernetes. It manages all cluster resources and receives requests from other components.
-
Unified Scheduler: This likely represents a custom or enhanced scheduler within Kubernetes, responsible for placing Presto worker pods and potentially other workloads onto the nodes. It might incorporate logic from both Yarn and Presto for optimized placement.
-
Nodes: The worker machines in the Kubernetes cluster. Each node runs a kubelet (the Kubernetes agent) and can host multiple pods.
-
Presto Worker Pod: These pods execute Presto worker processes, which handle query processing tasks assigned by the Presto coordinator.
-
Yarn Container Pod: These pods encapsulate Yarn containers, providing resource isolation and management within Kubernetes. Presto worker pods can run within Yarn containers to leverage Yarn's resource management capabilities.
-
MMC Worker Pod: These pods appear to be related to memory management or caching (MMC might stand for Memory Management Controller or something similar). They likely cooperate with Presto workers for optimized memory usage or caching of query results.
-
Eunomia Agent: Eunomia is a resource management framework. The Eunomia agent running on each node interacts with the scheduler to provide resource information and manage resource allocation for pods.
Presto-Specific Components:
-
Presto Coordinator: The central component of Presto, responsible for parsing queries, planning execution, and distributing tasks to the Presto workers.
-
Presto Memory Plugin/Insight Plugin: These are likely custom plugins developed to enhance Presto's memory management and performance insight capabilities.
-
Scheduling Hint: This suggests a mechanism to provide hints or directives to the scheduler to influence the placement of Presto worker pods (e.g., node affinity, pod affinity/anti-affinity).
Other Components:
-
Yarn RM (Resource Manager): The central component of Yarn, responsible for managing cluster resources and scheduling applications.
-
Remote Scheduler/Remote Kubelet Service: These components within Yarn RM probably interface with the Kubernetes API server, allowing Yarn to schedule and manage resources within the Kubernetes cluster.
-
Offline Orchestrator: This component likely handles the initial deployment or scaling of Presto and Yarn components in Kubernetes.
-
DSP Predictor/Insight Store: These components are part of a custom performance prediction subsystem (DSP possibly stands for Data Stream Processor or a similar term). The predictor uses information from the Insight Store (which presumably holds performance metrics) to forecast future resource needs or query execution times.
Data Flow (Simplified):
-
Clients submit queries to the Presto Coordinator.
-
The Presto Coordinator interacts with the Unified Scheduler (which in turn might interact with Yarn RM) to determine where to place and run the Presto worker pods.
-
The Unified Scheduler uses information from the Insight Plugin, DSP Predictor, and potentially Scheduling Hints to make placement decisions.
-
Presto worker pods are scheduled onto nodes and execute query processing tasks.
-
MMC worker pods assist with memory management or caching.
-
The Insight Plugin collects performance metrics and stores them in the Insight Store.
-
The DSP Predictor analyzes the stored data to provide performance predictions.
This architecture demonstrates a complex integration of Presto, Yarn, Kubernetes, and custom components aimed at providing optimized scheduling, resource management, and performance prediction for Presto workloads running in a Kubernetes environment. The custom components suggest a focus on performance tuning and efficient utilization of cluster resources.