Scaling and efficiency
Mesos aims to provide a highly scalable and efficient mechanism to enable various frameworks to share cluster resources effectively. Distributed applications are varied, can have different priorities in different contexts, and are continuously evolving, a fact that led Mesos' design philosophy towards providing for customizable resource allocation policies that users can define and set as per their requirements.
Resource allocation
The Mesos resource allocation module contains the policy that the Mesos master uses to determine the type and quantity of resource offers that need to be made to each framework. Organizations can customize it to implement their own allocation policy, for example, fair sharing, priority, and so on, which allow for fine-grained resource sharing. Custom allocation modules can be developed to address specific needs.
The resource allocation module is responsible for making sure that the resources are shared in a fair manner among competing frameworks. The choice of algorithm used to determine whether the sharing policy has a great bearing on the efficiency of a cluster manager.
One of the most popular allocation algorithms, max-min fairness, works well in a homogenous environment; this is the one where resource requirements are fairly proportional between different competing users, such as the Hadoop cluster. However, scheduling resources across frameworks with heterogeneous resource demands poses a more complex challenge. What is a suitable fair share allocation policy if user A runs the tasks that require two CPUs and 8 GB RAM each and user B runs tasks that require four CPUs and 2 GB RAM each? As can be seen, user A's tasks are RAM-heavy, while user B's tasks are CPU-heavy. How, then, should a set of combined RAM + CPU resources be distributed between the two users?
The latter scenario is a common one faced by Mesos, designed as it is to manage resources primarily in a heterogeneous environment. To address this, Mesos has the Dominant Resource Fairness algorithm (DRF) as its default resource allocation policy, which is far more suitable for heterogeneous environments. The algorithm is described in detail in the following sections.
The Dominant Resource Fairness algorithm (DRF)
Job scheduling in datacenters is not limited to only CPUs but extends to other resources, such as the memory and disk, as well. In a scenario where resource demands are varied, some tasks are CPU-intensive, while some are memory- or disk-intensive; this is where the min-max fairness algorithm falls short. Herein lies the need for a resource scheduling mechanism that provides every user in a heterogeneous environment a fair share of the resources most required by it. In simple terms, DRF is an adaptation of the max-min fairness algorithm to fairly share heterogeneous resources among users.
Let's consider the following example to understand how the algorithm works.
We will assume that the resources are given in multiples of demand vectors and are divisible.
Consider a case where the total resources available are eight CPUs and 10 GB memory. User 1 runs tasks that require one CPU and 3 GB memory, and user 2 runs tasks that require three CPUs and 1 GB memory. Before we proceed to analyze how the DRF algorithm will allocate tasks, let's understand the concepts of the dominant resource and share:
- Dominant resource: This refers to the resource (CPU or memory) that is most required by the user. In this case, user 1 runs tasks that have higher memory requirements (3 GB per task), so the dominant resource for user 1 is memory. On the other hand, user 2 runs computation-heavy tasks (three CPUs per task) and hence has CPU as its dominant resource.
- Dominant share: This refers to the fraction of the dominant resource that the user is allocated. Referring to our example, user 1's dominant share is 30% (3/10), while user 2's dominant share is 37.5% (3/8).
The DRF allocation module tracks the dominant share of each user and makes a note of the resources allocated to each user. DRF begins allocation by offering resources (CPU or memory) to the user with the lowest dominant share among all the competing users. The user then has the option to accept the offer if it meets its requirement.
Now, let us look at each step taken by the DRF algorithm to allocate resources for users 1 and 2. For simplicity's sake, we will overlook the resources that get released back into the pool after the completion of small tasks and assume that every resource offer is accepted and that the users run an infinite number of tasks having the resource requirements. Every user 1 task would consume one-eighth of the total CPU and three-tenths of the total memory, making memory user 1's dominant resource. Every user 2 task would consume three-eighths of the total CPU and one-tenth of the total memory, making CPU user 2's dominant share.
Each row provides the following information:
- User Selected: The user that has been offered resources by the algorithm
- Resource share: A fraction of the total available resources for each resource type that is allocated to a user in the offer round.
- Dominant share: The resource share of the dominant resource
- Dominant share percentage: The dominant share expressed as a percentage (%)
- CPU Total Allocation: The sum of CPU resources allocated to all users in the current offer round
- Memory Total Allocation: The sum of memory resources allocated to all users in the current offer round
To begin with, both users have a dominant share of 0% (as no resource is allocated as yet). We will assume that DRF chooses user 1 to offer resources to first, although had we assumed user 2, the final outcome would have been the same. Here are the steps it will follow:
- User 1 will receive the required set of resources to run a task. The dominant share for its dominant resource (memory) will get increased to 30%.
- User 2's dominant share being 0%, it will receive resources in the next round. The dominant share for its dominant resource (CPU) will get increased to 37.5%.
- As User 1 now has the lower dominant share (30%), it will receive the next set of resources. Its dominant share rises to 60%.
- User 2 that has the lower dominant share (37.5%) will now be offered resources.
- The process will continue until there are no more resources to allocate to run the user tasks. In this case, after step 4, the CPU resources will get saturated (highlighted in red).
- The process will continue if any resources are freed or the resource requirement changes.
Primarily, DRF aims to maximize the minimum dominant share across all users. As in this example, DRF worked with the users to allocate the following:
- Two tasks to user 1 with a total allocation of two CPUs, 6 GB memory, and a dominant share % of 60 (Memory).
- Two tasks to user 2 with a total allocation of six CPUs, 2 GB memory, and a dominant share % of 75 (CPU).
This can be diagrammatically depicted as follows:
Weighted DRF
We have so far assumed that users have an equal probability of being offered resources. There could also be a modification created in the algorithm, where one user or a set of users is favored over others in terms of resource allocation. This is referred to as Weighted DRF, wherein resources are not shared equally among users. Sharing can be weighted on a per-user and per-resource-level basis, the former being more popular.
Let's consider a per-user weighted computation of the previous example. For every user i and resource j, the weights are stated as w1,j 3 and w2,j = 1. This implies that user 1 will have three times the proportion of all the resources compared to user 2 in the system. If both the weights have the value 1, then allocation would be carried out in accordance with the normal DRF algorithm (as described before).
Now, let's look at each step taken by the DRF algorithm to allocate resources for users 1 and 2.
To begin with, both the users have a dominant share of 0% (as no resource is allocated as yet). We will assume that Weighted DRF chooses user 1 to offer resources to first, although had we assumed User 2, the final outcome would have been the same. Here are the steps that it will follow:
- User 1 will receive the required set of resources to run a task. The dominant share for its dominant resource (memory) gets increased to 10% (30% divided by 3).
- User 2's dominant share being 0%, it will receive resources in the next round. The dominant share for its dominant resource (CPU) will get increased to 37.5%.
- As user 1 now has the lower dominant share (10%), it will receive the next set of resources. Its dominant share will rise to 20% (60% divided by 3).
- User 1 still has the lower dominant share (20%) and is now offered resources again to make it 30% (90% divided by 3).
- The process will continues till there are no more resources to allocate to run the user tasks. In this case, after step 4, the memory resources will get saturated (highlighted in red).
- The process will continue if any resources are freed or the resource requirement changes.
Weighted DRF aims to prioritize resource sharing based on the weight assigned to every user. In this example, Weighted DRF worked with the users to allocate the following:
- Three tasks to user 1 with a total allocation of three CPUs and 9 GB memory
- Only one task to user 2 with a total allocation of three CPUs and 1 GB memory
This can be diagrammatically depicted as follows:
In addition to this, it is possible to create custom modules that cater to an organization or need specific resource allocation. This will be covered later in the same chapter.
Let's now look at some of the important properties that DRF follows/satisfies:
- Progressive Filling: Allocation by progressive filling in DRF increases the dominant shares of all users at the same speed, while other resource allocations of users increase proportionally based on the demand. This continues up to a point at which at least one resource is saturated, after which the allocations of users that require the saturated resource are halted, and these users are eliminated. Progressive filling for other users proceeds in a recursive fashion and ends when there is no user left whose dominant share can be increased.
- Share Guarantee: The DRF algorithm allocates resources to users via "progressive filling", which ensures that every user's dominant share allocation increases at the same rate and continues until one resource gets saturated and the resource allocation is frozen. This indirectly ensures that all users are treated equally and are guaranteed 1/n of at least one resource.
- Strategy-proofness: This property of DRF ensures that users at any given point of time cannot benefit from increased allocation by falsifying their resource demands. In case a user does try to game the system by demanding extra resources, the DRF algorithm is such that the allocation of resources may happen in a manner that is deterrent to this user.
- Pareto efficiency: This property of DRF implies that increasing the dominant share of a given user will proportionally decrease the dominant share of other users for this particular resource. Courtesy of the progressive filling algorithm, it is but natural that allocation of more resources to one specific user will hurt others.
- Envy-freeness: DRF is envy-free because there is no need for any user to prefer or envy the resource allocation of another. Envy comes into the picture only when, for instance, user 1 envies user 2, whose dominant share for a particular resource is higher. However, considering that resource allocation is done via progressive filling, dominant shares of both users 1 and 2 will be the same by the time the resource in question is saturated. This envy is neither beneficial nor required.
Configuring resource offers on Mesos
A common problem encountered is that sometimes, frameworks do not accept any resource offers due to improper resource configuration settings on the slaves. For example, the Elasticsearch framework requires ports 9200
and 9300
, but the default port range configuration in the Mesos slaves is 31000
to 32000
.
The slaves must be configured correctly so that the right resource offers are made to frameworks that can then accept them. This can be done as follows:
- In the
mesos-slave
command, add the necessary resource parameters Here's an example:--resources='ports:[9200-9200,9300-9300]' ...
- Create a file under
/etc/mesos-slave
calledresources
whose content is the necessary resource string. Run the following command:$ cat /etc/mesos-slave/resources ports:[9200-9200,9300-9300] $