Completely Fair Scheduler

Completely Fair Scheduler
Original author(s) Ingo Molnár
Developer(s) Linux kernel developers
Written in C
Operating system Linux kernel
Type process scheduler
License GPL-2.0
Website kernel.org
Location of the "Completely Fair Scheduler" (a process scheduler) in a simplified structure of the Linux kernel.

The Completely Fair Scheduler (CFS) is a process scheduler which was merged into the 2.6.23 (October 2007) release of the Linux kernel and is the default scheduler. It handles CPU resource allocation for executing processes, and aims to maximize overall CPU utilization while also maximizing interactive performance.

Con Kolivas's work with CPU scheduling, most significantly his implementation of "fair scheduling" named Rotating Staircase Deadline, inspired Ingo Molnár to develop his CFS, as a replacement for the earlier O(1) scheduler, crediting Kolivas in his announcement.[1]

In contrast to the previous O(1) scheduler used in older Linux 2.6 kernels, the CFS scheduler implementation is not based on run queues. Instead, a red-black tree implements a "timeline" of future task execution. Additionally, the scheduler uses nanosecond granularity accounting, the atomic units by which an individual process' share of the CPU was allocated (thus making redundant the previous notion of timeslices). This precise knowledge also means that no specific heuristics are required to determine the interactivity of a process, for example.[2]

Like the old O(1) scheduler, CFS uses a concept called "sleeper fairness", which considers sleeping or waiting tasks equivalent to those on the runqueue. This means that interactive tasks which spend most of their time waiting for user input or other events get a comparable share of CPU time when they need it.

Algorithm

The data structure used for the scheduling algorithm is a red-black tree in which the nodes are scheduler specific structures, entitled "sched_entity". These are derived from the general task_struct process descriptor, with added scheduler elements.

The nodes are indexed by processor "execution time" in nanoseconds.[3]

A "maximum execution time" is also calculated for each process. This time is based upon the idea that an "ideal processor" would equally share processing power amongst all processes. Thus, the maximum execution time is the time the process has been waiting to run, divided by the total number of processes, or in other words, the maximum execution time is the time the process would have expected to run on an "ideal processor".

When the scheduler is invoked to run a new process, the operation of the scheduler is as follows:

  1. The left most node of the scheduling tree is chosen (as it will have the lowest spent execution time), and sent for execution.
  2. If the process simply completes execution, it is removed from the system and scheduling tree.
  3. If the process reaches its maximum execution time or is otherwise stopped (voluntarily or via interrupt) it is reinserted into the scheduling tree based on its new spent execution time.
  4. The new left-most node will then be selected from the tree, repeating the iteration.

If the process spends a lot of its time sleeping, then its spent time value is low and it automatically gets the priority boost when it finally needs it. Hence such tasks do not get less processor time than the tasks that are constantly running.

OS background

CFS is an implementation of a well-studied, classic scheduling algorithm called weighted fair queuing.[4]

Originally invented for packet networks, fair queuing had been previously applied to CPU scheduling under the name stride scheduling. However, CFS uses terminology different from that normally applied to fair queuing. "Service error" (the amount by which a process's obtained CPU share differs from its expected CPU share) is called "wait_runtime" in Linux's implementation, and "queue virtual time" (QVT) is called "fair_clock".

The fair queuing CFS scheduler has a scheduling complexity of O(log N), where N is the number of tasks in the runqueue. Choosing a task can be done in constant time, but reinserting a task after it has run requires O(log N) operations, because the runqueue is implemented as a red-black tree.

CFS is the first implementation of a fair queuing process scheduler widely used in a general-purpose operating system.[4]

Fairer algorithms

Technically, the name "Completely Fair Scheduler" is not entirely correct, since the algorithm only guarantees the "unfair" level to be less than O(n), where n is the number of processes. There are more complicated algorithms which can give better bounds over the "unfair" levels (e.g. O(log n)).

The Linux kernel received a patch for CFS in November 2010 for the 2.6.38 kernel that has made the scheduler fairer for use on desktops and workstations. Developed by Mike Galbraith using ideas suggested by Linus Torvalds, the patch implements a feature called autogrouping that significantly boosts interactive desktop performance.[5] The explanation of the basic algorithm implementation was included by Mike Galbraith in a post about the patch:[6]

Each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase its refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly.

The primary issues solved by this are for multi-core as well as multi-cpu (SMP) systems experiencing increased interactive response times while performing other tasks that use many CPU-intensive threads in those tasks. A simple explanation is that, with this patch applied, one will be able to still watch a video, read email and perform other typical desktop activities without glitches or choppiness while, say, compiling the Linux kernel or encoding video.

Initial patches for the autogroup feature tied grouping to a tty (terminal), but the eventually merged patch settled on an implementation that tied groups to sessions created via the setsid() system call.[7] The autogroup feature implements task group creation only for fair class tasks (that is, processes scheduled according to the default SCHED_OTHER policy, but not, for example, processes scheduled under realtime policies) and, as such, leaves the way open for enhancement. Even at this basic implementation this patch can make Linux on the desktop a reality for all those who have found desktop performance to be less than desired.[8] As Linus Torvalds put it:[9]

So I think this is firmly one of those "real improvement" patches.

Good job. Group scheduling goes from "useful for some specific server loads" to "that's a killer feature".

See also

References

This article is issued from Wikipedia - version of the 12/2/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.