Control groups, or cgroups for short, allow you to set limits on resources for processes and their children. This is the mechanism that Docker uses to control limits on memory, swap, CPU, and storage and network I/O resources. … Every Docker container is assigned a cgroup that is unique to that container. All of the processes in the container wil be in the same group. That means that it's easy to control resources for each container as a whole without worrying about what might be running. If a container is redeployed with new processes added, you can have Docker assign the same policy and it will apply to all of them.
—Sean P. Kane & Karl Matthias, Docker Up & Running, 2nd Edition
Constraining the CPU utilization available (or even the CPU cores accessible) to a group of processes is fundamental to isolation. Even under fair scheduling policies, processes that fork lots of CPU-intensive children (e.g. the Apache web server) can swamp available CPU bandwidth, and cause performance issues for other processes. CPU and CPUSet control groups provide powerful mechanisms to allocate CPU resources to groups of processes. Docker and other container environments make use of these mechanisms to isolate container resource usage, and the hierarchical nature of control groups makes it easy for containers to further apportion allocated resources to subgroups of processes within themselves.
In this studio, you will:
cgroups v2CPU controller to apply weights to, or constrain the bandwidth of, a group of processes
timeutility, through the
cpu.statinterface, and with the Function Tracer and KernelShark
Please complete the required exercises below. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.
As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.
Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.
As the answer to the first exercise, please list the names of the people who worked together on this studio.
In this studio you will use
to apply limits to the CPU usage of a process (or group of processes),
and observe that usage.
In particular, we will be using the
CPUSet controller to limit processes to a subset of the system's CPU cores,
and the CPU controller to constrain and observe those processes' usage of time on those cores.
Make sure, before proceeding, that you have enabled
cgroups v2 on your Raspberry Pi
according to the instructions in the
For this exercise, you will begin with using the
First, navigate into the
and verify that it is available by inspecting the contents of
Then, enable it for any children by adding it to the
To do this, you will have to run the terminal in root mode, i.e. first issue either the
sudo su or
sudo bash command.
From there, issue the command:
echo "+cpu" > cgroup.subtree_control
Now, inspect the contents of
to verify that the controller has been added.
Next, create a child control group, contained within the root group,
which will monitor a CPU-intensive parallel task.
To create the child, simply create a subdirectory within the root
Navigate into this subdirectory, and list its contents,
verifying that (1) the
cgroup.controllers file lists the
and that (2) the
cpu.stat file is present.
As the answer to this exercise, please list the contents of the subdirectory,
and show the contents of the
To practice using the CPU controller,
you will use a program that generates heavy CPU usage on all available cores.
Please download the parallel_dense_mm.c program.
It takes a single command line argument, specifying the matrix size
It creates two dense matrices of size
n x n with randomly-generated values,
then multiplies them, using OpenMP to run in parallel.
Compile it against the OpenMP library using:
gcc -Wall -o parallel_dense_mm parallel_dense_mm.c -fopenmp
You are going to run this program in the CPU
cgroup you created.
To do so, write another program, called
that (1) prints its PID;
(2) blocks on input from
then, once it receives any input character, proceeds to exec the following command:
time ./parallel_dense_mm 500
Note that the
time command is not a standalone utility,
but is actually a command built into the bash shell,
and therefore cannot be executed from the
exec family of functions.
So, you will need to install a
time utility executable,
which you can do with the command:
sudo apt-get install time
Compile and run your program. After it prints its PID,
but before pressing a key to proceed,
add it to the CPU
cgroup you created by writing its PID into the
cgroup.procs file in that subdirectory.
Then, allow the program to proceed. Once it completes,
compare its output (i.e., the output of the
that measured the execution time of the matrix multiply program)
to the values reported in the
cpu.stat file within the
As the answer to this exercise, show the values reported by
and the contents of the
Explain how these values compare.
Remember, the values in
cpu.stat are reported in microseconds.
CPU contention can affect the execution time of programs, and interference by a process that incurs heavy CPU usage can slow down other processes running on the system. To observe this phenomenon, you will run two concurrent instances of the parallel matrix multiply, and see how the resulting contention affects its timing.
exec_time program so that it can be passed
a command-line argument specifying the size of the matrices,
which it will then pass as an argument to the
instead of using a hard-coded value.
Compile the program.
Now, you will run two concurrent instances of your program: one which you will time, and the other which will cause CPU contention. Proceed as follows:
As the answer to this exercise, please copy the terminal output that shows the execution time of the second, smaller instance of the matrix multiply program. How does this compare to the measured time from the previous exercise? What does this tell you about the effect of CPU contention?
In addition to providing monitoring functionality,
cgroups can also control CPU access by applying weights
(similar to nice values under the CFS scheduler).
For this exercise, you will again create CPU contention by running two instances of your program,
so that all cores have two CPU heavy threads running concurrently.
This time, however, you will add one instance to a CPU
and give it a higher weight.
First, print the contents of the
cpu.weight file in the CPU
cgroup you created.
As part of the answer to this exercise, write the reported value.
Proceed as follows:
cgroup.procsfile in your CPU cgroup.
cpu.weightfile than the current configured value; recall that this value can range from 1-10000.
As the answer to this exercise, please (1) report the original (default) value
(2) tell us what value you then set for that file,
then (3) show the execution time of the second, smaller instance of the matrix multiply program.
How does this compare to the measured time from the previous exercise?
What does this tell you about how CPU
cgroups can affect process access to CPU resources?
cgroups can also constrain CPU bandwidth
for a process, or group of processes,
running on contended processors or cores.
For this exercise, you will use the same technique to create CPU contention,
but this time you will constrain the bandwidth of one of the instances of the parallel matrix multiply program.
First, reset the value of the
cpu.weight controller file
to its original, default value (you should have retrieved this in the previous exercise).
Next, apply a bandwidth limit by writing into the
cpu.max interface file.
This file takes the format:
MAX indicate the maximum total time (in microseconds)
that processes in the
cgroup can execute on contended CPUs
PERIOD of elapsed time.
This restricts the bandwidth of processes in that
Use values that are sufficiently small that you will be able to see throttling behavior.
For example, if your
exec_time program measured an elapsed time of t seconds
parallel_dense_mm with matrices of size 500x500,
then use a
MAX of t/5 seconds (converted to microseconds)
PERIOD at least twice the
Now, proceed to measure the execution time the same way you did in the previous exercise,
running an instance of your program with large matrices,
and a second instance with 500x500 matrices, which is added to the
to constrain its bandwidth.
As the answer to this exercise, please (1) tell us what values you set in the
file; (2) calculate the resulting bandwidth constraint, based on those numbers;
and (3) show the execution time of the second, smaller instance of the matrix multiply program.
How does this compare to the measured times from the previous exercises?
What does this tell you about the bandwidth constraint scales execution times?
To look more closely at the how the CPU
cgroup enforces bandwidth constraints,
you will use
ftrace (short for Function Tracer) to trace
the execution of
parallel_dense_mm when it has a bandwidth constraint applied,
then use KernelShark to view the results of the trace.
First, on your Raspberry Pi, install these utilities with the command:
sudo apt-get install trace-cmd kernelshark
The function tracer is extraordinarily powerful,
and a full exploration of its capabilities is beyond the scope of this studio.
However, to generate a basic trace, which will allow you to inspect the scheduler's behavior
when it applies a bandwidth constraint to a
you will use the command:
sudo trace-cmd record -e sched_switch ./exec_time 500
In particular, for this exercise, proceed as follows:
cgroup.procsfile in your CPU cgroup.
Now, use KernelShark to inspect your trace.
Note: Running Kernelshark, which is required for the following questions, requires a GUI. This means that if you have a headless setup for your Raspberry Pi, you will need to connect to it with a VNC viewer (as detailed on Exercise 4 of Studio 2), or use X11 forwarding view your ssh client.
On Mac/Linux, this is done by simply passing '-X' to the ssh command line. You may also need to install an X11 client, such as XQuartz. Other clients should have similarly straightforward configuration options to enable.
On Windows, you may need to use a non-native ssh client, like PuTTY, in conjunction with an X11 server like Xming. To enable X11 forwarding in PuTTY, first ensure that Xming is running on your Windows computer. Then, in PuTTY, expand the Connection settings in the left sidebar, expand SSH, then click the X11 settings menu. Check the "Enable X11 forwarding" option. Now, you can click the Session menu in the left sidebar, and connect to your Pi via ssh.
Open the trace file that was produced in the previous exercise, with the command:
By default, you will be looking at a timeline for each CPU core in the system. Each process in the system will be given a unique color so you can track individual processes as they are scheduled on and off of processors as well as when they may be migrated between cores.
Start by zooming in on the trace until you can make out discrete events. To zoom in: press and hold the left mouse button; drag the cursor to the right; and then release to define a zoom window. Zooming out is the reverse: press and hold the left mouse button; drag the cursor to the left; and then release the mouse button.
We can also enable a process-centric view rather than a CPU-centric
view. In the Kernelshark window, go to the Plots menu, select Tasks, and then find the process
parallel_dense_mm and click on the check box to activate it.
Scroll down or enlarge the viewing window until you see the timeline for that process
at the bottom. This timeline only shows the activity of this one process, and
different colors represent execution time on different processors (red boxes on this
timeline represent time where this task was not scheduled on any processor).
You can use the CPU and task timelines to see exactly how your process executed over its lifetime. If you zoom in to where you can see discrete events, you can mouse over those events to see exactly when each thread of the process was preempted.
As the answer to this exercise, please discuss what the trace tells you about the behavior of the process when it is scheduled on contended CPUs, and how you observe the bandwidth limits being applied. Also, please take a screenshot of the complete trace, as well as a screenshot of a zoomed-in area that highlights the scheduler's behavior, and submit these with your answers.
The remaining exercises of this studio are intended to
introduce the CPUSet
and reinforce how
and namespace scoping can allocate resources (in this case, CPU cores) to a container,
which is able to then apportion those resource among its processes.
First, create a hierarchy of
cgroups, and delegate ownership to the
These steps are similar to those given in the previous studio,
but you will also be enabling the
Open a root shell (e.g. with
then navigate to the
In the root shell, issue the following commands:
mkdir pi_containers (This creates a hierarchy of
in which the
pi user can launch containers.)
chown pi:pi pi_containers (This delegates control of the hierarchy to the
allowing it to create new children.)
chown pi:pi pi_containers/cgroup.procs (This allows the
to move processes within the hierarchy.)
chown pi:pi pi_containers/cgroup.subtree_control
(This allows the
pi user to move enable controllers throughout the hierarchy.
However, the root user still retains control over the controllers and interfaces at the root of the hierarchy,
allowing the administrator to allocate resources to the
which can then distribute those resources amongst its containers.)
echo "+cpuset" > cgroup.subtree_control (This ensures that the CPUSet controller is available
Now, close the root shell. As the
pi user (without
navigate into the
From there, first add the
cpuset controller to the
so that the controller is available to containers created under this
Then, still from within
create a new directory called
Recall that processes must only be in the leaf node of a
(with the exception of processes in the root
this directory allows processes to be moved into the
without being explicitely assigned to an individual container
Print the PID of the shell (
echo $$), then launch a root shell.
Write that PID into the
Exit the root shell, then print the contents of the
to confirm that the shell has, indeed, been added to the
Now that it is in the portion of the hierarchy controlled by the
that user can move it (and its children) freely within the hierarchy.
Still as the
pi user, create a new directory under
/sys/fs/cgroup/pi_containers (e.g., called
that will serve as the
cgroup for your container.
List the contents of this directory, and verify that the
cpuset.cpus interface file is present.
You will use this interface to restrict your container to a subset of the available CPUs.
Write the value
2-3 into the file,
which restricts your container to CPUs 2 and 3,
then print its contents to verify that the write was successful.
Now, run the
cgroupns_child_exec program like in the previous studio,
having it create new PID, user, and
and have it join the new
cgroup you created, e.g. with the command:
./cgroupns_child_exec -pCU -M '0 1000 1' -G '0 1000 1' /sys/fs/cgroup/pi_containers/container1/cgroup.procs ./simple_init
init prompt in your new container,
To allow your container to further apportion its resources
(i.e. the CPU cores it has been allocated),
it will need to create nested
From the container's shell, create a new
cgroup will restrict any container processes that are moved into it
to only execute on CPU 2.
As such, in the following instructions, we assume you've called the directory
Remember, because processes can only be in leaf node
to enable controllers in
cpu2 and move processes into it,
other processes in the container will need to be in their own leaf node
So (still from your container's shell),
Then, print the contents of
and write all PIDs present in that file into the
Note that the last entry in
is likely the PID of the process printing its contents (e.g.
and so you may not be able to write that PID into the
Once you have done this, print the contents of both
You should see that there is still a process with PID 0 listed in
container1/cgroup.procs that you cannot move out.
This is the parent of the container's
Because it is outside of the container's scope,
you cannot move it from inside the container.
To move it out of that
so that only leaf nodes of that
cgroup contain processes,
open another terminal window (as non-root, but outside of the container).
From there, print the contents of
This will show you the PID of the
from the scope of the global PID namespace.
From this terminal window, move that process into the
From the terminal running inside your container,
Print the contents of the following files:
At this point, only the
default container should have processes.
From the other terminal window outside the container,
print the contents of the
As part of the answer to this exercise, please show the contents of this file
when printed from inside and outside the container.
Explain why the PIDs listed do not match.
Then, from inside the container, print the contents of
From outside the container, print the contents of
where PID is one of the PIDs listed in
cgroup.procs when printed from outside the container
(in other words, the global namespace PID of one of the processes in the container).
As the remainder of the answer to this exercise, please show the contents of both files,
and explain how
cgroup namespace scoping makes their contents differ.
Make sure to keep your container's terminal window open,
as you will continue to use it in the next exercise.
Now that your
cgroup hierarchy is configured
with a subtree for your container,
and all container processes are in a leaf node in that subtree,
your container should be able to split its resources among child processes.
For this exercise, you will have your container allocate only a single CPU core
to the parallel matrix multiply program,
then time its execution to confirm that the restriction works correctly.
From your container, navigate into the
directory. From there, enable the CPUSet controller for child
+ cpuset into
Then, list the contents of the
cpu2 directory to verify that the
cpuset.cpus controller file is now present.
Write the value "2" into that file.
Now, in your container's bash shell,
you will launch a new shell by running
In other words, at this point, your container should be running the following hierarchy of processes:
simple_init → /bin/bash → /bin/bash
The nested bash shell will be used to constrain process execution to the
Have the shell move itself into that
echo $$ > cpu2/cgroup.procs
Then, print the contents of the
file to verify that the move was successful.
Next, navigate into the directory where you compiled your
Run it using the
time utility you installed,
on matrices of size 500x500, i.e.:
/usr/bin/time ./parallel_dense_mm 500
(If you get an error message stating that
does not exist, run
which time to see the path of the utility.)
As the answer to this exercise, please show the output produced by the
What do the values show you about the parallelism of the execution in this context?
In other words, do the values confirm that the program was restricted to a single core,
and if so, how?
Page updated Wednesday, February 9, 2022, by Marion Sudvarg and Chris Gill.