CSE 522S - Advanced Operating Systems

CSE 522S: Studio 9

CPU Control and Timing Events

Control groups, or cgroups for short, allow you to set limits on resources for processes and their children. This is the mechanism that Docker uses to control limits on memory, swap, CPU, and storage and network I/O resources. … Every Docker container is assigned a cgroup that is unique to that container. All of the processes in the container wil be in the same group. That means that it's easy to control resources for each container as a whole without worrying about what might be running. If a container is redeployed with new processes added, you can have Docker assign the same policy and it will apply to all of them.

—Sean P. Kane & Karl Matthias, Docker Up & Running, 2nd Edition

Constraining the CPU utilization available (or even the CPU cores accessible) to a group of processes is fundamental to isolation. Even under fair scheduling policies, processes that fork lots of CPU-intensive children (e.g. the Apache web server) can swamp available CPU bandwidth, and cause performance issues for other processes. CPU and CPUSet control groups provide powerful mechanisms to allocate CPU resources to groups of processes. Docker and other container environments make use of these mechanisms to isolate container resource usage, and the hierarchical nature of control groups makes it easy for containers to further apportion allocated resources to subgroups of processes within themselves.

In this studio, you will:

Practice using the cgroups v2 CPU controller to apply weights to, or constrain the bandwidth of, a group of processes
Observe the effects of these techniques by monitoring CPU usage with the time utility, through the cpu.stat interface, and with the Function Tracer and KernelShark
Integrate the CPUSet controller into your simple container environment to constrain container execution to a subset of available cores.

Please complete the required exercises below. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

As the answer to the first exercise, please list the names of the people who worked together on this studio.
In this studio you will use cgroups to apply limits to the CPU usage of a process (or group of processes), and observe that usage. In particular, we will be using the cgroups v2 CPUSet controller to limit processes to a subset of the system's CPU cores, and the CPU controller to constrain and observe those processes' usage of time on those cores.

Make sure, before proceeding, that you have enabled cgroups v2 on your Raspberry Pi according to the instructions in the previous studio.

For this exercise, you will begin with using the cpu controller. First, navigate into the /sys/fs/cgroup directory and verify that it is available by inspecting the contents of cgroup.controllers. Then, enable it for any children by adding it to the cgroup.subtree_control file. To do this, you will have to run the terminal in root mode, i.e. first issue either the sudo su or sudo bash command. From there, issue the command:

echo "+cpu" > cgroup.subtree_control

Now, inspect the contents of cgroup.subtree_control to verify that the controller has been added.

Next, create a child control group, contained within the root group, which will monitor a CPU-intensive parallel task. To create the child, simply create a subdirectory within the root cgroup. Navigate into this subdirectory, and list its contents, verifying that (1) the cgroup.controllers file lists the cpu controller, and that (2) the cpu.stat file is present. As the answer to this exercise, please list the contents of the subdirectory, and show the contents of the cpu.stat file.
To practice using the CPU controller, you will use a program that generates heavy CPU usage on all available cores. Please download the parallel_dense_mm.c program. It takes a single command line argument, specifying the matrix size n. It creates two dense matrices of size n x n with randomly-generated values, then multiplies them, using OpenMP to run in parallel. Compile it against the OpenMP library using:

gcc -Wall -o parallel_dense_mm parallel_dense_mm.c -fopenmp

You are going to run this program in the CPU cgroup you created. To do so, write another program, called exec_time, that (1) prints its PID; (2) blocks on input from stdin; then, once it receives any input character, proceeds to exec the following command:

time ./parallel_dense_mm 500

Note that the time command is not a standalone utility, but is actually a command built into the bash shell, and therefore cannot be executed from the exec family of functions. So, you will need to install a time utility executable, which you can do with the command:

sudo apt-get install time

Compile and run your program. After it prints its PID, but before pressing a key to proceed, add it to the CPU cgroup you created by writing its PID into the cgroup.procs file in that subdirectory.

Then, allow the program to proceed. Once it completes, compare its output (i.e., the output of the time utility that measured the execution time of the matrix multiply program) to the values reported in the cpu.stat file within the cgroup. As the answer to this exercise, show the values reported by time and the contents of the cpu.stat file. Explain how these values compare. Remember, the values in cpu.stat are reported in microseconds.
CPU contention can affect the execution time of programs, and interference by a process that incurs heavy CPU usage can slow down other processes running on the system. To observe this phenomenon, you will run two concurrent instances of the parallel matrix multiply, and see how the resulting contention affects its timing.

Modify your exec_time program so that it can be passed a command-line argument specifying the size of the matrices, which it will then pass as an argument to the exec function, instead of using a hard-coded value. Compile the program.

Now, you will run two concurrent instances of your program: one which you will time, and the other which will cause CPU contention. Proceed as follows:
1. In one terminal window, run (but do not yet provide it with input to make it proceed) one instance of your program with a large matrix size (e.g. 5000). This will run sufficiently long such that can be guaranteed to start before, and end after, a second instance of the program.
2. In a second terminal window, run (but do not yet provide it with input to make it proceed) one instance of your program with the same matrix size as in the previous exercise (i.e. 500). You will time the execution time of this instance.
3. Press enter in the first terminal window to kick off the larger matrix multiply, then immediately press enter in the second terminal window to kick off the smaller instance that you will time.
4. Once the second, smaller instance completes, use CTRL+C to terminate the first instance of the matrix multiply.
As the answer to this exercise, please copy the terminal output that shows the execution time of the second, smaller instance of the matrix multiply program. How does this compare to the measured time from the previous exercise? What does this tell you about the effect of CPU contention?
In addition to providing monitoring functionality, CPU cgroups can also control CPU access by applying weights (similar to nice values under the CFS scheduler). For this exercise, you will again create CPU contention by running two instances of your program, so that all cores have two CPU heavy threads running concurrently. This time, however, you will add one instance to a CPU cgroup, and give it a higher weight.

First, print the contents of the cpu.weight file in the CPU cgroup you created. As part of the answer to this exercise, write the reported value.

Proceed as follows:
1. In one terminal window, run (but do not yet provide it with input to make it proceed) one instance of your program with a large matrix size (e.g. 5000). This will run sufficiently long such that can be guaranteed to start before, and end after, a second instance of the program.
2. In a second terminal window, run (but do not yet provide it with input to make it proceed) one instance of your program with the same matrix size as in the previous exercise (i.e. 500). You will time the execution time of this instance.
3. In a third terminal window, write the PID of the second instance of the program (i.e., the smaller instance that we will be timing) into the cgroup.procs file in your CPU cgroup.
4. Then, you will apply a higher weight to the execution of this process. Write a larger value into the cpu.weight file than the current configured value; recall that this value can range from 1-10000.
5. Press enter in the first terminal window to kick off the larger matrix multiply, then immediately press enter in the second terminal window to kick off the smaller instance that you will time.
6. Once the second, smaller instance completes, use CTRL+C to terminate the first instance of the matrix multiply.
As the answer to this exercise, please (1) report the original (default) value in your cgroup's cpu.weight file, (2) tell us what value you then set for that file, then (3) show the execution time of the second, smaller instance of the matrix multiply program. How does this compare to the measured time from the previous exercise? What does this tell you about how CPU cgroups can affect process access to CPU resources?
CPU cgroups can also constrain CPU bandwidth for a process, or group of processes, running on contended processors or cores. For this exercise, you will use the same technique to create CPU contention, but this time you will constrain the bandwidth of one of the instances of the parallel matrix multiply program.

First, reset the value of the cpu.weight controller file to its original, default value (you should have retrieved this in the previous exercise). Next, apply a bandwidth limit by writing into the cpu.max interface file. This file takes the format:

MAX PERIOD

Where MAX indicate the maximum total time (in microseconds) that processes in the cgroup can execute on contended CPUs for every PERIOD of elapsed time. This restricts the bandwidth of processes in that cgroup to MAX/PERIOD.

Use values that are sufficiently small that you will be able to see throttling behavior. For example, if your exec_time program measured an elapsed time of t seconds to run parallel_dense_mm with matrices of size 500x500, then use a MAX of t/5 seconds (converted to microseconds) and a PERIOD at least twice the MAX value.

Now, proceed to measure the execution time the same way you did in the previous exercise, running an instance of your program with large matrices, and a second instance with 500x500 matrices, which is added to the cgroup to constrain its bandwidth.

As the answer to this exercise, please (1) tell us what values you set in the cpu.max file; (2) calculate the resulting bandwidth constraint, based on those numbers; and (3) show the execution time of the second, smaller instance of the matrix multiply program. How does this compare to the measured times from the previous exercises? What does this tell you about the bandwidth constraint scales execution times?
To look more closely at the how the CPU cgroup enforces bandwidth constraints, you will use ftrace (short for Function Tracer) to trace the execution of parallel_dense_mm when it has a bandwidth constraint applied, then use KernelShark to view the results of the trace.

First, on your Raspberry Pi, install these utilities with the command:

sudo apt-get install trace-cmd kernelshark

The function tracer is extraordinarily powerful, and a full exploration of its capabilities is beyond the scope of this studio. However, to generate a basic trace, which will allow you to inspect the scheduler's behavior when it applies a bandwidth constraint to a cgroup, you will use the command:

sudo trace-cmd record -e sched_switch ./exec_time 500

In particular, for this exercise, proceed as follows:
1. In one terminal window, run (but do not yet provide it with input to make it proceed) one instance of your program with a large matrix size (e.g. 5000). This will run sufficiently long such that can be guaranteed to start before, and end after, a second instance of the program.
2. In a second terminal window, run (but do not yet provide it with input to make it proceed) one instance of your program with the same matrix size as in the previous exercise (i.e. 500). This is the instance you will trace, so launch it with the command listed above.
3. In a third terminal window, write the PID of the second instance of the program (i.e., the smaller instance that we will be timing) into the cgroup.procs file in your CPU cgroup.
4. Press enter in the first terminal window to kick off the larger matrix multiply, then immediately press enter in the second terminal window to kick off the smaller instance that is being traced.
5. Once the second, smaller instance completes, use CTRL+C to terminate the first instance of the matrix multiply.
Now, use KernelShark to inspect your trace.

Note: Running Kernelshark, which is required for the following questions, requires a GUI. This means that if you have a headless setup for your Raspberry Pi, you will need to connect to it with a VNC viewer (as detailed on Exercise 4 of Studio 2), or use X11 forwarding view your ssh client.

On Mac/Linux, this is done by simply passing '-X' to the ssh command line. You may also need to install an X11 client, such as XQuartz. Other clients should have similarly straightforward configuration options to enable.

On Windows, you may need to use a non-native ssh client, like PuTTY, in conjunction with an X11 server like Xming. To enable X11 forwarding in PuTTY, first ensure that Xming is running on your Windows computer. Then, in PuTTY, expand the Connection settings in the left sidebar, expand SSH, then click the X11 settings menu. Check the "Enable X11 forwarding" option. Now, you can click the Session menu in the left sidebar, and connect to your Pi via ssh.

Open the trace file that was produced in the previous exercise, with the command:

kernelshark trace.dat
By default, you will be looking at a timeline for each CPU core in the system. Each process in the system will be given a unique color so you can track individual processes as they are scheduled on and off of processors as well as when they may be migrated between cores.

Start by zooming in on the trace until you can make out discrete events. To zoom in: press and hold the left mouse button; drag the cursor to the right; and then release to define a zoom window. Zooming out is the reverse: press and hold the left mouse button; drag the cursor to the left; and then release the mouse button.

We can also enable a process-centric view rather than a CPU-centric view. In the Kernelshark window, go to the Plots menu, select Tasks, and then find the process parallel_dense_mm and click on the check box to activate it. Scroll down or enlarge the viewing window until you see the timeline for that process at the bottom. This timeline only shows the activity of this one process, and different colors represent execution time on different processors (red boxes on this timeline represent time where this task was not scheduled on any processor).

You can use the CPU and task timelines to see exactly how your process executed over its lifetime. If you zoom in to where you can see discrete events, you can mouse over those events to see exactly when each thread of the process was preempted.

As the answer to this exercise, please discuss what the trace tells you about the behavior of the process when it is scheduled on contended CPUs, and how you observe the bandwidth limits being applied. Also, please take a screenshot of the complete trace, as well as a screenshot of a zoomed-in area that highlights the scheduler's behavior, and submit these with your answers.
The remaining exercises of this studio are intended to introduce the CPUSet cgroup controller, and reinforce how cgroup delegation and namespace scoping can allocate resources (in this case, CPU cores) to a container, which is able to then apportion those resource among its processes.

First, create a hierarchy of cgroups, and delegate ownership to the pi user. These steps are similar to those given in the previous studio, but you will also be enabling the cpuset controller:

Open a root shell (e.g. with sudo su), then navigate to the /sys/fs/cgroup directory. In the root shell, issue the following commands:

mkdir pi_containers (This creates a hierarchy of cgroups in which the pi user can launch containers.)

chown pi:pi pi_containers (This delegates control of the hierarchy to the pi user, allowing it to create new children.)

chown pi:pi pi_containers/cgroup.procs (This allows the pi user to move processes within the hierarchy.)

chown pi:pi pi_containers/cgroup.subtree_control (This allows the pi user to move enable controllers throughout the hierarchy. However, the root user still retains control over the controllers and interfaces at the root of the hierarchy, allowing the administrator to allocate resources to the pi user, which can then distribute those resources amongst its containers.)

echo "+cpuset" > cgroup.subtree_control (This ensures that the CPUSet controller is available to the pi user's cgroup hierarchy.)

Now, close the root shell. As the pi user (without sudo), navigate into the /sys/fs/cgroup/pi_containers directory. From there, first add the cpuset controller to the cgroup.subtree_control so that the controller is available to containers created under this cgroup. Then, still from within /sys/fs/cgroup/pi_containers, create a new directory called default. Recall that processes must only be in the leaf node of a cgroup (with the exception of processes in the root cgroup); this directory allows processes to be moved into the pi_containers hierarchy without being explicitely assigned to an individual container cgroup.

Print the PID of the shell (echo $$), then launch a root shell. Write that PID into the /sys/fs/cgroup/pi_containers/default/cgroup.procs. Exit the root shell, then print the contents of the default/cgroup.procs to confirm that the shell has, indeed, been added to the cgroup. Now that it is in the portion of the hierarchy controlled by the pi user, that user can move it (and its children) freely within the hierarchy.

Still as the pi user, create a new directory under /sys/fs/cgroup/pi_containers (e.g., called container1) that will serve as the cgroup for your container. List the contents of this directory, and verify that the cpuset.cpus interface file is present. You will use this interface to restrict your container to a subset of the available CPUs. Write the value 2-3 into the file, which restricts your container to CPUs 2 and 3, then print its contents to verify that the write was successful.

Now, run the cgroupns_child_exec program like in the previous studio, having it create new PID, user, and cgroup namespaces, and have it join the new cgroup you created, e.g. with the command:

./cgroupns_child_exec -pCU -M '0 1000 1' -G '0 1000 1' /sys/fs/cgroup/pi_containers/container1/cgroup.procs ./simple_init

From the init prompt in your new container, launch /bin/bash. To allow your container to further apportion its resources (i.e. the CPU cores it has been allocated), it will need to create nested cgroups under pi_containers/container1. From the container's shell, create a new cgroup under /sys/fs/cgroup/pi_containers/container1. This cgroup will restrict any container processes that are moved into it to only execute on CPU 2. As such, in the following instructions, we assume you've called the directory cpu2.

Remember, because processes can only be in leaf node cgroups, to enable controllers in cpu2 and move processes into it, other processes in the container will need to be in their own leaf node under container1. So (still from your container's shell), create another cgroup under container1 called default. Then, print the contents of container1/cgroup.procs and write all PIDs present in that file into the container1/default/cgroup.procs file. Note that the last entry in container1/cgroup.procs is likely the PID of the process printing its contents (e.g. cat) and so you may not be able to write that PID into the container1/default/cgroup.procs file.

Once you have done this, print the contents of both container1/cgroup.procs and container1/default/cgroup.procs. You should see that there is still a process with PID 0 listed in container1/cgroup.procs that you cannot move out. This is the parent of the container's init process, i.e., the cgroupns_child_exec process. Because it is outside of the container's scope, you cannot move it from inside the container.

To move it out of that cgroup, so that only leaf nodes of that cgroup contain processes, open another terminal window (as non-root, but outside of the container). From there, print the contents of /sys/fs/cgroup/pi_containers/container1/cgroup.procs. This will show you the PID of the cgroupns_child_exec process from the scope of the global PID namespace.

From this terminal window, move that process into the /sys/fs/cgroup/pi_containers/default namespace.

From the terminal running inside your container, navigate into /sys/fs/cgroup/pi_containers/container1. Print the contents of the following files:

cgroup.procs default/cgroup.procs cpu2/cgroup.procs

At this point, only the default container should have processes.

From the other terminal window outside the container, print the contents of the /sys/fs/cgroup/pi_containers/container1/default/cgroup.procs file. As part of the answer to this exercise, please show the contents of this file when printed from inside and outside the container. Explain why the PIDs listed do not match.

Then, from inside the container, print the contents of /proc/self/cgroup. From outside the container, print the contents of /proc/PID/cgroup, where PID is one of the PIDs listed in cgroup.procs when printed from outside the container (in other words, the global namespace PID of one of the processes in the container). As the remainder of the answer to this exercise, please show the contents of both files, and explain how cgroup namespace scoping makes their contents differ. Make sure to keep your container's terminal window open, as you will continue to use it in the next exercise.
Now that your cgroup hierarchy is configured with a subtree for your container, and all container processes are in a leaf node in that subtree, your container should be able to split its resources among child processes. For this exercise, you will have your container allocate only a single CPU core to the parallel matrix multiply program, then time its execution to confirm that the restriction works correctly.

From your container, navigate into the /sys/fs/cgroup/pi_containers/container1 directory. From there, enable the CPUSet controller for child cgroups by writing + cpuset into cgroup.subtree_control. Then, list the contents of the cpu2 directory to verify that the cpuset.cpus controller file is now present. Write the value "2" into that file.

Now, in your container's bash shell, you will launch a new shell by running /bin/bash. In other words, at this point, your container should be running the following hierarchy of processes:

simple_init → /bin/bash → /bin/bash

The nested bash shell will be used to constrain process execution to the cpu2 cgroup. Have the shell move itself into that cgroup:

echo $$ > cpu2/cgroup.procs

Then, print the contents of the cpu2/cgroup.procs file to verify that the move was successful.

Next, navigate into the directory where you compiled your parallel_dense_mm program. Run it using the time utility you installed, on matrices of size 500x500, i.e.:

/usr/bin/time ./parallel_dense_mm 500

(If you get an error message stating that /usr/bin/time does not exist, run which time to see the path of the utility.)

As the answer to this exercise, please show the output produced by the time utility. What do the values show you about the parallelism of the execution in this context? In other words, do the values confirm that the program was restricted to a single core, and if so, how?

Things to Turn In:

Your answers to the above exercises.
Your screenshots of the KernelShark trace.
Your code for the exec_time program.

Page updated Wednesday, February 9, 2022, by Marion Sudvarg and Chris Gill.