CSE 522S - Advanced Operating Systems

CSE 522S: Studio 8

Observing Memory Events

To be sure, many users would love more memory. On modern systems, however, the problem is not really one of sharing too little among too many, but of properly using and keeping track of the bounty.

—Robert Love, Linux System Programming, 2nd Edition, Chapter 9, pp. 293.

As one of the most important resources in computer systems, memory must be managed carefully to efficiently utilize the resource. Misuses of memory, whether intentional (e.g. malicious memory overallocation) or accidental (e.g. programs with significant memory leaks) can lead to unwanted system interference. Understanding how the Linux kernel provides mechanisms to constrain the memory use of a process, or group of processes, is important for minimizing interference, especially for modular or componentized systems, e.g. those using containers and Docker.

In this studio, you will:

Use the cgroups v2 memory controller to constrain the memory use of a group of processes
Monitor the statistics and event files provided by cgroups and its memory controller to observe out-of-memory events.
Integrate cgroups into your simple container environment.

Please complete the required exercises below. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

As the answer to the first exercise, please list the names of the people who worked together on this studio.
In this studio you will use a resource management feature of the Linux kernel, called cgroups, to apply limits on the amount of memory a process (or group of processes) can acquire. The cgroups feature consists of several subsystems (or controllers), each of which is responsible for a particular resource type (such as CPUs, memory, I/O, or networks). It provides a pseudofilesystem through which users can get and set parameters and limits associated with a subsystem. In particular, we will be using the cgroups v2 memory controller.

The Raspberry Pi enables both cgroups v1 and v2 by default. However, the Linux kernel does not allow both versions of the same controller to be active. So, you will begin by configuring your Raspberry Pi's boot settings to (1) disable the cgroups v1 substem, and (2) enable the memory controller for use by cgroups v2

The Raspberry Pi OS launches the systemd daemon during system startup. This utility is responsible for configuring much of the kernel and userspace functionality of the Raspberry Pi, including mounting the appropriate cgroups pseudofilesystem(s). Certain commands can be issued to systemd to change its boot-time behavior via the /boot/cmdline.txt file. Note that commands in that file are separated by spaces.

To disable the cgroups v1 subystem, add the following command to the end of the file:

cgroup_no_v1=all

Then, to enable memory cgroups, add the following commands:

cgroup_memory=1 cgroup_enable=memory

To apply these settings, reboot your Raspberry Pi.

After the reboot, you can check that only the cgroups v2 subsystem is mounted, and verify that the memory controller is enabled, by issuing the following two commands:

mount | grep cgroup

cat /proc/cgroups

Do so, then as the answer to this exercise (1) show the output of these two commands, (2) explain what each command does, and (3) indicate what the output tells you about which cgroups subsystem is mounted, and which controllers are enabled.
The cgroups pseudofilesystem is arranged hierarchically. By default, all tasks in the system are included in the root cgroup, which in this case is located at /sys/fs/cgroup/.

The read-only cgroup.controllers file lists the controllers that are available to the cgroup. The cgroup.subtree_control file defines the list of controllers that are available to any children of the cgroup. In other words, a child's cgroup.controllers file is a read-only copy of its parent's cgroup.subtree_control file.

For this studio, you will only be using the memory controller. First, verify that it is available by inspecting the contents of cgroup.controllers. Then, enable it for any children by adding it to the cgroup.subtree_control file. To do this, you will have to run the terminal in root mode, i.e. first issue either the sudo su or sudo bash command.

Enabling a controller involves writing a "+", followed by the controller name, to the cgroup.subtree_control file, e.g.:
echo "+memory" > cgroup.subtree_control

Now, inspect the contents of cgroup.subtree_control, and try to remove any controllers listed (besides the memory controller you just added). Note that you might not be able to remove some controllers (you will note any error messages as part of the answer to this exercise). Removing a controller is similar to adding it, except you write a "-" followed by the controller name, e.g.:

echo "-cpu" > cgroup.subtree_control

For this exercise you will create a child control group, contained within the root group, which will monitor a task that we will write in the next exercise. To create the child, simply create a subdirectory within the root cgroup. Navigate into this subdirectory, and list its contents. Then, try to remove the memory controller from its cgroup.controllers file, and note any error messages that appear.

As the answer to this exercise, please (1) write the contents of the cgroup.controllers and cgroup.subtree_control files in the root cgroup, (2) write the contents of those files in the child cgroup you created, (3) show the list of files in the child cgroup directory, (4) write any error messages printed when you tried to remove controllers from the root's cgroup.subtree_control, then (5) write any error messages printed when you tried to write to the child's cgroup.controllers file, and explain why you think you saw those errors.
Now, you will use the child cgroup you created to monitor the memory usage induced by a program you will write. This concept is an important notion for cgroups: a program can induce more memory usage than its corresponding process, allowing it to bypass traditional resource limit mechanisms, e.g., by forking several child processes (a classic forkbomb attack). However, because a child process will automatically be a part of its parent's cgroup, the constraints enforced by cgroup controllers are applied against the total resource usage induced by a program.

Write a program (outside of the cgroups filesystem!) that implements a forkbomb: the program should, in a loop, request a significant amount of memory (at least a page) by calling malloc() without freeing, delays for a short time (sufficiently long to allow for observability, e.g. a second), then it should fork() a child process. Because the fork() is also in the loop body, this means that your program will generate an exponentially-increasing number of child processes. Compile your program.

Open another terminal window, which you will use to run your program. Before doing so, you will write the PID of this terminal into the child control group you created. Issue the following command to see its PID:

echo $$

Then, in your first terminal window (which should be running as root, i.e. with sudo su or sudo bash) write the PID into the cgroup.procs file in your child cgroup. Then, print the contents of the cgroup.procs, the memory.current, and the memory.stat files.

To verify that the shell running in your second terminal window is in the new cgroup, you can print the contents of the file /proc/self/cgroup. Please do so, note the output, and check that the reported cgroup membership matches what you expect.

Still in your second terminal window, run the forkbomb program, wait a couple of seconds, then print the contents of those three files again, before terminating your forkbomb.

After terminating the forkbomb, print the contents of the same three files.

As the answer to this exercise, please show the contents of the three files, i.e., cgroup.procs, memory.current, and memory.stat, before, during, and after the forkbomb program's execution. Please explain the significance of the contents of the cgroup.procs, memory.current, and how those contents changed. Also, please pick one statistic from the memory.stat file, explain how it changed through the lifetime of the program, and explain its significance. Additionally, please show the contents of the /proc/self/cgroup file, when you inspected it from the second terminal window.
Now, in addition to using the memory controller to observe the forkbomb program, you will constrain its memory usage using the memory.max interface.

In your second terminal window, which is a member of your child cgroup, rewrite and recompile your forkbomb program so it no longer delays before each call to fork().

In the first terminal window, running as root, write a value into memory.max that is sufficiently larger than the current value of memory.current such that the forkbomb will cause this limit to be exceeded relatively quickly. Then, print the value stored in memory.max, as indicated by printing the contents of that file. This value may differ slightly from what you wrote into the file.

Note: If you need to reset the memory.max value, such that there is no longer an enforced maximum, you can do so by writing the value "max" into the file. If you need to remove a cgroup, you can do so if the "populated" field of its cgroup.events file has a value 0. It can be removed by removing its directory using rmdir.
Next, run your forkbomb program in the second terminal window. Observe what happens, and, in the first terminal window, print the contents of the memory.events file.

As the answer to this exercise, please (1) tell us what value you wrote into memory.max, what value was subsequently printed from the contents of that file, and why you think those values might have been different. Then (2) explain what happened to the forkbomb, show the contents of the memory.events file, and explain the significance of those contents.
The cgroups memory controller, in addition to providing a way to enforce a hard limit on memory usage, also supplies the memory.high controller. This allows an administrator to define a memory usage threshold beyond which (1) a "high" event, in the memory.events file, is triggered, which subsequently (2) signals to the Linux kernel that it should begin aggressively reclaiming memory from the processes in the cgroup (though those processes will not be killed).

For this exercise, you will write a monitoring program that prints a notification when the cgroup exceeds its "high" memory threshold, or when the cgroup changes its "populated" state. The monitoring program should do the following:
1. Take exactly two command-line arguments (and print a helpful usage message if more or fewer are given) which are the paths to the files named cgroup.events and memory.events within the child group, respectively.
  
  The memory.events file contains counters for when the low, high, and max memory thresholds are crossed.
  
  The cgroup.events file contains two binary values, which indicate whether the cgroup is populated (i.e. it, or its children, have member processes) and whether it is frozen (i.e. its member processes have been placed in a suspended state).
2. Attempt to open both files, read-only, using the open() system call. If either file cannot be opened, it should print a helpful error message and exit.
3. Create an inotify instance, then subsequently add watches for both files, watching for IN_MODIFY events.
4. In a loop, watch for events on these files by reading from the file descriptor returned by the inotify_init() function.
  
  Your program should associate the watch descriptors with the file descriptors returned by the opened files, so that, if either file has been changed, your program knows which is the corresponding opened file descriptor.
5. If the cgroup.events file has been changed, the monitor program should print a message indicating whether the cgroup is populated or not.
6. If the memory.events file has been changed, the monitor program should print the current value of its "high" field.
(NOTE: Remember that to perform subsequent reads of the entire contents of a file, you need to use lseek to set the file offset back to the beginning!)

Compile your monitor program, and rewrite and recompile your forkbomb program so it again delays before each call to fork().

This time, have three terminal windows open, (1) running as root, (2) running your monitor program, and (3) which will be added to the cgroup.

In (3), again print its PID to the terminal window with echo $$

In (2), launch your monitor program, then in (1) write the PID of terminal window (3) into cgroup.procs file. Then, also in (1), print the contents of memory.current. Write a value into memory.high that is sufficiently larger than the current value of memory.current such that the forkbomb will cause this limit to be exceeded relatively quickly. Then, print the value stored in memory.high, as indicated by printing the contents of that file. Again, this value may differ slightly from what you wrote into the file.

Now, in (3), launch your forkbomb. Allow it to run until you see your monitor program begin to show output. Let the monitor program print a few messages, then kill your forkbomb, and close terminal (3).

As the answer to this exercise, please say (1) the value you wrote into memory.high and the value it subsequently showed as being stored, and why you think those values might have been different. Then (2) please show the output of the monitor program, and explain what the output tells you about the behavior of the forkbomb, and what happened when you closed the terminal.
For the final exercise, you will integrate cgroups into the simple container environment from the previous studio, using cgroups namespaces to define a delegation boundary, such that you can create your container without root privileges in its own cgroup, then allow it to create children of that cgroup and place child processes into those children.

First, you will need to ensure that cgroups v2 has been mounted with the nsdelegate option. Issue the command:

mount | grep cgroup

If the list in parentheses does not include the "nsdelegate" option, you will need to remount cgroups with that option:

sudo mount -t cgroup2 -o remount,nsdelegate /sys/fs/cgroup

Now, you'll create a hierarchy of cgroups, and delegate ownership to the pi user. This will allow that user (without administrative privileges) to create individual cgroups to manage resource allocation for individual containers, launch containers in those cgroups, then allow those containers to further partition resource usage among the processes in their scope.

To begin with, open a root shell (e.g. with sudo su), then navigate to the /sys/fs/cgroup directory. In the root shell, issue the following commands:

mkdir pi_containers (This creates a hierarchy of cgroups in which the pi user can launch containers.)

chown pi:pi pi_containers (This delegates control of the hierarchy to the pi user, allowing it to create new children.)

chown pi:pi pi_containers/cgroup.procs (This allows the pi user to move processes within the hierarchy.)

chown pi:pi pi_containers/cgroup.subtree_control (This allows the pi user to move enable controllers throughout the hierarchy. However, the root user still retains control over the controllers and interfaces at the root of the hierarchy, allowing the administrator to allocate resources to the pi user, which can then distribute those resources amongst its containers.)

Now, close the root shell. As the pi user (without sudo), navigate into the /sys/fs/cgroup/pi_containers directory. In it, create a new directory, called default. Recall that processes must only be in the leaf node of a cgroup (with the exception of processes in the root cgroup); this directory allows processes to be moved into the pi_containers hierarchy without being explicitely assigned to an individual container cgroup.

Print the PID of the shell (echo $$), then launch a root shell. Write that PID into the /sys/fs/cgroup/pi_containers/default/cgroup.procs. Exit the root shell, then print the contents of the default/cgroup.procs to confirm that the shell has, indeed, been added to the cgroup. Now that it is in the portion of the hierarchy controlled by the pi user, that user can move it (and its children) freely within the hierarchy.

This means that, from within that same shell, you can launch a new container as the pi user, and use cgroup namespaces to constrain it to a cgroup within the hierarchy. To do this, please download the cgroupns_child_exec.c program.

This program is similar to the userns_child_exec.c program from the previous studio, but with additional functionality to allow the cloned child process to join a new cgroup namespace. In doing so, it takes an additional command-line argument specifying a cgroup.procs file, and writes its PID to this file, before cloning the child into its new namespace. This has the effect of isolating the child's view of the cgroups hierarchy to the specified cgroup. This means that, even if the child runs as the pi user (which has permissions to a broader subtree of the hierarchy), it cannot move itself out of the portion of the hierarchy into which it has been placed. Please take a look at the joinCgroup function, and the call to that function from main, to understand how this functionality was programmed.

Compile the program, then (as the pi user), create a new directory under /sys/fs/cgroup/pi_containers (e.g., called container1) that will serve as the cgroup for your container.

Now, run the program, creating new PID and user namespaces as before, as well as a new cgroup namespace, and have it join the new cgroup you created, e.g. with the command:

./cgroupns_child_exec -pCU -M '0 1000 1' -G '0 1000 1' /sys/fs/cgroup/pi_containers/container1/cgroup.procs ./simple_init

(Make sure that simple_init is in the same directory as cgroupns_child_exec!)

Congratulations! You have created a new container in its own cgroup.

From the init prompt, launch /bin/bash. Open another terminal window. You will use both terminal windows (inside and outside the container) to compare the scoped view of the container with the system configuration that has been applied to it.

From both terminal windows, print the contents of the /sys/fs/cgroup/pi_containers/container1/cgroup.procs file. As part of the answer to this exercise, please show the contents, and explain why you think they are different.

Then, from your container, print the contents of the /proc/self/cgroup file. From the other terminal window, the contents of the /proc/PID/cgroup file, where PID is the last PID that was listed from this terminal window (i.e. outside the container) in the /sys/fs/cgroup/pi_containers/container1/cgroup.procs. As part of the answer to this exercise, please show the contents, describe how the two files are related, and explain why you think the contents are different.

Now, from your container, you will try to move a process back out of the container1 cgroup, and into the default cgroup. Because both directories are in the same subtree of the cgroup hierarchy that is controlled by the pi user, and because the container is running as the pi user, cgroup delegation rules would suggest that this should be possible. However, the scoping provided by cgroup namespaces should prevent this from happening. To see this behavior, issue the following command:

echo $$ > /sys/fs/cgroup/pi_containers/default/cgroup.procs

As the remainder of the answer to this exercise, please describe what happened, and explain how the isolation provided by cgroup namespaces impacted this behavior.

Things to Turn In:

Your answers to the above exercises.
Your code for the forkbomb program
Your code for the monitoring program

Page updated Monday, February 7, 2022, by Marion Sudvarg and Chris Gill.