To be sure, many users would love more memory. On modern systems, however, the problem is not really one of sharing too little among too many, but of properly using and keeping track of the bounty.
—Robert Love, Linux System Programming, 2nd Edition, Chapter 9, pp. 293.
As one of the most important resources in computer systems, memory must be managed carefully to efficiently utilize the resource. Misuses of memory, whether intentional (e.g. malicious memory overallocation) or accidental (e.g. programs with significant memory leaks) can lead to unwanted system interference. Understanding how the Linux kernel provides mechanisms to constrain the memory use of a process, or group of processes, is important for minimizing interference, especially for modular or componentized systems, e.g. those using containers and Docker.
In this studio, you will:
cgroups v2
memory controller to constrain the memory use of a group of processescgroups
and its memory controller to observe out-of-memory events.cgroups
into your simple container environment.
Please complete the required exercises below. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.
As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.
Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.
As the answer to the first exercise, please list the names of the people who worked together on this studio.
In this studio you will use a resource management feature of the Linux kernel,
called cgroups
, to apply limits on the amount of memory a process (or group of processes) can acquire.
The cgroups
feature consists of several subsystems (or controllers), each of which is responsible for a particular resource type (such as CPUs, memory, I/O, or networks).
It provides a pseudofilesystem through which users can get and set parameters and limits associated with a subsystem.
In particular, we will be using the cgroups v2
memory controller.
The Raspberry Pi enables both cgroups v1
and v2
by default.
However, the Linux kernel does not allow both versions of the same controller to be active.
So, you will begin by configuring your Raspberry Pi's boot settings to
(1) disable the cgroups v1
substem, and (2) enable the memory controller for use by cgroups v2
The Raspberry Pi OS launches the systemd
daemon during system startup.
This utility is responsible for configuring much of the kernel and userspace functionality of the Raspberry Pi,
including mounting the appropriate cgroups
pseudofilesystem(s).
Certain commands can be issued to systemd
to change its boot-time behavior via the /boot/cmdline.txt
file.
Note that commands in that file are separated by spaces.
To disable the cgroups v1
subystem, add the following command to the end of the file:
cgroup_no_v1=all
Then, to enable memory cgroups, add the following commands:
cgroup_memory=1 cgroup_enable=memory
To apply these settings, reboot your Raspberry Pi.
After the reboot, you can check that only the cgroups v2
subsystem is mounted,
and verify that the memory controller is enabled,
by issuing the following two commands:
mount | grep cgroup
cat /proc/cgroups
Do so, then as the answer to this exercise (1) show the output of these two commands,
(2) explain what each command does, and (3) indicate what the output tells you about
which cgroups
subsystem is mounted, and which controllers are enabled.
The cgroups
pseudofilesystem is arranged hierarchically.
By default, all tasks in the system are included in the root cgroup
,
which in this case is located at /sys/fs/cgroup/
.
The read-only cgroup.controllers
file lists the controllers that are available
to the cgroup
. The cgroup.subtree_control
file defines
the list of controllers that are available to any children of the cgroup
.
In other words, a child's cgroup.controllers
file is a read-only copy of its parent's cgroup.subtree_control
file.
For this studio, you will only be using the memory controller.
First, verify that it is available by inspecting the contents of cgroup.controllers
.
Then, enable it for any children by adding it to the cgroup.subtree_control
file.
To do this, you will have to run the terminal in root mode, i.e. first issue either the
sudo su
or sudo bash
command.
Enabling a controller involves writing a "+", followed by the controller name, to the cgroup.subtree_control
file, e.g.:
echo "+memory" > cgroup.subtree_control
Now, inspect the contents of cgroup.subtree_control
, and try to remove any controllers listed
(besides the memory controller you just added).
Note that you might not be able to remove some controllers
(you will note any error messages as part of the answer to this exercise).
Removing a controller is similar to adding it, except you write a "-" followed by the controller name, e.g.:
echo "-cpu" > cgroup.subtree_control
For this exercise you will create a child control group, contained within the root group,
which will monitor a task that we will write in the next exercise.
To create the child, simply create a subdirectory within the root cgroup
.
Navigate into this subdirectory, and list its contents.
Then, try to remove the memory controller from its cgroup.controllers
file,
and note any error messages that appear.
As the answer to this exercise, please (1) write the contents of the cgroup.controllers
and cgroup.subtree_control
files in the root cgroup
, (2) write the contents of those files in the child cgroup
you created,
(3) show the list of files in the child cgroup
directory,
(4) write any error messages printed when you tried to remove controllers from the root's cgroup.subtree_control
,
then (5) write any error messages printed when you tried to write to the child's cgroup.controllers
file,
and explain why you think you saw those errors.
Now, you will use the child cgroup
you created to monitor the memory usage induced by a program you will write.
This concept is an important notion for cgroups
:
a program can induce more memory usage than its corresponding process,
allowing it to bypass traditional resource limit mechanisms,
e.g., by forking several child processes (a classic forkbomb attack).
However, because a child process will automatically be a part of its parent's cgroup
,
the constraints enforced by cgroup
controllers are applied against the total resource usage induced by a program.
Write a program (outside of the cgroups
filesystem!) that implements a forkbomb:
the program should, in a loop, request a significant amount of memory (at least a page) by calling malloc()
without freeing,
delays for a short time (sufficiently long to allow for observability, e.g. a second),
then it should fork()
a child process.
Because the fork()
is also in the loop body,
this means that your program will generate an exponentially-increasing number of child processes.
Compile your program.
Open another terminal window, which you will use to run your program. Before doing so, you will write the PID of this terminal into the child control group you created. Issue the following command to see its PID:
echo $$
Then, in your first terminal window (which should be running as root, i.e. with sudo su
or sudo bash
)
write the PID into the cgroup.procs
file in your child cgroup
.
Then, print the contents of the cgroup.procs
,
the memory.current
, and the memory.stat
files.
To verify that the shell running in your second terminal window is in the new cgroup
,
you can print the contents of the file /proc/self/cgroup
.
Please do so, note the output, and check that the reported cgroup
membership matches what you expect.
Still in your second terminal window, run the forkbomb program, wait a couple of seconds, then print the contents of those three files again, before terminating your forkbomb.
After terminating the forkbomb, print the contents of the same three files.
As the answer to this exercise, please show the contents of the three files,
i.e., cgroup.procs
, memory.current
, and memory.stat
,
before, during, and after the forkbomb program's execution.
Please explain the significance of the contents of the cgroup.procs
, memory.current
,
and how those contents changed. Also, please pick one statistic from the memory.stat
file,
explain how it changed through the lifetime of the program, and explain its significance.
Additionally, please show the contents of the /proc/self/cgroup
file,
when you inspected it from the second terminal window.
Now, in addition to using the memory controller to observe the forkbomb program,
you will constrain its memory usage using the memory.max
interface.
In your second terminal window, which is a member of your child cgroup
,
rewrite and recompile your forkbomb program so it no longer delays before each call to fork()
.
In the first terminal window, running as root,
write a value into memory.max
that is sufficiently larger than the current value of memory.current
such that the forkbomb will cause this limit to be exceeded relatively quickly.
Then, print the value stored in memory.max
, as indicated by printing the contents of that file.
This value may differ slightly from what you wrote into the file.
Note: If you need to reset the memory.max
value, such that there is no longer an enforced maximum,
you can do so by writing the value "max" into the file.
If you need to remove a cgroup
, you can do so if the "populated" field of its cgroup.events
file has a value 0.
It can be removed by removing its directory using rmdir
.
Next, run your forkbomb program in the second terminal window. Observe what happens,
and, in the first terminal window, print the contents of the memory.events
file.
As the answer to this exercise, please (1) tell us what value you wrote into memory.max
,
what value was subsequently printed from the contents of that file, and why you think those values might have been different.
Then (2) explain what happened to the forkbomb, show the contents of the memory.events
file,
and explain the significance of those contents.
The cgroups
memory controller, in addition to providing a way to enforce a hard limit on memory usage,
also supplies the memory.high
controller.
This allows an administrator to define a memory usage threshold beyond which
(1) a "high" event, in the memory.events
file, is triggered, which subsequently
(2) signals to the Linux kernel that it should begin aggressively reclaiming memory from the processes in the cgroup
(though those processes will not be killed).
For this exercise, you will write a monitoring program that prints a notification when
the cgroup
exceeds its "high" memory threshold, or when the cgroup
changes its "populated" state.
The monitoring program should do the following:
Take exactly two command-line arguments (and print a helpful usage message if more or fewer are given)
which are the paths to the files named cgroup.events
and memory.events
within the child group, respectively.
The memory.events
file contains counters for when the low, high, and max memory thresholds are crossed.
The cgroup.events
file contains two binary values, which indicate whether the cgroup
is populated
(i.e. it, or its children, have member processes)
and whether it is frozen (i.e. its member processes have been placed in a suspended state).
Attempt to open both files, read-only, using the open()
system call.
If either file cannot be opened, it should print a helpful error message and exit.
Create an inotify
instance, then subsequently add watches for both files,
watching for IN_MODIFY
events.
In a loop, watch for events on these files by reading from the file descriptor returned by the inotify_init()
function.
Your program should associate the watch descriptors with the file descriptors returned by the opened files, so that, if either file has been changed, your program knows which is the corresponding opened file descriptor.
If the cgroup.events
file has been changed, the monitor program should print a message indicating
whether the cgroup
is populated or not.
If the memory.events
file has been changed, the monitor program should print the current value of its "high" field.
(NOTE: Remember that to perform subsequent reads of the entire contents of a file,
you need to use lseek
to set the file offset back to the beginning!)
Compile your monitor program, and rewrite and recompile your forkbomb program
so it again delays before each call to fork()
.
This time, have three terminal windows open, (1) running as root,
(2) running your monitor program, and (3) which will be added to the cgroup
.
In (3), again print its PID to the terminal window with echo $$
In (2), launch your monitor program, then in (1) write the PID of terminal window (3) into cgroup.procs
file.
Then, also in (1), print the contents of memory.current
.
Write a value into memory.high
that is sufficiently larger than the current value of memory.current
such that the forkbomb will cause this limit to be exceeded relatively quickly.
Then, print the value stored in memory.high
, as indicated by printing the contents of that file.
Again, this value may differ slightly from what you wrote into the file.
Now, in (3), launch your forkbomb. Allow it to run until you see your monitor program begin to show output. Let the monitor program print a few messages, then kill your forkbomb, and close terminal (3).
As the answer to this exercise, please say (1) the value you wrote into memory.high
and the value it subsequently showed as being stored,
and why you think those values might have been different.
Then (2) please show the output of the monitor program, and explain what the output tells you about the behavior of the forkbomb,
and what happened when you closed the terminal.
For the final exercise, you will integrate cgroups
into the simple container environment from the previous studio,
using cgroups
namespaces to define a delegation boundary,
such that you can create your container without root privileges in its own cgroup
,
then allow it to create children of that cgroup
and place child processes into those children.
First, you will need to ensure that cgroups v2
has been mounted with the nsdelegate
option.
Issue the command:
mount | grep cgroup
If the list in parentheses does not include the "nsdelegate" option, you will need to remount cgroups
with that option:
sudo mount -t cgroup2 -o remount,nsdelegate /sys/fs/cgroup
Now, you'll create a hierarchy of cgroups
, and delegate ownership to the pi
user.
This will allow that user (without administrative privileges)
to create individual cgroups
to manage resource allocation for individual containers,
launch containers in those cgroups
,
then allow those containers to further partition resource usage among the processes in their scope.
To begin with, open a root shell (e.g. with sudo su
),
then navigate to the /sys/fs/cgroup
directory.
In the root shell, issue the following commands:
mkdir pi_containers
(This creates a hierarchy of cgroups
in which the pi
user can launch containers.)
chown pi:pi pi_containers
(This delegates control of the hierarchy to the pi
user,
allowing it to create new children.)
chown pi:pi pi_containers/cgroup.procs
(This allows the pi
user
to move processes within the hierarchy.)
chown pi:pi pi_containers/cgroup.subtree_control
(This allows the pi
user to move enable controllers throughout the hierarchy.
However, the root user still retains control over the controllers and interfaces at the root of the hierarchy,
allowing the administrator to allocate resources to the pi
user,
which can then distribute those resources amongst its containers.)
Now, close the root shell. As the pi
user (without sudo
),
navigate into the /sys/fs/cgroup/pi_containers
directory.
In it, create a new directory, called default
.
Recall that processes must only be in the leaf node of a cgroup
(with the exception of processes in the root cgroup
);
this directory allows processes to be moved into the pi_containers
hierarchy
without being explicitely assigned to an individual container cgroup
.
Print the PID of the shell (echo $$
), then launch a root shell.
Write that PID into the /sys/fs/cgroup/pi_containers/default/cgroup.procs
.
Exit the root shell, then print the contents of the default/cgroup.procs
to confirm that the shell has, indeed, been added to the cgroup
.
Now that it is in the portion of the hierarchy controlled by the pi
user,
that user can move it (and its children) freely within the hierarchy.
This means that, from within that same shell,
you can launch a new container as the pi
user,
and use cgroup
namespaces to constrain it to a cgroup
within the hierarchy.
To do this, please download the
cgroupns_child_exec.c
program.
This program is similar to the userns_child_exec.c
program from the previous studio,
but with additional functionality to allow the cloned child process to join a new cgroup
namespace.
In doing so, it takes an additional command-line argument specifying a cgroup.procs
file,
and writes its PID to this file, before cloning the child into its new namespace.
This has the effect of isolating the child's view of the cgroups
hierarchy
to the specified cgroup
.
This means that, even if the child runs as the pi
user (which has permissions to a broader subtree of the hierarchy),
it cannot move itself out of the portion of the hierarchy into which it has been placed.
Please take a look at the joinCgroup
function, and the call to that function from main
,
to understand how this functionality was programmed.
Compile the program, then (as the pi
user), create a new directory under
/sys/fs/cgroup/pi_containers
(e.g., called container1
)
that will serve as the cgroup
for your container.
Now, run the program, creating new PID and user namespaces as before,
as well as a new cgroup
namespace,
and have it join the new cgroup
you created, e.g. with the command:
./cgroupns_child_exec -pCU -M '0 1000 1' -G '0 1000 1' /sys/fs/cgroup/pi_containers/container1/cgroup.procs ./simple_init
(Make sure that simple_init
is in the same directory as cgroupns_child_exec
!)
Congratulations! You have created a new container in its own cgroup
.
From the init
prompt, launch /bin/bash
.
Open another terminal window. You will use both terminal windows (inside and outside the container)
to compare the scoped view of the container with the system configuration that has been applied to it.
From both terminal windows, print the contents of the /sys/fs/cgroup/pi_containers/container1/cgroup.procs
file.
As part of the answer to this exercise, please show the contents, and explain why you think they are different.
Then, from your container, print the contents of the /proc/self/cgroup
file.
From the other terminal window, the contents of the /proc/PID/cgroup
file,
where PID is the last PID that was listed from this terminal window (i.e. outside the container)
in the /sys/fs/cgroup/pi_containers/container1/cgroup.procs
.
As part of the answer to this exercise, please show the contents, describe how the two files are related,
and explain why you think the contents are different.
Now, from your container, you will try to move a process back out of the
container1 cgroup
, and into the default cgroup
.
Because both directories are in the same subtree of the cgroup
hierarchy that is controlled by the pi
user,
and because the container is running as the pi
user,
cgroup
delegation rules would suggest that this should be possible.
However, the scoping provided by cgroup
namespaces should prevent this from happening.
To see this behavior, issue the following command:
echo $$ > /sys/fs/cgroup/pi_containers/default/cgroup.procs
As the remainder of the answer to this exercise, please describe what happened,
and explain how the isolation provided by cgroup
namespaces impacted this behavior.
Page updated Monday, February 7, 2022, by Marion Sudvarg and Chris Gill.