Buffering writes provides a significant performance improvement, and consequently, any operating system even halfway deserving the mark "modern" implements delayed writes via buffers. … The Linux kernel, like any modern operating system kernel, implements a complex layer of caching, buffering, and I/O management between devices and applications. … The page cache is an in-memory store of recently accessed data from an on-disk filesystem. Disk access is painfully slow, particularly relative to today's processor speeds. Storing requested data in memory allows the kernel to fulfill subsequent requests for the same data from memory, avoiding repeated disk access. The page cache exploits the concept of temporal locality, a type of locality of reference, which says that a resource accessed at one point has a high probability of being accessed again in the near future. The memory consumed to cache data on its first access therefore pays off, as it prevents future expensive disk accesses.
—Robert Love, Linux System Programming, 2nd Edition
Disk access is orders of magnitude slower than access to main memory (RAM),
which itself induces significant latency compared to CPU cache and registers.
As such, disk I/O bandwidth is a resource that must be shared among processes.
We have already explored cgroups
as a monitoring and constraint mechanism
by which memory and CPU resources can be allocated among groups of processes;
similarly, cgroups
also provide an IO controller.
However, not all reads from, and writes to, files on disk
immediately trigger I/O activity.
The Linux kernel utilizes the page cache,
a set of data structures that allow information on disk
to be stored in memory for faster access.
This comes at a cost, however, and there is a tradeoff between optimizing I/O,
and providing sufficient memory resources to other parts of the kernel,
and to processes running on the system.
As such, pages in the cache are occasionally evicted,
and, further, when pages are dirtied (i.e., when a process writes to a file),
they must occasionally be written back to disk to maintain consistency.
In this studio, you will:
Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.
As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.
Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.
As the answer to the first exercise, please list the names of the people who worked together on this studio.
In this studio you will use the cgroups v2
IO controller
to apply limits to the I/O usage of a process (or group of processes),
and observe that usage.
Make sure, before proceeding, that you have enabled
cgroups v2
on your Raspberry Pi
according to the instructions in the
earlier studio.
For this exercise, you will set up a cgroup
with the io
controller.
First, navigate into the /sys/fs/cgroup
directory
and verify that the controller is available by inspecting the contents of cgroup.controllers
.
Then, enable it for any children by adding it to the cgroup.subtree_control
file.
To do this, you will have to run the terminal in root mode, i.e. first issue either the
sudo su
or sudo bash
command.
From there, issue the command:
echo "+io" > cgroup.subtree_control
Now, inspect the contents of cgroup.subtree_control
to verify that the controller has been added.
Next, create a child control group, contained within the root group,
which will monitor several IO-intensive tasks.
To create the child, simply create a subdirectory within the root cgroup
.
Navigate into this subdirectory, and list its contents,
verifying that (1) the cgroup.controllers
file lists the io
controller,
and that (2) the io.stat
file is present.
As the answer to this exercise, please list the contents of the subdirectory,
and show the contents of the io.stat
file.
To practice using the IO controller,
you will use a program that generates heavy I/O activity
with different access patterns.
You will first create a large file from which your program will read,
using the dd
utility.
This command-line tool can copy and create files,
including copying from special device node files.
This will enable you to create a file filled with 0-bits by copying from /dev/zero
.
For more information, see the online man page
or issue the command:
man 1 dd
First, issue the df -h
command
to verify that your storage device has sufficient space.
Look at the available space listed for the filesystem mounted on "/".
If it shows that there is at least 5G of available space,
you will create a file consisting of 256MB of 0's.
If you have less than 5G of empty space, then you are likely using a MicroSD
with less than 16GB capacity; if so, you should use a larger one!
To create the file, issue the following command
(from a directory outside of the cgroups
hierarchy,
e.g. from within your home directory):
dd if=/dev/zero of=<filename> bs=1M count=256
Linux may leverage its page cache to keep this file in memory,
and avoid actually writing it back to the MicroSD card until it thinks it needs to do so.
To flush it to disk, run the sync
utility:
sudo sync
Even after the file is written to disk, it may still be maintained in the page cache,
to allow fast access from processes that may need to read from, or write to it.
Because the purpose of this exercise is to examine I/O behavior,
we want to guarantee that subsequent reads from the file have to go directly to disk.
The /proc
pseudo-filesystem's sys
subsystem
provides access to the virtual memory through several interfaces in the /proc/sys/vm
directory.
You can flush the page cache by opening a root shell (sudo su
) then issuing the command:
echo 1 > /proc/sys/vm/drop_caches
Now, you will use a program, read_access.c,
to access this file using several reads,
which can either be in sequence, or from random locations in the file.
The program takes two command line arguments:
the first specifies the path to the file,
and the second specifies the access behavior.
It defines a hardcoded value for the size of the file in bytes,
and allocates a page-sized buffer (i.e. 4096 bytes) that it will read into from the file.
It first (1) attempts to open the file for reading (and exits with a descriptive error message if it can't),
(2) uses the second command line argument to determine if it should use SEQUENTIAL or RANDOM access patterns,
then (3) prints its PID to stdout, before (4) blocking on input from stdin
so that it can be placed into a cgroup
before proceeding.
Then, once it receives input, it proceeds to access the file as follows.
It runs a loop, and for each iteration it attempts to read a page-size number of bytes from the file into the buffer. It continues for 10,000 iterations, reading 40 MB. If the specified access pattern is RANDOM, then before each read it seeks to a random location in the file.
Compile the program with no optimizations (i.e. with the -O0
flag),
then run it in a non-root shell,
having it open the large file you created,
specifying the SEQUENTIAL access pattern.
Open another terminal, then launch a root shell.
When the program prints its PID, before proceeding,
use the root shell to add it to the cgroup
you created in the previous exercise.
Have the program proceed, then once it has completed,
print the contents of the io.stat
file in the cgroup
you created.
Flush the page cache again, using your root shell.
Then, run the program again, this time using the RANDOM access pattern.
Again, move the program into your cgroup
after it prints its PID.
As the answer to this exercise, show the values reported by io.stat
after each run of the program.
Subtract the first set of values from the second set to get the I/O statistics for the second run,
and report those values as well.
Please explain how and why the statistics for the first run and second run were different.
Now that you have a sense of how different file access patterns affect I/O behavior,
we will introduce a mechanism by which a userspace program can advise the kernel about its intended access patterns,
so that the kernel can optimize access.
The posix_fadvise
system call allows a process to announce
an intention to access file data in a specific pattern,
allowing the kernel to optimize its page caching and read-ahead
(reading from a disk beyond the bounds of calls to read()
,
and storing that extra data in the page cache) behavior.
This does not represent a binding contract with the kernel;
it is free to make its own decisions about optimization,
especially in the context of considering where optimization will affect the performance of other processes.
To proceed with testing these mechanisms, we want to start with a clean page cache, so flush the page cache again with the command:
echo 1 > /proc/sys/vm/drop_caches
Modify your read_access.c
program so that,
instead of reading from the file using a fixed number of iterations,
it takes a third command-line argument specifying the number of times it will read.
It should also take an optional fourth command-line argument specifying advice to the kernel
(you can limit this to the POSIX_FADV_SEQUENTIAL
and POSIX_FADV_RANDOM
advice constants).
if the SEQUENTIAL access pattern is specified as the second command-line argument,
and if advice is additionally specified,
then after your program opens the file,
it should use posix_fadvise
to mark the first PAGE_SIZE * iterations
bytes with the specified advice pattern.
For more information, see the online man page
or issue the command:
man 2 posix_fadvise
Now, proceed as follows:
rmdir
and mkdir
Compile read_access
with no optimizations,
then run it twice in a row using the SEQUENTIAL access pattern,
with 10000 iterations, but no advice:
./read_access <filename> SEQUENTIAL 10000
Each time, place it into the I/O cgroup,
and print the contents of io.stat
after each run.
Run read_access
twice in a row using the SEQUENTIAL access pattern,
with 10000 iterations, and with the POSIX_FADV_SEQUENTIAL
advice:
./read_access <filename> SEQUENTIAL 10000 POSIX_FADV_SEQUENTIAL
Each time, place it into the I/O cgroup,
and print the contents of io.stat
after each run.
Run read_access
twice in a row using the SEQUENTIAL access pattern,
with 10000 iterations, and with the POSIX_FADV_RANDOM
advice:
./read_access <filename> SEQUENTIAL 10000 POSIX_FADV_RANDOM
Each time, place it into the I/O cgroup,
and print the contents of io.stat
after each run.
You should now have three sets of io.stat
contents,
where each set has the contents of the file after the first, then the second,
run of your program.
As the answer to this exercise, please report those statistics.
Also, for each set, subtract the first set of values from the second set to get the I/O statistics for the second run,
and report those values as well.
Explain how the behavior differed for each piece of advice issued to the kernel.
So far, we have explored file read behavior.
In general, when files are read, the kernel first looks in its page cache,
then, if the contents are not there, it reads from the disk.
Writes are similarly optimized: the kernel writes to the page cache first,
marking the corresponding areas as dirty,
and only occasionally flushes those dirty areas to disk.
Flushes can be manually forced through the drop_caches
interface
(when you have cleared the page cache,
this has the side effect of forcing any dirty pages to be written back to disk).
However, this additionally means that the kernel is emptying its in-memory records of those files;
this is inefficient in scenarios where you want to force a writeback,
but want to keep the file data in cache for fast access.
In this exercise, you will explore file write, and writeback, behavior.
This is actually mediated by two cgroup
controllers
under the cgroups v2
subsystem:
the IO controller monitors and constrains disk writes,
similarly to its same functionality for disk reads.
However, because the page cache is held in memory,
this means that the memory controller monitors the page cache,
and constrains its use:
when processes exceed their high memory limit,
the kernel will begin to aggressively evict pages
(triggering writebacks to disk).
The interplay between memory and I/O limits
(e.g. what happens when memory limits force writebacks,
but I/O limits constrain the rate at which writebacks are possible)
is a complex topic, and is beyond the scope of this course.
We will, for this exercise, look at how both controllers monitor
the page cache and subsequent writebacks.
Download the program write_access.c,
which functions very similarly to read_access
,
but instead writes to the specified file.
Compile it with no optimizations, then run the program using the RANDOM access pattern,
and place it into a new cgroup
that has both memory and IO controllers enabled.
Now, print the contents of both the io.stat
and memory.stat
files.
To trigger a writeback to disk (but without emptying the page cache),
you can use the sync
utility (you used this before
to synchronize the large file you created with dd
).
Calling it is simple:
sudo sync
Print the contents of both the io.stat
and memory.stat
files again.
As the answer to this exercise, please indiciate (1) how the I/O statistics changed before and after the call to sync
and (2) how the file_mapped
, file_dirty
, and file_writeback
fields in the memory.stat
file changed.
What does this information say about the writeback behavior of the page cache?
read_access.c
program.
write_access.c
program,
if you did the corresponding optional enrichment exercise.For this exercise, you will run the read_access
program like you did in the previous exercises,
but instead of using the cgroup
to measure its I/O statistics,
you will trace its behavior with ftrace
and KernelShark.
First, flush the page cache
Then, run the program in SEQUENTIAL access mode using only 100 iterations,
and trace it with ftrace
, i.e.
sudo trace-cmd record -e sched_switch ./read_access <filename> SEQUENTIAL 100
Move or rename trace.dat
so it isn't overwritten by subsequent traces.
Then, flush the page cache again,
and trace the program with ftrace
, but running in RANDOM access mode.
Now, open both trace files with KernelShark.
(If you have a headless setup for your Raspberry Pi,
please refer to the instructions in the previous studio
for running KernelShark with a GUI.)
Look closely at how the different access patterns affect the behavior of the read()
calls.
As the answer to this exercise, please explain how this is reflective of what you observed
in the cgroups
statistics reported in the previous exercise.
Also, please include a screenshot from KernelShark from each trace that highlights and reinforces your observations.
We will now explore the writeback behavior that occurs when I/O constraints are applied to a process.
First, create a new IO cgroup
, then constrain its write IOPS using the io.max
interface file.
The wiops
value constrains the maximum I/O operations per second allowed for processes in the group.
The corresponding wios
value obtained from io.stats
in the previous exercise
should tell you how many write I/O operations the writeback incurred;
set an appropriate write IOPS limit so that the writeback should take several seconds.
Modify the write_access
program so that, before it terminates,
it triggers a writeback using the sync()
system call,
which functions similarly to the sync
command-line utility.
For more information, see the online man page
or issue the command:
man 2 sync
Now, run the program in RANDOM access mode,
and trace it with ftrace
, i.e.
sudo trace-cmd record -e sched_switch ./write_access <filename> RANDOM
Then, open the trace file with KernelShark.
Look at the trace that occurs after the call to sync()
.
As the answer to this exercise, please explain how this is reflective of the write IOPS limit you set
in the IO cgroup
controller interface.
Also, please include a screenshot from KernelShark that highlights and reinforces your observations.
Page updated Wednesday, February 16, 2022, by Marion Sudvarg and Chris Gill.