CSE 522S - Advanced Operating Systems

CSE 522S: Studio 10

Controlling I/O Behavior

Buffering writes provides a significant performance improvement, and consequently, any operating system even halfway deserving the mark "modern" implements delayed writes via buffers. … The Linux kernel, like any modern operating system kernel, implements a complex layer of caching, buffering, and I/O management between devices and applications. … The page cache is an in-memory store of recently accessed data from an on-disk filesystem. Disk access is painfully slow, particularly relative to today's processor speeds. Storing requested data in memory allows the kernel to fulfill subsequent requests for the same data from memory, avoiding repeated disk access. The page cache exploits the concept of temporal locality, a type of locality of reference, which says that a resource accessed at one point has a high probability of being accessed again in the near future. The memory consumed to cache data on its first access therefore pays off, as it prevents future expensive disk accesses.

—Robert Love, Linux System Programming, 2nd Edition

Disk access is orders of magnitude slower than access to main memory (RAM), which itself induces significant latency compared to CPU cache and registers. As such, disk I/O bandwidth is a resource that must be shared among processes. We have already explored cgroups as a monitoring and constraint mechanism by which memory and CPU resources can be allocated among groups of processes; similarly, cgroups also provide an IO controller. However, not all reads from, and writes to, files on disk immediately trigger I/O activity. The Linux kernel utilizes the page cache, a set of data structures that allow information on disk to be stored in memory for faster access. This comes at a cost, however, and there is a tradeoff between optimizing I/O, and providing sufficient memory resources to other parts of the kernel, and to processes running on the system. As such, pages in the cache are occasionally evicted, and, further, when pages are dirtied (i.e., when a process writes to a file), they must occasionally be written back to disk to maintain consistency.

In this studio, you will:

Create a large file, then use programs that read and write from that file using different access patterns
Use the IO cgroup to observe how those access patterns affect access to the backing I/O storage device
Learn about mechanisms by which userspace can advise the kernel of intended file access patterns, such that I/O can be optimized
Explore how the page cache reduces I/O access, how it enables delayed writebacks to disk, and the interplay it requires between the memory and I/O subsystems

Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

As the answer to the first exercise, please list the names of the people who worked together on this studio.
In this studio you will use the cgroups v2 IO controller to apply limits to the I/O usage of a process (or group of processes), and observe that usage. Make sure, before proceeding, that you have enabled cgroups v2 on your Raspberry Pi according to the instructions in the earlier studio.

For this exercise, you will set up a cgroup with the io controller. First, navigate into the /sys/fs/cgroup directory and verify that the controller is available by inspecting the contents of cgroup.controllers. Then, enable it for any children by adding it to the cgroup.subtree_control file. To do this, you will have to run the terminal in root mode, i.e. first issue either the sudo su or sudo bash command. From there, issue the command:

echo "+io" > cgroup.subtree_control

Now, inspect the contents of cgroup.subtree_control to verify that the controller has been added.

Next, create a child control group, contained within the root group, which will monitor several IO-intensive tasks. To create the child, simply create a subdirectory within the root cgroup. Navigate into this subdirectory, and list its contents, verifying that (1) the cgroup.controllers file lists the io controller, and that (2) the io.stat file is present. As the answer to this exercise, please list the contents of the subdirectory, and show the contents of the io.stat file.
To practice using the IO controller, you will use a program that generates heavy I/O activity with different access patterns. You will first create a large file from which your program will read, using the dd utility. This command-line tool can copy and create files, including copying from special device node files. This will enable you to create a file filled with 0-bits by copying from /dev/zero. For more information, see the online man page or issue the command:

man 1 dd

First, issue the df -h command to verify that your storage device has sufficient space. Look at the available space listed for the filesystem mounted on "/". If it shows that there is at least 5G of available space, you will create a file consisting of 256MB of 0's. If you have less than 5G of empty space, then you are likely using a MicroSD with less than 16GB capacity; if so, you should use a larger one!

To create the file, issue the following command (from a directory outside of the cgroups hierarchy, e.g. from within your home directory):

dd if=/dev/zero of=<filename> bs=1M count=256

Linux may leverage its page cache to keep this file in memory, and avoid actually writing it back to the MicroSD card until it thinks it needs to do so. To flush it to disk, run the sync utility:

sudo sync

Even after the file is written to disk, it may still be maintained in the page cache, to allow fast access from processes that may need to read from, or write to it. Because the purpose of this exercise is to examine I/O behavior, we want to guarantee that subsequent reads from the file have to go directly to disk. The /proc pseudo-filesystem's sys subsystem provides access to the virtual memory through several interfaces in the /proc/sys/vm directory. You can flush the page cache by opening a root shell (sudo su) then issuing the command:

echo 1 > /proc/sys/vm/drop_caches

Now, you will use a program, read_access.c, to access this file using several reads, which can either be in sequence, or from random locations in the file. The program takes two command line arguments: the first specifies the path to the file, and the second specifies the access behavior. It defines a hardcoded value for the size of the file in bytes, and allocates a page-sized buffer (i.e. 4096 bytes) that it will read into from the file. It first (1) attempts to open the file for reading (and exits with a descriptive error message if it can't), (2) uses the second command line argument to determine if it should use SEQUENTIAL or RANDOM access patterns, then (3) prints its PID to stdout, before (4) blocking on input from stdin so that it can be placed into a cgroup before proceeding. Then, once it receives input, it proceeds to access the file as follows.

It runs a loop, and for each iteration it attempts to read a page-size number of bytes from the file into the buffer. It continues for 10,000 iterations, reading 40 MB. If the specified access pattern is RANDOM, then before each read it seeks to a random location in the file.

Compile the program with no optimizations (i.e. with the -O0 flag), then run it in a non-root shell, having it open the large file you created, specifying the SEQUENTIAL access pattern. Open another terminal, then launch a root shell. When the program prints its PID, before proceeding, use the root shell to add it to the cgroup you created in the previous exercise.

Have the program proceed, then once it has completed, print the contents of the io.stat file in the cgroup you created.

Flush the page cache again, using your root shell. Then, run the program again, this time using the RANDOM access pattern. Again, move the program into your cgroup after it prints its PID.

As the answer to this exercise, show the values reported by io.stat after each run of the program. Subtract the first set of values from the second set to get the I/O statistics for the second run, and report those values as well. Please explain how and why the statistics for the first run and second run were different.
Now that you have a sense of how different file access patterns affect I/O behavior, we will introduce a mechanism by which a userspace program can advise the kernel about its intended access patterns, so that the kernel can optimize access. The posix_fadvise system call allows a process to announce an intention to access file data in a specific pattern, allowing the kernel to optimize its page caching and read-ahead (reading from a disk beyond the bounds of calls to read(), and storing that extra data in the page cache) behavior. This does not represent a binding contract with the kernel; it is free to make its own decisions about optimization, especially in the context of considering where optimization will affect the performance of other processes.
To proceed with testing these mechanisms, we want to start with a clean page cache, so flush the page cache again with the command:

echo 1 > /proc/sys/vm/drop_caches

Modify your read_access.c program so that, instead of reading from the file using a fixed number of iterations, it takes a third command-line argument specifying the number of times it will read. It should also take an optional fourth command-line argument specifying advice to the kernel (you can limit this to the POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM advice constants). if the SEQUENTIAL access pattern is specified as the second command-line argument, and if advice is additionally specified, then after your program opens the file, it should use posix_fadvise to mark the first PAGE_SIZE * iterations bytes with the specified advice pattern. For more information, see the online man page or issue the command:

man 2 posix_fadvise

Now, proceed as follows:
1. Remove and recreate your I/O cgroup with rmdir and mkdir
2. Compile read_access with no optimizations, then run it twice in a row using the SEQUENTIAL access pattern, with 10000 iterations, but no advice:
  ./read_access <filename> SEQUENTIAL 10000
  
  Each time, place it into the I/O cgroup, and print the contents of io.stat after each run.
3. Flush the page cache, and remove and recreate the I/O cgroup.
4. Run read_access twice in a row using the SEQUENTIAL access pattern, with 10000 iterations, and with the POSIX_FADV_SEQUENTIAL advice:
  ./read_access <filename> SEQUENTIAL 10000 POSIX_FADV_SEQUENTIAL
  
  Each time, place it into the I/O cgroup, and print the contents of io.stat after each run.
5. Flush the page cache, and remove and recreate the I/O cgroup.
6. Run read_access twice in a row using the SEQUENTIAL access pattern, with 10000 iterations, and with the POSIX_FADV_RANDOM advice:
  ./read_access <filename> SEQUENTIAL 10000 POSIX_FADV_RANDOM
  
  Each time, place it into the I/O cgroup, and print the contents of io.stat after each run.
You should now have three sets of io.stat contents, where each set has the contents of the file after the first, then the second, run of your program. As the answer to this exercise, please report those statistics. Also, for each set, subtract the first set of values from the second set to get the I/O statistics for the second run, and report those values as well. Explain how the behavior differed for each piece of advice issued to the kernel.
So far, we have explored file read behavior. In general, when files are read, the kernel first looks in its page cache, then, if the contents are not there, it reads from the disk. Writes are similarly optimized: the kernel writes to the page cache first, marking the corresponding areas as dirty, and only occasionally flushes those dirty areas to disk. Flushes can be manually forced through the drop_caches interface (when you have cleared the page cache, this has the side effect of forcing any dirty pages to be written back to disk). However, this additionally means that the kernel is emptying its in-memory records of those files; this is inefficient in scenarios where you want to force a writeback, but want to keep the file data in cache for fast access.

In this exercise, you will explore file write, and writeback, behavior. This is actually mediated by two cgroup controllers under the cgroups v2 subsystem: the IO controller monitors and constrains disk writes, similarly to its same functionality for disk reads. However, because the page cache is held in memory, this means that the memory controller monitors the page cache, and constrains its use: when processes exceed their high memory limit, the kernel will begin to aggressively evict pages (triggering writebacks to disk). The interplay between memory and I/O limits (e.g. what happens when memory limits force writebacks, but I/O limits constrain the rate at which writebacks are possible) is a complex topic, and is beyond the scope of this course.

We will, for this exercise, look at how both controllers monitor the page cache and subsequent writebacks. Download the program write_access.c, which functions very similarly to read_access, but instead writes to the specified file. Compile it with no optimizations, then run the program using the RANDOM access pattern, and place it into a new cgroup that has both memory and IO controllers enabled.

Now, print the contents of both the io.stat and memory.stat files.

To trigger a writeback to disk (but without emptying the page cache), you can use the sync utility (you used this before to synchronize the large file you created with dd). Calling it is simple:

sudo sync

Print the contents of both the io.stat and memory.stat files again. As the answer to this exercise, please indiciate (1) how the I/O statistics changed before and after the call to sync and (2) how the file_mapped, file_dirty, and file_writeback fields in the memory.stat file changed. What does this information say about the writeback behavior of the page cache?
Things to Turn In:
- Your answers to the above exercises, and to any optional enrichment exercises you chose to complete
- The code for your modified read_access.c program.
- The code for your modified write_access.c program, if you did the corresponding optional enrichment exercise.
- The requested screenshots of KernelShark traces, if you did the corresponding optional enrichment exercises.
Optional Enrichment Exercises
For this exercise, you will run the read_access program like you did in the previous exercises, but instead of using the cgroup to measure its I/O statistics, you will trace its behavior with ftrace and KernelShark.

First, flush the page cache Then, run the program in SEQUENTIAL access mode using only 100 iterations, and trace it with ftrace, i.e.

sudo trace-cmd record -e sched_switch ./read_access <filename> SEQUENTIAL 100

Move or rename trace.dat so it isn't overwritten by subsequent traces.

Then, flush the page cache again, and trace the program with ftrace, but running in RANDOM access mode.

Now, open both trace files with KernelShark. (If you have a headless setup for your Raspberry Pi, please refer to the instructions in the previous studio for running KernelShark with a GUI.) Look closely at how the different access patterns affect the behavior of the read() calls. As the answer to this exercise, please explain how this is reflective of what you observed in the cgroups statistics reported in the previous exercise. Also, please include a screenshot from KernelShark from each trace that highlights and reinforces your observations.
We will now explore the writeback behavior that occurs when I/O constraints are applied to a process. First, create a new IO cgroup, then constrain its write IOPS using the io.max interface file. The wiops value constrains the maximum I/O operations per second allowed for processes in the group. The corresponding wios value obtained from io.stats in the previous exercise should tell you how many write I/O operations the writeback incurred; set an appropriate write IOPS limit so that the writeback should take several seconds.

Modify the write_access program so that, before it terminates, it triggers a writeback using the sync() system call, which functions similarly to the sync command-line utility. For more information, see the online man page or issue the command:

man 2 sync

Now, run the program in RANDOM access mode, and trace it with ftrace, i.e.

sudo trace-cmd record -e sched_switch ./write_access <filename> RANDOM

Then, open the trace file with KernelShark. Look at the trace that occurs after the call to sync(). As the answer to this exercise, please explain how this is reflective of the write IOPS limit you set in the IO cgroup controller interface. Also, please include a screenshot from KernelShark that highlights and reinforces your observations.

Page updated Wednesday, February 16, 2022, by Marion Sudvarg and Chris Gill.

CSE 522S: Studio 10

Controlling I/O Behavior

Required Exercises

Things to Turn In:

Optional Enrichment Exercises