CSE 522S - Advanced Operating Systems

CSE 522S: Studio 6

Isolation with Namespaces

Inside each container, you see a filesystem, network interfaces, disks, and other resources that all appear to be unique to the container despite sharing the kernel with all the other processes on the system. The primary network interface on the actual machine, for example, is a single shared resource. But inside your container it will look like it has an entire network interface to itself. This is a really useful abstraction: it's what makes your container feel like a machine all by itself. The way this is implemented in the kernel is with namespaces. Namespaces take a traditionally global resource and present the container with its own unique and unshared version of that resource.

—Sean P. Kane & Karl Matthias, Docker Up & Running, 2nd Edition

Namespaces allow groups of processes to share an isolated instance of a global resource. They are fundamental to the operation of containers, enabling the Linux kernel to establish isolated subsystems that, unlike virtual machines, share the same kernel, but still can appear from within, in many ways, to exist on their own. This studio introduces namespaces, and establishes the foundations by which containers use them.

In this studio, you will:

Be introduced to the UTS, PID, and mount namespace types
Gain experience creating and joining namespaces using the unshare, setns, and clone system calls
Learn how to mount a proc pseudo-filesystem, isolating the ps command to a PID namespace
Create a simple init process
Explore the differences between chroot and mount namespaces
Set up your own simple container environment

Please complete the required exercises below. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

As the answer to the first exercise, list the names of the people who worked together on this studio.
As a simple demonstration, to introduce the concept of namespaces, you will write a program that creates a new UTS namespace, set its hostname, then observe that the hostname, from the perspective of processes in the broader system outside of that namespace, is unaffected.

Because namespaces (besides user namespaces, which we will cover next time) require administrative privileges to create, please complete this studio on your Raspberry Pi.

Hint: this exercise closely follows parts of the article, Namespaces in operation, part 2: the namespaces API and you may use any code (with attribution) from the unshare.c program referenced in that article.

Write a program that uses the unshare() system call to create and join a new UTS namespace. For more information, see the online man page or issue the command:

man 2 unshare

The program should then change the hostname of its namespace with the sethostname() system call. For more information, see the online man page or issue the command:

man 2 sethostname

The program should next call gethostname(), and print the system hostname (of its namespace) to the terminal. It should finally enter an infinite loop such that it does not terminate until interrupted, allowing the namespace to remain active.

Compile and run the program on your Raspberry Pi (using sudo to gain administrative privileges). With it still running, open a new terminal window, and issue the hostname command.

As the answer to this exercise, please show the output of your program, and the hostname shown in the other terminal. Do you observe the expected behavior? Why or why not?
This time, you will use the setns() system call to allow a process to enter an existing UTS namespace.

Hint: this exercise closely follows parts of the article, Namespaces in operation, part 2: the namespaces API and you may use any code (with attribution) from the ns_exec.c program referenced in that article.

Modify your program (you do not need to create another copy) from the previous exercise so that, after the call to unshare(), but before calling sethostname(), it prints its PID to the terminal.

Write another program that takes two command-line arguments. The first should be an integer that specifies a PID. The program should attempt to open() the corresponding /proc/PID/ns/uts file (read only), then join the corresponding namespace with the setns() system call. For more information, see the online man page or issue the command:

man 2 setns

The second command line argument should be a command, which the program should execute with execvp in the namespace it has joined. For more information, see the online man page or issue the command:

man 3 execvp

To practice good C programming style, be sure to have your program check the number of command-line arguments and the return values from system calls, then report any errors appropriately and exit if necessary.

Compile and run the program that creates a new namespace, (using sudo to gain administrative privileges) and observe the PID of the child process.

In another terminal window, compile and run the program that joins an existing namespace (using sudo to gain administrative privileges), passing the PID of the child process, and having it launch a shell in the new namespace (e.g. by passing /bin/bash as the second command-line argument). In that shell, issue the hostname command to verify that it is, indeed, in the new UTS namespace created by the first program.

As the answer to this exercise, please show the output from both programs.
For the rest of this studio, you will take existing programs, then modify them, to establish a set of tools that will enable you to create a very simple container. You will continue to build on these tools in the next several studios to gain experience with the foundational mechanisms that Linux provides for container environments, before using more advanced tools like Docker.

So far in this studio, you have explored two of the three primary system calls related to namespaces. For this and the remaining exercise, you will instead use the clone() system call. For more information, see the online man page or issue the command:

man 2 clone

On your Raspberry Pi, retrieve and compile the ns_child_exec.c program. This program creates a child process, using clone() in new namespace(s), according to flags provided as command-line arguments. Subsequent argument(s) specify a binary or command that the cloned child process executes with execvp() in those new namespace(s). Please read through the program's code to understand how it works, and please see the Namespaces in operation, part 4: more on PID namespaces article for information about, and examples for using, the ns_child_exec program.

Create a new a program that uses the getpid() and getppid() functions to print its PID and its parent PID to the terminal. Compile this program, then run ns_child_exec with verbose output (using sudo to gain administrative privileges), to execute the program you wrote in its own PID namespace. The verbose output option will cause ns_child_exec to print the PID of the child process it creates, from the scope of the root PID namespace. The child process, on the other hand, will print its (and its parent's) PIDs from the scope of the new namespace.

As the answer to this exercise, please show the output from both processes. State which command-line flag is passed to ns_child_exec to specify a new PID namespace and which flag specifies verbose output. Please also explain, briefly, why the parent PID of the child process changes, and explain the mechanisms used by ns_child_exec to ensure that its child process does not become an orphan.
A process launched in a new PID namespace can also inhabit a new mount namespace, which enables it to mount a proc pseudo-filesystem for its PID namespace in the /proc directory, without affecting the mount visible to processes in the root namespace.

Hint: this exercise closely follows parts of the article, Namespaces in operation, part 3: PID namespaces and you may use any code (with attribution) from the pidns_init_sleep.c program referenced in that article.

Modify the program you created in the previous exercise that prints its, and its parent's, PIDs. After it prints these values, it should mount the proc pseudofilesystem for its namespace into the /proc directory. Because the /proc filesystem is likely marked as shared, meaning that events will propagate to its peer groups outside the namespace, you will first need to mark it as private. To do so, you will first use the mount() system call to update the namespace's propagation type to MS_PRIVATE, then use a second mount() system call to mount its proc filesystem. For more information, see the online man page or issue the command:

man 2 mount

After mounting the proc filesystem, your program should, with one of the exec family of functions, run the ps u command. For more information, you can look at the call to execlp in the pidns_init_sleep.c program, see the online man page, or issue the command:

man 3 exec

Compile your program, then run ns_child_exec with verbose output (using sudo to gain administrative privileges) to run it in its own PID and mount namespaces. After it completes, call sudo ps u. Compare the outputs of both runs of the ps u utility (the one you called directly from the scope of the global namespace, and the one called by your process in the scope of the new namespace) to verify that ps lists processes from the perspective of the namespace from which it was called.

Note: if you did not correctly remount the proc filesystem as private, before remounting it in the scope of the new PID namespace, this has the effect of (1) propagating the remount event to the global mount namespace, such that (2) when the program that created the new namespace exits, and the PID namespace is therefore destroyed, the /proc mount will be unmounted (since that namespace no longer exists), which means that (3) it will also be unmounted in the parent namespace. If this happens, then when you call sudo ps u directly from the shell, you may get an error message indicating that /proc is not mounted. In this case, make sure that your program that mounts /proc in the new namespace is correctly first remounting the /proc filesystem as shared. Then, from a terminal window (in the global namespace), remount /proc by issuing the following command:

sudo mount proc /proc -t proc

As the answer to this exercise, please show the output of your program, as well as the output of your call to ps u from the terminal. Please explain any similarities and differences you observe. Also, state which command-line flag is passed to ns_child_exec to specify a new mount namespace.
The first process in a PID namespace has PID 1 in that namespace, and, in many ways, becomes the "init" process for that namespace, meaning that it becomes the parent to, and reaps, orphaned processes in its namespace. For this exercise, you will use an existing program that performs some of the "init" functionality.

Hint: this exercise closely follows parts of the article, Namespaces in operation, part 4: more on PID namespaces and you may use any code (with attribution) from the orphan.c program referenced in that article.

On your Raspberry Pi, retrieve and compile the simple_init.c program. This is a simple init-style program to be used as the init "init" process in a PID namespace. The program reaps the status of its children and provides a simple shell facility for executing commands.

Write a new program that calls fork() to create a child process. For more information, see the online man page or issue the command:

man 2 fork

The parent process should (1) print the child process's PID, (2) print its own PID, and then (3) print its parent's PID. It should then (4) sleep (e.g. with the sleep function) for one second before exiting, which ensures that it has not yet exited when the child process becomes active. For more information, see the online man page or issue the command:

man 3 sleep

The child process should (1) print its PID then (2) print its parent's PID. Make sure to have the parent and child processes print appropriate messages so you can distinguish which process is printing. Then, the child process should (3) sleep for two seconds, guaranteeing that its parent has exited, making it an orphan. The child should then (5) print its parent's PID again before exiting.

Compile your program, then run ns_child_exec with verbose output to launch simple_init in its own PID namespace (using sudo to gain administrative privileges), passing the -v flag as a command line argument to simple_init to specify verbose output, e.g. with the following command:

sudo ./ns_child_exec -pv ./simple_init -v

NOTE: Using CTRL+C will not exit the terminal provided by simple_init, but you can issue the exit or quit commands.

From the simple_init shell, run your program that creates an orphaned child process. As the answer to this exercise, please show all terminal output (including the output of simple_init when it is launched). Please also explain, briefly, what mechanisms simple_init uses to reap orphaned processes in its namespace.
A Linux container is typically given its own view of the filesystem, isolating it from the rest of the system. Besides the files specific to its functionality, and a set of directories into which it can store data, it also needs access to necessary OS files and directories that enable it to interface with the Linux OS.

For this exercise, you will simulate the creation of a container by launching a shell in its own isolated section of the filesystem by using the chroot mechanism.

First, create a new directory that will serve as the root of your container's filesystem. This directory should exist at the root of the filesystem, e.g. mkdir /containerfs

Inside this directory, use the mount utility to bind-mount the following files and directories:
- /bin [D] *
- /dev/console [F]
- /dev/pts [D]
- /dev/shm [D]
- /etc/hostname [F]
- /etc/hosts [F]
- /lib [D] *
- /proc [D] *
- /sys [D] *
- /usr/bin [D] *
- /usr/lib [D] *
(* indicates to mount read-only)

([D] indicates to create a directory as the mount-point, [F] indicates to create an empty file as the mount-point)
For more information, see the online man page or issue the command:

man 8 mount

Now, use the chroot command-line utility to launch a new shell, with its root directory as the directory you just created. For more information, see the online man page or issue the command:

man 1 chroot

From the new shell, create a home directory in the shell's root directory, then navigate into it and create a few files in there. Run the command ls -l to view the files you've created. Also, run the ps command to verify that the proc filesystem is mounted at /proc. In another terminal window, navigate to the directory you created to be the root directory of the "container," enter the home directory you created, and run ls -l to view the files in that directory. As the answer to this exercise, show the output of both ls commands and the ps command.

NOTE: you can issue the exit command to exit the chroot jail environment.
This time, instead of launching a shell in a chroot jail, you will create a shell process in its own mount namespace.

First, you will create a new program that sets up the mount namespace, including bind-mounting the necessary directories and files listed in the previous exercise, then pivoting to a new root directory. This program will be linked against simple_init, and will provide a function that takes a const char * argument indicating the path to the directory that will serve as the root of the container's filesystem.

To ensure that mount events from the new user namespace do not propagate to the root namespace, the function should first recursively set all mounts, starting from the root, to private:

mount("","/","",MS_PRIVATE | MS_REC,NULL);

To allow the specified directory to become the root of the filesystem, that directory needs to be a mount-point. So, as a next step, the function should use the mount() bind-mount the specified directory to itself.

Next, the function should use the chdir() system call to set its working directory to the supplied path. For more information, see the online man page or issue the command:

man 2 chdir

Next, the function should use the mount() system call to bind-mount all necessary directories listed in the previous exercise into the specified directory. Be sure to use the same settings (read-only where specified) as before. This time, however, it should mount the proc pseudo-filesystem directly:

mount("proc","proc","proc",0,NULL)

Finally, to establish a new root directory, the function will need to use the pivot_root system call. This requires the new root to contain a subdirectory into which the old root filesystem will be mounted.

Have your program create a directory called old-root using the mkdir() system call. For more information, see the online man page or issue the command:

man 2 mkdir

HINT: You can use S_IRWXU as the mode argument for mkdir().

Then, have your program call pivot_root() to swap the current directory (.) with the old-root directory. For more information, see the online man page or issue the command:

man 2 pivot_root

To practice good C programming style, be sure to have your function check the return values from all system calls, then report any errors appropriately and exit if necessary. For the mkdir() system call, an errno of EEXIST means the directory already exists, and is therefore likely not indicative of undesired behavior for your function.

Create a header file declaring the function, and call this function from simple_init.c before it enters its while loop to read shell commands. You can pass the following argument to the function:

argv[optind]

This allows you to specify the path to the container's root directory as a command-line argument to simple_init. Your simple_init.c program should verify that argv[optind] is not null, i.e., that a path to a directory was provided as a command-line argument, before calling the function. This allows you to run simple_init without mapping a new root directory, if desired (e.g. if you are not running it in its own mount namespace).

Compile your modified simple_init program, linking the other C file containing your mount namespace initialization function against it, then run ns_child_exec to launch simple_init in its own PID, UTS, and mount namespaces, supplying the path to the container's root directory to simple_init, e.g. with the following command:

sudo ./ns_child_exec -pmuv ./simple_init /containerfs

Congratulations! You have created a simple container.

The shell provided by simple_init has only very basic functionality, but it does allow you to launch the bash shell, i.e. by executing /bin/bash. Do so, then in your new container, create a home directory in the shell's root directory (if you have not done so already), then navigate into it and create a few files in there. Run the command ls -l to view the files you've created. In another terminal window, navigate to the directory you created to be the root directory of the "container," enter the home directory you created, and run ls -l to view the files in that directory.

Also, run the ps command to verify that the proc filesystem is mounted at /proc. From a new terminal window, run ps to compare the processes from the perspective of the root namespace, to the processes listed by your program.

Then, from the shell provided by simple_init, issue the command:

hostname <newhostname>

to create a new hostname for your container. Then, issue the hostname command to verify that the new hostname was applied. In another terminal window, issue the hostname command to verify that the original hostname of your Raspberry Pi was not affected in the root namespace.

As the answer to this exercise, show the output of both terminals, i.e. the ps, ls, and hostname commands run from the bash shell launched by your container's simple_init, and from the terminal window in the root namespace. Also, state which command-line flag is passed to ns_child_exec to specify a new UTS namespace.

Be sure to keep the directory that serves as the root of your container intact! You will continue to use this simple container environment in future studios.
Things to Turn In:
- Your answers to the above exercises
- Your code for the program that creates a UTS namespace with unshare() and prints its PID
- Your code for the program that joins the UTS namespace inhabited by a given process, then execvp()s a command
- Your code for the program that prints its, and its parents, PIDs then mounts the proc pseudofilesystem to /proc
- Your code for the program that creates an orphaned child process to be reaped by simple_init
- Your code for the function that mounts directories, then establishes a new root directory, for a new mount namespace
- Your modified simple_init.c program that calls this function

Page updated Tuesday, January 25, 2022, by Marion Sudvarg and Chris Gill.

CSE 522S: Studio 6

Isolation with Namespaces

Required Exercises

Things to Turn In: