"Although it would be nice if every critical region consisted of code that did nothing more complicated than incrementing a variable, reality is much crueler."
—Robert Love, Linux Kernel Development 3rd Ed., Chapter 10, pp. 183
Locking primitives are important user space tools for concurrent and parallel programming. Two main types of locks exist: locks that spin while a program is waiting to acquire the lock, versus those that cause the program to sleep. Spinlocks consume CPU cycles as a process waits, but are well suited for low-latency applications when critical sections are very short. Other locks allow processes to sleep while waiting, which costs fewer CPU cycles but may result in longer latency with processes sleeping and waking up.
In this studio, you will:
Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.
As you work through these exercises, please record your answers, and when you finish them each and every person who worked on them should please log into Canvas, select this course in this semester, and then upload a file containing them and also upload any other files the assignment asks for, and submit those files for this studio assignment (there should be a separate submission from each person who worked together on them).
Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.
workload.cprogram. This program produces several seconds of work on each processor in a system. Build and run the program on your Raspberry Pi (see the comment in the program's code for the command line you will need to use to build it) and verify that it occupies all of your system's cores by using
topand pressing 1 (or even better, using
As the answer to this exercise, please explain why (based on the
output of the
top utility (or of
or kernelshark) you think the program is occupying
all of the cores.
critical_section()function concurrently (or even simultaneously if they are on different cores). This is undesirable, since critical sections typically protect important shared data. First we will build a spin lock to protect access to the
Write an initially empty
unlock function. Each of these functions should take a pointer to a
Note: Recall that the
volatile specifier tells the compiler
that the value of a variable may change. In this case, the compiler interprets the
volatile int * type declaration to mean "a pointer to an int that's volatile"
(the value pointed at by the pointer may change unexpectedly, not that the pointer
itself may change.)
To treat an integer like a lock, we need to define two values
that represent the locked and unlocked states, respectively. Define these at
the top of your program with a pre-compiler
Also, declare an integer variable for the state of the lock
itself, and initialize the variable to be in the unlocked state.
Inside the parallel region of the program, insert calls to the
unlock functions before and after the critical
section of the
workload.c program, respectively, passing the address
of the lock variable into each of those calls.
As the answer to this exercise, please show your statements to declare the two values and the initial value of the lock.
unlockfunctions we'll use GCC's built-in atomic instructions. If we were working in C++ we could use C++ 11's atomic instructions. If we didn't have access to GCC, or if speed was very critical, we could implement these with assembly instructions.
The atomic built-in functions are documented here.
For the spin lock we will use the function
. The first three arguments determine the meaning of this
function: ptr, expected, and desired. When called,
this function atomically compares the contents of the location pointed to by ptr
with the contents of the variable pointed to by expected, and
if they are equal, writes the value of the variable pointed to by desired into
the location pointed to by ptr. The last
three arguments specify special memory orderings, but we'll just opt for a
strong ordering for this studio, as in:
__atomic_compare_exchange( ptr, expected, desired, 0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)
To implement the lock function, you should check for the unlocked state
and write the value of the locked state. However, it's possible that this
function will fail (for example, if another process already holds the lock).
Thus, your lock function should attempt to swap the state of the lock variable,
and continue retrying until it succeeds. WARNING: If it fails, the
function may overwrite the value of the variable pointed to by
Please make sure to re-initialize the expected variable's value as needed.
Note that only the
lock variable needs to be
accessible at a scope outside of the
functions, and that declaring the
desired variables outside of those functions makes them
susceptible to races - instead, you should please declare (and
variables inside of those functions, so that they only exist within
the stack frame for a single call to one of those functions.
Implement the unlock function using the same
function. However, since we only expect the unlock function to be called when we already
hold the lock, it should succeed unless something is drastically wrong. If the
__atomic_compare_exchange() fails, then rather than
retrying the function, your code should print out an error message and return.
As the answer to this exercise, please show your code for both the
You can also set the environment variable in a separate statement, with:
In the latter case, you only need to set the environment variable once in your current shell, not every time you run the workload.
As the answer to this exercise, please explain why (based on the output you saw) the program only allows one thread at a time into the critical section.
unlock()functions, and the variables that held the locked and unlocked states (don't remove the lock variable or the statements that
#definethe locked and unlocked state values).
The sleep and wakeup mechanism we will use for this second version of userspace synchronization is a system call named futex, which stands for a fast userspace mutex. This system call handles the mechanisms for sleeping and waking processes, but userspace library code must decide how and when to use this capability. Specifically, the futex library function is designed to implement a semaphore on top of an integer. There are three states:
|At least one process is sleeping:||any negative number|
Since the futex is designed to implement a semaphore, this means that processes lock and unlock the futex by atomic increments and decrements. When a process claims the futex, it atomically decrements the integer by one. When a process releases the futex, it atomically increments the integer by one. If two processes never conflict, then the value of the futex integer will always be zero or one, and no process will ever have to sleep (and thus, you will never need to make a futex system call).
However, if multiple processes try to lock the futex simultaneously, they will
decrement the integer value to be negative. In this case, a process that gets
some value less than zero will want to go to sleep, and the kernel then
must become involved. The semantics and the particulars of this process
are documented in the man pages produced by the following commands:
man 2 futex and
man 7 futex. Note that the
FUTEX_WAIT version of the futex syscall will only put the caller
to sleep if the lock variable whose address was passed to it has the value that
was also passed to that syscall.
Make sure that the
#define values you have declared
for the unlocked and locked state values are consistent with the semantics
described above, and if not update them accordingly. Also make sure that your
lock variable is initialized to be in the unlocked state per those values.
As the answer to this exercise, please explain whether or not you needed to change
#define values in your code, and why or why not.
ret_val = __atomic_sub_fetch( ptr, 1, __ATOMIC_ACQ_REL );where
ptris the address of your lock variable.
ret_valis greater than or equal to zero, and if so the lock function should simply return successfully (since the lock has been acquired).
ret_valis less than zero) we need to sleep, so invoke the system call as follows:
syscall( SYS_futex, ptr, FUTEX_WAIT, ret_val, NULL );and then go back to step a.
Implement your unlock function according to the following algorithm.
ret_val = __atomic_add_fetch( ptr, 1, __ATOMIC_ACQ_REL );where
ptris the address of your lock variable
__atomic_store_n( ptr, 1, __ATOMIC_RELEASE );and then invoke the system call:
syscall( SYS_futex, ptr, FUTEX_WAKE, INT_MAX );
You will need to include additional header files to provide declarations for
futex, etc. (please see the
individual man pages in section 2 of the command line reference manual
for those, e.g.,
man 2 syscall, etc.), and you also may
need to include one or more C libraries that define constants.
answer to this exercise, please show your implementations of the new
sudo GOMP_CPU_AFFINITY=0,1,2,3 OMP_NUM_THREADS=4 OMP_PROC_BIND=true trace-cmd record -e sched_switch ./your_spin_lock_executable
Notice that your spin lock is able to do synchronization entirely in userspace, while the futex lock sometimes requires the intervention of the kernel. As the answer to this exercise, please explain whether or not it would be possible to do a sleep lock entirely in userspace (i.e. with no system calls), and why or why not?
Page updated Saturday, November 14, 2020, by Chris Gill.