"Although it would be nice if every critical region consisted of code that did nothing more complicated than incrementing a variable, reality is much crueler."
—Robert Love, Linux Kernel Development 3rd Ed., Chapter 10, pp. 183
Locking primitives are important user space tools for concurrent and parallel programming. Two main types of locks exist: locks that spin while a program is waiting to acquire the lock, versus those that cause the program to sleep. Spinlocks consume CPU cycles as a process waits, but are well suited for low-latency applications when critical sections are very short. Other locks allow processes to sleep while waiting, which costs fewer CPU cycles but may result in longer latency with processes sleeping and waking up.
In this studio, you will:
Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.
As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas. If you work in a group with other people, only one of you should please upload the answers (and any other files requested) for the studio, and if you need to resubmit (e.g., if a studio were to be marked incomplete when we grade it) the same person who submitted the studio originally should please do that.
Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.
workload.c
program. This program produces several seconds of work on each processor
in a system. Build and run the program on your Raspberry Pi (see the comment in the
program's code for the command line you will need to use to build it)
and verify that it occupies all of your system's
cores by using top
and pressing 1 (or even better, using htop
or kernelshark).
As the answer to this exercise, please explain why (based on the
output of the top
utility (or of
htop
or kernelshark) you think the program is occupying
all of the cores.
critical_section()
function concurrently (or even simultaneously if they are on different cores).
This is undesirable, since critical sections typically protect important shared
data. First we will build a spin lock to protect access to the
critical_section()
function.
Write an initially empty lock
and
unlock
function. Each of these functions should take a pointer to a
volatile integer.
Note: Recall that the volatile
specifier tells the compiler
that the value of a variable may change. In this case, the compiler interprets the
volatile int *
type declaration to mean "a pointer to an int that's volatile"
(the value pointed at by the pointer may change unexpectedly, not that the pointer
itself may change.)
To treat an integer like a lock, we need to define two values
that represent the locked and unlocked states, respectively. Define these at
the top of your program with a pre-compiler #define
statement.
Also, declare an integer variable for the state of the lock
itself, and initialize the variable to be in the unlocked state.
Inside the parallel region of the program, insert calls to the
lock
and unlock
functions before and after the critical
section of the workload.c
program, respectively, passing the address
of the lock variable into each of those calls.
As the answer to this exercise, please show your statements to declare the two values and the initial value of the lock.
lock
and unlock
functions we'll use GCC's built-in atomic instructions. If we were working in
C++ we could use C++ 11's atomic instructions. If we didn't have access to GCC,
or if speed was very critical, we could implement these with assembly
instructions.
The atomic built-in functions are documented here.
For the spin lock we will use the function __atomic_compare_exchange()
. The first three arguments determine the meaning of this
function: ptr, expected, and desired. When called,
this function atomically compares the contents of the location pointed to by ptr
with the contents of the variable pointed to by expected, and
if they are equal, writes the value of the variable pointed to by desired into
the location pointed to by ptr. The last
three arguments specify special memory orderings, but we'll just opt for a
strong ordering for this studio, as in:
__atomic_compare_exchange( ptr, expected, desired, 0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)
To implement the lock function, you should check for the unlocked state
and write the value of the locked state. However, it's possible that this
function will fail (for example, if another process already holds the lock).
Thus, your lock function should attempt to swap the state of the lock variable,
and continue retrying until it succeeds. WARNING: If it fails, the
function may overwrite the value of the variable pointed to by expected
!
Please make sure to re-initialize the expected variable's value as needed.
Note that only the lock
variable needs to be
accessible at a scope outside of the lock
and unlock
functions, and that declaring the expected
and
desired
variables outside of those functions makes them
susceptible to races - instead, you should please declare (and
initialize) the expected
and desired
variables inside of those functions, so that they only exist within
the stack frame for a single call to one of those functions.
Implement the unlock function using the same __atomic_compare_exchange()
function. However, since we only expect the unlock function to be called when we already
hold the lock, it should succeed unless something is drastically wrong. If the
call to __atomic_compare_exchange()
fails, then rather than
retrying the function, your code should print out an error message and return.
As the answer to this exercise, please show your code for both the
lock
and unlock
functions.
GOMP_CPU_AFFINITY=0,1,2,3 ./workload
You can also set the environment variable in a separate statement, with:
export GOMP_CPU_AFFINITY=0,1,2,3
./workload
In the latter case, you only need to set the environment variable once in your current shell, not every time you run the workload.
As the answer to this exercise, please explain why (based on the output you saw) the program only allows one thread at a time into the critical section.
lock()
and unlock()
functions, and the
variables that held the locked and unlocked states (don't remove the lock variable or
the statements that #define
the locked and unlocked state values).
The sleep and wakeup mechanism we will use for this second version of userspace synchronization is a system call named futex, which stands for a fast userspace mutex. This system call handles the mechanisms for sleeping and waking processes, but userspace library code must decide how and when to use this capability. Specifically, the futex library function is designed to implement a semaphore on top of an integer. There are three states:
Unlocked: | 1 |
Locked: | 0 |
At least one process is sleeping: | any negative number |
Since the futex is designed to implement a semaphore, this means that processes lock and unlock the futex by atomic increments and decrements. When a process claims the futex, it atomically decrements the integer by one. When a process releases the futex, it atomically increments the integer by one. If two processes never conflict, then the value of the futex integer will always be zero or one, and no process will ever have to sleep (and thus, you will never need to make a futex system call).
However, if multiple processes try to lock the futex simultaneously, they will
decrement the integer value to be negative. In this case, a process that gets
some value less than zero will want to go to sleep, and the kernel then
must become involved. The semantics and the particulars of this process
are documented in the man pages produced by the following commands:
man 2 futex
and man 7 futex
. Note that the
FUTEX_WAIT
version of the futex syscall will only put the caller
to sleep if the lock variable whose address was passed to it has the value that
was also passed to that syscall.
Make sure that the #define
values you have declared
for the unlocked and locked state values are consistent with the semantics
described above, and if not update them accordingly. Also make sure that your
lock variable is initialized to be in the unlocked state per those values.
As the answer to this exercise, please explain whether or not you needed to change
the #define
values in your code, and why or why not.
ret_val = __atomic_sub_fetch( ptr, 1, __ATOMIC_ACQ_REL );
where
ptr
is the address of your lock variable.
ret_val
is greater than or equal to zero, and if so
the lock function should simply return successfully (since the lock has been acquired).
ret_val
is less than zero) we need to sleep, so
invoke the system call as follows: syscall( SYS_futex, ptr, FUTEX_WAIT, ret_val, NULL );
and then go back to step a.
Implement your unlock function according to the following algorithm.
ret_val = __atomic_add_fetch( ptr, 1, __ATOMIC_ACQ_REL );
where ptr
is the address of your lock variable
__atomic_store_n( ptr, 1, __ATOMIC_RELEASE );
and then invoke
the system call: syscall( SYS_futex, ptr, FUTEX_WAKE, INT_MAX );
You will need to include additional header files to provide declarations for
syscall
, futex
, etc. (please see the
individual man pages in section 2 of the command line reference manual
for those, e.g., man 2 syscall
, etc.), and you also may
need to include one or more C libraries that define constants.
As the
answer to this exercise, please show your implementations of the new
lock
and
unlock
functions.
sudo GOMP_CPU_AFFINITY=0,1,2,3 OMP_NUM_THREADS=4 OMP_PROC_BIND=true
trace-cmd record -e sched_switch ./your_spin_lock_executable
Notice that your spin lock is able to do synchronization entirely in userspace, while the futex lock sometimes requires the intervention of the kernel. As the answer to this exercise, please explain whether or not it would be possible to do a sleep lock entirely in userspace (i.e. with no system calls), and why or why not?