CSE 422S: Studio 9

Build Your Own Locks

"Although it would be nice if every critical region consisted of code that did nothing more complicated than incrementing a variable, reality is much crueler."

—Robert Love, Linux Kernel Development 3rd Ed., Chapter 10, pp. 183

Locking primitives are important user space tools for concurrent and parallel programming. Two main types of locks exist: locks that spin while a program is waiting to acquire the lock, versus those that cause the program to sleep. Spinlocks consume CPU cycles as a process waits, but are well suited for low-latency applications when critical sections are very short. Other locks allow processes to sleep while waiting, which costs fewer CPU cycles but may result in longer latency with processes sleeping and waking up.

In this studio, you will:

  1. Build a userspace spin lock, using atomic instructions
  2. Build a userspace sleep lock, using atomic instructions and futexes

Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas. If you work in a group with other people, only one of you should please upload the answers (and any other files requested) for the studio, and if you need to resubmit (e.g., if a studio were to be marked incomplete when we grade it) the same person who submitted the studio originally should please do that.

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

  1. As the answer to the first exercise, please list the names of the people who worked together on this studio.

  2. Download the workload.c program. This program produces several seconds of work on each processor in a system. Build and run the program on your Raspberry Pi (see the comment in the program's code for the command line you will need to use to build it) and verify that it occupies all of your system's cores by using top and pressing 1 (or even better, using htop or kernelshark).

    As the answer to this exercise, please explain why (based on the output of the top utility (or of htop or kernelshark) you think the program is occupying all of the cores.

  3. Right now, all threads can execute the critical_section() function concurrently (or even simultaneously if they are on different cores). This is undesirable, since critical sections typically protect important shared data. First we will build a spin lock to protect access to the critical_section() function.

    Write an initially empty lock and unlock function. Each of these functions should take a pointer to a volatile integer.

    Note: Recall that the volatile specifier tells the compiler that the value of a variable may change. In this case, the compiler interprets the volatile int * type declaration to mean "a pointer to an int that's volatile" (the value pointed at by the pointer may change unexpectedly, not that the pointer itself may change.)

    To treat an integer like a lock, we need to define two values that represent the locked and unlocked states, respectively. Define these at the top of your program with a pre-compiler #define statement. Also, declare an integer variable for the state of the lock itself, and initialize the variable to be in the unlocked state.

    Inside the parallel region of the program, insert calls to the lock and unlock functions before and after the critical section of the workload.c program, respectively, passing the address of the lock variable into each of those calls.

    As the answer to this exercise, please show your statements to declare the two values and the initial value of the lock.

  4. In order to implement the lock and unlock functions we'll use GCC's built-in atomic instructions. If we were working in C++ we could use C++ 11's atomic instructions. If we didn't have access to GCC, or if speed was very critical, we could implement these with assembly instructions.

    The atomic built-in functions are documented here. For the spin lock we will use the function __atomic_compare_exchange() . The first three arguments determine the meaning of this function: ptr, expected, and desired. When called, this function atomically compares the contents of the location pointed to by ptr with the contents of the variable pointed to by expected, and if they are equal, writes the value of the variable pointed to by desired into the location pointed to by ptr. The last three arguments specify special memory orderings, but we'll just opt for a strong ordering for this studio, as in:

    __atomic_compare_exchange( ptr, expected, desired, 0, __ATOMIC_ACQ_REL, __ATOMIC_ACQUIRE)

    To implement the lock function, you should check for the unlocked state and write the value of the locked state. However, it's possible that this function will fail (for example, if another process already holds the lock). Thus, your lock function should attempt to swap the state of the lock variable, and continue retrying until it succeeds. WARNING: If it fails, the function may overwrite the value of the variable pointed to by expected! Please make sure to re-initialize the expected variable's value as needed.

    Note that only the lock variable needs to be accessible at a scope outside of the lock and unlock functions, and that declaring the expected and desired variables outside of those functions makes them susceptible to races - instead, you should please declare (and initialize) the expected and desired variables inside of those functions, so that they only exist within the stack frame for a single call to one of those functions.

    Implement the unlock function using the same __atomic_compare_exchange() function. However, since we only expect the unlock function to be called when we already hold the lock, it should succeed unless something is drastically wrong. If the call to __atomic_compare_exchange() fails, then rather than retrying the function, your code should print out an error message and return.

    As the answer to this exercise, please show your code for both the lock and unlock functions.

  5. Compile and run your program on your Raspberry Pi, and use each thread's finishing statement to verify that only one thread enters the critical section at a time. To ensure that each thread of the program runs on a separate core of your Raspberry Pi, run it with an environment variable as follows:

    GOMP_CPU_AFFINITY=0,1,2,3 ./workload

    You can also set the environment variable in a separate statement, with:

    export GOMP_CPU_AFFINITY=0,1,2,3

    In the latter case, you only need to set the environment variable once in your current shell, not every time you run the workload.

    As the answer to this exercise, please explain why (based on the output you saw) the program only allows one thread at a time into the critical section.

  6. Now we will implement a sleep lock. The lock described above consumes processor time while it's waiting, because it continually retries the lock operation until it succeeds. To implement our sleep lock, we will replace this behavior with one where we try to acquire the lock, but if we fail, the thread sleeps until it is later woken up. To begin, make a copy of your program and delete the bodies of the lock() and unlock() functions, and the variables that held the locked and unlocked states (don't remove the lock variable or the statements that #define the locked and unlocked state values).

    The sleep and wakeup mechanism we will use for this second version of userspace synchronization is a system call named futex, which stands for a fast userspace mutex. This system call handles the mechanisms for sleeping and waking processes, but userspace library code must decide how and when to use this capability. Specifically, the futex library function is designed to implement a semaphore on top of an integer. There are three states:

    At least one process is sleeping:any negative number

    Since the futex is designed to implement a semaphore, this means that processes lock and unlock the futex by atomic increments and decrements. When a process claims the futex, it atomically decrements the integer by one. When a process releases the futex, it atomically increments the integer by one. If two processes never conflict, then the value of the futex integer will always be zero or one, and no process will ever have to sleep (and thus, you will never need to make a futex system call).

    However, if multiple processes try to lock the futex simultaneously, they will decrement the integer value to be negative. In this case, a process that gets some value less than zero will want to go to sleep, and the kernel then must become involved. The semantics and the particulars of this process are documented in the man pages produced by the following commands: man 2 futex and man 7 futex. Note that the FUTEX_WAIT version of the futex syscall will only put the caller to sleep if the lock variable whose address was passed to it has the value that was also passed to that syscall.

    Make sure that the #define values you have declared for the unlocked and locked state values are consistent with the semantics described above, and if not update them accordingly. Also make sure that your lock variable is initialized to be in the unlocked state per those values.

    As the answer to this exercise, please explain whether or not you needed to change the #define values in your code, and why or why not.

  7. Implement your lock function according to the following algorithm.

    1. Decrement the lock variable with ret_val = __atomic_sub_fetch( ptr, 1, __ATOMIC_ACQ_REL ); where ptr is the address of your lock variable.
    2. Check whether ret_val is greater than or equal to zero, and if so the lock function should simply return successfully (since the lock has been acquired).
    3. Otherwise (if ret_val is less than zero) we need to sleep, so invoke the system call as follows: syscall( SYS_futex, ptr, FUTEX_WAIT, ret_val, NULL ); and then go back to step a.

    Implement your unlock function according to the following algorithm.

    1. Increment the lock integer with ret_val = __atomic_add_fetch( ptr, 1, __ATOMIC_ACQ_REL ); where ptr is the address of your lock variable
    2. Check to see if the return value is 1 (unlocked), and if so the unlock function should simply return successfully
    3. Otherwise, we need to set the lock into the unlocked state and then wake up any sleeping threads. First set the lock integer to 1 with __atomic_store_n( ptr, 1, __ATOMIC_RELEASE ); and then invoke the system call: syscall( SYS_futex, ptr, FUTEX_WAKE, INT_MAX );

    You will need to include additional header files to provide declarations for syscall, futex, etc. (please see the individual man pages in section 2 of the command line reference manual for those, e.g., man 2 syscall, etc.), and you also may need to include one or more C libraries that define constants.

    As the answer to this exercise, please show your implementations of the new lock and unlock functions.

  8. Run your program and verify that only one thread is able to access the critical section at a time.

    As the answer to this exercise, please explain why you think only one thread at a time can access the critical section.

  9. Trace both versions of your program, as in

    sudo GOMP_CPU_AFFINITY=0,1,2,3 OMP_NUM_THREADS=4 OMP_PROC_BIND=true trace-cmd record -e sched_switch ./your_spin_lock_executable

    and take a screen shot showing both behaviors.

    Notice that your spin lock is able to do synchronization entirely in userspace, while the futex lock sometimes requires the intervention of the kernel. As the answer to this exercise, please explain whether or not it would be possible to do a sleep lock entirely in userspace (i.e. with no system calls), and why or why not?

Things to turn in