"Proper synchronization -- locking that is free of deadlocks, scalable, and clean -- requires design decisions from start through finish."
—Robert Love, Linux Kernel Development, 3rd Ed., Chapter 9, pp. 172.
In this assignment, you will implement a kernel module in which kernel threads will run concurrently on your Raspberry Pi to cooperatively compute prime numbers. To do so, you will use kernel memory allocation and deallocation, synchronization, and logging.
As for the previous lab assignment, for this lab you are encouraged to work with one or two other people as a team, though you may complete this assignment individually if you prefer. Teams of more than 3 people are not allowed for this assignment.
In this lab, you will:
ktime.h
header file for a useful library for taking time stamps and measuring elapsed time.
Note that the ktime_t
type is not transparent,
and should therefore only be accessed and manipulated through the ktime library functions.Please read through the grading rubric for the lab assignments in this course, and please keep the points it makes in mind as you design, implement, and test your lab solution.
Please complete the exercises described below. As you work through them, please take notes regarding your observations, and when you are finished please write-up a cohesive report (as in the previous lab assignment) that you will submit along with the other files you will develop for this assignment. Since part of the assignment will be to present graphs of results obtained from trials you will run to assess performance, you may choose to use a non-text file format (e.g., .docx or .pdf) for your report if you'd like.
At the top of your report file, please again list the names of the people who worked together on this lab, and a working e-mail address for each of those people. You are again encouraged to work in groups of 2 or 3 people on this lab assignment, though you may work individually instead if you prefer - groups larger than 3 are not allowed for this assignment.
Your report file should include:
In this lab you will create a kernel module that will (1) allocate kernel memory for an array of integers from 2 up to a specified upper bound; (2) initialize that memory to contain those integers in ascending order; (3) spawn a specified number of kernel threads, which will then go through the array concurrently, "crossing out" numbers that are not primes by setting them to zero; and (4) once all the threads have finished, print out the remaining prime numbers along with several statistics about the module's performance.
Instead of pinning those threads to specific cores, in this lab you will allow them to run wherever Linux puts them. The threads will first wait at a synchronization barrier until all other threads have been successfully spawned and have arrived at the barrier. Once all threads have arrived at the barrier (once), they will each repeat the following operations: (safely) (1) select the next prime number from an array of integers, and (2) "cross out" all of its larger multiples in the array by setting their values to zero (thus marking them as non-prime). In doing so, these threads will implement a cooperative, concurrent (and possibly parallel) version of the Sieve of Eratosthenes algorithm.
Correct but inefficient processing: even the single threaded version of this algorithm has some built-in inefficiencies, in that some non-prime elements in the array (e.g., 6, 10, 12, 14, etc.) may be "crossed out" repeatedly when their different prime factors are visited; the multi-threaded version has an inherent data race that similarly may allow a thread to perform futile processing of a non-prime element (e.g., threads begin processing of 2, 3, and 4 concurrently even though 4 would have been marked earlier as non-prime in the single-threaded version, and not processed when it was reached). Since these inefficiencies impact only performance and not correctness of the algorithm, in this assignment you will record and analyze them instead of trying to fix them (except optionally for extra credit as noted below).
Your kernel module should use the module_param
macro
to define two ulong
module parameters, indicating (1) the
number of threads to be used (must be 1 or greater), and (2) an upper
bound on the range of primes to compute (must be 2 or greater). These
parameters should be named num_threads
and
upper_bound
, respectively. Declare and provide default
values (of 1 and 10 respectively) for two static unsigned
long
variables to store the values for these module parameters,
so that in the event that your module is loaded without parameters,
the module will use a single thread to compute all primes in the range
from 2 to 10 inclusive.
In addition to these parameters, your kernel module should have at least the following global variables that are shared by all of its threads: (1) a pointer to an array of counter variables in which each thread will keep track of how many times it has "crossed out" a non-prime number, (2) a pointer to the array of integers, within which primes will be computed, (3) an integer for the position of the current number (prime in the single-threaded version) that is being processed, and (4) an atomic variable that indicates whether or not the computation of prime numbers has finished in each thread. In addition, your kernel module may have additional global variables to implement correct barrier synchronization (e.g., a counter for how many threads still need to synchronize), protection of any critical sections (e.g., a lock), etc.
How you implement the barrier is up to you, but it must work correctly twice! One straightforward way to do this is using spin locks with both a counter for how many threads threads still need to arrive at the barrier and an atomic variable for possible states of barrier synchronization (no barriers completed, one barrier completed, two barriers completed, etc.) but then you must make sure the module initialization function and the barrier function set/reset these variables appropriately.
The prime computation function should repeatedly (1) safely (as a critical section protected by a mutex or spin lock) store the value of the global position variable in a local variable, and then advance the global position variable until it either reaches another non-zero value in the array or exceeds the last position in the array; (2) if the position in the local variable is greater than the last element in the array then the function should simply return; otherwise the function should (3) safely (as a critical section protected by a mutex or spin lock) go to each number in the array that is a larger multiple of the local position variable, set each of those larger multiples to zero, and increment its counter each time it does that. Note that since each thread has its own independent counter, there should be no need to protect the counter with locks, etc.
Otherwise (if both module parameters are valid), the module's init function should then allocate kernel memory, using
kmalloc()
with GFP_KERNEL
, for an array of
integers large enough to hold all integers from 2 up to (and
including) the number in the upper bound parameter. If that fails,
the init function should print an appropriate message
to the system log, make sure the pointer to the array of integers is
0, and then simply return.
Otherwise (if the previous allocation succeded), the
init function should then allocate additional kernel
memory, again using kmalloc()
with
GFP_KERNEL
, for an array with individual unsigned
integers that will be used as counters, with which each kernel
thread will keep track of how many
"cross out" operations it performs - the size of that array should be
the number of threads, given in the num_threads
module
parameter. If that fails, the init function should
print an appropriate message to the system log and then (before returning)
make sure the pointer to the array of counters is 0, and then either
defer to the module's exit function to do the following, or itself perform
the following: deallocate the memory for the (previously successfully
allocated) integer array using kfree()
, and make sure the
pointer to the array of integers is also 0.
Otherwise, the init function should iterate through the array of per-thread counter variables, setting each one to 0, and then should iterate through the integer array, setting each successive location to the next integer from 2 up to (and including) the upper bound, and then also should set the value of the variable for the current position, to the position that contains the number 2. The init function then should set the atomic variable that indicates whether or not processing has completed, to indicate that processing has not completed.
Finally, the init function should then spawn as many threads as are indicated by the
num_threads
module parameter, passing a pointer to a different element of the
array of counters (for tracking how many "cross out" operations each thread performs) to each thread (as an argument
to that thread's thread function).
Remember that while you debug, it can be useful to monitor the system log in a separate terminal window using a command like one of the following:
tail -f /var/log/syslog
dmesg -wH
Otherwise (if initialization was successful and processing has completed) the module's exit function should:
After that, the module's exit function should
deallocate the memory for both arrays (of integers and counters, respectively) using
kfree()
twice, and then return.
A Note on Kernel Thread Termination
Kernel threads, unlike userspace threads, exist on their own, and are cleaned up after their thread function returns. This has a couple of implications.
First, if a kernel module creates a kernel thread, the module does not necessarily have to call kthread_stop()
to force the thread to end.
As long as the thread function is guaranteed to return without interference,
the kernel module does not have to call kthread_stop() in its exit function.
However, if the thread puts itself to sleep (i.e. through a call to schedule()
),
then a call to kthread_stop()
is necessary: kthread_stop()
wakes up the kernel thread,
allowing it to resume execution. It also allows the function kthread_should_stop()
to return true if called within the thread.
Finally, because threads are cleaned up after exiting, calls to kthread_stop()
that take a pointer to a thread which has already exited are problematic.
The pointer may no longer be valid, or because of the behavior of the kernel's slab allocator,
it may even point to a different thread.
So, considering all of this, an easy rule of thumb is that all three of these functions should exist together:
kthread_stop()
(called outside of the thread by a module function)kthread_should_stop()
(checked within the thread)schedule()
(called within the thread)If any one is present, then the others probably should be there as well.
Then, make a copy of your module and in that copy: (1) change the type of element in the array of numbers to be sieved, to be
atomic_t
(if a different type was used previously to store integers),
(2) remove the lock from the second critical section of the prime computation function, i.e. the section that sets multiples of the local position to 0, and
(3) re-implement this section to use atomic operations such as atomic_set()
, atomic_add()
,
atomic_read()
, etc. to (safely) perform all reading and writing of elements in the array.
Also compile and test your new module that uses atomic operations, and when you are satisfied that it runs correctly, add a section titled "Module Design and Implementation" to your report, and in it describe how you implemented the features of this assignment, including locking versus atomic operations in the first and second modules.
Add a section titled "Module Performance" to your report, and in it summarize what trials you ran, and what trends you noticed in (1) the timing of the initialization and prime computation sections for each module and (2) the number of unnecessary "cross out" operations performed, for different numbers of threads and different upper bounds, in each module. Please be sure to include tests that include multiple threads per cpu and potentially large upper bounds of your Sieve.
For each of the modules, please graph (e.g., using Excel or gnuplot) the timing results as follows: completion time (initialization time plus prime computation time) on the y-axis vs. the upper bound on the range of primes on the x-axis, with a separate curve ("data series" in Excel) for each different number of threads used - leaving out curves that are essentially identical.
For each of the modules, please graph (e.g., using Excel or gnuplot) the efficiency results as follows: number of unnecessary "cross out" operations on the y-axis vs. the upper bound on the range of primes on the x-axis, with a separate curve ("data series" in Excel) for each different number of threads used - leaving out curves that are essentially identical. In your report please at least give the names of the graph files you are turning in with those plots, and if you would like to use a non-text format please also feel free to include pictures of the plots with your discussion. At the end of that section please compare and contrast the trends you saw with the two different modules, and discuss briefly why any trends differed and why any other trends were similar between the modules.
Things to turn in:
EXTRA CREDIT
Implement one of the following options (worth up to 5 percent of the value of the lab assignment):
Compile your additional modules, and on your Raspberry Pi run representative trials that show how well those new versions perform compared to the original versions, in terms of initialization and processing times as well as in terms of how many unnecessary "cross out" operations were performed, for different numbers of threads and different upper bounds on the range of primes to compute.
Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and (3) a brief analysis of what those results say about the strengths and weaknesses of the original versus new versions. Please also submit the additional source code files for the new modules with your report and other items noted above.
Compile your additional modules, and on your Raspberry Pi run representative trials that illustrate how well this new version performs compared to both of the original versions (with a simple locking stratgy versus with atomic operations), in terms of initialization and processing times for different numbers of threads, different upper bounds on the range of primes to compute, and different numbers of locks.
Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and (3) a brief analysis of what those results say about the strengths and weaknesses of the original versions versus this new one and about how the choice of the number of locks may affect performance. Please also submit the additional source code file for the new module with your report and other items noted above.
Build and run your new module and test its correctness. Also profile its performance and compare it to how the other modules you implemented performed.
Add a section titled "Extra Credit" to your report, and in that section please document (1) how you designed and implemented this feature, (2) output results (and/or text or figures summarizing them), and a brief comparison of the module's performance compared to the performance of the other modules you implemented, and (3) a brief explanation of why the new implementation remains safe.
Last updated Thursday March 10th, 2022 by James Orr.