"I wish it need not have happened in my time," said Frodo.
"So do I," said Gandalf, "and so do all who live to see such times. But that
is not for them to decide. All we have to decide is what to do with the time
that is given, us."
—The Fellowship of the Ring , Book 1, Chapter 2
Benchmarking programs can give important insights into how they perform (including where potential performance bottlenecks may exist) under different conditions. In addition, benchmarking in userspace serves as an introduction to the concepts behind benchmarking in the kernel, which in combination allows us to measure the impact of the kernel on userspace performance and vice versa.
In this studio, you will:
Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.
As you work through these exercises, please record your answers, and when finished upload them along with the relevant source code to the appropriate spot on Canvas. If you work in a group with other people, only one of you should please upload the answers (and any other files requested) for the studio, and if you need to resubmit (e.g., if a studio were to be marked incomplete when we grade it) the same person who submitted the studio originally should please do that.
Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.
Note: you will do all of today's userspace timing and benchmarking exercises
on shell.cec.wustl.edu
rather than on your Raspberry Pi.
The final exercises on in-kernel timing will require you to load a kernel module on your Pi.
As the answer to the first exercise, list the names of the people who worked together on this studio.
First, we need some programs to benchmark. Please download a code package that includes five programs here.
Use sftp
to transfer that file over to shell.cec.wustl.edu
.
Once you have done that, log in to shell.cec.wustl.edu
,
then connect to the Linux Lab cluster by issuing the command:
qlogin -q all.q
Now, unzip the package, build the programs with the Makefile that is provided in it
(i.e., simply by issuing the command make
in that directory), and run each program a few times. As the answer to this
exercise, describe briefly what each program does.
time
. This is
actually a special command built into the bash shell, so its documentation
can be found under man 1 bash
.
Use that command to capture the timing of a few runs of each of the test
programs, using different parameter values so that they take more or less time to run.
As the answer to this exercise, show the results of those runs.
time
command outputs three different pieces of timing information.
As the answer to this exercise please say what they are and briefly explain the differences
among them.
user
and real
timing information for these runs. Especially, what could explain why (especially for programs that involved parallel execution) the time command might report a user
time value that is larger than the real
time value?
time ./dense_mm 1000
time ./parallel_dense_mm 1000
sing.c
and execute the following command.
time ./sing 1000
user
and sys
timing information for that program,
especially in comparison to their relationship when you time another command like
time ./dense_mm 1000
clock_getres
. You can find the documentation
for this function in the manual pages: man 2 clock_getres
.
Warning: Internet versions of
man pages may not be up to date. Use the version on shell.cec.wustl.edu
or another Linux server that you know to be up to date.
Look through the clocks available at the man 2 clock_getres
page.
As the answer to this exercise, name a clock that would
be well suited for userspace benchmarking (and explain briefly why), and the name
a clock that would be poorly suited for userspace benchmarking (and explain briefly why).
clock_getres
to write a short program called
getres.c
that gives the resolutions for several different clock
types. This function requires a structure called a timespec
,
which is also documented in the man 2 clock_getres
page and is the basic data structure
used to report timing information from the kernel to userspace.
As the answer to this exercise, copy and paste your program output (include at least one _COARSE
clock type) and explain briefly what is meant by the resolution values that were output by your program.
CLOCK_MONOTONIC
and CLOCK_MONOTONIC_COARSE
, and why.
getdelay.c
that uses the function
clock_gettime
to figure out how long a call to
clock_gettime
takes. As the answer to this exercise,
report this value, and describe briefly how you obtained it.
parallel_dense_mm.c
into a new file called
timed_parallel_dense_mm.c
. First modify the code in the new file so
that you time the critical computational loop with the CLOCK_MONOTONIC_RAW
clock. Then modify the code again so that the program
takes a second parameter (which defaults to 1) and executes the timed segment multiple times. Your program should output the minimum, mean, and maximum times over all timed iterations.
Run your program for 100 iterations with matrix size 100. As the answer to this exercise, show the reported timing values, and based on the minimum, average (mean or median), and maximum timing values, say briefly what you think a reasonable estimate of the "common case" running time actually is, and why.
Like we did in the modules
studio, save a copy of simple_module.c
,
but rename it to ktime_module.c
.
Modify the module as follows:
ktime_t
.
Call these init_time
and exit_time
.
ktime_get*
functions described in the
ktime accessors documentation
in both your module init
and exit
functions
to set the corresponding ktime_t
global variables.
exit
function, use the
ktime_sub
macro to get the elapsed time between module initialization and exit.
ktime_to_timespec64
macro, which returns a timespec64
structure (similar to timespec
),
to convert the elapsed time to a printable type.
printk()
statement in the module exit
function to
print the seconds and nanoseconds elapsed from module initialization to exit.
Now, update your Makefile
so it contains the appropriate
.o
file target for your new module, and then issue the following commands to generate the .ko
file for your new module:
module add arm-rpi
and then
LINUX_SOURCE=path to your Linux kernel source code
make -C $LINUX_SOURCE ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- M=$PWD modules
Then, use sftp
to copy the generated .ko
file
over to your Raspberry Pi, and load the module using sudo insmod
.
Wait a few seconds, then unload the module using sudo rmmod
.
Then, use dmesg
to confirm that the module is working correctly.
As the answer to this exercise, please (1) tell us which ktime_get*
function you used, and why, and (2) show us the last few lines of system log output,
including your module's message stating how long it had remained loaded.
In addition to the answers above, please submit:
getres.c
getdelay.c
timed_parallel_dense_mm.c
ktime_module.c
As the answer to any of these optional exercies that you would like to try, please briefly describe what you learned (and give answers to any questions the exercise contains).
Another benchmarking technique involves reading directly from the Time Stamp Counter (TSC) on the Intel x86 CPU.
This register can be accessed using the readtsc
instruction, like in the following code:
static inline unsigned long long rdtsc_get(void) { unsigned long high, low; asm volatile ("rdtsc" : "=a" (low), "=d" (high)); return ( (unsigned long long) low) | ( ( (unsigned long long) high ) << 32 ); }
Modify timed_parallel_dense_mm.c
to time the critical computational loop
using the value returned by the provided rdtsc
function.
Your program should output the minimum, mean, and maximum times over all timed iterations.
As the answer to this exercise,
please submit your modified timed_parallel_dense_mm.c
,
and show the reported timing values.