CSE 422S: Studio 6

Timing and Benchmarking

"I wish it need not have happened in my time," said Frodo.
"So do I," said Gandalf, "and so do all who live to see such times. But that is not for them to decide. All we have to decide is what to do with the time that is given, us."

The Fellowship of the Ring , Book 1, Chapter 2

Benchmarking programs can give important insights into how they perform (including where potential performance bottlenecks may exist) under different conditions. In addition, benchmarking in userspace serves as an introduction to the concepts behind benchmarking in the kernel, which in combination allows us to measure the impact of the kernel on userspace performance and vice versa.

In this studio, you will:

  1. Benchmark userspace programs using command line tools
  2. Benchmark userspace programs using Linux's clock functions
  3. Use the same clocks to measure elapsed time in a kernel module

Please complete the required exercises below, as well as any optional enrichment exercises that you wish to complete. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when you finish them each and every person who worked on them should please log into Canvas, select this course in this semester, and then upload a file containing them and also upload any other files the assignment asks for, and submit those files for this studio assignment (there should be a separate submission from each person who worked together on them).

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.

Required Exercises

Note: you will do all of today's userspace timing and benchmarking exercises on shell.cec.wustl.edu rather than on your Raspberry Pi. The final exercises on in-kernel timing will require you to load a kernel module on your Pi.

  1. As the answer to the first exercise, list the names of the people who worked together on this studio.

  2. First, we need some programs to benchmark. Please download a code package that includes five programs here.

    Use sftp to transfer that file over to shell.cec.wustl.edu.

    Once you have done that, log in to shell.cec.wustl.edu, then connect to the Linux Lab cluster by issuing the command:

    qlogin -q all.q

    Now, unzip the package, build the programs with the Makefile that is provided in it (i.e., simply by issuing the command make in that directory), and run each program a few times. As the answer to this exercise, describe briefly what each program does.

  3. Coarse grained benchmarking can be done directly from the command line, using a bash command named time. This is actually a special command built into the bash shell, so its documentation can be found under man 1 bash. Use that command to capture the timing of a few runs of each of the test programs, using different parameter values so that they take more or less time to run. As the answer to this exercise, show the results of those runs.

  4. The time command outputs three different pieces of timing information. As the answer to this exercise please say what they are and briefly explain the differences among them.

  5. Now compare the results of the following two commands (note that they may take a while to complete). As the answer to this exercise please describe and briefly explain what you observe about the relationships among the user and real timing information for these runs. Especially, what could explain why (especially for programs that involved parallel execution) the time command might report a user time value that is larger than the real time value?

  6. Look at the code in sing.c and execute the following command. As the answer to this exercise, please describe and explain what you observed about the relationship between the user and sys timing information for that program, especially in comparison to their relationship when you time another command like (for example).

  7. Now we're going to switch to using the C API for Linux's clocks. First, we'll look at exactly what clocks are available and get some info about each one, with the function clock_getres. You can find the documentation for this function in the manual pages: man 2 clock_getres. Warning: Internet versions of man pages may not be up to date. Use the version on shell.cec.wustl.edu or another Linux server that you know to be up to date.

    Look through the clocks available at the man 2 clock_getres page. As the answer to this exercise, name a clock that would be well suited for userspace benchmarking (and explain briefly why), and the name a clock that would be poorly suited for userspace benchmarking (and explain briefly why).

  8. Next, use clock_getres to write a short program called getres.c that gives the resolutions for several different clock types. This function requires a structure called a timespec, which is also documented in the man 2 clock_getres page and is the basic data structure used to report timing information from the kernel to userspace.

    As the answer to this exercise, copy and paste your program output (include at least one _COARSE clock type) and explain briefly what is meant by the resolution values that were output by your program.

  9. Based on the descriptions of Linux timers, the HZ variable, and jiffies from chapter 11 of the LKD text book (the assigned readings for today), as the answer to this exercise please explain briefly what you think the difference is between CLOCK_MONOTONIC and CLOCK_MONOTONIC_COARSE, and why.

  10. Write a second short program called getdelay.c that uses the function clock_gettime to figure out how long a call to clock_gettime takes. As the answer to this exercise, report this value, and describe briefly how you obtained it.

  11. Copy parallel_dense_mm.c into a new file called timed_parallel_dense_mm.c. First modify the code in the new file so that you time the critical computational loop with the CLOCK_MONOTONIC_RAW clock. Then modify the code again so that the program takes a second parameter (which defaults to 1) and executes the timed segment multiple times. Your program should output the minimum, mean, and maximum times over all timed iterations.

    Run your program for 100 iterations with matrix size 100. As the answer to this exercise, show the reported timing values, and based on the minimum, average (mean or median), and maximum timing values, say briefly what you think a reasonable estimate of the "common case" running time actually is, and why.

  12. Like we did in the previous studio, save a copy of simple_module.c, but rename it to ktime_module.c.

    Modify the module as follows:

    Now, update your Makefile so it contains the appropriate .o file target for your new module, and then generate the .ko file for your new module, as follows:

    module add raspberry

    and then

    LINUX_SOURCE=path to your Linux kernel source code

    make -C $LINUX_SOURCE ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- M=$PWD modules

    Then, use sftp to copy the generated .ko file over to your Raspberry Pi, and load the module using sudo insmod. Wait a few seconds, then unload the module using sudo rmmod. Then, use dmesg to confirm that the module is working correctly.

    As the answer to this exercise, please (1) tell us which ktime_get* function you used, and why, and (2) show us the last few lines of system log output, including your module's message stating how long it had remained loaded.

  13. Things to turn in

    In addition to the answers above, please submit:

    Optional Enrichment Exercises

    As the answer to any of these optional exercies that you would like to try, please briefly describe what you learned (and give answers to any questions the exercise contains).

  14. Another benchmarking technique involves reading directly from the Time Stamp Counter (TSC) on the Intel x86 CPU. This register can be accessed using the readtsc instruction, like in the following code:

          static inline unsigned long long rdtsc_get(void) {
            unsigned long high, low;
            asm volatile ("rdtsc" : "=a" (low), "=b" (high));
            return ( (unsigned long long) low) |
                    ( ( (unsigned long long) high ) << 32 );

    Modify timed_parallel_dense_mm.c to time the critical computational loop using the value returned by the provided rdtsc function. Your program should output the minimum, mean, and maximum times over all timed iterations.

    Keep in mind that the TSC register can overflow, so your program should verify that the value returned after the critical section is greater than the value returned before.

    As the answer to this exercise, please submit your modified timed_parallel_dense_mm.c, and show the reported timing values.

Page updated Tuesday October 13, 2020, by Marion Sudvarg and Chris Gill.