CSE 522S: Studio 3

Hardware Counters and Loadable Kernel Modules


"You see," he continued, beginning to feel better, "once there was no time at all, and people found it very inconvenient. They never knew whether they were eating lunch or dinner, and they were always missing trains. So time was invented to help them keep track of the day and get places when they should. When they began to count all the time that was available, what with 60 seconds in a minute and 60 minutes in an hour and 24 hours in a day and 365 days in a year, it seemed as if there was very much more than could ever be used."

The Phantom Toolbooth, Norton Juster, Chapter 3

In this studio, you will:

  1. Read from hardware counters provided by both the Intel and ARM platforms.
  2. Review building and installing kernel modules.
  3. Write kernel modules to compare different methods of accessing the ARM PMU.

Please complete the required exercises below. We encourage you to please work in groups of 2 or 3 people on each studio (and the groups are allowed to change from studio to studio) though if you would prefer to complete any studio by yourself that is allowed.

As you work through these exercises, please record your answers, and when you finish them each and every person who worked on them should please log into Canvas, select this course in this semester, and then upload a file containing them and also upload any other files the assignment asks for, and submit those files for this studio assignment (there should be a separate submission from each person who worked together on them).

Make sure that the name of each person who worked on these exercises is listed in the first answer, and make sure you number each of your responses so it is easy to match your responses with each exercise.


Required Exercises

  1. As the answer to the first exercise, list the names of the people who worked together on this studio.
  2. The Intel x86 instruction set architecture provides a Time Stamp Counter (TSC) register, which can be accessed directly from userspace with the readtsc instruction, like in the following code:

          static inline unsigned long long rdtsc_get(void) {
            unsigned long high, low;
            asm volatile ("rdtsc" : "=a" (low), "=d" (high));
            return ( (unsigned long long) low) |
                    ( ( (unsigned long long) high ) << 32 );
          }
        

    On the Linux Lab cluster (accessed with the qlogin command from shell.cec.wustl.edu), write a program that maintains an array of unsigned long long integers of size 100. In a loop, read the value from the TSC twice in a row, then write the elapsed number of cycles (the difference between the two values) into the array.

    Keep in mind that the TSC register can overflow, so your program should verify that the second value returned is greater than the first.

    After doing this 100 times, your program should calculate and print the minimum, maximum, mean, and standard deviation of the elapsed cycles between rdtsc calls. As the answer to this exercise, please report these values.

    Your home directory resides on networked storage, and therefore the contents remain the same on both shell.cec.wustl.edu and all of the Linux Lab cluster machines. This means that, besides using terminal-based text editors (nano, vim, emacs), you can also use Visual Studio Code to edit your programs remotely.

  3. ARM also provides a cycle counter, the Performance Monitors Cycle Count Register (PMCNTR) on the Performance Monitor Unit (PMU). Unlike the TSC register on x86, the PMCNTR is not, by default, accessible from userspace. However, it can be read from kernel code, e.g. by kernel modules.

    First, download this driver file, and place it in the arch/arm/include/asm directory in your Linux kernel source tree.

    Then, on the Linux Lab cluster, please create a new directory to hold your kernel modules, e.g. /project/scratch01/compile/your-username/modules and cd into it. Save a copy of enable_ccnt_522.c. In that directory also create a Makefile that contains the line

    obj-m := enable_ccnt_522.o

    and save that file. Now, modify the kernel module so that it measures the elapsed times between reads from the PMCNTR for 100 samples, and reports the minimum, maximum, mean, and standard deviation of the elapsed cycles between reads, similarly to the previous exercise.

    You can use the pmccntr_get function, which takes no arguments and returns a uint64_t value, to retrieve the cycle counts.

    As this is a kernel module, use printk to print these values to the kernel log.

    Hints for Math in the Kernel

    From LKD:

    “When a user-space process uses floating-point instructions, the kernel manages the transition from integer to floating point mode. What the kernel has to do when using floating-point instructions varies by architecture, but the kernel normally catches a trap and then initiates the transition from integer to floating point mode.

    “Unlike user-space, the kernel does not have the luxury of seamless support for floating point because it cannot easily trap itself. Using a floating point inside the kernel requires manually saving and restoring the floating point registers, among other possible chores. The short answer is: Don't do it! Except in the rare cases, no floating-point operations are in the kernel.”

    For calculating the mean and standard deviation, you will likely need to use 64-bit division and a square-root function. The cross-compiler may not be able to generate the correct operations for 64-bit division on the target 32-bit platform.

    For 64-bit division, use the

    u64 div_u64(u64 dividend, u32 divisor)

    function declared in

    <linux/math64.h>

    For an integer approximation of the square root, use the int_sqrt() or int_sqrt64() function declared in include/linux/kernel.h.

    Build the module by issuing the command

    LINUX_SOURCE=path to your Linux kernel source code

    (Note that the path above should end in something like linux_source/linux)

    (Note also that you can add the command above as an individual lines at the end of the file ~/.bashrc so that it is run automatically whenever you log in.)

    and finally compile via

    make -C $LINUX_SOURCE ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- M=$PWD modules

    which if successful should produce a kernel module file named enable_ccnt_522.ko

    Boot up your Raspberry Pi, open up a terminal window, create a directory to hold your kernel modules, and use sftp to get the enable_ccnt_522.ko file you produced in the previous exercise.

    Use the insmod utility to load your kernel module into the kernel, as in:

    sudo insmod enable_ccnt_522.ko

    If you recieved no error messages, then your module has been successfully loaded. To confirm your module was loaded, you can also issue the command

    lsmod

    to see a listing of all currently loaded kernel modules.

    To see the values your kernel module reported for the minimum, maximum, mean, and standard deviation of the elapsed cycles between reads, issue the dmesg command, which prints the contents of the kernel log to the terminal. As the answer to this exercise, please show the output of the system log.

    (Note: Since you may be frequently building kernel modules for this class, you can create a bash alias for the make command, e.g.:

    alias makemodule="make -C $LINUX_SOURCE ARCH=arm CROSS_COMPILE=arm-linux-gnueabihf- M=$PWD modules"

    You can additionally add this as a line to the end of your ~/.bashrc file.)

  4. The Linux Kernel provides indirect access to hardware counters, including the ARM PMU, via the perf_event_open system call.

    Create a program directly on your Raspberry Pi that uses this system call to get a file descriptor that, when read, supplies the current value of the PMCNTR.

    HINT: Follow the instructions given under the "Performance Information from a Linux Application" header on this page, using PERF_COUNT_HW_CPU_CYCLES as the config field of the perf_event_attr argument to the syscall.

    Then, similarly to the previous exercises, take 100 samples of the elapsed cycles between two subsequent reads, then have your program print the minimum, maximum, mean, and standard deviation of these samples. As the answer to this exercise, report those values.

    If you are using the Raspberry Pi 4 or 4B, and if you haven't done so already in Studio 1, you will need to modify the arm-pmu entry in the device tree for the Pi 4's BCM2711 board, which will allow the Linux kernel to load the hw perfevents driver that will be used in a later studio. To do so, open the file arch/arm/boot/dts/bcm2711.dtsi, find the entry for the arm-pmu, then set the "compatible" line as follows:

    compatible = "arm,cortex-a72-pmu", "arm,cortex-a15-pmu", "arm,armv8-pmuv3";

    Then, recompile the Linux kernel, and install the new build onto your Raspberry Pi 4 using the instructions in Studio 2. After rebooting your Raspberry Pi, verify that the driver has loaded by running the command:

    dmesg | grep "perfevents"

    You should see an event that contains the phrase, "hw perfevents: enabled"

  5. The ARM PMU has a register, the Performance Monitor User Enable Register (PMUSERENR), that, when set, allows direct access to the PMU from userspace. To enable this register, save another copy of enable_ccnt_522.c, but with a different name, and update your Makefile to compile this module.

    Modify the module code, replacing the pmccntr_enable_once function with a call to pmccntr_enable_once_user.

    Build your module, then load it on the Pi, confirming as before (with lsmod or dmesg) that it has loaded correctly.

    Next, retrieve the same driver file onto your Raspberry Pi, and create a program that #includes it. You may also need to include <stdbool.h>

    Your program should call the pmccntr_get() function directly, taking 100 samples of the elapsed time between calls, then prints the minimum, maximum, mean, and standard deviation as before.

    As the answer to this exercise report these values. Please explain why you think they are different from (or similar to) the values reported in the previous two exercises.

  6. To remove a kernel module, the rmmod utility calls the underlying delete_module system call.

    Attempt to remove the kernel module you loaded earlier, but without elevating your privilege with sudo, as in:

    rmmod enable_ccnt_522

    You should receive an error message. As the answer to this exercise, please write the error message you received.

  7. Additionally, please take a look at the source code for the delete_module and the init_module system calls.

    As the answer to this exercise, please (1) copy the lines of code in delete_module that the kernel uses to verify that the calling process has permission to remove the module, and (2) indicate which function is called, for this same purpose, in init_module.

  8. Things to turn in

    Please submit


    Page updated Monday, November 29, 2021, by Marion Sudvarg and Chris Gill.