Hardware Counters for Non-Intrusive Benchmarking - Demistifying the Intel PMU

Hardware counters provided by the Intel performance monitoring unit (PMU) are a great way to obtain metrics such as L1/2/3 cache misses, instructions executed, etc at fine-grained resolution. While there are a number of tools such as linux-perf, VTune, etc. for obtaining said metrics, my (admittedly non-exhaustive) search concluded that there was no good way to obtain hardware metrics for a sub-section of code, as shown below:

int main() {
    /***************************************
    * Initialize array with random numbers *
    ****************************************/
    int n = 1000000;
    long* arr = (long*)malloc(sizeof(long)*n);
    for (int i=0; i<n; i+=1) {
        arr[i] = random();
    }
    /*************************
    *** Code to benchmark ****
    **************************/
    sort(arr, n);
    /*************************/
    return 0;
}

Suppose we only want to benchmark the section of code for sorting while omitting the overhead of initializing the array, I found it difficult to use existing tools in such cases without imposing additional overhead.

This is where the facilities provided by the Intel PMU come in handy.

That being said, Intel's Developer Manual which details how to use the PMU is not the easiest document to decipher. The information pertinent to using the PMU is scattered in multiple places in the manual and the internet. I've put together the relevant information I found in this blogpost.


The Basics


If there is one realization I've had after poring over the Intel manual for a full weekend, it is the following:  
Registers are the interface to the PMU.
That's it.

If you set the right values in the right input registers, you can obtain desired hardware metrics by reading associated output registers. The challenge is identifying what registers to set (program), how to program them, and from where to read the performance metrics. The subsequent sections answer these questions.


How to program MSRs?


The desired performance events can be obtained by programming certain model-specific registers, or MSRs. A useful command-line facility is msr-tools, which provides commands rdmsr and wrmsr to read and write MSRs respectively.

node:~> sudo apt install msr-tools

node:~> sudo wrmsr 0x186 0x43412e

node:~> sudo rdmsr 0x186

43412e


Above, we are programming the MSR at 0x186 to the value 0x43412e. This happens to be the performance event for measuring L3 cache misses.

How to read performance counters?


Now that we have programmed the MSR, where/how do we obtain L3 cache misses? By using the rdpmc instruction. Both Intel CPUs and the Linux kernel support this instruction, and it can be used as follows:
unsigned long rdpmc_l3_cache_misses() {
    unsigned long a, d, c;
    c = 0;
    __asm__ __volatile__("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);
}
rdpmc stands for read performance monitoring counter (PMC in short). There is a one-to-one mapping between MSRs and PMCs. Since we programmed register 0x186 which is the MSR IA32_PERFEVTSEL0, i.e, the zeroth performance monitoring MSR, the corresponding zeroth PMC can be read by passing the argument c=0 to the rdpmc instruction.

To obtain L3 cache misses for a particular section of code, simply read the PMC before and after as follows:
...

unsigned long l3_misses = rdpmc_l3_cache_misses();
<...
code to benchmark
...>
l3_misses = rdpmc_l3_cache_misses() - l3_misses;

...

Exploring the PMU's capabilities 

cpuid is a useful tool to explore the hardware's capabilities. The 0x0AH leaf gives information about the PMU.

node:~> sudo apt install cpuid

node:~> cpuid -1 -l 0x0ah

CPU:

   Architecture Performance Monitoring Features (0xa/eax):

      version ID                               = 0x4 (4)

      number of counters per logical processor = 0x4 (4)

      bit width of counter                     = 0x30 (48)

      length of EBX bit vector                 = 0x7 (7)

   Architecture Performance Monitoring Features (0xa/ebx):

      core cycle event not available           = false

      instruction retired event not available  = false

      reference cycles event not available     = false

      last-level cache ref event not available = false

      last-level cache miss event not avail    = false

      branch inst retired event not available  = false

      branch mispred retired event not avail   = false

   Architecture Performance Monitoring Features (0xa/edx):

      number of fixed counters    = 0x3 (3)

      bit width of fixed counters = 0x30 (48)

Intel CPUs have different architecture families, such as Skylake, Kaby Lake, etc. For instance, the Intel Xeon Silver 4114 10-core CPU belongs to the Skylake architecture family:

node:~> cat /sys/devices/cpu/caps/pmu_name

skylake


There are three types of performance events supported by the PMU:

  1. Fixed counters: The cpuid 0x0AH leaf shows that the version ID of the PMU is 4, and there are three fixed counters supported by PMU. These counters are retired instructions, core cycles, and TSC reference cycles (Table 18-2 in the manual). These three are commonly used performance events and hence incorporated as fixed counters. All PMUs with version ID 2 and higher support these fixed counters, which includes processors since Intel Core 2 Duo T 7700 released in 2007.
  2. Architectural performance events: Aside from the fixed counters, the PMU provides limited MSRs to program performance events of choice. The cpuid leaf shows that there are four counters per logical processor. These MSRs are named IA32_PERFEVTSELx, where x ranges from 0-3 in our case. Some processors can provide up to 8 counters, but the register address of these counters is fixed starting from 0x186 for MSR IA32_PERFEVTSEL0 for all Intel architectures. Architectural performance events have the same event code independent of the PMU architecture family. There are seven such events highlighted in green (Table 18-1 in the manual) typically supported by all Intel PMUs. Note that the first three architectural events are the same as fixed counters.
  3. Non-architectural performance events: Some less frequently used performance events such as L2 and L1 cache misses can be specific to the PMU's architecture family. Programming the MSRs for these events requires looking up the manual for specifications of that particular architecture family.

Enabling the PMU

The PMU is disabled by default, and can be turned on by programming the global performance control register IA32_PERF_GLOBAL_CTRL at address 0x38f. The format of the register is shown below (Figure 18-3 from the manual):


Note that here we assume the PMU version to be 2 or higher, which includes processors since Intel Core 2 Duo T 7700 released in 2007. To enable the fixed counters, we need to set bits 32-34. To enable all the PMCs, we need to set the lower 4 bits since our CPU has 4 available performance counters.

node:~> sudo wrmsr 0x38f 0x70000000f


Instead, if the CPU had 8 available counters, we would set the lower 8 bits.


node:~> sudo wrmsr 0x38f 0x7000000ff


A final step involves enabling the rdpmc instruction for user-level programs. This is optional, but is useful if you want to read performance counters without using sudo.

node:~> sudo echo 2 > /sys/devices/cpu/rdpmc


Enabling Fixed Counters


For PMUs that support fixed counters (version ID 2 or higher), an additional register IA32_FIXED_CTR_CTRL at address 0x38d needs to be programmed. The layout of the register is as follows (Figure 18-2 in the manual):

Each fixed counter has 4 associated bits. Taking counter 0 as example, bits 0-1 enable the counter for programs of different ring levels. Bit 2 is reserved, and bit 3 is set to indicate in case an overflow occurs in the PMC (note that PMC is 48 bits wide, see output of cpuid 0x0AH leaf). The three fixed counters can be enabled for all ring levels by setting the MSR IA32_FIXED_CTR_CTRL as follows:

node:~> sudo wrmsr 0x38d 0x333

Note that we have not set the overflow control (bit 3), as we are not handling this corner case. For information on how to detect and handle overflow, refer to section 18.2.2 in the manual.

Once the fixed counters have been enabled, they can be accessed as follows:
// Total number of instructions executed (retired)
unsigned long rdpmc_retired_inst() {
    unsigned long a, d, c;
    c = (1<<30);
    __asm__ __volatile__("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);
}

// Total number of core cycles. Note that when CPU frequency
// scaling is enabled (default), time elapsed does not scale
// linearly with core cycles.
unsigned long rdpmc_core_cycles() {
    unsigned long a, d, c;
    c = (1<<30) + 1;
    __asm__ __volatile__("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);
}

// Reference cycles at the rate of the system's time stamp counter (TSC).
// This metric is unaffected by CPU frequency scaling.
unsigned long rdpmc_tsc_reference_cyles() {
    unsigned long a, d, c;
    c = (1<<30) + 2;
    __asm__ __volatile__("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);
}

Programming performance events


Aside from the fixed counters, the PMU provides limited MSRs that can be programmed for the desired performance event. These programmable MSRs are called IA32_PERFEVTSELx (x ranges from 0-3 in our case as Intel Xeon Silver 4114 processors have 4 available counters), and can be programmed for a wide range of events. The layout of these MSRs is as follows (Figure 18-1 in the manual):


A detailed description of all the fields in the counter can be found in section 18.2.1.1 in the manual. Here I will point out two important fields - event select (bits 0-7) and unit mask (bits 8-15). Any desired performance event can be programmed by setting these fields correctly.

For the seven architectural performance events, these two fields can be obtained from Table 18-1 in the manual. For last level cache misses (i.e., L3 misses), the event is 0x2e and the umask is 0x41. We have previously programmed IA32_PERFEVTSEL0 (0x186) to measure L3 cache misses.

Let's take another example. Suppose we want to determine the L3 cache miss rate, we will also need to obtain the number of L3 cache references. There is an architectural performance event for measuring last level cache (L3) references with event 0x2e and umask 0x4f. Notice that the event is the same as for L3 misses, but the umask is different.

node:~> sudo wrmsr 0x187 0x434f2e

Here we have programmed IA32_PERFEVTSEL1 (0x187) to measure LLC references. The corresponding performance counter can be obtained as follows:
unsigned long rdpmc_l3_cache_refs() {
    unsigned long a, d, c;
    c = 1;
    __asm__ __volatile__("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);
}
Notice that the argument c=1 which reads from the PMC mapping to IA32_PERFEVTSEL1.

There are a lot more non-architectural performance events (100+) that are specific to the architecture family. Our CPU belongs to the Skylake architecture family, and the list of events supported are listed in Table 19-3 in the manual. Here I will take one example of L2 cache misses, with event 0x24 and umask 0x3f:

node:~> sudo wrmsr 0x188 0x433f24

Having programmed IA32_PERFEVTSEL2 (0x188), L2 cache misses can be accessed as follows:
unsigned long rdpmc_l2_cache_miss() {
    unsigned long a, d, c;
    c = 2;
    __asm__ __volatile__("rdpmc" : "=a" (a), "=d" (d) : "c" (c));
    return ((unsigned long)a) | (((unsigned long)d) << 32);
}

Conclusion


Intel PMU is a powerful tool for benchmarking with near-zero overhead. There are 100s of useful hardware metrics that can be captured using the PMU's capabilities. I hope that this blogpost will serve as a guide to navigate the Intel manual for programming different performance events.