This article is the first of a three part series on the PERF (linux-tools) performance measurement and profiling system.Part 1 demonstrates how to use PERF to identify and analyze the hottest execution spots in a program. It covers the basic PERF commands, options and software performance events.Part 2 introduces hardware performance events and demonstrates how to measure hardware events across an entire application. It defines and discusses several useful rates and ratios for performance assessment and analysis.Part 3 uses hardware performance event sampling to identify and analyze program hot-spots.
Each article builds on the usage and background information presented in the previous articles. All articles use the same two example programs:Naive: Naive matrix multiplication program (textbook algorithm).Interchange: An improved matrix multiplication program (loop nest interchange algorithm).
The textbook algorithm has known performance issues due to its memory access pattern. The loop nest interchange algorithm has a different access pattern which reads/writes data more efficiently.
Source: Makefile, naive.c, readmeCommon code: test_common.h, test_common.c, rpi_pmu.h, rpi_pmu.cIntroduction
Performance Events for Linux, called “PERF,” is the standard profiling infrastructure on Linux. You can analyze the performance of a program using the tools provided with PERF, or you can build your own tools on top of PERF.
PERF consists of two major subsystems:A kernel SYSCALL that provides access to performance data including operating system software events and hardware performance events, and A collection of user-space tools to collect, display and analyze performance data.
PERF is the standard means of access to the hardware performance counters. It supports both counting mode and sampling mode.
PERF is part of the kernel, so Raspbian Wheezy (3.6.11+) already has PERF built in. However, if you want to work with PERF, you need to install the linux-tools package using either Synaptic or apt-get. The following command does the job:
sudo apt-get install linux-tools-3.6
Be sure to install the version which matches the kernel. You can display information about the currently installed Linux distribution (including the kernel version) by entering: uname -a.
This note is an introduction to the PERF user-space tools. It shows how to use PERF to analyze the textbook matrix multiplication program (naive.c). Once you see how easy it is to use PERF, you can explore the command line options on your own. To get a complete list of PERF commands, enter:
perf --help
To get usage information about a particular PERF command (e.g., report, enter:
perf report --help
I recommend reading the PERF tutorial on the PERF Wiki, too.What can you measure with PERF?
PERF supports a large number of predefined events. To get a list of supported events, enter:
perf list
PERF supports hardware events and software events. The hardware events are measured using the hardware performance counters. The available events are processor implementation specific. On the Raspberry Pi, the available events are those supported by the Broadcom BCM2835 processor, which uses the ARM1176JZF-S core. Please see the ARM1176JZF-S Technical Reference Manual (TRM) for more information about ARM11 hardware performance events. The Raspberry Pi 2 is the second generation Pi and has a Broadcom BCM2836 processor. The BCM2836 is a quad-core Cortex-A7 (ARMv7) machine. Please consult the ARM Cortex-A7 Technical Reference Manual for details about hardware performance counters and events.
Although this tutorial demonstrates PERF on Raspberry Pi, the techniques are applicable to other architectures like x86. This part of the tutorial series uses Linux software performance events which are common to all architectures.
Certain commonly used hardware events have PERF symbolic names such as:
cpu-cycles OR cycles
instructions
cache-references
cache-misses
branch-instructions OR branches
branch-misses
bus-cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
ref-cycles
You may also specify a hardware performance event by its processor-specific event identifier, usually a hexadecimal value. (See the TRM for event identifiers and more.)
PERF supports several pre-defined software events:
cpu-clock
task-clock
page-faults OR faults
context-switches OR cs
cpu-migrations OR migrations
minor-faults
major-faults
alignment-faults
emulation-faults
The cpu-clock event measures the passage of time. It uses the Linux CPU clock as the timing source. This tutorial focuses on the cpu-clock event because execution time is a good starting point for performance analysis. This event is easy to use and understand. Hardware events are a little bit more difficult to use and we discuss them in a separate tutorial.Elapsed time and event counting
The first step in analysis is to establish baseline performance. The baseline gives us a way to assess the efficacy of any changes that we make to the source or the build process. Please don’t forget that the compiler performs many optimizations which are enabled by command line options. Sometimes the best way to improve performance is through flag mining, i.e., finding the combination of compiler options that yield the best performance.
The cpu-clock event is exactly what we need to establish a baseline. The perf stat command:
perf stat -e cpu-clock ./naive
Runs the textbook matrix multiplication program named naive and displays the following output:
Performance counter stats for './naive':
16515.108000 cpu-clock
16.752186472 seconds time elapsed
It displays the number of cpu-clock events in milliseconds and the elapsed execution time. These times are the baseline for future comparisons.
We can measure more than one event as shown in the command and output below:
> perf stat -e cpu-clock,faults ./naive
Performance counter stats for './naive':
16508.392000 cpu-clock
885 page-faults
16.740473493 seconds time elapsed
By the way, the “>” before example commands is the shell’s command line prompt. Don’t enter this character when you try the example commands! In this example, PERF measures two software events: cpu-clock and page-faults.What questions should I ask?
There are three over-arching practical questions in performance analysis.Which parts of the program take the most execution time?Do the number of software or hardware events indicate an actual performance issue to be fixed?How can we fix the performance issue?
We need to be detectives in order to answer these questions.
The code regions which take the most execution time are called hot spots. The hot spots are the best places to tune and optimize because a little bit of effort on making a hot spot faster can have a big pay-off. It’s not worth spending engineering effort on a bit of code that executes only once or doesn’t take much time. Since this kind of analysis deals with execution time, you can easily guess that the cpu-clock event is part of the analytical method.
The second question asks if the number of events such as page-faults indicate an actual performance issue to be fixed or not. In the second example above, are 885 page faults too many? To answer this question, we need to have some knowledge or intuition about the algorithms and data structures in the program. We at least need to know if the program is CPU-bound (does a lot of computation) or is memory-bound (needs to read/write a lot of data in memory). We also need to know where the events are occuring in the program. If the events correlate with a known hot spot, then we may have found a real performance issue and need to fix it.
The last question is the hardest one to answer. Let’s assume that we have found a hot spot with a high number of page faults. We need to study the program’s algorithms and data structures and determine if the program has an unnecessarily large working set size and touches many pages within a short period of time.We might need to reduce or reorganize the data so that the data reside on fewer pages. Once the code changes are finished, we must measure the performance of the modified program and compare its execution time against the baseline to see if there is any speed-up.