perf(1)
perf list show supported hw/sw events & metrics
-v ........ print longer event descriptions
--details . print information on the perf event names
and expressions used internally by events
perf stat
-p <pid> ..... show stats for running process
-o <file> .... write output to file (default stderr)
-I <ms> ...... show stats periodically over interval <ms>
-e <ev> ...... select event(s)
-M <met> ..... print metric(s), this adds the metric events
--all-user ... configure all selected events for user space
--all-kernel . configure all selected events for kernel space
perf top
-p <pid> .. show stats for running process
-F <hz> ... sampling frequency
-K ........ hide kernel threads
perf record
-p <pid> ............... record stats for running process
-o <file> .............. write output to file (default perf.data)
-F <hz> ................ sampling frequency
--call-graph <method> .. [fp, dwarf, lbr] method how to caputre backtrace
fp : use frame-pointer, need to compile with
-fno-omit-frame-pointer
dwarf: use .cfi debug information
lbr : use hardware last branch record facility
-g ..................... short-hand for --call-graph fp
-e <ev> ................ select event(s)
--all-user ............. configure all selected events for user space
--all-kernel ........... configure all selected events for kernel space
-M intel ............... use intel disassembly in annotate
perf report
-n .................... annotate symbols with nr of samples
--stdio ............... report to stdio, if not presen tui mode
-g graph,0.5,callee ... show callee based call chains with value >0.5
Useful <ev>:
page-faults
minor-faults
major-faults
cpu-cycles`
task-clock
Select specific events
Events to sample are specified with the -e
option, either pass a comma
separated list or pass -e
multiple times.
Events are specified in the following form name[:modifier]
. The list and
description of the modifier
can be found in the
perf-list(1)
manpage under EVENT MODIFIERS
.
# L1 i$ misses in user space
# L2 i$ stats in user/kernel space mixed
# Sample specified events.
perf stat -e L1-icache-load-misses:u \
-e l2_rqsts.all_code_rd:uk,l2_rqsts.code_rd_hit:k,l2_rqsts.code_rd_miss:k \
-- stress -c 2
The --all-user
and --all-kernel
options append a :u
and :k
modifier to
all specified events. Therefore the following two command lines are equivalent.
# 1)
perf stat -e cycles:u,instructions:u -- ls
# 2)
perf stat --all-user -e cycles,instructions -- ls
Raw events
In case perf does not provide a symbolic name for an event, the event can be
specified in a raw form as r + UMask + EventCode
.
The following is an example for the L2_RQSTS.CODE_RD_HIT event
with EventCode=0x24
and UMask=0x10
on my laptop with a sandybridge
uarch.
perf stat -e l2_rqsts.code_rd_hit -e r1024 -- ls
# Performance counter stats for 'ls':
#
# 33.942 l2_rqsts.code_rd_hit
# 33.942 r1024
Find raw performance counter events (intel)
The intel/perfmon
repository provides a performance event
databases for the different intel uarchs.
The table in mapfile.csv
can be used to lookup the
corresponding uarch, just grab the family model from the procfs.
cat /proc/cpuinfo | awk '/^vendor_id/ { V=$3 }
/^cpu family/ { F=$4 }
/^model\s*:/ { printf "%s-%d-%x\n",V,F,$3 }'
The table in performance monitoring events describes how events are sorted into the different files.
Raw events for perfs own symbolic names
Perf also defines some own symbolic names for events. An example is the
cache-references
event. The perf_event_open(2)
manpage
gives the following description.
perf_event_open(2)
PERF_COUNT_HW_CACHE_REFERENCES
Cache accesses. Usually this indicates Last Level Cache accesses but this
may vary depending on your CPU. This may include prefetches and coherency
messages; again this depends on the design of your CPU.
The sysfs
can be consulted to get the concrete performance counter on the
given system.
cat /sys/devices/cpu/events/cache-misses
# event=0x2e,umask=0x41
Flamegraph
Flamegraph with single event trace
perf record -g -e cpu-cycles -p <pid>
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > cycles-flamegraph.svg
Flamegraph with multiple event traces
perf record -g -e cpu-cycles,page-faults -p <pid>
perf script --per-event-dump
# fold & generate as above
Examples
Estimate max instructions per cycle
#define NOP4 "nop\nnop\nnop\nnop\n"
#define NOP32 NOP4 NOP4 NOP4 NOP4 NOP4 NOP4 NOP4 NOP4
#define NOP256 NOP32 NOP32 NOP32 NOP32 NOP32 NOP32 NOP32 NOP32
#define NOP2048 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256
int main() {
for (unsigned i = 0; i < 2000000; ++i) {
asm volatile(NOP2048);
}
}
perf stat -e cycles,instructions ./noploop
# Performance counter stats for './noploop':
#
# 1.031.075.940 cycles
# 4.103.534.341 instructions # 3,98 insn per cycle
Caller vs callee callstacks
The following gives an example for a scenario where we have the following calls
main -> do_foo() -> do_work()
main -> do_bar() -> do_work()
perf report --stdio -g graph,caller
# Children Self Command Shared Object Symbols
# ........ ........ ....... .................... .................
#
# 49.71% 49.66% bench bench [.] do_work
# |
# --49.66%--_start <- callstack bottom
# __libc_start_main
# 0x7ff366c62ccf
# main
# |
# |--25.13%--do_bar
# | do_work <- callstack top
# |
# --24.53%--do_foo
# do_work
perf report --stdio -g graph,callee
# Children Self Command Shared Object Symbols
# ........ ........ ....... .................... .................
#
# 49.71% 49.66% bench bench [.] do_work
# |
# ---do_work <- callstack top
# |
# |--25.15%--do_bar
# | main
# | 0x7ff366c62ccf
# | __libc_start_main
# | _start <- callstack bottom
# |
# --24.55%--do_foo
# main
# 0x7ff366c62ccf
# __libc_start_main
# _start <- callstack bottom
References
- intel/perfmon - intel PMU event database per uarch
- intel/perfmon-html - a html rendered version of the PMU events with search
- intel/perfmon/mapfile.csv - processor family to uarch mapping
- linux/perf/events - x86 PMU events known to perf tools
- linux/arch/events - x86 PMU events linux kernel
- wikichip - computer architecture wiki
- perf-list(1) - manpage
- perf_event_open(2) - manpage
- intel/sdm - intel software developer manuals (eg Optimization Reference Manual)