perf(1)

perf list     show supported hw/sw events & metrics
  -v ........ print longer event descriptions
  --details . print information on the perf event names
              and expressions used internally by events

perf stat
  -p <pid> ..... show stats for running process
  -o <file> .... write output to file (default stderr)
  -I <ms> ...... show stats periodically over interval <ms>
  -e <ev> ...... select event(s)
  -M <met> ..... print metric(s), this adds the metric events
  --all-user ... configure all selected events for user space
  --all-kernel . configure all selected events for kernel space

perf top
  -p <pid> .. show stats for running process
  -F <hz> ... sampling frequency
  -K ........ hide kernel threads

perf record
  -p <pid> ............... record stats for running process
  -o <file> .............. write output to file (default perf.data)
  -F <hz> ................ sampling frequency
  --call-graph <method> .. [fp, dwarf, lbr] method how to caputre backtrace
                           fp   : use frame-pointer, need to compile with
                                  -fno-omit-frame-pointer
                           dwarf: use .cfi debug information
                           lbr  : use hardware last branch record facility
  -g ..................... short-hand for --call-graph fp
  -e <ev> ................ select event(s)
  --all-user ............. configure all selected events for user space
  --all-kernel ........... configure all selected events for kernel space
  -M intel ............... use intel disassembly in annotate

perf report
  -n .................... annotate symbols with nr of samples
  --stdio ............... report to stdio, if not presen tui mode
  -g graph,0.5,callee ... show callee based call chains with value >0.5
Useful <ev>:
  page-faults
  minor-faults
  major-faults
  cpu-cycles`
  task-clock

Select specific events

Events to sample are specified with the -e option, either pass a comma separated list or pass -e multiple times.

Events are specified in the following form name[:modifier]. The list and description of the modifier can be found in the perf-list(1) manpage under EVENT MODIFIERS.

# L1 i$ misses in user space
# L2 i$ stats in user/kernel space mixed
# Sample specified events.
perf stat -e L1-icache-load-misses:u \
          -e l2_rqsts.all_code_rd:uk,l2_rqsts.code_rd_hit:k,l2_rqsts.code_rd_miss:k \
          -- stress -c 2

The --all-user and --all-kernel options append a :u and :k modifier to all specified events. Therefore the following two command lines are equivalent.

# 1)
perf stat -e cycles:u,instructions:u -- ls

# 2)
perf stat --all-user -e cycles,instructions -- ls

Raw events

In case perf does not provide a symbolic name for an event, the event can be specified in a raw form as r + UMask + EventCode.

The following is an example for the L2_RQSTS.CODE_RD_HIT event with EventCode=0x24 and UMask=0x10 on my laptop with a sandybridge uarch.

perf stat -e l2_rqsts.code_rd_hit -e r1024 -- ls
# Performance counter stats for 'ls':
#
#       33.942      l2_rqsts.code_rd_hit
#       33.942      r1024

Find raw performance counter events (intel)

The intel/perfmon repository provides a performance event databases for the different intel uarchs.

The table in mapfile.csv can be used to lookup the corresponding uarch, just grab the family model from the procfs.

 cat /proc/cpuinfo | awk '/^vendor_id/  { V=$3 }
                          /^cpu family/ { F=$4 }
                          /^model\s*:/  { printf "%s-%d-%x\n",V,F,$3 }'

The table in performance monitoring events describes how events are sorted into the different files.

Raw events for perfs own symbolic names

Perf also defines some own symbolic names for events. An example is the cache-references event. The perf_event_open(2) manpage gives the following description.

perf_event_open(2)

PERF_COUNT_HW_CACHE_REFERENCES
    Cache accesses.  Usually this indicates Last Level Cache accesses but this
    may vary depending on your CPU.  This may include prefetches and coherency
    messages; again this depends on the design of your CPU.

The sysfs can be consulted to get the concrete performance counter on the given system.

cat /sys/devices/cpu/events/cache-misses
# event=0x2e,umask=0x41

Flamegraph

Flamegraph with single event trace

perf record -g -e cpu-cycles -p <pid>
perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > cycles-flamegraph.svg

Flamegraph with multiple event traces

perf record -g -e cpu-cycles,page-faults -p <pid>
perf script --per-event-dump
# fold & generate as above

Examples

Estimate max instructions per cycle

#define NOP4        "nop\nnop\nnop\nnop\n"
#define NOP32       NOP4   NOP4   NOP4   NOP4   NOP4   NOP4   NOP4   NOP4
#define NOP256      NOP32  NOP32  NOP32  NOP32  NOP32  NOP32  NOP32  NOP32
#define NOP2048     NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256

int main() {
  for (unsigned i = 0; i < 2000000; ++i) {
    asm volatile(NOP2048);
  }
}
perf stat -e cycles,instructions ./noploop
# Performance counter stats for './noploop':
#
#     1.031.075.940      cycles
#     4.103.534.341      instructions       #    3,98  insn per cycle

Caller vs callee callstacks

The following gives an example for a scenario where we have the following calls

  • main -> do_foo() -> do_work()
  • main -> do_bar() -> do_work()
perf report --stdio -g graph,caller

# Children      Self  Command  Shared Object         Symbols
# ........  ........  .......  ....................  .................
#
#  49.71%    49.66%   bench    bench                 [.] do_work
#          |
#           --49.66%--_start                <- callstack bottom
#                     __libc_start_main
#                     0x7ff366c62ccf
#                     main
#                     |
#                     |--25.13%--do_bar
#                     |          do_work    <- callstack top
#                     |
#                      --24.53%--do_foo
#                                do_work

perf report --stdio -g graph,callee

# Children      Self  Command  Shared Object         Symbols
# ........  ........  .......  ....................  .................
#
#  49.71%    49.66%   bench    bench                 [.] do_work
#          |
#          ---do_work                       <- callstack top
#             |
#             |--25.15%--do_bar
#             |          main
#             |          0x7ff366c62ccf
#             |          __libc_start_main
#             |          _start             <- callstack bottom
#             |
#              --24.55%--do_foo
#                        main
#                        0x7ff366c62ccf
#                        __libc_start_main
#                        _start             <- callstack bottom

References