Profiling

Author:Ferdinand Majerech

https://github.com/kiith-sa/profiling

Intro

* in most cases

Overview

Real world machines - cache

_images/caches.png

Real world machines - branches

Numbers everyone should know (Jeff Dean)

L1 cache hit (data not in register) ~0.5 ns few cycles
Branch mispredict ~5 ns  
Lmax cache reference (L1/L2 miss) ~7 ns ~14x L1 hit
Mutex lock/unlock ~25 ns  
Main memory reference (Lmax miss) ~100 ns ~14x L2 hit
Compress 1K bytes w/ Snappy (Google) ~3,000 ns  
Send 2K bytes over 1 Gbps network ~20,000 ns  
Read 4k random read from SSD ~150,000 ns  
Read 1 MB sequentially from RAM ~250,000 ns  
Round trip within same datacenter ~500,000 ns  
Read 1 MB sequentially from SSD ~1,000,000 ns  
Disk seek ~10,000,000 ns  
Read 1 MB sequentially from disk ~20,000,000 ns ~80x RAM + seek
Send packet CA->Netherlands->CA ~150,000,000 ns  

Files (workshop)

commands.txt Commands for copy pasting
small.cfg Small CFG file for testing
huge.cfg Huge CFG file for testing
cfg*.cpp/cfg*.h Sample code (CFG parser) for profiling
diy.h DIY profiling example
LICENSE_1_0.txt Boost license

Building the binary to profile (workshop)

On inlining

Evolution of a cfg parser

Evolution of a cfg parser

bin huge.cfg 10 small.cfg 50000
cfg ~1290ms ~1070ms
cfg2 ~880ms ~1010ms
cfg3 ~700ms ~700ms
cfg4 ~445ms ~605ms
cfg5 ~420ms ~595ms

Tools

time

time ./cfg small.cfg 200:

real 0m0.011s
user 0m0.009s
sys  0m0.002s

gprof

Do not use gprof.*

* except to get function call counts

valgrind

callgrind

callgrind - example

valgrind --tool=callgrind ./cfg huge.cfg 1:

I   refs:      757,793,107
I1  misses:          4,312
LLi misses:          3,776
I1  miss rate:         0.0%
LLi miss rate:         0.0%

D   refs:      358,393,103  (212,485,868 rd + 145,907,235 wr)
D1  misses:      2,020,202  (  1,864,991 rd +     155,211 wr)
LLd misses:      1,491,826  (  1,341,624 rd +     150,202 wr)
D1  miss rate:         0.5% (        0.8%   +         0.1%  )
LLd miss rate:         0.4% (        0.6%   +         0.1%  )

LL refs:         2,024,514  (  1,869,303 rd +     155,211 wr)
LL misses:       1,495,602  (  1,345,400 rd +     150,202 wr)
LL miss rate:          0.1% (        0.1%   +         0.1%  )

Branches:       93,924,200  ( 75,266,948 cond +  18,657,252 ind)
Mispredicts:     8,760,703  (  4,822,612 cond +   3,938,091 ind)
Mispred rate:          9.3% (        6.4%     +        21.1%   )

KCachegrind

_images/kcachegrind.png

KCachegrind examples

kcachegrind callgrind.out.[NUMBER]

massif

_images/massif-visualizer.png

~/.valgrindrc (workshop)

--num-callers=20
--callgrind:compress-strings=no
--callgrind:dump-line=yes
--callgrind:dump-instr=yes
--callgrind:compress-pos=no
--callgrind:separate-threads=no
--callgrind:collect-jumps=yes
--callgrind:collect-systime=yes
--callgrind:collect-bus=yes
--callgrind:cache-sim=yes
--callgrind:branch-sim=yes
--callgrind:I1=32768,8,64
--callgrind:D1=32768,8,64
--callgrind:LL=262144,8,64

perf

perf - commands

perf list

perf list lists events perf can record on this machine

_images/perf_list.png

perf list

Some interesting events:

cycles CPU cycles spent
instructions Instructions executed
branches All branches
branch-misses Mispredicted branches (~5ns)
L1-Xcache-XXXX-misses L1 cache misses (<=~7ns)
cache-references Lmax cache accesses (~7ns)
cache-misses Lmax cache misses (~100ns)
stalled-cycles-frontend Cycles the CPU frontend is doing nothing
stalled-cycles-backend Cycles the CPU itself is doing nothing

perf stat

perf stat ./tharsis-game --threads=3:

_images/perf_stat.png

perf stat

perf record

perf report

perf report

perf top

perf top - example

DIY

DIY - why

Despiker

Appendix 1: Patterns

Patterns - STL

Patterns - Cache

Appendix 2: Optimizations

Optimizations - polymorphism

Optimizations - cache

class Bad
{
    uint64_t[8] rarelyAccessed;
    // cache line (64 bytes) boundary
    std::string alwaysAccessed:
}

class Good
{
    // Fits the first cache line
    std::string alwaysAccessed:
    uint64_t[8] rarelyAccessed;
}

Optimizations - branch prediction

Optimizations - alloc (time) overhead

Optimizations - alloc overhead

class Foo {
    static Foo* freelist; // Start of free list

    static Foo* alloc() {
        Foo* f;
        if(freelist) { // Reuse from freelist
            f        = freelist;
            freelist = f.next;
        }
        else {
            f = new Foo();
        }
        return f;
    }

    static void dealloc(Foo* f) { // Move to freelist
        f.next   = freelist;
        freelist = f;
    }

    Foo* next; // For use by freelist
    // Other Foo code here...
}

Optimizations - compiler params

Optimizations - inlining

Optimizations - intrinsics

Optimizations - int types

Optimizations - 0 (Andrei Alexandrescu)

It's over

https://github.com/kiith-sa/profiling