Flame Graphs for Go With pprof

Posted on

This week, I was working on a Go program and I wanted to understand which part was taking the most time. I had seen some people use flame graphs for this, but had never made one myself, so I decided to try it out. It took a little time to figure out the right tools to use, but once I did it was pretty easy. Here’s what one looks like (full size):

On the X axis, a wider bar means more time spent, and on the Y axis you can see the call stack (functions lower down are calling functions higher up). There’s no particular meaning to the colors.

Unfortunately I can’t share the actual program I was working on, but I’ll show you the steps on a little program called whoami which starts a web server that responds to each request by simply writing back the IP address of the requester:

package main

import (
  "log"
  "net/http"
  _ "net/http/pprof"
)

func main() {
  http.HandleFunc("/", handle)
  log.Fatal(http.ListenAndServe(":8080", nil))
}

func handle(w http.ResponseWriter, r *http.Request) {
  log.Printf("handling request from: %s", r.RemoteAddr)
  if _, err := w.Write([]byte(r.RemoteAddr)); err != nil {
    log.Printf("could not write IP: %s", err)
  }
}

Go’s standard library includes some tools for profiling the running program through its various pprof packages and utilities. Here, I’m importing net/http/pprof, which exposes /debug/pprof endpoints on the DefaultServeMux.

Profiling our web server

I want to make sure whoami is actually serving requests when I profile it, so I’m using wrk to generate lots of requests by running wrk -d 30s 'http://localhost:8080'. With that running, I can fetch a 20 second CPU profile from the pprof server:

$ go tool pprof \
  -raw -output=cpu.txt \
  'http://localhost:8080/debug/pprof/profile?seconds=20'

This creates a file called cpu.txt containing the decoded pprof samples, which are what I need to build my flame graph.

Brendan Gregg, the inventor of flame graphs, has published some scripts to turn pprof output into an interactive flame graph. Since whoami is a Go program, I can use stackcollapse-go.pl to convert the samples to the right format for flamegraph.pl (both from this repository):

$ ./stackcollapse-go.pl cpu.txt | flamegraph.pl > flame.svg

Click here to see the result!

Make it faster!

One thing I noticed though is that about 30% of the time it takes to serve() each request seems to be spent in log.Printf, which needs to make a write system call to print the message to the terminal:

Annotated flame graph showing log.Printf about a third as wide as serve

Maybe we can make our server faster by removing the logging? But to know if we can make it “faster,” we need to know how fast it is right now.

One interesting thing about flame graphs is that they don’t measure time in seconds, they measure it in samples. When you run pprof, it checks what your program is doing 100 times per second, so the flame graph is just an aggregation of “how many times was each unique stack trace sampled.” (This means that if you have a function that’s called rarely, it might not even appear in the flame graph at all!)

So to tell how fast whoami is in absolute terms, we can use wrk to gather some initial statistics. I want to run wrk again (rather than use the results from when I was profiling) because profiling your program will slow it down.

$ wrk -d 10s 'http://localhost:8080'
Running 10s test @ http://localhost:8080
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   152.41us   51.27us   2.81ms   86.24%
    Req/Sec    31.30k     2.27k   39.31k    72.77%
  628837 requests in 10.10s, 76.76MB read
Requests/sec:  62261.70
Transfer/sec:      7.60MB

When I remove the call to log.Printf and re-run the server, wrk now reports:

$ wrk -d 10s 'http://localhost:8080'
Running 10s test @ http://localhost:8080
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   117.23us   30.28us   1.02ms   78.11%
    Req/Sec    40.28k     3.09k   47.51k    69.80%
  809605 requests in 10.10s, 98.83MB read
Requests/sec:  80160.78
Transfer/sec:      9.79MB

Sure enough, it looks like the average latency decreased by around 30% (from 152 to 117 µs), and the requests per second correspondingly increased from 62k to 80k, around 30%!

I don’t have super high confidence in this measurement though. I’m doing this all on my laptop, so I’m not sure if the system calls that wrk is making to send the requests are slowing down the system calls whoami is making to read the requests and write the responses at all.

tl;dr: How to Make a Flame Graph from a pprof source

Download the scripts from Brendan Gregg’s FlameGraph repo and then assuming <source> is either a pprof file or URL, run these commands:

$ go tool pprof -raw -output=cpu.txt <source>
$ stackcollapse-go.pl cpu.txt | flamegraph.pl > cpu.svg

More in Golang: