The Inner Product, December 2002
Jonathan Blow (jon@number-none.com)
Last updated 17 January 2004
 
Interactive Profiling, Part 1
Related articles:
Interactive Profiling, Part 2 
(January 2003)
Interactive Profiling, Part 3 (February 2003)
 
Working with different development houses, I've seen that it's often extremely 
difficult to get anything done. Most developers are working at a low level of 
efficiency. Our programming languages, compilers, debuggers, source control 
systems, and profiling tools are all, in one way or another, bad. This costs us 
huge amounts of time and money. Today I'd like to attack one corner of this 
problem space, the inadequacy of our profiling tools.
For the practicing game developer, there are a handful of commercial profiling 
products to choose from; two representative examples are Intel's VTune and the 
CodeWarrior Hierarchical Profiler (HiProf). Despite being profilers, these 
products are inappropriate in several ways; using them in a production 
environment tends to be painful. Usually a profiler will slow down the game 
tremendously, preventing us from reproducing the situations we want to profile. 
VTune is not slow if you do not collect any information other than the program 
counter at each sample; but that information is of very limited use. 
Because these products don't satisfy our needs, just about every game gets 
rigged with a homemade profiler. We manually mark sections of the code with 
telltale names like "render_world" or "simulate_physics"; a heads-up display 
shows, in real time, how much CPU is being spent in each area. 
Interactive Profiling as a Different Thing than Batch 
Profiling
There are some ways in which even the simplest in-game profiler is more useful 
than a tool like VTune or HiProf. With a realtime heads-up display, you can see 
correlations between the frame rate and the activity of each game subsystem, and 
use human pattern-matching and intelligence to figure out what's happening.
Suppose, in a map with a lot of objects and busy activity, your game's frame 
rate stutters annoyingly; say, rougly one frame out of every 30 takes 3 times as 
long as it should. If you try to diagnose the problem with VTune or HiProf, 
you'll meet with some difficulty. Since the long frames are intermixed with 
normal frames, the code causing the problem is only taking 1 2/3 as much CPU as 
it should, when averaged over time; this probably won't be sufficient for that 
code to stand out as the culprit in the profiling report. 
You might be able to track down the problem in VTune by enabling the "collect 
chronology data" option. You'll suffer through annoying slowdowns, and if the 
problem is timestep-dependent, it may not happen during profiling. To robustly 
solve this problem, you can build your game engine to support full journaling 
and playback; that's useful in its own right, but it's also a big engineering 
task that we'd rather not require just to obtain a profile. After all this, you 
may discover that the problem is caused by your AI pathfinding code. You may 
even know that a specific loop is eating up a lot of time. But the profiler 
can't tell you why. 
On the other hand, with an in-engine interactive profiler you see the problem 
instantly, because you're watching a live report that changes from frame to 
frame. Every time the frame stutter happens, you'll see the program section "AI 
Pathfind" jump to the top of the chart. And because you're running around inside 
the game world, you may be able to easily correlate the spikes with specific 
problems -- for example, now that you know the problem is pathfinding, you can 
see that it happens whenever a monster walks near a certain kind of lamp. So you 
go investigate the geometry of the lamp to get to the heart of the matter. 
Because you can turn on the interactive profiler at any time, you don't have to 
strain to reproduce a test case; when it happens, bam, you hit the function key 
that turns on the profiler and shows you what's happening. 
Batch profilers try to help you visualize data, with bar charts and pie charts 
and hierarchical diagrams. But your game renderer is already a rich 
visualization of your game's variables, a far more complex and meaningful one 
than any batch profiler will give you. Augmenting the world rendering with some 
supplementary data is usually the best visualization approach. 
Basically, there are 2 things batch profilers are good for: (1) Giving you a 
broad idea of your program's behavior averaged over time, and (2) giving you 
precise instruction-level performance data for small sections of code. (1) can 
be obtained just as well through a simple in-engine profiler, and (2) represents 
a small fraction of the profiling use cases of a modern game.
As an industry we frequently write our own in-engine profilers, but it's seldom 
articulated that interactive profiling is a different paradigm than batch 
profiling. The batch profilers available to us now are somewhat useful, but most 
of the time we really want a good interactive profiler. If the tool vendors 
understood this, they could do some neat stuff. 
Because we as game developers are not in the business of selling tools, the 
profilers we write are usually minimal. I think it would be quite worthwhile to 
spend some up-front effort to build a high-quality interactive profiler that 
people can just plug into their games. So, in this article and the next, I will 
take steps toward that. 
Goals of the Interactive Profiler
I want this profiler to be effortless to use. It should be so fast that you can 
always leave it enabled in every build of your game, only removing it when the 
time comes to ship the final product. Disabling profilers with #ifdefs in 
various builds is bad; suppose you see a problem and want to check it out, but 
whoops, your current build doesn't have profiling turned on, so you have to 
stop, change the #define, do a full rebuild which may take a while, then run the 
game and try to reproduce the case (which can be pretty hard sometimes). That's 
the kind of process impediment that stifles game development. By itself it 
doesn't seem like much, but we build up these situations where many impediments 
affect our daily progress, and it becomes a chore to do anything. 
Using the profiler, I'd like it to be easy to detect the kinds of performance 
problems we care about. I want the profiler's behavior to be reliable; for 
example, it should not change in character based on the frame rate. And I think 
it would be really neat if it provided some tools that help us jump to more of a 
meta-level and see the different kinds of behaviors our games exhibit, and how 
common each behavior is within the overall gamut. 
Profiler Overview
I'll create a small header file that should be included in all code that wants 
to be profiled. The header contains macros for declaring different zones of the 
program (rendering, input, physics, etc) and for stating that code execution is 
entering or leaving a given zone. 
When we enter a zone, we'll ask a high-resolution timer what time it is, and 
record that. Upon leaving the zone, we subtract that entry time from the current 
time, which tells us how long we were inside. We add this to a total-time 
accumulator for that zone. Once per frame, we draw the accumulated results and 
reset them. 
With a little bit of extra accounting, we can separately track what HiProf 
refers to as the "hierarchical time", which is the time spent inside a zone and 
all zones that it calls; and "self time", which only counts time within a 
specific zone body and stops when a new zone is entered. 
Zone Declaration
I chose to declare zones by instantiating structs in the global namespace. Each 
struct contains the variables used to store the raw timing data for that zone. 
For convenience I created a macro called Define_Zone that declares the 
appropriate instance (see Listing 1). When entering or leaving a zone, we refer 
to the name of that global zone variable; all references to zones are resolved 
at link time, so no CPU is spent at runtime matching zone IDs. Also, you get a 
compile-time error if you try to use a zone that isn't declared, which is good 
for catching spelling mistakes.
There are two alternatives for declaring zones that people often use, but they 
both have big disadvantages. One method is to create a global header file with 
enumerated constants for each zone; these constants index into an array. With 
this approach, adding a new zone requires a full rebuild of the project, and 
full rebuilds are horrible for many code bases. Also, when tracking down a 
performance problem, we often want to incrementally create temporary zones, 
which is painful with this scheme. The struct-based scheme presented here is 
totally confined to the files being worked on, requiring minimal recompilation.
The other alternative is to use strings to designate zone names, but this incurs 
a lot of runtime overhead, and spelling errors can go unnoticed for a long time.
 
| Listing 1: struct Program_Zone { Program_Zone(char *name); char *name; int index; Profiling_Int64 total_self_ticks; Profiling_Int64 t_self_start; .... // Other members deleted for brevity }; 
 | 
Entering and Leaving Zones
It is easy to define macros to indicate that you are entering or leaving a given 
zone. When entering, the zone is pushed onto a stack; when leaving, it's popped. 
However, I don't like using these macros directly; it is too easy to forget an 
exit declaration in functions that have a lot of exit points (Listing 2). I 
prefer to make a C++ dummy class that contains no local data, but that calls 
Enter_Zone on construction, and Leave_Zone on destruction. Then I can 
instantiate that class whenever I want to enter a zone, and the leaving will 
occur automatically.
 
| Listing 2: bool foo(char *s) { Enter_Zone(foo); if (s == NULL) { Exit_Zone(foo); return false; } char *t; // Do something complicated with s, // putting the result into t... t = result; if (!well_formed(t)) { Exit_Zone(foo); return false; } do_something_complicated_with(t); Exit_Zone(foo); return true; } | 
Getting the Time
We want to acquire an accurate timestamp as quickly as possible. We could use 
the Windows API function QueryPerformanceCounter ("QPC") for this, but QPC is 
very slow. In Windows 2000/XP it's a wrapper function in a DLL, which calls a 
different wrapper function in a different DLL, which triggers a kernel trap to a 
routine that reads the PCI bus clock. If you want to be saved from the agony of 
including windows.h in every file of your game, you'll probably wrap QPC yet 
again. Add all that up, and it's just not lightweight.
Instead I wrote some inline assembly language that uses the rdtsc instruction, 
which has been available on all new x86 chips for years now. This gives me 
integer timestamps in units of some arbitrary chip frequency, but I want time 
measured in seconds. For maximum speed, I perform all the timestamp accounting 
directly on these integer timestamps, and only convert to seconds once per 
frame, when it's time to process and display the profiling data. To perform the 
conversion, I use QPC at the beginning and end of the frame to get a frame 
duration in seconds; by measuring how many rdtsc ticks have passed during the 
same interval, I can convert from rdtsc ticks to seconds.
Unfortunately, newer mobile processors have this feature called SpeedStep, 
wherein they dynamically adjust their clock speed based on demand. Even though 
rdtsc's name ("Read Time Stamp Counter") indicates that it was intended as a 
timer, someone at Intel decided that it's okay for rdtsc to dumbly report clock 
cycles without compensating for SpeedStep. As a result, if you're running on a 
laptop and your app has bottlenecks that involve blocking on i/o or accidentally 
sleeping, rdtsc is unreliable as a measurement of time. Fortunately, the 
profiler can detect this situation and report that SpeedStep is causing a 
problem (it just looks for fluctuation in the number of rdtsc ticks per QPC 
second). However, there is no way to fix the profiling data itself. If this 
happens to you, I recommend using a desktop machine for profiling.
Displaying the Data
We draw the profiling data with one zone per line, sorting by CPU usage so that 
the most expensive zones are at the top, as shown in Figure 1. But since timings 
in a game fluctuate a lot, if we do this naively, lines of output will move 
around haphazardly and the display will be hard to read.
This is a good application for the frame-rate-independent IIR filters I 
introduced two months ago (See "Toward Better Scripting, Part I"). We apply 
these once per frame to the timing values; the reuslting numbers are more stable 
and the display is easier to read. By varying the filter coefficients, we can 
decide whether we want to see long-term averages or very recent values. 
But we don't just want to know about frame-time hogs. Consistency of frame rate 
is very important, so we should also seek out routines that are inconsistent in 
the amount of time they take. In the scripting article, I showed how to compute 
the variance of a filtered variable. We square-root this to get the standard 
deviation (stdev), which is measured in seconds. Now the profiler can display 
and sort by self-stdev and hierarchical-stdev. In addition, we color-code each 
line of output depending on how unpredictable the zone's timing is; steady zones 
are drawn in a cool color, and vibrating zones are drawn in a hot color.
Sample Code, and Next Time
This month's sample code is a toy game system with the profiler built in. 
There's a simple world full of crates. Besides rendering, the game has dummy 
functions for AI, audio, etc. To ensure that the different systems run slowly 
enough for the profile to be interesting across a variety of machines, they 
artificially slow themselves down. Some functions, like the AI code, fluctuate 
their CPU usage so that the profiler can detect that variance.
The profiler itself is very modular; you should be able to use it in your own 
game with minimal effort. We've only just gotten started exploring what we can 
do with an interactive profiler; next month we'll look at some data processing 
and visualization enhancements.