Interactive Profiling, Part 1 (from The Inner Product)

The Inner Product, December 2002
Jonathan Blow (jon@number-none.com)
Last updated 17 January 2004

Interactive Profiling, Part 1

Related articles:

Interactive Profiling, Part 2 (January 2003)
Interactive Profiling, Part 3 (February 2003)

Working with different development houses, I've seen that it's often extremely difficult to get anything done. Most developers are working at a low level of efficiency. Our programming languages, compilers, debuggers, source control systems, and profiling tools are all, in one way or another, bad. This costs us huge amounts of time and money. Today I'd like to attack one corner of this problem space, the inadequacy of our profiling tools.

For the practicing game developer, there are a handful of commercial profiling products to choose from; two representative examples are Intel's VTune and the CodeWarrior Hierarchical Profiler (HiProf). Despite being profilers, these products are inappropriate in several ways; using them in a production environment tends to be painful. Usually a profiler will slow down the game tremendously, preventing us from reproducing the situations we want to profile. VTune is not slow if you do not collect any information other than the program counter at each sample; but that information is of very limited use.
Because these products don't satisfy our needs, just about every game gets rigged with a homemade profiler. We manually mark sections of the code with telltale names like "render_world" or "simulate_physics"; a heads-up display shows, in real time, how much CPU is being spent in each area.

Interactive Profiling as a Different Thing than Batch Profiling

There are some ways in which even the simplest in-game profiler is more useful than a tool like VTune or HiProf. With a realtime heads-up display, you can see correlations between the frame rate and the activity of each game subsystem, and use human pattern-matching and intelligence to figure out what's happening.

Suppose, in a map with a lot of objects and busy activity, your game's frame rate stutters annoyingly; say, rougly one frame out of every 30 takes 3 times as long as it should. If you try to diagnose the problem with VTune or HiProf, you'll meet with some difficulty. Since the long frames are intermixed with normal frames, the code causing the problem is only taking 1 2/3 as much CPU as it should, when averaged over time; this probably won't be sufficient for that code to stand out as the culprit in the profiling report.

You might be able to track down the problem in VTune by enabling the "collect chronology data" option. You'll suffer through annoying slowdowns, and if the problem is timestep-dependent, it may not happen during profiling. To robustly solve this problem, you can build your game engine to support full journaling and playback; that's useful in its own right, but it's also a big engineering task that we'd rather not require just to obtain a profile. After all this, you may discover that the problem is caused by your AI pathfinding code. You may even know that a specific loop is eating up a lot of time. But the profiler can't tell you why.

On the other hand, with an in-engine interactive profiler you see the problem instantly, because you're watching a live report that changes from frame to frame. Every time the frame stutter happens, you'll see the program section "AI Pathfind" jump to the top of the chart. And because you're running around inside the game world, you may be able to easily correlate the spikes with specific problems -- for example, now that you know the problem is pathfinding, you can see that it happens whenever a monster walks near a certain kind of lamp. So you go investigate the geometry of the lamp to get to the heart of the matter. Because you can turn on the interactive profiler at any time, you don't have to strain to reproduce a test case; when it happens, bam, you hit the function key that turns on the profiler and shows you what's happening.

Batch profilers try to help you visualize data, with bar charts and pie charts and hierarchical diagrams. But your game renderer is already a rich visualization of your game's variables, a far more complex and meaningful one than any batch profiler will give you. Augmenting the world rendering with some supplementary data is usually the best visualization approach.

Basically, there are 2 things batch profilers are good for: (1) Giving you a broad idea of your program's behavior averaged over time, and (2) giving you precise instruction-level performance data for small sections of code. (1) can be obtained just as well through a simple in-engine profiler, and (2) represents a small fraction of the profiling use cases of a modern game.

As an industry we frequently write our own in-engine profilers, but it's seldom articulated that interactive profiling is a different paradigm than batch profiling. The batch profilers available to us now are somewhat useful, but most of the time we really want a good interactive profiler. If the tool vendors understood this, they could do some neat stuff.

Because we as game developers are not in the business of selling tools, the profilers we write are usually minimal. I think it would be quite worthwhile to spend some up-front effort to build a high-quality interactive profiler that people can just plug into their games. So, in this article and the next, I will take steps toward that.

Goals of the Interactive Profiler

I want this profiler to be effortless to use. It should be so fast that you can always leave it enabled in every build of your game, only removing it when the time comes to ship the final product. Disabling profilers with #ifdefs in various builds is bad; suppose you see a problem and want to check it out, but whoops, your current build doesn't have profiling turned on, so you have to stop, change the #define, do a full rebuild which may take a while, then run the game and try to reproduce the case (which can be pretty hard sometimes). That's the kind of process impediment that stifles game development. By itself it doesn't seem like much, but we build up these situations where many impediments affect our daily progress, and it becomes a chore to do anything.

Using the profiler, I'd like it to be easy to detect the kinds of performance problems we care about. I want the profiler's behavior to be reliable; for example, it should not change in character based on the frame rate. And I think it would be really neat if it provided some tools that help us jump to more of a meta-level and see the different kinds of behaviors our games exhibit, and how common each behavior is within the overall gamut.

Profiler Overview

I'll create a small header file that should be included in all code that wants to be profiled. The header contains macros for declaring different zones of the program (rendering, input, physics, etc) and for stating that code execution is entering or leaving a given zone.

When we enter a zone, we'll ask a high-resolution timer what time it is, and record that. Upon leaving the zone, we subtract that entry time from the current time, which tells us how long we were inside. We add this to a total-time accumulator for that zone. Once per frame, we draw the accumulated results and reset them.

With a little bit of extra accounting, we can separately track what HiProf refers to as the "hierarchical time", which is the time spent inside a zone and all zones that it calls; and "self time", which only counts time within a specific zone body and stops when a new zone is entered.

Zone Declaration

I chose to declare zones by instantiating structs in the global namespace. Each struct contains the variables used to store the raw timing data for that zone. For convenience I created a macro called Define_Zone that declares the appropriate instance (see Listing 1). When entering or leaving a zone, we refer to the name of that global zone variable; all references to zones are resolved at link time, so no CPU is spent at runtime matching zone IDs. Also, you get a compile-time error if you try to use a zone that isn't declared, which is good for catching spelling mistakes.

There are two alternatives for declaring zones that people often use, but they both have big disadvantages. One method is to create a global header file with enumerated constants for each zone; these constants index into an array. With this approach, adding a new zone requires a full rebuild of the project, and full rebuilds are horrible for many code bases. Also, when tracking down a performance problem, we often want to incrementally create temporary zones, which is painful with this scheme. The struct-based scheme presented here is totally confined to the files being worked on, requiring minimal recompilation.

The other alternative is to use strings to designate zone names, but this incurs a lot of runtime overhead, and spelling errors can go unnoticed for a long time.

Listing 1:

struct Program_Zone {
    Program_Zone(char *name);

    char *name;
    int index;

    Profiling_Int64 total_self_ticks;
    Profiling_Int64 t_self_start;

    .... // Other members deleted for brevity
};

#define Define_Zone(name) Program_Zone PZONE_ ## name(#name);

Entering and Leaving Zones

It is easy to define macros to indicate that you are entering or leaving a given zone. When entering, the zone is pushed onto a stack; when leaving, it's popped. However, I don't like using these macros directly; it is too easy to forget an exit declaration in functions that have a lot of exit points (Listing 2). I prefer to make a C++ dummy class that contains no local data, but that calls Enter_Zone on construction, and Leave_Zone on destruction. Then I can instantiate that class whenever I want to enter a zone, and the leaving will occur automatically.

Listing 2:

bool foo(char *s) {
    Enter_Zone(foo);

    if (s == NULL) {
        Exit_Zone(foo);
        return false;
    }

    char *t;
    // Do something complicated with s,
    // putting the result into t...
    t = result;

    if (!well_formed(t)) {
        Exit_Zone(foo);
        return false;
    }

    do_something_complicated_with(t);

    Exit_Zone(foo);
    return true;
}

Getting the Time

We want to acquire an accurate timestamp as quickly as possible. We could use the Windows API function QueryPerformanceCounter ("QPC") for this, but QPC is very slow. In Windows 2000/XP it's a wrapper function in a DLL, which calls a different wrapper function in a different DLL, which triggers a kernel trap to a routine that reads the PCI bus clock. If you want to be saved from the agony of including windows.h in every file of your game, you'll probably wrap QPC yet again. Add all that up, and it's just not lightweight.

Instead I wrote some inline assembly language that uses the rdtsc instruction, which has been available on all new x86 chips for years now. This gives me integer timestamps in units of some arbitrary chip frequency, but I want time measured in seconds. For maximum speed, I perform all the timestamp accounting directly on these integer timestamps, and only convert to seconds once per frame, when it's time to process and display the profiling data. To perform the conversion, I use QPC at the beginning and end of the frame to get a frame duration in seconds; by measuring how many rdtsc ticks have passed during the same interval, I can convert from rdtsc ticks to seconds.

Unfortunately, newer mobile processors have this feature called SpeedStep, wherein they dynamically adjust their clock speed based on demand. Even though rdtsc's name ("Read Time Stamp Counter") indicates that it was intended as a timer, someone at Intel decided that it's okay for rdtsc to dumbly report clock cycles without compensating for SpeedStep. As a result, if you're running on a laptop and your app has bottlenecks that involve blocking on i/o or accidentally sleeping, rdtsc is unreliable as a measurement of time. Fortunately, the profiler can detect this situation and report that SpeedStep is causing a problem (it just looks for fluctuation in the number of rdtsc ticks per QPC second). However, there is no way to fix the profiling data itself. If this happens to you, I recommend using a desktop machine for profiling.

Displaying the Data

We draw the profiling data with one zone per line, sorting by CPU usage so that the most expensive zones are at the top, as shown in Figure 1. But since timings in a game fluctuate a lot, if we do this naively, lines of output will move around haphazardly and the display will be hard to read.

This is a good application for the frame-rate-independent IIR filters I introduced two months ago (See "Toward Better Scripting, Part I"). We apply these once per frame to the timing values; the reuslting numbers are more stable and the display is easier to read. By varying the filter coefficients, we can decide whether we want to see long-term averages or very recent values.
But we don't just want to know about frame-time hogs. Consistency of frame rate is very important, so we should also seek out routines that are inconsistent in the amount of time they take. In the scripting article, I showed how to compute the variance of a filtered variable. We square-root this to get the standard deviation (stdev), which is measured in seconds. Now the profiler can display and sort by self-stdev and hierarchical-stdev. In addition, we color-code each line of output depending on how unpredictable the zone's timing is; steady zones are drawn in a cool color, and vibrating zones are drawn in a hot color.

Sample Code, and Next Time

This month's sample code is a toy game system with the profiler built in. There's a simple world full of crates. Besides rendering, the game has dummy functions for AI, audio, etc. To ensure that the different systems run slowly enough for the profile to be interesting across a variety of machines, they artificially slow themselves down. Some functions, like the AI code, fluctuate their CPU usage so that the profiler can detect that variance.

The profiler itself is very modular; you should be able to use it in your own game with minimal effort. We've only just gotten started exploring what we can do with an interactive profiler; next month we'll look at some data processing and visualization enhancements.