The Inner Product, February 2003
Jonathan Blow (jon@number-none.com)
Last updated 17 January 2004
 
Interactive 
Profiling, Part 3:
Behavior Clustering
Related articles:
Interactive Profiling, Part 1 
(December 2002)
Interactive Profiling, Part 2 (January 2003)
Note: This article is currently missing its figures.  
We will attempt to remedy this situation in the future.
For the past two months I've been talking about profiling. I started out by 
decrying the lack of good profiling tools for games. I talked about the 
difference between a "batch profiler", like VTune or the CodeWarrior profiler, 
and an "interactive profiler", like the ones we usually build into our games. I 
then proceeded to make a big wish list of features that a good interactive 
profiler might provide.
Chief among those features was the ability to abstract the program's timing 
measurements into higher-level "behaviors", and to do some analysis regarding 
the frequency and consistency of these behaviors. The point is that modern games 
don't exhibit one consistent pattern of resource usage. At one time they might 
be fill-rate limited by the graphics card; at another they'll be slowed down by 
an abundance of pathfinding and physics. A tool that makes these patterns clear 
would be valuable.
Finding a Clustering Method
Let's look at Table 1, which we first saw in last month's column. Here we see 
three behaviors: behavior A is rendering-heavy, B is physics-heavy, and C is 
AI-heavy. Imagine these numbers came from a profiling run of a game in progress; 
as each frame is processed, we get a new set of timing numbers. Now suppose that 
the next frame is measured, and we get the following numbers: "rendering 72%, 
physics_sim 17%, ai_pathfind 11%". Those numbers are very much like behavior A. 
Even though the numbers do not exactly match, we would expect the profiling tool 
to classify this frame as having behavior A.
 
Table 1:
| Code Module | Behavior A | Behavior B | Behavior C | 
| rendering | 75% | 25% | 25% | 
| physics_sim | 15% | 65% | 15% | 
| ai_pathfind | 10% | 10% | 60% | 
That's not hard to implement when you have a predefined list of behaviors like 
Table 1; it's harder when you start from scratch with no preconceived notions of 
program behavior. In general we are faced with a clustering problem: given a 
bunch of vectors containing timing data, we want to arrange the vectors into 
groups, where the centroid of each group can be used as a summary.
Clustering is a common task in computer science and engineering applications, so 
many algorithms have been developed, for example "k-means clustering". Here's 
how k-means works: first, randomly partition your input into a set of initial 
clusters. Find the centroid of each cluster. These clusters obviously won't be 
perfect yet. To fix that, you iterate over each input vector, and re-assign it 
to the cluster whose centroid is nearest. Now recompute the centroids. Repeat 
this reassignment process until the results converge, and you perform an entire 
pass without changing anything.
The name "k-means" indicates that there are some number of clusters (k of them), 
and that we are representing each cluster by its center (by the mean of the 
input values). 
Unfortunately, k-means and most other clustering methods are batch algorithms. 
They require you to have all the vectors available at once, and to access those 
vectors randomly. Since I want an interactive profiler, which can classify 
behaviors "on the fly" during a single program run, these algorithms are 
unsuitable.
Non-Batch Clustering
There's a special term for clustering algorithms that can handle streaming 
input: they are called "online clustering algorithms". For example, there's an 
"online k-means".
Online k-means works like this: start with k cluster centers arbitrary 
distributed through space. For each input vector you want to classify, find the 
closest cluster point, classify the vector into that cluster, and move the 
cluster point closer to that vector. (See "Radial Basis Function Learning" in 
For More Information).
Online k-means really is an incremental version of batch k-means; but making the 
algorithm incremental is a big change, so the explanations of the two versions 
sound fairly different. 
How do you decide how many clusters are necessary to accurately represent your 
data stream? There are some heuristics that involve removing cluster points when 
you have too many (some of them are lying idle, never attracting input) and 
adding cluster points when you have too few (when the radius of space attributed 
to a particular cluster point is too large). Unfortunately, these heuristics all 
require some kind of a priori knowledge of the data: you need to know how many 
samples is the minimum for considering a point to represent a valid cluster, or 
how spatially large a good cluster is likely to be. In other words, you need a 
good guess about the duration and scale of your data.
But we don't want to have to guess at that stuff in order to get a good profile. 
Really we want the computer to figure out appropriate values from context. And 
those values may need to change. If we are running our game for 5 minutes and 
see some frames that spend 70% rendering, 75% rendering, or 80% rendering, we 
probably want to draw distinctions between these and call them three different 
behaviors. But then if we walk into another room in the game, and we start 
seeing a lot of behaviors like B and C from Table 1, then perhaps all of the 
previous frames should be categorized as A.
Clearly, some kind of hierarchical clustering is necessary. Probably it's best 
to let the user decide at which detail level to view the abstractions; we should 
maintain a tree of behaviors from which the user selects a particular view. 
Unfortunately, the typical methods, like "Hierarchical Agglomerative Clustering" 
tend to be data-intensive batch processes.
Visualization
There's one other problem. In a full-sized game, there may be a few hundred 
different sub-areas of the program for which we get profiling data. In 
clustering terms, we're clustering data points that have a few hundred 
dimensions each. Suppose we manage to do this, and we get a result with 15 
different behavior clusters. Those clusters are still points in a 
high-dimensional space. We can list them, like Table 1, but we'd like better 
methods of visualization. Maybe we want to somehow project these clusters onto a 
2D chart, and have the clusters that are most alike placed near each other on 
that chart. Maybe the two dimensions of the chart can even mean something, 
related to the two biggest ways in which the behaviors tend to vary.
There are some well-known methods for projecting multidimensional data onto a 
2-plane. These methods tend to fall into two categories: (1) ad-hoc, producing 
poor results, and (2) complicated, slow, and scary. 
There's another alternative, though: a hybrid kind of algorithm that clusters 
and projects simultaneously, and does a good job of both.
The Self-Organizing Map
In this month's sample code, to perform the classification of CPU behavior, I 
used this hybrid algorithm, the Kohonen Self-Organizing Map (abbreviated SOM).
The SOM is an array of points in the input space; all the points are initialized 
to random values. For each incoming frame of profiling data, we find the closest 
point in the SOM. We then move that closest-match point toward the position 
represented by the incoming frame, just like online k-means. But the SOM does 
something extra: in addition to giving the clusters positions in the input 
space, it treats them as having positions in another space also. The clusters 
have fixed positions along a regular grid, in what I will call "neighbor space". 
Whenever you move a cluster center toward an input in input space, you also look 
up its neighbors in neighbor space, and move those neighbors through input space 
in the same direction, thought not by as much.
This neighbor space is usually implemented just by storing the cluster data in a 
2D array, and by iterating across nearby cells in that array. You can think of 
this as applying a "move filter" to the array, where the filter kernel is high 
in the center but tapers off with increasing radius (see Figure 1).
After many iterations of this, the clusters in input space will converge, just 
like k-means. The size of the move filter shrinks over time, as the clusters 
become settled, until it becomes a point. The "Self-Organizing" part of SOM 
represents the idea that, because we applied that move filter in neighborhood 
space, we were constantly influencing neighboring array cells to take on similar 
values. So after we are done processing our inputs, we not only get cluster 
centers, but the clusters are arranged in the 2D grid according to similarity. 
So we can generate a 2D display where similar clusters are drawn near each 
other, which can help us understand the data.
This is a quick-and-dirty description of the SOM, with some simplifications. For 
reasons of isotropy, like we saw when looking at networking ("Networking, Part 
2: Transmitting Vectors", July 2002, Game Developer Magazine), it is usually 
better to use a hexagonal tiling than a rectangular grid. Also, the SOM grid can 
consist of any number of dimensions, but 2 dimensions is the most common choice, 
since a map with more dimensions is harder to visualize. Finally, initialization 
of the SOM is best done using some attributes of the training data, since a 
random spread in a high-dimensional space is likely to produce points that are 
not near anything in your data set; in this case, the only way array elements 
become used is when they get dragged into use by a neighbor, so it takes a while 
before the SOM is working at reasonable effectiveness.
For some references that explain self-organizing maps in greater detail, see For 
More Information.
Adapting the SOM to realtime tasks
The SOM is a batch algorithm, for two reasons. To produce good results, the 
algorithm wants a randomly-ordered sampling of the data. Otherwise, you can 
imagine first presenting all your samples of behavior A; then you present an 
equal number of samples of behavior B which happens to land in a cluster 
neighboring A, and the movement filter for B wipes out most of A. The second 
reason is that the SOM algorithm wants to have a global notion of time; early in 
training, the movement filter is very large, and over time it shrinks, until 
toward the end of training it is nearly a point. But with live input data, we 
don't know how long we will be running, and we may wish to run indefinitely.
So I made some modifications to the SOM algorithm to adapt it for realtime data. 
These modifications are sort of questionable, since I have not analyzed them 
deeply, and I can think of some ways they may be problematic. But to a first 
approximation they work okay.
Shrinking The Filter
Rather than using a global time, I introduced the idea of "local 
convergence", which is a value between 0 and 1. The filter size centered around 
any given array cell depends on that cell's local convergence: a convergence of 
0 means a big filter, 1 means a small filter.
Whenever a point in the SOM is generally staying in the same place, its 
convergence value increases; if it starts getting yanked someplace else, its 
convergence value decreases. I use the IIR filters described in "Toward Better 
Scripting, Part I" (October 2002 issue of Game Developer) to track the trend of 
a point's motion. So if many training steps confirm the positions of some region 
of the map, that region will no longer feel the need to upset its neighbors. But 
if an area of the map keeps being upset, its "local time" will go backward and 
it will jostle its neighbors in a bid to create more space for the competing 
values.
This "hardening of the map" does to some extent alleviate the "B wipes out A" 
problem, but does not solve it. To do so, I would implement a hierarchical 
series of maps, with maps that change quickly feeding data into maps that change 
more slowly. Unfortunately I haven't had time to try this (see the note at the 
end of the article) but plan to soon.
One further problem is that there's no way for the SOM to grow. To achieve 
reasonable results, you need to construct it with enough initial nodes. But I 
believe hierarchical maps would solve this also.
Initialization
I wanted a good way to initialize the SOM. As I mentioned earlier, random 
initialization is not so good. The way it's usually done for batch data is to 
perform principal component analysis on all the input, find the two 
non-axis-aligned dimensions along which it has the greatest spread, and 
initialize the SOM array entries to evenly cover that spread. Another way of 
saying this is, you find the covariance of the data with respect to the mean, 
then pick the two longest axes of the multidimensional ellipsoid; this is just 
an n-dimensional version of what we did in "My Friend, The Covariance Body" 
(September 2002 issue of Game Developer).
Since I don't want to wait for the whole batch of input data, I just collect 10 
frames' worth of "warmup points" before the SOM kicks into training mode. Then, 
because I didn't want to write n-dimensional matrix decomposition code, I 
approximate the spread as follows: First I pick a random warmup point, then I 
find the warmup point that is furthest from it. I find the vector between them, 
and that is the longest axis of the spread. Then I remove that dimension from 
the warmup data by subtracting each point's projection along that axis. Then I 
choose another random point, find the warmup point furthest from that one, and 
that is the second longest axis of the spread. I then initialize the SOM entries 
to points evenly spaced across the plane defined by these two axes, centered on 
the mean. If the axes are degenerate within some numerical threshold, then I 
just seed the SOM with random offsets from the mean.
Sample Code
This month's sample code is an updated version of last month's toy game world. 
In addition to AI-heavy and entity-render-heavy behavior modes, there's now a 
sun in the sky; when you look toward the sun, ridiculous amounts of lens flare 
are drawn, bogging the system down that way. As you move around the game world, 
you'll cause various behaviors that are mixtures of AI, entity, and lens flare. 
Whenever mouselook is enabled so that you can move around, the game system takes 
each frame's profiling data and classifies it using an SOM.
When mouselook is disabled and the SOM is paused, you can move the mouse pointer 
over the SOM's display and look at the behaviors it has detected. There are a 
number of visualizations enabled, which are explained in the README. These 
visualizations aren't as intuitive as I'd like, but after a bit of exposure you 
get used to them and can infer interesting relationships about the clusters by 
looking at colored diagrams (Figure 2).
Future Work
So far this work has been pretty experimental, but I think it shows some good 
promise. I am going to try some hierarchical versions of the SOM and online 
k-means, and will report on their effectiveness in a future column. In the 
meantime, next month we'll go back to a more conventional topic, like 
cool-looking graphics.
References
"Radial Basis Function Learning", Paolo Palmeri,
http://miles.cnuce.cnr.it/~palmeri/datam/articles/okmc.ps.gz
"The Self-Organizing Map (SOM)", Teuvo Kohonen,
http://www.cis.hut.fi/projects/somtoolbox/theory/somalgorithm.shtml
"Enhancing SOM based data visualization", Juha Vesanto, Johan Himberg et al,
http://www.cis.hut.fi/projects/ide/publications/html/iizuka98