Optimizing memory usage of a frame-based profiler


This is a continuation of a previous post about frame-based profiling. This post explains in detail how I went about decreasing memory usage of a frame-based profiler I'm using to measure performance of processes (aka ECS Systems) in Tharsis.

Shaving off bytes

My first frame profiler worked by adding a (64-byte) ZoneStorage struct to a global array whenever a Zone ended. ZoneStorage consisted of the time when the zone was entered/exited (both 8-byte, since I need high precision), a fixed-size info string and some extra data adding up to 64 bytes per zone. This quickly ate through my memory when there were many zones in a frame (e.g. 10000 zones per frame * 30 FPS * 64 bytes == 19.2 MiB/s). I was aware of a trivial way to save some memory (moving the info strings out of ZoneStorage so short strings can really be short), but at the time this was 'good enough' as I didn't need long profiling runs (and profiled on a machine with 16 gigs of RAM).

struct ZoneStorage
    // In seconds since startup
    double start;
    double end;
    // Index of the parent zone
    uint  parent;
    char[43] info;
    // Size of used space in info
    ubyte infoBytes;
    static assert(ZoneStorage.sizeof == 64, "Unexpected size of ZoneStorage");

When rewriting the profiler, I wanted to decrease memory usage as much as possible, especially for heavy workloads with many zones per frame or high FPS. I started by moving the info string to a separate buffer, with simple layout (note: I didn't want to deduplicate strings as that could consume a lot of CPU time):

ubyte length1
char[length1] infoStr1
ubyte length2
char[length2] infoStr2

This cuts memory per zone to 21 + info.length bytes, which is a decent improvement to start with.

Another 'obvious' improvement is to remove parent and use deltas for time. parent can be inferred from zone start and end times. The idea with delta values is to use something like startSincePreviousZone and duration instead of start and end. Shorter time spans like time between zones or zone durations should hopefully need less precision to represent.

It turned out this didn't work as well. float was out of question due to its lack of precision, so I considered storing hnsec-precise (hectonanosecond, or tenth of a microsecond) time deltas using uint. That doesn't work either, because a single second has 10M hnsecs which means the longest time delta we could represent is about 430 seconds. Which is OK for most cases, but could blow up unexpectedly from time to time.

Overall, this would decrease the size of a zone to 9 + info.length bytes, which is pretty good, but at the cost of inability to handle big time gaps.

Thinking in Events

To decrease memory usage even further, I had to think about the problem differently. Obviously, I couldn't make ZoneData any smaller without making the maximum supported time delta too short.

In heavy workloads (say, 10000 zones per frame), we can start thinking in terms of zones per second or time between zones. If we have (10000 * 30FPS) 300k zones recorded per second, average time between zones will be in the order of microseconds, or 10s of hnsecs. That can be represented by a 4 byte uint; in fact it can be represented by a byte. Many zones would not have such short durations, because zones are hierarchical. However, most time gaps between individual zone starts and ends would be so short.

Zone can be separated into zone start and zone end: events occuring in separate points in time. Instead of a ZoneData array, we can store profiling data as a stream of such events, represented by an event type ID followed by event-specific data. For example, a ZoneStart/ZoneEnd event can use the following format:

EventID (ubyte) EventID.ZoneStart
ubyte timeSinceLastEvent

We can use other events to represent time spans longer than what ubyte can represent as well as zone info (which then becomes a more general 'info' event that may be used for more than just zones). With this, zone size in heavy workloads goes down to 2 + 2 + 2 + info.length bytes, although this size will increase in lighter loads where time gaps between events can't be represented by the ubyte in zone events.

A basic time span event would take 3 or 5 bytes (EventID + ushort``||``uint). However, in heavy workloads where there are few hnsecs between events, 2 or 4 byte precision may be more than needed. Usually we just need an extra bit to make the time gap slightly longer than the 255 hnsecs representable on byte in a ZoneStart or ZoneEnd event. I ended up using time span events that specify fixed-length time spans (256 or 65536 hnsecs). These can be stored in 1 byte (event ID). This leads to unnecessary overhead in light workloads, but it turns out to be not that much (550 kiB/h in 'idle'), and can be minimized by also adding an event type that specifies exact time span.

The event stream itself is a simple array of bytes. The events described above have an extra benefit of using 1-byte values for everything, removing any need to deal with endianness. The event stream is still written to by an RAII Zone struct. Example fragment of an event stream:

EventID (ubyte) EventID.LongTimeSpan  
EventID (ubyte) EventID.TimeSpan  
EventID (ubyte) EventID.ZoneStart start of zone "first"
ubyte timeSinceLastEvent  
EventID (ubyte) EventID.Info  
ubyte infoLength 9
char[infoLength] info "first"
EventID (ubyte) EventID.TimeSpan  
EventID (ubyte) EventID.ZoneEnd end of zone "zone1"
ubyte timeSinceLastEvent  
EventID (ubyte) EventID.ZoneStart start of zone "second"
ubyte timeSinceLastEvent  
EventID (ubyte) EventID.Info  
ubyte infoLength 10
char[infoLength] info "second"
EventID (ubyte) EventID.ZoneEnd end of zone "second"
ubyte timeSinceLastEvent  
EventID (ubyte) EventID.TimeSpan  

Having a linear stream of events also naturally lends itself to wrapping in a range-style API to generate type-safe event objects on-the-fly and use the full power of standard D modules like std.range and std.range. I'm writing about that in the next post.