The importance of profiling automation

Needless to say that one of the basic pillars of immersion in a game is performance perceived as smoothness and responsiveness. A game with terrible performance breaks the immersion that could have been present otherwise. But even if we understand that as an industry, we still don’t seem to put constant effort on how to make sure that pillar in our games is solid. A few months ago I went to Sony’s PlayStation Devcon 2017 where I got to talk to many fellow rendering programmers as well as tools developers. One of the things that caught my attention was that even though most (if not all) the people I got to talk to were working on big AAA games or technology, few of them showed any interest on profiling automation. This was also evident during a rendering round table where I was pretty much the only one who asked for output of metrics from Sony’s tools to use in automation. I thought that perhaps the work and knowledge on automation were left to SQEs (Software Quality Engineers), but when I asked many told me they had few or no automation tools to check for performance regressions. That is alarming because AAA games tend to be huge beasts with a lot of people making content and writing features, but rarely do you have as many people as you need to keep the performance on check as the game is evolving. In the end it seems like we rely on heroic efforts at the end of the project or even patches after launch. So my idea with this short piece is to explain why profiling automation is important and some of the required features to get better automation.

Game with terrible performance on release.

The performance of a game is one of the things that seems to become critical only at the end of a project. That goes in line with the constant worry about “premature optimization” which is far more often “overdue optimization” and it is the justification for poor architectural choices and poor use of available resources. But while you might think that you don’t have the time to focus on performance as you are trying to add new features, it is always important to see the performance impact of the code added. Many programmers think they can get by just on assumptions but unfortunately most if not all programmers (myself included) make mistakes when assessing performance without profiling data. For some unknown reason to me programmers are fully aware of the limitations of estimating tasks during a project, yet they are very confident on their understanding of what the performance issues are on a project or how the code they just submitted impacts it.

Personal experience.

My perceptions are based on the fact that before joining EA Vancouver I was a consultant and I was mostly contracted to improve performance. My clients knew that they had performance issues they needed to fix, but they couldn’t have an employee assigned to the task, nor hire a new full-time employee just to fix that. Having done that type of work for so long I told my clients that I would be working on whatever they asked for (I always signed staff-augmentation-type of contracts) but I always recommended that they assigned me a task to just do profiling. Thankfully every single client agreed but they also told me that they knew what the issues were and what to look for. Instead I would profile CPU, GPU, and IO without looking specifically for the performance issues the client wanted fixed. More often than not the profiling data showed that the biggest performance issues were completely different than what they thought. That constant repetition of the experience only served to reinforce my thought that as developers we are just as good estimating performance as we are estimating tasks in a project. But that wasn’t the only common issue among my different clients. Many of them blamed other teams or technology for their performance issues, and when they knew nobody else was to blame they often found out too late about performance regressions they introduced. I can’t even count the number of times that someone had to be sent on a scavenger hunt trying to find the change or changes that caused a performance regression in the last week or month.

Solution.

What is the solution to this issue? Profiling. But there are two distinct approaches to handle this. One is to change the culture of the programmers so they profile every change they make (even if they don’t think it is performance-sensitive code), and at the same time change the culture of management to make sure it gets done with a proper allocation of time for the task. That’s pretty much impossible to do, especially for teams within a game that don’t have the same performance focused mindset. The odds of getting that done on a rendering or systems teams are considerably higher than on the UI team. But even in the rendering team it is hard to allocate the time to get that done, and it is even harder for non-technical management to understand the importance of doing that. The other approach is to move away from manual profiling and processes and instead rely on automation. Automated performance runs can get rid of the issue by providing data as often as your hardware infrastructure would allow. In essence, a good automation system with a good dashboard would do the high level profiling that an engineer would have done, and even make available the low-level profiling data for the engineers. At the same time the dashboard can provide high level performance numbers for technical management. With that data in hand they can ask each team about performance issues and start planning and allocating time to get them fixed.

Defining profiling smoke.

Before you even look at tools and metrics you want to gather, it is critical that your team defines what a profiling run is. This is critical because if the runs are not consistent in terms of what the game is doing during profiling, then it will be extremely hard to decipher performance trends out of the profiling data noise. And not only that, but people will not value the profiling data since they know the performance spike might go down by itself by time the next automated profiling run gets done. Obviously each team within a game will have different profiling scenarios that they consider important so it is critical that any automated profiling covers those scenarios. But it is also important that the profiling smoke tests not be segmented per team and instead have a single smoke that covers all the different scenarios. The value of doing that is that since the different subsystems in a game run concurrently, many issues will arise in the interaction points of those subsystems which are owned by different teams. For example, the UI team might think that the scenario they need to test is on a bunch of UI-heavy screens but perhaps a change in UI impacts the performance during normal gameplay even though they might just have a few UI elements for the HUD. The other important aspect is determinism. Most games are not fully deterministic, but it is important that the profiling smoke be as deterministic as possible within the constraints of the game. Any engineer should have the ability to clearly correlate the data of two different profiling smoke runs and that’s only possible if there is an acceptable level of determinism on the runs.

Data to gather and process.

The next step is to determine what type of data to gather and how that gathering process impacts the performance of the profile smoke itself. The level of details can be as low as capturing instruction-level counters at a fast rate, to very high level timers that just gathers how long a section of the game took to execute (for example, time to run through a whole level). The benefit of the low level approach is that the engineer would have access to specific detail to tackle a problem, but at the same time the amount of data generated is too big and it is bound to impact the performance of the profiling smoke itself. The high level approach has the benefit of reducing the amount of data to gather and process, it is very lightweight on its impact, but at the same time it doesn’t offer enough information to the engineer more so than just “this ran slower/faster than before”. After doing this for a while I have found the happy-medium to be timestamped counters on a per-frame basis to be the best solution. The data they offer is good enough to pinpoint specific issues (assuming there are enough markers), it is a model that works just as good on CPU and GPU, and it tends to be reasonable to manage in terms of cycles and memory spent per frame. There are multiple tools that can generate and visualize this data:

Intel GPA Microsoft PIX SN Systems Razor RAD Game Tools Telemetry

Unfortunately, to the best of my knowledge only Intel VTune and RAD’s Telemetry support exporting the profiling data (feel free to correct me on the comments). Any system that can’t export the data won’t work for this. If that’s the case, then you will have to implement the CPU and GPU markers on your own. This will allow you to export the data for the automation system, and can also be used for an in-game profiler if desired.

One thing to keep in mind is that the data will be used for full runs instead of a few frames. This means that whatever system you implement (or product you purchase) must be able to cope with a considerable amount of data. Even with a low granularity of CPU and GPU markers per frame you still generate a sizable amount of data. Just as an example, in one of the profiling smokes I can run at work I can capture 45 minutes of data with around 120 markers per frame. So a full capture contains the data for around 20 million markers which means that just storing the beginning and end 64-bit timestamp (with no name) for a marker would generate ~320 megabytes of uncompressed data. At the same time, whatever tool you use to generate metrics from the source data must be able to cope with that amount of data, especially if you want to have a quick turnaround. Once you have all the data you can generate the relevant metrics that will be the “electrocardiogram” of the game’s performance. The data that you decide to display will be highly dependent on the needs of the game but here are a few ideas.

  • General
    • CPU frame time.
      • Breakdown per subsystem or team.
    • High-water CPU frame time.
    • High-water number of consecutive CPU spikes.
    • Number of allocations per frame.
    • Loading
      • Time to load a level.
      • Time to leave a game and go back to the main menu.
      • High-water memory use during loading phases.
  • Rendering
    • GPU frame time.
    • High-water GPU frame time.
    • High-water number of consecutive GPU spikes.
    • High-water video memory used.
    • Number of missed v-syncs.

Conclusion.

When everything is set and done the whole team will be assured that whatever they do will be profiled. That means that performance won’t just vanish as features are added. In the worst case scenario, you will have performance issues but you will have the ability to either revert a changelist or decide how performance will be recovered in order to leave that new piece of code in the depot. Another huge benefit is that this workflow will also help put performance in the mind of the developers and not just get a feature implemented at any cost. So in the end the technical management will be assured visibility on the performance as the game evolves, the engineers will have the proper profiling data when technical management complains, and the rest of production will know if what they plan to deliver to the end user will actually be achieved in terms of performance.

Advertisements

MallocTrackerView

Last time I talked about the implementation of MallocTracker and the integration with Unreal Engine 4. While I did a good effort to try to explain the insides of MallocTracker, one thing that was clearly lacking is how to interpret the output generated by MallocTracker. In this post I will talk about MallocTracker Viewer and looking at the output generated.

Text output.

When allocations are not tagged and scopes are not defined then the output generated by MallocTracker is not that meaningful but some context is still provided. Let’s look at three allocations as text:

0x0000000063b40000,Main Thread,Unknown,127471600,GlobalScope,UnnamedAllocation
0x000000008bda0000,Main Thread,Unknown,123401431,GlobalScope,UnnamedAllocation
0x00000000ba8d83d0,Main Thread,Unknown,48,GlobalScope,UnnamedTArray

As you can see there is one line per allocation, and under each allocation there is a field. The format for each line is:

Address, Thread name, Group name, Bytes, Scope Stack, Name

So even with the allocations completely untagged there still the thread context that might be useful. But let’s see what we get when we use the UObject MallocTracker scope support as well as adding a scope in FArchiveAsync::Precache() and tagging the single allocation done in that function with the filename of the loaded file:

0x0000000063b40000,Main Thread,UObject,127471600,GlobalScope|HierarchicalInstancedStaticMeshComponent|HierarchicalInstancedStaticMeshComponent,../../../../dev/UnrealEngine/PZ4.8/../../../Unreal/KiteDemo/Content/Maps/GoldenPath/Height/Height_x2_y2.umap
0x000000008bda0000,Main Thread,Systems,123401431,GlobalScope|FArchiveAsyncPrecache,../../../../dev/UnrealEngine/PZ4.8/../../../Unreal/KiteDemo/Content/KiteDemo/Environments/Landscape/Materials/M_Grass_Landscape_01_INST_HOLE.uasset
0x00000000ba8d83d0,Main Thread,Unknown,48,GlobalScope|Material|M_CustomDepthBillboard,UnnamedTArray

With the proper tagging the data is far more meaningful. We now know that a 121.56 MiB allocation was done for the HierarchicalInstancedStaticMeshComponent UObject that was loading the Height_x2_y2.umap file from disk. Same quality of output is available in the other three allocations. While it does seem that the output is very useful, it is pretty much impossible to deal with huge number of allocations in an actual game. The Kite demo after loading everything has over 600k number of allocations that go from 16 bytes to 121.56 MiB happening in 18 different threads.

One sensible thing to do is to import that data into a spreadsheet such as Excel which can already deal with the CSV output generated by MallocTracker. That is very useful when trying to look for specific allocations such as looking for the biggest allocations and such. But in terms of navigation it is still hard to deal with it since there any many duplicated that could be collapsed (meaning allocations with same name, size, group and thread but different addresses). So it is necessary to make something custom. Enter MallocTrackerView.

MallocTrackerView.

MallocTrackerView is a tool I’m writing in C# to be able to visualize the data generated by MallocTracker better than I could do with Excel or a text editor. This is my first piece of code written from scratch in C#, in the past I only had to do relatively minor changes to C# tools such as an optimization for Unreal Engine’s MemoryProfiler2. Initially my idea was to write it in Python since I had already used it while doing some research for Electronic Arts and I thought that it was a good fit. The reason I decided against it is that Unreal Engine 4 doesn’t use it at all and it seems the only presence of it is on third party libraries. Using C++ was also a possibility but integrating a proper UI library would be too much. So considering that there are already a bunch of tools written in C# within the engine’s depot I decided to do it in C#. The first thing I had to consider once I had chosen C# is that I would need support for a TreeView control that had support for columns since the idea was to have three columns:

  • Name. This would be the thread name, scope name or allocation name depending on where you are in the hierarchy.
  • Size. Accumulated size of the allocations under that node.
  • Count. Accumulated number of allocations done under that node. Since there can be multiple allocations with the same name under the same thread and scope stack, it is possible that a leaf node has a count higher than one.

Since the default C# TreeView control didn’t properly support that and I really wasn’t into writing my own solution (after all one of the reasons to use C# is to focus on the actual tool rather than infrastructure), I did a rather quick search to see how people solved the need for a similar control. As usual Stack Overflow pointed me to the solution which was to use ObjectListView. After doing a bit more research it seemed like people were happy with it so I decided to use it.

MallocTrackerView right now is very much alpha but in its current state it is fairly useful because it covers the most relevant use case that isn’t covered by a text editor or Excel which is to look at the data in terms of scopes. Let’s go over it by opening a MallocTracker dump done in the Kite demo:
MallocTrackerViewThread
That’s how the UI looks like after opening a MallocTracker dump. The root nodes in the treeview are the different threads. As you go deeper into the tree the scopes are shown and the allocations as well, here is an example:
MallocTrackerViewMaterial
Here we are looking at the allocations done in the ScotsPineMedium_Billboard_Mat material UObject from the main thread. This is where we can see the usefulness of having such system available, we now know the specific allocations related to that UObject, and correctly accounting for the actual memory allocated rather than relying on hand-written GetResourceSize and serialization numbers. By comparison, this is the info you get from the obj list command related to ScotsPineMedium_Billboard_Mat:

Object NumKBytes MaxKBytes ResKBytes ExclusiveResKBytes
Material /Game/KiteDemo/Environments/
Trees/ScotsPineMedium/
ScotsPineMedium_Billboard_Mat.ScotsPineMedium_Billboard_Mat
3K 3K 62505K 0K

Even better is to filter the allocations by scope name where we can see related allocations. Here is the output when looking at the allocations that contain the scope name “ScotsPineMedium”:
MallocTrackerViewScotsPineMedium
And you can dig even deeper by combining filters where I filter by allocation group allocations with a certain name:
MallocTrackerViewUObjectAlloc
I think this shows clearly what can be done when having MallocTracker enabled even if only three allocations were properly tagged.

Bad news.

I made a pull request to integrate MallocTracker to Unreal Engine 4 and it was rejected by Epic. I think it was worth going over their reasons for not integrating because someone else might have the same doubts. Let’s go over their concerns as shown in the response to the pull request:

  • It’s too much support burden to maintain those tags. I think this concern comes from the lack of understanding on how the system works. As a programmer you don’t have to tag the allocations. It is better to tag allocations, but if you don’t you still have the scopes to provide context to the allocations as you have seen on the previous examples. If you decided to tag an allocation (and you can use MallocTrackerView to determine which allocations to tag) then you don’t have to worry about that allocation ever again. The tagging is a change in a single line of code (be it an FMemory call, a container constructor, etc) you don’t need to maintain. What is a burden is maintaining all the UObject’s GetResourceSize() functions so I beg to differ. It is also worth noting that this kind of tagging isn’t something that isn’t sure it can scale, I have seen it used in full AAA games throughout the code without this concern being relevant at all.
  • It most likely won’t be used by anyone at higher level. This makes the assumption that non-engine people currently have a good grasp on the memory usage of their game code and assets. While it is true that the MemReport command does offer some insights, it is still not enough and it is certainly harder to visualize than using MallocTracker with MallocTrackerView.
  • At a lower level it usually doesn’t provide enough information. The only tool provided by the engine that does a better job is MallocProfiler with the MemoryProfiler2 tool. But the issues with it are that it is basically unusable in any memory-constrained environment (particularly consoles), the performance is completely unacceptable to the point that the game becomes unplayable, and many of the allocations have the exact same callstack even though they refer to completely different UObjects. Instead MallocTracker runs which pretty much the same performance as having it disabled, it has a considerably lower memory overhead (at least an order of magnitude smaller, for example it needs 35.3MiB to store 614,145 allocations), and it does provided proper information even in terms of allocations done to create UObjects from Blueprints.

In the end it is fine to have this pull request rejected, after all Epic can’t accept whatever is sent their way, but I happen to think that having this in would be rather useful for everybody. But to Epic’s credit, given the open nature of the engine anybody can ask me for the changes and get to use this even if Epic themselves won’t integrate it. That is much better that you can probably do using Unity or some other closed engine.

Overview video.

Adding memory tracking features to Unreal Engine 4.

It has been a while since I last talked about memory allocation and tracking. I have had the time to implement the tracking on Unreal Engine 4 and I think that I go over it. I’m going to make the assumption that you have read the previous blog post I made about this: “Limitations of memory tracking features in Unreal Engine 4” and “Memory allocation and tracking“.

Unreal Engine 4 memory management API.

Basic allocation methods.

There are three basic ways to allocate or free memory within the Unreal Engine 4:

  • Use of GMalloc pointer. This a way to get access to the global allocator. Which allocator is set to be used depends on GCreateMalloc().
  • FMemory functions. These are static functions such as Malloc(), Realloc(), and Free(). They also use GMalloc for the allocations but before doing that it checks if GMalloc is defined before every allocation, reallocation or free. If GMalloc is nullptr then GCreateMalloc() is called.
  • Global new and delete operators. By default they are only defined in the modules in ModuleBoilerplate.h which means that many calls to new and delete were not being handled within the Unreal Engine 4 memory system. The overloaded operators actually call the FMemory functions.

There are cases where memory isn’t deallocated and freed through those mechanisms which shouldn’t happen if possible. To catch those cases I submitted a pull request, which is already integrated and set for release on version 4.9, that catches those allocations via the c runtime library call _CrtSetAllocHook(). One such example is that the engine’s integration of zlib didn’t do allocations through the engine’s facilities, using _CrtSetAllocHook() I found that out and made a pull request with the fix which is set for release on 4.9.

The basic API for both, the direct calls to GMalloc and the FMemory functions, is this:

virtual void* Malloc( SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void Free( void* Original ) = 0;

These are the places that would need to be modified if we want to add any new piece of data per allocation.

Integrating with the engine

Similar to the approach I took for the stomp allocator, I made a new class called FMallocTracker that derives from FMalloc which allows me to hook it up to the Unreal memory allocation system. Since a valid allocator must be passed when creating the FMallocTracker instance all the actual allocations will be done with that allocator. The FMallocTracker is only there to keep tracking information, nothing else. But that is not enough, we actually need to get the tracking data we want in all the way to the allocator. So the first step is to modify the allocator’s functions when we have the memory tracking feature enabled:

#if USE_MALLOC_TRACKER
virtual void* Malloc( SIZE_T Count, uint32 Alignment, const uint32 GroupId, const char * const Name ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment, const uint32 GroupId, const char * const Name ) = 0;
#else
virtual void* Malloc( SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
#endif // USE_MALLOC_TRACKER

The new parameters are:

  • Name. Name of the allocation. This name can be whatever you want but the recommendation is that it be a literal that is easy to search for. I will show the means to provide more context later in this post.
  • Group. This is the id for the group that is responsible for this allocation. These are the groups that I have defined but it something that you should definitely tune for your needs.

This change implies changes to all the allocators to have it be transparent within the engine, but once that work is done then you can tag the allocations without worrying about the underlying implementation. The benefit of tagging allocations isn’t just for tracking purposes but it is also relevant for code documentation. Having worked on large codebases as a consultant with this kind of tagging done is a great benefit when ramping up, to deal with interactions among different groups, and fix memory related crashes.

The next step is to integrate with the new and delete operators. As I already mentioned, in the engine they were defined in the ModuleBoilerplate.h, to provide better coverage I first move it to MemoryBase.h. The next step was to define our new operator overloads to pass in the name and group.

OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new  (size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name)	OPERATOR_NEW_NOTHROW_SPEC{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new[](size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name)	OPERATOR_NEW_NOTHROW_SPEC{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new  (size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name, const std::nothrow_t&)	OPERATOR_NEW_NOTHROW_SPEC	{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new[](size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name, const std::nothrow_t&)	OPERATOR_NEW_NOTHROW_SPEC	{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }

To avoid having to fill up the code checking for USE_MALLOC_TRACKER it is nice to provide some defines to create those allocations as if USE_MALLOC_TRACKER was set but without incurring in unnecessary cost when it isn’t set. The intention is that this should be something with very little to no performance cost. So here is the basic definition:

#if USE_MALLOC_TRACKER
	#define PZ_NEW(GroupId, Name) new(DEFAULT_ALIGNMENT, (Name), (GroupId))
	#define PZ_NEW_ALIGNED(Alignment, GroupId, Name) new((Alignment), (Name), (GroupId))
	#define PZ_NEW_ARRAY(GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), DEFAULT_ALIGNMENT, (Name), (GroupId)))
	#define PZ_NEW_ARRAY_ALIGNED(Alignment, GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), (Alignment), (Name), (GroupId)))
#else
	#define PZ_NEW(GroupId, Name) new(DEFAULT_ALIGNMENT)
	#define PZ_NEW_ALIGNED(Alignment, GroupId, Name) new((Alignment))
	#define PZ_NEW_ARRAY(GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), DEFAULT_ALIGNMENT))
	#define PZ_NEW_ARRAY_ALIGNED(Alignment, GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), (Alignment)))
#endif // USE_MALLOC_TRACKER

Here are a couple of examples of how the tagged allocations look compared to the non-tagged allocation in terms of code:
TaggedSample0
TaggedSample1

Tracking allocations done by containers.

One of the issues that does come up when naming allocations in a simple way to recognize each time is dealing with containers. There is hardly ever a single instance of anything within the engine and when making a game, be it position of particles or number of players, so a lot of containers are used within the engine. When making an allocation within a container it wouldn’t be too useful to have a generic names. Let’s look at this example from FMeshParticleVertexFactory::DataType:

/** The streams to read the texture coordinates from. */
TArray<FVertexStreamComponent,TFixedAllocator<MAX_TEXCOORDS> > TextureCoordinates;

A generic name for allocations done within the allocator assigned for that container would be something like “TFixedAllocator::ResizeAllocation”. It doesn’t say much. Instead, a better name for all allocations related to that container would be something like “FMeshParticleVertexFactory::DataType::TextureCoordinates”. In order to do this we need to be able to assign names and groups to the containers in such way that whenever an allocation is done within that container, the name and group of the container is fetched to tag those allocations. In order to do that we will need to make changes to the containers and to the allocators that can be used with those containers. That would involve adding pointer and a 32-bit unsigned integer per container when building with USE_MALLOC_TRACKER enabled, and the changing the necessary constructors to add that optional information. One of the constructors for a TArray would look like this:

TArray(const uint32 GroupId = GROUP_UNKNOWN, const char * const Name = "UnnamedTArray")
	: ArrayNum(0)
	, ArrayMax(0)
#if USE_MALLOC_TRACKER
	, ArrayName(Name)
	, ArrayGroupId(GroupId)
#endif // USE_MALLOC_TRACKER
{}

With those changes in place we have the necessary information to send to the allocators to be able to tag those allocations. The next step is to look at those allocators and make the necessary changes to pass that information to the underlying allocator being used. Those container allocators in general use the FMemory method of allocating memory, and the FContainerAllocatorInterface defines the ResizeAllocation function that actually does the allocation of memory. Similarly to the previous changes, we need to add the name and group for the allocation.

#if USE_MALLOC_TRACKER
	void ResizeAllocation(int32 PreviousNumElements, int32 NumElements, SIZE_T NumBytesPerElement, const uint32 GroupId, const char * const Name);
#else
	void ResizeAllocation(int32 PreviousNumElements, int32 NumElements, SIZE_T NumBytesPerElement);
#endif // USE_MALLOC_TRACKER

Again, since we don’t want to fill up the engine’s code with ifdefs, we again rely on a define to simplify that:

#if USE_MALLOC_TRACKER
#define PZ_CONTAINER_RESIZE_ALLOCATION(ContainerPtr, PreviousNumElements, NumElements, NumBytesPerElement, GroupId, Name) (ContainerPtr)->ResizeAllocation((PreviousNumElements), (NumElements), (NumBytesPerElement), (GroupId), (Name))
#else
#define PZ_CONTAINER_RESIZE_ALLOCATION(ContainerPtr, PreviousNumElements, NumElements, NumBytesPerElement, GroupId, Name) (ContainerPtr)->ResizeAllocation((PreviousNumElements), (NumElements), (NumBytesPerElement))
#endif // USE_MALLOC_TRACKER

With that in place then we can pass the ArrayName and ArrayGroup to container allocator.

One thing that is also necessary is to change the name or group of a container after construction because that it what’s necessary to be able to name allocations done by containers of containers. One of such example is this where we need to set the name or group after a FindOrAdd in any of these TMap containers:

/** Map of object to their outers, used to avoid an object iterator to find such things. **/
TMap<UObjectBase*, TSet<UObjectBase*> > ObjectOuterMap;
TMap<UClass*, TSet<UObjectBase*> > ClassToObjectListMap;
TMap<UClass*, TSet<UClass*> > ClassToChildListMap;

Once that’s done then all the allocations done by the containers will be tagged properly as they change. So now we just need to set the name for the container. Going back to the FMeshParticleVertexFactory::DataType::TextureCoordinates example, we can now set the name and group for the allocation:

DataType()
	: TextureCoordinates(GROUP_RENDERING, "FMeshParticleVertexFactory::DataType::TextureCoordinates")
	, bInitialized(false)
{
}

Defining scopes.

As part of the “Memory allocation and tracking” post I mentioned the need to define scopes for the allocation in order to provide context. The scopes are not the same as callstacks (which is something already provided by MallocProfiler). Many allocations happen within the same callstack but referring to completely different UObjects. That even more prevalent with the use of Blueprints. Due to that it is very useful to have scopes that would allow tracking or memory usage even within Blueprints.

To leverage the code already present in the engine I took the approach of reusing the FScopeCycleCounterUObject struct which is used to define scopes related to UObjects for the stats system. The engine already has placed those scopes where it’s necessary, and you can still place your own allocation-tracking-specific scopes by using the FMallocTrackerScope class. Also to improve visibility two scopes are created automatically on each FScopeCycleCounterUObject, a scope with the name of the class of the UObject, and a scope with the name of the UObject. That makes it easier to collapse the data per class name when we eventually create a tool to visualize the data. It get a better sense of the complexity let’s look at a single scope coming from the Elemental demo:
ScopeSample
If we analyze the allocations under that scope we see the following:

Address Thread Name Group Bytes Name
0x0000000023156420 Main Thread UObject 96 InterpGroupInst
0x00000000231cf000 Main Thread Unknown 64 UnnamedTSet
0x0000000023168480 Main Thread UObject 80 InterpTrackInstMove
0x0000000028ee8480 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc2420 Main Thread Unknown 32 UnnamedTArray
0x00000000231563c0 Main Thread UObject 96 InterpGroupInst
0x00000000231cefc0 Main Thread Unknown 64 UnnamedTSet
0x0000000023168430 Main Thread UObject 80 InterpTrackInstMove
0x00000000231cef80 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc2400 Main Thread Unknown 32 UnnamedTArray
0x0000000023156360 Main Thread UObject 96 InterpGroupInst
0x00000000231cef40 Main Thread Unknown 64 UnnamedTSet
0x00000000231683e0 Main Thread UObject 80 InterpTrackInstMove
0x0000000028ee8380 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc23e0 Main Thread Unknown 32 UnnamedTArray
0x00000000231cef00 Main Thread UObject 64 InterpTrackInstAnimControl
0x00000000231ceec0 Main Thread UObject 64 InterpTrackInstVisibility

Those are just 17 allocations on the Play function in a Blueprint. The actual number of allocations that there are when I made the capture on the Elemental demo was 584454. The number of unique scopes is pretty high as well, 4175. And with that we are just talking about the 607MiBs allocated at the time of the capture even though the peak number of bytes allocated was 603MiBs. This goes to show the need for this kind of memory tracking.

MallocTracker implementation

As I mentioned previously, MallocTracker was implemented in a similar fashion as the stomp allocator I made previously. The MallocTracker was made to be lightweight and follow the performance requirements mentioned in “Memory allocation and tracking“.

The implementation is fast enough to be enabled by default without causing too much of an impact in performance, and with a fairly low overhead in terms of memory. For example the Elements demo showed an overhead for tracking of ~30MiBs and under 2 ms on the CPU on a debug build, even less on optimized builds. As usual there is a compromise between memory overhead and performance, those numbers are based on the approach I decided to take. There are other approaches to take which will either favor performance or memory overhead but I think I struck a reasonable balance.

To start analyzing the implementation let’s look at a concrete example. This is what happens when FMemory::Malloc() gets called:

  1. AllocDiagramFMemory::Malloc() gets called requesting a certain number of bytes with a given name and group assigned to that allocation.
  2. FMemory::Malloc() calls FMallocTracker::Malloc() with the same arguments assuming that GMalloc points to an FMallocTracker instance.
  3. FMallocTracker::Malloc() allocates the actual memory by using the allocator that was passed in during FMallocTracker creation, in this case FMallocBinned.
  4. FMallocTracker::Malloc() modifies atomically some global allocation stats such as peak number of bytes allocated, peak number of allocations, etcetera.
  5. FMallocTracker::Malloc() gets the assigned PerThreadData instance for the current thread.
  6. FMallocTracker::Malloc() calls PerThreadData::AddAllocation to store the allocation’s data on this thread’s containers.
  7. FMallocTracker::Malloc() returns the pointer the underlying allocator returned in step 3.

Global statistics.

Very few global stats are included. They are there to give you a very quick overview but nothing else. The global stats collected are:

  • Allocated Bytes. Number of bytes allocated when the data was dumped.
  • Number of allocations. Number of allocations when the data was dumped. A high number of allocations will usually cause high memory fragmentation.
  • Peak number of allocated bytes. The highest number of bytes allocated since MallocTracker was enabled.
  • Peak number of allocations. The highest number of living allocations since MallocTracker was enabled.
  • Overhead bytes. Number of bytes that are MallocTracker internal overhead.

All those stats are atomically updated since there are being accessed by all threads doing allocations.

Data per thread.

In order to improve performance and avoid contention of resources as multiple threads allocates and frees memory, most of the work is done on a per-thread basis. All allocations and scope stacks are stored per thread. All allocations have a relevant scope stack defined and the uppermost scope is the GlobalScope. The same scope names usually show up in multiple scope stacks. As such in order to minimize memory overhead all scope names for that thread are stored uniquely and referenced on the scope stacks. And since scope stacks are show up in multiple allocations then we store scope stacks uniquely. Let’s look at a concrete example, scope names in blue and allocations in orange:
SampleAllocDiagram
To store that data we would have three distinct arrays which aren’t shared among the different threads:

  • Unique scope names. This stores unique scope names used in this threads. At least the GlobalScope is assured to be there. This will store new scope names as they are pushed into the stack.
  • Unique scope stacks. This stores unique stacks by creating a dynamic array of fixed-size arrays where indices to the relevant scope names are store.
  • Allocations. Data for each allocation. This includes the address of the allocation, size in bytes, group and name assigned to the allocation, and the index to the unique scope stack.

If we refer to the previous graph we see that we have five allocations. Here is the data to store those five allocations:
SampleAllocData

Reallocating and freeing memory.

Reallocations and frees make things a bit more complicated due to the fact that it is rather common within the Unreal Engine to see instances of allocations happening in one thread and then being reallocated or freed on a different thread. That means that we can’t assume that we will find the relevant allocation on the per-thread data of the calling thread. Since that is the case it also means that we need to introduce some locking. In other to reduce the contention for a global lock instead each per-thread data class has its own lock. Both in reallocation and freeing the per-thread data of the current calling thread is checked for the existing allocation. If it isn’t found there then it looks for that allocation on the other per-thread data lock them in the process one at a time. This ensure that contention is reduced and the rest can keep themselves busy as long as possible.

Dealing with names.

In order to make the MallocTracker fast enough and acceptable in terms of memory consumption I had to put in place a restriction on any kind of naming, be it the name for the allocations or scopes. The restriction is that the lifetime of the memory where those names are stored must be the same or longer than the actual allocation or scope. The reason is that making any copy of that data greatly impacts performance and memory consumption, so only pointers are stored. While that may seem like a complex restriction to live with, I find it to be perfectly fine since you should know the lifetime of your own allocations. If you don’t know about the lifetime of your allocations in order to know for how long to keep those names alive then you have bigger problems to deal with.

Another particular implementation with respect to allocation and scope names is the need to deal with ANSI and wide character names. To make this more transparent all those pointers are assumed to be ANSI unless the 63rd bit in the pointer is set in which case the pointer is assumed to point to a wide character name. FMallocTracker provides a way to get a pointer with that bit set for wide char names, and to set if necessary for FNames which can be either wide or ANSI. At the time of output to file the names are handled properly and dumped to file.

Conclusion.

Unfortunately I won’t be able to convey the true usefulness of this type of system until I make a tool to visualize the data properly, but you can take me work that it is really useful. Finding fragmentation issues and dealing with rampant memory usage is so much easier with this data. This is considerably better than what is already provided in the engine in terms of performance, memory consumption, and data quality. The next step would be to actually fully transition the engine to use the tagged allocation but that’s something that can be done as needed. It certainly doesn’t make sense to spend too much time just tagging allocations where many of them are not conflicting. Instead it is better to just tag the big allocations to get more insight into the specific issues. But even if you find tagging allocations too boring, you may still get useful data. Here is the data of the biggest allocations captured on the Elemental demo.
SampleFull

Sample data and source code.

To make more sense out of this I also provide sample data. The sample data comes out of running a modified version of the Elemental demo on a test build. You can download the data from here and view it with any text editor that support big files.
You can also view the code by looking at the pull request I made for Epic Games to see if they will take this update. The pull request is available here.

Video overview.

Memory stomp allocator for Unreal Engine 4.

As I have been working on the memory tracking feature one of the issues I had to deal with is memory stomps. They are hard to track even with the debug allocator provided by Unreal Engine 4. I will go over what are the cases to handle and the actual implementation.

Symptoms of a memory stomp.

The symptoms of a memory stomp could be clearly evident as an explicit error about a corrupted heap, or as complex as unexpected behavior without any crash at all which is why they are so hard to catch. This is exacerbated by the fact that any performance-aware allocator won’t actually request pages to the OS on every allocation but rather request multiple pages at once and assign addresses as necessary. Only when they run out of available pages will they request new pages to the OS. In that situation the OS won’t be able to do anything to let us know that we messed up (by for example, throwing an access violation exception). Instead execution continues as usual but behavior isn’t what is expected since we could be effectively be operating with memory that is unrelated to what we want to read or write. Just as an example, I had to deal with the case where some CPU particles were behaving in a way that they would end up at origin and color change every now and then. After looking for a logic error, I was able to determine that the issue had nothing to do with the code changing the color or transform but rather a memory overrun on unrelated code which ended up in the same pool in the binned allocator. Another example is when pointers get overwritten. If you don’t see that the pointer itself was written somewhere else, rather than having the data that is being pointed to corrupt, you may waste some time. Depending on the nature of the code accessing the data pointer it may or may not crash. Still, the symptom would be similar to that of overwritten data.

Types of memory stomps.

A memory stomp could be defined as doing different type of operations with memory that are invalid. All these invalid operations are hard to catch due to their nature. They are:

  • Memory overrun. Reading or writing off the end of an allocation.
static const size_t NumBytes = 1U << 6U;
uint8_t * const TestBytes = new uint8_t[NumBytes];
// Memory overrun:
static const size_t SomeVariable = 7U;
TestBytes[1U << SomeVariable] = 1U;
delete [] TestBytes;
  • Memory underrun. Reading or writing off the beginning of an allocation.
static const size_t NumBytes = 1U << 6U;
uint8_t * const TestBytes = new uint8_t[NumBytes];
// Memory underrun:
TestBytes[-128] = 1U;
delete [] TestBytes;
  • Operation after free. Reading or writing an allocation that was already free.
static const size_t NumBytes = 1U << 6U;
uint8_t * const TestBytes = new uint8_t[NumBytes];
delete [] TestBytes;
// Operation after free:
TestBytes[0] = 1U;

One confusion that does raise up is if dereferencing a null pointer could be considered a memory stomp. Given that the behavior when dereferencing null is undefined, it wouldn’t be possible to say that it is effectively a memory stomp. So to keep things simple, I rely on the target-platform OS to deal with that case.

How it works.

The stomp allocator work by using the memory protection features in the different operating systems. They allow us to tag different memory pages (which are usually 4KiB each) with different access modes. The access modes are OS-dependent but they all offer four basic protection modes: execute, read, write, and no-access. Based on that the stomp allocator allocates at least two pages, one page where the actual allocation lives, and an extra page that will be used to detect the memory stomp. The allocation requested is pushed to the end of the last valid page in such way that any read or write beyond that point would cause a segmentation fault since you would be trying to access a protected page. Since people say that a picture is worth a thousand words, here is a reference diagram: MallocStompDiag

Diagram not to scale. The memory allocation information + the sentinel only accounts for 0.39% of the two pages shown.

As it is visible, there is a waste of space due to the fact that we have to deal with full pages. That is the price we have to pay. But the waste is limited to a full page plus the waste in the space of the first valid page. So the stomp allocator isn’t something that you would have enabled by default, but it is a tool to help you find those hard to catch issues. Another nice benefit is that this approach should work fine on many platforms. As a pull request to Epic I’m providing the implementation for Windows, Xbox One, Linux and Mac. But it can be implemented on PlayStation 4 (details under NDA) and perhaps even on mobile platforms as long as they provide functionality similar to what’s provided with mprotect or VirtualProtect. Another aspect for safety is the use of a sentinel to detect an underrun. The sentinel is the last member of the AllocationData which is shown as “Memory allocation information” which is necessary to be able to deallocate the full allocation on free, and to see how much information to copy on Realloc. When an allocation is freed then the sentinel is checked to see if it is the value expected (which in my code it is 0xdeadbeef or 0xdeadbeefdeadbeef depending if it is a 32-bit or 64-bit build). If the sentinel doesn’t match the expected value then there was an underrun detected and the debugger will be stopped. But that will only work for very specific cases where the sentinel is overwritten. So changing the layout with a different mode would help with this issue. Here is the diagram: MallocStompDiagUnder

Diagram not to scale.

This basically flips where the no-access page is set. This allows finding underruns that happen as long as it writes before the “Memory allocation information” shown in the diagram. That piece of data is small (the size depends on the architecture used to compile). The sentinel is still present to deal with small underrun errors. The underrun errors that manage to go below that point will actually hit the no-access page which will trigger the response we want from the OS. The last aspect is that any attempt to read or write from memory already free will fail as it happens. A performance-aware allocator (such as the binned allocator) would add those pages to a free list and return them when someone else request memory without actually making a request to the OS for new pages. That is one of the reasons why they are fast. The stomp allocator will actually decommit the freed pages which makes any read or write operation in those pages invalid.

Memory allocation and tracking.

After that optimization interlude, let’s go back to memory allocation and tracking. As I mentioned last time I would like to go over the different requirements of the memory allocation and tracking features necessary for shipping a AAA game without having a horrible time doing memory optimizations.

Solution Requirements

Most AAA games have a lot of resource that go in and out of system and video memory constantly. This is more exacerbated in sandbox games, and less so in stadium games (such as NBA 2K15). In any case, the solution needs to be fast and provide proper data related to all allocations happening. A fast solution that doesn’t provided proper tracking facilities won’t do, and a solution with proper tracking facilities but it is slow won’t do either. Both aspects are important in equal measure. At the same time, ALL allocations must go through it which implies that client code nor third party libraries won’t allocate memory on their own and global new and delete operators should be overridden.

Tracking Information

The solution must provide relevant allocation tracking information. The information must be as general as how much memory is used overall, as specific as what was the address returned for a specific allocation, and everything in between. Any allocation must have relevant tracking information that is defined when it was allocated and can be used as a reference for programmers to detect issues.

General Information

The general information provided should be extremely simple. It should provide the following items:

  • Allocated bytes.
  • Number of allocations.
  • Peak allocated bytes.
  • Peak number of allocations.

Allocations Grouping

Just like there are different groups/teams focused on different areas of a game, allocations should also be grouped. Some of the groups are rendering, gameplay, UI, audio, etc. The different groups have different memory allocation patterns and requirements. As such, it is a good idea to group allocations based on that criteria because it provides the following benefits:

  • Optimal allocation setup. Not all groups have the same allocation needs so it is better to allow allocation setup per group. That means that for example, not all groups need mutexed allocators, not all groups may use the same small block allocator, etc.
  • Budget tracking and enforcement. It is essential to have each group have their own amount of ram that they can keep track of, and systems programmers can assign in agreement with the different groups. Basically this will guarantee that they all have the their fair share and that everybody keeps memory usage under control.
  • Easier to detect corruption issues. Since all allocations have a group assigned with a proper allocation setup, it isn’t that hard to solve corruption issues or issues during allocation. The grouping provide good initial context.
  • Better performance. Since mutexes are not required in all the groups or allocators that cost can be avoided. On groups that require mutexes there is also less contention since there isn’t a single mutexed allocator (such as the global allocator) to lock. It is also possible to make trade-offs between absolute performance, and peak memory usage when it comes to defining how memory is allocated.

Allocation Naming

All allocations should be “named” in order to have proper ways to recognize the different allocations. What the name implies is up to whoever requests the memory, and you may enforce naming conventions, but that tag should be available in order to track memory allocations. For the sake of performance, these tags should only be available on non-shipping builds.

Allocation Scoping

The solution must allow the stacking of scopes per thread to provide more context to allocations. This provides better context than simply using the callstack of the allocation, and it is far cheaper that grabbing the callstack. On Unreal an example might be where are scope is created during UObject creation so all allocations happening for that UObject are nested under that scope. All non-scoped allocation would still belong to a global scope. Here is an example of a scope stack with the allocation as a leaf node and its relevant data:

Main Thread												Pointer				Bytes	Group
	Global Scope
		UGameEngine::Init
			/Game/Maps/LandscapeMap.umap
				AddToWorld
					PersistentLevel
						ConstructObject
							FPhysXAllocator::allocate	0x000000000b093fe0	131720	Physics

Allocation flagging

Allocations may provide optional flags that depending on the allocator used it will have a certain meaning. Some flags examples are:

  • Lifetime flags. Provides hints as to how long this allocation is expected to live. This allows the allocator to be more clever when allocating memory to reduce memory fragmentation.
  • Clear allocation flag. Let the allocator know that you want the memory allocated cleared to zero before returning.

Performance

The solution must provide acceptable performance even in non-shipping builds with the tracking features enabled. And by acceptable that means that the frame time should never go above 50ms per frame with tracking features enabled. When it goes above that threshold people start trying to avoid using the tracking features and that’s the slippery slope you will have to recover from at the worst possible time, shipping time. Also performance hit and general overhead on shipping builds should be slim to none.

Allocations Grouping

To implement the best possible allocation scheme without exposing the complexities to the client code, it make sense to define multiple allocators per group. Those allocators would be called in order, which ever allocator succeeds at allocating the memory returns the allocation. So for example under a normal group three allocators could be added:

  • Static Small Block Allocator (SSBA). This could be a static small block allocator that doesn’t grow and accepts allocations up to 256 bytes.
  • Dynamic Small Block Allocator (DSBA). This could be a dynamic small block allocator that grows as necessary and accepts allocations up to 1024 bytes.
  • Standard Allocator (SA). Standard OS allocator that allocates any size allocation.

So a request for 1032 bytes would try the SSBA first, the DSBA, and finally it will request the memory to the SA. It is also perfectly fine to just have one allocator if the allocator you use provides the functionality you need. For example, you may use jemalloc which already handles different sized allocations by itself with the proper locking mechanisms.

Conclusion

With all this data we ought to be able to create the proper facilities for memory allocation and tracking. Next time we will delve into the API that we would eventually implement. If you think I forgot any other item please don’t hesitate to comment.

Limitations of memory tracking features in Unreal Engine 4.

Since June last year I have been working on a AAA game based on Unreal Engine 4 as a contractor. One of the big issue on most AAA titles, and which certainly bit me on this project, is memory consumption. It is a tough challenge to provide the best experience to the end-user while doing so within the constraint of the current available memory. And even more so when working in a sandbox game which is my case. But while the Unreal Engine 4 does provide some facilities for tracking memory, I have come to see that they are not up to the task when it comes to big sandbox games. But first let’s dig into the two means provided by the engine, memory report commands and MallocProfiler.

Notes: For this post I will assume that you have read the blog post by Ben Zeigler called “Debugging and Optimizing Memory” on the official Unreal Engine blog. Also, this article is based on what’s available in Unreal Engine 4.6.

Memory report commands

The memory reports commands are a bunch of different console commands that allow you to see the memory usage in general. All those commands are bundled together in a single very convenient command that’s “Memreport –full”. Behind the back it executes the following commands:

Mem FromReport
obj list -alphasort
rhi.DumpMemory
LogOutStatLevels
ListSpawnedActors
DumpParticleMem
ConfigMem
r.DumpRenderTargetPoolMemory
ListTextures -alphasort
ListSounds -alphasort
ListParticleSystems -alphasort
obj list class=SoundWave -alphasort
obj list class=SkeletalMesh -alphasort
obj list class=StaticMesh -alphasort
obj list class=Level -alphasort

All these command rely on three mechanisms for gathering data: global allocator stats, memory tracking stat define, and each object’s UObject::GetResourceSize().

Global allocator stats

LogMemory: Platform Memory Stats for WindowsNoEditor
LogMemory: Process Physical Memory: 722.09 MB used, 945.73 MB peak
LogMemory: Process Virtual Memory: 732.14 MB used, 1379.78 MB peak
LogMemory: Physical Memory: 9624.43 MB used, 24551.27 MB total
LogMemory: Virtual Memory: 1026.31 MB used, 8388608.00 MB total
LogMemory:
LogMemory: Memory Stats:
LogMemory: FMemStack (gamethread) current size = 0.00 MB
LogMemory: FPageAllocator (all threads) allocation size [used/ unused] = [0.00 / 0.56] MB
LogMemory: Nametable memory usage = 1.54 MB

This is the most basic information. The depth of the data provided is very limited but it is the first basic information to look at when assessing memory issues. The first four lines is data that comes from GlobalMemoryStatusEx and GetProcessMemoryInfo which are completely generic, and it’s data that you can already see in the Task Manager.

The last three lines are specific bits of memory which usually don’t take that much but it is still useful to see. The FMemStack size is the memory used by the linear stack allocator. That allocator is a singleton and it is used for multiple things such as storing data for rendering composition passes, shadows information, etc.

The second line refers to the allocation size done by the page allocator. This allocator it also store data internally split at in normal page size allocations and small page size allocations. The statistics provided include both normal page size and small page size allocations. The FMemStack uses the FPageAllocator so they are related.

The last entry just shows the size of the table that stores all the FName entries. Since FNames are used throughout the engine it can be a sizable piece of memory, especially in sandbox games with lots of different UObjects.

Memory tracking stat defines

The Unreal stat system provides the following macros to create stat entries for memory tracking:

#define DECLARE_MEMORY_STAT(CounterName,StatId,GroupId)
#define DECLARE_MEMORY_STAT_POOL(CounterName,StatId,GroupId,Pool)
#define DECLARE_MEMORY_STAT_EXTERN(CounterName,StatId,GroupId, API)
#define DECLARE_MEMORY_STAT_POOL_EXTERN(CounterName,StatId,GroupId,Pool, API)

#define INC_MEMORY_STAT_BY(StatId,Amount)
#define DEC_MEMORY_STAT_BY(StatId,Amount)
#define SET_MEMORY_STAT(StatId,Value)

#define INC_MEMORY_STAT_BY_FName(Stat, Amount)
#define DEC_MEMORY_STAT_BY_FName(Stat,Amount)
#define SET_MEMORY_STAT_FName(Stat,Value)

These macros allow you to define, set, increase, and decrease each memory tracking stat entry. But the fault of this approach is that it needs to be increased and decreased properly, and in a place that it is actually near where the memory is allocated or freed. For example, let’s look at one example in the engine

void FPrimitiveSceneInfo::RemoveFromScene(bool bUpdateStaticDrawLists)
{
	check(IsInRenderingThread());

	// implicit linked list. The destruction will update this "head" pointer to the next item in the list.
	while(LightList)
	{
		FLightPrimitiveInteraction::Destroy(LightList);
	}

	// Remove the primitive from the octree.
	check(OctreeId.IsValidId());
	check(Scene->PrimitiveOctree.GetElementById(OctreeId).PrimitiveSceneInfo == this);
	Scene->PrimitiveOctree.RemoveElement(OctreeId);
	OctreeId = FOctreeElementId();

	IndirectLightingCacheAllocation = NULL;

	DEC_MEMORY_STAT_BY(STAT_PrimitiveInfoMemory, sizeof(*this) + StaticMeshes.GetAllocatedSize() + Proxy->GetMemoryFootprint());

	if (bUpdateStaticDrawLists)
	{
		RemoveStaticMeshes();
	}
}

In that piece of code we can see how STAT_PrimitiveInfoMemory is decreased, but not anywhere near where the actual memory was freed. The memory was freed inside the allocator defined for the StaticMeshes array and the scene proxy, and all of that was triggered by remove an element from the octree. If someone makes changes to the memory usage of the octree then this stat would reflect wrong memory consumption which leads to wrong decisions when optimizing memory usage. The same happens if FPrimitiveSceneInfo changes, especially when new containers are added to the class.

The process of having up to date allocation data by means of stat defines is very error prone. This data does get out of date, it gets written wrong, and it doesn’t actually track memory, just estimates of what the programmer thinks it could consume memory. And the last mechanism, the use of UObject::GetResourceSize() has the same issues.

Memory tracking with UObject::GetResourceSize()

Objects:

                                         Class  Count  NumKBytes  MaxKBytes  ResKBytes ExclusiveResKBytes
                            AIPerceptionSystem      1          0K          0K          0K          0K
                                AISense_Damage      1          0K          0K          0K          0K
                               AISense_Hearing      1          0K          0K          0K          0K
                            AISense_Prediction      1          0K          0K          0K          0K
                                 AISense_Sight      1          0K          0K          0K          0K
                                  AISense_Team      1          0K          0K          0K          0K
                                 AISense_Touch      1          0K          0K          0K          0K
                                      AISystem      1          0K          0K          0K          0K
                                  AmbientSound     21         15K         15K          0K          0K
               AnimNotify_PlayParticleEffect_C     11          1K          1K          0K          0K
                                  AnimSequence     78      53453K      53453K      53333K      53333K
                        AnimSingleNodeInstance     85        241K        242K          0K          0K
                                 ArrayProperty    729         85K         85K          0K          0K
                           AssetObjectProperty      2          0K          0K          0K          0K

In the ideal situation this function provides memory usage of a specific UObject. The function is defined this way:

/**
 * Returns the size of the object/ resource for display to artists/ LDs in the Editor. The
 * default behavior is to return 0 which indicates that the resource shouldn't display its
 * size which is used to not confuse people by displaying small sizes for e.g. objects like
 * materials
 *
 * @param	Type	Indicates which resource size should be returned
 * @return	Size of resource as to be displayed to artists/ LDs in the Editor.
 */
virtual SIZE_T GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

It should just take a call to the specific UObject::GetResourceSize() and we should get the data which is extremely useful data to have. It tries to answer the question “what UObjects do I need to optimize in the scene?”, “which UObject-based classed do I need to optimize?” and questions of that nature. But again, this is as good as the implementation written for it. This is another mechanism that gets outdated, that is implemented wrong (for example, by assuming the data is stored with an object and not just pointers to the data), or that is just empty. For example, one may ask “which skeletal mesh in my scene should I optimize?” Let’s look at the implementation in the engine as of Unreal Engine 4.6:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

So that’s not good. This mechanism is useless or outdated. So to fix this I had to write something like this:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	SIZE_T ResSize = sizeof(USkeletalMesh);

	ResSize += Materials.GetAllocatedSize();
	ResSize += SkelMirrorTable.GetAllocatedSize();

	for (int32 i = 0, n = LODInfo.Num(); i < n; ++i)
	{
		ResSize += LODInfo[i].GetResourceSize(Mode);
	}
	ResSize += LODInfo.GetAllocatedSize();

#if WITH_EDITORONLY_DATA
	for (int32 i = 0, n = OptimizationSettings.Num(); i < n; ++i)
	{
		ResSize += OptimizationSettings[i].GetResourceSize(Mode);
	}
	ResSize += OptimizationSettings.GetAllocatedSize();
#endif

	for (int32 i = 0, n = MorphTargets.Num(); i < n; ++i)
	{
		ResSize += MorphTargets[i]->GetResourceSize(Mode);
	}
	ResSize += MorphTargets.GetAllocatedSize();

	ResSize += RefBasesInvMatrix.GetAllocatedSize();

	ResSize += ClothingAssets.GetAllocatedSize();

#if WITH_APEX_CLOTHING
	for (int32 i = 0, n = ClothingAssets.Num(); i < n; ++i)
	{
		ResSize += ClothingAssets[i].GetResourceSize(Mode);
	}
#endif

	ResSize += CachedStreamingTextureFactors.GetAllocatedSize();

	ResSize += Sockets.GetAllocatedSize();

	return ResSize;
}

And then again, you have to check that the called GetResourceSize() functions actually return the proper value. For example ClothingAssets is an array of FClothingAssetData but that struct didn’t have a GetResourceSize() implementation (probably because that isn’t a UObject in itself but we still need the memory usage of that resource). And also, this implementation will get outdated, and perhaps I missed some data when I implemented it. You just can rely on this to get proper memory usage data.

MallocProfiler

The malloc profiler is a completely different approach. Basically this is a feature in the engine that allows to keep track of all memory allocated through new and delete operators as well as anything that happens through the global memory allocator. It collects basic stats such as memory operations, modules loader, etc. but also the actual memory operations (allocations, frees, reallocations) together with the callstack for each. The grabbing of the callstack isn’t optional since the callstack is the only piece of data that differentiates one allocation from another. But the fact that it has to capture the callstack makes it incredibly slow to use in-game, and it generates a huge output file. For example, I have seen capture files of ~15GBs, and each frame took more than one second (yes, second, not millisecond) while the game was running.

The data can be visualized with a custom tool written in C# called MemoryProfiler 2. This is a extremely slow and inefficient tool in terms of memory. A 15GB capture takes close to 30 minutes to open and more that 80GBs of system memory. Again, the issue isn’t only related to the tool but also the fact that there is just a sheer amount of data generated, and that is made worse by the fact that the opening of the data is single threaded with the code structured in a way that makes it hard to multithread. But even if the tool was faster, and the data was compressed properly, it still wouldn’t help that much because the information is limited. Lets look at the data:
MemoryProfiler2
Here we see the most relevant tab in terms of memory usage by resources because it visualizes the different allocations as a call tree. In the screenshot the allocations related to the USkeletalMeshes serialization allocations are visible. The issue is that if you have many USkeletalMeshes, you can’t find out which ones of those should be optimized. You just know that serializing all the USkeletalMeshes takes a certain amount of memory. This doesn’t provide really useful data to tackle memory optimizations except in the cases where you are doing general optimizations (such as rearranging the data inside USkeletalMesh to reduce compiler padding).

Next steps

After looking at the limitations, it is obvious that there is a place in the engine for a better way to do and track allocations. We need something that would cover up the whole spectrum of allocations, that would allow us to keep track of them, but at the same time do so without making it impossible for the game to run fast enough when it’s enabled. Next time I will start digging into a solution for that and be more explicit about the requirements.

Optimizing Unreal Engine 4 asset cooking. CRC.

As I was working on the GPU particle system on Unreal Engine 4, I had to build my own scene. The process of baking the raw assets to be properly consumed by the engine is called “cooking” in Unreal Engine 4’s lingo. I found myself cooking assets quiet often but one thing I noticed was that I found that the cooking process took a while to determine which assets needed to be cooked. The reason I found that was because I would try to cook the assets without actually saving my changes. While this may not seem like a big deal, it is especially annoying for me since I had a pretty bad experience working with the build system at Dreamworks R&D. The build system would take more than 3 minutes to determine that nothing had to be built, and I’m talking about code and not assets. So I decided to look at this. As usual I will do all the profiling on my machine:

MachineSpecs

What I decided to cook my simple particles scene (which you can see screenshots here) once, and then cook it again but then with the profiler running. In that way I unsure that the cooking process only runs through the process of verifying what needs to be done, and it should determine that there isn’t any asset to build. After doing that here are the results:

MemCrc_Slow

Since this is an asset baking process it is fairly normal to spend a lot of time dealing with memory so I wasn’t too surprised about the CPU time spent on memset and memcpy. That’s something that I will try to optimize later, and that I know it will take some work. Also the memset in particular is caused by build for the editor which fills up memory with a debug value (0xCD) any new allocation or before freeing an allocation. But the call to FCrc::MemCrc_DEPRECATED() seemed like a low hanging fruit since that a pretty self-contained process. Let’s look at the code, which is also available here:

/**
 * DEPRECATED
 * These tables and functions are deprecated because they're using tables and implementations
 * which give values different from what a user of a typical CRC32 algorithm might expect.
 */

uint32 FCrc::MemCrc_DEPRECATED(const void* InData, int32 Length, uint32 CRC/*=0 */)
{
    // Based on the Slicing-by-8 implementation found here:
    // http://slicing-by-8.sourceforge.net/

    CRC = ~BYTESWAP_ORDER32(CRC);

    const uint8* __restrict Data = (uint8*)InData;

    // First we need to align to 32-bits
    int32 InitBytes = Align(Data, 4) - Data;

    if (Length > InitBytes)
    {
        Length -= InitBytes;

        for (; InitBytes; --InitBytes)
        {
            CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
        }

        auto Data4 = (const uint32*)Data;
        for (uint32 Repeat = Length / 8; Repeat; --Repeat)
        {
            uint32 V1 = *Data4++ ^ CRC;
            uint32 V2 = *Data4++;
            CRC =
                CRCTablesSB8_DEPRECATED[7][ V1 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[6][(V1 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[5][(V1 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[4][ V1 >> 24 ] ^
                CRCTablesSB8_DEPRECATED[3][ V2 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[2][(V2 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[1][(V2 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[0][ V2 >> 24 ];
        }
        Data = (const uint8*)Data4;

        Length %= 8;
    }

    for (; Length; --Length)
    {
        CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
    }

    return BYTESWAP_ORDER32(~CRC);
}

As seen on the code, this is based on the slicing-by-8 version of CRC32, which is way faster than the standard implementation. The reason is that it allows for grater instruction-level parallelism because on the standard implementation each table lookup must be completed before computing the next index. In this case 8 table lookups can potentially be done in parallel. For precise details a paper was written about the algorithm here. So that’s fairly optimal already.

But what are the alternatives? The easiest one would be to use SSE 4.2 which includes the CRC32 instruction. While there is a good speed up as shown here, that wouldn’t be integrated in PC because SSE 4.2 became available with the Nehelem architecture, and I don’t think AMD supported that before Bulldozer. So the other more feasible alternative is to get rid of CRC32 and move to some other checksum algorithm.

There are many alternatives to generate hashes. In this case we need to generate checksums, there is no need for any cryptographic type of hash function. Even then Unreal Engine 4 already provides SHA-1. Some of the alternatives are FNV, CityHash, SpookyHash, MurmurHash, etc. They all are valid, and Charles Bloom who works on Oodle made a blog post covering the performance and properties of some of those algorithms so I won’t attempt to cover that. But I dug deeper and I came across xxHash written by Yann Collet who also the author of the LZ4 compression algorithm. The performance shown by xxHash, the fact that it was used by LZ4, and the permissive license (New BSD License) made me make up my mind about making an attempt to integrate it.

But not everything is that simple. While xxHash is extremely easy to integrate (like the stb libraries written by Sean Barrett), it has a considerable impact on the engine. The fact that all the hashes will change implies that all assets have to be rebuilt. This may not be a huge issue for me because I just have a few simple scenes, but it has a big impact on teams working on AAA games. From my point of view I think it is worthwhile to bite the bullet if the cooking time is reduced after that. So, what happens when we replace CRC32 with xxHash in terms of performance?

MemCrc_Fast

That’s a 3X improvement over the old code. In terms of wall-clock time there is almost a 20% improvement over the old version. I think that justifies the cooking of the assets.

Anyway this in particular may not make it into the master branch of Unreal Engine 4 because it does pull in a new dependency (xxHash) and I have no idea if that’s acceptable. Anyway I think it is a meaningful change that some people may want to integrate.

The lesson this time is to think outside the box. If you have seen my past posts you may have thought that I would have taken the path of optimizing the throughput by helping the compiler or looking at the number of uops. But instead I realized that since this is a pretty self-contained problem I knew there had to be faster alternatives to CRC32 that fit the needs of the engine.