MallocTrackerView

Last time I talked about the implementation of MallocTracker and the integration with Unreal Engine 4. While I did a good effort to try to explain the insides of MallocTracker, one thing that was clearly lacking is how to interpret the output generated by MallocTracker. In this post I will talk about MallocTracker Viewer and looking at the output generated.

Text output.

When allocations are not tagged and scopes are not defined then the output generated by MallocTracker is not that meaningful but some context is still provided. Let’s look at three allocations as text:

0x0000000063b40000,Main Thread,Unknown,127471600,GlobalScope,UnnamedAllocation
0x000000008bda0000,Main Thread,Unknown,123401431,GlobalScope,UnnamedAllocation
0x00000000ba8d83d0,Main Thread,Unknown,48,GlobalScope,UnnamedTArray

As you can see there is one line per allocation, and under each allocation there is a field. The format for each line is:

Address, Thread name, Group name, Bytes, Scope Stack, Name

So even with the allocations completely untagged there still the thread context that might be useful. But let’s see what we get when we use the UObject MallocTracker scope support as well as adding a scope in FArchiveAsync::Precache() and tagging the single allocation done in that function with the filename of the loaded file:

0x0000000063b40000,Main Thread,UObject,127471600,GlobalScope|HierarchicalInstancedStaticMeshComponent|HierarchicalInstancedStaticMeshComponent,../../../../dev/UnrealEngine/PZ4.8/../../../Unreal/KiteDemo/Content/Maps/GoldenPath/Height/Height_x2_y2.umap
0x000000008bda0000,Main Thread,Systems,123401431,GlobalScope|FArchiveAsyncPrecache,../../../../dev/UnrealEngine/PZ4.8/../../../Unreal/KiteDemo/Content/KiteDemo/Environments/Landscape/Materials/M_Grass_Landscape_01_INST_HOLE.uasset
0x00000000ba8d83d0,Main Thread,Unknown,48,GlobalScope|Material|M_CustomDepthBillboard,UnnamedTArray

With the proper tagging the data is far more meaningful. We now know that a 121.56 MiB allocation was done for the HierarchicalInstancedStaticMeshComponent UObject that was loading the Height_x2_y2.umap file from disk. Same quality of output is available in the other three allocations. While it does seem that the output is very useful, it is pretty much impossible to deal with huge number of allocations in an actual game. The Kite demo after loading everything has over 600k number of allocations that go from 16 bytes to 121.56 MiB happening in 18 different threads.

One sensible thing to do is to import that data into a spreadsheet such as Excel which can already deal with the CSV output generated by MallocTracker. That is very useful when trying to look for specific allocations such as looking for the biggest allocations and such. But in terms of navigation it is still hard to deal with it since there any many duplicated that could be collapsed (meaning allocations with same name, size, group and thread but different addresses). So it is necessary to make something custom. Enter MallocTrackerView.

MallocTrackerView.

MallocTrackerView is a tool I’m writing in C# to be able to visualize the data generated by MallocTracker better than I could do with Excel or a text editor. This is my first piece of code written from scratch in C#, in the past I only had to do relatively minor changes to C# tools such as an optimization for Unreal Engine’s MemoryProfiler2. Initially my idea was to write it in Python since I had already used it while doing some research for Electronic Arts and I thought that it was a good fit. The reason I decided against it is that Unreal Engine 4 doesn’t use it at all and it seems the only presence of it is on third party libraries. Using C++ was also a possibility but integrating a proper UI library would be too much. So considering that there are already a bunch of tools written in C# within the engine’s depot I decided to do it in C#. The first thing I had to consider once I had chosen C# is that I would need support for a TreeView control that had support for columns since the idea was to have three columns:

  • Name. This would be the thread name, scope name or allocation name depending on where you are in the hierarchy.
  • Size. Accumulated size of the allocations under that node.
  • Count. Accumulated number of allocations done under that node. Since there can be multiple allocations with the same name under the same thread and scope stack, it is possible that a leaf node has a count higher than one.

Since the default C# TreeView control didn’t properly support that and I really wasn’t into writing my own solution (after all one of the reasons to use C# is to focus on the actual tool rather than infrastructure), I did a rather quick search to see how people solved the need for a similar control. As usual Stack Overflow pointed me to the solution which was to use ObjectListView. After doing a bit more research it seemed like people were happy with it so I decided to use it.

MallocTrackerView right now is very much alpha but in its current state it is fairly useful because it covers the most relevant use case that isn’t covered by a text editor or Excel which is to look at the data in terms of scopes. Let’s go over it by opening a MallocTracker dump done in the Kite demo:
MallocTrackerViewThread
That’s how the UI looks like after opening a MallocTracker dump. The root nodes in the treeview are the different threads. As you go deeper into the tree the scopes are shown and the allocations as well, here is an example:
MallocTrackerViewMaterial
Here we are looking at the allocations done in the ScotsPineMedium_Billboard_Mat material UObject from the main thread. This is where we can see the usefulness of having such system available, we now know the specific allocations related to that UObject, and correctly accounting for the actual memory allocated rather than relying on hand-written GetResourceSize and serialization numbers. By comparison, this is the info you get from the obj list command related to ScotsPineMedium_Billboard_Mat:

Object NumKBytes MaxKBytes ResKBytes ExclusiveResKBytes
Material /Game/KiteDemo/Environments/
Trees/ScotsPineMedium/
ScotsPineMedium_Billboard_Mat.ScotsPineMedium_Billboard_Mat
3K 3K 62505K 0K

Even better is to filter the allocations by scope name where we can see related allocations. Here is the output when looking at the allocations that contain the scope name “ScotsPineMedium”:
MallocTrackerViewScotsPineMedium
And you can dig even deeper by combining filters where I filter by allocation group allocations with a certain name:
MallocTrackerViewUObjectAlloc
I think this shows clearly what can be done when having MallocTracker enabled even if only three allocations were properly tagged.

Bad news.

I made a pull request to integrate MallocTracker to Unreal Engine 4 and it was rejected by Epic. I think it was worth going over their reasons for not integrating because someone else might have the same doubts. Let’s go over their concerns as shown in the response to the pull request:

  • It’s too much support burden to maintain those tags. I think this concern comes from the lack of understanding on how the system works. As a programmer you don’t have to tag the allocations. It is better to tag allocations, but if you don’t you still have the scopes to provide context to the allocations as you have seen on the previous examples. If you decided to tag an allocation (and you can use MallocTrackerView to determine which allocations to tag) then you don’t have to worry about that allocation ever again. The tagging is a change in a single line of code (be it an FMemory call, a container constructor, etc) you don’t need to maintain. What is a burden is maintaining all the UObject’s GetResourceSize() functions so I beg to differ. It is also worth noting that this kind of tagging isn’t something that isn’t sure it can scale, I have seen it used in full AAA games throughout the code without this concern being relevant at all.
  • It most likely won’t be used by anyone at higher level. This makes the assumption that non-engine people currently have a good grasp on the memory usage of their game code and assets. While it is true that the MemReport command does offer some insights, it is still not enough and it is certainly harder to visualize than using MallocTracker with MallocTrackerView.
  • At a lower level it usually doesn’t provide enough information. The only tool provided by the engine that does a better job is MallocProfiler with the MemoryProfiler2 tool. But the issues with it are that it is basically unusable in any memory-constrained environment (particularly consoles), the performance is completely unacceptable to the point that the game becomes unplayable, and many of the allocations have the exact same callstack even though they refer to completely different UObjects. Instead MallocTracker runs which pretty much the same performance as having it disabled, it has a considerably lower memory overhead (at least an order of magnitude smaller, for example it needs 35.3MiB to store 614,145 allocations), and it does provided proper information even in terms of allocations done to create UObjects from Blueprints.

In the end it is fine to have this pull request rejected, after all Epic can’t accept whatever is sent their way, but I happen to think that having this in would be rather useful for everybody. But to Epic’s credit, given the open nature of the engine anybody can ask me for the changes and get to use this even if Epic themselves won’t integrate it. That is much better that you can probably do using Unity or some other closed engine.

Overview video.

Adding memory tracking features to Unreal Engine 4.

It has been a while since I last talked about memory allocation and tracking. I have had the time to implement the tracking on Unreal Engine 4 and I think that I go over it. I’m going to make the assumption that you have read the previous blog post I made about this: “Limitations of memory tracking features in Unreal Engine 4” and “Memory allocation and tracking“.

Unreal Engine 4 memory management API.

Basic allocation methods.

There are three basic ways to allocate or free memory within the Unreal Engine 4:

  • Use of GMalloc pointer. This a way to get access to the global allocator. Which allocator is set to be used depends on GCreateMalloc().
  • FMemory functions. These are static functions such as Malloc(), Realloc(), and Free(). They also use GMalloc for the allocations but before doing that it checks if GMalloc is defined before every allocation, reallocation or free. If GMalloc is nullptr then GCreateMalloc() is called.
  • Global new and delete operators. By default they are only defined in the modules in ModuleBoilerplate.h which means that many calls to new and delete were not being handled within the Unreal Engine 4 memory system. The overloaded operators actually call the FMemory functions.

There are cases where memory isn’t deallocated and freed through those mechanisms which shouldn’t happen if possible. To catch those cases I submitted a pull request, which is already integrated and set for release on version 4.9, that catches those allocations via the c runtime library call _CrtSetAllocHook(). One such example is that the engine’s integration of zlib didn’t do allocations through the engine’s facilities, using _CrtSetAllocHook() I found that out and made a pull request with the fix which is set for release on 4.9.

The basic API for both, the direct calls to GMalloc and the FMemory functions, is this:

virtual void* Malloc( SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void Free( void* Original ) = 0;

These are the places that would need to be modified if we want to add any new piece of data per allocation.

Integrating with the engine

Similar to the approach I took for the stomp allocator, I made a new class called FMallocTracker that derives from FMalloc which allows me to hook it up to the Unreal memory allocation system. Since a valid allocator must be passed when creating the FMallocTracker instance all the actual allocations will be done with that allocator. The FMallocTracker is only there to keep tracking information, nothing else. But that is not enough, we actually need to get the tracking data we want in all the way to the allocator. So the first step is to modify the allocator’s functions when we have the memory tracking feature enabled:

#if USE_MALLOC_TRACKER
virtual void* Malloc( SIZE_T Count, uint32 Alignment, const uint32 GroupId, const char * const Name ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment, const uint32 GroupId, const char * const Name ) = 0;
#else
virtual void* Malloc( SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
#endif // USE_MALLOC_TRACKER

The new parameters are:

  • Name. Name of the allocation. This name can be whatever you want but the recommendation is that it be a literal that is easy to search for. I will show the means to provide more context later in this post.
  • Group. This is the id for the group that is responsible for this allocation. These are the groups that I have defined but it something that you should definitely tune for your needs.

This change implies changes to all the allocators to have it be transparent within the engine, but once that work is done then you can tag the allocations without worrying about the underlying implementation. The benefit of tagging allocations isn’t just for tracking purposes but it is also relevant for code documentation. Having worked on large codebases as a consultant with this kind of tagging done is a great benefit when ramping up, to deal with interactions among different groups, and fix memory related crashes.

The next step is to integrate with the new and delete operators. As I already mentioned, in the engine they were defined in the ModuleBoilerplate.h, to provide better coverage I first move it to MemoryBase.h. The next step was to define our new operator overloads to pass in the name and group.

OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new  (size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name)	OPERATOR_NEW_NOTHROW_SPEC{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new[](size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name)	OPERATOR_NEW_NOTHROW_SPEC{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new  (size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name, const std::nothrow_t&)	OPERATOR_NEW_NOTHROW_SPEC	{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new[](size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name, const std::nothrow_t&)	OPERATOR_NEW_NOTHROW_SPEC	{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }

To avoid having to fill up the code checking for USE_MALLOC_TRACKER it is nice to provide some defines to create those allocations as if USE_MALLOC_TRACKER was set but without incurring in unnecessary cost when it isn’t set. The intention is that this should be something with very little to no performance cost. So here is the basic definition:

#if USE_MALLOC_TRACKER
	#define PZ_NEW(GroupId, Name) new(DEFAULT_ALIGNMENT, (Name), (GroupId))
	#define PZ_NEW_ALIGNED(Alignment, GroupId, Name) new((Alignment), (Name), (GroupId))
	#define PZ_NEW_ARRAY(GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), DEFAULT_ALIGNMENT, (Name), (GroupId)))
	#define PZ_NEW_ARRAY_ALIGNED(Alignment, GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), (Alignment), (Name), (GroupId)))
#else
	#define PZ_NEW(GroupId, Name) new(DEFAULT_ALIGNMENT)
	#define PZ_NEW_ALIGNED(Alignment, GroupId, Name) new((Alignment))
	#define PZ_NEW_ARRAY(GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), DEFAULT_ALIGNMENT))
	#define PZ_NEW_ARRAY_ALIGNED(Alignment, GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), (Alignment)))
#endif // USE_MALLOC_TRACKER

Here are a couple of examples of how the tagged allocations look compared to the non-tagged allocation in terms of code:
TaggedSample0
TaggedSample1

Tracking allocations done by containers.

One of the issues that does come up when naming allocations in a simple way to recognize each time is dealing with containers. There is hardly ever a single instance of anything within the engine and when making a game, be it position of particles or number of players, so a lot of containers are used within the engine. When making an allocation within a container it wouldn’t be too useful to have a generic names. Let’s look at this example from FMeshParticleVertexFactory::DataType:

/** The streams to read the texture coordinates from. */
TArray<FVertexStreamComponent,TFixedAllocator<MAX_TEXCOORDS> > TextureCoordinates;

A generic name for allocations done within the allocator assigned for that container would be something like “TFixedAllocator::ResizeAllocation”. It doesn’t say much. Instead, a better name for all allocations related to that container would be something like “FMeshParticleVertexFactory::DataType::TextureCoordinates”. In order to do this we need to be able to assign names and groups to the containers in such way that whenever an allocation is done within that container, the name and group of the container is fetched to tag those allocations. In order to do that we will need to make changes to the containers and to the allocators that can be used with those containers. That would involve adding pointer and a 32-bit unsigned integer per container when building with USE_MALLOC_TRACKER enabled, and the changing the necessary constructors to add that optional information. One of the constructors for a TArray would look like this:

TArray(const uint32 GroupId = GROUP_UNKNOWN, const char * const Name = "UnnamedTArray")
	: ArrayNum(0)
	, ArrayMax(0)
#if USE_MALLOC_TRACKER
	, ArrayName(Name)
	, ArrayGroupId(GroupId)
#endif // USE_MALLOC_TRACKER
{}

With those changes in place we have the necessary information to send to the allocators to be able to tag those allocations. The next step is to look at those allocators and make the necessary changes to pass that information to the underlying allocator being used. Those container allocators in general use the FMemory method of allocating memory, and the FContainerAllocatorInterface defines the ResizeAllocation function that actually does the allocation of memory. Similarly to the previous changes, we need to add the name and group for the allocation.

#if USE_MALLOC_TRACKER
	void ResizeAllocation(int32 PreviousNumElements, int32 NumElements, SIZE_T NumBytesPerElement, const uint32 GroupId, const char * const Name);
#else
	void ResizeAllocation(int32 PreviousNumElements, int32 NumElements, SIZE_T NumBytesPerElement);
#endif // USE_MALLOC_TRACKER

Again, since we don’t want to fill up the engine’s code with ifdefs, we again rely on a define to simplify that:

#if USE_MALLOC_TRACKER
#define PZ_CONTAINER_RESIZE_ALLOCATION(ContainerPtr, PreviousNumElements, NumElements, NumBytesPerElement, GroupId, Name) (ContainerPtr)->ResizeAllocation((PreviousNumElements), (NumElements), (NumBytesPerElement), (GroupId), (Name))
#else
#define PZ_CONTAINER_RESIZE_ALLOCATION(ContainerPtr, PreviousNumElements, NumElements, NumBytesPerElement, GroupId, Name) (ContainerPtr)->ResizeAllocation((PreviousNumElements), (NumElements), (NumBytesPerElement))
#endif // USE_MALLOC_TRACKER

With that in place then we can pass the ArrayName and ArrayGroup to container allocator.

One thing that is also necessary is to change the name or group of a container after construction because that it what’s necessary to be able to name allocations done by containers of containers. One of such example is this where we need to set the name or group after a FindOrAdd in any of these TMap containers:

/** Map of object to their outers, used to avoid an object iterator to find such things. **/
TMap<UObjectBase*, TSet<UObjectBase*> > ObjectOuterMap;
TMap<UClass*, TSet<UObjectBase*> > ClassToObjectListMap;
TMap<UClass*, TSet<UClass*> > ClassToChildListMap;

Once that’s done then all the allocations done by the containers will be tagged properly as they change. So now we just need to set the name for the container. Going back to the FMeshParticleVertexFactory::DataType::TextureCoordinates example, we can now set the name and group for the allocation:

DataType()
	: TextureCoordinates(GROUP_RENDERING, "FMeshParticleVertexFactory::DataType::TextureCoordinates")
	, bInitialized(false)
{
}

Defining scopes.

As part of the “Memory allocation and tracking” post I mentioned the need to define scopes for the allocation in order to provide context. The scopes are not the same as callstacks (which is something already provided by MallocProfiler). Many allocations happen within the same callstack but referring to completely different UObjects. That even more prevalent with the use of Blueprints. Due to that it is very useful to have scopes that would allow tracking or memory usage even within Blueprints.

To leverage the code already present in the engine I took the approach of reusing the FScopeCycleCounterUObject struct which is used to define scopes related to UObjects for the stats system. The engine already has placed those scopes where it’s necessary, and you can still place your own allocation-tracking-specific scopes by using the FMallocTrackerScope class. Also to improve visibility two scopes are created automatically on each FScopeCycleCounterUObject, a scope with the name of the class of the UObject, and a scope with the name of the UObject. That makes it easier to collapse the data per class name when we eventually create a tool to visualize the data. It get a better sense of the complexity let’s look at a single scope coming from the Elemental demo:
ScopeSample
If we analyze the allocations under that scope we see the following:

Address Thread Name Group Bytes Name
0x0000000023156420 Main Thread UObject 96 InterpGroupInst
0x00000000231cf000 Main Thread Unknown 64 UnnamedTSet
0x0000000023168480 Main Thread UObject 80 InterpTrackInstMove
0x0000000028ee8480 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc2420 Main Thread Unknown 32 UnnamedTArray
0x00000000231563c0 Main Thread UObject 96 InterpGroupInst
0x00000000231cefc0 Main Thread Unknown 64 UnnamedTSet
0x0000000023168430 Main Thread UObject 80 InterpTrackInstMove
0x00000000231cef80 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc2400 Main Thread Unknown 32 UnnamedTArray
0x0000000023156360 Main Thread UObject 96 InterpGroupInst
0x00000000231cef40 Main Thread Unknown 64 UnnamedTSet
0x00000000231683e0 Main Thread UObject 80 InterpTrackInstMove
0x0000000028ee8380 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc23e0 Main Thread Unknown 32 UnnamedTArray
0x00000000231cef00 Main Thread UObject 64 InterpTrackInstAnimControl
0x00000000231ceec0 Main Thread UObject 64 InterpTrackInstVisibility

Those are just 17 allocations on the Play function in a Blueprint. The actual number of allocations that there are when I made the capture on the Elemental demo was 584454. The number of unique scopes is pretty high as well, 4175. And with that we are just talking about the 607MiBs allocated at the time of the capture even though the peak number of bytes allocated was 603MiBs. This goes to show the need for this kind of memory tracking.

MallocTracker implementation

As I mentioned previously, MallocTracker was implemented in a similar fashion as the stomp allocator I made previously. The MallocTracker was made to be lightweight and follow the performance requirements mentioned in “Memory allocation and tracking“.

The implementation is fast enough to be enabled by default without causing too much of an impact in performance, and with a fairly low overhead in terms of memory. For example the Elements demo showed an overhead for tracking of ~30MiBs and under 2 ms on the CPU on a debug build, even less on optimized builds. As usual there is a compromise between memory overhead and performance, those numbers are based on the approach I decided to take. There are other approaches to take which will either favor performance or memory overhead but I think I struck a reasonable balance.

To start analyzing the implementation let’s look at a concrete example. This is what happens when FMemory::Malloc() gets called:

  1. AllocDiagramFMemory::Malloc() gets called requesting a certain number of bytes with a given name and group assigned to that allocation.
  2. FMemory::Malloc() calls FMallocTracker::Malloc() with the same arguments assuming that GMalloc points to an FMallocTracker instance.
  3. FMallocTracker::Malloc() allocates the actual memory by using the allocator that was passed in during FMallocTracker creation, in this case FMallocBinned.
  4. FMallocTracker::Malloc() modifies atomically some global allocation stats such as peak number of bytes allocated, peak number of allocations, etcetera.
  5. FMallocTracker::Malloc() gets the assigned PerThreadData instance for the current thread.
  6. FMallocTracker::Malloc() calls PerThreadData::AddAllocation to store the allocation’s data on this thread’s containers.
  7. FMallocTracker::Malloc() returns the pointer the underlying allocator returned in step 3.

Global statistics.

Very few global stats are included. They are there to give you a very quick overview but nothing else. The global stats collected are:

  • Allocated Bytes. Number of bytes allocated when the data was dumped.
  • Number of allocations. Number of allocations when the data was dumped. A high number of allocations will usually cause high memory fragmentation.
  • Peak number of allocated bytes. The highest number of bytes allocated since MallocTracker was enabled.
  • Peak number of allocations. The highest number of living allocations since MallocTracker was enabled.
  • Overhead bytes. Number of bytes that are MallocTracker internal overhead.

All those stats are atomically updated since there are being accessed by all threads doing allocations.

Data per thread.

In order to improve performance and avoid contention of resources as multiple threads allocates and frees memory, most of the work is done on a per-thread basis. All allocations and scope stacks are stored per thread. All allocations have a relevant scope stack defined and the uppermost scope is the GlobalScope. The same scope names usually show up in multiple scope stacks. As such in order to minimize memory overhead all scope names for that thread are stored uniquely and referenced on the scope stacks. And since scope stacks are show up in multiple allocations then we store scope stacks uniquely. Let’s look at a concrete example, scope names in blue and allocations in orange:
SampleAllocDiagram
To store that data we would have three distinct arrays which aren’t shared among the different threads:

  • Unique scope names. This stores unique scope names used in this threads. At least the GlobalScope is assured to be there. This will store new scope names as they are pushed into the stack.
  • Unique scope stacks. This stores unique stacks by creating a dynamic array of fixed-size arrays where indices to the relevant scope names are store.
  • Allocations. Data for each allocation. This includes the address of the allocation, size in bytes, group and name assigned to the allocation, and the index to the unique scope stack.

If we refer to the previous graph we see that we have five allocations. Here is the data to store those five allocations:
SampleAllocData

Reallocating and freeing memory.

Reallocations and frees make things a bit more complicated due to the fact that it is rather common within the Unreal Engine to see instances of allocations happening in one thread and then being reallocated or freed on a different thread. That means that we can’t assume that we will find the relevant allocation on the per-thread data of the calling thread. Since that is the case it also means that we need to introduce some locking. In other to reduce the contention for a global lock instead each per-thread data class has its own lock. Both in reallocation and freeing the per-thread data of the current calling thread is checked for the existing allocation. If it isn’t found there then it looks for that allocation on the other per-thread data lock them in the process one at a time. This ensure that contention is reduced and the rest can keep themselves busy as long as possible.

Dealing with names.

In order to make the MallocTracker fast enough and acceptable in terms of memory consumption I had to put in place a restriction on any kind of naming, be it the name for the allocations or scopes. The restriction is that the lifetime of the memory where those names are stored must be the same or longer than the actual allocation or scope. The reason is that making any copy of that data greatly impacts performance and memory consumption, so only pointers are stored. While that may seem like a complex restriction to live with, I find it to be perfectly fine since you should know the lifetime of your own allocations. If you don’t know about the lifetime of your allocations in order to know for how long to keep those names alive then you have bigger problems to deal with.

Another particular implementation with respect to allocation and scope names is the need to deal with ANSI and wide character names. To make this more transparent all those pointers are assumed to be ANSI unless the 63rd bit in the pointer is set in which case the pointer is assumed to point to a wide character name. FMallocTracker provides a way to get a pointer with that bit set for wide char names, and to set if necessary for FNames which can be either wide or ANSI. At the time of output to file the names are handled properly and dumped to file.

Conclusion.

Unfortunately I won’t be able to convey the true usefulness of this type of system until I make a tool to visualize the data properly, but you can take me work that it is really useful. Finding fragmentation issues and dealing with rampant memory usage is so much easier with this data. This is considerably better than what is already provided in the engine in terms of performance, memory consumption, and data quality. The next step would be to actually fully transition the engine to use the tagged allocation but that’s something that can be done as needed. It certainly doesn’t make sense to spend too much time just tagging allocations where many of them are not conflicting. Instead it is better to just tag the big allocations to get more insight into the specific issues. But even if you find tagging allocations too boring, you may still get useful data. Here is the data of the biggest allocations captured on the Elemental demo.
SampleFull

Sample data and source code.

To make more sense out of this I also provide sample data. The sample data comes out of running a modified version of the Elemental demo on a test build. You can download the data from here and view it with any text editor that support big files.
You can also view the code by looking at the pull request I made for Epic Games to see if they will take this update. The pull request is available here.

Video overview.

Memory stomp allocator for Unreal Engine 4.

As I have been working on the memory tracking feature one of the issues I had to deal with is memory stomps. They are hard to track even with the debug allocator provided by Unreal Engine 4. I will go over what are the cases to handle and the actual implementation.

Symptoms of a memory stomp.

The symptoms of a memory stomp could be clearly evident as an explicit error about a corrupted heap, or as complex as unexpected behavior without any crash at all which is why they are so hard to catch. This is exacerbated by the fact that any performance-aware allocator won’t actually request pages to the OS on every allocation but rather request multiple pages at once and assign addresses as necessary. Only when they run out of available pages will they request new pages to the OS. In that situation the OS won’t be able to do anything to let us know that we messed up (by for example, throwing an access violation exception). Instead execution continues as usual but behavior isn’t what is expected since we could be effectively be operating with memory that is unrelated to what we want to read or write. Just as an example, I had to deal with the case where some CPU particles were behaving in a way that they would end up at origin and color change every now and then. After looking for a logic error, I was able to determine that the issue had nothing to do with the code changing the color or transform but rather a memory overrun on unrelated code which ended up in the same pool in the binned allocator. Another example is when pointers get overwritten. If you don’t see that the pointer itself was written somewhere else, rather than having the data that is being pointed to corrupt, you may waste some time. Depending on the nature of the code accessing the data pointer it may or may not crash. Still, the symptom would be similar to that of overwritten data.

Types of memory stomps.

A memory stomp could be defined as doing different type of operations with memory that are invalid. All these invalid operations are hard to catch due to their nature. They are:

  • Memory overrun. Reading or writing off the end of an allocation.
static const size_t NumBytes = 1U << 6U;
uint8_t * const TestBytes = new uint8_t[NumBytes];
// Memory overrun:
static const size_t SomeVariable = 7U;
TestBytes[1U << SomeVariable] = 1U;
delete [] TestBytes;
  • Memory underrun. Reading or writing off the beginning of an allocation.
static const size_t NumBytes = 1U << 6U;
uint8_t * const TestBytes = new uint8_t[NumBytes];
// Memory underrun:
TestBytes[-128] = 1U;
delete [] TestBytes;
  • Operation after free. Reading or writing an allocation that was already free.
static const size_t NumBytes = 1U << 6U;
uint8_t * const TestBytes = new uint8_t[NumBytes];
delete [] TestBytes;
// Operation after free:
TestBytes[0] = 1U;

One confusion that does raise up is if dereferencing a null pointer could be considered a memory stomp. Given that the behavior when dereferencing null is undefined, it wouldn’t be possible to say that it is effectively a memory stomp. So to keep things simple, I rely on the target-platform OS to deal with that case.

How it works.

The stomp allocator work by using the memory protection features in the different operating systems. They allow us to tag different memory pages (which are usually 4KiB each) with different access modes. The access modes are OS-dependent but they all offer four basic protection modes: execute, read, write, and no-access. Based on that the stomp allocator allocates at least two pages, one page where the actual allocation lives, and an extra page that will be used to detect the memory stomp. The allocation requested is pushed to the end of the last valid page in such way that any read or write beyond that point would cause a segmentation fault since you would be trying to access a protected page. Since people say that a picture is worth a thousand words, here is a reference diagram: MallocStompDiag

Diagram not to scale. The memory allocation information + the sentinel only accounts for 0.39% of the two pages shown.

As it is visible, there is a waste of space due to the fact that we have to deal with full pages. That is the price we have to pay. But the waste is limited to a full page plus the waste in the space of the first valid page. So the stomp allocator isn’t something that you would have enabled by default, but it is a tool to help you find those hard to catch issues. Another nice benefit is that this approach should work fine on many platforms. As a pull request to Epic I’m providing the implementation for Windows, Xbox One, Linux and Mac. But it can be implemented on PlayStation 4 (details under NDA) and perhaps even on mobile platforms as long as they provide functionality similar to what’s provided with mprotect or VirtualProtect. Another aspect for safety is the use of a sentinel to detect an underrun. The sentinel is the last member of the AllocationData which is shown as “Memory allocation information” which is necessary to be able to deallocate the full allocation on free, and to see how much information to copy on Realloc. When an allocation is freed then the sentinel is checked to see if it is the value expected (which in my code it is 0xdeadbeef or 0xdeadbeefdeadbeef depending if it is a 32-bit or 64-bit build). If the sentinel doesn’t match the expected value then there was an underrun detected and the debugger will be stopped. But that will only work for very specific cases where the sentinel is overwritten. So changing the layout with a different mode would help with this issue. Here is the diagram: MallocStompDiagUnder

Diagram not to scale.

This basically flips where the no-access page is set. This allows finding underruns that happen as long as it writes before the “Memory allocation information” shown in the diagram. That piece of data is small (the size depends on the architecture used to compile). The sentinel is still present to deal with small underrun errors. The underrun errors that manage to go below that point will actually hit the no-access page which will trigger the response we want from the OS. The last aspect is that any attempt to read or write from memory already free will fail as it happens. A performance-aware allocator (such as the binned allocator) would add those pages to a free list and return them when someone else request memory without actually making a request to the OS for new pages. That is one of the reasons why they are fast. The stomp allocator will actually decommit the freed pages which makes any read or write operation in those pages invalid.

Optimizing sorting of base passes in Unreal Engine 4.

I know I said I was going to dig into the memory tracking solution, but I realized that I might as well do an optimization since people seem to find them very useful. So I decided to look into profiling and optimizing the Landscape Mountains map which you can download from the Unreal Engine Marketplace. If you haven’t seen it here is a video:

I decided to profile this 900 frames starting from the 100th frame since I didn’t want to profile the initial streaming of the level. So here is the basic thread view.

ThreadProfile

So the top thread is doing much more work than the rest, that thread is the render thread. It was obvious that to improve performance I would have to optimize the work happening in the rendering thread. So let’s look at the work happening in the rendering thread.

RenderThreadProfile

The highlighted functions caught my attention because they accounted for almost 8% of the time spent on the render thread and sorting is usually a very atomic issue to solve. So let’s dig into the actual code execution.

InitViewsCallstack

One of the first calls made each frame to render the scene is to FDeferredShadingSceneRenderer::InitViews() which is in charge of initializing the scene’s views by checking visibility, do sorting of elements for the different passes, initialize the dynamic shadows, etc. The sorting of the static meshes with the respective drawing policy kicks off the calling of the functions we saw on the profiling data. Let’s look at that function:

void FDeferredShadingSceneRenderer::SortBasePassStaticData(FVector ViewPosition)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		SCOPE_CYCLE_COUNTER(STAT_SortStaticDrawLists);

		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			Scene->BasePassNoLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassSimpleDynamicLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedPointIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassHighQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassLowQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
		}
	}
}

That’s extremely simple to understand. It just sorts the different draw lists one after the other twice, once for the default base pass, and once for the masked base pass. The first thing that becomes obvious is that we could sort those draw lists asynchronously in different threads and get some other work going while that’s being done. But in order to do that we need to know that the sort is atomic and doesn’t affect global state. To do that we would have to dig in deeper in the callstack. Let’s look at the code of the next function, TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;
		for (int32 ElementIndex = 0; ElementIndex < DrawingPolicyLink.Elements.Num(); ElementIndex++)
		{
			FElement& Element = DrawingPolicyLink.Elements[ElementIndex];

			if (ElementIndex == 0)
			{
				AccumulatedBounds = Element.Bounds;
			}
			else
			{
				AccumulatedBounds = AccumulatedBounds + Element.Bounds;
			}
		}
		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

In this piece of code there are a couple of things that could be changed. First of all you can see that inside the inner loop ElementIndex is initialized on the first iteration of the loop but then is checked in all the other iterations just to see if it’s the first. Now, you could say that the branch predictor in the CPU would figure out that ElementIndex won’t be 0 in the next iterations, but it is better to write code without relying too much on the branch predictors since they are fairly different from one CPU to another. Let’s make that change first plus a couple of minor changes:

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0].Bounds;
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

Now the actual issue with doing multiple draw lists sorts concurrently with this code is that there are two static variables in TStaticMeshDrawList:

/**
 * Static variables for getting data into the Compare function.
 * Ideally Sort would accept a non-static member function which would avoid having to go through globals.
 */
static TDrawingPolicySet* SortDrawingPolicySet;
static FVector SortViewPosition;

Now this is a case where code comments get outdated because it is possible to send data to the Compare function without using globals and without using thread-local storage or anything similar. Since the assumption was that sending data to the sort function wasn’t possible, here is how the predicate class was written:

/** Helper stuct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare( A, B ) < 0;
	}
};

Now that we now let’s see how those two static variables are used:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B)
{
	const FSphere& BoundsA = (*SortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*SortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - SortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - SortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

So it is clear that first thing we need to make this asynchronous is to get rid of SortDrawingPolicySet and SortViewPosition. First the predicate class must be rewritten to include the parameters that used to be in the static variables:

/** Helper struct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
private:
	const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const SortDrawingPolicySet;
	const FVector SortViewPosition;

public:
	TCompareStaticMeshDrawList(const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
		: SortDrawingPolicySet(InSortDrawingPolicySet)
		, SortViewPosition(InSortViewPosition)
	{
	}

	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare(A, B, SortDrawingPolicySet, SortViewPosition) < 0;
	}
};

And replace the Compare implementation to fit the new changes:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B, const TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
{
	const FSphere& BoundsA = (*InSortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*InSortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - InSortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - InSortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

Since the static variables are gone we need to modify TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0];
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	OrderedDrawingPolicies.Sort(TCompareStaticMeshDrawList<DrawingPolicyType>(&DrawingPolicySet, ViewPosition));
}

Now we should be able to sort the draw lists asynchronously. To do that we will need to replace the code in FDeferredShadingSceneRenderer::SortBasePassStaticData() to kick off asynchronous calls to TStaticMeshDrawList::SortFrontToBack(), and it will have to return some kind of container that can be used to determine if those tasks finished. In that way we can issue those tasks and avoid waiting for them to finish on the rendering thread unless we reach a point in the code were we need those tasks done (and hopefully by then they should be done already. Unfortunately Unreal Engine 4.6 doesn’t support futures and features that would make it easy to launch this functions asynchronously. It is my understanding that they will come in Unreal Engine 4.8. But meanwhile we will have to write a standard task. Let’s do that.

For the sorting task, we will sort a single draw list per task, and whoever needs to sort multiple draw lists will create multiple tasks. This task only needs to parameters, the draw list to sort, and the view position for the sorting, and the only purpose of the task will be to call the SortFrontToBack() function of the draw list to sort, with the view position as a parameter, nothing else. Let’s implement that:

template<typename StaticMeshDrawList>
class FSortFrontToBackTask
{
private:
	StaticMeshDrawList * const StaticMeshDrawListToSort;
	const FVector ViewPosition;

public:
	FSortFrontToBackTask(StaticMeshDrawList * const InStaticMeshDrawListToSort, const FVector InViewPosition)
		: StaticMeshDrawListToSort(InStaticMeshDrawListToSort)
		, ViewPosition(InViewPosition)
	{

	}

	FORCEINLINE TStatId GetStatId() const
	{
		RETURN_QUICK_DECLARE_CYCLE_STAT(FSortFrontToBackTask, STATGROUP_TaskGraphTasks);
	}

	ENamedThreads::Type GetDesiredThread()
	{
		return ENamedThreads::AnyThread;
	}

	static ESubsequentsMode::Type GetSubsequentsMode() { return ESubsequentsMode::TrackSubsequents; }

	void DoTask(ENamedThreads::Type CurrentThread, const FGraphEventRef& MyCompletionGraphEvent)
	{
		StaticMeshDrawListToSort->SortFrontToBack(ViewPosition);
	}
};

That’s a bit of boilerplate code that won’t be necessary once futures are supported on Unreal Engine 4.8. Anyway, now we need to create a function to dispatch those tasks. Since I like to keep the single thread and multiple thread implementations available for debugging, I decided to write a new function called FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(). Let’s implement that:

void FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(const FVector &InViewPosition, FGraphEventArray &OutSortEvents)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FNoLightMapPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassNoLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FSimpleDynamicLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassSimpleDynamicLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedVolumeIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedPointIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedPointIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassHighQualityLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TDistanceFieldShadowsAndLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<LQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassLowQualityLightMapDrawList[DrawType]), InViewPosition));
		}
	}
}

With that we have dispatched all the sorting tasks and OutSortEvents can be used to determine if all the tasks are done. So let’s go back to FDeferredShadingSceneRenderer::InitViews() and implement the support for asynchronous sorting:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	FGraphEventArray SortEvents;
	AsyncSortBasePassStaticData(AverageViewPosition, SortEvents);
#else
	SortBasePassStaticData(AverageViewPosition);
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we get to keep both approaches should you have to debug an issue with either. At this point the sort tasks were launches, but we can’t assume they will be done before we leave the function so let’s check for that and make sure they are done:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	if (SortEvents.Num())
	{
		FTaskGraphInterface::Get().WaitUntilTasksComplete(SortEvents, ENamedThreads::RenderThread);
	}
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we halt execution on the rendering thread until the tasks are done, but hopefully they should be done by then.

So now the question is, did this change actually improve performance? Let’s compare the basic data capturing the same frames as before:

AsyncSortSummaryComparisonAll positive numbers show performance improvements. Before optimization numbers are to the left, after optimization to the right. Blue bars in histogram show the frames before optimization, orange shows after optimization.

With VTune configured to show frames that took more than 20ms as slow, and less that 11ms as fast, it is clearly visible that we have more frames with good or fast performance and less frames with slow performance. Overall all frames moved to the right in the histogram showing that this definitely improved performance overall. Let’s look at the work on the different threads:

AsyncSortThreadsBeforeBefore

AsyncSortThreadsAfterAfter

The render thread is highlighted in both profilings. It is visible that the work on the rendering thread was reduced considerably, and the work on the threads that consume tasks was increased. So by shifting the work from the render thread to the tasks threads without spending time waiting for those tasks to finish improved performance.
Going wide isn’t always the answer, especially when the code you need to optimize was written to be single threaded. But as programmers we need to get better at doing stuff asynchronously, to design from the start for multithreading, because this is a clear case were it is beneficial to go wide.

If you would like to see the whole code change and have access to the Unreal Engine 4 github then you can check the whole change here: https://github.com/EpicGames/UnrealEngine/pull/820

Limitations of memory tracking features in Unreal Engine 4.

Since June last year I have been working on a AAA game based on Unreal Engine 4 as a contractor. One of the big issue on most AAA titles, and which certainly bit me on this project, is memory consumption. It is a tough challenge to provide the best experience to the end-user while doing so within the constraint of the current available memory. And even more so when working in a sandbox game which is my case. But while the Unreal Engine 4 does provide some facilities for tracking memory, I have come to see that they are not up to the task when it comes to big sandbox games. But first let’s dig into the two means provided by the engine, memory report commands and MallocProfiler.

Notes: For this post I will assume that you have read the blog post by Ben Zeigler called “Debugging and Optimizing Memory” on the official Unreal Engine blog. Also, this article is based on what’s available in Unreal Engine 4.6.

Memory report commands

The memory reports commands are a bunch of different console commands that allow you to see the memory usage in general. All those commands are bundled together in a single very convenient command that’s “Memreport –full”. Behind the back it executes the following commands:

Mem FromReport
obj list -alphasort
rhi.DumpMemory
LogOutStatLevels
ListSpawnedActors
DumpParticleMem
ConfigMem
r.DumpRenderTargetPoolMemory
ListTextures -alphasort
ListSounds -alphasort
ListParticleSystems -alphasort
obj list class=SoundWave -alphasort
obj list class=SkeletalMesh -alphasort
obj list class=StaticMesh -alphasort
obj list class=Level -alphasort

All these command rely on three mechanisms for gathering data: global allocator stats, memory tracking stat define, and each object’s UObject::GetResourceSize().

Global allocator stats

LogMemory: Platform Memory Stats for WindowsNoEditor
LogMemory: Process Physical Memory: 722.09 MB used, 945.73 MB peak
LogMemory: Process Virtual Memory: 732.14 MB used, 1379.78 MB peak
LogMemory: Physical Memory: 9624.43 MB used, 24551.27 MB total
LogMemory: Virtual Memory: 1026.31 MB used, 8388608.00 MB total
LogMemory:
LogMemory: Memory Stats:
LogMemory: FMemStack (gamethread) current size = 0.00 MB
LogMemory: FPageAllocator (all threads) allocation size [used/ unused] = [0.00 / 0.56] MB
LogMemory: Nametable memory usage = 1.54 MB

This is the most basic information. The depth of the data provided is very limited but it is the first basic information to look at when assessing memory issues. The first four lines is data that comes from GlobalMemoryStatusEx and GetProcessMemoryInfo which are completely generic, and it’s data that you can already see in the Task Manager.

The last three lines are specific bits of memory which usually don’t take that much but it is still useful to see. The FMemStack size is the memory used by the linear stack allocator. That allocator is a singleton and it is used for multiple things such as storing data for rendering composition passes, shadows information, etc.

The second line refers to the allocation size done by the page allocator. This allocator it also store data internally split at in normal page size allocations and small page size allocations. The statistics provided include both normal page size and small page size allocations. The FMemStack uses the FPageAllocator so they are related.

The last entry just shows the size of the table that stores all the FName entries. Since FNames are used throughout the engine it can be a sizable piece of memory, especially in sandbox games with lots of different UObjects.

Memory tracking stat defines

The Unreal stat system provides the following macros to create stat entries for memory tracking:

#define DECLARE_MEMORY_STAT(CounterName,StatId,GroupId)
#define DECLARE_MEMORY_STAT_POOL(CounterName,StatId,GroupId,Pool)
#define DECLARE_MEMORY_STAT_EXTERN(CounterName,StatId,GroupId, API)
#define DECLARE_MEMORY_STAT_POOL_EXTERN(CounterName,StatId,GroupId,Pool, API)

#define INC_MEMORY_STAT_BY(StatId,Amount)
#define DEC_MEMORY_STAT_BY(StatId,Amount)
#define SET_MEMORY_STAT(StatId,Value)

#define INC_MEMORY_STAT_BY_FName(Stat, Amount)
#define DEC_MEMORY_STAT_BY_FName(Stat,Amount)
#define SET_MEMORY_STAT_FName(Stat,Value)

These macros allow you to define, set, increase, and decrease each memory tracking stat entry. But the fault of this approach is that it needs to be increased and decreased properly, and in a place that it is actually near where the memory is allocated or freed. For example, let’s look at one example in the engine

void FPrimitiveSceneInfo::RemoveFromScene(bool bUpdateStaticDrawLists)
{
	check(IsInRenderingThread());

	// implicit linked list. The destruction will update this "head" pointer to the next item in the list.
	while(LightList)
	{
		FLightPrimitiveInteraction::Destroy(LightList);
	}

	// Remove the primitive from the octree.
	check(OctreeId.IsValidId());
	check(Scene->PrimitiveOctree.GetElementById(OctreeId).PrimitiveSceneInfo == this);
	Scene->PrimitiveOctree.RemoveElement(OctreeId);
	OctreeId = FOctreeElementId();

	IndirectLightingCacheAllocation = NULL;

	DEC_MEMORY_STAT_BY(STAT_PrimitiveInfoMemory, sizeof(*this) + StaticMeshes.GetAllocatedSize() + Proxy->GetMemoryFootprint());

	if (bUpdateStaticDrawLists)
	{
		RemoveStaticMeshes();
	}
}

In that piece of code we can see how STAT_PrimitiveInfoMemory is decreased, but not anywhere near where the actual memory was freed. The memory was freed inside the allocator defined for the StaticMeshes array and the scene proxy, and all of that was triggered by remove an element from the octree. If someone makes changes to the memory usage of the octree then this stat would reflect wrong memory consumption which leads to wrong decisions when optimizing memory usage. The same happens if FPrimitiveSceneInfo changes, especially when new containers are added to the class.

The process of having up to date allocation data by means of stat defines is very error prone. This data does get out of date, it gets written wrong, and it doesn’t actually track memory, just estimates of what the programmer thinks it could consume memory. And the last mechanism, the use of UObject::GetResourceSize() has the same issues.

Memory tracking with UObject::GetResourceSize()

Objects:

                                         Class  Count  NumKBytes  MaxKBytes  ResKBytes ExclusiveResKBytes
                            AIPerceptionSystem      1          0K          0K          0K          0K
                                AISense_Damage      1          0K          0K          0K          0K
                               AISense_Hearing      1          0K          0K          0K          0K
                            AISense_Prediction      1          0K          0K          0K          0K
                                 AISense_Sight      1          0K          0K          0K          0K
                                  AISense_Team      1          0K          0K          0K          0K
                                 AISense_Touch      1          0K          0K          0K          0K
                                      AISystem      1          0K          0K          0K          0K
                                  AmbientSound     21         15K         15K          0K          0K
               AnimNotify_PlayParticleEffect_C     11          1K          1K          0K          0K
                                  AnimSequence     78      53453K      53453K      53333K      53333K
                        AnimSingleNodeInstance     85        241K        242K          0K          0K
                                 ArrayProperty    729         85K         85K          0K          0K
                           AssetObjectProperty      2          0K          0K          0K          0K

In the ideal situation this function provides memory usage of a specific UObject. The function is defined this way:

/**
 * Returns the size of the object/ resource for display to artists/ LDs in the Editor. The
 * default behavior is to return 0 which indicates that the resource shouldn't display its
 * size which is used to not confuse people by displaying small sizes for e.g. objects like
 * materials
 *
 * @param	Type	Indicates which resource size should be returned
 * @return	Size of resource as to be displayed to artists/ LDs in the Editor.
 */
virtual SIZE_T GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

It should just take a call to the specific UObject::GetResourceSize() and we should get the data which is extremely useful data to have. It tries to answer the question “what UObjects do I need to optimize in the scene?”, “which UObject-based classed do I need to optimize?” and questions of that nature. But again, this is as good as the implementation written for it. This is another mechanism that gets outdated, that is implemented wrong (for example, by assuming the data is stored with an object and not just pointers to the data), or that is just empty. For example, one may ask “which skeletal mesh in my scene should I optimize?” Let’s look at the implementation in the engine as of Unreal Engine 4.6:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

So that’s not good. This mechanism is useless or outdated. So to fix this I had to write something like this:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	SIZE_T ResSize = sizeof(USkeletalMesh);

	ResSize += Materials.GetAllocatedSize();
	ResSize += SkelMirrorTable.GetAllocatedSize();

	for (int32 i = 0, n = LODInfo.Num(); i < n; ++i)
	{
		ResSize += LODInfo[i].GetResourceSize(Mode);
	}
	ResSize += LODInfo.GetAllocatedSize();

#if WITH_EDITORONLY_DATA
	for (int32 i = 0, n = OptimizationSettings.Num(); i < n; ++i)
	{
		ResSize += OptimizationSettings[i].GetResourceSize(Mode);
	}
	ResSize += OptimizationSettings.GetAllocatedSize();
#endif

	for (int32 i = 0, n = MorphTargets.Num(); i < n; ++i)
	{
		ResSize += MorphTargets[i]->GetResourceSize(Mode);
	}
	ResSize += MorphTargets.GetAllocatedSize();

	ResSize += RefBasesInvMatrix.GetAllocatedSize();

	ResSize += ClothingAssets.GetAllocatedSize();

#if WITH_APEX_CLOTHING
	for (int32 i = 0, n = ClothingAssets.Num(); i < n; ++i)
	{
		ResSize += ClothingAssets[i].GetResourceSize(Mode);
	}
#endif

	ResSize += CachedStreamingTextureFactors.GetAllocatedSize();

	ResSize += Sockets.GetAllocatedSize();

	return ResSize;
}

And then again, you have to check that the called GetResourceSize() functions actually return the proper value. For example ClothingAssets is an array of FClothingAssetData but that struct didn’t have a GetResourceSize() implementation (probably because that isn’t a UObject in itself but we still need the memory usage of that resource). And also, this implementation will get outdated, and perhaps I missed some data when I implemented it. You just can rely on this to get proper memory usage data.

MallocProfiler

The malloc profiler is a completely different approach. Basically this is a feature in the engine that allows to keep track of all memory allocated through new and delete operators as well as anything that happens through the global memory allocator. It collects basic stats such as memory operations, modules loader, etc. but also the actual memory operations (allocations, frees, reallocations) together with the callstack for each. The grabbing of the callstack isn’t optional since the callstack is the only piece of data that differentiates one allocation from another. But the fact that it has to capture the callstack makes it incredibly slow to use in-game, and it generates a huge output file. For example, I have seen capture files of ~15GBs, and each frame took more than one second (yes, second, not millisecond) while the game was running.

The data can be visualized with a custom tool written in C# called MemoryProfiler 2. This is a extremely slow and inefficient tool in terms of memory. A 15GB capture takes close to 30 minutes to open and more that 80GBs of system memory. Again, the issue isn’t only related to the tool but also the fact that there is just a sheer amount of data generated, and that is made worse by the fact that the opening of the data is single threaded with the code structured in a way that makes it hard to multithread. But even if the tool was faster, and the data was compressed properly, it still wouldn’t help that much because the information is limited. Lets look at the data:
MemoryProfiler2
Here we see the most relevant tab in terms of memory usage by resources because it visualizes the different allocations as a call tree. In the screenshot the allocations related to the USkeletalMeshes serialization allocations are visible. The issue is that if you have many USkeletalMeshes, you can’t find out which ones of those should be optimized. You just know that serializing all the USkeletalMeshes takes a certain amount of memory. This doesn’t provide really useful data to tackle memory optimizations except in the cases where you are doing general optimizations (such as rearranging the data inside USkeletalMesh to reduce compiler padding).

Next steps

After looking at the limitations, it is obvious that there is a place in the engine for a better way to do and track allocations. We need something that would cover up the whole spectrum of allocations, that would allow us to keep track of them, but at the same time do so without making it impossible for the game to run fast enough when it’s enabled. Next time I will start digging into a solution for that and be more explicit about the requirements.