Adding memory tracking features to Unreal Engine 4.

It has been a while since I last talked about memory allocation and tracking. I have had the time to implement the tracking on Unreal Engine 4 and I think that I go over it. I’m going to make the assumption that you have read the previous blog post I made about this: “Limitations of memory tracking features in Unreal Engine 4” and “Memory allocation and tracking“.

Unreal Engine 4 memory management API.

Basic allocation methods.

There are three basic ways to allocate or free memory within the Unreal Engine 4:

  • Use of GMalloc pointer. This a way to get access to the global allocator. Which allocator is set to be used depends on GCreateMalloc().
  • FMemory functions. These are static functions such as Malloc(), Realloc(), and Free(). They also use GMalloc for the allocations but before doing that it checks if GMalloc is defined before every allocation, reallocation or free. If GMalloc is nullptr then GCreateMalloc() is called.
  • Global new and delete operators. By default they are only defined in the modules in ModuleBoilerplate.h which means that many calls to new and delete were not being handled within the Unreal Engine 4 memory system. The overloaded operators actually call the FMemory functions.

There are cases where memory isn’t deallocated and freed through those mechanisms which shouldn’t happen if possible. To catch those cases I submitted a pull request, which is already integrated and set for release on version 4.9, that catches those allocations via the c runtime library call _CrtSetAllocHook(). One such example is that the engine’s integration of zlib didn’t do allocations through the engine’s facilities, using _CrtSetAllocHook() I found that out and made a pull request with the fix which is set for release on 4.9.

The basic API for both, the direct calls to GMalloc and the FMemory functions, is this:

virtual void* Malloc( SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void Free( void* Original ) = 0;

These are the places that would need to be modified if we want to add any new piece of data per allocation.

Integrating with the engine

Similar to the approach I took for the stomp allocator, I made a new class called FMallocTracker that derives from FMalloc which allows me to hook it up to the Unreal memory allocation system. Since a valid allocator must be passed when creating the FMallocTracker instance all the actual allocations will be done with that allocator. The FMallocTracker is only there to keep tracking information, nothing else. But that is not enough, we actually need to get the tracking data we want in all the way to the allocator. So the first step is to modify the allocator’s functions when we have the memory tracking feature enabled:

#if USE_MALLOC_TRACKER
virtual void* Malloc( SIZE_T Count, uint32 Alignment, const uint32 GroupId, const char * const Name ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment, const uint32 GroupId, const char * const Name ) = 0;
#else
virtual void* Malloc( SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
virtual void* Realloc( void* Original, SIZE_T Count, uint32 Alignment = DEFAULT_ALIGNMENT ) = 0;
#endif // USE_MALLOC_TRACKER

The new parameters are:

  • Name. Name of the allocation. This name can be whatever you want but the recommendation is that it be a literal that is easy to search for. I will show the means to provide more context later in this post.
  • Group. This is the id for the group that is responsible for this allocation. These are the groups that I have defined but it something that you should definitely tune for your needs.

This change implies changes to all the allocators to have it be transparent within the engine, but once that work is done then you can tag the allocations without worrying about the underlying implementation. The benefit of tagging allocations isn’t just for tracking purposes but it is also relevant for code documentation. Having worked on large codebases as a consultant with this kind of tagging done is a great benefit when ramping up, to deal with interactions among different groups, and fix memory related crashes.

The next step is to integrate with the new and delete operators. As I already mentioned, in the engine they were defined in the ModuleBoilerplate.h, to provide better coverage I first move it to MemoryBase.h. The next step was to define our new operator overloads to pass in the name and group.

OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new  (size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name)	OPERATOR_NEW_NOTHROW_SPEC{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new[](size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name)	OPERATOR_NEW_NOTHROW_SPEC{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new  (size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name, const std::nothrow_t&)	OPERATOR_NEW_NOTHROW_SPEC	{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }
OPERATOR_NEW_MSVC_PRAGMA FORCEINLINE void* operator new[](size_t Size, const uint32 Alignment, const uint32 GroupId, const char * const Name, const std::nothrow_t&)	OPERATOR_NEW_NOTHROW_SPEC	{ return FMemory::Malloc(Size, Alignment, GroupId, Name); }

To avoid having to fill up the code checking for USE_MALLOC_TRACKER it is nice to provide some defines to create those allocations as if USE_MALLOC_TRACKER was set but without incurring in unnecessary cost when it isn’t set. The intention is that this should be something with very little to no performance cost. So here is the basic definition:

#if USE_MALLOC_TRACKER
	#define PZ_NEW(GroupId, Name) new(DEFAULT_ALIGNMENT, (Name), (GroupId))
	#define PZ_NEW_ALIGNED(Alignment, GroupId, Name) new((Alignment), (Name), (GroupId))
	#define PZ_NEW_ARRAY(GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), DEFAULT_ALIGNMENT, (Name), (GroupId)))
	#define PZ_NEW_ARRAY_ALIGNED(Alignment, GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), (Alignment), (Name), (GroupId)))
#else
	#define PZ_NEW(GroupId, Name) new(DEFAULT_ALIGNMENT)
	#define PZ_NEW_ALIGNED(Alignment, GroupId, Name) new((Alignment))
	#define PZ_NEW_ARRAY(GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), DEFAULT_ALIGNMENT))
	#define PZ_NEW_ARRAY_ALIGNED(Alignment, GroupId, Name, Type, Num)  reinterpret_cast<##Type*>(FMemory::Malloc((Num) * sizeof(##Type), (Alignment)))
#endif // USE_MALLOC_TRACKER

Here are a couple of examples of how the tagged allocations look compared to the non-tagged allocation in terms of code:
TaggedSample0
TaggedSample1

Tracking allocations done by containers.

One of the issues that does come up when naming allocations in a simple way to recognize each time is dealing with containers. There is hardly ever a single instance of anything within the engine and when making a game, be it position of particles or number of players, so a lot of containers are used within the engine. When making an allocation within a container it wouldn’t be too useful to have a generic names. Let’s look at this example from FMeshParticleVertexFactory::DataType:

/** The streams to read the texture coordinates from. */
TArray<FVertexStreamComponent,TFixedAllocator<MAX_TEXCOORDS> > TextureCoordinates;

A generic name for allocations done within the allocator assigned for that container would be something like “TFixedAllocator::ResizeAllocation”. It doesn’t say much. Instead, a better name for all allocations related to that container would be something like “FMeshParticleVertexFactory::DataType::TextureCoordinates”. In order to do this we need to be able to assign names and groups to the containers in such way that whenever an allocation is done within that container, the name and group of the container is fetched to tag those allocations. In order to do that we will need to make changes to the containers and to the allocators that can be used with those containers. That would involve adding pointer and a 32-bit unsigned integer per container when building with USE_MALLOC_TRACKER enabled, and the changing the necessary constructors to add that optional information. One of the constructors for a TArray would look like this:

TArray(const uint32 GroupId = GROUP_UNKNOWN, const char * const Name = "UnnamedTArray")
	: ArrayNum(0)
	, ArrayMax(0)
#if USE_MALLOC_TRACKER
	, ArrayName(Name)
	, ArrayGroupId(GroupId)
#endif // USE_MALLOC_TRACKER
{}

With those changes in place we have the necessary information to send to the allocators to be able to tag those allocations. The next step is to look at those allocators and make the necessary changes to pass that information to the underlying allocator being used. Those container allocators in general use the FMemory method of allocating memory, and the FContainerAllocatorInterface defines the ResizeAllocation function that actually does the allocation of memory. Similarly to the previous changes, we need to add the name and group for the allocation.

#if USE_MALLOC_TRACKER
	void ResizeAllocation(int32 PreviousNumElements, int32 NumElements, SIZE_T NumBytesPerElement, const uint32 GroupId, const char * const Name);
#else
	void ResizeAllocation(int32 PreviousNumElements, int32 NumElements, SIZE_T NumBytesPerElement);
#endif // USE_MALLOC_TRACKER

Again, since we don’t want to fill up the engine’s code with ifdefs, we again rely on a define to simplify that:

#if USE_MALLOC_TRACKER
#define PZ_CONTAINER_RESIZE_ALLOCATION(ContainerPtr, PreviousNumElements, NumElements, NumBytesPerElement, GroupId, Name) (ContainerPtr)->ResizeAllocation((PreviousNumElements), (NumElements), (NumBytesPerElement), (GroupId), (Name))
#else
#define PZ_CONTAINER_RESIZE_ALLOCATION(ContainerPtr, PreviousNumElements, NumElements, NumBytesPerElement, GroupId, Name) (ContainerPtr)->ResizeAllocation((PreviousNumElements), (NumElements), (NumBytesPerElement))
#endif // USE_MALLOC_TRACKER

With that in place then we can pass the ArrayName and ArrayGroup to container allocator.

One thing that is also necessary is to change the name or group of a container after construction because that it what’s necessary to be able to name allocations done by containers of containers. One of such example is this where we need to set the name or group after a FindOrAdd in any of these TMap containers:

/** Map of object to their outers, used to avoid an object iterator to find such things. **/
TMap<UObjectBase*, TSet<UObjectBase*> > ObjectOuterMap;
TMap<UClass*, TSet<UObjectBase*> > ClassToObjectListMap;
TMap<UClass*, TSet<UClass*> > ClassToChildListMap;

Once that’s done then all the allocations done by the containers will be tagged properly as they change. So now we just need to set the name for the container. Going back to the FMeshParticleVertexFactory::DataType::TextureCoordinates example, we can now set the name and group for the allocation:

DataType()
	: TextureCoordinates(GROUP_RENDERING, "FMeshParticleVertexFactory::DataType::TextureCoordinates")
	, bInitialized(false)
{
}

Defining scopes.

As part of the “Memory allocation and tracking” post I mentioned the need to define scopes for the allocation in order to provide context. The scopes are not the same as callstacks (which is something already provided by MallocProfiler). Many allocations happen within the same callstack but referring to completely different UObjects. That even more prevalent with the use of Blueprints. Due to that it is very useful to have scopes that would allow tracking or memory usage even within Blueprints.

To leverage the code already present in the engine I took the approach of reusing the FScopeCycleCounterUObject struct which is used to define scopes related to UObjects for the stats system. The engine already has placed those scopes where it’s necessary, and you can still place your own allocation-tracking-specific scopes by using the FMallocTrackerScope class. Also to improve visibility two scopes are created automatically on each FScopeCycleCounterUObject, a scope with the name of the class of the UObject, and a scope with the name of the UObject. That makes it easier to collapse the data per class name when we eventually create a tool to visualize the data. It get a better sense of the complexity let’s look at a single scope coming from the Elemental demo:
ScopeSample
If we analyze the allocations under that scope we see the following:

Address Thread Name Group Bytes Name
0x0000000023156420 Main Thread UObject 96 InterpGroupInst
0x00000000231cf000 Main Thread Unknown 64 UnnamedTSet
0x0000000023168480 Main Thread UObject 80 InterpTrackInstMove
0x0000000028ee8480 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc2420 Main Thread Unknown 32 UnnamedTArray
0x00000000231563c0 Main Thread UObject 96 InterpGroupInst
0x00000000231cefc0 Main Thread Unknown 64 UnnamedTSet
0x0000000023168430 Main Thread UObject 80 InterpTrackInstMove
0x00000000231cef80 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc2400 Main Thread Unknown 32 UnnamedTArray
0x0000000023156360 Main Thread UObject 96 InterpGroupInst
0x00000000231cef40 Main Thread Unknown 64 UnnamedTSet
0x00000000231683e0 Main Thread UObject 80 InterpTrackInstMove
0x0000000028ee8380 Main Thread Unknown 64 UnnamedTSet
0x0000000022bc23e0 Main Thread Unknown 32 UnnamedTArray
0x00000000231cef00 Main Thread UObject 64 InterpTrackInstAnimControl
0x00000000231ceec0 Main Thread UObject 64 InterpTrackInstVisibility

Those are just 17 allocations on the Play function in a Blueprint. The actual number of allocations that there are when I made the capture on the Elemental demo was 584454. The number of unique scopes is pretty high as well, 4175. And with that we are just talking about the 607MiBs allocated at the time of the capture even though the peak number of bytes allocated was 603MiBs. This goes to show the need for this kind of memory tracking.

MallocTracker implementation

As I mentioned previously, MallocTracker was implemented in a similar fashion as the stomp allocator I made previously. The MallocTracker was made to be lightweight and follow the performance requirements mentioned in “Memory allocation and tracking“.

The implementation is fast enough to be enabled by default without causing too much of an impact in performance, and with a fairly low overhead in terms of memory. For example the Elements demo showed an overhead for tracking of ~30MiBs and under 2 ms on the CPU on a debug build, even less on optimized builds. As usual there is a compromise between memory overhead and performance, those numbers are based on the approach I decided to take. There are other approaches to take which will either favor performance or memory overhead but I think I struck a reasonable balance.

To start analyzing the implementation let’s look at a concrete example. This is what happens when FMemory::Malloc() gets called:

  1. AllocDiagramFMemory::Malloc() gets called requesting a certain number of bytes with a given name and group assigned to that allocation.
  2. FMemory::Malloc() calls FMallocTracker::Malloc() with the same arguments assuming that GMalloc points to an FMallocTracker instance.
  3. FMallocTracker::Malloc() allocates the actual memory by using the allocator that was passed in during FMallocTracker creation, in this case FMallocBinned.
  4. FMallocTracker::Malloc() modifies atomically some global allocation stats such as peak number of bytes allocated, peak number of allocations, etcetera.
  5. FMallocTracker::Malloc() gets the assigned PerThreadData instance for the current thread.
  6. FMallocTracker::Malloc() calls PerThreadData::AddAllocation to store the allocation’s data on this thread’s containers.
  7. FMallocTracker::Malloc() returns the pointer the underlying allocator returned in step 3.

Global statistics.

Very few global stats are included. They are there to give you a very quick overview but nothing else. The global stats collected are:

  • Allocated Bytes. Number of bytes allocated when the data was dumped.
  • Number of allocations. Number of allocations when the data was dumped. A high number of allocations will usually cause high memory fragmentation.
  • Peak number of allocated bytes. The highest number of bytes allocated since MallocTracker was enabled.
  • Peak number of allocations. The highest number of living allocations since MallocTracker was enabled.
  • Overhead bytes. Number of bytes that are MallocTracker internal overhead.

All those stats are atomically updated since there are being accessed by all threads doing allocations.

Data per thread.

In order to improve performance and avoid contention of resources as multiple threads allocates and frees memory, most of the work is done on a per-thread basis. All allocations and scope stacks are stored per thread. All allocations have a relevant scope stack defined and the uppermost scope is the GlobalScope. The same scope names usually show up in multiple scope stacks. As such in order to minimize memory overhead all scope names for that thread are stored uniquely and referenced on the scope stacks. And since scope stacks are show up in multiple allocations then we store scope stacks uniquely. Let’s look at a concrete example, scope names in blue and allocations in orange:
SampleAllocDiagram
To store that data we would have three distinct arrays which aren’t shared among the different threads:

  • Unique scope names. This stores unique scope names used in this threads. At least the GlobalScope is assured to be there. This will store new scope names as they are pushed into the stack.
  • Unique scope stacks. This stores unique stacks by creating a dynamic array of fixed-size arrays where indices to the relevant scope names are store.
  • Allocations. Data for each allocation. This includes the address of the allocation, size in bytes, group and name assigned to the allocation, and the index to the unique scope stack.

If we refer to the previous graph we see that we have five allocations. Here is the data to store those five allocations:
SampleAllocData

Reallocating and freeing memory.

Reallocations and frees make things a bit more complicated due to the fact that it is rather common within the Unreal Engine to see instances of allocations happening in one thread and then being reallocated or freed on a different thread. That means that we can’t assume that we will find the relevant allocation on the per-thread data of the calling thread. Since that is the case it also means that we need to introduce some locking. In other to reduce the contention for a global lock instead each per-thread data class has its own lock. Both in reallocation and freeing the per-thread data of the current calling thread is checked for the existing allocation. If it isn’t found there then it looks for that allocation on the other per-thread data lock them in the process one at a time. This ensure that contention is reduced and the rest can keep themselves busy as long as possible.

Dealing with names.

In order to make the MallocTracker fast enough and acceptable in terms of memory consumption I had to put in place a restriction on any kind of naming, be it the name for the allocations or scopes. The restriction is that the lifetime of the memory where those names are stored must be the same or longer than the actual allocation or scope. The reason is that making any copy of that data greatly impacts performance and memory consumption, so only pointers are stored. While that may seem like a complex restriction to live with, I find it to be perfectly fine since you should know the lifetime of your own allocations. If you don’t know about the lifetime of your allocations in order to know for how long to keep those names alive then you have bigger problems to deal with.

Another particular implementation with respect to allocation and scope names is the need to deal with ANSI and wide character names. To make this more transparent all those pointers are assumed to be ANSI unless the 63rd bit in the pointer is set in which case the pointer is assumed to point to a wide character name. FMallocTracker provides a way to get a pointer with that bit set for wide char names, and to set if necessary for FNames which can be either wide or ANSI. At the time of output to file the names are handled properly and dumped to file.

Conclusion.

Unfortunately I won’t be able to convey the true usefulness of this type of system until I make a tool to visualize the data properly, but you can take me work that it is really useful. Finding fragmentation issues and dealing with rampant memory usage is so much easier with this data. This is considerably better than what is already provided in the engine in terms of performance, memory consumption, and data quality. The next step would be to actually fully transition the engine to use the tagged allocation but that’s something that can be done as needed. It certainly doesn’t make sense to spend too much time just tagging allocations where many of them are not conflicting. Instead it is better to just tag the big allocations to get more insight into the specific issues. But even if you find tagging allocations too boring, you may still get useful data. Here is the data of the biggest allocations captured on the Elemental demo.
SampleFull

Sample data and source code.

To make more sense out of this I also provide sample data. The sample data comes out of running a modified version of the Elemental demo on a test build. You can download the data from here and view it with any text editor that support big files.
You can also view the code by looking at the pull request I made for Epic Games to see if they will take this update. The pull request is available here.

Video overview.