Memory allocation and tracking.

After that optimization interlude, let’s go back to memory allocation and tracking. As I mentioned last time I would like to go over the different requirements of the memory allocation and tracking features necessary for shipping a AAA game without having a horrible time doing memory optimizations.

Solution Requirements

Most AAA games have a lot of resource that go in and out of system and video memory constantly. This is more exacerbated in sandbox games, and less so in stadium games (such as NBA 2K15). In any case, the solution needs to be fast and provide proper data related to all allocations happening. A fast solution that doesn’t provided proper tracking facilities won’t do, and a solution with proper tracking facilities but it is slow won’t do either. Both aspects are important in equal measure. At the same time, ALL allocations must go through it which implies that client code nor third party libraries won’t allocate memory on their own and global new and delete operators should be overridden.

Tracking Information

The solution must provide relevant allocation tracking information. The information must be as general as how much memory is used overall, as specific as what was the address returned for a specific allocation, and everything in between. Any allocation must have relevant tracking information that is defined when it was allocated and can be used as a reference for programmers to detect issues.

General Information

The general information provided should be extremely simple. It should provide the following items:

  • Allocated bytes.
  • Number of allocations.
  • Peak allocated bytes.
  • Peak number of allocations.

Allocations Grouping

Just like there are different groups/teams focused on different areas of a game, allocations should also be grouped. Some of the groups are rendering, gameplay, UI, audio, etc. The different groups have different memory allocation patterns and requirements. As such, it is a good idea to group allocations based on that criteria because it provides the following benefits:

  • Optimal allocation setup. Not all groups have the same allocation needs so it is better to allow allocation setup per group. That means that for example, not all groups need mutexed allocators, not all groups may use the same small block allocator, etc.
  • Budget tracking and enforcement. It is essential to have each group have their own amount of ram that they can keep track of, and systems programmers can assign in agreement with the different groups. Basically this will guarantee that they all have the their fair share and that everybody keeps memory usage under control.
  • Easier to detect corruption issues. Since all allocations have a group assigned with a proper allocation setup, it isn’t that hard to solve corruption issues or issues during allocation. The grouping provide good initial context.
  • Better performance. Since mutexes are not required in all the groups or allocators that cost can be avoided. On groups that require mutexes there is also less contention since there isn’t a single mutexed allocator (such as the global allocator) to lock. It is also possible to make trade-offs between absolute performance, and peak memory usage when it comes to defining how memory is allocated.

Allocation Naming

All allocations should be “named” in order to have proper ways to recognize the different allocations. What the name implies is up to whoever requests the memory, and you may enforce naming conventions, but that tag should be available in order to track memory allocations. For the sake of performance, these tags should only be available on non-shipping builds.

Allocation Scoping

The solution must allow the stacking of scopes per thread to provide more context to allocations. This provides better context than simply using the callstack of the allocation, and it is far cheaper that grabbing the callstack. On Unreal an example might be where are scope is created during UObject creation so all allocations happening for that UObject are nested under that scope. All non-scoped allocation would still belong to a global scope. Here is an example of a scope stack with the allocation as a leaf node and its relevant data:

Main Thread												Pointer				Bytes	Group
	Global Scope
		UGameEngine::Init
			/Game/Maps/LandscapeMap.umap
				AddToWorld
					PersistentLevel
						ConstructObject
							FPhysXAllocator::allocate	0x000000000b093fe0	131720	Physics

Allocation flagging

Allocations may provide optional flags that depending on the allocator used it will have a certain meaning. Some flags examples are:

  • Lifetime flags. Provides hints as to how long this allocation is expected to live. This allows the allocator to be more clever when allocating memory to reduce memory fragmentation.
  • Clear allocation flag. Let the allocator know that you want the memory allocated cleared to zero before returning.

Performance

The solution must provide acceptable performance even in non-shipping builds with the tracking features enabled. And by acceptable that means that the frame time should never go above 50ms per frame with tracking features enabled. When it goes above that threshold people start trying to avoid using the tracking features and that’s the slippery slope you will have to recover from at the worst possible time, shipping time. Also performance hit and general overhead on shipping builds should be slim to none.

Allocations Grouping

To implement the best possible allocation scheme without exposing the complexities to the client code, it make sense to define multiple allocators per group. Those allocators would be called in order, which ever allocator succeeds at allocating the memory returns the allocation. So for example under a normal group three allocators could be added:

  • Static Small Block Allocator (SSBA). This could be a static small block allocator that doesn’t grow and accepts allocations up to 256 bytes.
  • Dynamic Small Block Allocator (DSBA). This could be a dynamic small block allocator that grows as necessary and accepts allocations up to 1024 bytes.
  • Standard Allocator (SA). Standard OS allocator that allocates any size allocation.

So a request for 1032 bytes would try the SSBA first, the DSBA, and finally it will request the memory to the SA. It is also perfectly fine to just have one allocator if the allocator you use provides the functionality you need. For example, you may use jemalloc which already handles different sized allocations by itself with the proper locking mechanisms.

Conclusion

With all this data we ought to be able to create the proper facilities for memory allocation and tracking. Next time we will delve into the API that we would eventually implement. If you think I forgot any other item please don’t hesitate to comment.

Advertisements

Optimizing sorting of base passes in Unreal Engine 4.

I know I said I was going to dig into the memory tracking solution, but I realized that I might as well do an optimization since people seem to find them very useful. So I decided to look into profiling and optimizing the Landscape Mountains map which you can download from the Unreal Engine Marketplace. If you haven’t seen it here is a video:

I decided to profile this 900 frames starting from the 100th frame since I didn’t want to profile the initial streaming of the level. So here is the basic thread view.

ThreadProfile

So the top thread is doing much more work than the rest, that thread is the render thread. It was obvious that to improve performance I would have to optimize the work happening in the rendering thread. So let’s look at the work happening in the rendering thread.

RenderThreadProfile

The highlighted functions caught my attention because they accounted for almost 8% of the time spent on the render thread and sorting is usually a very atomic issue to solve. So let’s dig into the actual code execution.

InitViewsCallstack

One of the first calls made each frame to render the scene is to FDeferredShadingSceneRenderer::InitViews() which is in charge of initializing the scene’s views by checking visibility, do sorting of elements for the different passes, initialize the dynamic shadows, etc. The sorting of the static meshes with the respective drawing policy kicks off the calling of the functions we saw on the profiling data. Let’s look at that function:

void FDeferredShadingSceneRenderer::SortBasePassStaticData(FVector ViewPosition)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		SCOPE_CYCLE_COUNTER(STAT_SortStaticDrawLists);

		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			Scene->BasePassNoLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassSimpleDynamicLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedPointIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassHighQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassLowQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
		}
	}
}

That’s extremely simple to understand. It just sorts the different draw lists one after the other twice, once for the default base pass, and once for the masked base pass. The first thing that becomes obvious is that we could sort those draw lists asynchronously in different threads and get some other work going while that’s being done. But in order to do that we need to know that the sort is atomic and doesn’t affect global state. To do that we would have to dig in deeper in the callstack. Let’s look at the code of the next function, TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;
		for (int32 ElementIndex = 0; ElementIndex < DrawingPolicyLink.Elements.Num(); ElementIndex++)
		{
			FElement& Element = DrawingPolicyLink.Elements[ElementIndex];

			if (ElementIndex == 0)
			{
				AccumulatedBounds = Element.Bounds;
			}
			else
			{
				AccumulatedBounds = AccumulatedBounds + Element.Bounds;
			}
		}
		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

In this piece of code there are a couple of things that could be changed. First of all you can see that inside the inner loop ElementIndex is initialized on the first iteration of the loop but then is checked in all the other iterations just to see if it’s the first. Now, you could say that the branch predictor in the CPU would figure out that ElementIndex won’t be 0 in the next iterations, but it is better to write code without relying too much on the branch predictors since they are fairly different from one CPU to another. Let’s make that change first plus a couple of minor changes:

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0].Bounds;
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

Now the actual issue with doing multiple draw lists sorts concurrently with this code is that there are two static variables in TStaticMeshDrawList:

/**
 * Static variables for getting data into the Compare function.
 * Ideally Sort would accept a non-static member function which would avoid having to go through globals.
 */
static TDrawingPolicySet* SortDrawingPolicySet;
static FVector SortViewPosition;

Now this is a case where code comments get outdated because it is possible to send data to the Compare function without using globals and without using thread-local storage or anything similar. Since the assumption was that sending data to the sort function wasn’t possible, here is how the predicate class was written:

/** Helper stuct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare( A, B ) < 0;
	}
};

Now that we now let’s see how those two static variables are used:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B)
{
	const FSphere& BoundsA = (*SortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*SortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - SortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - SortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

So it is clear that first thing we need to make this asynchronous is to get rid of SortDrawingPolicySet and SortViewPosition. First the predicate class must be rewritten to include the parameters that used to be in the static variables:

/** Helper struct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
private:
	const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const SortDrawingPolicySet;
	const FVector SortViewPosition;

public:
	TCompareStaticMeshDrawList(const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
		: SortDrawingPolicySet(InSortDrawingPolicySet)
		, SortViewPosition(InSortViewPosition)
	{
	}

	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare(A, B, SortDrawingPolicySet, SortViewPosition) < 0;
	}
};

And replace the Compare implementation to fit the new changes:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B, const TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
{
	const FSphere& BoundsA = (*InSortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*InSortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - InSortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - InSortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

Since the static variables are gone we need to modify TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0];
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	OrderedDrawingPolicies.Sort(TCompareStaticMeshDrawList<DrawingPolicyType>(&DrawingPolicySet, ViewPosition));
}

Now we should be able to sort the draw lists asynchronously. To do that we will need to replace the code in FDeferredShadingSceneRenderer::SortBasePassStaticData() to kick off asynchronous calls to TStaticMeshDrawList::SortFrontToBack(), and it will have to return some kind of container that can be used to determine if those tasks finished. In that way we can issue those tasks and avoid waiting for them to finish on the rendering thread unless we reach a point in the code were we need those tasks done (and hopefully by then they should be done already. Unfortunately Unreal Engine 4.6 doesn’t support futures and features that would make it easy to launch this functions asynchronously. It is my understanding that they will come in Unreal Engine 4.8. But meanwhile we will have to write a standard task. Let’s do that.

For the sorting task, we will sort a single draw list per task, and whoever needs to sort multiple draw lists will create multiple tasks. This task only needs to parameters, the draw list to sort, and the view position for the sorting, and the only purpose of the task will be to call the SortFrontToBack() function of the draw list to sort, with the view position as a parameter, nothing else. Let’s implement that:

template<typename StaticMeshDrawList>
class FSortFrontToBackTask
{
private:
	StaticMeshDrawList * const StaticMeshDrawListToSort;
	const FVector ViewPosition;

public:
	FSortFrontToBackTask(StaticMeshDrawList * const InStaticMeshDrawListToSort, const FVector InViewPosition)
		: StaticMeshDrawListToSort(InStaticMeshDrawListToSort)
		, ViewPosition(InViewPosition)
	{

	}

	FORCEINLINE TStatId GetStatId() const
	{
		RETURN_QUICK_DECLARE_CYCLE_STAT(FSortFrontToBackTask, STATGROUP_TaskGraphTasks);
	}

	ENamedThreads::Type GetDesiredThread()
	{
		return ENamedThreads::AnyThread;
	}

	static ESubsequentsMode::Type GetSubsequentsMode() { return ESubsequentsMode::TrackSubsequents; }

	void DoTask(ENamedThreads::Type CurrentThread, const FGraphEventRef& MyCompletionGraphEvent)
	{
		StaticMeshDrawListToSort->SortFrontToBack(ViewPosition);
	}
};

That’s a bit of boilerplate code that won’t be necessary once futures are supported on Unreal Engine 4.8. Anyway, now we need to create a function to dispatch those tasks. Since I like to keep the single thread and multiple thread implementations available for debugging, I decided to write a new function called FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(). Let’s implement that:

void FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(const FVector &InViewPosition, FGraphEventArray &OutSortEvents)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FNoLightMapPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassNoLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FSimpleDynamicLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassSimpleDynamicLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedVolumeIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedPointIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedPointIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassHighQualityLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TDistanceFieldShadowsAndLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<LQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassLowQualityLightMapDrawList[DrawType]), InViewPosition));
		}
	}
}

With that we have dispatched all the sorting tasks and OutSortEvents can be used to determine if all the tasks are done. So let’s go back to FDeferredShadingSceneRenderer::InitViews() and implement the support for asynchronous sorting:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	FGraphEventArray SortEvents;
	AsyncSortBasePassStaticData(AverageViewPosition, SortEvents);
#else
	SortBasePassStaticData(AverageViewPosition);
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we get to keep both approaches should you have to debug an issue with either. At this point the sort tasks were launches, but we can’t assume they will be done before we leave the function so let’s check for that and make sure they are done:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	if (SortEvents.Num())
	{
		FTaskGraphInterface::Get().WaitUntilTasksComplete(SortEvents, ENamedThreads::RenderThread);
	}
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we halt execution on the rendering thread until the tasks are done, but hopefully they should be done by then.

So now the question is, did this change actually improve performance? Let’s compare the basic data capturing the same frames as before:

AsyncSortSummaryComparisonAll positive numbers show performance improvements. Before optimization numbers are to the left, after optimization to the right. Blue bars in histogram show the frames before optimization, orange shows after optimization.

With VTune configured to show frames that took more than 20ms as slow, and less that 11ms as fast, it is clearly visible that we have more frames with good or fast performance and less frames with slow performance. Overall all frames moved to the right in the histogram showing that this definitely improved performance overall. Let’s look at the work on the different threads:

AsyncSortThreadsBeforeBefore

AsyncSortThreadsAfterAfter

The render thread is highlighted in both profilings. It is visible that the work on the rendering thread was reduced considerably, and the work on the threads that consume tasks was increased. So by shifting the work from the render thread to the tasks threads without spending time waiting for those tasks to finish improved performance.
Going wide isn’t always the answer, especially when the code you need to optimize was written to be single threaded. But as programmers we need to get better at doing stuff asynchronously, to design from the start for multithreading, because this is a clear case were it is beneficial to go wide.

If you would like to see the whole code change and have access to the Unreal Engine 4 github then you can check the whole change here: https://github.com/EpicGames/UnrealEngine/pull/820

Limitations of memory tracking features in Unreal Engine 4.

Since June last year I have been working on a AAA game based on Unreal Engine 4 as a contractor. One of the big issue on most AAA titles, and which certainly bit me on this project, is memory consumption. It is a tough challenge to provide the best experience to the end-user while doing so within the constraint of the current available memory. And even more so when working in a sandbox game which is my case. But while the Unreal Engine 4 does provide some facilities for tracking memory, I have come to see that they are not up to the task when it comes to big sandbox games. But first let’s dig into the two means provided by the engine, memory report commands and MallocProfiler.

Notes: For this post I will assume that you have read the blog post by Ben Zeigler called “Debugging and Optimizing Memory” on the official Unreal Engine blog. Also, this article is based on what’s available in Unreal Engine 4.6.

Memory report commands

The memory reports commands are a bunch of different console commands that allow you to see the memory usage in general. All those commands are bundled together in a single very convenient command that’s “Memreport –full”. Behind the back it executes the following commands:

Mem FromReport
obj list -alphasort
rhi.DumpMemory
LogOutStatLevels
ListSpawnedActors
DumpParticleMem
ConfigMem
r.DumpRenderTargetPoolMemory
ListTextures -alphasort
ListSounds -alphasort
ListParticleSystems -alphasort
obj list class=SoundWave -alphasort
obj list class=SkeletalMesh -alphasort
obj list class=StaticMesh -alphasort
obj list class=Level -alphasort

All these command rely on three mechanisms for gathering data: global allocator stats, memory tracking stat define, and each object’s UObject::GetResourceSize().

Global allocator stats

LogMemory: Platform Memory Stats for WindowsNoEditor
LogMemory: Process Physical Memory: 722.09 MB used, 945.73 MB peak
LogMemory: Process Virtual Memory: 732.14 MB used, 1379.78 MB peak
LogMemory: Physical Memory: 9624.43 MB used, 24551.27 MB total
LogMemory: Virtual Memory: 1026.31 MB used, 8388608.00 MB total
LogMemory:
LogMemory: Memory Stats:
LogMemory: FMemStack (gamethread) current size = 0.00 MB
LogMemory: FPageAllocator (all threads) allocation size [used/ unused] = [0.00 / 0.56] MB
LogMemory: Nametable memory usage = 1.54 MB

This is the most basic information. The depth of the data provided is very limited but it is the first basic information to look at when assessing memory issues. The first four lines is data that comes from GlobalMemoryStatusEx and GetProcessMemoryInfo which are completely generic, and it’s data that you can already see in the Task Manager.

The last three lines are specific bits of memory which usually don’t take that much but it is still useful to see. The FMemStack size is the memory used by the linear stack allocator. That allocator is a singleton and it is used for multiple things such as storing data for rendering composition passes, shadows information, etc.

The second line refers to the allocation size done by the page allocator. This allocator it also store data internally split at in normal page size allocations and small page size allocations. The statistics provided include both normal page size and small page size allocations. The FMemStack uses the FPageAllocator so they are related.

The last entry just shows the size of the table that stores all the FName entries. Since FNames are used throughout the engine it can be a sizable piece of memory, especially in sandbox games with lots of different UObjects.

Memory tracking stat defines

The Unreal stat system provides the following macros to create stat entries for memory tracking:

#define DECLARE_MEMORY_STAT(CounterName,StatId,GroupId)
#define DECLARE_MEMORY_STAT_POOL(CounterName,StatId,GroupId,Pool)
#define DECLARE_MEMORY_STAT_EXTERN(CounterName,StatId,GroupId, API)
#define DECLARE_MEMORY_STAT_POOL_EXTERN(CounterName,StatId,GroupId,Pool, API)

#define INC_MEMORY_STAT_BY(StatId,Amount)
#define DEC_MEMORY_STAT_BY(StatId,Amount)
#define SET_MEMORY_STAT(StatId,Value)

#define INC_MEMORY_STAT_BY_FName(Stat, Amount)
#define DEC_MEMORY_STAT_BY_FName(Stat,Amount)
#define SET_MEMORY_STAT_FName(Stat,Value)

These macros allow you to define, set, increase, and decrease each memory tracking stat entry. But the fault of this approach is that it needs to be increased and decreased properly, and in a place that it is actually near where the memory is allocated or freed. For example, let’s look at one example in the engine

void FPrimitiveSceneInfo::RemoveFromScene(bool bUpdateStaticDrawLists)
{
	check(IsInRenderingThread());

	// implicit linked list. The destruction will update this "head" pointer to the next item in the list.
	while(LightList)
	{
		FLightPrimitiveInteraction::Destroy(LightList);
	}

	// Remove the primitive from the octree.
	check(OctreeId.IsValidId());
	check(Scene->PrimitiveOctree.GetElementById(OctreeId).PrimitiveSceneInfo == this);
	Scene->PrimitiveOctree.RemoveElement(OctreeId);
	OctreeId = FOctreeElementId();

	IndirectLightingCacheAllocation = NULL;

	DEC_MEMORY_STAT_BY(STAT_PrimitiveInfoMemory, sizeof(*this) + StaticMeshes.GetAllocatedSize() + Proxy->GetMemoryFootprint());

	if (bUpdateStaticDrawLists)
	{
		RemoveStaticMeshes();
	}
}

In that piece of code we can see how STAT_PrimitiveInfoMemory is decreased, but not anywhere near where the actual memory was freed. The memory was freed inside the allocator defined for the StaticMeshes array and the scene proxy, and all of that was triggered by remove an element from the octree. If someone makes changes to the memory usage of the octree then this stat would reflect wrong memory consumption which leads to wrong decisions when optimizing memory usage. The same happens if FPrimitiveSceneInfo changes, especially when new containers are added to the class.

The process of having up to date allocation data by means of stat defines is very error prone. This data does get out of date, it gets written wrong, and it doesn’t actually track memory, just estimates of what the programmer thinks it could consume memory. And the last mechanism, the use of UObject::GetResourceSize() has the same issues.

Memory tracking with UObject::GetResourceSize()

Objects:

                                         Class  Count  NumKBytes  MaxKBytes  ResKBytes ExclusiveResKBytes
                            AIPerceptionSystem      1          0K          0K          0K          0K
                                AISense_Damage      1          0K          0K          0K          0K
                               AISense_Hearing      1          0K          0K          0K          0K
                            AISense_Prediction      1          0K          0K          0K          0K
                                 AISense_Sight      1          0K          0K          0K          0K
                                  AISense_Team      1          0K          0K          0K          0K
                                 AISense_Touch      1          0K          0K          0K          0K
                                      AISystem      1          0K          0K          0K          0K
                                  AmbientSound     21         15K         15K          0K          0K
               AnimNotify_PlayParticleEffect_C     11          1K          1K          0K          0K
                                  AnimSequence     78      53453K      53453K      53333K      53333K
                        AnimSingleNodeInstance     85        241K        242K          0K          0K
                                 ArrayProperty    729         85K         85K          0K          0K
                           AssetObjectProperty      2          0K          0K          0K          0K

In the ideal situation this function provides memory usage of a specific UObject. The function is defined this way:

/**
 * Returns the size of the object/ resource for display to artists/ LDs in the Editor. The
 * default behavior is to return 0 which indicates that the resource shouldn't display its
 * size which is used to not confuse people by displaying small sizes for e.g. objects like
 * materials
 *
 * @param	Type	Indicates which resource size should be returned
 * @return	Size of resource as to be displayed to artists/ LDs in the Editor.
 */
virtual SIZE_T GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

It should just take a call to the specific UObject::GetResourceSize() and we should get the data which is extremely useful data to have. It tries to answer the question “what UObjects do I need to optimize in the scene?”, “which UObject-based classed do I need to optimize?” and questions of that nature. But again, this is as good as the implementation written for it. This is another mechanism that gets outdated, that is implemented wrong (for example, by assuming the data is stored with an object and not just pointers to the data), or that is just empty. For example, one may ask “which skeletal mesh in my scene should I optimize?” Let’s look at the implementation in the engine as of Unreal Engine 4.6:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

So that’s not good. This mechanism is useless or outdated. So to fix this I had to write something like this:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	SIZE_T ResSize = sizeof(USkeletalMesh);

	ResSize += Materials.GetAllocatedSize();
	ResSize += SkelMirrorTable.GetAllocatedSize();

	for (int32 i = 0, n = LODInfo.Num(); i < n; ++i)
	{
		ResSize += LODInfo[i].GetResourceSize(Mode);
	}
	ResSize += LODInfo.GetAllocatedSize();

#if WITH_EDITORONLY_DATA
	for (int32 i = 0, n = OptimizationSettings.Num(); i < n; ++i)
	{
		ResSize += OptimizationSettings[i].GetResourceSize(Mode);
	}
	ResSize += OptimizationSettings.GetAllocatedSize();
#endif

	for (int32 i = 0, n = MorphTargets.Num(); i < n; ++i)
	{
		ResSize += MorphTargets[i]->GetResourceSize(Mode);
	}
	ResSize += MorphTargets.GetAllocatedSize();

	ResSize += RefBasesInvMatrix.GetAllocatedSize();

	ResSize += ClothingAssets.GetAllocatedSize();

#if WITH_APEX_CLOTHING
	for (int32 i = 0, n = ClothingAssets.Num(); i < n; ++i)
	{
		ResSize += ClothingAssets[i].GetResourceSize(Mode);
	}
#endif

	ResSize += CachedStreamingTextureFactors.GetAllocatedSize();

	ResSize += Sockets.GetAllocatedSize();

	return ResSize;
}

And then again, you have to check that the called GetResourceSize() functions actually return the proper value. For example ClothingAssets is an array of FClothingAssetData but that struct didn’t have a GetResourceSize() implementation (probably because that isn’t a UObject in itself but we still need the memory usage of that resource). And also, this implementation will get outdated, and perhaps I missed some data when I implemented it. You just can rely on this to get proper memory usage data.

MallocProfiler

The malloc profiler is a completely different approach. Basically this is a feature in the engine that allows to keep track of all memory allocated through new and delete operators as well as anything that happens through the global memory allocator. It collects basic stats such as memory operations, modules loader, etc. but also the actual memory operations (allocations, frees, reallocations) together with the callstack for each. The grabbing of the callstack isn’t optional since the callstack is the only piece of data that differentiates one allocation from another. But the fact that it has to capture the callstack makes it incredibly slow to use in-game, and it generates a huge output file. For example, I have seen capture files of ~15GBs, and each frame took more than one second (yes, second, not millisecond) while the game was running.

The data can be visualized with a custom tool written in C# called MemoryProfiler 2. This is a extremely slow and inefficient tool in terms of memory. A 15GB capture takes close to 30 minutes to open and more that 80GBs of system memory. Again, the issue isn’t only related to the tool but also the fact that there is just a sheer amount of data generated, and that is made worse by the fact that the opening of the data is single threaded with the code structured in a way that makes it hard to multithread. But even if the tool was faster, and the data was compressed properly, it still wouldn’t help that much because the information is limited. Lets look at the data:
MemoryProfiler2
Here we see the most relevant tab in terms of memory usage by resources because it visualizes the different allocations as a call tree. In the screenshot the allocations related to the USkeletalMeshes serialization allocations are visible. The issue is that if you have many USkeletalMeshes, you can’t find out which ones of those should be optimized. You just know that serializing all the USkeletalMeshes takes a certain amount of memory. This doesn’t provide really useful data to tackle memory optimizations except in the cases where you are doing general optimizations (such as rearranging the data inside USkeletalMesh to reduce compiler padding).

Next steps

After looking at the limitations, it is obvious that there is a place in the engine for a better way to do and track allocations. We need something that would cover up the whole spectrum of allocations, that would allow us to keep track of them, but at the same time do so without making it impossible for the game to run fast enough when it’s enabled. Next time I will start digging into a solution for that and be more explicit about the requirements.