Optimizing sorting of base passes in Unreal Engine 4.

I know I said I was going to dig into the memory tracking solution, but I realized that I might as well do an optimization since people seem to find them very useful. So I decided to look into profiling and optimizing the Landscape Mountains map which you can download from the Unreal Engine Marketplace. If you haven’t seen it here is a video:

I decided to profile this 900 frames starting from the 100th frame since I didn’t want to profile the initial streaming of the level. So here is the basic thread view.

ThreadProfile

So the top thread is doing much more work than the rest, that thread is the render thread. It was obvious that to improve performance I would have to optimize the work happening in the rendering thread. So let’s look at the work happening in the rendering thread.

RenderThreadProfile

The highlighted functions caught my attention because they accounted for almost 8% of the time spent on the render thread and sorting is usually a very atomic issue to solve. So let’s dig into the actual code execution.

InitViewsCallstack

One of the first calls made each frame to render the scene is to FDeferredShadingSceneRenderer::InitViews() which is in charge of initializing the scene’s views by checking visibility, do sorting of elements for the different passes, initialize the dynamic shadows, etc. The sorting of the static meshes with the respective drawing policy kicks off the calling of the functions we saw on the profiling data. Let’s look at that function:

void FDeferredShadingSceneRenderer::SortBasePassStaticData(FVector ViewPosition)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		SCOPE_CYCLE_COUNTER(STAT_SortStaticDrawLists);

		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			Scene->BasePassNoLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassSimpleDynamicLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedPointIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassHighQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassLowQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
		}
	}
}

That’s extremely simple to understand. It just sorts the different draw lists one after the other twice, once for the default base pass, and once for the masked base pass. The first thing that becomes obvious is that we could sort those draw lists asynchronously in different threads and get some other work going while that’s being done. But in order to do that we need to know that the sort is atomic and doesn’t affect global state. To do that we would have to dig in deeper in the callstack. Let’s look at the code of the next function, TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;
		for (int32 ElementIndex = 0; ElementIndex < DrawingPolicyLink.Elements.Num(); ElementIndex++)
		{
			FElement& Element = DrawingPolicyLink.Elements[ElementIndex];

			if (ElementIndex == 0)
			{
				AccumulatedBounds = Element.Bounds;
			}
			else
			{
				AccumulatedBounds = AccumulatedBounds + Element.Bounds;
			}
		}
		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

In this piece of code there are a couple of things that could be changed. First of all you can see that inside the inner loop ElementIndex is initialized on the first iteration of the loop but then is checked in all the other iterations just to see if it’s the first. Now, you could say that the branch predictor in the CPU would figure out that ElementIndex won’t be 0 in the next iterations, but it is better to write code without relying too much on the branch predictors since they are fairly different from one CPU to another. Let’s make that change first plus a couple of minor changes:

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0].Bounds;
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

Now the actual issue with doing multiple draw lists sorts concurrently with this code is that there are two static variables in TStaticMeshDrawList:

/**
 * Static variables for getting data into the Compare function.
 * Ideally Sort would accept a non-static member function which would avoid having to go through globals.
 */
static TDrawingPolicySet* SortDrawingPolicySet;
static FVector SortViewPosition;

Now this is a case where code comments get outdated because it is possible to send data to the Compare function without using globals and without using thread-local storage or anything similar. Since the assumption was that sending data to the sort function wasn’t possible, here is how the predicate class was written:

/** Helper stuct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare( A, B ) < 0;
	}
};

Now that we now let’s see how those two static variables are used:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B)
{
	const FSphere& BoundsA = (*SortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*SortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - SortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - SortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

So it is clear that first thing we need to make this asynchronous is to get rid of SortDrawingPolicySet and SortViewPosition. First the predicate class must be rewritten to include the parameters that used to be in the static variables:

/** Helper struct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
private:
	const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const SortDrawingPolicySet;
	const FVector SortViewPosition;

public:
	TCompareStaticMeshDrawList(const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
		: SortDrawingPolicySet(InSortDrawingPolicySet)
		, SortViewPosition(InSortViewPosition)
	{
	}

	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare(A, B, SortDrawingPolicySet, SortViewPosition) < 0;
	}
};

And replace the Compare implementation to fit the new changes:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B, const TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
{
	const FSphere& BoundsA = (*InSortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*InSortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - InSortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - InSortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

Since the static variables are gone we need to modify TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0];
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	OrderedDrawingPolicies.Sort(TCompareStaticMeshDrawList<DrawingPolicyType>(&DrawingPolicySet, ViewPosition));
}

Now we should be able to sort the draw lists asynchronously. To do that we will need to replace the code in FDeferredShadingSceneRenderer::SortBasePassStaticData() to kick off asynchronous calls to TStaticMeshDrawList::SortFrontToBack(), and it will have to return some kind of container that can be used to determine if those tasks finished. In that way we can issue those tasks and avoid waiting for them to finish on the rendering thread unless we reach a point in the code were we need those tasks done (and hopefully by then they should be done already. Unfortunately Unreal Engine 4.6 doesn’t support futures and features that would make it easy to launch this functions asynchronously. It is my understanding that they will come in Unreal Engine 4.8. But meanwhile we will have to write a standard task. Let’s do that.

For the sorting task, we will sort a single draw list per task, and whoever needs to sort multiple draw lists will create multiple tasks. This task only needs to parameters, the draw list to sort, and the view position for the sorting, and the only purpose of the task will be to call the SortFrontToBack() function of the draw list to sort, with the view position as a parameter, nothing else. Let’s implement that:

template<typename StaticMeshDrawList>
class FSortFrontToBackTask
{
private:
	StaticMeshDrawList * const StaticMeshDrawListToSort;
	const FVector ViewPosition;

public:
	FSortFrontToBackTask(StaticMeshDrawList * const InStaticMeshDrawListToSort, const FVector InViewPosition)
		: StaticMeshDrawListToSort(InStaticMeshDrawListToSort)
		, ViewPosition(InViewPosition)
	{

	}

	FORCEINLINE TStatId GetStatId() const
	{
		RETURN_QUICK_DECLARE_CYCLE_STAT(FSortFrontToBackTask, STATGROUP_TaskGraphTasks);
	}

	ENamedThreads::Type GetDesiredThread()
	{
		return ENamedThreads::AnyThread;
	}

	static ESubsequentsMode::Type GetSubsequentsMode() { return ESubsequentsMode::TrackSubsequents; }

	void DoTask(ENamedThreads::Type CurrentThread, const FGraphEventRef& MyCompletionGraphEvent)
	{
		StaticMeshDrawListToSort->SortFrontToBack(ViewPosition);
	}
};

That’s a bit of boilerplate code that won’t be necessary once futures are supported on Unreal Engine 4.8. Anyway, now we need to create a function to dispatch those tasks. Since I like to keep the single thread and multiple thread implementations available for debugging, I decided to write a new function called FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(). Let’s implement that:

void FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(const FVector &InViewPosition, FGraphEventArray &OutSortEvents)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FNoLightMapPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassNoLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FSimpleDynamicLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassSimpleDynamicLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedVolumeIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedPointIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedPointIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassHighQualityLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TDistanceFieldShadowsAndLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<LQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassLowQualityLightMapDrawList[DrawType]), InViewPosition));
		}
	}
}

With that we have dispatched all the sorting tasks and OutSortEvents can be used to determine if all the tasks are done. So let’s go back to FDeferredShadingSceneRenderer::InitViews() and implement the support for asynchronous sorting:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	FGraphEventArray SortEvents;
	AsyncSortBasePassStaticData(AverageViewPosition, SortEvents);
#else
	SortBasePassStaticData(AverageViewPosition);
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we get to keep both approaches should you have to debug an issue with either. At this point the sort tasks were launches, but we can’t assume they will be done before we leave the function so let’s check for that and make sure they are done:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	if (SortEvents.Num())
	{
		FTaskGraphInterface::Get().WaitUntilTasksComplete(SortEvents, ENamedThreads::RenderThread);
	}
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we halt execution on the rendering thread until the tasks are done, but hopefully they should be done by then.

So now the question is, did this change actually improve performance? Let’s compare the basic data capturing the same frames as before:

AsyncSortSummaryComparisonAll positive numbers show performance improvements. Before optimization numbers are to the left, after optimization to the right. Blue bars in histogram show the frames before optimization, orange shows after optimization.

With VTune configured to show frames that took more than 20ms as slow, and less that 11ms as fast, it is clearly visible that we have more frames with good or fast performance and less frames with slow performance. Overall all frames moved to the right in the histogram showing that this definitely improved performance overall. Let’s look at the work on the different threads:

AsyncSortThreadsBeforeBefore

AsyncSortThreadsAfterAfter

The render thread is highlighted in both profilings. It is visible that the work on the rendering thread was reduced considerably, and the work on the threads that consume tasks was increased. So by shifting the work from the render thread to the tasks threads without spending time waiting for those tasks to finish improved performance.
Going wide isn’t always the answer, especially when the code you need to optimize was written to be single threaded. But as programmers we need to get better at doing stuff asynchronously, to design from the start for multithreading, because this is a clear case were it is beneficial to go wide.

If you would like to see the whole code change and have access to the Unreal Engine 4 github then you can check the whole change here: https://github.com/EpicGames/UnrealEngine/pull/820

Limitations of memory tracking features in Unreal Engine 4.

Since June last year I have been working on a AAA game based on Unreal Engine 4 as a contractor. One of the big issue on most AAA titles, and which certainly bit me on this project, is memory consumption. It is a tough challenge to provide the best experience to the end-user while doing so within the constraint of the current available memory. And even more so when working in a sandbox game which is my case. But while the Unreal Engine 4 does provide some facilities for tracking memory, I have come to see that they are not up to the task when it comes to big sandbox games. But first let’s dig into the two means provided by the engine, memory report commands and MallocProfiler.

Notes: For this post I will assume that you have read the blog post by Ben Zeigler called “Debugging and Optimizing Memory” on the official Unreal Engine blog. Also, this article is based on what’s available in Unreal Engine 4.6.

Memory report commands

The memory reports commands are a bunch of different console commands that allow you to see the memory usage in general. All those commands are bundled together in a single very convenient command that’s “Memreport –full”. Behind the back it executes the following commands:

Mem FromReport
obj list -alphasort
rhi.DumpMemory
LogOutStatLevels
ListSpawnedActors
DumpParticleMem
ConfigMem
r.DumpRenderTargetPoolMemory
ListTextures -alphasort
ListSounds -alphasort
ListParticleSystems -alphasort
obj list class=SoundWave -alphasort
obj list class=SkeletalMesh -alphasort
obj list class=StaticMesh -alphasort
obj list class=Level -alphasort

All these command rely on three mechanisms for gathering data: global allocator stats, memory tracking stat define, and each object’s UObject::GetResourceSize().

Global allocator stats

LogMemory: Platform Memory Stats for WindowsNoEditor
LogMemory: Process Physical Memory: 722.09 MB used, 945.73 MB peak
LogMemory: Process Virtual Memory: 732.14 MB used, 1379.78 MB peak
LogMemory: Physical Memory: 9624.43 MB used, 24551.27 MB total
LogMemory: Virtual Memory: 1026.31 MB used, 8388608.00 MB total
LogMemory:
LogMemory: Memory Stats:
LogMemory: FMemStack (gamethread) current size = 0.00 MB
LogMemory: FPageAllocator (all threads) allocation size [used/ unused] = [0.00 / 0.56] MB
LogMemory: Nametable memory usage = 1.54 MB

This is the most basic information. The depth of the data provided is very limited but it is the first basic information to look at when assessing memory issues. The first four lines is data that comes from GlobalMemoryStatusEx and GetProcessMemoryInfo which are completely generic, and it’s data that you can already see in the Task Manager.

The last three lines are specific bits of memory which usually don’t take that much but it is still useful to see. The FMemStack size is the memory used by the linear stack allocator. That allocator is a singleton and it is used for multiple things such as storing data for rendering composition passes, shadows information, etc.

The second line refers to the allocation size done by the page allocator. This allocator it also store data internally split at in normal page size allocations and small page size allocations. The statistics provided include both normal page size and small page size allocations. The FMemStack uses the FPageAllocator so they are related.

The last entry just shows the size of the table that stores all the FName entries. Since FNames are used throughout the engine it can be a sizable piece of memory, especially in sandbox games with lots of different UObjects.

Memory tracking stat defines

The Unreal stat system provides the following macros to create stat entries for memory tracking:

#define DECLARE_MEMORY_STAT(CounterName,StatId,GroupId)
#define DECLARE_MEMORY_STAT_POOL(CounterName,StatId,GroupId,Pool)
#define DECLARE_MEMORY_STAT_EXTERN(CounterName,StatId,GroupId, API)
#define DECLARE_MEMORY_STAT_POOL_EXTERN(CounterName,StatId,GroupId,Pool, API)

#define INC_MEMORY_STAT_BY(StatId,Amount)
#define DEC_MEMORY_STAT_BY(StatId,Amount)
#define SET_MEMORY_STAT(StatId,Value)

#define INC_MEMORY_STAT_BY_FName(Stat, Amount)
#define DEC_MEMORY_STAT_BY_FName(Stat,Amount)
#define SET_MEMORY_STAT_FName(Stat,Value)

These macros allow you to define, set, increase, and decrease each memory tracking stat entry. But the fault of this approach is that it needs to be increased and decreased properly, and in a place that it is actually near where the memory is allocated or freed. For example, let’s look at one example in the engine

void FPrimitiveSceneInfo::RemoveFromScene(bool bUpdateStaticDrawLists)
{
	check(IsInRenderingThread());

	// implicit linked list. The destruction will update this "head" pointer to the next item in the list.
	while(LightList)
	{
		FLightPrimitiveInteraction::Destroy(LightList);
	}

	// Remove the primitive from the octree.
	check(OctreeId.IsValidId());
	check(Scene->PrimitiveOctree.GetElementById(OctreeId).PrimitiveSceneInfo == this);
	Scene->PrimitiveOctree.RemoveElement(OctreeId);
	OctreeId = FOctreeElementId();

	IndirectLightingCacheAllocation = NULL;

	DEC_MEMORY_STAT_BY(STAT_PrimitiveInfoMemory, sizeof(*this) + StaticMeshes.GetAllocatedSize() + Proxy->GetMemoryFootprint());

	if (bUpdateStaticDrawLists)
	{
		RemoveStaticMeshes();
	}
}

In that piece of code we can see how STAT_PrimitiveInfoMemory is decreased, but not anywhere near where the actual memory was freed. The memory was freed inside the allocator defined for the StaticMeshes array and the scene proxy, and all of that was triggered by remove an element from the octree. If someone makes changes to the memory usage of the octree then this stat would reflect wrong memory consumption which leads to wrong decisions when optimizing memory usage. The same happens if FPrimitiveSceneInfo changes, especially when new containers are added to the class.

The process of having up to date allocation data by means of stat defines is very error prone. This data does get out of date, it gets written wrong, and it doesn’t actually track memory, just estimates of what the programmer thinks it could consume memory. And the last mechanism, the use of UObject::GetResourceSize() has the same issues.

Memory tracking with UObject::GetResourceSize()

Objects:

                                         Class  Count  NumKBytes  MaxKBytes  ResKBytes ExclusiveResKBytes
                            AIPerceptionSystem      1          0K          0K          0K          0K
                                AISense_Damage      1          0K          0K          0K          0K
                               AISense_Hearing      1          0K          0K          0K          0K
                            AISense_Prediction      1          0K          0K          0K          0K
                                 AISense_Sight      1          0K          0K          0K          0K
                                  AISense_Team      1          0K          0K          0K          0K
                                 AISense_Touch      1          0K          0K          0K          0K
                                      AISystem      1          0K          0K          0K          0K
                                  AmbientSound     21         15K         15K          0K          0K
               AnimNotify_PlayParticleEffect_C     11          1K          1K          0K          0K
                                  AnimSequence     78      53453K      53453K      53333K      53333K
                        AnimSingleNodeInstance     85        241K        242K          0K          0K
                                 ArrayProperty    729         85K         85K          0K          0K
                           AssetObjectProperty      2          0K          0K          0K          0K

In the ideal situation this function provides memory usage of a specific UObject. The function is defined this way:

/**
 * Returns the size of the object/ resource for display to artists/ LDs in the Editor. The
 * default behavior is to return 0 which indicates that the resource shouldn't display its
 * size which is used to not confuse people by displaying small sizes for e.g. objects like
 * materials
 *
 * @param	Type	Indicates which resource size should be returned
 * @return	Size of resource as to be displayed to artists/ LDs in the Editor.
 */
virtual SIZE_T GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

It should just take a call to the specific UObject::GetResourceSize() and we should get the data which is extremely useful data to have. It tries to answer the question “what UObjects do I need to optimize in the scene?”, “which UObject-based classed do I need to optimize?” and questions of that nature. But again, this is as good as the implementation written for it. This is another mechanism that gets outdated, that is implemented wrong (for example, by assuming the data is stored with an object and not just pointers to the data), or that is just empty. For example, one may ask “which skeletal mesh in my scene should I optimize?” Let’s look at the implementation in the engine as of Unreal Engine 4.6:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	return 0;
}

So that’s not good. This mechanism is useless or outdated. So to fix this I had to write something like this:

SIZE_T USkeletalMesh::GetResourceSize(EResourceSizeMode::Type Mode)
{
	SIZE_T ResSize = sizeof(USkeletalMesh);

	ResSize += Materials.GetAllocatedSize();
	ResSize += SkelMirrorTable.GetAllocatedSize();

	for (int32 i = 0, n = LODInfo.Num(); i < n; ++i)
	{
		ResSize += LODInfo[i].GetResourceSize(Mode);
	}
	ResSize += LODInfo.GetAllocatedSize();

#if WITH_EDITORONLY_DATA
	for (int32 i = 0, n = OptimizationSettings.Num(); i < n; ++i)
	{
		ResSize += OptimizationSettings[i].GetResourceSize(Mode);
	}
	ResSize += OptimizationSettings.GetAllocatedSize();
#endif

	for (int32 i = 0, n = MorphTargets.Num(); i < n; ++i)
	{
		ResSize += MorphTargets[i]->GetResourceSize(Mode);
	}
	ResSize += MorphTargets.GetAllocatedSize();

	ResSize += RefBasesInvMatrix.GetAllocatedSize();

	ResSize += ClothingAssets.GetAllocatedSize();

#if WITH_APEX_CLOTHING
	for (int32 i = 0, n = ClothingAssets.Num(); i < n; ++i)
	{
		ResSize += ClothingAssets[i].GetResourceSize(Mode);
	}
#endif

	ResSize += CachedStreamingTextureFactors.GetAllocatedSize();

	ResSize += Sockets.GetAllocatedSize();

	return ResSize;
}

And then again, you have to check that the called GetResourceSize() functions actually return the proper value. For example ClothingAssets is an array of FClothingAssetData but that struct didn’t have a GetResourceSize() implementation (probably because that isn’t a UObject in itself but we still need the memory usage of that resource). And also, this implementation will get outdated, and perhaps I missed some data when I implemented it. You just can rely on this to get proper memory usage data.

MallocProfiler

The malloc profiler is a completely different approach. Basically this is a feature in the engine that allows to keep track of all memory allocated through new and delete operators as well as anything that happens through the global memory allocator. It collects basic stats such as memory operations, modules loader, etc. but also the actual memory operations (allocations, frees, reallocations) together with the callstack for each. The grabbing of the callstack isn’t optional since the callstack is the only piece of data that differentiates one allocation from another. But the fact that it has to capture the callstack makes it incredibly slow to use in-game, and it generates a huge output file. For example, I have seen capture files of ~15GBs, and each frame took more than one second (yes, second, not millisecond) while the game was running.

The data can be visualized with a custom tool written in C# called MemoryProfiler 2. This is a extremely slow and inefficient tool in terms of memory. A 15GB capture takes close to 30 minutes to open and more that 80GBs of system memory. Again, the issue isn’t only related to the tool but also the fact that there is just a sheer amount of data generated, and that is made worse by the fact that the opening of the data is single threaded with the code structured in a way that makes it hard to multithread. But even if the tool was faster, and the data was compressed properly, it still wouldn’t help that much because the information is limited. Lets look at the data:
MemoryProfiler2
Here we see the most relevant tab in terms of memory usage by resources because it visualizes the different allocations as a call tree. In the screenshot the allocations related to the USkeletalMeshes serialization allocations are visible. The issue is that if you have many USkeletalMeshes, you can’t find out which ones of those should be optimized. You just know that serializing all the USkeletalMeshes takes a certain amount of memory. This doesn’t provide really useful data to tackle memory optimizations except in the cases where you are doing general optimizations (such as rearranging the data inside USkeletalMesh to reduce compiler padding).

Next steps

After looking at the limitations, it is obvious that there is a place in the engine for a better way to do and track allocations. We need something that would cover up the whole spectrum of allocations, that would allow us to keep track of them, but at the same time do so without making it impossible for the game to run fast enough when it’s enabled. Next time I will start digging into a solution for that and be more explicit about the requirements.

Optimizing Unreal Engine 4 asset cooking. CRC.

As I was working on the GPU particle system on Unreal Engine 4, I had to build my own scene. The process of baking the raw assets to be properly consumed by the engine is called “cooking” in Unreal Engine 4’s lingo. I found myself cooking assets quiet often but one thing I noticed was that I found that the cooking process took a while to determine which assets needed to be cooked. The reason I found that was because I would try to cook the assets without actually saving my changes. While this may not seem like a big deal, it is especially annoying for me since I had a pretty bad experience working with the build system at Dreamworks R&D. The build system would take more than 3 minutes to determine that nothing had to be built, and I’m talking about code and not assets. So I decided to look at this. As usual I will do all the profiling on my machine:

MachineSpecs

What I decided to cook my simple particles scene (which you can see screenshots here) once, and then cook it again but then with the profiler running. In that way I unsure that the cooking process only runs through the process of verifying what needs to be done, and it should determine that there isn’t any asset to build. After doing that here are the results:

MemCrc_Slow

Since this is an asset baking process it is fairly normal to spend a lot of time dealing with memory so I wasn’t too surprised about the CPU time spent on memset and memcpy. That’s something that I will try to optimize later, and that I know it will take some work. Also the memset in particular is caused by build for the editor which fills up memory with a debug value (0xCD) any new allocation or before freeing an allocation. But the call to FCrc::MemCrc_DEPRECATED() seemed like a low hanging fruit since that a pretty self-contained process. Let’s look at the code, which is also available here:

/**
 * DEPRECATED
 * These tables and functions are deprecated because they're using tables and implementations
 * which give values different from what a user of a typical CRC32 algorithm might expect.
 */

uint32 FCrc::MemCrc_DEPRECATED(const void* InData, int32 Length, uint32 CRC/*=0 */)
{
    // Based on the Slicing-by-8 implementation found here:
    // http://slicing-by-8.sourceforge.net/

    CRC = ~BYTESWAP_ORDER32(CRC);

    const uint8* __restrict Data = (uint8*)InData;

    // First we need to align to 32-bits
    int32 InitBytes = Align(Data, 4) - Data;

    if (Length > InitBytes)
    {
        Length -= InitBytes;

        for (; InitBytes; --InitBytes)
        {
            CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
        }

        auto Data4 = (const uint32*)Data;
        for (uint32 Repeat = Length / 8; Repeat; --Repeat)
        {
            uint32 V1 = *Data4++ ^ CRC;
            uint32 V2 = *Data4++;
            CRC =
                CRCTablesSB8_DEPRECATED[7][ V1 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[6][(V1 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[5][(V1 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[4][ V1 >> 24 ] ^
                CRCTablesSB8_DEPRECATED[3][ V2 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[2][(V2 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[1][(V2 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[0][ V2 >> 24 ];
        }
        Data = (const uint8*)Data4;

        Length %= 8;
    }

    for (; Length; --Length)
    {
        CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
    }

    return BYTESWAP_ORDER32(~CRC);
}

As seen on the code, this is based on the slicing-by-8 version of CRC32, which is way faster than the standard implementation. The reason is that it allows for grater instruction-level parallelism because on the standard implementation each table lookup must be completed before computing the next index. In this case 8 table lookups can potentially be done in parallel. For precise details a paper was written about the algorithm here. So that’s fairly optimal already.

But what are the alternatives? The easiest one would be to use SSE 4.2 which includes the CRC32 instruction. While there is a good speed up as shown here, that wouldn’t be integrated in PC because SSE 4.2 became available with the Nehelem architecture, and I don’t think AMD supported that before Bulldozer. So the other more feasible alternative is to get rid of CRC32 and move to some other checksum algorithm.

There are many alternatives to generate hashes. In this case we need to generate checksums, there is no need for any cryptographic type of hash function. Even then Unreal Engine 4 already provides SHA-1. Some of the alternatives are FNV, CityHash, SpookyHash, MurmurHash, etc. They all are valid, and Charles Bloom who works on Oodle made a blog post covering the performance and properties of some of those algorithms so I won’t attempt to cover that. But I dug deeper and I came across xxHash written by Yann Collet who also the author of the LZ4 compression algorithm. The performance shown by xxHash, the fact that it was used by LZ4, and the permissive license (New BSD License) made me make up my mind about making an attempt to integrate it.

But not everything is that simple. While xxHash is extremely easy to integrate (like the stb libraries written by Sean Barrett), it has a considerable impact on the engine. The fact that all the hashes will change implies that all assets have to be rebuilt. This may not be a huge issue for me because I just have a few simple scenes, but it has a big impact on teams working on AAA games. From my point of view I think it is worthwhile to bite the bullet if the cooking time is reduced after that. So, what happens when we replace CRC32 with xxHash in terms of performance?

MemCrc_Fast

That’s a 3X improvement over the old code. In terms of wall-clock time there is almost a 20% improvement over the old version. I think that justifies the cooking of the assets.

Anyway this in particular may not make it into the master branch of Unreal Engine 4 because it does pull in a new dependency (xxHash) and I have no idea if that’s acceptable. Anyway I think it is a meaningful change that some people may want to integrate.

The lesson this time is to think outside the box. If you have seen my past posts you may have thought that I would have taken the path of optimizing the throughput by helping the compiler or looking at the number of uops. But instead I realized that since this is a pretty self-contained problem I knew there had to be faster alternatives to CRC32 that fit the needs of the engine.

Unreal Engine 4 GPU particle system micro-optimizations. Part 3.

On part one of the series I made an optimization to BuildParticleVertexBuffer which got rid of an unnecessary check when setting the half precision floats that sets the indices. On part two I made a separate optimization but in the process I made a stress test to help me understand the issues. It’s now time to go back to BuildParticleVertexBuffer since it was the biggest hotspot in the stress test. As usual I will do all the profiling on my machine: MachineSpecs I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters in the scene. Read part two of the series to see a couple of screenshots of the scene. BuildParticleVertexBuffer-Part3-Slow Compared to everything else BuildParticleVertexBuffer is by far the biggest hotspot and something that definitely needs to be improved. What makes matters even worse is that this happens on the render thread which is critical as mentioned on part two of the series. So let’s look at the code of the function.

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there is a specific comment that mentions the fact that in some platforms writing unions and/or bitfields to locked memory are really slow, and instead a forced integer write is going to be faster. But, what are those platforms? Does it make sense to do that for all the platforms? I don’t know what are the exact platforms that whoever wrote this code was referring to (one of the downsides of having just access to Unreal Engine 4 on Github and not on Epic’s Perforce server is that you can’t use something like the time-lapse view in Perforce to see when and who wrote this). If anybody have any specific information about that please comment or let me know. Anyway I decided that I would make a single change. I would get rid of the temporary FParticleIndex variable used to write the floats IndexX and IndexY, which is then written as a uint32. Instead of that I would use the SetNoChecks() from the previous part in the series, and set the floats directly. That simplifies code but doesn’t necessarily do the same for the assembly. So here is the code:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);

#if PLATFORM_WINDOWS
				ParticleIndices->X.SetNoChecks(IndexX);
				ParticleIndices->Y.SetNoChecks(IndexY);

				++ParticleIndices;
#else
				const int32 Stride = 1;
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
#endif // PLATFORM_WINDOWS
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

I decided to see what Intel Architecture Code Analyzer had to say in terms of latency. BuildParticleVertexBuffer-Part3-Latency That doesn’t look good, the estimates said that the latency went from 53 cycles to 68 cycles mostly due to pressure on port 1. But since those are estimates, it is critical to actually run the code and profile it. This is the results: BuildParticleVertexBuffer-Part3-Fast With that simple change I managed to cut down the CPU time of the top hotspot in half and get the CPI (cycles per instruction retired) rate to go from 0.704 to 0.417. There are a couple of lessons here. The first one is to never rely a 100 percent on static analysis tools. They are useful tools, but when it comes to performance the only proper way to measure is by profiling at runtime. The other lesson is that you should make sure that you validate the platform assumptions. You cannot make the end-user pay due to wrong platform generalizations. Make sure that the assumptions are correct, and write specialized code for each platform if necessary. Do not forget that as programmers, at the end of the day we are paid to deliver a good experience. We are not paid to have generalized code that only fits the lowest common denominator, after all the end user doesn’t even care about our code.

Unreal Engine 4 GPU particle system micro-optimizations. Part 2.

On the previous part of the series I showed a micro-optimization to the GPU particle system for the BuildParticleVertexBuffer function. As part of the verification process as well as to understand how the system worked I created a scene of a hollow box with 20 GPU particle emitters. I gave them a basic behavior in Cascade and nothing else. Here are two screenshots:

ParticlesCloseParticlesFar

Using that scene I decided to profile the system. As usual I did this on my machine:

MachineSpecs

This time I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters.

BuildTileVertexBuffer-Profile

The first hotspot is on BuildParticleVertexBuffer but that’s something I covered on the previous part of the series. The next one is BuildTileVertexBuffer. Now you may ask what’s the point of optimizing something that isn’t that doesn’t seem to take that much CPU time. And the answer is the next screenshot.

BuildTileVertexBuffer-Callstack

This function is being called from the RenderThread which given the current implementation on the Unreal Engine 4, it is a considerable deal for performance. For the time being a lot of the rendering work happens on a single rendering thread so that can set the wall time for the frame since there is a lot of work to be done on it. To prove that see how work is distributed on the different threads:

BuildTileVertexBuffer-Threads

0x1F98 is the thread id for the rendering thread, and 0x1B60 is the thread id for the main thread. It is clearly visible that performance is dominated by the work done on the rendering thread. Thankfully Epic is working on parallel rendering which will alleviate the issue, but for the time being it is critical to optimize the render thread.

With that out of the way let’s look at the code:

/**
 * Builds a vertex buffer containing the offsets for a set of tiles.
 * @param TileOffsetsRef - The vertex buffer to fill. Must be at least TileCount * sizeof(FVector4) in size.
 * @param Tiles - The tiles which will be drawn.
 * @param TileCount - The number of tiles in the array.
 */
static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	int32 Index;
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( ; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

In this function at first sight there isn’t anything terribly wrong with the exception of the use of the Index variable. Without any reasonable reason there is a dependency created between the two loops when in fact they are completely independent. Each loop could be controlled independently. Let’s rewrite it:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

But shouldn’t the Visual Studio compiler realize that? Let’s look at the assembly with Intel Architecture Code Analyzer. The old code is on the left, and the new code is on the right.

BuildTileVertexBuffer-FirstOpt

The change reduced got rid of two move instructions and the number of uops (micro ops) got reduced from 48 to 46. But we shouldn’t expect a huge improvement in performance from just doing that. We need to reduce the number of micro ops further and hopefully that will also improve the instruction-level parallelism. So looking at the code that on the top loop there isn’t any specific need to construct an FVector2D and assigned to the current tile offset. I could just write the X, Y components directly as two independent operations. I could also do that for the loop below. Here is the new code:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index].X = FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX );
		TileOffset[Index].Y = FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index].X = 100.0f;
		TileOffset[Index].Y = 100.0f;
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

Would that make any improvements at all? Let’s look at it with Intel Architecture Code Analyzer. The old code on the left, the new code on the right.

BuildTileVertexBuffer-SecondOpt

Now we went from 48 uops to 41. Our optimization looks good from the static analysis, but let’s profile it.

BuildTileVertexBuffer-OptProfile

That cut down in half the CPU time which is good for the rendering thread.

I think the main lesson here is that the compiler may not optimize everything even if you think it should. Compilers in general are pretty limited in terms of optimizations so as programmers we need to be aware of their limitations and how to get around them (which might involve trial and error, and looking at the disassembly directly or with some static analysis).

Unreal Engine 4 GPU particle system micro-optimizations. Part 1.

This time I’m going to go over two optimizations which are the very first that I made, and that just got integrated in Unreal Engine 4. Hopefully they will ship with version 4.4. When I first got the Unreal Engine 4 license the most impressive demo available at the time as the Effects Cave demo. The demo looks like this:

So I started profiling the first 120 frames of the demo to get a sense of how CPU performance was being spent. As usual I profiled in my machine that has the following specs:

MachineSpecs

Below is the result of the profiling run. Keep in mind that this was done on Unreal Engine 4.1:

BuildParticleVertexBuffer-Slow

It isn’t that often that you get to profile something that is such an obvious hotspot. Here the issue is pretty clear in terms of performance, I had to focus on BuildParticleVertexBuffer.

To make some sense of this it is good to see how the GPU particle system works. The GPU particle system unlike the traditional CPU particle system allows for a high particle count to be rendered efficiently but with some lack of flexibility. The actual emission of the particles still happen on the CPU but once they are emitted, the rest of the process happens on the GPU. The GPU particle system is implemented with tiles (4×4 tiles by default) in a set of textures (1024×1024 by default). There are two textures for position data, and one for velocity and those textures are a double buffered. They are indexed by a vertex buffer where the index is stored in 2 half floats that represent this texture. The particles are collided with the information in the depth buffer. In particular BuildParticleVertexBuffer creates the vertex buffer used to store the indices for the particles for a given set of tiles and fills it up with the data. Let’s look at how this is done:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray<uint32>& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there isn’t anything really suspicious about it, we are just generating the coordinates that index a particle. The compiler already generated fairly efficient code in terms of the input since most of the parameters are set at compile time. I decided to look at it with the Intel Architecture Code Analyzer what was the assembly and through-put. Here was part of the output:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 31.55 Cycles       Throughput Bottleneck: FrontEnd, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 27.4    0.0 | 27.4 | 12.5   6.0  | 12.5   6.0  | 13.0 | 31.2 |
-------------------------------------------------------------------------

What surprised me in the assembly was seeing comparisons with some constants that wasn’t clearly visible on the source code. Obviously I was missing something in the code. If you see the code there wasn’t anything immediately obvious that would cause comparisons with constants that weren’t the constants related to the GPU particle simulation parameters such as the size of the textures. But then I realized something particular in the code related to the type of the data. As I mentioned previously, the indexing data were two half precision floats. They are defined in FParticleIndex:

/**
 * Per-particle information stored in a vertex buffer for drawing GPU sprites.
 */
struct FParticleIndex
{
	/** The X coordinate of the particle within the texture. */
	FFloat16 X;
	/** The Y coordinate of the particle within the texture. */
	FFloat16 Y;
};

But as you can see from the function, the value set to the temporary FParticleIndex is two single precision floats. No explicit cast was done on the code function, so it had to be done on assignment. That’s why I decided to see how that was assigned and came to this function that was called from the assignment operator.

FORCEINLINE void FFloat16::Set( float FP32Value )
{
	FFloat32 FP32(FP32Value);

	// Copy sign-bit
	Components.Sign = FP32.Components.Sign;

	// Check for zero, denormal or too small value.
	if ( FP32.Components.Exponent <= 112 )			// Too small exponent? (0+127-15)
	{
		// Set to 0.
		Components.Exponent = 0;
		Components.Mantissa = 0;
	}
	// Check for INF or NaN, or too high value
	else if ( FP32.Components.Exponent >= 143 )		// Too large exponent? (31+127-15)
	{
		// Set to 65504.0 (max value)
		Components.Exponent = 30;
		Components.Mantissa = 1023;
	}
	// Handle normal number.
	else
	{
		Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
		Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
	}
}

As you can see this function takes care of transforming single precision floats that out of range of range of a half precision float. Those were the comparisons I was seeing. But is this meaningful in any way for us? Of course! Thinking about the data is critical. And when you do that in this case, there is no need for those checks. Our indexing data fits perfectly within a half precision float so we should be assigning the values directly. To do that I wrote the following function:

FORCEINLINE void FFloat16::SetNoChecks( const float FP32Value )
{
	const FFloat32 FP32(FP32Value);

	// Make absolutely sure that you never pass in a single precision floating
	// point value that may actually need the checks. If you are not 100% sure
	// of that just use Set().

	Components.Sign = FP32.Components.Sign;
	Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
	Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
}

Now that those checks are gone performance should be better. Again I used Intel Architecture Code Analyzer, and here is the output.

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 23.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 22.0    0.0 | 21.9 | 7.5    4.5  | 7.5    4.5  | 6.0  | 23.1 |
-------------------------------------------------------------------------

As you can see the throughput of the block from 31.55 to 23.40 cycles, and the front end is no longer a bottleneck. Time to profile to see if there is a difference in runtime.

BuildParticleVertexBuffer-Fast

That small change yielded a performance win of ~27% from the original code with ~45% less instructions retired. A good performance win for such a small change.

As all the optimizations I have done so far, the understanding of the input and output data allowed me to determine how to optimize. I can’t stress enough how important it is to understand data. If you don’t understand the data then odds of improving performance are very low.

On the next part I will talk of another micro-optimization closely related to this one.

Optimizing Unreal Engine 4’s async file IO thread.

After the first post I made on the ToLower() optimization a lot of people rightfully wondered why did I optimize that and not the number of calls made to it. The answer is that I did optimize both. The reason is that while the reduction of ToLower() calls would have made a significant difference, it is also important to have the actual code executed be optimal even if it’s not executed too often. The sum of all these inefficiencies are the cause of performance issues because it is hard to see the performance decrease as the product evolves. This is what people refer as “death by a thousand cuts”. So considering that I first optimized the function to make sure that there was a minimum bar for performance set for the function call, and then I went ahead and reduced the number of calls to the function. Of course depending on your time restriction you may set priorities differently.

As usual I’m going to be profiling on my machine that has the following specs:

MachineSpecs

So let’s begin by looking at the profile. Just like last time this is profiling from frame two the first 1000 frames on the Elemental demo.

FAsyncIOSystemSlow

As you can see a lot of time is being spent on the ToLower() function, something I already optimized but it wasn’t included on this profile because I profile change by change on its own. So the approach this time is to reduce the number of calls to the function. So looking at the callstacks I found out that a lot of the calls were coming from the AsyncIOSystem thread.

FAsyncIOSystemCallstack

As the name says the AsyncIOSystem is an asynchronous IO system. It is cross-platform and single-threaded runnable. What is a runnable on Unreal Engine 4 you ask, here is the answer:

/**
 * Interface for "runnable" objects.
 *
 * A runnable object is an object that is "run" on an arbitrary thread. The call usage pattern is
 * Init(), Run(), Exit(). The thread that is going to "run" this object always uses those calling
 * semantics. It does this on the thread that is created so that any thread specific uses (TLS, etc.)
 * are available in the contexts of those calls. A "runnable" does all initialization in Init().
 *
 * If initialization fails, the thread stops execution and returns an error code. If it succeeds,
 * Run() is called where the real threaded work is done. Upon completion, Exit() is called to allow
 * correct clean up.
 */

New IO requests can be queue up from any thread where and those requests can be cancelled as well. Whenever a new request is sent the following data is stored in the given order:

Type Name Description
uint64 RequestIndex Index of the request.
int32 FileSortKey NOT USED. File sort key on media, INDEX_NONE if not supported or unknown.
FString FileName Name of file.
int64 Offset Offset into file.
int64 Size Size in bytes of data to read.
int64 UncompressedSize Uncompressed size in bytes of original data, 0 if data is not compressed on disc.
void* Dest Pointer to memory region used to read data into.
ECompressionFlags CompressionFlags Flags for controlling decompression. Valid flags are no compression, zlib compression, and specific flags related to zlib compression
FThreadSafeCounter* Counter Thread safe counter that is decremented once work is done.
EAsyncIOPriority Priority Priority of request. Allows minimum, low, below normal, normal, high, and maximum priority.
uint32 bIsDestroyHandleRequest : 1 True if this is a request to destroy the handle.
uint32 bHasAlreadyRequestedHandleToBeCached : 1 True if the caching of the handle was already requested.

So the most obvious culprit of the performance drain must be the FileName. So let’s look at where it is used:

FAsyncIOSystemFileNameRefs
The use in the Tick function caught my attention right away, it lined up properly with the callstack that I knew was flooding my output window. The Tick function of FAsyncIOSystem does the following things in order:

  • Create file handles for the outstanding requests.
  • Creates a copy of the next request that should be handled based on a specific platform criteria such as the layout on disc or the highest set priority.
  • Fulfill the pending request.
    • Which can be a request to destroy the handle or retrieve/create a handle to fulfill a a compressed or uncompressed read.

To determine if a request handle needs to be created or retrieved from the cache there is a name to handle map that maps the FString of the name of an iFileHandle*. If the filename is in the map then it means that there is a cached handle for it. To do that it does a FindRef() on the TMap which calls FindId(), here is the code:

/**
 * Finds an element with the given key in the set.
 * @param Key - The key to search for.
 * @return The id of the set element matching the given key, or the NULL id if none matches.
 */
FSetElementId FindId(KeyInitType Key) const
{
	if(HashSize)
	{
		for(FSetElementId ElementId = GetTypedHash(KeyFuncs::GetKeyHash(Key));
			ElementId.IsValidId();
			ElementId = Elements[ElementId].HashNextId)
		{
			if(KeyFuncs::Matches(KeyFuncs::GetSetKey(Elements[ElementId].Value),Key))
			{
				// Return the first match, regardless of whether the set has multiple matches for the key or not.
				return ElementId;
			}
		}
	}
	return FSetElementId();
}

The ElementId is generated by the GetTypedHash() function which for our type it generates a CRC32 hash, this is where all the time in FCrc::Strihash_DEPRECATED<wchar_t>() was being spent. And KeyFuncs::Matches() looks like this:

static FORCEINLINE bool Matches(KeyInitType A,KeyInitType B)
{
	return A == B;
}

While that looks pretty reasonable for fundamental integral types, for comparisons of FStrings it does a call to Stricmp to do a lexicographical comparison. This is where the ToLower() call is made:

template <typename CharType>
static inline int32 Stricmp( const CharType* String1, const CharType* String2 )
{
	// walk the strings, comparing them case insensitively
	for (; *String1 || *String2; String1++, String2++)
	{
		CharType Char1 = TChar<CharType>::ToLower(*String1), Char2 = TChar<CharType>::ToLower(*String2);
		if (Char1 != Char2)
		{
			return Char1 - Char2;
		}
	}
	return 0;
}

So now we know what it implies to find something in the cache, but how often does that happen? The answer is in the FAsyncIOSystemBase::Tick() function which shows that it happens once per outstanding request, and then once more when a request if a request is pending. I measured the number of request done before the very first frame was rendered, there were 2096 requests queued. Considering that the AsyncIOSystem thread has an above normal priority and it can happen pretty often. The numbers add up pretty quickly. We need to fix this.

To fix this I took a rather simple approach which is to make sure that finding something in the cache involves comparisons between integral types. The easiest way was to add another field to the IO request data which is a 32-bit hash of the filename. The hash is generated whenever a new IO request is queued up (be it an IO request or a file handle destroy request), and then that hash is used to find cached file handles. To generate the hash I decided to use something already found in the engine rather than integrating something like FNV-1 or xxHash, so I used a CRC32 hash.

So after doing that change let’s look at the profile:

FAsyncIOSystemFast

Pretty impressive, the call to ToLower() isn’t there anymore because only 0.030ms are spent in all the 1000 frames. The call to FCrc::Strihash_DEPRECATED<wchar_t>() isn’t there either because only 7.9ms are spent in all the 1000 frames.

The lessons this time is related to the tendency of developers to hide complexity under very generic functions which have huge performance relevance. In particular performance was suboptimal because of it isn’t obvious that a A == B in KeyFuncs::Matches would imply a Stricmp call for an FString. That’s why in my own code I tend not to override operators, they usually hide complexity when as programmers we need to be fully aware of the complexity of what we ship. Programmers also forget that our main objective isn’t to create generic solutions that solve problems that we may have in the future. Our main objective is to ship the best experience to the end user, and that means writing code that solve the actual problem we need to solve, with the available hardware resources, and with the available time to write it. If you care a lot about the future of your code then worry about making optimizable code rather that making a grand design to abstract complexities using ten different design patterns. The truth is that the end user doesn’t care if you use 10 different patterns in a generic design, but they do care if the load times are high.

And knowing your performance data is critical. In the case of the Elemental demo 2096 IO requests were done and fulfilled before the first frame was rendered. I think being aware of that is critical to making the proper design and performance decisions. In fact given that data I would be inclined to further optimize this by changing the AoS nature of the FAsyncIORequest and move it to a SoA so that hashes are all stored together to optimize to reduce the CPI in the Tick function, but I will leave that up to you.