Optimizing sorting of base passes in Unreal Engine 4.

I know I said I was going to dig into the memory tracking solution, but I realized that I might as well do an optimization since people seem to find them very useful. So I decided to look into profiling and optimizing the Landscape Mountains map which you can download from the Unreal Engine Marketplace. If you haven’t seen it here is a video:

I decided to profile this 900 frames starting from the 100th frame since I didn’t want to profile the initial streaming of the level. So here is the basic thread view.

ThreadProfile

So the top thread is doing much more work than the rest, that thread is the render thread. It was obvious that to improve performance I would have to optimize the work happening in the rendering thread. So let’s look at the work happening in the rendering thread.

RenderThreadProfile

The highlighted functions caught my attention because they accounted for almost 8% of the time spent on the render thread and sorting is usually a very atomic issue to solve. So let’s dig into the actual code execution.

InitViewsCallstack

One of the first calls made each frame to render the scene is to FDeferredShadingSceneRenderer::InitViews() which is in charge of initializing the scene’s views by checking visibility, do sorting of elements for the different passes, initialize the dynamic shadows, etc. The sorting of the static meshes with the respective drawing policy kicks off the calling of the functions we saw on the profiling data. Let’s look at that function:

void FDeferredShadingSceneRenderer::SortBasePassStaticData(FVector ViewPosition)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		SCOPE_CYCLE_COUNTER(STAT_SortStaticDrawLists);

		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			Scene->BasePassNoLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassSimpleDynamicLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedPointIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassHighQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassLowQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
		}
	}
}

That’s extremely simple to understand. It just sorts the different draw lists one after the other twice, once for the default base pass, and once for the masked base pass. The first thing that becomes obvious is that we could sort those draw lists asynchronously in different threads and get some other work going while that’s being done. But in order to do that we need to know that the sort is atomic and doesn’t affect global state. To do that we would have to dig in deeper in the callstack. Let’s look at the code of the next function, TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;
		for (int32 ElementIndex = 0; ElementIndex < DrawingPolicyLink.Elements.Num(); ElementIndex++)
		{
			FElement& Element = DrawingPolicyLink.Elements[ElementIndex];

			if (ElementIndex == 0)
			{
				AccumulatedBounds = Element.Bounds;
			}
			else
			{
				AccumulatedBounds = AccumulatedBounds + Element.Bounds;
			}
		}
		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

In this piece of code there are a couple of things that could be changed. First of all you can see that inside the inner loop ElementIndex is initialized on the first iteration of the loop but then is checked in all the other iterations just to see if it’s the first. Now, you could say that the branch predictor in the CPU would figure out that ElementIndex won’t be 0 in the next iterations, but it is better to write code without relying too much on the branch predictors since they are fairly different from one CPU to another. Let’s make that change first plus a couple of minor changes:

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0].Bounds;
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

Now the actual issue with doing multiple draw lists sorts concurrently with this code is that there are two static variables in TStaticMeshDrawList:

/**
 * Static variables for getting data into the Compare function.
 * Ideally Sort would accept a non-static member function which would avoid having to go through globals.
 */
static TDrawingPolicySet* SortDrawingPolicySet;
static FVector SortViewPosition;

Now this is a case where code comments get outdated because it is possible to send data to the Compare function without using globals and without using thread-local storage or anything similar. Since the assumption was that sending data to the sort function wasn’t possible, here is how the predicate class was written:

/** Helper stuct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare( A, B ) < 0;
	}
};

Now that we now let’s see how those two static variables are used:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B)
{
	const FSphere& BoundsA = (*SortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*SortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - SortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - SortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

So it is clear that first thing we need to make this asynchronous is to get rid of SortDrawingPolicySet and SortViewPosition. First the predicate class must be rewritten to include the parameters that used to be in the static variables:

/** Helper struct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
private:
	const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const SortDrawingPolicySet;
	const FVector SortViewPosition;

public:
	TCompareStaticMeshDrawList(const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
		: SortDrawingPolicySet(InSortDrawingPolicySet)
		, SortViewPosition(InSortViewPosition)
	{
	}

	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare(A, B, SortDrawingPolicySet, SortViewPosition) < 0;
	}
};

And replace the Compare implementation to fit the new changes:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B, const TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
{
	const FSphere& BoundsA = (*InSortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*InSortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - InSortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - InSortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

Since the static variables are gone we need to modify TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0];
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	OrderedDrawingPolicies.Sort(TCompareStaticMeshDrawList<DrawingPolicyType>(&DrawingPolicySet, ViewPosition));
}

Now we should be able to sort the draw lists asynchronously. To do that we will need to replace the code in FDeferredShadingSceneRenderer::SortBasePassStaticData() to kick off asynchronous calls to TStaticMeshDrawList::SortFrontToBack(), and it will have to return some kind of container that can be used to determine if those tasks finished. In that way we can issue those tasks and avoid waiting for them to finish on the rendering thread unless we reach a point in the code were we need those tasks done (and hopefully by then they should be done already. Unfortunately Unreal Engine 4.6 doesn’t support futures and features that would make it easy to launch this functions asynchronously. It is my understanding that they will come in Unreal Engine 4.8. But meanwhile we will have to write a standard task. Let’s do that.

For the sorting task, we will sort a single draw list per task, and whoever needs to sort multiple draw lists will create multiple tasks. This task only needs to parameters, the draw list to sort, and the view position for the sorting, and the only purpose of the task will be to call the SortFrontToBack() function of the draw list to sort, with the view position as a parameter, nothing else. Let’s implement that:

template<typename StaticMeshDrawList>
class FSortFrontToBackTask
{
private:
	StaticMeshDrawList * const StaticMeshDrawListToSort;
	const FVector ViewPosition;

public:
	FSortFrontToBackTask(StaticMeshDrawList * const InStaticMeshDrawListToSort, const FVector InViewPosition)
		: StaticMeshDrawListToSort(InStaticMeshDrawListToSort)
		, ViewPosition(InViewPosition)
	{

	}

	FORCEINLINE TStatId GetStatId() const
	{
		RETURN_QUICK_DECLARE_CYCLE_STAT(FSortFrontToBackTask, STATGROUP_TaskGraphTasks);
	}

	ENamedThreads::Type GetDesiredThread()
	{
		return ENamedThreads::AnyThread;
	}

	static ESubsequentsMode::Type GetSubsequentsMode() { return ESubsequentsMode::TrackSubsequents; }

	void DoTask(ENamedThreads::Type CurrentThread, const FGraphEventRef& MyCompletionGraphEvent)
	{
		StaticMeshDrawListToSort->SortFrontToBack(ViewPosition);
	}
};

That’s a bit of boilerplate code that won’t be necessary once futures are supported on Unreal Engine 4.8. Anyway, now we need to create a function to dispatch those tasks. Since I like to keep the single thread and multiple thread implementations available for debugging, I decided to write a new function called FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(). Let’s implement that:

void FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(const FVector &InViewPosition, FGraphEventArray &OutSortEvents)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FNoLightMapPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassNoLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FSimpleDynamicLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassSimpleDynamicLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedVolumeIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedPointIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedPointIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassHighQualityLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TDistanceFieldShadowsAndLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<LQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassLowQualityLightMapDrawList[DrawType]), InViewPosition));
		}
	}
}

With that we have dispatched all the sorting tasks and OutSortEvents can be used to determine if all the tasks are done. So let’s go back to FDeferredShadingSceneRenderer::InitViews() and implement the support for asynchronous sorting:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	FGraphEventArray SortEvents;
	AsyncSortBasePassStaticData(AverageViewPosition, SortEvents);
#else
	SortBasePassStaticData(AverageViewPosition);
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we get to keep both approaches should you have to debug an issue with either. At this point the sort tasks were launches, but we can’t assume they will be done before we leave the function so let’s check for that and make sure they are done:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	if (SortEvents.Num())
	{
		FTaskGraphInterface::Get().WaitUntilTasksComplete(SortEvents, ENamedThreads::RenderThread);
	}
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we halt execution on the rendering thread until the tasks are done, but hopefully they should be done by then.

So now the question is, did this change actually improve performance? Let’s compare the basic data capturing the same frames as before:

AsyncSortSummaryComparisonAll positive numbers show performance improvements. Before optimization numbers are to the left, after optimization to the right. Blue bars in histogram show the frames before optimization, orange shows after optimization.

With VTune configured to show frames that took more than 20ms as slow, and less that 11ms as fast, it is clearly visible that we have more frames with good or fast performance and less frames with slow performance. Overall all frames moved to the right in the histogram showing that this definitely improved performance overall. Let’s look at the work on the different threads:

AsyncSortThreadsBeforeBefore

AsyncSortThreadsAfterAfter

The render thread is highlighted in both profilings. It is visible that the work on the rendering thread was reduced considerably, and the work on the threads that consume tasks was increased. So by shifting the work from the render thread to the tasks threads without spending time waiting for those tasks to finish improved performance.
Going wide isn’t always the answer, especially when the code you need to optimize was written to be single threaded. But as programmers we need to get better at doing stuff asynchronously, to design from the start for multithreading, because this is a clear case were it is beneficial to go wide.

If you would like to see the whole code change and have access to the Unreal Engine 4 github then you can check the whole change here: https://github.com/EpicGames/UnrealEngine/pull/820

Advertisements

Unreal Engine 4 GPU particle system micro-optimizations. Part 3.

On part one of the series I made an optimization to BuildParticleVertexBuffer which got rid of an unnecessary check when setting the half precision floats that sets the indices. On part two I made a separate optimization but in the process I made a stress test to help me understand the issues. It’s now time to go back to BuildParticleVertexBuffer since it was the biggest hotspot in the stress test. As usual I will do all the profiling on my machine: MachineSpecs I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters in the scene. Read part two of the series to see a couple of screenshots of the scene. BuildParticleVertexBuffer-Part3-Slow Compared to everything else BuildParticleVertexBuffer is by far the biggest hotspot and something that definitely needs to be improved. What makes matters even worse is that this happens on the render thread which is critical as mentioned on part two of the series. So let’s look at the code of the function.

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there is a specific comment that mentions the fact that in some platforms writing unions and/or bitfields to locked memory are really slow, and instead a forced integer write is going to be faster. But, what are those platforms? Does it make sense to do that for all the platforms? I don’t know what are the exact platforms that whoever wrote this code was referring to (one of the downsides of having just access to Unreal Engine 4 on Github and not on Epic’s Perforce server is that you can’t use something like the time-lapse view in Perforce to see when and who wrote this). If anybody have any specific information about that please comment or let me know. Anyway I decided that I would make a single change. I would get rid of the temporary FParticleIndex variable used to write the floats IndexX and IndexY, which is then written as a uint32. Instead of that I would use the SetNoChecks() from the previous part in the series, and set the floats directly. That simplifies code but doesn’t necessarily do the same for the assembly. So here is the code:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);

#if PLATFORM_WINDOWS
				ParticleIndices->X.SetNoChecks(IndexX);
				ParticleIndices->Y.SetNoChecks(IndexY);

				++ParticleIndices;
#else
				const int32 Stride = 1;
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
#endif // PLATFORM_WINDOWS
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

I decided to see what Intel Architecture Code Analyzer had to say in terms of latency. BuildParticleVertexBuffer-Part3-Latency That doesn’t look good, the estimates said that the latency went from 53 cycles to 68 cycles mostly due to pressure on port 1. But since those are estimates, it is critical to actually run the code and profile it. This is the results: BuildParticleVertexBuffer-Part3-Fast With that simple change I managed to cut down the CPU time of the top hotspot in half and get the CPI (cycles per instruction retired) rate to go from 0.704 to 0.417. There are a couple of lessons here. The first one is to never rely a 100 percent on static analysis tools. They are useful tools, but when it comes to performance the only proper way to measure is by profiling at runtime. The other lesson is that you should make sure that you validate the platform assumptions. You cannot make the end-user pay due to wrong platform generalizations. Make sure that the assumptions are correct, and write specialized code for each platform if necessary. Do not forget that as programmers, at the end of the day we are paid to deliver a good experience. We are not paid to have generalized code that only fits the lowest common denominator, after all the end user doesn’t even care about our code.

Optimizing AABB transform in Unreal Engine 4

Last time I didn’t go over the top hotspot in Unreal Engine 4’s Elemental demo so I will tackle that issue now. Let’s look at the profiling

TransformSlowSource

The FBox::TransformBy() function transforms the bounding box with a FTransform which stores a quaternion for the rotation, a translation vector and a scale vector. This FTransform object might be SIMD or scalar depending on the platform, but even in the SIMD solution there is a whole lot of work going on to transform the given position with the FTransform. If you have access to the source check FTransform::TransformFVector4() in Engine\Source\Runtime\Core\Public\Math\TransformVectorized.h . The FTransform object itself is smaller than using an FMatrix (3 SIMD registers vs 4 SIMD registers), but the transform operation isn’t so straightforward. On top of that, given the OOP nature of the implementation, it doesn’t take advantage of the fact that there is data that could be reused for every vertex of the bounding box. Here is the old implementation:

FBox FBox::TransformBy(const FTransform & M) const
{
	FVector Vertices[8] =
	{
		FVector(Min),
		FVector(Min.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Min.Z),
		FVector(Max.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Max.Z),
		FVector(Max)
	};

	FBox NewBox(0);

	for (int32 VertexIndex = 0; VertexIndex < ARRAY_COUNT(Vertices); VertexIndex++)
	{
		FVector4 ProjectedVertex = M.TransformPosition(Vertices[VertexIndex]);
		NewBox += ProjectedVertex;
	}

	return NewBox;
}

Fortunately FBox has a TransformBy implementation that takes an FMatrix that is way better since it easy fairly small vectorized code that takes in account all the vertices of the bounding box at once. Instead of transforming each vertex and adding them to the new bounding box while dealing with the quaternion, it transforms all the vertices with the matrix (which just takes 3 _mm_shuffle_ps, 3 _mm_mul_ps, and 3 _mm_add_ps per vertex) and then recalculates the bounding box (which take 7 _mm_min_ps and 7 _mm_max_ps).

So using FMatrix in the current implementation is better than using FTransform. But in general the engine uses FTranforms for local-to-world transforms. That means that we would have to rewrite the code to use matrices instead, or we could convert the FTransform into a FMatrix when transforming the bounding box. So you may ask “how much work would it take to change from FTransform to FMatrix?” Here is part of the answer:

TransfromByVA

As you can see there is a whole bunch of work to be done if we want to do that. Now that raises the question, can I do the conversion in-place with the cycles I saved by not using an FTransform? The process of switching from FTransform to FMatrix isn’t cheap in terms of number of instructions, but since it involves a lot of shuffling it shouldn’t be hard to pipeline. So to determine that I decided to use another tool from Intel, the Intel Architecture Code Analyzer (IACA). Basically this tool allows you to statically analyze the throughput and latency of a snippet of code. For people who have worked on PlayStation 3 will find it similar to SN Tuner’s view of even and odd instructions in the SPU. If you haven’t seen that then take a look at Jaymin Kessler’s video on software pipelining. Anyway IACA is a nice tool to get an estimate of the performance of a snippet of code and it works just fine for our purpose.

So let’s take a look at the throughput for FTransform::TransformFVector4(). Keep in mind that this is called eight times, once for each vertex of the bounding box.

Throughput Analysis Report
--------------------------
Block Throughput: 27.05 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 17.0   0.0  | 9.0  | 8.5    6.0  | 8.5    6.0  | 5.0  | 27.0 |
-------------------------------------------------------------------------

This means that we take ~27 cycles per vertex or ~216 for the whole bounding box. If we look at the throughput of FBox::TransformBy() with the FMatrix input it looks like this:

Throughput Analysis Report
--------------------------
Block Throughput: 80.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 24.0   0.0  | 38.0 | 17.0   8.0  | 17.0   8.0  | 18.0 | 80.0 |
-------------------------------------------------------------------------

That’s ~80 cycles for all the vertices which is much better. But since we need to go from FTransform to FMatrix let’s see what that takes by getting the throughput for the conversion:

Throughput Analysis Report
--------------------------
Block Throughput: 29.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 6.0    0.0  | 6.0  | 5.5    3.0  | 5.5    3.0  | 5.0  | 29.0 |
-------------------------------------------------------------------------

That’s ~29 cycles. So if we add the cost of conversion plus the actual transform then we get ~109 cycles.

So in theory this should work out just fine, so lets rewrite the function:

FBox FBox::TransformBy(const FTransform & M) const
{
	return TransformBy(M.ToMatrixWithScale());
}

Pretty simple change, but as IACA provides estimates we actually have to profile this to see if it actually improves performance. To do that I made a stress test first that transformed the same AABB 214748364 times to make sure that I could measure the performance delta and to avoid any data or instruction cache misses “polluting” the profiling. So here is the profiling data with the old code in red, and the new one in green.

TransformByStressProfile

Pretty impressive win considering the small change in the code. But let’s see how it works on the Elemental demo like last time:

TransformByElementalProfile

We got an average of a 6x improvement on the top hotspot. And here is a look at the profiling data, see how it went down the list of hotspots.

TransformByFast

I think the lesson here is that just like last time, it is really important to think about the data and platform. Don’t take the OOP approach “I will do this right for a vertex and then call it multiple times” as if the number of vertices on a bounding box would change at some point. Instead take advantage of the fact that you have 8 vertices and see how you can write something optimal and concise.

The other lesson is to not be afraid of doing some data contortion to improve performance. In the optimal case there wouldn’t be a need for that, but if that’s not the case then see if you can get a performance win by shuffling data around even if shuffling itself has a cost.

And as usual, always profile, don’t rely on static analysis like IACA, and don’t think too high of your skills to extrapolate performance gains, just profile and assess.

Optimizing string function in UE4

Back in March, during the GDC, Epic Games released Unreal Engine 4 with a subscription based license. As soon as it was released I decided to license it to be able to learn and exercise my skills. That was something that I wasn’t doing back in March since I was helping EA Sports reach alpha stage on NHL 15 doing back-end work for the UI. As soon as I got it I started profiling to understand the performance characteristics of the engine. As time goes on this blog you will see that performance is something very important for me =) . Since then a have come up with half a dozen optimizations, some of which are already integrated in Epic’s Perforce, and some of which are still under review. So today I’m going to be talking on a really basic optimization with a considerable impact in performance which involves converting a character to lower case.

As any serious performance work, I started by profiling which is the most critical piece of data for optimization. I refuse to do any optimization work unless I have profiling data that backs up the work. I also add the before/after profiling data to the commit information to make explicit why the work needed to be done, and I make sure I include the specs of the hardware used for profiling. All the profiling that I will show in this post was done on this machine:

MachineSpecs

The first thing I did was hook up Intel’s VTune to Unreal Engine 4 since I also try to avoid doing non-automated profiling as much as possible. After some work dealing with the Unreal Build Tool I got it integrated properly. I was able to add frame markers, events, and start and stop the capturing of profiling data from code. With that ready I decided to profile the Elemental demo. The Elemental demo was a demo meant to show the potential for Unreal Engine 4 first shown back in 2012. Martin Mittring made a presentation about it in SIGGRAPH 2012. If you have never seen the demo here is a video of it:

I decided to profile the Elemental demo starting on frame 2 to frame 1000 just to skip all the loading of data for the first frame. That meant that for the first ~3.6 seconds of the demo starting nothing would be profiled, then the profiler would be enabled at the start of the second frame and it would be stopped at the end of frame 1000. Given that this was a scripted demo it allowed me to make sure that any optimization I did actually improved performance. The profiling showed this:

ToLower-Opt-Slow

For now just ignore FBox::TransformBy() since I will talk about it in another post. After that the top hotspot was TChar<wchar_t>::ToLower(). Since the Unreal Engine 4 EULA allows me I will share this small piece of code.

template &lt;&gt; inline
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower(CharType c)
{
	// compiler generates incorrect code if we try to use TEXT('char') instead of the numeric values directly
	// some special cases
	switch (WIDECHAR(c))
	{
		// these are not 32 apart
	case 159: return 255; // diaeresis y
	case 140: return 156; // digraph ae

		// characters within the 192 - 255 range which have no uppercase/lowercase equivalents
	case 240:
	case 208:
	case 223:
	case 247:
		return c;
	}

	if ((c &gt;= 192 &amp;&amp; c &lt; 223) || (c &gt;= LITERAL(CharType, 'A') &amp;&amp; c &lt;= LITERAL(CharType, 'Z')))
	{
		return c + UPPER_LOWER_DIFF;
	}

	// no lowercase equivalent
	return c;
}

This converts a wide char from upper case to lower case. If you take that piece of code on its own most people wouldn’t think of it as being a top hotspot. Unfortunately programmers in general don’t seem to think about the implications in performance of their string handling code. In particular the code in Unreal Engine 4 used too many branches. That would be hard for the branch predictor to predict given the random nature of the input. There were two things to be done here, the most obvious one was to reduce the number of calls to the function, but the other one was to actually optimize the function by reducing the number of branches. I will go on to how I reduced the number of calls in another post so let’s focus on optimizing the actual function.

The code makes it pretty explicit what is the valid input is, and what just returns the input char. Since the range of valid input is small it really lends itself to using a look-up table. The look-up table would have to be as big as the valid input to only have to branch to make sure that the input char fits within the valid range. The first thing was to create the look-up table which was really easy.

static const size_t ConversionMapSize = 256U;
static const uint8 ConversionMap[ConversionMapSize] =
{
	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
	0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
	0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
	0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
	0x40, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
	'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
	0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
	0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
	0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x9C, 0x8D, 0x8E, 0x8F,
	0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0xFF,
	0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
	0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xDF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF
};

This table is just 256 bytes long which are 4 cache-lines for the relevant platforms for Unreal Engine 4. This shouldn’t be a concern at all since the use case for this is for converting full strings char by char. The first time we would get a cache miss to get the table but it would be a cache-hit for the rest of the chars.

Since we got the look-up table we can now index it with the input char as long as the input char is within the range covered by the table. For anything outside of that range we would just return the source input char.

template &lt;&gt; inline 
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower( CharType c )
{
	return (c &lt; static_cast&lt;CharType&gt;(ConversionMapSize)) ? static_cast&lt;CharType&gt;(ConversionMap[c]) : c;
}

Now we have a single branch. Time to profile the optimization. Since I got everything automated I ran the exact same process that captured the same number of frames and this was the performance difference in the function:

ToLower-Opt-Fast

Without much effort I decreased the CPU time in the function by ~41%. This is a big difference that show how performance can be dominated by branches. It also shows how critical it is to profile before optimizing. I’m pretty sure most programmers would have thought that the performance would have been dominated by work of some complex subsystem (such as animation) rather than the more mundane like dealing with chars.

The basic lesson here is that you should profile before doing any optimization work, don’t assume that the most complex subsystems are always the bottleneck. Even more important is to be aware of your target platform and write code in a way that it’s good to it.