Unreal Engine 4 GPU particle system micro-optimizations. Part 2.

On the previous part of the series I showed a micro-optimization to the GPU particle system for the BuildParticleVertexBuffer function. As part of the verification process as well as to understand how the system worked I created a scene of a hollow box with 20 GPU particle emitters. I gave them a basic behavior in Cascade and nothing else. Here are two screenshots:

ParticlesCloseParticlesFar

Using that scene I decided to profile the system. As usual I did this on my machine:

MachineSpecs

This time I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters.

BuildTileVertexBuffer-Profile

The first hotspot is on BuildParticleVertexBuffer but that’s something I covered on the previous part of the series. The next one is BuildTileVertexBuffer. Now you may ask what’s the point of optimizing something that isn’t that doesn’t seem to take that much CPU time. And the answer is the next screenshot.

BuildTileVertexBuffer-Callstack

This function is being called from the RenderThread which given the current implementation on the Unreal Engine 4, it is a considerable deal for performance. For the time being a lot of the rendering work happens on a single rendering thread so that can set the wall time for the frame since there is a lot of work to be done on it. To prove that see how work is distributed on the different threads:

BuildTileVertexBuffer-Threads

0x1F98 is the thread id for the rendering thread, and 0x1B60 is the thread id for the main thread. It is clearly visible that performance is dominated by the work done on the rendering thread. Thankfully Epic is working on parallel rendering which will alleviate the issue, but for the time being it is critical to optimize the render thread.

With that out of the way let’s look at the code:

/**
 * Builds a vertex buffer containing the offsets for a set of tiles.
 * @param TileOffsetsRef - The vertex buffer to fill. Must be at least TileCount * sizeof(FVector4) in size.
 * @param Tiles - The tiles which will be drawn.
 * @param TileCount - The number of tiles in the array.
 */
static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	int32 Index;
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( ; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

In this function at first sight there isn’t anything terribly wrong with the exception of the use of the Index variable. Without any reasonable reason there is a dependency created between the two loops when in fact they are completely independent. Each loop could be controlled independently. Let’s rewrite it:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

But shouldn’t the Visual Studio compiler realize that? Let’s look at the assembly with Intel Architecture Code Analyzer. The old code is on the left, and the new code is on the right.

BuildTileVertexBuffer-FirstOpt

The change reduced got rid of two move instructions and the number of uops (micro ops) got reduced from 48 to 46. But we shouldn’t expect a huge improvement in performance from just doing that. We need to reduce the number of micro ops further and hopefully that will also improve the instruction-level parallelism. So looking at the code that on the top loop there isn’t any specific need to construct an FVector2D and assigned to the current tile offset. I could just write the X, Y components directly as two independent operations. I could also do that for the loop below. Here is the new code:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index].X = FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX );
		TileOffset[Index].Y = FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index].X = 100.0f;
		TileOffset[Index].Y = 100.0f;
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

Would that make any improvements at all? Let’s look at it with Intel Architecture Code Analyzer. The old code on the left, the new code on the right.

BuildTileVertexBuffer-SecondOpt

Now we went from 48 uops to 41. Our optimization looks good from the static analysis, but let’s profile it.

BuildTileVertexBuffer-OptProfile

That cut down in half the CPU time which is good for the rendering thread.

I think the main lesson here is that the compiler may not optimize everything even if you think it should. Compilers in general are pretty limited in terms of optimizations so as programmers we need to be aware of their limitations and how to get around them (which might involve trial and error, and looking at the disassembly directly or with some static analysis).

Advertisements