Unreal Engine 4 GPU particle system micro-optimizations. Part 1.

This time I’m going to go over two optimizations which are the very first that I made, and that just got integrated in Unreal Engine 4. Hopefully they will ship with version 4.4. When I first got the Unreal Engine 4 license the most impressive demo available at the time as the Effects Cave demo. The demo looks like this:

So I started profiling the first 120 frames of the demo to get a sense of how CPU performance was being spent. As usual I profiled in my machine that has the following specs:

MachineSpecs

Below is the result of the profiling run. Keep in mind that this was done on Unreal Engine 4.1:

BuildParticleVertexBuffer-Slow

It isn’t that often that you get to profile something that is such an obvious hotspot. Here the issue is pretty clear in terms of performance, I had to focus on BuildParticleVertexBuffer.

To make some sense of this it is good to see how the GPU particle system works. The GPU particle system unlike the traditional CPU particle system allows for a high particle count to be rendered efficiently but with some lack of flexibility. The actual emission of the particles still happen on the CPU but once they are emitted, the rest of the process happens on the GPU. The GPU particle system is implemented with tiles (4×4 tiles by default) in a set of textures (1024×1024 by default). There are two textures for position data, and one for velocity and those textures are a double buffered. They are indexed by a vertex buffer where the index is stored in 2 half floats that represent this texture. The particles are collided with the information in the depth buffer. In particular BuildParticleVertexBuffer creates the vertex buffer used to store the indices for the particles for a given set of tiles and fills it up with the data. Let’s look at how this is done:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray<uint32>& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there isn’t anything really suspicious about it, we are just generating the coordinates that index a particle. The compiler already generated fairly efficient code in terms of the input since most of the parameters are set at compile time. I decided to look at it with the Intel Architecture Code Analyzer what was the assembly and through-put. Here was part of the output:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 31.55 Cycles       Throughput Bottleneck: FrontEnd, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 27.4    0.0 | 27.4 | 12.5   6.0  | 12.5   6.0  | 13.0 | 31.2 |
-------------------------------------------------------------------------

What surprised me in the assembly was seeing comparisons with some constants that wasn’t clearly visible on the source code. Obviously I was missing something in the code. If you see the code there wasn’t anything immediately obvious that would cause comparisons with constants that weren’t the constants related to the GPU particle simulation parameters such as the size of the textures. But then I realized something particular in the code related to the type of the data. As I mentioned previously, the indexing data were two half precision floats. They are defined in FParticleIndex:

/**
 * Per-particle information stored in a vertex buffer for drawing GPU sprites.
 */
struct FParticleIndex
{
	/** The X coordinate of the particle within the texture. */
	FFloat16 X;
	/** The Y coordinate of the particle within the texture. */
	FFloat16 Y;
};

But as you can see from the function, the value set to the temporary FParticleIndex is two single precision floats. No explicit cast was done on the code function, so it had to be done on assignment. That’s why I decided to see how that was assigned and came to this function that was called from the assignment operator.

FORCEINLINE void FFloat16::Set( float FP32Value )
{
	FFloat32 FP32(FP32Value);

	// Copy sign-bit
	Components.Sign = FP32.Components.Sign;

	// Check for zero, denormal or too small value.
	if ( FP32.Components.Exponent <= 112 )			// Too small exponent? (0+127-15)
	{
		// Set to 0.
		Components.Exponent = 0;
		Components.Mantissa = 0;
	}
	// Check for INF or NaN, or too high value
	else if ( FP32.Components.Exponent >= 143 )		// Too large exponent? (31+127-15)
	{
		// Set to 65504.0 (max value)
		Components.Exponent = 30;
		Components.Mantissa = 1023;
	}
	// Handle normal number.
	else
	{
		Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
		Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
	}
}

As you can see this function takes care of transforming single precision floats that out of range of range of a half precision float. Those were the comparisons I was seeing. But is this meaningful in any way for us? Of course! Thinking about the data is critical. And when you do that in this case, there is no need for those checks. Our indexing data fits perfectly within a half precision float so we should be assigning the values directly. To do that I wrote the following function:

FORCEINLINE void FFloat16::SetNoChecks( const float FP32Value )
{
	const FFloat32 FP32(FP32Value);

	// Make absolutely sure that you never pass in a single precision floating
	// point value that may actually need the checks. If you are not 100% sure
	// of that just use Set().

	Components.Sign = FP32.Components.Sign;
	Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
	Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
}

Now that those checks are gone performance should be better. Again I used Intel Architecture Code Analyzer, and here is the output.

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 23.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 22.0    0.0 | 21.9 | 7.5    4.5  | 7.5    4.5  | 6.0  | 23.1 |
-------------------------------------------------------------------------

As you can see the throughput of the block from 31.55 to 23.40 cycles, and the front end is no longer a bottleneck. Time to profile to see if there is a difference in runtime.

BuildParticleVertexBuffer-Fast

That small change yielded a performance win of ~27% from the original code with ~45% less instructions retired. A good performance win for such a small change.

As all the optimizations I have done so far, the understanding of the input and output data allowed me to determine how to optimize. I can’t stress enough how important it is to understand data. If you don’t understand the data then odds of improving performance are very low.

On the next part I will talk of another micro-optimization closely related to this one.

Advertisements