Optimizing Unreal Engine 4 asset cooking. CRC.

As I was working on the GPU particle system on Unreal Engine 4, I had to build my own scene. The process of baking the raw assets to be properly consumed by the engine is called “cooking” in Unreal Engine 4’s lingo. I found myself cooking assets quiet often but one thing I noticed was that I found that the cooking process took a while to determine which assets needed to be cooked. The reason I found that was because I would try to cook the assets without actually saving my changes. While this may not seem like a big deal, it is especially annoying for me since I had a pretty bad experience working with the build system at Dreamworks R&D. The build system would take more than 3 minutes to determine that nothing had to be built, and I’m talking about code and not assets. So I decided to look at this. As usual I will do all the profiling on my machine:

MachineSpecs

What I decided to cook my simple particles scene (which you can see screenshots here) once, and then cook it again but then with the profiler running. In that way I unsure that the cooking process only runs through the process of verifying what needs to be done, and it should determine that there isn’t any asset to build. After doing that here are the results:

MemCrc_Slow

Since this is an asset baking process it is fairly normal to spend a lot of time dealing with memory so I wasn’t too surprised about the CPU time spent on memset and memcpy. That’s something that I will try to optimize later, and that I know it will take some work. Also the memset in particular is caused by build for the editor which fills up memory with a debug value (0xCD) any new allocation or before freeing an allocation. But the call to FCrc::MemCrc_DEPRECATED() seemed like a low hanging fruit since that a pretty self-contained process. Let’s look at the code, which is also available here:

/**
 * DEPRECATED
 * These tables and functions are deprecated because they're using tables and implementations
 * which give values different from what a user of a typical CRC32 algorithm might expect.
 */

uint32 FCrc::MemCrc_DEPRECATED(const void* InData, int32 Length, uint32 CRC/*=0 */)
{
    // Based on the Slicing-by-8 implementation found here:
    // http://slicing-by-8.sourceforge.net/

    CRC = ~BYTESWAP_ORDER32(CRC);

    const uint8* __restrict Data = (uint8*)InData;

    // First we need to align to 32-bits
    int32 InitBytes = Align(Data, 4) - Data;

    if (Length > InitBytes)
    {
        Length -= InitBytes;

        for (; InitBytes; --InitBytes)
        {
            CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
        }

        auto Data4 = (const uint32*)Data;
        for (uint32 Repeat = Length / 8; Repeat; --Repeat)
        {
            uint32 V1 = *Data4++ ^ CRC;
            uint32 V2 = *Data4++;
            CRC =
                CRCTablesSB8_DEPRECATED[7][ V1 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[6][(V1 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[5][(V1 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[4][ V1 >> 24 ] ^
                CRCTablesSB8_DEPRECATED[3][ V2 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[2][(V2 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[1][(V2 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[0][ V2 >> 24 ];
        }
        Data = (const uint8*)Data4;

        Length %= 8;
    }

    for (; Length; --Length)
    {
        CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
    }

    return BYTESWAP_ORDER32(~CRC);
}

As seen on the code, this is based on the slicing-by-8 version of CRC32, which is way faster than the standard implementation. The reason is that it allows for grater instruction-level parallelism because on the standard implementation each table lookup must be completed before computing the next index. In this case 8 table lookups can potentially be done in parallel. For precise details a paper was written about the algorithm here. So that’s fairly optimal already.

But what are the alternatives? The easiest one would be to use SSE 4.2 which includes the CRC32 instruction. While there is a good speed up as shown here, that wouldn’t be integrated in PC because SSE 4.2 became available with the Nehelem architecture, and I don’t think AMD supported that before Bulldozer. So the other more feasible alternative is to get rid of CRC32 and move to some other checksum algorithm.

There are many alternatives to generate hashes. In this case we need to generate checksums, there is no need for any cryptographic type of hash function. Even then Unreal Engine 4 already provides SHA-1. Some of the alternatives are FNV, CityHash, SpookyHash, MurmurHash, etc. They all are valid, and Charles Bloom who works on Oodle made a blog post covering the performance and properties of some of those algorithms so I won’t attempt to cover that. But I dug deeper and I came across xxHash written by Yann Collet who also the author of the LZ4 compression algorithm. The performance shown by xxHash, the fact that it was used by LZ4, and the permissive license (New BSD License) made me make up my mind about making an attempt to integrate it.

But not everything is that simple. While xxHash is extremely easy to integrate (like the stb libraries written by Sean Barrett), it has a considerable impact on the engine. The fact that all the hashes will change implies that all assets have to be rebuilt. This may not be a huge issue for me because I just have a few simple scenes, but it has a big impact on teams working on AAA games. From my point of view I think it is worthwhile to bite the bullet if the cooking time is reduced after that. So, what happens when we replace CRC32 with xxHash in terms of performance?

MemCrc_Fast

That’s a 3X improvement over the old code. In terms of wall-clock time there is almost a 20% improvement over the old version. I think that justifies the cooking of the assets.

Anyway this in particular may not make it into the master branch of Unreal Engine 4 because it does pull in a new dependency (xxHash) and I have no idea if that’s acceptable. Anyway I think it is a meaningful change that some people may want to integrate.

The lesson this time is to think outside the box. If you have seen my past posts you may have thought that I would have taken the path of optimizing the throughput by helping the compiler or looking at the number of uops. But instead I realized that since this is a pretty self-contained problem I knew there had to be faster alternatives to CRC32 that fit the needs of the engine.

Advertisements

Unreal Engine 4 GPU particle system micro-optimizations. Part 3.

On part one of the series I made an optimization to BuildParticleVertexBuffer which got rid of an unnecessary check when setting the half precision floats that sets the indices. On part two I made a separate optimization but in the process I made a stress test to help me understand the issues. It’s now time to go back to BuildParticleVertexBuffer since it was the biggest hotspot in the stress test. As usual I will do all the profiling on my machine: MachineSpecs I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters in the scene. Read part two of the series to see a couple of screenshots of the scene. BuildParticleVertexBuffer-Part3-Slow Compared to everything else BuildParticleVertexBuffer is by far the biggest hotspot and something that definitely needs to be improved. What makes matters even worse is that this happens on the render thread which is critical as mentioned on part two of the series. So let’s look at the code of the function.

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there is a specific comment that mentions the fact that in some platforms writing unions and/or bitfields to locked memory are really slow, and instead a forced integer write is going to be faster. But, what are those platforms? Does it make sense to do that for all the platforms? I don’t know what are the exact platforms that whoever wrote this code was referring to (one of the downsides of having just access to Unreal Engine 4 on Github and not on Epic’s Perforce server is that you can’t use something like the time-lapse view in Perforce to see when and who wrote this). If anybody have any specific information about that please comment or let me know. Anyway I decided that I would make a single change. I would get rid of the temporary FParticleIndex variable used to write the floats IndexX and IndexY, which is then written as a uint32. Instead of that I would use the SetNoChecks() from the previous part in the series, and set the floats directly. That simplifies code but doesn’t necessarily do the same for the assembly. So here is the code:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);

#if PLATFORM_WINDOWS
				ParticleIndices->X.SetNoChecks(IndexX);
				ParticleIndices->Y.SetNoChecks(IndexY);

				++ParticleIndices;
#else
				const int32 Stride = 1;
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
#endif // PLATFORM_WINDOWS
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

I decided to see what Intel Architecture Code Analyzer had to say in terms of latency. BuildParticleVertexBuffer-Part3-Latency That doesn’t look good, the estimates said that the latency went from 53 cycles to 68 cycles mostly due to pressure on port 1. But since those are estimates, it is critical to actually run the code and profile it. This is the results: BuildParticleVertexBuffer-Part3-Fast With that simple change I managed to cut down the CPU time of the top hotspot in half and get the CPI (cycles per instruction retired) rate to go from 0.704 to 0.417. There are a couple of lessons here. The first one is to never rely a 100 percent on static analysis tools. They are useful tools, but when it comes to performance the only proper way to measure is by profiling at runtime. The other lesson is that you should make sure that you validate the platform assumptions. You cannot make the end-user pay due to wrong platform generalizations. Make sure that the assumptions are correct, and write specialized code for each platform if necessary. Do not forget that as programmers, at the end of the day we are paid to deliver a good experience. We are not paid to have generalized code that only fits the lowest common denominator, after all the end user doesn’t even care about our code.

Unreal Engine 4 GPU particle system micro-optimizations. Part 2.

On the previous part of the series I showed a micro-optimization to the GPU particle system for the BuildParticleVertexBuffer function. As part of the verification process as well as to understand how the system worked I created a scene of a hollow box with 20 GPU particle emitters. I gave them a basic behavior in Cascade and nothing else. Here are two screenshots:

ParticlesCloseParticlesFar

Using that scene I decided to profile the system. As usual I did this on my machine:

MachineSpecs

This time I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters.

BuildTileVertexBuffer-Profile

The first hotspot is on BuildParticleVertexBuffer but that’s something I covered on the previous part of the series. The next one is BuildTileVertexBuffer. Now you may ask what’s the point of optimizing something that isn’t that doesn’t seem to take that much CPU time. And the answer is the next screenshot.

BuildTileVertexBuffer-Callstack

This function is being called from the RenderThread which given the current implementation on the Unreal Engine 4, it is a considerable deal for performance. For the time being a lot of the rendering work happens on a single rendering thread so that can set the wall time for the frame since there is a lot of work to be done on it. To prove that see how work is distributed on the different threads:

BuildTileVertexBuffer-Threads

0x1F98 is the thread id for the rendering thread, and 0x1B60 is the thread id for the main thread. It is clearly visible that performance is dominated by the work done on the rendering thread. Thankfully Epic is working on parallel rendering which will alleviate the issue, but for the time being it is critical to optimize the render thread.

With that out of the way let’s look at the code:

/**
 * Builds a vertex buffer containing the offsets for a set of tiles.
 * @param TileOffsetsRef - The vertex buffer to fill. Must be at least TileCount * sizeof(FVector4) in size.
 * @param Tiles - The tiles which will be drawn.
 * @param TileCount - The number of tiles in the array.
 */
static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	int32 Index;
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( ; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

In this function at first sight there isn’t anything terribly wrong with the exception of the use of the Index variable. Without any reasonable reason there is a dependency created between the two loops when in fact they are completely independent. Each loop could be controlled independently. Let’s rewrite it:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

But shouldn’t the Visual Studio compiler realize that? Let’s look at the assembly with Intel Architecture Code Analyzer. The old code is on the left, and the new code is on the right.

BuildTileVertexBuffer-FirstOpt

The change reduced got rid of two move instructions and the number of uops (micro ops) got reduced from 48 to 46. But we shouldn’t expect a huge improvement in performance from just doing that. We need to reduce the number of micro ops further and hopefully that will also improve the instruction-level parallelism. So looking at the code that on the top loop there isn’t any specific need to construct an FVector2D and assigned to the current tile offset. I could just write the X, Y components directly as two independent operations. I could also do that for the loop below. Here is the new code:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index].X = FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX );
		TileOffset[Index].Y = FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index].X = 100.0f;
		TileOffset[Index].Y = 100.0f;
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

Would that make any improvements at all? Let’s look at it with Intel Architecture Code Analyzer. The old code on the left, the new code on the right.

BuildTileVertexBuffer-SecondOpt

Now we went from 48 uops to 41. Our optimization looks good from the static analysis, but let’s profile it.

BuildTileVertexBuffer-OptProfile

That cut down in half the CPU time which is good for the rendering thread.

I think the main lesson here is that the compiler may not optimize everything even if you think it should. Compilers in general are pretty limited in terms of optimizations so as programmers we need to be aware of their limitations and how to get around them (which might involve trial and error, and looking at the disassembly directly or with some static analysis).

Unreal Engine 4 GPU particle system micro-optimizations. Part 1.

This time I’m going to go over two optimizations which are the very first that I made, and that just got integrated in Unreal Engine 4. Hopefully they will ship with version 4.4. When I first got the Unreal Engine 4 license the most impressive demo available at the time as the Effects Cave demo. The demo looks like this:

So I started profiling the first 120 frames of the demo to get a sense of how CPU performance was being spent. As usual I profiled in my machine that has the following specs:

MachineSpecs

Below is the result of the profiling run. Keep in mind that this was done on Unreal Engine 4.1:

BuildParticleVertexBuffer-Slow

It isn’t that often that you get to profile something that is such an obvious hotspot. Here the issue is pretty clear in terms of performance, I had to focus on BuildParticleVertexBuffer.

To make some sense of this it is good to see how the GPU particle system works. The GPU particle system unlike the traditional CPU particle system allows for a high particle count to be rendered efficiently but with some lack of flexibility. The actual emission of the particles still happen on the CPU but once they are emitted, the rest of the process happens on the GPU. The GPU particle system is implemented with tiles (4×4 tiles by default) in a set of textures (1024×1024 by default). There are two textures for position data, and one for velocity and those textures are a double buffered. They are indexed by a vertex buffer where the index is stored in 2 half floats that represent this texture. The particles are collided with the information in the depth buffer. In particular BuildParticleVertexBuffer creates the vertex buffer used to store the indices for the particles for a given set of tiles and fills it up with the data. Let’s look at how this is done:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray<uint32>& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there isn’t anything really suspicious about it, we are just generating the coordinates that index a particle. The compiler already generated fairly efficient code in terms of the input since most of the parameters are set at compile time. I decided to look at it with the Intel Architecture Code Analyzer what was the assembly and through-put. Here was part of the output:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 31.55 Cycles       Throughput Bottleneck: FrontEnd, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 27.4    0.0 | 27.4 | 12.5   6.0  | 12.5   6.0  | 13.0 | 31.2 |
-------------------------------------------------------------------------

What surprised me in the assembly was seeing comparisons with some constants that wasn’t clearly visible on the source code. Obviously I was missing something in the code. If you see the code there wasn’t anything immediately obvious that would cause comparisons with constants that weren’t the constants related to the GPU particle simulation parameters such as the size of the textures. But then I realized something particular in the code related to the type of the data. As I mentioned previously, the indexing data were two half precision floats. They are defined in FParticleIndex:

/**
 * Per-particle information stored in a vertex buffer for drawing GPU sprites.
 */
struct FParticleIndex
{
	/** The X coordinate of the particle within the texture. */
	FFloat16 X;
	/** The Y coordinate of the particle within the texture. */
	FFloat16 Y;
};

But as you can see from the function, the value set to the temporary FParticleIndex is two single precision floats. No explicit cast was done on the code function, so it had to be done on assignment. That’s why I decided to see how that was assigned and came to this function that was called from the assignment operator.

FORCEINLINE void FFloat16::Set( float FP32Value )
{
	FFloat32 FP32(FP32Value);

	// Copy sign-bit
	Components.Sign = FP32.Components.Sign;

	// Check for zero, denormal or too small value.
	if ( FP32.Components.Exponent <= 112 )			// Too small exponent? (0+127-15)
	{
		// Set to 0.
		Components.Exponent = 0;
		Components.Mantissa = 0;
	}
	// Check for INF or NaN, or too high value
	else if ( FP32.Components.Exponent >= 143 )		// Too large exponent? (31+127-15)
	{
		// Set to 65504.0 (max value)
		Components.Exponent = 30;
		Components.Mantissa = 1023;
	}
	// Handle normal number.
	else
	{
		Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
		Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
	}
}

As you can see this function takes care of transforming single precision floats that out of range of range of a half precision float. Those were the comparisons I was seeing. But is this meaningful in any way for us? Of course! Thinking about the data is critical. And when you do that in this case, there is no need for those checks. Our indexing data fits perfectly within a half precision float so we should be assigning the values directly. To do that I wrote the following function:

FORCEINLINE void FFloat16::SetNoChecks( const float FP32Value )
{
	const FFloat32 FP32(FP32Value);

	// Make absolutely sure that you never pass in a single precision floating
	// point value that may actually need the checks. If you are not 100% sure
	// of that just use Set().

	Components.Sign = FP32.Components.Sign;
	Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
	Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
}

Now that those checks are gone performance should be better. Again I used Intel Architecture Code Analyzer, and here is the output.

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 23.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 22.0    0.0 | 21.9 | 7.5    4.5  | 7.5    4.5  | 6.0  | 23.1 |
-------------------------------------------------------------------------

As you can see the throughput of the block from 31.55 to 23.40 cycles, and the front end is no longer a bottleneck. Time to profile to see if there is a difference in runtime.

BuildParticleVertexBuffer-Fast

That small change yielded a performance win of ~27% from the original code with ~45% less instructions retired. A good performance win for such a small change.

As all the optimizations I have done so far, the understanding of the input and output data allowed me to determine how to optimize. I can’t stress enough how important it is to understand data. If you don’t understand the data then odds of improving performance are very low.

On the next part I will talk of another micro-optimization closely related to this one.