Optimizing Unreal Engine 4 asset cooking. CRC.

As I was working on the GPU particle system on Unreal Engine 4, I had to build my own scene. The process of baking the raw assets to be properly consumed by the engine is called “cooking” in Unreal Engine 4’s lingo. I found myself cooking assets quiet often but one thing I noticed was that I found that the cooking process took a while to determine which assets needed to be cooked. The reason I found that was because I would try to cook the assets without actually saving my changes. While this may not seem like a big deal, it is especially annoying for me since I had a pretty bad experience working with the build system at Dreamworks R&D. The build system would take more than 3 minutes to determine that nothing had to be built, and I’m talking about code and not assets. So I decided to look at this. As usual I will do all the profiling on my machine:

MachineSpecs

What I decided to cook my simple particles scene (which you can see screenshots here) once, and then cook it again but then with the profiler running. In that way I unsure that the cooking process only runs through the process of verifying what needs to be done, and it should determine that there isn’t any asset to build. After doing that here are the results:

MemCrc_Slow

Since this is an asset baking process it is fairly normal to spend a lot of time dealing with memory so I wasn’t too surprised about the CPU time spent on memset and memcpy. That’s something that I will try to optimize later, and that I know it will take some work. Also the memset in particular is caused by build for the editor which fills up memory with a debug value (0xCD) any new allocation or before freeing an allocation. But the call to FCrc::MemCrc_DEPRECATED() seemed like a low hanging fruit since that a pretty self-contained process. Let’s look at the code, which is also available here:

/**
 * DEPRECATED
 * These tables and functions are deprecated because they're using tables and implementations
 * which give values different from what a user of a typical CRC32 algorithm might expect.
 */

uint32 FCrc::MemCrc_DEPRECATED(const void* InData, int32 Length, uint32 CRC/*=0 */)
{
    // Based on the Slicing-by-8 implementation found here:
    // http://slicing-by-8.sourceforge.net/

    CRC = ~BYTESWAP_ORDER32(CRC);

    const uint8* __restrict Data = (uint8*)InData;

    // First we need to align to 32-bits
    int32 InitBytes = Align(Data, 4) - Data;

    if (Length > InitBytes)
    {
        Length -= InitBytes;

        for (; InitBytes; --InitBytes)
        {
            CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
        }

        auto Data4 = (const uint32*)Data;
        for (uint32 Repeat = Length / 8; Repeat; --Repeat)
        {
            uint32 V1 = *Data4++ ^ CRC;
            uint32 V2 = *Data4++;
            CRC =
                CRCTablesSB8_DEPRECATED[7][ V1 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[6][(V1 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[5][(V1 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[4][ V1 >> 24 ] ^
                CRCTablesSB8_DEPRECATED[3][ V2 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[2][(V2 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[1][(V2 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[0][ V2 >> 24 ];
        }
        Data = (const uint8*)Data4;

        Length %= 8;
    }

    for (; Length; --Length)
    {
        CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
    }

    return BYTESWAP_ORDER32(~CRC);
}

As seen on the code, this is based on the slicing-by-8 version of CRC32, which is way faster than the standard implementation. The reason is that it allows for grater instruction-level parallelism because on the standard implementation each table lookup must be completed before computing the next index. In this case 8 table lookups can potentially be done in parallel. For precise details a paper was written about the algorithm here. So that’s fairly optimal already.

But what are the alternatives? The easiest one would be to use SSE 4.2 which includes the CRC32 instruction. While there is a good speed up as shown here, that wouldn’t be integrated in PC because SSE 4.2 became available with the Nehelem architecture, and I don’t think AMD supported that before Bulldozer. So the other more feasible alternative is to get rid of CRC32 and move to some other checksum algorithm.

There are many alternatives to generate hashes. In this case we need to generate checksums, there is no need for any cryptographic type of hash function. Even then Unreal Engine 4 already provides SHA-1. Some of the alternatives are FNV, CityHash, SpookyHash, MurmurHash, etc. They all are valid, and Charles Bloom who works on Oodle made a blog post covering the performance and properties of some of those algorithms so I won’t attempt to cover that. But I dug deeper and I came across xxHash written by Yann Collet who also the author of the LZ4 compression algorithm. The performance shown by xxHash, the fact that it was used by LZ4, and the permissive license (New BSD License) made me make up my mind about making an attempt to integrate it.

But not everything is that simple. While xxHash is extremely easy to integrate (like the stb libraries written by Sean Barrett), it has a considerable impact on the engine. The fact that all the hashes will change implies that all assets have to be rebuilt. This may not be a huge issue for me because I just have a few simple scenes, but it has a big impact on teams working on AAA games. From my point of view I think it is worthwhile to bite the bullet if the cooking time is reduced after that. So, what happens when we replace CRC32 with xxHash in terms of performance?

MemCrc_Fast

That’s a 3X improvement over the old code. In terms of wall-clock time there is almost a 20% improvement over the old version. I think that justifies the cooking of the assets.

Anyway this in particular may not make it into the master branch of Unreal Engine 4 because it does pull in a new dependency (xxHash) and I have no idea if that’s acceptable. Anyway I think it is a meaningful change that some people may want to integrate.

The lesson this time is to think outside the box. If you have seen my past posts you may have thought that I would have taken the path of optimizing the throughput by helping the compiler or looking at the number of uops. But instead I realized that since this is a pretty self-contained problem I knew there had to be faster alternatives to CRC32 that fit the needs of the engine.

Advertisements

Unreal Engine 4 GPU particle system micro-optimizations. Part 3.

On part one of the series I made an optimization to BuildParticleVertexBuffer which got rid of an unnecessary check when setting the half precision floats that sets the indices. On part two I made a separate optimization but in the process I made a stress test to help me understand the issues. It’s now time to go back to BuildParticleVertexBuffer since it was the biggest hotspot in the stress test. As usual I will do all the profiling on my machine: MachineSpecs I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters in the scene. Read part two of the series to see a couple of screenshots of the scene. BuildParticleVertexBuffer-Part3-Slow Compared to everything else BuildParticleVertexBuffer is by far the biggest hotspot and something that definitely needs to be improved. What makes matters even worse is that this happens on the render thread which is critical as mentioned on part two of the series. So let’s look at the code of the function.

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there is a specific comment that mentions the fact that in some platforms writing unions and/or bitfields to locked memory are really slow, and instead a forced integer write is going to be faster. But, what are those platforms? Does it make sense to do that for all the platforms? I don’t know what are the exact platforms that whoever wrote this code was referring to (one of the downsides of having just access to Unreal Engine 4 on Github and not on Epic’s Perforce server is that you can’t use something like the time-lapse view in Perforce to see when and who wrote this). If anybody have any specific information about that please comment or let me know. Anyway I decided that I would make a single change. I would get rid of the temporary FParticleIndex variable used to write the floats IndexX and IndexY, which is then written as a uint32. Instead of that I would use the SetNoChecks() from the previous part in the series, and set the floats directly. That simplifies code but doesn’t necessarily do the same for the assembly. So here is the code:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);

#if PLATFORM_WINDOWS
				ParticleIndices->X.SetNoChecks(IndexX);
				ParticleIndices->Y.SetNoChecks(IndexY);

				++ParticleIndices;
#else
				const int32 Stride = 1;
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
#endif // PLATFORM_WINDOWS
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

I decided to see what Intel Architecture Code Analyzer had to say in terms of latency. BuildParticleVertexBuffer-Part3-Latency That doesn’t look good, the estimates said that the latency went from 53 cycles to 68 cycles mostly due to pressure on port 1. But since those are estimates, it is critical to actually run the code and profile it. This is the results: BuildParticleVertexBuffer-Part3-Fast With that simple change I managed to cut down the CPU time of the top hotspot in half and get the CPI (cycles per instruction retired) rate to go from 0.704 to 0.417. There are a couple of lessons here. The first one is to never rely a 100 percent on static analysis tools. They are useful tools, but when it comes to performance the only proper way to measure is by profiling at runtime. The other lesson is that you should make sure that you validate the platform assumptions. You cannot make the end-user pay due to wrong platform generalizations. Make sure that the assumptions are correct, and write specialized code for each platform if necessary. Do not forget that as programmers, at the end of the day we are paid to deliver a good experience. We are not paid to have generalized code that only fits the lowest common denominator, after all the end user doesn’t even care about our code.

Unreal Engine 4 GPU particle system micro-optimizations. Part 2.

On the previous part of the series I showed a micro-optimization to the GPU particle system for the BuildParticleVertexBuffer function. As part of the verification process as well as to understand how the system worked I created a scene of a hollow box with 20 GPU particle emitters. I gave them a basic behavior in Cascade and nothing else. Here are two screenshots:

ParticlesCloseParticlesFar

Using that scene I decided to profile the system. As usual I did this on my machine:

MachineSpecs

This time I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters.

BuildTileVertexBuffer-Profile

The first hotspot is on BuildParticleVertexBuffer but that’s something I covered on the previous part of the series. The next one is BuildTileVertexBuffer. Now you may ask what’s the point of optimizing something that isn’t that doesn’t seem to take that much CPU time. And the answer is the next screenshot.

BuildTileVertexBuffer-Callstack

This function is being called from the RenderThread which given the current implementation on the Unreal Engine 4, it is a considerable deal for performance. For the time being a lot of the rendering work happens on a single rendering thread so that can set the wall time for the frame since there is a lot of work to be done on it. To prove that see how work is distributed on the different threads:

BuildTileVertexBuffer-Threads

0x1F98 is the thread id for the rendering thread, and 0x1B60 is the thread id for the main thread. It is clearly visible that performance is dominated by the work done on the rendering thread. Thankfully Epic is working on parallel rendering which will alleviate the issue, but for the time being it is critical to optimize the render thread.

With that out of the way let’s look at the code:

/**
 * Builds a vertex buffer containing the offsets for a set of tiles.
 * @param TileOffsetsRef - The vertex buffer to fill. Must be at least TileCount * sizeof(FVector4) in size.
 * @param Tiles - The tiles which will be drawn.
 * @param TileCount - The number of tiles in the array.
 */
static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	int32 Index;
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( ; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

In this function at first sight there isn’t anything terribly wrong with the exception of the use of the Index variable. Without any reasonable reason there is a dependency created between the two loops when in fact they are completely independent. Each loop could be controlled independently. Let’s rewrite it:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

But shouldn’t the Visual Studio compiler realize that? Let’s look at the assembly with Intel Architecture Code Analyzer. The old code is on the left, and the new code is on the right.

BuildTileVertexBuffer-FirstOpt

The change reduced got rid of two move instructions and the number of uops (micro ops) got reduced from 48 to 46. But we shouldn’t expect a huge improvement in performance from just doing that. We need to reduce the number of micro ops further and hopefully that will also improve the instruction-level parallelism. So looking at the code that on the top loop there isn’t any specific need to construct an FVector2D and assigned to the current tile offset. I could just write the X, Y components directly as two independent operations. I could also do that for the loop below. Here is the new code:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index].X = FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX );
		TileOffset[Index].Y = FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index].X = 100.0f;
		TileOffset[Index].Y = 100.0f;
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

Would that make any improvements at all? Let’s look at it with Intel Architecture Code Analyzer. The old code on the left, the new code on the right.

BuildTileVertexBuffer-SecondOpt

Now we went from 48 uops to 41. Our optimization looks good from the static analysis, but let’s profile it.

BuildTileVertexBuffer-OptProfile

That cut down in half the CPU time which is good for the rendering thread.

I think the main lesson here is that the compiler may not optimize everything even if you think it should. Compilers in general are pretty limited in terms of optimizations so as programmers we need to be aware of their limitations and how to get around them (which might involve trial and error, and looking at the disassembly directly or with some static analysis).

Unreal Engine 4 GPU particle system micro-optimizations. Part 1.

This time I’m going to go over two optimizations which are the very first that I made, and that just got integrated in Unreal Engine 4. Hopefully they will ship with version 4.4. When I first got the Unreal Engine 4 license the most impressive demo available at the time as the Effects Cave demo. The demo looks like this:

So I started profiling the first 120 frames of the demo to get a sense of how CPU performance was being spent. As usual I profiled in my machine that has the following specs:

MachineSpecs

Below is the result of the profiling run. Keep in mind that this was done on Unreal Engine 4.1:

BuildParticleVertexBuffer-Slow

It isn’t that often that you get to profile something that is such an obvious hotspot. Here the issue is pretty clear in terms of performance, I had to focus on BuildParticleVertexBuffer.

To make some sense of this it is good to see how the GPU particle system works. The GPU particle system unlike the traditional CPU particle system allows for a high particle count to be rendered efficiently but with some lack of flexibility. The actual emission of the particles still happen on the CPU but once they are emitted, the rest of the process happens on the GPU. The GPU particle system is implemented with tiles (4×4 tiles by default) in a set of textures (1024×1024 by default). There are two textures for position data, and one for velocity and those textures are a double buffered. They are indexed by a vertex buffer where the index is stored in 2 half floats that represent this texture. The particles are collided with the information in the depth buffer. In particular BuildParticleVertexBuffer creates the vertex buffer used to store the indices for the particles for a given set of tiles and fills it up with the data. Let’s look at how this is done:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray<uint32>& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there isn’t anything really suspicious about it, we are just generating the coordinates that index a particle. The compiler already generated fairly efficient code in terms of the input since most of the parameters are set at compile time. I decided to look at it with the Intel Architecture Code Analyzer what was the assembly and through-put. Here was part of the output:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 31.55 Cycles       Throughput Bottleneck: FrontEnd, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 27.4    0.0 | 27.4 | 12.5   6.0  | 12.5   6.0  | 13.0 | 31.2 |
-------------------------------------------------------------------------

What surprised me in the assembly was seeing comparisons with some constants that wasn’t clearly visible on the source code. Obviously I was missing something in the code. If you see the code there wasn’t anything immediately obvious that would cause comparisons with constants that weren’t the constants related to the GPU particle simulation parameters such as the size of the textures. But then I realized something particular in the code related to the type of the data. As I mentioned previously, the indexing data were two half precision floats. They are defined in FParticleIndex:

/**
 * Per-particle information stored in a vertex buffer for drawing GPU sprites.
 */
struct FParticleIndex
{
	/** The X coordinate of the particle within the texture. */
	FFloat16 X;
	/** The Y coordinate of the particle within the texture. */
	FFloat16 Y;
};

But as you can see from the function, the value set to the temporary FParticleIndex is two single precision floats. No explicit cast was done on the code function, so it had to be done on assignment. That’s why I decided to see how that was assigned and came to this function that was called from the assignment operator.

FORCEINLINE void FFloat16::Set( float FP32Value )
{
	FFloat32 FP32(FP32Value);

	// Copy sign-bit
	Components.Sign = FP32.Components.Sign;

	// Check for zero, denormal or too small value.
	if ( FP32.Components.Exponent <= 112 )			// Too small exponent? (0+127-15)
	{
		// Set to 0.
		Components.Exponent = 0;
		Components.Mantissa = 0;
	}
	// Check for INF or NaN, or too high value
	else if ( FP32.Components.Exponent >= 143 )		// Too large exponent? (31+127-15)
	{
		// Set to 65504.0 (max value)
		Components.Exponent = 30;
		Components.Mantissa = 1023;
	}
	// Handle normal number.
	else
	{
		Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
		Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
	}
}

As you can see this function takes care of transforming single precision floats that out of range of range of a half precision float. Those were the comparisons I was seeing. But is this meaningful in any way for us? Of course! Thinking about the data is critical. And when you do that in this case, there is no need for those checks. Our indexing data fits perfectly within a half precision float so we should be assigning the values directly. To do that I wrote the following function:

FORCEINLINE void FFloat16::SetNoChecks( const float FP32Value )
{
	const FFloat32 FP32(FP32Value);

	// Make absolutely sure that you never pass in a single precision floating
	// point value that may actually need the checks. If you are not 100% sure
	// of that just use Set().

	Components.Sign = FP32.Components.Sign;
	Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
	Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
}

Now that those checks are gone performance should be better. Again I used Intel Architecture Code Analyzer, and here is the output.

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 23.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 22.0    0.0 | 21.9 | 7.5    4.5  | 7.5    4.5  | 6.0  | 23.1 |
-------------------------------------------------------------------------

As you can see the throughput of the block from 31.55 to 23.40 cycles, and the front end is no longer a bottleneck. Time to profile to see if there is a difference in runtime.

BuildParticleVertexBuffer-Fast

That small change yielded a performance win of ~27% from the original code with ~45% less instructions retired. A good performance win for such a small change.

As all the optimizations I have done so far, the understanding of the input and output data allowed me to determine how to optimize. I can’t stress enough how important it is to understand data. If you don’t understand the data then odds of improving performance are very low.

On the next part I will talk of another micro-optimization closely related to this one.

Optimizing Unreal Engine 4’s async file IO thread.

After the first post I made on the ToLower() optimization a lot of people rightfully wondered why did I optimize that and not the number of calls made to it. The answer is that I did optimize both. The reason is that while the reduction of ToLower() calls would have made a significant difference, it is also important to have the actual code executed be optimal even if it’s not executed too often. The sum of all these inefficiencies are the cause of performance issues because it is hard to see the performance decrease as the product evolves. This is what people refer as “death by a thousand cuts”. So considering that I first optimized the function to make sure that there was a minimum bar for performance set for the function call, and then I went ahead and reduced the number of calls to the function. Of course depending on your time restriction you may set priorities differently.

As usual I’m going to be profiling on my machine that has the following specs:

MachineSpecs

So let’s begin by looking at the profile. Just like last time this is profiling from frame two the first 1000 frames on the Elemental demo.

FAsyncIOSystemSlow

As you can see a lot of time is being spent on the ToLower() function, something I already optimized but it wasn’t included on this profile because I profile change by change on its own. So the approach this time is to reduce the number of calls to the function. So looking at the callstacks I found out that a lot of the calls were coming from the AsyncIOSystem thread.

FAsyncIOSystemCallstack

As the name says the AsyncIOSystem is an asynchronous IO system. It is cross-platform and single-threaded runnable. What is a runnable on Unreal Engine 4 you ask, here is the answer:

/**
 * Interface for "runnable" objects.
 *
 * A runnable object is an object that is "run" on an arbitrary thread. The call usage pattern is
 * Init(), Run(), Exit(). The thread that is going to "run" this object always uses those calling
 * semantics. It does this on the thread that is created so that any thread specific uses (TLS, etc.)
 * are available in the contexts of those calls. A "runnable" does all initialization in Init().
 *
 * If initialization fails, the thread stops execution and returns an error code. If it succeeds,
 * Run() is called where the real threaded work is done. Upon completion, Exit() is called to allow
 * correct clean up.
 */

New IO requests can be queue up from any thread where and those requests can be cancelled as well. Whenever a new request is sent the following data is stored in the given order:

Type Name Description
uint64 RequestIndex Index of the request.
int32 FileSortKey NOT USED. File sort key on media, INDEX_NONE if not supported or unknown.
FString FileName Name of file.
int64 Offset Offset into file.
int64 Size Size in bytes of data to read.
int64 UncompressedSize Uncompressed size in bytes of original data, 0 if data is not compressed on disc.
void* Dest Pointer to memory region used to read data into.
ECompressionFlags CompressionFlags Flags for controlling decompression. Valid flags are no compression, zlib compression, and specific flags related to zlib compression
FThreadSafeCounter* Counter Thread safe counter that is decremented once work is done.
EAsyncIOPriority Priority Priority of request. Allows minimum, low, below normal, normal, high, and maximum priority.
uint32 bIsDestroyHandleRequest : 1 True if this is a request to destroy the handle.
uint32 bHasAlreadyRequestedHandleToBeCached : 1 True if the caching of the handle was already requested.

So the most obvious culprit of the performance drain must be the FileName. So let’s look at where it is used:

FAsyncIOSystemFileNameRefs
The use in the Tick function caught my attention right away, it lined up properly with the callstack that I knew was flooding my output window. The Tick function of FAsyncIOSystem does the following things in order:

  • Create file handles for the outstanding requests.
  • Creates a copy of the next request that should be handled based on a specific platform criteria such as the layout on disc or the highest set priority.
  • Fulfill the pending request.
    • Which can be a request to destroy the handle or retrieve/create a handle to fulfill a a compressed or uncompressed read.

To determine if a request handle needs to be created or retrieved from the cache there is a name to handle map that maps the FString of the name of an iFileHandle*. If the filename is in the map then it means that there is a cached handle for it. To do that it does a FindRef() on the TMap which calls FindId(), here is the code:

/**
 * Finds an element with the given key in the set.
 * @param Key - The key to search for.
 * @return The id of the set element matching the given key, or the NULL id if none matches.
 */
FSetElementId FindId(KeyInitType Key) const
{
	if(HashSize)
	{
		for(FSetElementId ElementId = GetTypedHash(KeyFuncs::GetKeyHash(Key));
			ElementId.IsValidId();
			ElementId = Elements[ElementId].HashNextId)
		{
			if(KeyFuncs::Matches(KeyFuncs::GetSetKey(Elements[ElementId].Value),Key))
			{
				// Return the first match, regardless of whether the set has multiple matches for the key or not.
				return ElementId;
			}
		}
	}
	return FSetElementId();
}

The ElementId is generated by the GetTypedHash() function which for our type it generates a CRC32 hash, this is where all the time in FCrc::Strihash_DEPRECATED<wchar_t>() was being spent. And KeyFuncs::Matches() looks like this:

static FORCEINLINE bool Matches(KeyInitType A,KeyInitType B)
{
	return A == B;
}

While that looks pretty reasonable for fundamental integral types, for comparisons of FStrings it does a call to Stricmp to do a lexicographical comparison. This is where the ToLower() call is made:

template <typename CharType>
static inline int32 Stricmp( const CharType* String1, const CharType* String2 )
{
	// walk the strings, comparing them case insensitively
	for (; *String1 || *String2; String1++, String2++)
	{
		CharType Char1 = TChar<CharType>::ToLower(*String1), Char2 = TChar<CharType>::ToLower(*String2);
		if (Char1 != Char2)
		{
			return Char1 - Char2;
		}
	}
	return 0;
}

So now we know what it implies to find something in the cache, but how often does that happen? The answer is in the FAsyncIOSystemBase::Tick() function which shows that it happens once per outstanding request, and then once more when a request if a request is pending. I measured the number of request done before the very first frame was rendered, there were 2096 requests queued. Considering that the AsyncIOSystem thread has an above normal priority and it can happen pretty often. The numbers add up pretty quickly. We need to fix this.

To fix this I took a rather simple approach which is to make sure that finding something in the cache involves comparisons between integral types. The easiest way was to add another field to the IO request data which is a 32-bit hash of the filename. The hash is generated whenever a new IO request is queued up (be it an IO request or a file handle destroy request), and then that hash is used to find cached file handles. To generate the hash I decided to use something already found in the engine rather than integrating something like FNV-1 or xxHash, so I used a CRC32 hash.

So after doing that change let’s look at the profile:

FAsyncIOSystemFast

Pretty impressive, the call to ToLower() isn’t there anymore because only 0.030ms are spent in all the 1000 frames. The call to FCrc::Strihash_DEPRECATED<wchar_t>() isn’t there either because only 7.9ms are spent in all the 1000 frames.

The lessons this time is related to the tendency of developers to hide complexity under very generic functions which have huge performance relevance. In particular performance was suboptimal because of it isn’t obvious that a A == B in KeyFuncs::Matches would imply a Stricmp call for an FString. That’s why in my own code I tend not to override operators, they usually hide complexity when as programmers we need to be fully aware of the complexity of what we ship. Programmers also forget that our main objective isn’t to create generic solutions that solve problems that we may have in the future. Our main objective is to ship the best experience to the end user, and that means writing code that solve the actual problem we need to solve, with the available hardware resources, and with the available time to write it. If you care a lot about the future of your code then worry about making optimizable code rather that making a grand design to abstract complexities using ten different design patterns. The truth is that the end user doesn’t care if you use 10 different patterns in a generic design, but they do care if the load times are high.

And knowing your performance data is critical. In the case of the Elemental demo 2096 IO requests were done and fulfilled before the first frame was rendered. I think being aware of that is critical to making the proper design and performance decisions. In fact given that data I would be inclined to further optimize this by changing the AoS nature of the FAsyncIORequest and move it to a SoA so that hashes are all stored together to optimize to reduce the CPI in the Tick function, but I will leave that up to you.

Optimizing AABB transform in Unreal Engine 4

Last time I didn’t go over the top hotspot in Unreal Engine 4’s Elemental demo so I will tackle that issue now. Let’s look at the profiling

TransformSlowSource

The FBox::TransformBy() function transforms the bounding box with a FTransform which stores a quaternion for the rotation, a translation vector and a scale vector. This FTransform object might be SIMD or scalar depending on the platform, but even in the SIMD solution there is a whole lot of work going on to transform the given position with the FTransform. If you have access to the source check FTransform::TransformFVector4() in Engine\Source\Runtime\Core\Public\Math\TransformVectorized.h . The FTransform object itself is smaller than using an FMatrix (3 SIMD registers vs 4 SIMD registers), but the transform operation isn’t so straightforward. On top of that, given the OOP nature of the implementation, it doesn’t take advantage of the fact that there is data that could be reused for every vertex of the bounding box. Here is the old implementation:

FBox FBox::TransformBy(const FTransform & M) const
{
	FVector Vertices[8] =
	{
		FVector(Min),
		FVector(Min.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Min.Z),
		FVector(Max.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Max.Z),
		FVector(Max)
	};

	FBox NewBox(0);

	for (int32 VertexIndex = 0; VertexIndex < ARRAY_COUNT(Vertices); VertexIndex++)
	{
		FVector4 ProjectedVertex = M.TransformPosition(Vertices[VertexIndex]);
		NewBox += ProjectedVertex;
	}

	return NewBox;
}

Fortunately FBox has a TransformBy implementation that takes an FMatrix that is way better since it easy fairly small vectorized code that takes in account all the vertices of the bounding box at once. Instead of transforming each vertex and adding them to the new bounding box while dealing with the quaternion, it transforms all the vertices with the matrix (which just takes 3 _mm_shuffle_ps, 3 _mm_mul_ps, and 3 _mm_add_ps per vertex) and then recalculates the bounding box (which take 7 _mm_min_ps and 7 _mm_max_ps).

So using FMatrix in the current implementation is better than using FTransform. But in general the engine uses FTranforms for local-to-world transforms. That means that we would have to rewrite the code to use matrices instead, or we could convert the FTransform into a FMatrix when transforming the bounding box. So you may ask “how much work would it take to change from FTransform to FMatrix?” Here is part of the answer:

TransfromByVA

As you can see there is a whole bunch of work to be done if we want to do that. Now that raises the question, can I do the conversion in-place with the cycles I saved by not using an FTransform? The process of switching from FTransform to FMatrix isn’t cheap in terms of number of instructions, but since it involves a lot of shuffling it shouldn’t be hard to pipeline. So to determine that I decided to use another tool from Intel, the Intel Architecture Code Analyzer (IACA). Basically this tool allows you to statically analyze the throughput and latency of a snippet of code. For people who have worked on PlayStation 3 will find it similar to SN Tuner’s view of even and odd instructions in the SPU. If you haven’t seen that then take a look at Jaymin Kessler’s video on software pipelining. Anyway IACA is a nice tool to get an estimate of the performance of a snippet of code and it works just fine for our purpose.

So let’s take a look at the throughput for FTransform::TransformFVector4(). Keep in mind that this is called eight times, once for each vertex of the bounding box.

Throughput Analysis Report
--------------------------
Block Throughput: 27.05 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 17.0   0.0  | 9.0  | 8.5    6.0  | 8.5    6.0  | 5.0  | 27.0 |
-------------------------------------------------------------------------

This means that we take ~27 cycles per vertex or ~216 for the whole bounding box. If we look at the throughput of FBox::TransformBy() with the FMatrix input it looks like this:

Throughput Analysis Report
--------------------------
Block Throughput: 80.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 24.0   0.0  | 38.0 | 17.0   8.0  | 17.0   8.0  | 18.0 | 80.0 |
-------------------------------------------------------------------------

That’s ~80 cycles for all the vertices which is much better. But since we need to go from FTransform to FMatrix let’s see what that takes by getting the throughput for the conversion:

Throughput Analysis Report
--------------------------
Block Throughput: 29.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 6.0    0.0  | 6.0  | 5.5    3.0  | 5.5    3.0  | 5.0  | 29.0 |
-------------------------------------------------------------------------

That’s ~29 cycles. So if we add the cost of conversion plus the actual transform then we get ~109 cycles.

So in theory this should work out just fine, so lets rewrite the function:

FBox FBox::TransformBy(const FTransform & M) const
{
	return TransformBy(M.ToMatrixWithScale());
}

Pretty simple change, but as IACA provides estimates we actually have to profile this to see if it actually improves performance. To do that I made a stress test first that transformed the same AABB 214748364 times to make sure that I could measure the performance delta and to avoid any data or instruction cache misses “polluting” the profiling. So here is the profiling data with the old code in red, and the new one in green.

TransformByStressProfile

Pretty impressive win considering the small change in the code. But let’s see how it works on the Elemental demo like last time:

TransformByElementalProfile

We got an average of a 6x improvement on the top hotspot. And here is a look at the profiling data, see how it went down the list of hotspots.

TransformByFast

I think the lesson here is that just like last time, it is really important to think about the data and platform. Don’t take the OOP approach “I will do this right for a vertex and then call it multiple times” as if the number of vertices on a bounding box would change at some point. Instead take advantage of the fact that you have 8 vertices and see how you can write something optimal and concise.

The other lesson is to not be afraid of doing some data contortion to improve performance. In the optimal case there wouldn’t be a need for that, but if that’s not the case then see if you can get a performance win by shuffling data around even if shuffling itself has a cost.

And as usual, always profile, don’t rely on static analysis like IACA, and don’t think too high of your skills to extrapolate performance gains, just profile and assess.

Optimizing string function in UE4

Back in March, during the GDC, Epic Games released Unreal Engine 4 with a subscription based license. As soon as it was released I decided to license it to be able to learn and exercise my skills. That was something that I wasn’t doing back in March since I was helping EA Sports reach alpha stage on NHL 15 doing back-end work for the UI. As soon as I got it I started profiling to understand the performance characteristics of the engine. As time goes on this blog you will see that performance is something very important for me =) . Since then a have come up with half a dozen optimizations, some of which are already integrated in Epic’s Perforce, and some of which are still under review. So today I’m going to be talking on a really basic optimization with a considerable impact in performance which involves converting a character to lower case.

As any serious performance work, I started by profiling which is the most critical piece of data for optimization. I refuse to do any optimization work unless I have profiling data that backs up the work. I also add the before/after profiling data to the commit information to make explicit why the work needed to be done, and I make sure I include the specs of the hardware used for profiling. All the profiling that I will show in this post was done on this machine:

MachineSpecs

The first thing I did was hook up Intel’s VTune to Unreal Engine 4 since I also try to avoid doing non-automated profiling as much as possible. After some work dealing with the Unreal Build Tool I got it integrated properly. I was able to add frame markers, events, and start and stop the capturing of profiling data from code. With that ready I decided to profile the Elemental demo. The Elemental demo was a demo meant to show the potential for Unreal Engine 4 first shown back in 2012. Martin Mittring made a presentation about it in SIGGRAPH 2012. If you have never seen the demo here is a video of it:

I decided to profile the Elemental demo starting on frame 2 to frame 1000 just to skip all the loading of data for the first frame. That meant that for the first ~3.6 seconds of the demo starting nothing would be profiled, then the profiler would be enabled at the start of the second frame and it would be stopped at the end of frame 1000. Given that this was a scripted demo it allowed me to make sure that any optimization I did actually improved performance. The profiling showed this:

ToLower-Opt-Slow

For now just ignore FBox::TransformBy() since I will talk about it in another post. After that the top hotspot was TChar<wchar_t>::ToLower(). Since the Unreal Engine 4 EULA allows me I will share this small piece of code.

template &lt;&gt; inline
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower(CharType c)
{
	// compiler generates incorrect code if we try to use TEXT('char') instead of the numeric values directly
	// some special cases
	switch (WIDECHAR(c))
	{
		// these are not 32 apart
	case 159: return 255; // diaeresis y
	case 140: return 156; // digraph ae

		// characters within the 192 - 255 range which have no uppercase/lowercase equivalents
	case 240:
	case 208:
	case 223:
	case 247:
		return c;
	}

	if ((c &gt;= 192 &amp;&amp; c &lt; 223) || (c &gt;= LITERAL(CharType, 'A') &amp;&amp; c &lt;= LITERAL(CharType, 'Z')))
	{
		return c + UPPER_LOWER_DIFF;
	}

	// no lowercase equivalent
	return c;
}

This converts a wide char from upper case to lower case. If you take that piece of code on its own most people wouldn’t think of it as being a top hotspot. Unfortunately programmers in general don’t seem to think about the implications in performance of their string handling code. In particular the code in Unreal Engine 4 used too many branches. That would be hard for the branch predictor to predict given the random nature of the input. There were two things to be done here, the most obvious one was to reduce the number of calls to the function, but the other one was to actually optimize the function by reducing the number of branches. I will go on to how I reduced the number of calls in another post so let’s focus on optimizing the actual function.

The code makes it pretty explicit what is the valid input is, and what just returns the input char. Since the range of valid input is small it really lends itself to using a look-up table. The look-up table would have to be as big as the valid input to only have to branch to make sure that the input char fits within the valid range. The first thing was to create the look-up table which was really easy.

static const size_t ConversionMapSize = 256U;
static const uint8 ConversionMap[ConversionMapSize] =
{
	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
	0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
	0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
	0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
	0x40, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
	'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
	0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
	0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
	0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x9C, 0x8D, 0x8E, 0x8F,
	0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0xFF,
	0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
	0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xDF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF
};

This table is just 256 bytes long which are 4 cache-lines for the relevant platforms for Unreal Engine 4. This shouldn’t be a concern at all since the use case for this is for converting full strings char by char. The first time we would get a cache miss to get the table but it would be a cache-hit for the rest of the chars.

Since we got the look-up table we can now index it with the input char as long as the input char is within the range covered by the table. For anything outside of that range we would just return the source input char.

template &lt;&gt; inline 
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower( CharType c )
{
	return (c &lt; static_cast&lt;CharType&gt;(ConversionMapSize)) ? static_cast&lt;CharType&gt;(ConversionMap[c]) : c;
}

Now we have a single branch. Time to profile the optimization. Since I got everything automated I ran the exact same process that captured the same number of frames and this was the performance difference in the function:

ToLower-Opt-Fast

Without much effort I decreased the CPU time in the function by ~41%. This is a big difference that show how performance can be dominated by branches. It also shows how critical it is to profile before optimizing. I’m pretty sure most programmers would have thought that the performance would have been dominated by work of some complex subsystem (such as animation) rather than the more mundane like dealing with chars.

The basic lesson here is that you should profile before doing any optimization work, don’t assume that the most complex subsystems are always the bottleneck. Even more important is to be aware of your target platform and write code in a way that it’s good to it.