Optimizing Unreal Engine 4’s async file IO thread.

After the first post I made on the ToLower() optimization a lot of people rightfully wondered why did I optimize that and not the number of calls made to it. The answer is that I did optimize both. The reason is that while the reduction of ToLower() calls would have made a significant difference, it is also important to have the actual code executed be optimal even if it’s not executed too often. The sum of all these inefficiencies are the cause of performance issues because it is hard to see the performance decrease as the product evolves. This is what people refer as “death by a thousand cuts”. So considering that I first optimized the function to make sure that there was a minimum bar for performance set for the function call, and then I went ahead and reduced the number of calls to the function. Of course depending on your time restriction you may set priorities differently.

As usual I’m going to be profiling on my machine that has the following specs:

MachineSpecs

So let’s begin by looking at the profile. Just like last time this is profiling from frame two the first 1000 frames on the Elemental demo.

FAsyncIOSystemSlow

As you can see a lot of time is being spent on the ToLower() function, something I already optimized but it wasn’t included on this profile because I profile change by change on its own. So the approach this time is to reduce the number of calls to the function. So looking at the callstacks I found out that a lot of the calls were coming from the AsyncIOSystem thread.

FAsyncIOSystemCallstack

As the name says the AsyncIOSystem is an asynchronous IO system. It is cross-platform and single-threaded runnable. What is a runnable on Unreal Engine 4 you ask, here is the answer:

/**
 * Interface for "runnable" objects.
 *
 * A runnable object is an object that is "run" on an arbitrary thread. The call usage pattern is
 * Init(), Run(), Exit(). The thread that is going to "run" this object always uses those calling
 * semantics. It does this on the thread that is created so that any thread specific uses (TLS, etc.)
 * are available in the contexts of those calls. A "runnable" does all initialization in Init().
 *
 * If initialization fails, the thread stops execution and returns an error code. If it succeeds,
 * Run() is called where the real threaded work is done. Upon completion, Exit() is called to allow
 * correct clean up.
 */

New IO requests can be queue up from any thread where and those requests can be cancelled as well. Whenever a new request is sent the following data is stored in the given order:

Type Name Description
uint64 RequestIndex Index of the request.
int32 FileSortKey NOT USED. File sort key on media, INDEX_NONE if not supported or unknown.
FString FileName Name of file.
int64 Offset Offset into file.
int64 Size Size in bytes of data to read.
int64 UncompressedSize Uncompressed size in bytes of original data, 0 if data is not compressed on disc.
void* Dest Pointer to memory region used to read data into.
ECompressionFlags CompressionFlags Flags for controlling decompression. Valid flags are no compression, zlib compression, and specific flags related to zlib compression
FThreadSafeCounter* Counter Thread safe counter that is decremented once work is done.
EAsyncIOPriority Priority Priority of request. Allows minimum, low, below normal, normal, high, and maximum priority.
uint32 bIsDestroyHandleRequest : 1 True if this is a request to destroy the handle.
uint32 bHasAlreadyRequestedHandleToBeCached : 1 True if the caching of the handle was already requested.

So the most obvious culprit of the performance drain must be the FileName. So let’s look at where it is used:

FAsyncIOSystemFileNameRefs
The use in the Tick function caught my attention right away, it lined up properly with the callstack that I knew was flooding my output window. The Tick function of FAsyncIOSystem does the following things in order:

  • Create file handles for the outstanding requests.
  • Creates a copy of the next request that should be handled based on a specific platform criteria such as the layout on disc or the highest set priority.
  • Fulfill the pending request.
    • Which can be a request to destroy the handle or retrieve/create a handle to fulfill a a compressed or uncompressed read.

To determine if a request handle needs to be created or retrieved from the cache there is a name to handle map that maps the FString of the name of an iFileHandle*. If the filename is in the map then it means that there is a cached handle for it. To do that it does a FindRef() on the TMap which calls FindId(), here is the code:

/**
 * Finds an element with the given key in the set.
 * @param Key - The key to search for.
 * @return The id of the set element matching the given key, or the NULL id if none matches.
 */
FSetElementId FindId(KeyInitType Key) const
{
	if(HashSize)
	{
		for(FSetElementId ElementId = GetTypedHash(KeyFuncs::GetKeyHash(Key));
			ElementId.IsValidId();
			ElementId = Elements[ElementId].HashNextId)
		{
			if(KeyFuncs::Matches(KeyFuncs::GetSetKey(Elements[ElementId].Value),Key))
			{
				// Return the first match, regardless of whether the set has multiple matches for the key or not.
				return ElementId;
			}
		}
	}
	return FSetElementId();
}

The ElementId is generated by the GetTypedHash() function which for our type it generates a CRC32 hash, this is where all the time in FCrc::Strihash_DEPRECATED<wchar_t>() was being spent. And KeyFuncs::Matches() looks like this:

static FORCEINLINE bool Matches(KeyInitType A,KeyInitType B)
{
	return A == B;
}

While that looks pretty reasonable for fundamental integral types, for comparisons of FStrings it does a call to Stricmp to do a lexicographical comparison. This is where the ToLower() call is made:

template <typename CharType>
static inline int32 Stricmp( const CharType* String1, const CharType* String2 )
{
	// walk the strings, comparing them case insensitively
	for (; *String1 || *String2; String1++, String2++)
	{
		CharType Char1 = TChar<CharType>::ToLower(*String1), Char2 = TChar<CharType>::ToLower(*String2);
		if (Char1 != Char2)
		{
			return Char1 - Char2;
		}
	}
	return 0;
}

So now we know what it implies to find something in the cache, but how often does that happen? The answer is in the FAsyncIOSystemBase::Tick() function which shows that it happens once per outstanding request, and then once more when a request if a request is pending. I measured the number of request done before the very first frame was rendered, there were 2096 requests queued. Considering that the AsyncIOSystem thread has an above normal priority and it can happen pretty often. The numbers add up pretty quickly. We need to fix this.

To fix this I took a rather simple approach which is to make sure that finding something in the cache involves comparisons between integral types. The easiest way was to add another field to the IO request data which is a 32-bit hash of the filename. The hash is generated whenever a new IO request is queued up (be it an IO request or a file handle destroy request), and then that hash is used to find cached file handles. To generate the hash I decided to use something already found in the engine rather than integrating something like FNV-1 or xxHash, so I used a CRC32 hash.

So after doing that change let’s look at the profile:

FAsyncIOSystemFast

Pretty impressive, the call to ToLower() isn’t there anymore because only 0.030ms are spent in all the 1000 frames. The call to FCrc::Strihash_DEPRECATED<wchar_t>() isn’t there either because only 7.9ms are spent in all the 1000 frames.

The lessons this time is related to the tendency of developers to hide complexity under very generic functions which have huge performance relevance. In particular performance was suboptimal because of it isn’t obvious that a A == B in KeyFuncs::Matches would imply a Stricmp call for an FString. That’s why in my own code I tend not to override operators, they usually hide complexity when as programmers we need to be fully aware of the complexity of what we ship. Programmers also forget that our main objective isn’t to create generic solutions that solve problems that we may have in the future. Our main objective is to ship the best experience to the end user, and that means writing code that solve the actual problem we need to solve, with the available hardware resources, and with the available time to write it. If you care a lot about the future of your code then worry about making optimizable code rather that making a grand design to abstract complexities using ten different design patterns. The truth is that the end user doesn’t care if you use 10 different patterns in a generic design, but they do care if the load times are high.

And knowing your performance data is critical. In the case of the Elemental demo 2096 IO requests were done and fulfilled before the first frame was rendered. I think being aware of that is critical to making the proper design and performance decisions. In fact given that data I would be inclined to further optimize this by changing the AoS nature of the FAsyncIORequest and move it to a SoA so that hashes are all stored together to optimize to reduce the CPI in the Tick function, but I will leave that up to you.

Advertisements

Optimizing AABB transform in Unreal Engine 4

Last time I didn’t go over the top hotspot in Unreal Engine 4’s Elemental demo so I will tackle that issue now. Let’s look at the profiling

TransformSlowSource

The FBox::TransformBy() function transforms the bounding box with a FTransform which stores a quaternion for the rotation, a translation vector and a scale vector. This FTransform object might be SIMD or scalar depending on the platform, but even in the SIMD solution there is a whole lot of work going on to transform the given position with the FTransform. If you have access to the source check FTransform::TransformFVector4() in Engine\Source\Runtime\Core\Public\Math\TransformVectorized.h . The FTransform object itself is smaller than using an FMatrix (3 SIMD registers vs 4 SIMD registers), but the transform operation isn’t so straightforward. On top of that, given the OOP nature of the implementation, it doesn’t take advantage of the fact that there is data that could be reused for every vertex of the bounding box. Here is the old implementation:

FBox FBox::TransformBy(const FTransform & M) const
{
	FVector Vertices[8] =
	{
		FVector(Min),
		FVector(Min.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Min.Z),
		FVector(Max.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Max.Z),
		FVector(Max)
	};

	FBox NewBox(0);

	for (int32 VertexIndex = 0; VertexIndex < ARRAY_COUNT(Vertices); VertexIndex++)
	{
		FVector4 ProjectedVertex = M.TransformPosition(Vertices[VertexIndex]);
		NewBox += ProjectedVertex;
	}

	return NewBox;
}

Fortunately FBox has a TransformBy implementation that takes an FMatrix that is way better since it easy fairly small vectorized code that takes in account all the vertices of the bounding box at once. Instead of transforming each vertex and adding them to the new bounding box while dealing with the quaternion, it transforms all the vertices with the matrix (which just takes 3 _mm_shuffle_ps, 3 _mm_mul_ps, and 3 _mm_add_ps per vertex) and then recalculates the bounding box (which take 7 _mm_min_ps and 7 _mm_max_ps).

So using FMatrix in the current implementation is better than using FTransform. But in general the engine uses FTranforms for local-to-world transforms. That means that we would have to rewrite the code to use matrices instead, or we could convert the FTransform into a FMatrix when transforming the bounding box. So you may ask “how much work would it take to change from FTransform to FMatrix?” Here is part of the answer:

TransfromByVA

As you can see there is a whole bunch of work to be done if we want to do that. Now that raises the question, can I do the conversion in-place with the cycles I saved by not using an FTransform? The process of switching from FTransform to FMatrix isn’t cheap in terms of number of instructions, but since it involves a lot of shuffling it shouldn’t be hard to pipeline. So to determine that I decided to use another tool from Intel, the Intel Architecture Code Analyzer (IACA). Basically this tool allows you to statically analyze the throughput and latency of a snippet of code. For people who have worked on PlayStation 3 will find it similar to SN Tuner’s view of even and odd instructions in the SPU. If you haven’t seen that then take a look at Jaymin Kessler’s video on software pipelining. Anyway IACA is a nice tool to get an estimate of the performance of a snippet of code and it works just fine for our purpose.

So let’s take a look at the throughput for FTransform::TransformFVector4(). Keep in mind that this is called eight times, once for each vertex of the bounding box.

Throughput Analysis Report
--------------------------
Block Throughput: 27.05 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 17.0   0.0  | 9.0  | 8.5    6.0  | 8.5    6.0  | 5.0  | 27.0 |
-------------------------------------------------------------------------

This means that we take ~27 cycles per vertex or ~216 for the whole bounding box. If we look at the throughput of FBox::TransformBy() with the FMatrix input it looks like this:

Throughput Analysis Report
--------------------------
Block Throughput: 80.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 24.0   0.0  | 38.0 | 17.0   8.0  | 17.0   8.0  | 18.0 | 80.0 |
-------------------------------------------------------------------------

That’s ~80 cycles for all the vertices which is much better. But since we need to go from FTransform to FMatrix let’s see what that takes by getting the throughput for the conversion:

Throughput Analysis Report
--------------------------
Block Throughput: 29.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 6.0    0.0  | 6.0  | 5.5    3.0  | 5.5    3.0  | 5.0  | 29.0 |
-------------------------------------------------------------------------

That’s ~29 cycles. So if we add the cost of conversion plus the actual transform then we get ~109 cycles.

So in theory this should work out just fine, so lets rewrite the function:

FBox FBox::TransformBy(const FTransform & M) const
{
	return TransformBy(M.ToMatrixWithScale());
}

Pretty simple change, but as IACA provides estimates we actually have to profile this to see if it actually improves performance. To do that I made a stress test first that transformed the same AABB 214748364 times to make sure that I could measure the performance delta and to avoid any data or instruction cache misses “polluting” the profiling. So here is the profiling data with the old code in red, and the new one in green.

TransformByStressProfile

Pretty impressive win considering the small change in the code. But let’s see how it works on the Elemental demo like last time:

TransformByElementalProfile

We got an average of a 6x improvement on the top hotspot. And here is a look at the profiling data, see how it went down the list of hotspots.

TransformByFast

I think the lesson here is that just like last time, it is really important to think about the data and platform. Don’t take the OOP approach “I will do this right for a vertex and then call it multiple times” as if the number of vertices on a bounding box would change at some point. Instead take advantage of the fact that you have 8 vertices and see how you can write something optimal and concise.

The other lesson is to not be afraid of doing some data contortion to improve performance. In the optimal case there wouldn’t be a need for that, but if that’s not the case then see if you can get a performance win by shuffling data around even if shuffling itself has a cost.

And as usual, always profile, don’t rely on static analysis like IACA, and don’t think too high of your skills to extrapolate performance gains, just profile and assess.

Optimizing string function in UE4

Back in March, during the GDC, Epic Games released Unreal Engine 4 with a subscription based license. As soon as it was released I decided to license it to be able to learn and exercise my skills. That was something that I wasn’t doing back in March since I was helping EA Sports reach alpha stage on NHL 15 doing back-end work for the UI. As soon as I got it I started profiling to understand the performance characteristics of the engine. As time goes on this blog you will see that performance is something very important for me =) . Since then a have come up with half a dozen optimizations, some of which are already integrated in Epic’s Perforce, and some of which are still under review. So today I’m going to be talking on a really basic optimization with a considerable impact in performance which involves converting a character to lower case.

As any serious performance work, I started by profiling which is the most critical piece of data for optimization. I refuse to do any optimization work unless I have profiling data that backs up the work. I also add the before/after profiling data to the commit information to make explicit why the work needed to be done, and I make sure I include the specs of the hardware used for profiling. All the profiling that I will show in this post was done on this machine:

MachineSpecs

The first thing I did was hook up Intel’s VTune to Unreal Engine 4 since I also try to avoid doing non-automated profiling as much as possible. After some work dealing with the Unreal Build Tool I got it integrated properly. I was able to add frame markers, events, and start and stop the capturing of profiling data from code. With that ready I decided to profile the Elemental demo. The Elemental demo was a demo meant to show the potential for Unreal Engine 4 first shown back in 2012. Martin Mittring made a presentation about it in SIGGRAPH 2012. If you have never seen the demo here is a video of it:

I decided to profile the Elemental demo starting on frame 2 to frame 1000 just to skip all the loading of data for the first frame. That meant that for the first ~3.6 seconds of the demo starting nothing would be profiled, then the profiler would be enabled at the start of the second frame and it would be stopped at the end of frame 1000. Given that this was a scripted demo it allowed me to make sure that any optimization I did actually improved performance. The profiling showed this:

ToLower-Opt-Slow

For now just ignore FBox::TransformBy() since I will talk about it in another post. After that the top hotspot was TChar<wchar_t>::ToLower(). Since the Unreal Engine 4 EULA allows me I will share this small piece of code.

template &lt;&gt; inline
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower(CharType c)
{
	// compiler generates incorrect code if we try to use TEXT('char') instead of the numeric values directly
	// some special cases
	switch (WIDECHAR(c))
	{
		// these are not 32 apart
	case 159: return 255; // diaeresis y
	case 140: return 156; // digraph ae

		// characters within the 192 - 255 range which have no uppercase/lowercase equivalents
	case 240:
	case 208:
	case 223:
	case 247:
		return c;
	}

	if ((c &gt;= 192 &amp;&amp; c &lt; 223) || (c &gt;= LITERAL(CharType, 'A') &amp;&amp; c &lt;= LITERAL(CharType, 'Z')))
	{
		return c + UPPER_LOWER_DIFF;
	}

	// no lowercase equivalent
	return c;
}

This converts a wide char from upper case to lower case. If you take that piece of code on its own most people wouldn’t think of it as being a top hotspot. Unfortunately programmers in general don’t seem to think about the implications in performance of their string handling code. In particular the code in Unreal Engine 4 used too many branches. That would be hard for the branch predictor to predict given the random nature of the input. There were two things to be done here, the most obvious one was to reduce the number of calls to the function, but the other one was to actually optimize the function by reducing the number of branches. I will go on to how I reduced the number of calls in another post so let’s focus on optimizing the actual function.

The code makes it pretty explicit what is the valid input is, and what just returns the input char. Since the range of valid input is small it really lends itself to using a look-up table. The look-up table would have to be as big as the valid input to only have to branch to make sure that the input char fits within the valid range. The first thing was to create the look-up table which was really easy.

static const size_t ConversionMapSize = 256U;
static const uint8 ConversionMap[ConversionMapSize] =
{
	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
	0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
	0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
	0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
	0x40, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
	'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
	0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
	0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
	0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x9C, 0x8D, 0x8E, 0x8F,
	0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0xFF,
	0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
	0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xDF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF
};

This table is just 256 bytes long which are 4 cache-lines for the relevant platforms for Unreal Engine 4. This shouldn’t be a concern at all since the use case for this is for converting full strings char by char. The first time we would get a cache miss to get the table but it would be a cache-hit for the rest of the chars.

Since we got the look-up table we can now index it with the input char as long as the input char is within the range covered by the table. For anything outside of that range we would just return the source input char.

template &lt;&gt; inline 
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower( CharType c )
{
	return (c &lt; static_cast&lt;CharType&gt;(ConversionMapSize)) ? static_cast&lt;CharType&gt;(ConversionMap[c]) : c;
}

Now we have a single branch. Time to profile the optimization. Since I got everything automated I ran the exact same process that captured the same number of frames and this was the performance difference in the function:

ToLower-Opt-Fast

Without much effort I decreased the CPU time in the function by ~41%. This is a big difference that show how performance can be dominated by branches. It also shows how critical it is to profile before optimizing. I’m pretty sure most programmers would have thought that the performance would have been dominated by work of some complex subsystem (such as animation) rather than the more mundane like dealing with chars.

The basic lesson here is that you should profile before doing any optimization work, don’t assume that the most complex subsystems are always the bottleneck. Even more important is to be aware of your target platform and write code in a way that it’s good to it.