Optimizing string function in UE4

Back in March, during the GDC, Epic Games released Unreal Engine 4 with a subscription based license. As soon as it was released I decided to license it to be able to learn and exercise my skills. That was something that I wasn’t doing back in March since I was helping EA Sports reach alpha stage on NHL 15 doing back-end work for the UI. As soon as I got it I started profiling to understand the performance characteristics of the engine. As time goes on this blog you will see that performance is something very important for me =) . Since then a have come up with half a dozen optimizations, some of which are already integrated in Epic’s Perforce, and some of which are still under review. So today I’m going to be talking on a really basic optimization with a considerable impact in performance which involves converting a character to lower case.

As any serious performance work, I started by profiling which is the most critical piece of data for optimization. I refuse to do any optimization work unless I have profiling data that backs up the work. I also add the before/after profiling data to the commit information to make explicit why the work needed to be done, and I make sure I include the specs of the hardware used for profiling. All the profiling that I will show in this post was done on this machine:

MachineSpecs

The first thing I did was hook up Intel’s VTune to Unreal Engine 4 since I also try to avoid doing non-automated profiling as much as possible. After some work dealing with the Unreal Build Tool I got it integrated properly. I was able to add frame markers, events, and start and stop the capturing of profiling data from code. With that ready I decided to profile the Elemental demo. The Elemental demo was a demo meant to show the potential for Unreal Engine 4 first shown back in 2012. Martin Mittring made a presentation about it in SIGGRAPH 2012. If you have never seen the demo here is a video of it:

I decided to profile the Elemental demo starting on frame 2 to frame 1000 just to skip all the loading of data for the first frame. That meant that for the first ~3.6 seconds of the demo starting nothing would be profiled, then the profiler would be enabled at the start of the second frame and it would be stopped at the end of frame 1000. Given that this was a scripted demo it allowed me to make sure that any optimization I did actually improved performance. The profiling showed this:

ToLower-Opt-Slow

For now just ignore FBox::TransformBy() since I will talk about it in another post. After that the top hotspot was TChar<wchar_t>::ToLower(). Since the Unreal Engine 4 EULA allows me I will share this small piece of code.

template &lt;&gt; inline
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower(CharType c)
{
	// compiler generates incorrect code if we try to use TEXT('char') instead of the numeric values directly
	// some special cases
	switch (WIDECHAR(c))
	{
		// these are not 32 apart
	case 159: return 255; // diaeresis y
	case 140: return 156; // digraph ae

		// characters within the 192 - 255 range which have no uppercase/lowercase equivalents
	case 240:
	case 208:
	case 223:
	case 247:
		return c;
	}

	if ((c &gt;= 192 &amp;&amp; c &lt; 223) || (c &gt;= LITERAL(CharType, 'A') &amp;&amp; c &lt;= LITERAL(CharType, 'Z')))
	{
		return c + UPPER_LOWER_DIFF;
	}

	// no lowercase equivalent
	return c;
}

This converts a wide char from upper case to lower case. If you take that piece of code on its own most people wouldn’t think of it as being a top hotspot. Unfortunately programmers in general don’t seem to think about the implications in performance of their string handling code. In particular the code in Unreal Engine 4 used too many branches. That would be hard for the branch predictor to predict given the random nature of the input. There were two things to be done here, the most obvious one was to reduce the number of calls to the function, but the other one was to actually optimize the function by reducing the number of branches. I will go on to how I reduced the number of calls in another post so let’s focus on optimizing the actual function.

The code makes it pretty explicit what is the valid input is, and what just returns the input char. Since the range of valid input is small it really lends itself to using a look-up table. The look-up table would have to be as big as the valid input to only have to branch to make sure that the input char fits within the valid range. The first thing was to create the look-up table which was really easy.

static const size_t ConversionMapSize = 256U;
static const uint8 ConversionMap[ConversionMapSize] =
{
	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F,
	0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F,
	0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F,
	0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F,
	0x40, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
	'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
	0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F,
	0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
	0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x9C, 0x8D, 0x8E, 0x8F,
	0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0xFF,
	0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF,
	0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xDF,
	0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF,
	0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF
};

This table is just 256 bytes long which are 4 cache-lines for the relevant platforms for Unreal Engine 4. This shouldn’t be a concern at all since the use case for this is for converting full strings char by char. The first time we would get a cache miss to get the table but it would be a cache-hit for the rest of the chars.

Since we got the look-up table we can now index it with the input char as long as the input char is within the range covered by the table. For anything outside of that range we would just return the source input char.

template &lt;&gt; inline 
TChar&lt;WIDECHAR&gt;::CharType TChar&lt;WIDECHAR&gt;::ToLower( CharType c )
{
	return (c &lt; static_cast&lt;CharType&gt;(ConversionMapSize)) ? static_cast&lt;CharType&gt;(ConversionMap[c]) : c;
}

Now we have a single branch. Time to profile the optimization. Since I got everything automated I ran the exact same process that captured the same number of frames and this was the performance difference in the function:

ToLower-Opt-Fast

Without much effort I decreased the CPU time in the function by ~41%. This is a big difference that show how performance can be dominated by branches. It also shows how critical it is to profile before optimizing. I’m pretty sure most programmers would have thought that the performance would have been dominated by work of some complex subsystem (such as animation) rather than the more mundane like dealing with chars.

The basic lesson here is that you should profile before doing any optimization work, don’t assume that the most complex subsystems are always the bottleneck. Even more important is to be aware of your target platform and write code in a way that it’s good to it.

Advertisements