Optimizing AABB transform in Unreal Engine 4

Last time I didn’t go over the top hotspot in Unreal Engine 4’s Elemental demo so I will tackle that issue now. Let’s look at the profiling

TransformSlowSource

The FBox::TransformBy() function transforms the bounding box with a FTransform which stores a quaternion for the rotation, a translation vector and a scale vector. This FTransform object might be SIMD or scalar depending on the platform, but even in the SIMD solution there is a whole lot of work going on to transform the given position with the FTransform. If you have access to the source check FTransform::TransformFVector4() in Engine\Source\Runtime\Core\Public\Math\TransformVectorized.h . The FTransform object itself is smaller than using an FMatrix (3 SIMD registers vs 4 SIMD registers), but the transform operation isn’t so straightforward. On top of that, given the OOP nature of the implementation, it doesn’t take advantage of the fact that there is data that could be reused for every vertex of the bounding box. Here is the old implementation:

FBox FBox::TransformBy(const FTransform & M) const
{
	FVector Vertices[8] =
	{
		FVector(Min),
		FVector(Min.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Min.Z),
		FVector(Max.X, Max.Y, Min.Z),
		FVector(Max.X, Min.Y, Max.Z),
		FVector(Min.X, Max.Y, Max.Z),
		FVector(Max)
	};

	FBox NewBox(0);

	for (int32 VertexIndex = 0; VertexIndex < ARRAY_COUNT(Vertices); VertexIndex++)
	{
		FVector4 ProjectedVertex = M.TransformPosition(Vertices[VertexIndex]);
		NewBox += ProjectedVertex;
	}

	return NewBox;
}

Fortunately FBox has a TransformBy implementation that takes an FMatrix that is way better since it easy fairly small vectorized code that takes in account all the vertices of the bounding box at once. Instead of transforming each vertex and adding them to the new bounding box while dealing with the quaternion, it transforms all the vertices with the matrix (which just takes 3 _mm_shuffle_ps, 3 _mm_mul_ps, and 3 _mm_add_ps per vertex) and then recalculates the bounding box (which take 7 _mm_min_ps and 7 _mm_max_ps).

So using FMatrix in the current implementation is better than using FTransform. But in general the engine uses FTranforms for local-to-world transforms. That means that we would have to rewrite the code to use matrices instead, or we could convert the FTransform into a FMatrix when transforming the bounding box. So you may ask “how much work would it take to change from FTransform to FMatrix?” Here is part of the answer:

TransfromByVA

As you can see there is a whole bunch of work to be done if we want to do that. Now that raises the question, can I do the conversion in-place with the cycles I saved by not using an FTransform? The process of switching from FTransform to FMatrix isn’t cheap in terms of number of instructions, but since it involves a lot of shuffling it shouldn’t be hard to pipeline. So to determine that I decided to use another tool from Intel, the Intel Architecture Code Analyzer (IACA). Basically this tool allows you to statically analyze the throughput and latency of a snippet of code. For people who have worked on PlayStation 3 will find it similar to SN Tuner’s view of even and odd instructions in the SPU. If you haven’t seen that then take a look at Jaymin Kessler’s video on software pipelining. Anyway IACA is a nice tool to get an estimate of the performance of a snippet of code and it works just fine for our purpose.

So let’s take a look at the throughput for FTransform::TransformFVector4(). Keep in mind that this is called eight times, once for each vertex of the bounding box.

Throughput Analysis Report
--------------------------
Block Throughput: 27.05 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 17.0   0.0  | 9.0  | 8.5    6.0  | 8.5    6.0  | 5.0  | 27.0 |
-------------------------------------------------------------------------

This means that we take ~27 cycles per vertex or ~216 for the whole bounding box. If we look at the throughput of FBox::TransformBy() with the FMatrix input it looks like this:

Throughput Analysis Report
--------------------------
Block Throughput: 80.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 24.0   0.0  | 38.0 | 17.0   8.0  | 17.0   8.0  | 18.0 | 80.0 |
-------------------------------------------------------------------------

That’s ~80 cycles for all the vertices which is much better. But since we need to go from FTransform to FMatrix let’s see what that takes by getting the throughput for the conversion:

Throughput Analysis Report
--------------------------
Block Throughput: 29.00 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 6.0    0.0  | 6.0  | 5.5    3.0  | 5.5    3.0  | 5.0  | 29.0 |
-------------------------------------------------------------------------

That’s ~29 cycles. So if we add the cost of conversion plus the actual transform then we get ~109 cycles.

So in theory this should work out just fine, so lets rewrite the function:

FBox FBox::TransformBy(const FTransform & M) const
{
	return TransformBy(M.ToMatrixWithScale());
}

Pretty simple change, but as IACA provides estimates we actually have to profile this to see if it actually improves performance. To do that I made a stress test first that transformed the same AABB 214748364 times to make sure that I could measure the performance delta and to avoid any data or instruction cache misses “polluting” the profiling. So here is the profiling data with the old code in red, and the new one in green.

TransformByStressProfile

Pretty impressive win considering the small change in the code. But let’s see how it works on the Elemental demo like last time:

TransformByElementalProfile

We got an average of a 6x improvement on the top hotspot. And here is a look at the profiling data, see how it went down the list of hotspots.

TransformByFast

I think the lesson here is that just like last time, it is really important to think about the data and platform. Don’t take the OOP approach “I will do this right for a vertex and then call it multiple times” as if the number of vertices on a bounding box would change at some point. Instead take advantage of the fact that you have 8 vertices and see how you can write something optimal and concise.

The other lesson is to not be afraid of doing some data contortion to improve performance. In the optimal case there wouldn’t be a need for that, but if that’s not the case then see if you can get a performance win by shuffling data around even if shuffling itself has a cost.

And as usual, always profile, don’t rely on static analysis like IACA, and don’t think too high of your skills to extrapolate performance gains, just profile and assess.

Advertisements