The importance of profiling automation

Needless to say that one of the basic pillars of immersion in a game is performance perceived as smoothness and responsiveness. A game with terrible performance breaks the immersion that could have been present otherwise. But even if we understand that as an industry, we still don’t seem to put constant effort on how to make sure that pillar in our games is solid. A few months ago I went to Sony’s PlayStation Devcon 2017 where I got to talk to many fellow rendering programmers as well as tools developers. One of the things that caught my attention was that even though most (if not all) the people I got to talk to were working on big AAA games or technology, few of them showed any interest on profiling automation. This was also evident during a rendering round table where I was pretty much the only one who asked for output of metrics from Sony’s tools to use in automation. I thought that perhaps the work and knowledge on automation were left to SQEs (Software Quality Engineers), but when I asked many told me they had few or no automation tools to check for performance regressions. That is alarming because AAA games tend to be huge beasts with a lot of people making content and writing features, but rarely do you have as many people as you need to keep the performance on check as the game is evolving. In the end it seems like we rely on heroic efforts at the end of the project or even patches after launch. So my idea with this short piece is to explain why profiling automation is important and some of the required features to get better automation.

Game with terrible performance on release.

The performance of a game is one of the things that seems to become critical only at the end of a project. That goes in line with the constant worry about “premature optimization” which is far more often “overdue optimization” and it is the justification for poor architectural choices and poor use of available resources. But while you might think that you don’t have the time to focus on performance as you are trying to add new features, it is always important to see the performance impact of the code added. Many programmers think they can get by just on assumptions but unfortunately most if not all programmers (myself included) make mistakes when assessing performance without profiling data. For some unknown reason to me programmers are fully aware of the limitations of estimating tasks during a project, yet they are very confident on their understanding of what the performance issues are on a project or how the code they just submitted impacts it.

Personal experience.

My perceptions are based on the fact that before joining EA Vancouver I was a consultant and I was mostly contracted to improve performance. My clients knew that they had performance issues they needed to fix, but they couldn’t have an employee assigned to the task, nor hire a new full-time employee just to fix that. Having done that type of work for so long I told my clients that I would be working on whatever they asked for (I always signed staff-augmentation-type of contracts) but I always recommended that they assigned me a task to just do profiling. Thankfully every single client agreed but they also told me that they knew what the issues were and what to look for. Instead I would profile CPU, GPU, and IO without looking specifically for the performance issues the client wanted fixed. More often than not the profiling data showed that the biggest performance issues were completely different than what they thought. That constant repetition of the experience only served to reinforce my thought that as developers we are just as good estimating performance as we are estimating tasks in a project. But that wasn’t the only common issue among my different clients. Many of them blamed other teams or technology for their performance issues, and when they knew nobody else was to blame they often found out too late about performance regressions they introduced. I can’t even count the number of times that someone had to be sent on a scavenger hunt trying to find the change or changes that caused a performance regression in the last week or month.

Solution.

What is the solution to this issue? Profiling. But there are two distinct approaches to handle this. One is to change the culture of the programmers so they profile every change they make (even if they don’t think it is performance-sensitive code), and at the same time change the culture of management to make sure it gets done with a proper allocation of time for the task. That’s pretty much impossible to do, especially for teams within a game that don’t have the same performance focused mindset. The odds of getting that done on a rendering or systems teams are considerably higher than on the UI team. But even in the rendering team it is hard to allocate the time to get that done, and it is even harder for non-technical management to understand the importance of doing that. The other approach is to move away from manual profiling and processes and instead rely on automation. Automated performance runs can get rid of the issue by providing data as often as your hardware infrastructure would allow. In essence, a good automation system with a good dashboard would do the high level profiling that an engineer would have done, and even make available the low-level profiling data for the engineers. At the same time the dashboard can provide high level performance numbers for technical management. With that data in hand they can ask each team about performance issues and start planning and allocating time to get them fixed.

Defining profiling smoke.

Before you even look at tools and metrics you want to gather, it is critical that your team defines what a profiling run is. This is critical because if the runs are not consistent in terms of what the game is doing during profiling, then it will be extremely hard to decipher performance trends out of the profiling data noise. And not only that, but people will not value the profiling data since they know the performance spike might go down by itself by time the next automated profiling run gets done. Obviously each team within a game will have different profiling scenarios that they consider important so it is critical that any automated profiling covers those scenarios. But it is also important that the profiling smoke tests not be segmented per team and instead have a single smoke that covers all the different scenarios. The value of doing that is that since the different subsystems in a game run concurrently, many issues will arise in the interaction points of those subsystems which are owned by different teams. For example, the UI team might think that the scenario they need to test is on a bunch of UI-heavy screens but perhaps a change in UI impacts the performance during normal gameplay even though they might just have a few UI elements for the HUD. The other important aspect is determinism. Most games are not fully deterministic, but it is important that the profiling smoke be as deterministic as possible within the constraints of the game. Any engineer should have the ability to clearly correlate the data of two different profiling smoke runs and that’s only possible if there is an acceptable level of determinism on the runs.

Data to gather and process.

The next step is to determine what type of data to gather and how that gathering process impacts the performance of the profile smoke itself. The level of details can be as low as capturing instruction-level counters at a fast rate, to very high level timers that just gathers how long a section of the game took to execute (for example, time to run through a whole level). The benefit of the low level approach is that the engineer would have access to specific detail to tackle a problem, but at the same time the amount of data generated is too big and it is bound to impact the performance of the profiling smoke itself. The high level approach has the benefit of reducing the amount of data to gather and process, it is very lightweight on its impact, but at the same time it doesn’t offer enough information to the engineer more so than just “this ran slower/faster than before”. After doing this for a while I have found the happy-medium to be timestamped counters on a per-frame basis to be the best solution. The data they offer is good enough to pinpoint specific issues (assuming there are enough markers), it is a model that works just as good on CPU and GPU, and it tends to be reasonable to manage in terms of cycles and memory spent per frame. There are multiple tools that can generate and visualize this data:

Intel GPA Microsoft PIX SN Systems Razor RAD Game Tools Telemetry

Unfortunately, to the best of my knowledge only Intel VTune and RAD’s Telemetry support exporting the profiling data (feel free to correct me on the comments). Any system that can’t export the data won’t work for this. If that’s the case, then you will have to implement the CPU and GPU markers on your own. This will allow you to export the data for the automation system, and can also be used for an in-game profiler if desired.

One thing to keep in mind is that the data will be used for full runs instead of a few frames. This means that whatever system you implement (or product you purchase) must be able to cope with a considerable amount of data. Even with a low granularity of CPU and GPU markers per frame you still generate a sizable amount of data. Just as an example, in one of the profiling smokes I can run at work I can capture 45 minutes of data with around 120 markers per frame. So a full capture contains the data for around 20 million markers which means that just storing the beginning and end 64-bit timestamp (with no name) for a marker would generate ~320 megabytes of uncompressed data. At the same time, whatever tool you use to generate metrics from the source data must be able to cope with that amount of data, especially if you want to have a quick turnaround. Once you have all the data you can generate the relevant metrics that will be the “electrocardiogram” of the game’s performance. The data that you decide to display will be highly dependent on the needs of the game but here are a few ideas.

  • General
    • CPU frame time.
      • Breakdown per subsystem or team.
    • High-water CPU frame time.
    • High-water number of consecutive CPU spikes.
    • Number of allocations per frame.
    • Loading
      • Time to load a level.
      • Time to leave a game and go back to the main menu.
      • High-water memory use during loading phases.
  • Rendering
    • GPU frame time.
    • High-water GPU frame time.
    • High-water number of consecutive GPU spikes.
    • High-water video memory used.
    • Number of missed v-syncs.

Conclusion.

When everything is set and done the whole team will be assured that whatever they do will be profiled. That means that performance won’t just vanish as features are added. In the worst case scenario, you will have performance issues but you will have the ability to either revert a changelist or decide how performance will be recovered in order to leave that new piece of code in the depot. Another huge benefit is that this workflow will also help put performance in the mind of the developers and not just get a feature implemented at any cost. So in the end the technical management will be assured visibility on the performance as the game evolves, the engineers will have the proper profiling data when technical management complains, and the rest of production will know if what they plan to deliver to the end user will actually be achieved in terms of performance.

Optimizing sorting of base passes in Unreal Engine 4.

I know I said I was going to dig into the memory tracking solution, but I realized that I might as well do an optimization since people seem to find them very useful. So I decided to look into profiling and optimizing the Landscape Mountains map which you can download from the Unreal Engine Marketplace. If you haven’t seen it here is a video:

I decided to profile this 900 frames starting from the 100th frame since I didn’t want to profile the initial streaming of the level. So here is the basic thread view.

ThreadProfile

So the top thread is doing much more work than the rest, that thread is the render thread. It was obvious that to improve performance I would have to optimize the work happening in the rendering thread. So let’s look at the work happening in the rendering thread.

RenderThreadProfile

The highlighted functions caught my attention because they accounted for almost 8% of the time spent on the render thread and sorting is usually a very atomic issue to solve. So let’s dig into the actual code execution.

InitViewsCallstack

One of the first calls made each frame to render the scene is to FDeferredShadingSceneRenderer::InitViews() which is in charge of initializing the scene’s views by checking visibility, do sorting of elements for the different passes, initialize the dynamic shadows, etc. The sorting of the static meshes with the respective drawing policy kicks off the calling of the functions we saw on the profiling data. Let’s look at that function:

void FDeferredShadingSceneRenderer::SortBasePassStaticData(FVector ViewPosition)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		SCOPE_CYCLE_COUNTER(STAT_SortStaticDrawLists);

		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			Scene->BasePassNoLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassSimpleDynamicLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassCachedPointIndirectLightingDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassHighQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
			Scene->BasePassLowQualityLightMapDrawList[DrawType].SortFrontToBack(ViewPosition);
		}
	}
}

That’s extremely simple to understand. It just sorts the different draw lists one after the other twice, once for the default base pass, and once for the masked base pass. The first thing that becomes obvious is that we could sort those draw lists asynchronously in different threads and get some other work going while that’s being done. But in order to do that we need to know that the sort is atomic and doesn’t affect global state. To do that we would have to dig in deeper in the callstack. Let’s look at the code of the next function, TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;
		for (int32 ElementIndex = 0; ElementIndex < DrawingPolicyLink.Elements.Num(); ElementIndex++)
		{
			FElement& Element = DrawingPolicyLink.Elements[ElementIndex];

			if (ElementIndex == 0)
			{
				AccumulatedBounds = Element.Bounds;
			}
			else
			{
				AccumulatedBounds = AccumulatedBounds + Element.Bounds;
			}
		}
		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

In this piece of code there are a couple of things that could be changed. First of all you can see that inside the inner loop ElementIndex is initialized on the first iteration of the loop but then is checked in all the other iterations just to see if it’s the first. Now, you could say that the branch predictor in the CPU would figure out that ElementIndex won’t be 0 in the next iterations, but it is better to write code without relying too much on the branch predictors since they are fairly different from one CPU to another. Let’s make that change first plus a couple of minor changes:

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0].Bounds;
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	SortViewPosition = ViewPosition;
	SortDrawingPolicySet = &DrawingPolicySet;

	OrderedDrawingPolicies.Sort( TCompareStaticMeshDrawList<DrawingPolicyType>() );
}

Now the actual issue with doing multiple draw lists sorts concurrently with this code is that there are two static variables in TStaticMeshDrawList:

/**
 * Static variables for getting data into the Compare function.
 * Ideally Sort would accept a non-static member function which would avoid having to go through globals.
 */
static TDrawingPolicySet* SortDrawingPolicySet;
static FVector SortViewPosition;

Now this is a case where code comments get outdated because it is possible to send data to the Compare function without using globals and without using thread-local storage or anything similar. Since the assumption was that sending data to the sort function wasn’t possible, here is how the predicate class was written:

/** Helper stuct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare( A, B ) < 0;
	}
};

Now that we now let’s see how those two static variables are used:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B)
{
	const FSphere& BoundsA = (*SortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*SortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - SortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - SortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

So it is clear that first thing we need to make this asynchronous is to get rid of SortDrawingPolicySet and SortViewPosition. First the predicate class must be rewritten to include the parameters that used to be in the static variables:

/** Helper struct for sorting */
template<typename DrawingPolicyType>
struct TCompareStaticMeshDrawList
{
private:
	const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const SortDrawingPolicySet;
	const FVector SortViewPosition;

public:
	TCompareStaticMeshDrawList(const typename TStaticMeshDrawList<DrawingPolicyType>::TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
		: SortDrawingPolicySet(InSortDrawingPolicySet)
		, SortViewPosition(InSortViewPosition)
	{
	}

	FORCEINLINE bool operator()( const FSetElementId& A, const FSetElementId& B ) const
	{
		// Use static Compare from TStaticMeshDrawList
		return TStaticMeshDrawList<DrawingPolicyType>::Compare(A, B, SortDrawingPolicySet, SortViewPosition) < 0;
	}
};

And replace the Compare implementation to fit the new changes:

template<typename DrawingPolicyType>
int32 TStaticMeshDrawList<DrawingPolicyType>::Compare(FSetElementId A, FSetElementId B, const TDrawingPolicySet * const InSortDrawingPolicySet, const FVector InSortViewPosition)
{
	const FSphere& BoundsA = (*InSortDrawingPolicySet)[A].CachedBoundingSphere;
	const FSphere& BoundsB = (*InSortDrawingPolicySet)[B].CachedBoundingSphere;

	// Assume state buckets with large bounds are background geometry
	if (BoundsA.W >= HALF_WORLD_MAX / 2 && BoundsB.W < HALF_WORLD_MAX / 2)
	{
		return 1;
	}
	else if (BoundsB.W >= HALF_WORLD_MAX / 2 && BoundsA.W < HALF_WORLD_MAX / 2)
	{
		return -1;
	}
	else
	{
		const float DistanceASquared = (BoundsA.Center - InSortViewPosition).SizeSquared();
		const float DistanceBSquared = (BoundsB.Center - InSortViewPosition).SizeSquared();
		// Sort front to back
		return DistanceASquared > DistanceBSquared ? 1 : -1;
	}
}

Since the static variables are gone we need to modify TStaticMeshDrawList::SortFrontToBack():

template<typename DrawingPolicyType>
void TStaticMeshDrawList<DrawingPolicyType>::SortFrontToBack(FVector ViewPosition)
{
	// Cache policy link bounds
	for (typename TDrawingPolicySet::TIterator DrawingPolicyIt(DrawingPolicySet); DrawingPolicyIt; ++DrawingPolicyIt)
	{
		FBoxSphereBounds AccumulatedBounds(ForceInit);

		FDrawingPolicyLink& DrawingPolicyLink = *DrawingPolicyIt;

		const int32 NumElements = DrawingPolicyLink.Elements.Num();
		if (NumElements > 0)
		{
			AccumulatedBounds = DrawingPolicyLink.Elements[0];
		}

		for (int32 ElementIndex = 1; ElementIndex < NumElements; ElementIndex++)
		{
			AccumulatedBounds = AccumulatedBounds + DrawingPolicyLink.Elements[ElementIndex].Bounds;
		}

		DrawingPolicyIt->CachedBoundingSphere = AccumulatedBounds.GetSphere();
	}

	OrderedDrawingPolicies.Sort(TCompareStaticMeshDrawList<DrawingPolicyType>(&DrawingPolicySet, ViewPosition));
}

Now we should be able to sort the draw lists asynchronously. To do that we will need to replace the code in FDeferredShadingSceneRenderer::SortBasePassStaticData() to kick off asynchronous calls to TStaticMeshDrawList::SortFrontToBack(), and it will have to return some kind of container that can be used to determine if those tasks finished. In that way we can issue those tasks and avoid waiting for them to finish on the rendering thread unless we reach a point in the code were we need those tasks done (and hopefully by then they should be done already. Unfortunately Unreal Engine 4.6 doesn’t support futures and features that would make it easy to launch this functions asynchronously. It is my understanding that they will come in Unreal Engine 4.8. But meanwhile we will have to write a standard task. Let’s do that.

For the sorting task, we will sort a single draw list per task, and whoever needs to sort multiple draw lists will create multiple tasks. This task only needs to parameters, the draw list to sort, and the view position for the sorting, and the only purpose of the task will be to call the SortFrontToBack() function of the draw list to sort, with the view position as a parameter, nothing else. Let’s implement that:

template<typename StaticMeshDrawList>
class FSortFrontToBackTask
{
private:
	StaticMeshDrawList * const StaticMeshDrawListToSort;
	const FVector ViewPosition;

public:
	FSortFrontToBackTask(StaticMeshDrawList * const InStaticMeshDrawListToSort, const FVector InViewPosition)
		: StaticMeshDrawListToSort(InStaticMeshDrawListToSort)
		, ViewPosition(InViewPosition)
	{

	}

	FORCEINLINE TStatId GetStatId() const
	{
		RETURN_QUICK_DECLARE_CYCLE_STAT(FSortFrontToBackTask, STATGROUP_TaskGraphTasks);
	}

	ENamedThreads::Type GetDesiredThread()
	{
		return ENamedThreads::AnyThread;
	}

	static ESubsequentsMode::Type GetSubsequentsMode() { return ESubsequentsMode::TrackSubsequents; }

	void DoTask(ENamedThreads::Type CurrentThread, const FGraphEventRef& MyCompletionGraphEvent)
	{
		StaticMeshDrawListToSort->SortFrontToBack(ViewPosition);
	}
};

That’s a bit of boilerplate code that won’t be necessary once futures are supported on Unreal Engine 4.8. Anyway, now we need to create a function to dispatch those tasks. Since I like to keep the single thread and multiple thread implementations available for debugging, I decided to write a new function called FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(). Let’s implement that:

void FDeferredShadingSceneRenderer::AsyncSortBasePassStaticData(const FVector &InViewPosition, FGraphEventArray &OutSortEvents)
{
	// If we're not using a depth only pass, sort the static draw list buckets roughly front to back, to maximize HiZ culling
	// Note that this is only a very rough sort, since it does not interfere with state sorting, and each list is sorted separately
	if (EarlyZPassMode == DDM_None)
	{
		for (int32 DrawType = 0; DrawType < FScene::EBasePass_MAX; DrawType++)
		{
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FNoLightMapPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassNoLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FSimpleDynamicLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassSimpleDynamicLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedVolumeIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedVolumeIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<FCachedPointIndirectLightingPolicy> > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassCachedPointIndirectLightingDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassHighQualityLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TDistanceFieldShadowsAndLightMapPolicy<HQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassDistanceFieldShadowMapLightMapDrawList[DrawType]), InViewPosition));
			OutSortEvents.Add(TGraphTask<FSortFrontToBackTask<TStaticMeshDrawList<TBasePassDrawingPolicy<TLightMapPolicy<LQ_LIGHTMAP> > > > >::CreateTask(
				nullptr, ENamedThreads::AnyThread).ConstructAndDispatchWhenReady(&(Scene->BasePassLowQualityLightMapDrawList[DrawType]), InViewPosition));
		}
	}
}

With that we have dispatched all the sorting tasks and OutSortEvents can be used to determine if all the tasks are done. So let’s go back to FDeferredShadingSceneRenderer::InitViews() and implement the support for asynchronous sorting:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	FGraphEventArray SortEvents;
	AsyncSortBasePassStaticData(AverageViewPosition, SortEvents);
#else
	SortBasePassStaticData(AverageViewPosition);
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we get to keep both approaches should you have to debug an issue with either. At this point the sort tasks were launches, but we can’t assume they will be done before we leave the function so let’s check for that and make sure they are done:

#if PZ_ASYNC_FRONT_TO_BACK_SORT
	if (SortEvents.Num())
	{
		FTaskGraphInterface::Get().WaitUntilTasksComplete(SortEvents, ENamedThreads::RenderThread);
	}
#endif // PZ_ASYNC_FRONT_TO_BACK_SORT

In that way we halt execution on the rendering thread until the tasks are done, but hopefully they should be done by then.

So now the question is, did this change actually improve performance? Let’s compare the basic data capturing the same frames as before:

AsyncSortSummaryComparisonAll positive numbers show performance improvements. Before optimization numbers are to the left, after optimization to the right. Blue bars in histogram show the frames before optimization, orange shows after optimization.

With VTune configured to show frames that took more than 20ms as slow, and less that 11ms as fast, it is clearly visible that we have more frames with good or fast performance and less frames with slow performance. Overall all frames moved to the right in the histogram showing that this definitely improved performance overall. Let’s look at the work on the different threads:

AsyncSortThreadsBeforeBefore

AsyncSortThreadsAfterAfter

The render thread is highlighted in both profilings. It is visible that the work on the rendering thread was reduced considerably, and the work on the threads that consume tasks was increased. So by shifting the work from the render thread to the tasks threads without spending time waiting for those tasks to finish improved performance.
Going wide isn’t always the answer, especially when the code you need to optimize was written to be single threaded. But as programmers we need to get better at doing stuff asynchronously, to design from the start for multithreading, because this is a clear case were it is beneficial to go wide.

If you would like to see the whole code change and have access to the Unreal Engine 4 github then you can check the whole change here: https://github.com/EpicGames/UnrealEngine/pull/820

Optimizing Unreal Engine 4 asset cooking. CRC.

As I was working on the GPU particle system on Unreal Engine 4, I had to build my own scene. The process of baking the raw assets to be properly consumed by the engine is called “cooking” in Unreal Engine 4’s lingo. I found myself cooking assets quiet often but one thing I noticed was that I found that the cooking process took a while to determine which assets needed to be cooked. The reason I found that was because I would try to cook the assets without actually saving my changes. While this may not seem like a big deal, it is especially annoying for me since I had a pretty bad experience working with the build system at Dreamworks R&D. The build system would take more than 3 minutes to determine that nothing had to be built, and I’m talking about code and not assets. So I decided to look at this. As usual I will do all the profiling on my machine:

MachineSpecs

What I decided to cook my simple particles scene (which you can see screenshots here) once, and then cook it again but then with the profiler running. In that way I unsure that the cooking process only runs through the process of verifying what needs to be done, and it should determine that there isn’t any asset to build. After doing that here are the results:

MemCrc_Slow

Since this is an asset baking process it is fairly normal to spend a lot of time dealing with memory so I wasn’t too surprised about the CPU time spent on memset and memcpy. That’s something that I will try to optimize later, and that I know it will take some work. Also the memset in particular is caused by build for the editor which fills up memory with a debug value (0xCD) any new allocation or before freeing an allocation. But the call to FCrc::MemCrc_DEPRECATED() seemed like a low hanging fruit since that a pretty self-contained process. Let’s look at the code, which is also available here:

/**
 * DEPRECATED
 * These tables and functions are deprecated because they're using tables and implementations
 * which give values different from what a user of a typical CRC32 algorithm might expect.
 */

uint32 FCrc::MemCrc_DEPRECATED(const void* InData, int32 Length, uint32 CRC/*=0 */)
{
    // Based on the Slicing-by-8 implementation found here:
    // http://slicing-by-8.sourceforge.net/

    CRC = ~BYTESWAP_ORDER32(CRC);

    const uint8* __restrict Data = (uint8*)InData;

    // First we need to align to 32-bits
    int32 InitBytes = Align(Data, 4) - Data;

    if (Length > InitBytes)
    {
        Length -= InitBytes;

        for (; InitBytes; --InitBytes)
        {
            CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
        }

        auto Data4 = (const uint32*)Data;
        for (uint32 Repeat = Length / 8; Repeat; --Repeat)
        {
            uint32 V1 = *Data4++ ^ CRC;
            uint32 V2 = *Data4++;
            CRC =
                CRCTablesSB8_DEPRECATED[7][ V1 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[6][(V1 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[5][(V1 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[4][ V1 >> 24 ] ^
                CRCTablesSB8_DEPRECATED[3][ V2 & 0xFF] ^
                CRCTablesSB8_DEPRECATED[2][(V2 >> 8) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[1][(V2 >> 16) & 0xFF] ^
                CRCTablesSB8_DEPRECATED[0][ V2 >> 24 ];
        }
        Data = (const uint8*)Data4;

        Length %= 8;
    }

    for (; Length; --Length)
    {
        CRC = (CRC >> 8) ^ CRCTablesSB8_DEPRECATED[0][(CRC & 0xFF) ^ *Data++];
    }

    return BYTESWAP_ORDER32(~CRC);
}

As seen on the code, this is based on the slicing-by-8 version of CRC32, which is way faster than the standard implementation. The reason is that it allows for grater instruction-level parallelism because on the standard implementation each table lookup must be completed before computing the next index. In this case 8 table lookups can potentially be done in parallel. For precise details a paper was written about the algorithm here. So that’s fairly optimal already.

But what are the alternatives? The easiest one would be to use SSE 4.2 which includes the CRC32 instruction. While there is a good speed up as shown here, that wouldn’t be integrated in PC because SSE 4.2 became available with the Nehelem architecture, and I don’t think AMD supported that before Bulldozer. So the other more feasible alternative is to get rid of CRC32 and move to some other checksum algorithm.

There are many alternatives to generate hashes. In this case we need to generate checksums, there is no need for any cryptographic type of hash function. Even then Unreal Engine 4 already provides SHA-1. Some of the alternatives are FNV, CityHash, SpookyHash, MurmurHash, etc. They all are valid, and Charles Bloom who works on Oodle made a blog post covering the performance and properties of some of those algorithms so I won’t attempt to cover that. But I dug deeper and I came across xxHash written by Yann Collet who also the author of the LZ4 compression algorithm. The performance shown by xxHash, the fact that it was used by LZ4, and the permissive license (New BSD License) made me make up my mind about making an attempt to integrate it.

But not everything is that simple. While xxHash is extremely easy to integrate (like the stb libraries written by Sean Barrett), it has a considerable impact on the engine. The fact that all the hashes will change implies that all assets have to be rebuilt. This may not be a huge issue for me because I just have a few simple scenes, but it has a big impact on teams working on AAA games. From my point of view I think it is worthwhile to bite the bullet if the cooking time is reduced after that. So, what happens when we replace CRC32 with xxHash in terms of performance?

MemCrc_Fast

That’s a 3X improvement over the old code. In terms of wall-clock time there is almost a 20% improvement over the old version. I think that justifies the cooking of the assets.

Anyway this in particular may not make it into the master branch of Unreal Engine 4 because it does pull in a new dependency (xxHash) and I have no idea if that’s acceptable. Anyway I think it is a meaningful change that some people may want to integrate.

The lesson this time is to think outside the box. If you have seen my past posts you may have thought that I would have taken the path of optimizing the throughput by helping the compiler or looking at the number of uops. But instead I realized that since this is a pretty self-contained problem I knew there had to be faster alternatives to CRC32 that fit the needs of the engine.

Unreal Engine 4 GPU particle system micro-optimizations. Part 3.

On part one of the series I made an optimization to BuildParticleVertexBuffer which got rid of an unnecessary check when setting the half precision floats that sets the indices. On part two I made a separate optimization but in the process I made a stress test to help me understand the issues. It’s now time to go back to BuildParticleVertexBuffer since it was the biggest hotspot in the stress test. As usual I will do all the profiling on my machine: MachineSpecs I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters in the scene. Read part two of the series to see a couple of screenshots of the scene. BuildParticleVertexBuffer-Part3-Slow Compared to everything else BuildParticleVertexBuffer is by far the biggest hotspot and something that definitely needs to be improved. What makes matters even worse is that this happens on the render thread which is critical as mentioned on part two of the series. So let’s look at the code of the function.

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there is a specific comment that mentions the fact that in some platforms writing unions and/or bitfields to locked memory are really slow, and instead a forced integer write is going to be faster. But, what are those platforms? Does it make sense to do that for all the platforms? I don’t know what are the exact platforms that whoever wrote this code was referring to (one of the downsides of having just access to Unreal Engine 4 on Github and not on Epic’s Perforce server is that you can’t use something like the time-lapse view in Perforce to see when and who wrote this). If anybody have any specific information about that please comment or let me know. Anyway I decided that I would make a single change. I would get rid of the temporary FParticleIndex variable used to write the floats IndexX and IndexY, which is then written as a uint32. Instead of that I would use the SetNoChecks() from the previous part in the series, and set the floats directly. That simplifies code but doesn’t necessarily do the same for the assembly. So here is the code:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);

#if PLATFORM_WINDOWS
				ParticleIndices->X.SetNoChecks(IndexX);
				ParticleIndices->Y.SetNoChecks(IndexY);

				++ParticleIndices;
#else
				const int32 Stride = 1;
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
#endif // PLATFORM_WINDOWS
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

I decided to see what Intel Architecture Code Analyzer had to say in terms of latency. BuildParticleVertexBuffer-Part3-Latency That doesn’t look good, the estimates said that the latency went from 53 cycles to 68 cycles mostly due to pressure on port 1. But since those are estimates, it is critical to actually run the code and profile it. This is the results: BuildParticleVertexBuffer-Part3-Fast With that simple change I managed to cut down the CPU time of the top hotspot in half and get the CPI (cycles per instruction retired) rate to go from 0.704 to 0.417. There are a couple of lessons here. The first one is to never rely a 100 percent on static analysis tools. They are useful tools, but when it comes to performance the only proper way to measure is by profiling at runtime. The other lesson is that you should make sure that you validate the platform assumptions. You cannot make the end-user pay due to wrong platform generalizations. Make sure that the assumptions are correct, and write specialized code for each platform if necessary. Do not forget that as programmers, at the end of the day we are paid to deliver a good experience. We are not paid to have generalized code that only fits the lowest common denominator, after all the end user doesn’t even care about our code.

Unreal Engine 4 GPU particle system micro-optimizations. Part 2.

On the previous part of the series I showed a micro-optimization to the GPU particle system for the BuildParticleVertexBuffer function. As part of the verification process as well as to understand how the system worked I created a scene of a hollow box with 20 GPU particle emitters. I gave them a basic behavior in Cascade and nothing else. Here are two screenshots:

ParticlesCloseParticlesFar

Using that scene I decided to profile the system. As usual I did this on my machine:

MachineSpecs

This time I profiled 600 frames where the player was positioned in a corner pointing directly towards the emitters.

BuildTileVertexBuffer-Profile

The first hotspot is on BuildParticleVertexBuffer but that’s something I covered on the previous part of the series. The next one is BuildTileVertexBuffer. Now you may ask what’s the point of optimizing something that isn’t that doesn’t seem to take that much CPU time. And the answer is the next screenshot.

BuildTileVertexBuffer-Callstack

This function is being called from the RenderThread which given the current implementation on the Unreal Engine 4, it is a considerable deal for performance. For the time being a lot of the rendering work happens on a single rendering thread so that can set the wall time for the frame since there is a lot of work to be done on it. To prove that see how work is distributed on the different threads:

BuildTileVertexBuffer-Threads

0x1F98 is the thread id for the rendering thread, and 0x1B60 is the thread id for the main thread. It is clearly visible that performance is dominated by the work done on the rendering thread. Thankfully Epic is working on parallel rendering which will alleviate the issue, but for the time being it is critical to optimize the render thread.

With that out of the way let’s look at the code:

/**
 * Builds a vertex buffer containing the offsets for a set of tiles.
 * @param TileOffsetsRef - The vertex buffer to fill. Must be at least TileCount * sizeof(FVector4) in size.
 * @param Tiles - The tiles which will be drawn.
 * @param TileCount - The number of tiles in the array.
 */
static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	int32 Index;
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( ; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

In this function at first sight there isn’t anything terribly wrong with the exception of the use of the Index variable. Without any reasonable reason there is a dependency created between the two loops when in fact they are completely independent. Each loop could be controlled independently. Let’s rewrite it:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index] = FVector2D(
			FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ),
			FMath::Fractional( FMath::TruncFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY )
									  );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index] = FVector2D(100.0f, 100.0f);
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

But shouldn’t the Visual Studio compiler realize that? Let’s look at the assembly with Intel Architecture Code Analyzer. The old code is on the left, and the new code is on the right.

BuildTileVertexBuffer-FirstOpt

The change reduced got rid of two move instructions and the number of uops (micro ops) got reduced from 48 to 46. But we shouldn’t expect a huge improvement in performance from just doing that. We need to reduce the number of micro ops further and hopefully that will also improve the instruction-level parallelism. So looking at the code that on the top loop there isn’t any specific need to construct an FVector2D and assigned to the current tile offset. I could just write the X, Y components directly as two independent operations. I could also do that for the loop below. Here is the new code:

static void BuildTileVertexBuffer( FParticleBufferParamRef TileOffsetsRef, const uint32* Tiles, int32 TileCount )
{
	const int32 AlignedTileCount = ComputeAlignedTileCount(TileCount);
	FVector2D* TileOffset = (FVector2D*)RHILockVertexBuffer( TileOffsetsRef, 0, AlignedTileCount * sizeof(FVector2D), RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = Tiles[Index];
		TileOffset[Index].X = FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX );
		TileOffset[Index].Y = FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY );
	}
	for ( int32 Index = TileCount; Index < AlignedTileCount; ++Index )
	{
		TileOffset[Index].X = 100.0f;
		TileOffset[Index].Y = 100.0f;
	}
	RHIUnlockVertexBuffer( TileOffsetsRef );
}

Would that make any improvements at all? Let’s look at it with Intel Architecture Code Analyzer. The old code on the left, the new code on the right.

BuildTileVertexBuffer-SecondOpt

Now we went from 48 uops to 41. Our optimization looks good from the static analysis, but let’s profile it.

BuildTileVertexBuffer-OptProfile

That cut down in half the CPU time which is good for the rendering thread.

I think the main lesson here is that the compiler may not optimize everything even if you think it should. Compilers in general are pretty limited in terms of optimizations so as programmers we need to be aware of their limitations and how to get around them (which might involve trial and error, and looking at the disassembly directly or with some static analysis).

Unreal Engine 4 GPU particle system micro-optimizations. Part 1.

This time I’m going to go over two optimizations which are the very first that I made, and that just got integrated in Unreal Engine 4. Hopefully they will ship with version 4.4. When I first got the Unreal Engine 4 license the most impressive demo available at the time as the Effects Cave demo. The demo looks like this:

So I started profiling the first 120 frames of the demo to get a sense of how CPU performance was being spent. As usual I profiled in my machine that has the following specs:

MachineSpecs

Below is the result of the profiling run. Keep in mind that this was done on Unreal Engine 4.1:

BuildParticleVertexBuffer-Slow

It isn’t that often that you get to profile something that is such an obvious hotspot. Here the issue is pretty clear in terms of performance, I had to focus on BuildParticleVertexBuffer.

To make some sense of this it is good to see how the GPU particle system works. The GPU particle system unlike the traditional CPU particle system allows for a high particle count to be rendered efficiently but with some lack of flexibility. The actual emission of the particles still happen on the CPU but once they are emitted, the rest of the process happens on the GPU. The GPU particle system is implemented with tiles (4×4 tiles by default) in a set of textures (1024×1024 by default). There are two textures for position data, and one for velocity and those textures are a double buffered. They are indexed by a vertex buffer where the index is stored in 2 half floats that represent this texture. The particles are collided with the information in the depth buffer. In particular BuildParticleVertexBuffer creates the vertex buffer used to store the indices for the particles for a given set of tiles and fills it up with the data. Let’s look at how this is done:

static void BuildParticleVertexBuffer( FVertexBufferRHIParamRef VertexBufferRHI, const TArray<uint32>& InTiles )
{
	const int32 TileCount = InTiles.Num();
	const int32 IndexCount = TileCount * GParticlesPerTile;
	const int32 BufferSize = IndexCount * sizeof(FParticleIndex);
	const int32 Stride = 1;
	FParticleIndex* RESTRICT ParticleIndices = (FParticleIndex*)RHILockVertexBuffer( VertexBufferRHI, 0, BufferSize, RLM_WriteOnly );
	for ( int32 Index = 0; Index < TileCount; ++Index )
	{
		const uint32 TileIndex = InTiles[Index];
		const FVector2D TileOffset( FMath::Fractional( (float)TileIndex / (float)GParticleSimulationTileCountX ), FMath::Fractional( FMath::TruncToFloat( (float)TileIndex / (float)GParticleSimulationTileCountX ) / (float)GParticleSimulationTileCountY ) );
		for ( int32 ParticleY = 0; ParticleY < GParticleSimulationTileSize; ++ParticleY )
		{
			for ( int32 ParticleX = 0; ParticleX < GParticleSimulationTileSize; ++ParticleX )
			{
				const float IndexX = TileOffset.X + ((float)ParticleX / (float)GParticleSimulationTextureSizeX) + (0.5f / (float)GParticleSimulationTextureSizeX);
				const float IndexY = TileOffset.Y + ((float)ParticleY / (float)GParticleSimulationTextureSizeY) + (0.5f / (float)GParticleSimulationTextureSizeY);
				// on some platforms, union and/or bitfield writes to Locked memory are really slow, so use a forced int write instead
				// and in fact one 32-bit write is faster than two uint16 writes (i.e. using .Encoded)
  				FParticleIndex Temp;
  				Temp.X = IndexX;
  				Temp.Y = IndexY;
  				*(uint32*)ParticleIndices = *(uint32*)&Temp;
				// move to next particle
				ParticleIndices += Stride;
			}
		}
	}
	RHIUnlockVertexBuffer( VertexBufferRHI );
}

As you can see there isn’t anything really suspicious about it, we are just generating the coordinates that index a particle. The compiler already generated fairly efficient code in terms of the input since most of the parameters are set at compile time. I decided to look at it with the Intel Architecture Code Analyzer what was the assembly and through-put. Here was part of the output:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 31.55 Cycles       Throughput Bottleneck: FrontEnd, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 27.4    0.0 | 27.4 | 12.5   6.0  | 12.5   6.0  | 13.0 | 31.2 |
-------------------------------------------------------------------------

What surprised me in the assembly was seeing comparisons with some constants that wasn’t clearly visible on the source code. Obviously I was missing something in the code. If you see the code there wasn’t anything immediately obvious that would cause comparisons with constants that weren’t the constants related to the GPU particle simulation parameters such as the size of the textures. But then I realized something particular in the code related to the type of the data. As I mentioned previously, the indexing data were two half precision floats. They are defined in FParticleIndex:

/**
 * Per-particle information stored in a vertex buffer for drawing GPU sprites.
 */
struct FParticleIndex
{
	/** The X coordinate of the particle within the texture. */
	FFloat16 X;
	/** The Y coordinate of the particle within the texture. */
	FFloat16 Y;
};

But as you can see from the function, the value set to the temporary FParticleIndex is two single precision floats. No explicit cast was done on the code function, so it had to be done on assignment. That’s why I decided to see how that was assigned and came to this function that was called from the assignment operator.

FORCEINLINE void FFloat16::Set( float FP32Value )
{
	FFloat32 FP32(FP32Value);

	// Copy sign-bit
	Components.Sign = FP32.Components.Sign;

	// Check for zero, denormal or too small value.
	if ( FP32.Components.Exponent <= 112 )			// Too small exponent? (0+127-15)
	{
		// Set to 0.
		Components.Exponent = 0;
		Components.Mantissa = 0;
	}
	// Check for INF or NaN, or too high value
	else if ( FP32.Components.Exponent >= 143 )		// Too large exponent? (31+127-15)
	{
		// Set to 65504.0 (max value)
		Components.Exponent = 30;
		Components.Mantissa = 1023;
	}
	// Handle normal number.
	else
	{
		Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
		Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
	}
}

As you can see this function takes care of transforming single precision floats that out of range of range of a half precision float. Those were the comparisons I was seeing. But is this meaningful in any way for us? Of course! Thinking about the data is critical. And when you do that in this case, there is no need for those checks. Our indexing data fits perfectly within a half precision float so we should be assigning the values directly. To do that I wrote the following function:

FORCEINLINE void FFloat16::SetNoChecks( const float FP32Value )
{
	const FFloat32 FP32(FP32Value);

	// Make absolutely sure that you never pass in a single precision floating
	// point value that may actually need the checks. If you are not 100% sure
	// of that just use Set().

	Components.Sign = FP32.Components.Sign;
	Components.Exponent = int32(FP32.Components.Exponent) - 127 + 15;
	Components.Mantissa = uint16(FP32.Components.Mantissa >> 13);
}

Now that those checks are gone performance should be better. Again I used Intel Architecture Code Analyzer, and here is the output.

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - C:\dev\UnrealEngine\master\Engine\Binaries\Win64\UE4Client-Win64-Test.exe
Binary Format - 64Bit
Architecture  - SNB
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 23.40 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 22.0    0.0 | 21.9 | 7.5    4.5  | 7.5    4.5  | 6.0  | 23.1 |
-------------------------------------------------------------------------

As you can see the throughput of the block from 31.55 to 23.40 cycles, and the front end is no longer a bottleneck. Time to profile to see if there is a difference in runtime.

BuildParticleVertexBuffer-Fast

That small change yielded a performance win of ~27% from the original code with ~45% less instructions retired. A good performance win for such a small change.

As all the optimizations I have done so far, the understanding of the input and output data allowed me to determine how to optimize. I can’t stress enough how important it is to understand data. If you don’t understand the data then odds of improving performance are very low.

On the next part I will talk of another micro-optimization closely related to this one.

Optimizing Unreal Engine 4’s async file IO thread.

After the first post I made on the ToLower() optimization a lot of people rightfully wondered why did I optimize that and not the number of calls made to it. The answer is that I did optimize both. The reason is that while the reduction of ToLower() calls would have made a significant difference, it is also important to have the actual code executed be optimal even if it’s not executed too often. The sum of all these inefficiencies are the cause of performance issues because it is hard to see the performance decrease as the product evolves. This is what people refer as “death by a thousand cuts”. So considering that I first optimized the function to make sure that there was a minimum bar for performance set for the function call, and then I went ahead and reduced the number of calls to the function. Of course depending on your time restriction you may set priorities differently.

As usual I’m going to be profiling on my machine that has the following specs:

MachineSpecs

So let’s begin by looking at the profile. Just like last time this is profiling from frame two the first 1000 frames on the Elemental demo.

FAsyncIOSystemSlow

As you can see a lot of time is being spent on the ToLower() function, something I already optimized but it wasn’t included on this profile because I profile change by change on its own. So the approach this time is to reduce the number of calls to the function. So looking at the callstacks I found out that a lot of the calls were coming from the AsyncIOSystem thread.

FAsyncIOSystemCallstack

As the name says the AsyncIOSystem is an asynchronous IO system. It is cross-platform and single-threaded runnable. What is a runnable on Unreal Engine 4 you ask, here is the answer:

/**
 * Interface for "runnable" objects.
 *
 * A runnable object is an object that is "run" on an arbitrary thread. The call usage pattern is
 * Init(), Run(), Exit(). The thread that is going to "run" this object always uses those calling
 * semantics. It does this on the thread that is created so that any thread specific uses (TLS, etc.)
 * are available in the contexts of those calls. A "runnable" does all initialization in Init().
 *
 * If initialization fails, the thread stops execution and returns an error code. If it succeeds,
 * Run() is called where the real threaded work is done. Upon completion, Exit() is called to allow
 * correct clean up.
 */

New IO requests can be queue up from any thread where and those requests can be cancelled as well. Whenever a new request is sent the following data is stored in the given order:

Type Name Description
uint64 RequestIndex Index of the request.
int32 FileSortKey NOT USED. File sort key on media, INDEX_NONE if not supported or unknown.
FString FileName Name of file.
int64 Offset Offset into file.
int64 Size Size in bytes of data to read.
int64 UncompressedSize Uncompressed size in bytes of original data, 0 if data is not compressed on disc.
void* Dest Pointer to memory region used to read data into.
ECompressionFlags CompressionFlags Flags for controlling decompression. Valid flags are no compression, zlib compression, and specific flags related to zlib compression
FThreadSafeCounter* Counter Thread safe counter that is decremented once work is done.
EAsyncIOPriority Priority Priority of request. Allows minimum, low, below normal, normal, high, and maximum priority.
uint32 bIsDestroyHandleRequest : 1 True if this is a request to destroy the handle.
uint32 bHasAlreadyRequestedHandleToBeCached : 1 True if the caching of the handle was already requested.

So the most obvious culprit of the performance drain must be the FileName. So let’s look at where it is used:

FAsyncIOSystemFileNameRefs
The use in the Tick function caught my attention right away, it lined up properly with the callstack that I knew was flooding my output window. The Tick function of FAsyncIOSystem does the following things in order:

  • Create file handles for the outstanding requests.
  • Creates a copy of the next request that should be handled based on a specific platform criteria such as the layout on disc or the highest set priority.
  • Fulfill the pending request.
    • Which can be a request to destroy the handle or retrieve/create a handle to fulfill a a compressed or uncompressed read.

To determine if a request handle needs to be created or retrieved from the cache there is a name to handle map that maps the FString of the name of an iFileHandle*. If the filename is in the map then it means that there is a cached handle for it. To do that it does a FindRef() on the TMap which calls FindId(), here is the code:

/**
 * Finds an element with the given key in the set.
 * @param Key - The key to search for.
 * @return The id of the set element matching the given key, or the NULL id if none matches.
 */
FSetElementId FindId(KeyInitType Key) const
{
	if(HashSize)
	{
		for(FSetElementId ElementId = GetTypedHash(KeyFuncs::GetKeyHash(Key));
			ElementId.IsValidId();
			ElementId = Elements[ElementId].HashNextId)
		{
			if(KeyFuncs::Matches(KeyFuncs::GetSetKey(Elements[ElementId].Value),Key))
			{
				// Return the first match, regardless of whether the set has multiple matches for the key or not.
				return ElementId;
			}
		}
	}
	return FSetElementId();
}

The ElementId is generated by the GetTypedHash() function which for our type it generates a CRC32 hash, this is where all the time in FCrc::Strihash_DEPRECATED<wchar_t>() was being spent. And KeyFuncs::Matches() looks like this:

static FORCEINLINE bool Matches(KeyInitType A,KeyInitType B)
{
	return A == B;
}

While that looks pretty reasonable for fundamental integral types, for comparisons of FStrings it does a call to Stricmp to do a lexicographical comparison. This is where the ToLower() call is made:

template <typename CharType>
static inline int32 Stricmp( const CharType* String1, const CharType* String2 )
{
	// walk the strings, comparing them case insensitively
	for (; *String1 || *String2; String1++, String2++)
	{
		CharType Char1 = TChar<CharType>::ToLower(*String1), Char2 = TChar<CharType>::ToLower(*String2);
		if (Char1 != Char2)
		{
			return Char1 - Char2;
		}
	}
	return 0;
}

So now we know what it implies to find something in the cache, but how often does that happen? The answer is in the FAsyncIOSystemBase::Tick() function which shows that it happens once per outstanding request, and then once more when a request if a request is pending. I measured the number of request done before the very first frame was rendered, there were 2096 requests queued. Considering that the AsyncIOSystem thread has an above normal priority and it can happen pretty often. The numbers add up pretty quickly. We need to fix this.

To fix this I took a rather simple approach which is to make sure that finding something in the cache involves comparisons between integral types. The easiest way was to add another field to the IO request data which is a 32-bit hash of the filename. The hash is generated whenever a new IO request is queued up (be it an IO request or a file handle destroy request), and then that hash is used to find cached file handles. To generate the hash I decided to use something already found in the engine rather than integrating something like FNV-1 or xxHash, so I used a CRC32 hash.

So after doing that change let’s look at the profile:

FAsyncIOSystemFast

Pretty impressive, the call to ToLower() isn’t there anymore because only 0.030ms are spent in all the 1000 frames. The call to FCrc::Strihash_DEPRECATED<wchar_t>() isn’t there either because only 7.9ms are spent in all the 1000 frames.

The lessons this time is related to the tendency of developers to hide complexity under very generic functions which have huge performance relevance. In particular performance was suboptimal because of it isn’t obvious that a A == B in KeyFuncs::Matches would imply a Stricmp call for an FString. That’s why in my own code I tend not to override operators, they usually hide complexity when as programmers we need to be fully aware of the complexity of what we ship. Programmers also forget that our main objective isn’t to create generic solutions that solve problems that we may have in the future. Our main objective is to ship the best experience to the end user, and that means writing code that solve the actual problem we need to solve, with the available hardware resources, and with the available time to write it. If you care a lot about the future of your code then worry about making optimizable code rather that making a grand design to abstract complexities using ten different design patterns. The truth is that the end user doesn’t care if you use 10 different patterns in a generic design, but they do care if the load times are high.

And knowing your performance data is critical. In the case of the Elemental demo 2096 IO requests were done and fulfilled before the first frame was rendered. I think being aware of that is critical to making the proper design and performance decisions. In fact given that data I would be inclined to further optimize this by changing the AoS nature of the FAsyncIORequest and move it to a SoA so that hashes are all stored together to optimize to reduce the CPI in the Tick function, but I will leave that up to you.