Unusual performance drops for `Tex.SetColors` & `Mesh.SetVerts`
Description
When testing my HL2 application, I started to notice my color image sprites would occasionally have "performance hiccups" in their IStepper's Step function. If this function call takes too long in the main thread, it will block all the other ISteppers from rendering. I then noticed the inconsistency in performance for their texture's SetColors method. The main variables to take into consideration for quicker updates of a texture's color data seemed to be:
- Which architecture (ARM vs ARM64 for HL2 applications, PC as "ground truth")
- NoMips vs Dynamic textures
Platform / Environment
- Razer Blade 15 Laptop (x64)
- HoloLens 2 (ARM / ARM64)
- Using StereoKit's latest NuGet package (preview 3.7.3).
Test Program
A simple UWP C# single-file (Program.cs) application was used for testing:
namespace SetColorsSpeedTest
{
using StereoKit;
using StereoKit.Framework;
using System;
using System.Collections.Generic;
using System.Diagnostics;
public class ImagePlane : IStepper
{
private Tex tex;
private Sprite sprite;
private int imageWidth = 896;
private int imageHeight = 504;
private Pose windowPose = Matrix.T(new Vec3(0, 0, -1)).Pose;
private Vec2 imageAspectRatio;
private Stopwatch stopwatch = new Stopwatch();
public ImagePlane()
{
this.tex = new Tex(TexType.Dynamic | TexType.ImageNomips, TexFormat.Rgba32);
this.tex.SetColors(this.imageWidth, this.imageHeight, new Color32[this.imageWidth * this.imageHeight]);
this.sprite = Sprite.FromTex(tex, SpriteType.Single);
this.imageAspectRatio.x = 60.0f * U.cm;
this.diagnosticTimes = new List<long>(numValues);
}
public bool Enabled => true;
public bool Initialize()
{
return true;
}
public void Shutdown()
{
}
// diagnostic states
private static int numValues = 100;
private int numOutliers = 5;
private List<long> diagnosticTimes;
public void Step()
{
if (this.diagnosticTimes.Count == numValues)
{
this.diagnosticTimes.Sort();
// print diagnostics information
Trace.WriteLine($"min: {this.diagnosticTimes[0]}ms, " +
$"max: {this.diagnosticTimes[numValues - 1]}ms, " +
$"median: {this.diagnosticTimes[numValues / 2 - 1]}ms");
Trace.WriteLine($"printing top {numOutliers} longest times:");
for (int i = numValues - numOutliers - 1; i < numValues; i++)
{
long outlier = this.diagnosticTimes[i];
Trace.WriteLine($"#{i + 1}: {outlier}ms");
}
// safely exit the application
SK.Quit();
return;
}
Color32[] newColors = new Color32[this.imageWidth * this.imageHeight];
stopwatch.Start();
this.tex.SetColors(this.imageWidth, this.imageHeight, newColors);
long l = stopwatch.ElapsedMilliseconds;
stopwatch.Reset();
UI.WindowBegin("image plane", ref this.windowPose, moveType: UIMove.FaceUser);
UI.Image(this.sprite, this.imageAspectRatio);
UI.WindowEnd();
diagnosticTimes.Add(l);
}
}
internal class Program
{
static void Main(string[] args)
{
// Initialize StereoKit
SKSettings settings = new SKSettings
{
appName = "SetColorsSpeedTest",
assetsFolder = "Assets",
};
if (!SK.Initialize(settings))
Environment.Exit(1);
SK.AddStepper(new ImagePlane());
// Create assets used by the app
Pose cubePose = new Pose(0, 0, -0.5f, Quat.Identity);
Model cube = Model.FromMesh(
Mesh.GenerateRoundedCube(Vec3.One * 0.1f, 0.02f),
Default.MaterialUI);
Matrix floorTransform = Matrix.TS(0, -1.5f, 0, new Vec3(30, 0.1f, 30));
Material floorMaterial = new Material(Shader.FromFile("floor.hlsl"));
floorMaterial.Transparency = Transparency.Blend;
// Core application loop
while (SK.Step(() =>
{
if (SK.System.displayType == Display.Opaque)
Default.MeshCube.Draw(floorMaterial, floorTransform);
UI.Handle("Cube", ref cubePose, cube.Bounds);
cube.Draw(cubePose.ToMatrix());
})) ;
SK.Shutdown();
}
}
}
Logs or exception details
All timing measurements were made over 100 render frames. For each variable, I performed the measurement 3 times and kept the most average-looking results.
Legend
| key | meaning |
|---|---|
| min | Minimum time of one function call for n frames |
| max | Maximum time of one function call for n frames |
| median | Time of one function call for frame # n/2 |
| outliers | Top 5 longest times of the function call over n frames |
Diagnostic times
| NoMips | Dynamic | |
| x64 (PC) |
min: 4ms, max: 22ms, median: 5ms
printing top 5 outliers:
#95: 10ms
#96: 10ms
#97: 11ms
#98: 11ms
#99: 13ms
#100: 22ms
|
min: 4ms, max: 13ms, median: 4ms
printing top 5 outliers:
#95: 7ms
#96: 7ms
#97: 8ms
#98: 8ms
#99: 9ms
#100: 13ms
|
| ARM (HL2) |
min: 4ms, max: 37ms, median: 6ms
printing top 5 outliers:
#95: 11ms
#96: 15ms
#97: 23ms
#98: 25ms
#99: 30ms
#100: 37ms
|
min: 3ms, max: 33ms, median: 5ms
printing top 5 outliers:
#95: 8ms
#96: 8ms
#97: 9ms
#98: 9ms
#99: 23ms
#100: 33ms
|
| ARM64 (HL2) |
min: 10ms, max: 51ms, median: 13ms
printing top 5 outliers:
#95: 26ms
#96: 27ms
#97: 27ms
#98: 34ms
#99: 45ms
#100: 51ms
|
min: 10ms, max: 40ms, median: 12ms
printing top 5 outliers:
#95: 20ms
#96: 20ms
#97: 22ms
#98: 27ms
#99: 37ms
#100: 40ms
|
Discussion
- As you can see above in the ARM architecture, there are always outliers 5-10x slower than the median frame. What could be the reason for these significant outliers?
- ~Why aren't we allowed to set a Texture as
DynamicandImageNomipsfrom the start? Could this be contributing to the two outliers as seen in ARM -> Dynamic -> Atlased/Single?~ - Why would ARM be much faster than ARM64 when copying this color data to the GPU?
I'll look into this in detail in the morning, thanks a bunch for writing this up! The ARM vs ARM64 thing is ...unexpected to say the least, the big difference being that ARM64 uses .NET Native, and ARM generally does not.
I'll also note that you can specify more than one TexType, it's a bit flag. So TexType.Dynamic | TexType.ImageNomips will get you both at the same time.
Edit: Also Single vs. Atlased shouldn't make a difference here, you should see logs in the console noting that Atlased isn't implemented, and it falls back to Single anyhow :)
Thanks so much for the speedy response!
I did a big facepalm after seeing the docs about Atlased not being implemented after making the table 😅 That's awesome that TexType is a bit flag!
So, I just tested the ARM application again with the TexType set to TexType.Dynamic | TexType.ImageNomips and still am receiving the following outliers:
min: 3ms, max: 46ms, median: 5ms
printing top 5 longest times:
#95: 7ms
#96: 7ms
#97: 8ms
#98: 10ms
#99: 19ms
#100: 46ms
These outliers are my main concern for creating the issue, but the ARM vs ARM64 comparison just peaks my curiosity too!
I also noticed calling Mesh.SetVerts without calculating the bounds looks like it should simply be copying data in:
https://github.com/StereoKit/StereoKit/blob/dce5945141c000a16fd7497243fab45bd507882b/StereoKitC/libraries/sk_gpu.h#L1006
However, it could be that the Map operation is taking a varying amount of time: https://stackoverflow.com/questions/40808759/id3d11devicecontextmap-slow-performance
Perhaps an option to specify the D3D11_MAP_FLAG_DO_NOT_WAIT flag could boost performance? <- src.
I tested out a very similar design to this sample program for the ARM architecture on HL2 and received outliers that are a factor of ~3-10 of the median time.
Mesh.SetVerts output (ARM + HL2)
MeshRenderer: 163ms
MeshRenderer: 30ms
MeshRenderer: 31ms
MeshRenderer: 27ms
MeshRenderer: 106ms
MeshRenderer: 27ms
MeshRenderer: 27ms
MeshRenderer: 28ms
MeshRenderer: 41ms
MeshRenderer: 20ms
Yeah, I do suspect that a lot of the issue here is related to interaction with the GPU, which can be kinda complicated. Basically, if the GPU is busy doing something like drawing the previous frame, then it has to wait until the GPU unlocks before a Map can occur.
D3D11_MAP_FLAG_DO_NOT_WAIT is perhaps part of the solution, but that flag IIRC should cause the Map call to error out instead of waiting! You can then attempt to Map later on when the device isn't busy, but rescheduling that task is not exactly simple in this case. There needs to be some amount of interaction with SK's async asset pipeline here to let these slide until later in the frame, or to the next frame.