Development notes, Reusable code

Compute shaders and Camera.Render in Unity: workaround for re-ordering

Recently working on the 1.3 release of my Panorama Capture script I encountered a mysterious and difficult-to-reproduce issue on certain older GPUs such as NVIDIA GTX 675M where I made a sequence of calls to Camera.Render() and ComputeShader.Dispatch(), and Unity would blithely re-order them without regard for the read/write dependencies between them, resulting in very strange panorama images like this:

distorted

The function of the compute shader was to take the result of the rendering and store it in a compute buffer, so that the RenderTexture could then be re-used. This is roughly what my code looked like:

renderTexture = new RenderTexture(width, height);
ComputeBuffer computeBuffer = new ComputeBuffer(10 * width * height, 4);
copyShader.SetTexture(kernelIdx, "source", renderTexture);
copyShader.SetBuffer(kernelIdx, "result", computeBuffer);
cam.targetTexture = renderTexture;
for (int i=0; i < 10; i++) {
  // Set cam.transform.position/rotation based on i
  camera.Render();
  copyShader.SetInt("startIdx", i * width * height);
  copyShader.Dispatch(kernelIdx, (width  + threadsX - 1) / threadsX,
                                 (height + threadsY - 1) / threadsY, 1);
}
pixels = new uint[10 * width * height];
computeBuffer.GetData(pixels);

The goal is to render 10 images and copy them all into computeBuffer in order. But on some GPUs, the Render() and Dispatch() calls are done out-of-order – sometimes all the Render() calls are done before the Dispatch() calls, resulting in all 10 images in the computeBuffer being identical. Other times, the Dispatch() calls are done just one iteration early or late, shifting the resulting images up or down in the buffer, or resulting in duplication of certain images. I don’t know whether this is a Unity bug or a GPU memory model limitation, but I needed to find a workaround.

I discovered that if I read the entire result buffer after every Dispatch(), this eliminates the issue, but at enormous performance cost:

for (int i=0; i < 10; i++) {
  // Set cam.transform.position/rotation based on i
  camera.Render();
  copyShader.SetInt("startIdx", i * width * height);
  copyShader.Dispatch(kernelIdx, (width  + threadsX - 1) / threadsX,
                                 (height + threadsY - 1) / threadsY, 1);
  computeBuffer.GetData(pixels);
}

Rather than retrieving the entire huge buffer containing all the data, a more efficient method is to introduce a second, very small buffer, and retrieve that instead. Here’s what the modified compute shader looks like:

#pragma kernel Copy

Texture2D source;
RWStructuredBuffer result, forceWaitBuffer;
int width, height, startIdx, forceWaitValue;
SamplerState MyPointRepeatSampler;

[numthreads(32,32,1)]
void Copy (uint3 id : SV_DispatchThreadID) {
    if (id.x >= width || id.y >= height) return;
    if (id.x == width - 1 && id.y == height - 1 && id.z == 0)
        forceWaitBuffer[0] = forceWaitValue;

    float4 color = source.SampleLevel(MyPointRepeatSampler,
                                      float2(((float)id.x + 0.5)/ width,
                                             ((float)id.y + 0.5)/ height), 0);
    color *= 255.0;
    result[startIdx + (id.y * width) + id.x] =
        ((int)color.r << 16) | ((int)color.g << 8) | (int)color.b;
}

The main difference is I’ve added “RWStructuredBuffer forceWaitResultBuffer” and “int forceWaitValue”. Then, I choose just one value of id.x, id.y, and id.z, and for that value I write to the the buffer. Here I use the last value, but this is probably unimportant. On the CPU side I just read this buffer and check the result:

ComputeBuffer forceWaitBuffer = new ComputeBuffer(1, 4);
copyShader.SetBuffer(kernelIdx, "forceWaitBuffer", forceWaitBuffer);
uint[] forceWaitResult = new uint[1];
for (int i=0; i < 10; i++) {
  // Set cam.transform.position/rotation based on i
  camera.Render();
  int forceWaitValue = i;
  copyShader.SetInt("forceWaitValue", forceWaitValue);
  copyShader.SetInt("startIdx", i * width * height);
  copyShader.Dispatch(kernelIdx, (width  + threadsX - 1) / threadsX,
                                 (height + threadsY - 1) / threadsY, 1);
  forceWaitBuffer.GetData(forceWaitResult);
  if (forceWaitResult[0] != forceWaitValue)
    Debug.LogError("Unexpected forceWaitResult value");
}

Checking the result is not strictly necessary but helps avoid situations where the buffer was accidentally not set for whatever reason.

The same technique can be applied to force synchronization of the Camera.Render() call: we don’t want to retrieve all its pixels (which is very slow), but we can quickly retrieve just one of them (in this case the bottom-right pixel), ensuring the rendering has completed:

Texture2D forceWaitTexture = new Texture2D(1, 1);
for (int i=0; i < 10; i++) {
  // Set cam.transform.position/rotation based on i
  camera.Render();
  RenderTexture.active = cubemapRenderTexture;
  forceWaitTexture.ReadPixels(new Rect(width - 1, height - 1, 1, 1), 0, 0);

  copyShader.SetInt("forceWaitValue", i);
  copyShader.SetInt("startIdx", i * width * height);
  copyShader.Dispatch(kernelIdx, (width  + threadsX - 1) / threadsX,
                                 (height + threadsY - 1) / threadsY, 1);
  forceWaitBuffer.GetData(forceWaitResult);
}
RenderTexture.active = null;

In my testing so far this has led to reliable correct performance on any GPU supporting compute shaders, even when taking many images quickly in a row or with very large images; however I have not tested this on all devices and Unity versions (I’m on 5.1.2f1) and there may be limitations that I am unaware of.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s