9
12bitfloat
126d

PSA: The smaller the compute shader workgroups the more efficient they are, down to the wave size (32 on nvidia). Not exactly sure why, but looks like if you don't need group shared memory always have your workgroups be wave sized

Just this alone gave me a 30%+ performance increase. And combined with a few other changes got me from 50 µs to 10 µs, yay!

Comments
  • 4
    Update: Actually I'm kinda wrong. I have some fullscreen workload and that is fastest with 8*8*1 workgroups. Both 8*4*1 (wave sized) and 16*16*1 are noticibly slower...

    Guess if you're reading from an image per globalInvocationId, cache also plays a big role and having 64 threads closer together in terms of cache access outweighs some of the gains of smaller workgroup sizes?
  • 2
    No idea what you are talking about but sounds cool
  • 2
    Welcome to the world of performance. Profile everything, your assumptions are probably wrong.
  • 2
    SIMD performance can be really hard to analytically predict.

    In the case of compute shaders, it really boils down in the end on them needing access to something else besides their own vertex/geometry/pixel/whatever.

    That forces intrinsic dependencies between them, which coupled with, as you correctly said, caching and threading phenomena, can unpredictably impact performance.
  • 2
    @Lensflare I just came here to say this. No idea, but sounds cool, and also write more about the topic we don't understand because it has the same flavor of fun as reading about pseudo-esoteric wizard rituals in third party DnD supplements.

    More blood sacrifice please.
Add Comment