12bitfloat

1y

PSA: The smaller the compute shader workgroups the more efficient they are, down to the wave size (32 on nvidia). Not exactly sure why, but looks like if you don't need group shared memory always have your workgroups be wave sized

Just this alone gave me a 30%+ performance increase. And combined with a few other changes got me from 50 µs to 10 µs, yay!

random

vulkan

psa

Ranter

Comments

4

12bitfloat

10996

1y

Update: Actually I'm kinda wrong. I have some fullscreen workload and that is fastest with 8*8*1 workgroups. Both 8*4*1 (wave sized) and 16*16*1 are noticibly slower...

Guess if you're reading from an image per globalInvocationId, cache also plays a big role and having 64 threads closer together in terms of cache access outweighs some of the gains of smaller workgroup sizes?
2

Lensflare

21732

1y

No idea what you are talking about but sounds cool
2

CoreFusionX

3611

1y

SIMD performance can be really hard to analytically predict.

In the case of compute shaders, it really boils down in the end on them needing access to something else besides their own vertex/geometry/pixel/whatever.

That forces intrinsic dependencies between them, which coupled with, as you correctly said, caching and threading phenomena, can unpredictably impact performance.
2

Wisecrack

9419

1y

@Lensflare I just came here to say this. No idea, but sounds cool, and also write more about the topic we don't understand because it has the same flavor of fun as reading about pseudo-esoteric wizard rituals in third party DnD supplements.

More blood sacrifice please.

Related Rants

devRant © 2021 Hexical Labs LLC
Privacy Policy | Terms of Service