llvm

Ranter

12bitfloat

11034

Comments

3

Lensflare

21900

119d

Stop molesting those poor allocations!
2

jestdotty

6686

119d

why would you even have a need for such magic
1

12bitfloat

11034

119d

@jestdotty It's useful when you're running some wide algorithm over variable length data, e.g. utf 8 decoding where instead of reading each byte on its own you read the entire (potential) 4 bytes as one uint. Obviously for the last 3 bytes this would require reading past the end of the allocation

Though I think it's easier to just have both functions, one for the interior of the buffer and then one that handles the tail...
2

lorentz

15396

118d

@12bitfloat You would usually only call the wide algo on the well-aligned middle, in fact, some algos have versions for multiples of 8 bytes too just so they don't need to bounds check after each 64-bit item.

Any special reason you can't use SIMD?
1

12bitfloat

11034

118d

@lorentz Yeah I think that makes the most sense. For some reason I was worried about branch misprediction on the two consecutive loops but the remaining size of your buffer will literally be in L1 if not a register

In regards to SIMD, I haven't had a good idea for an efficient implementation yet. One major hurdle in utf8 decoding is that you have to smash the lower 6 bits of each of the up to 3 continuation bytes together. The only good ways i've found are the pext insn (which is microcoded and gigaslow on zen2 so I can't use it) and an unrolled loop

I've used the latter and it's surprisingly fast with a few other tricks (LUT for the first byte + fast path for 1 byte sequences)