thought it was going to be a pain to emulate VPERMD/VPERMPS on AArch64 when i cant easily go to memory with what we're doing, and turns out TRN1 makes the whole index setup easy, which goes a little something like:
; assume we have some vector named 'data' that we
; want to permute and we have a given a 256-bit
; wide 32-bit element register like:
;
; indices -> [4, 1, 2, 6, 7, 0, 3, 5]
;
; and its corresponding 8-bit element equivalent
;
; [0, 0, 0, 4, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 6, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 5]
trn1 z30.b, indices.b, indices.b
trn1 z30.h, z30.h, z30.h
; our vector is now like:
;
; [4, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 6, 6, 6, 6, 7, 7, 7, 7, 0, 0, 0, 0, 3, 3, 3, 3, 5, 5, 5, 5]
;
; now we need to turn them into byte indices instead of word indices
lsl z30.b, z30.b, #2
; now our vector is like:
;
; [16, 16, 16, 16, 4, 4, 4, 4, 8, 8, 8, 8, 24, 24, 24, 24, 28, 28, 28, 28, 0, 0, 0, 0, 12, 12, 12, 12, 20, 20, 20, 20]
;
; cool! now we need to make each subsequent entry after the
; initial byte index to a word represent the rest of the bytes needed
;
; so lets slam 0x03020100 into a temporary and broadcast it into a vector
mov w0, #0x0100
movk w0, #0x0302, lsl #16
dup z29.s, w0
; now lets add to our indices
add z30.b, z30.b, z29.b
; now the index vector is like so
;
; [19, 18, 17, 16, 7, 6, 5, 4, 11, 10, 9, 8, 27, 26, 25, 24, 31, 30, 29, 28, 3, 2, 1, 0, 15, 14, 13, 12, 23, 22, 21, 20]
;
; now all that's left to do is fire it into TBL
tbl z30.b, data.b, z30.b
; and now the data is properly shuffled as if used with VPERMD or VPERMPS in z30
left out the index register sanitizing for brevity, but essentially prior to all this, you should bitwise and each 32-bit element in the index vector by 0b111 (7) to clear out any junk in there, since someone can be silly and set bits other than the one's strictly used by VPERMD/VPERMPS
