another handwritten note. i was trying to figure out if there was a more decent way to optimize emulated HSUBPS/HSUBPD on aarch64. turns out what we have is fine, and this is only useful for the 256-bit AVX variants, but i haven't implemented those yet
ignore the missing bit of the paper, i initially cut out a bit to make a grocery list a few weeks before i actually used it for this
