lifning

๐Ÿ—ฆ old and unimproved! ๐Ÿ—ง

  • gender? 'ardly she/'er!

if you read my header image, i'm sorry


๐Ÿ“ blog
lifning.info/
๐Ÿ’ช demogroup
hell-labs.itch.io/
๐ŸŒ fediverse
snoot.tube/@lifning
๐Ÿ•ธ๏ธ website league
beam.phosphor.buzz/@lifning

jckarter
@jckarter

Performance-oriented developers usually throw around "data-oriented design" when talking about optimizing for cutting-edge CPUs and GPUs with advanced SIMD units, but some of their tricks are surprisingly effective even for programming lowly retro architectures like the 6502. In particular, the "struct of arrays", or SoA, data layout technique can be effective at optimizing code that works with larger integer types or data structures on these limited architectures.


An 8-bit processor usually only operates on 8 bits at a time, but even back in the day, we wanted programs that work with larger numbers and keep tables of those larger numbers in memory. We can do this the obvious way you would on a contemporary computer, storing each 16-bit value in two contiguous bytes of memory, one after the other, like you'd get if you wrote a global array literal in C:

const uint16_t funny_numbers[] = {
  69,
  219,
  420,
  6969,
  20721,
  31337,
};

Now let's consider an operation that does a lookup into the table:

uint8_t encode_funny_number(uint8_t index) {
  uint16_t value = funny_numbers[index];
  return (value & 0xFF) ^ (value >> 8);
}

To compile this down for a 6502, we have to compute the address of the index in the table, then work on one byte at a time, advancing the index as we go:

encode_funny_number:
  asl a ; multiply by two to get the byte offset into the table
  tax ; move the offset from accumulator register A to index register X
  lda funny_numbers,X ; fetch the first byte into register A
  inx ; advance the index to the next byte
  eor funny_numbers,X ; xor the next byte into register A
  rts ; return from the function

You could think of the 16-bit number as being a "struct" of two eight-bit bytes, and the lookup table as being an "array of structs", or AoSโ€”each entry is stored in a contiguous block in memory immediately after the entry before it. But we can also store the table as a "struct of arrays", where each part of the struct is in a separate array. It might be weird to think of an integer this way, but we could have one array for the low bytes of each entry, followed by a separate array for the high bytes of each entry:

const uint8_t funny_numbers_lo[] = {
  69 & 0xFF,
  219 & 0xFF,
  420 & 0xFF,
  6969 & 0xFF,
  20721 & 0xFF,
  31337 & 0xFF,
};

const uint8_t funny_numbers_hi[] = {
  69 >> 8,
  219 >> 8,
  420 >> 8,
  6969 >> 8,
  20721 >> 8,
  31337 >> 8,
};

Even though the values are split up and spread into two tables, the index into each table is now a constant byte offset, so we can implement the lookup with much less index manipulation, saving two whole instructions, and four cycles, over the AoS version:

encode_funny_number:
  tax ; move the offset from accumulator register A to index register X
  ; we don't need to multiply A by 2 anymore!
  lda funny_numbers_lo,X ; fetch the first byte into register A
  ; we don't need to increment X anymore!
  eor funny_numbers_hi,X ; xor the next byte into register A
  rts ; return from the function

You must log in to comment.

in reply to @jckarter's post:

doing this is also very useful with stuff like 8051's movc a, @a+dptr, where you can only cheaply access 256 bytes with the a register, so with an array of 256 16-bit integers, instead of

mov dptr, #addr
add a, dpl
mov dpl, a
clr a
addc a, dph
mov dph, a
clr a
movc a, @a+dptr
mov r6, a
mov a, #1
movc a, @a+dptr
mov r7, a

you can just do

mov dptr, #addr
mov r6, a
movc a, @a+dptr
xch a, r6
inc dph
movc a, @a+dptr
mov r7, a

Nice! This trick also came in handy for me on an AVR8 project. Even though the AVR doesn't really have compound addressing modes, it's still cheaper to put each byte lookup table on a separate "page" and increment the high byte of the index register, like:

ldi ZH, table >> 8 ; load address table + index into Z register
mov ZL, index
ld result, Z ; load first byte
inc ZH ; move to next page
ld tmp, Z ; load second byte
eor result, tmp