previous: https://cohost.org/haloopdy/post/2914882-adventures-in-raycas
in the last one, i finally had a raycasting game running on arduboy, which has an 8 bit 16mhz processor and 2.5kib ram:
but i want more
textured raycasting
textures are significantly more complex than flat surfaces. instead of pulling nice whole bytes out of a shading array, i now have to:
- calculate the horizontal coordinate within the texture where the ray hit the wall for every ray cast (every vertical slice)
- read the texture data for that x coordinate
- for every pixel in the wall stripe, calculate which texture pixel should go there
- draw each individual pixel of the wall
if you remember from before, individually setting pixels is far slower than doing it per-byte. and performing complicated texture coordinate math EVERY pixel? seems impossible.
let's do it.
first, i rewrite some code so that walls are drawn per-pixel again, and i prep a lot of other code to prepare for textures. i also add that "x coordinate within texture" calculation to the wall. but i'm still not drawing any textures, this is just "prepwork". and...
the commit message says "HORRIBLY SLOW" but i think i was just spoiled. it dropped from 45 to less than 30 fps just from the prepwork. however, when i went back to the profiler, i noticed something odd.
what you're seeing is the debug assembly and the profiler's "percentage cpu spent on each assembly instruction". the gray line at the top is the c++ code and the red underneath is the generated assembly.
if you're familiar with this stuff, you might immediately see something is off. why is SO MUCH ASSEMBLY generated for a simple left shift and bitwise AND? it couldn't be... but it was. the AVR instruction set, and this hardware, does not support arbitrary left or right shift, only simple shift by 1. arbitrary left and right shift is achieved with a Barrel Shifter and i guess they just didn't want to include one. as such, these arbitrary left + right shifts, which the compiler can't reason about and optimize, turn into loops. uuggghh. so now, most of the cpu is spent in this silly little loop of CALCULATING pixels (not drawing).
i have a commit message here called "No left shift, brain damage". seems apt; i was not expecting a fundamental operation to simply not exist, though i suppose it's really my ignorance that was the problem. so on this platform, it is actually significantly faster to multiply by 8 than it is to left shift 3. and it's EVEN faster to multiply by 64 than it is to left shift 6. this is why they tell you not to micro-optimize prematurely; it's pretty natural for people to think left shift is faster than multiply, but in this case they'd actually be making things worse.
so i had to make a lookup table for left shift of all things in order to generate individual bitmasks before i could even think about textures. this also meant replacing things like this:
uint16_t bofs = (yStart >> 3) * WIDTH;
with weird stuff like
uint16_t bofs = (yStart & 0b1111000) * BWIDTH;
where BWIDTH is pre-right-shifted, effectively making the math the same (along with the bitwise AND) but significantly faster.
next, i actually draw the texture:
fancy, i know
and finally, i start copying over the code from the Lodev tutorial again, but this time i'm actively rewriting it all to be as optimized as possible for this system, precalculating as much as i can, reducing data sizes, and... and...
WORKING! as i say in my commit message. sure it runs at 20fps but it works. i think i almost cried. being the pessimist that i am, i thought "i won't be able to make this much faster" but i got to work trying anyway
optimizations again
i'll just go through these quickly, but imagine me in a half-crazed state, arched over the computer and trying every weird hack i can to get it to run faster, repeatedly dumping the program into the profiler and rejoicing at tiny percentages gained.
- reduced texture precision (2 bytes instead of 4 for calculations)
- removing safety out of bounds checks for texture reading by making a 16 bit left shift lookup table with 17 entries, the last one duplicated
- using do-while loops instead of for loops to reduce the cost of branching (it's complicated, this isn't a normal optimization)
- replace 2 of the divisions with a hacky lookup table (this was the most improvement by FAR)
i want to talk about that division a bit. if you recall before, the loop only had 4 total divisions in it (just repeated every vertical slice). i was able to remove one of them before, but that still left 3. i was now able to remove a further TWO of them by using a "large" division lookup table. i had the luxury of knowing that all the divisions would be of the form 1 / (0 < x <= 1), and the fact that x in the range of 0 to 1 is only 8 bits means the table only needs 256 entries. i unfortunately ended up having to do some hacks because sometimes x can be slightly greater than 1, but the division lookup table still saved an absolutely ENORMOUS amount of performance. so much so that:
i got it back up to running at a smooth 35fps again. i really can't tell you how accomplished i felt in that moment. i felt so accomplished that i felt i could tackle:
sprites
ok, a fast AND textured raycaster is good and all, but like... again, what game are you going to play that just has walls, even if you can put fancy stuff on them. i still needed more, and sprites were the answer. with sprites, i could (theoretically) make a real game.
sprites come with their OWN complexities though:
- i was already pushing the cpu up to 100% again just trying to run it at 30fps
- sprites require even MORE complicated math since they span 2 dimensions (even though they're bilboards)
- sprites are basically "draw over what you just drew again", so unlike the textured raycasting which was just an "upgrade", sprites are purely "all additional load".
but i needed them. so i set to work doing the same thing as before, hunched over my computer and going crazy with a mixture of glee and frustration. i haven't programmed like this in a very long time, i'm getting old (kinda) and this kind of passion doesn't come often anymore. i once again copied the sprite code from the Lodev tutorial and got to work making absolutely massive adjustments, completely redoing most of it. i can tell i was really focused, because there's only like 3 commits between "starting sprites" and here:
a working raycaster, with all the bells and whistles i originally imagined. sure it runs at a paltry 20 fps again but that is PLAYABLE! especially when you consider it's just a tiny 1 inch or so display on a little credit card sized system. assets btw are from https://opengameart.org/content/tileset-1bit-color.
conclusion
i want to thank you for coming along with me on this journey...
is what i'd say if i hadn't gone insane. i'm not finished! i must have it, i must have the holy grail of performance AND features.
let's start by adding some moving and animated sprites. oh haha funny, the previous version already supported that, oops, just had to add the movement and frame update logic.
but i need more performance! 1 byte is enough precision to describe the location of sprites if i reduce the map size to 16x16; i don't have a lot of space anyway and you can simply load a new map with doorways between them. with 4 bits for the integer portion and 4 bits for the fractional part, i'm really pushing this thing. but i still need MORE!
oh are those some floats still leftover because i needed the precision? hacks, get rid of them, i only need 8 bits :)
oh quicksort is faster? no it's not, there's only 16 sprite slots, use insertion sort (this is a real thing)
oops, i'm STILL using some floating points even after removing most of them? remove them all (almost)
oh you need a distance cache to know if a wall is in front of a sprite? half resolution, sorry pixel-perfect fanatics. might as well halve the resolution of the shading calculation while i'm at it
ah i have a teeny tiny map, i can now take up a bit more of RAM with some cached values to increase performance by another couple percent
and then finally, the coup de grace, the absolute most ridiculous, most beneficial performance improvement to date. and i can't believe it worked. loop unrolling. why loop over bits when i can just repeat the code 8 times and loop over bytes? because the latter breeds insanity. i could make a whole blog post about the loop unrolling (and i think i might in the future) but for now... let me just bask in the glow of raycasting
real conclusion
but seriously, thank you for reading through all that. this was a work of passion for me, and even if i don't end up making a game, the process of creating the most optimized piece of code i could was something else. i've recently made even more optimizations to it so i don't have to halve the shading resolution, and i even had enough overhead to add an inverse shading effect to give the illusion of fog, it increases the amount of environments you could do:
and now, if you don't mind, i'll end with a probably eye-destroying collage of the evolution of the raycaster

🏳️⚧️