• she/they
  1. enby. Retrotech. Demoscene. Programming. Drum & Bass DJ. Amiga Junglist. Autistic. Trans. Adult by circumstance, not by choice.


haloopdy
@haloopdy

previous: https://cohost.org/haloopdy/post/2914882-adventures-in-raycas

in the last one, i finally had a raycasting game running on arduboy, which has an 8 bit 16mhz processor and 2.5kib ram:

untextured raycasting game

but i want more

textured raycasting


textures are significantly more complex than flat surfaces. instead of pulling nice whole bytes out of a shading array, i now have to:

  • calculate the horizontal coordinate within the texture where the ray hit the wall for every ray cast (every vertical slice)
  • read the texture data for that x coordinate
  • for every pixel in the wall stripe, calculate which texture pixel should go there
  • draw each individual pixel of the wall

if you remember from before, individually setting pixels is far slower than doing it per-byte. and performing complicated texture coordinate math EVERY pixel? seems impossible.

let's do it.

first, i rewrite some code so that walls are drawn per-pixel again, and i prep a lot of other code to prepare for textures. i also add that "x coordinate within texture" calculation to the wall. but i'm still not drawing any textures, this is just "prepwork". and...

slower raycaster

the commit message says "HORRIBLY SLOW" but i think i was just spoiled. it dropped from 45 to less than 30 fps just from the prepwork. however, when i went back to the profiler, i noticed something odd.

profiler output showing a seemingly simple left shift producing a huge amount of assembly

what you're seeing is the debug assembly and the profiler's "percentage cpu spent on each assembly instruction". the gray line at the top is the c++ code and the red underneath is the generated assembly.

if you're familiar with this stuff, you might immediately see something is off. why is SO MUCH ASSEMBLY generated for a simple left shift and bitwise AND? it couldn't be... but it was. the AVR instruction set, and this hardware, does not support arbitrary left or right shift, only simple shift by 1. arbitrary left and right shift is achieved with a Barrel Shifter and i guess they just didn't want to include one. as such, these arbitrary left + right shifts, which the compiler can't reason about and optimize, turn into loops. uuggghh. so now, most of the cpu is spent in this silly little loop of CALCULATING pixels (not drawing).

i have a commit message here called "No left shift, brain damage". seems apt; i was not expecting a fundamental operation to simply not exist, though i suppose it's really my ignorance that was the problem. so on this platform, it is actually significantly faster to multiply by 8 than it is to left shift 3. and it's EVEN faster to multiply by 64 than it is to left shift 6. this is why they tell you not to micro-optimize prematurely; it's pretty natural for people to think left shift is faster than multiply, but in this case they'd actually be making things worse.

so i had to make a lookup table for left shift of all things in order to generate individual bitmasks before i could even think about textures. this also meant replacing things like this:

uint16_t bofs = (yStart >> 3) * WIDTH;

with weird stuff like

uint16_t bofs = (yStart & 0b1111000) * BWIDTH;

where BWIDTH is pre-right-shifted, effectively making the math the same (along with the bitwise AND) but significantly faster.

next, i actually draw the texture:

a simple brick wall tile

fancy, i know

and finally, i start copying over the code from the Lodev tutorial again, but this time i'm actively rewriting it all to be as optimized as possible for this system, precalculating as much as i can, reducing data sizes, and... and...

textured raycasting, finally!

WORKING! as i say in my commit message. sure it runs at 20fps but it works. i think i almost cried. being the pessimist that i am, i thought "i won't be able to make this much faster" but i got to work trying anyway

optimizations again

i'll just go through these quickly, but imagine me in a half-crazed state, arched over the computer and trying every weird hack i can to get it to run faster, repeatedly dumping the program into the profiler and rejoicing at tiny percentages gained.

  • reduced texture precision (2 bytes instead of 4 for calculations)
  • removing safety out of bounds checks for texture reading by making a 16 bit left shift lookup table with 17 entries, the last one duplicated
  • using do-while loops instead of for loops to reduce the cost of branching (it's complicated, this isn't a normal optimization)
  • replace 2 of the divisions with a hacky lookup table (this was the most improvement by FAR)

i want to talk about that division a bit. if you recall before, the loop only had 4 total divisions in it (just repeated every vertical slice). i was able to remove one of them before, but that still left 3. i was now able to remove a further TWO of them by using a "large" division lookup table. i had the luxury of knowing that all the divisions would be of the form 1 / (0 < x <= 1), and the fact that x in the range of 0 to 1 is only 8 bits means the table only needs 256 entries. i unfortunately ended up having to do some hacks because sometimes x can be slightly greater than 1, but the division lookup table still saved an absolutely ENORMOUS amount of performance. so much so that:

raycaster with textures running at 35fps

i got it back up to running at a smooth 35fps again. i really can't tell you how accomplished i felt in that moment. i felt so accomplished that i felt i could tackle:

sprites

ok, a fast AND textured raycaster is good and all, but like... again, what game are you going to play that just has walls, even if you can put fancy stuff on them. i still needed more, and sprites were the answer. with sprites, i could (theoretically) make a real game.

sprites come with their OWN complexities though:

  • i was already pushing the cpu up to 100% again just trying to run it at 30fps
  • sprites require even MORE complicated math since they span 2 dimensions (even though they're bilboards)
  • sprites are basically "draw over what you just drew again", so unlike the textured raycasting which was just an "upgrade", sprites are purely "all additional load".

but i needed them. so i set to work doing the same thing as before, hunched over my computer and going crazy with a mixture of glee and frustration. i haven't programmed like this in a very long time, i'm getting old (kinda) and this kind of passion doesn't come often anymore. i once again copied the sprite code from the Lodev tutorial and got to work making absolutely massive adjustments, completely redoing most of it. i can tell i was really focused, because there's only like 3 commits between "starting sprites" and here:

raycasting with sprites!

a working raycaster, with all the bells and whistles i originally imagined. sure it runs at a paltry 20 fps again but that is PLAYABLE! especially when you consider it's just a tiny 1 inch or so display on a little credit card sized system. assets btw are from https://opengameart.org/content/tileset-1bit-color.


conclusion

i want to thank you for coming along with me on this journey...

is what i'd say if i hadn't gone insane. i'm not finished! i must have it, i must have the holy grail of performance AND features.

let's start by adding some moving and animated sprites. oh haha funny, the previous version already supported that, oops, just had to add the movement and frame update logic.

but i need more performance! 1 byte is enough precision to describe the location of sprites if i reduce the map size to 16x16; i don't have a lot of space anyway and you can simply load a new map with doorways between them. with 4 bits for the integer portion and 4 bits for the fractional part, i'm really pushing this thing. but i still need MORE!

oh are those some floats still leftover because i needed the precision? hacks, get rid of them, i only need 8 bits :)

oh quicksort is faster? no it's not, there's only 16 sprite slots, use insertion sort (this is a real thing)

oops, i'm STILL using some floating points even after removing most of them? remove them all (almost)

oh you need a distance cache to know if a wall is in front of a sprite? half resolution, sorry pixel-perfect fanatics. might as well halve the resolution of the shading calculation while i'm at it

ah i have a teeny tiny map, i can now take up a bit more of RAM with some cached values to increase performance by another couple percent

and then finally, the coup de grace, the absolute most ridiculous, most beneficial performance improvement to date. and i can't believe it worked. loop unrolling. why loop over bits when i can just repeat the code 8 times and loop over bytes? because the latter breeds insanity. i could make a whole blog post about the loop unrolling (and i think i might in the future) but for now... let me just bask in the glow of raycasting

a smooth, actually working raycaster

real conclusion

but seriously, thank you for reading through all that. this was a work of passion for me, and even if i don't end up making a game, the process of creating the most optimized piece of code i could was something else. i've recently made even more optimizations to it so i don't have to halve the shading resolution, and i even had enough overhead to add an inverse shading effect to give the illusion of fog, it increases the amount of environments you could do:

raycaster with fog instead of shadows

and now, if you don't mind, i'll end with a probably eye-destroying collage of the evolution of the raycaster

show many gifs the first raycasting attempt added some shading smoother shading smooth fps maze generation first textured raycaster fast textured raycaster sprites! final version

You must log in to comment.

in reply to @haloopdy's post:

This is a really neat project, and very impressive results!

In your example:

uint16_t bofs = (yStart & 0b1111000) * BWIDTH;

why mask out the rightmost bits of yStart? I can see that that makes it equivalent to your previous code, but (without context) it seems like that’s just losing precision on the operation. It looks like you’re switching from yStart / 8 * WIDTH to yStart * (WIDTH/8) but throwing away three bits of yStart first (which of course takes another operation!)

i'll start with the simple answer and then do the complicated answer; you don't have to read the complicated answer if you don't want lol

consider the y value 9, with bits 0b1001. (9 >> 3) * 128 = 128, because 9 >> 3 = 1. now consider the special BWIDTH=16 which is 128 pre-rightshifted by 3. 9 * 16 = 144, which is incorrect. and so, i must mask off the bottom bits using & to replicate the right shift: 9 & 0b11111000 * 16 = 8 * 16 = 128

now for the complicated answer. consider that, in reality, i need to march through each y coordinate of the screen 1 by 1 from wall start to wall finish. the most naive loop would be:

for(uint8_t y = wallstart; y < wallend; y++)
   arduboy.drawPixel(x, y, WHITE);

this would create very simple untextured wall with no shading. now consider that drawPixel is very slow, and i have access to the actual screen buffer. the screen buffer is a simple 1 dimensional buffer of size 1024. each byte represents a column (vertical) of 8 pixels, and each byte moves to the right until eventually moving to the next row. if i wanted to find the byte index (i call bofs) that contained the coordinate (x,y), the code would be:

//+x is actually correct, i don't know why i didn't include it before
uint8_t bofs = x + (y >> 3) * WIDTH;

where WIDTH is screen width, or 128. right shift 3 is the same as dividing by 8. so, in effect, we index normally with x, then offset by row length for each y coordinate but only for steps of 8 in y, since each byte holds 8 vertical pixels.

now, consider i want to get rid of that right shift. the above code has the effect that the 3 rightmost bits are "taken off", regardless what they are. this makes sense: the y values from 0 (0b0000) to 7 (0b0111) are all within the same byte, and starting from 8 (0b1000) and going to 15 (0b1111) are in the next byte. so the bottom three bits are essentially "ignored", and each step of (y >> 3) * WIDTH increments by WIDTH. if width is 128, there's no way to get, say, 64 out of that equation, or 16, or 144, only pure multiples of 128.

so when removing the rightshift by precalculating WIDTH >> 3 = 16, i need to make sure it operates the same, that no values other than multiples of 128 are used, even though i'm multiplying by 16. that's what the & is for, the bottom 3 bits make absolutely sure that y & 0b11111000 is a multiple of 8, and so (multiple of 8) * (multiple of 16) will always equal (multiple of 128).

that's a really long way of saying "more precision isn't always desired", since the point of the original code is to skip through chunks of 128 and not smaller ones.

Oh, division lookup table is an interesting idea. I kinda wonder why I didn't think of that; because I've been trying to code a simple raycaster ( https://twitter.com/sleeps_darkly/status/1637717739363254272 ) for GBA, and got tripped hard on that the ARM7TDMI cpu does not have hardware integer division. So I worked with making it fixed point, but it had some weird accumulating error which caused everything slowly flatten with time; so in the end I abandoned it.

ah, i have no idea what the difficulty difference would be. i have the benefit of working with only bits, though perhaps the fact that the gba has 32 bit operations on the ALU would make up for it, i don't know. i've seen impressive 3d on the gba (like an attempted tomb raider port, https://www.youtube.com/watch?v=_GVSLcqGP7g) but maybe they're just ultra geniuses and a raycaster is even more difficult than what i'm doing here lol. it's cool that you tried, if you get the urge you should try it again! though judging from the screenshot (2fps, my goodness) i would assume it is much more difficult on the gba than on the arduboy, as i simply copied the initial code from the raycasting tutorial and it ran pretty OK.