fi, en, (sv, ja, hu, yi) | avatar by https://twitter.com/udonkimuchikaki


libera.chat, irc.sortix.org
nortti
microblog (that is, a blog with small entries)
microblog.ahti.space/nortti

rgmechex
@rgmechex

In my latest video about the Cloud item swap glitch in Super Mario World, I discussed how it requires the SNES's CPU to execute from "open bus"--a term reserved for the memory region where no hardware is mapped. Essentially, this means that when the CPU tries to read a byte from this region, it reads a stale value that is held in an internal register (the Memory Data Register, or MDR). Instruction fetch operations also use the MDR, so even instruction opcodes, operands, and other internal cycle operations can affect the MDR value and ultimately what value is read from "open bus" regions. For a better explanation, you should really just check out the video--the 2nd chapter is all about how open bus works.

A thumbnail for the YouTube video 'Using Lakitu's Cloud to Defeat Bowser Quickly' This video is over an hour long, just as a warning!

Today I'm going to go into detail about a few caveats with executing from open bus. While it is pretty complicated, most instructions behave in a way that you would expect. There are some others that are a bit different, and can greatly affect an open bus codepath.


(This article will be written in the context of the Super NES's 65c816 CPU.)

Instructions take multiple machine cycles to execute. Some, like CLC (clear carry flag) only take 2, the fewest possible. Some, like INC $yyxx,X (increment) take 7, since they require more work to execute. The CLC only has to clear the carry flag, a single bit in the CPU's processor flag register, while the INC has to read a byte from memory, add 1 to it, and then write it back to memory.

CLC takes only 2 cycles, while INC $yyxx,X takes 7 We're going to assume for this article that all machine cycles take the same amount of time. (Surprise! Some can be faster or slower than others, depending on how many master cycles each machine cycle uses. This depends on memory and data bus speeds.)

All instructions share their first cycle of execution--fetching the opcode. This is the byte that tells the CPU what instruction is coming up next. This is trivial, and necessary as you can see. Can't know what to do if you don't even look at the instruction in the first place!

All instructions also use up at least one other cycle. Even NOP (no operation) takes up 2 cycles in total. I like to believe that the CPU can't really prepare for what's coming on this cycle, since it has to wait for the opcode to be fetched from memory. Therefore, it can't even tell if it doesn't even need to do anything on this cycle. Therefore the minimum number of cycles per instruction here is 2.

Most instructions will have operands. For example, the INC $yyxx,X from before has two operand bytes--they are combined to form a 16-bit memory address. You can reasonably assume that, just like the instruction opcode, the instruction operands are pretty important to be able to execute the instruction successfully. How are you going to increment a byte in memory without knowing what the memory address in question is?

This is why for almost all instructions, the machine cycles following the opcode fetch cycle are the operand fetch cycles. So the INC $yyxx,X fetches the INC opcode ($FE) on the first cycle, then the two bytes that form the $yyxx on the next two cycles.

JMP %yyxx takes 3 cycles, NOP takes 2, STA ($xx,S),Y takes 7, RTI takes 7, LDA ($xx,X) takes 6, and PLA takes 4 See if you can figure out what exactly each cycle is used for! Some of them are really tricky.

There are two exceptions to this! But first let's see how the JSR $yyxx instruction works.

This instruction takes 6 machine cycles to execute. The first is the opcode fetch, and the next two are the operand fetches, just like I just described. On the fourth cycle, the CPU is idle (can't say exactly why, it's beyond my ability!). Then on the last two cycles, the program counter1 is pushed to the stack--high byte first, then the low byte second.

JSR $yyxx uses 6 machine cycles to execute I want to say the idle cycle is actually when the program counter is decremented by 1 for reasons. (And operand fetch always increments the program counter by 1.)

Okay, not too crazy. Now let's look at the JSL $zzyyxx instruction works. You might expect the same exact workflow, but with 24-bit addresses instead of 16-bit addresses. And indeed, this instruction takes 8 cycles to execute (one extra for reading the longer operand, and one extra for pushing the program bank too).

But that's not how it works. The first cycle is the opcode fetch, and the next two are operand fetches. This gets the lower 16 bits of the target address. However, it's at this point that the program bank (the upper 8 bits of the 24-bit program counter) is pushed to the stack. Then on the fifth cycle, the CPU is idle like before, and on the sixth cycle, the final byte of the operand is finally fetched--the bank of the target address. Then the lower 16-bits of the program counter are pushed to the stack on the final two cycles.

JSL $zzyyxx takes 8 machine cycles to execute Maybe the last operand fetch cycle is a different kind of operand fetch that doesn't increment the program counter like all the others. Either way, still don't know what the idle cycle does here.

Even though things are done a bit out of order, the end result is the same. This also happens with the JSR ($yyxx,X) instruction:

First cycle is an opcode fetch, second is an operand fetch. Then the 16-bit program counter1 is pushed to the stack. The fifth instruction fetches the other half of the operand address, and in the sixth cycle, the X register is added to the result to form the indirect address. Then the 16-bit address at that location in memory is read, using the last two cycles.

JSR (yyxx,X) takes 8 machine cycles to execute Here, the second operand fetch may or may not increment the program counter--but it doesn't matter since it just gets overwritten with the 2 bytes read from memory on the final 2 cycles.

Normally, the software developer doesn't need to even be aware of how this works. They may need to know how many cycles total an instruction can use up, but they don't really need to know what exactly is happening on each of these cycles. Buuuut, this does affect how execution within open bus works!


Let's see what happens when we are executing from open bus then. Suppose the MDR had a value of $20, and we were executing from $045678. (The program bank is $04 and the program counter is $5678.)

  1. The CPU fetches an opcode, and receives the $20 from the MDR.
  2. $20 corresponds to the JSR $yyxx instruction, so the lower 8 bits of the operand are read--again $20 from the MDR.
  3. The upper 8 bits of the operand are read--also $20 from the MDR, since it hasn't been changed at all yet.
  4. The CPU is idle on the fourth cycle of a JSR $yyxx instruction.
  5. The upper 8 bits of the program counter are pushed to the stack. This means $56 is written to the stack, and the MDR becomes $56.
  6. The lower 8 bits of the program counter are pushed to the stack. For reasons1, $7A gets written to the stack, and the MDR becomes $7A.

The instruction ended up being JSR $2020, and the MDR finished with the lower 8 bits of the return address. This is typical behavior of open bus execution, where the operand bytes just match the opcode byte. Other examples include JMP $4C4C, JML $5C5C5C, SBC $FFFFFF,X, LDA ($B2), etc.

However, let's do the same thing if the MDR had a value of $22, and we were executing from $045678 again.

  1. The CPU fetches an opcode, and receives the $22 from the MDR.
  2. $22 corresponds to the JSL $zzyyxx instruction, so the lower 8 bits of the operand are read--again $22 from the MDR.
  3. The upper 8 bits of the operand are read--also $22 from the MDR, since it hasn't been changed at all yet.
  4. The program bank byte is pushed to the stack. This means $04 is written to the stack, and the MDR becomes $04.
  5. The CPU is idle on the fifth cycle of a JSL $zzyyxx instruction.
  6. The bank byte of the operand is read--except now this is $04 from the MDR since that's what was written last.
  7. The upper 8 bits of the program counter are pushed to the stack. This means $56 is written to the stack, and the MDR becomes $56.
  8. The lower 8 bits of the program counter are pushed to the stack. For reasons1, $7B gets written to the stack, and the MDR becomes $7B.

The instruction ended up being JSL $042222, and the MDR finished with the lower 8 bits of the return address again. Notice that not all of the operand bytes are identical this time! Due to the different order of operations for the JSL instruction, the bank byte is different. In fact, it will always match the current program bank, so a JSL from open bus will always stay in the same program bank it was executed in (which pretty much defeats the purpose of a JSL in the first place!).

Just for completeness, let's look at the last instruction. Suppose the MDR had a value of $FC, and we're still executing from $045678.

  1. The CPU fetches an opcode, and receives the $FC from the MDR.
  2. $FC corresponds to the JSR ($yyxx,X) instruction, so on the second cycle, the lower 8 bits of the operand are read--again $FC from the MDR.
  3. The upper 8 bits of the program counter are pushed to the stack. This means $56 is written to the stack, and the MDR becomes $56.
  4. The lower 8 bits of the program counter are pushed to the stack. For reasons1, $7A gets written to the stack, and the MDR becomes $7A.
  5. The upper 8 bits of the operand are read--except now this is $7A from the MDR since that's what was written last.
  6. The X register is added to the operand address of $7AFC. We'll just say X is $00 for simplicity's sake. This makes the indirect address $7AFC.
  7. The lower 8 bits of the target address are read at the indirect address $047AFC. In this instance, this is also open bus, so the value $7A is read.
  8. The upper 8 bits of the target address are read at the indirect address plus one, $047AFD. This is open bus yet again, so another $7A is read.

The instruction ended up being JSR ($7AFC,X), and the target address ended up being $7A7A. With this instruction, the upper byte of the operand address is based off of the current program counter. So that means this instruction will read from different locations in memory depending on where the instruction is executed in memory itself! The target instruction may or may not come from open bus--if it does, it is also based off of the return address, just like in this example.

This actually made the difference in an old Super Mario World credits warp route. Instead of the instruction being JSR ($FCFC,X), it was JSR ($17FC,X), since the return address that got pushed to the stack was $4A17. This means the jump table was located in work RAM, and at a particular byte that could be manipulated by controlling Mario, allowing for arbitrary code execution.


You can support Retro Game Mechanics Explained on Patreon here! Any and all support is greatly appreciated!


  1. In the 65xx series processors, the memory address that is pushed to the stack for JSR and JSL return addresses is the location of the next instruction minus one. So even though, say, a JSL takes up 4 bytes, if JSL $zzyyxx is located at address $01C000, $01C003 is pushed to the stack, even though the following instruction starts at $01C004.


You must log in to comment.