coolreader18

un mir zaynen ale shvester

known compiler nerd. jewish, ws71. i'm a maintainer of RustPython (bit less active though cause of ✨burnout✨) and i like open source and rust in general


(transcribed from a rant to a friend)

so i'm switching rustpython from a 64-bit Instruction enum with opargs (e.g. jump target, stack index, flags, etc) stored as fields in the variants to a 8-bit Instruction enum with opargs stored as u8s out-of-line
so enum Instruction { Jump(0x12_u32) } -> enum Instruction { Jump }; struct CodeUnit { op: Instruction::Jump, arg: 0x12_u8 }
however, 8 bits is not a lot of space; an index into the constants array (for LoadConst { idx }) could overflow easily so long as there are >255 string/int/tuple/etc literals in the function
CPython's solution is a special ExtendedArg marker opcode, when it encounters for example ExtendedArg(0x1), LoadConst(0x23) it interprets that as LoadConst((0x1 << 8) | 0x23)
it's sort of weird, but, ok, I didn't have any better ideas and it's totally worth shrinking Instruction by 4x, especially since most instructions won't need anywhere near all 32 possible bytes of the oparg
so i'm implementing it in the compiler and i'm realizing that there's sort of an issue: rustpython's bytecode compiler (and cpython's, cause I vaguely modeled this part after it) uses a block structure; instead of branching opcodes having targets be the index of an instruction in the instrutcions array as they are in the finalized bytecode, they're an index of a Block (basically just a Vec<Instruction>), and there's a final pass near the end where it goes through and rewrites the block indices to real pc (program counter) values
which works fine when one instruction in the IR is the same size as one instruction in the final code
but what if I loop through all the blocks, record a mapping of block indices -> pc values, and then when I go to loop through again to adjust the jump targets and write out the actual bytecode, the encoding size of the oparg has changed
like what if Jump(BlockIdx(0x30)) -> [Jump(0x30)] becomes Jump(InstrIdx(0x10_42)) -> [ExtendedArg(0x10), Jump(0x42)]
then the pc offsets are all fucked up
so I decided to just stick an assert there that screams if that happens and come back to it later

that was about an hour or two ago

now i decide to go look at the cpython codebase to see if I could figure out what clever, data structures and algorithms course-type solution they've come up with to get around that issue

and i find THIS:

    do {
        extended_arg_recompile = 0;
        for (/* each instr */) {
            adjust(instr_jump_target);
            if (new_instrsize != old_instrsize) {
                extended_arg_recompile = 1;
            }
        }

    /* XXX: This is an awful hack that could hurt performance, but
        on the bright side it should work until we come up
        with a better solution.

        The issue is that in the first loop blocksize() is called
        which calls instrsize() which requires i_oparg be set
        appropriately. There is a bootstrap problem because
        i_oparg is calculated in the second loop above.

        So we loop until we stop seeing new EXTENDED_ARGs.
        The only EXTENDED_ARGs that could be popping up are
        ones in jump instructions.  So this should converge
        fairly quickly.
    */
    } while (extended_arg_recompile);

that was my first thought to fix it too, like, "oh, I guess you could put it in a loop and just hope it doesn't go on forever", but I didn't think that's what the reference implementation of python would do?? I seriously don't mean to make this a judgement on the cpython developers, this totally works fine for what it does (and it does says it's an awful hack), it just was surprising/funny cause in almost every other situation I've been able to be like "well, how does cpython do it" and find a really solid solution, and here it's just "idfk, this works though". same bud.


You must log in to comment.

in reply to @coolreader18's post:

This is a classic problem in linkers as well, since short jumps are usually smaller than long jumps even in ISAs with fixed width instructions like ARM, because there they might require two instructions instead of one. There it’s called “linker relaxation.”

I think this iterative process is common, or else you just accept wasting some space.

One method: check if it would fit in the short form if everything else was in the long form, and if so, relax it. This gets it right for everything but the cases near the boundary. Good if your short range is big enough, otherwise the “uncertain region” eats a bunch of your short form range.