2015-10-28

Super Mario 64: Known values and bits

Three very common optimization possibilities (i.e. many non-trivial subroutines can be optimized in these ways) in Super Mario 64 are exemplified by the following pieces of code. I had to extract the pieces of code in a different way for this entry, so this time, their offsets are Nintendo 64 memory addresses, not file offsets.

You should read the entries on the MIPS instruction set first.

8027BE14: SW $24, 28($29) LW $25, 28($29) 80286790: LUI $10, 0x8034 LUI $8, 0x8034 LH $8, -14516($8) LH $10, -14574($10) LUI $1, 0x8034 80288CE4: ANDI $4, $4, 0xFFFF ANDI $5, $5, 0xFFFF ANDI $6, $6, 0xFFFF ANDI $14, $5, 0x000F OR $5, $14, $0 ANDI $15, $5, 0xFFFF

Available expressions (#avail)

Available expression analysis can be performed on the first block: 8027BE14: SW $24, 28($29) LW $25, 28($29) in order for the LW to use the result stored by the SW.

On 32-bit MIPS processors, the LW is completely unneeded, and reads of $25 can simply be replaced with reads of $24, the register that still contains the value stored to stack slot 28.

However, on 64-bit MIPS processors, which the Nintendo 64 contains, the LW forces the value loaded from memory to be properly sign-extended from 32 to 64 bits. Because the arguments to subroutines may or may not be properly sign-extended, the LW is actually useful here; it's just wasteful in terms of memory access. It can be replaced with ADDU without any change in program meaning: 8027BE14: SW $24, 28($29) ADDU $25, $24, $0

This only works if there are no aliases involved and if the value stored to the stack slot is still in a register, hence requiring an available expression analysis.

Constant reuse as a post-pass optimization (#constreuse)

A post-pass optimizer could replace sequences of code that use the same constant but then overwrite their own content such that the constant becomes unavailable immediately: 80286790: LUI $10, 0x8034 LUI $8, 0x8034 LH $8, -14516($8) LH $10, -14574($10) LUI $1, 0x8034 so that they load the constant once into the last written register and perform all operations based on that new register: 80286790: LUI $1, 0x8034 LH $8, -14516($1) LH $10, -14574($1)

As of the end of this code, the only three things that need to happen are as follows:

  • $1 contains 0x80340000;
  • $8 contains the 32-bit value loaded from memory at 0x8033C74C;
  • $10 contains the 32-bit value loaded from memory at 0x8033C712.

This is what the rewritten code does, in as many instructions.

Such a chain of instructions contains LUI instructions, all loading the same constant value, followed by code that reads the value and overwrites it in the same register. LUI ADDIU, LUI ORI and LUI followed by any memory load instruction all match this description. The code simplification needs to use LUI on the last written register first, replace references to the duplicate registers with that register, and delete the superfluous LUI instructions.

Value range analysis (#range)

Value range analysis, more specifically bitmask analysis (as described in the paper "Range and Bitmask Analysis for Hardware Optimization in High-Level Synthesis" by Marcel Gort and Jason H. Anderson, University of Toronto), allows a compiler to figure out some unneeded bitwise operations.

The compiler did not perform this analysis; let's follow the code in this snippet to figure out where and how: 80288CE4: ANDI $4, $4, 0xFFFF ; Bits #63..16 of $4 now known to be 0 ANDI $5, $5, 0xFFFF ; Bits #63..16 of $5 now known to be 0 ANDI $6, $6, 0xFFFF ; Bits #63..16 of $6 now known to be 0 ANDI $14, $5, 0x000F ; Bits #63..4 of $14 now known to be 0 OR $5, $14, $0 ; Move $14 (bits #3..0 of $5) back into $5 ANDI $15, $5, 0xFFFF ; Ensure bits #63..16 of $5 are unset

First, as is usual in subroutines, this piece of code makes sure that its arguments, of type unsigned short (uint16_t), do not have any garbage bits set. Then, it computes argument 2 & 0xF and stores this back to an unsigned short (uint16_t) variable. The compiler did not realize that bits #63..16 were already 0, though. The last ANDI instruction is unneeded, so it can be deleted, and later reads of $15 replaced with reads of $5.

No comments :

Post a Comment