2015-10-17

Super Mario 64 subroutine 6CD8, part 2: Memory optimization

This entry will focus on a subroutine in Super Mario 64, USA version, for the Nintendo 64. It's a MIPS III subroutine at offset 6CD8 (hexadecimal) within the ROM.

I already simplified this subroutine in part 1, about basic blocks. The original subroutine, as found in the ROM, is in that part. This entry will build on the simplified version.

Previously on Old Games, New Tech

6CD8: ADDIU $29, $29, -40 ; Grow the stack by 40 bytes SW $31, 28($29) ; There are JALs below, so store the return address SW $4, 40($29) ; Store argument 1 to the stack SW $5, 44($29) ; Store argument 2 to the stack SW $16, 24($29) ; Preserve the caller's value of $16 SW $0, 36($29) ; Initialize stack slot 36 to 0 LH $16, 42($29) ; Grab the low 16 bits of stack slot 40 (big-endian) BEQ $16, $0, 6D10 ; If those bits are equal to 0, go to 6D10 NOP 6CFC: ADDIU $1, $0, 1 BEQ $16, $1, 6D24 ; If those bits are equal to 1, go to 6D24 NOP BEQ $0, $0, 6D30 ; Unconditional branch to 6D30 NOP 6D10: JAL 0x8024BA8C ; Subroutine call - may use our values of $4 and $5 NOP SW $2, 36($29) ; Store the subroutine's return value into stack slot 36 BEQ $0, $0, 6D30 ; Unconditional branch to 6D30 NOP 6D24: JAL 0x8024B9B8 ; Subroutine call - may use our values of $4 and $5 NOP SW $2, 36($29) ; Store the subroutine's return value into stack slot 36 6D30: LW $2, 36($29) ; Load our return value from stack slot 36 LW $31, 28($29) ; Restore our return address after JALs LW $16, 24($29) ; Restore the caller's value of $16 ADDIU $29, $29, 40 ; Return the stack to its old size JR $31 ; Return after the JAL that called us NOP

Read-after-write dependency elimination (#raw)

An optimization that works within single basic blocks, read-after-write dependency elimination reuses a value that has been stored to memory (and is still in a processor register) instead of reloading it.

Such an optimization is possible here: 6CE0: SW $4, 40($29) ; Store argument 1 to the stack ... LH $16, 42($29) ; Grab the low 16 bits of stack slot 40 (big-endian) However, reusing the value of $4 instead of using LH requires sign-extending it from 16 bits to 32: 6CE0: SW $4, 40($29) ; Store argument 1 to the stack ... SLL $16, $4, 16 ; Kill the upper 16 bits of $4 into $16 SRA $16, $16, 16 ; Bring 16 copies of bit #31 down On some processors, the impact of adding this instruction (and polluting the instruction cache with it) is greater than the impact of reloading the value from the data cache. This optimization is therefore not always warranted.

Stack slot elimination (#sse)

If a stack slot is used to hold a local variable, and its address is not taken (e.g. to pass a pointer to it to another subroutine), modern compilers will try to keep its value in a register. That register may need to be written to the stack later on, due to the need to preserve caller-saved registers across subroutine calls, but most of the time, this optimization is worthwhile.

This subroutine has one such stack slot: the one 36 bytes after the stack pointer, used to hold the return value. What if, instead of allocating a stack slot, a register was used?

Modern compilers use the static single assignment (SSA) form to represent values that don't exist in memory in order to allocate registers to them. Here, the stack slot elimination results in roughly this function: function 6CD8(%arg1 signed 16-bit, %arg2) { _start: %1 = 0 decide %arg1Lo == 0? yes: is0; no: test1 test1: decide %arg1Lo == 1? yes: is1; no: returnValue is0: %2 = call 0x8024BA8C(%arg1, %arg2) goto returnValue is1: %3 = call 0x8024B9B8(%arg1, %arg2) goto returnValue returnValue: %4 = phi(%1 from test1, %2 from is0, %3 from is1) set return value to %4 return }

Wait... phi? What's that? (#phi)

The Phi node is a kind of node in SSA which merges multiple values that come from multiple basic blocks but are logically considered to be one value for further processing. If a register is assigned to such a node, it makes sense to assign the same register to all of the nodes that could make up its value. If that register is not available for all of the referenced nodes, then it is required to move the value into the register assigned to the Phi node when generating the final machine code.

With that in mind, let's assign $8 to the Phi node and rewrite the subroutine: 6CD8: ADDIU $29, $29, -40 SW $31, 28($29) SW $4, 40($29) SW $5, 44($29) SW $16, 24($29) OR $8, $0, $0 ; Initialize $8 to 0 LH $16, 42($29) BEQ $16, $0, 6D10 NOP 6CFC: ADDIU $1, $0, 1 BEQ $16, $1, 6D24 NOP BEQ $0, $0, 6D30 NOP 6D10: JAL 0x8024BA8C NOP ADDU $8, $2, $0 ; Store the subroutine's return value into $8 BEQ $0, $0, 6D30 NOP 6D24: JAL 0x8024B9B8 NOP ADDU $8, $2, $0 ; Store the subroutine's return value into $8 6D30: ADDU $2, $8, $0 ; Load our return value from $8 LW $31, 28($29) LW $16, 24($29) ADDIU $29, $29, 40 JR $31 NOP

Data movement optimization (#move)

If a piece of data is being moved from one register to another, especially the result of a Phi node, it may be worthwhile to rewrite the code as if the value was written to the moved-to register to begin with, while keeping constraints on callee-saved and caller-saved registers, and see if the resulting subroutine is shorter. In this subroutine's case, there is this instruction: ADDU $2, $8, $0 ; Load our return value from $8 Other moves cannot be rewritten as if they were loaded from $8, because then the code would no longer move from the return value.

With that in mind, let's assign $2 to the Phi node, delete the move from $2 to itself, and rewrite the subroutine: 6CD8: ADDIU $29, $29, -40 SW $31, 28($29) SW $4, 40($29) SW $5, 44($29) SW $16, 24($29) OR $2, $0, $0 ; Initialize $2 to 0 LH $16, 42($29) BEQ $16, $0, 6D10 NOP 6CFC: ADDIU $1, $0, 1 BEQ $16, $1, 6D20 NOP BEQ $0, $0, 6D28 NOP 6D10: JAL 0x8024BA8C NOP BEQ $0, $0, 6D28 NOP 6D20: JAL 0x8024B9B8 NOP 6D28: LW $31, 28($29) LW $16, 24($29) ADDIU $29, $29, 40 JR $31 NOP

The return value of the chosen subroutine is written to $2 for 6CD8. The return value of 6CD8 needs to be in $2 for the caller. No movement needs to occur in order to forward the return value as a return value. This is the optimal use of $2 for this subroutine.

The subroutine is shaping up nicely, but there are still a few things to be done. Until next time!

No comments :

Post a Comment