This entry will focus on a subroutine in Super Mario 64, USA version, for the Nintendo 64. It's a MIPS III subroutine at offset 6CD8 (hexadecimal) within the ROM.
I already simplified this subroutine in part 1, about basic blocks. The original subroutine, as found in the ROM, is in that part. This entry will build on the simplified version.
Previously on Old Games, New Tech
6CD8:
ADDIU $29, $29, -40 ; Grow the stack by 40 bytes
SW $31, 28($29) ; There are JALs below, so store the return address
SW $4, 40($29) ; Store argument 1 to the stack
SW $5, 44($29) ; Store argument 2 to the stack
SW $16, 24($29) ; Preserve the caller's value of $16
SW $0, 36($29) ; Initialize stack slot 36 to 0
LH $16, 42($29) ; Grab the low 16 bits of stack slot 40 (big-endian)
BEQ $16, $0, 6D10 ; If those bits are equal to 0, go to 6D10
NOP
6CFC:
ADDIU $1, $0, 1
BEQ $16, $1, 6D24 ; If those bits are equal to 1, go to 6D24
NOP
BEQ $0, $0, 6D30 ; Unconditional branch to 6D30
NOP
6D10:
JAL 0x8024BA8C ; Subroutine call - may use our values of $4 and $5
NOP
SW $2, 36($29) ; Store the subroutine's return value into stack slot 36
BEQ $0, $0, 6D30 ; Unconditional branch to 6D30
NOP
6D24:
JAL 0x8024B9B8 ; Subroutine call - may use our values of $4 and $5
NOP
SW $2, 36($29) ; Store the subroutine's return value into stack slot 36
6D30:
LW $2, 36($29) ; Load our return value from stack slot 36
LW $31, 28($29) ; Restore our return address after JALs
LW $16, 24($29) ; Restore the caller's value of $16
ADDIU $29, $29, 40 ; Return the stack to its old size
JR $31 ; Return after the JAL that called us
NOP
Read-after-write dependency elimination (#raw)
An optimization that works within single basic blocks, read-after-write dependency elimination reuses a value that has been stored to memory (and is still in a processor register) instead of reloading it.
Such an optimization is possible here:
6CE0:
SW $4, 40($29) ; Store argument 1 to the stack
...
LH $16, 42($29) ; Grab the low 16 bits of stack slot 40 (big-endian)
However, reusing the value of $4
instead of using LH
requires sign-extending it from 16 bits to 32:
6CE0:
SW $4, 40($29) ; Store argument 1 to the stack
...
SLL $16, $4, 16 ; Kill the upper 16 bits of $4 into $16
SRA $16, $16, 16 ; Bring 16 copies of bit #31 down
On some processors, the impact of adding this instruction (and polluting the instruction cache with it) is greater than the impact of reloading the value from the data cache. This optimization is therefore not always warranted.
Stack slot elimination (#sse)
If a stack slot is used to hold a local variable, and its address is not taken (e.g. to pass a pointer to it to another subroutine), modern compilers will try to keep its value in a register. That register may need to be written to the stack later on, due to the need to preserve caller-saved registers across subroutine calls, but most of the time, this optimization is worthwhile.
This subroutine has one such stack slot: the one 36 bytes after the stack pointer, used to hold the return value. What if, instead of allocating a stack slot, a register was used?
Modern compilers use the static single assignment (SSA) form to represent values that don't exist in memory in order to allocate registers to them. Here, the stack slot elimination results in roughly this function:
function 6CD8(%arg1 signed 16-bit, %arg2) {
_start:
%1 = 0
decide %arg1Lo == 0? yes: is0; no: test1
test1:
decide %arg1Lo == 1? yes: is1; no: returnValue
is0:
%2 = call 0x8024BA8C(%arg1, %arg2)
goto returnValue
is1:
%3 = call 0x8024B9B8(%arg1, %arg2)
goto returnValue
returnValue:
%4 = phi(%1 from test1, %2 from is0, %3 from is1)
set return value to %4
return
}
Wait... phi
? What's that? (#phi)
The Phi node is a kind of node in SSA which merges multiple values that come from multiple basic blocks but are logically considered to be one value for further processing. If a register is assigned to such a node, it makes sense to assign the same register to all of the nodes that could make up its value. If that register is not available for all of the referenced nodes, then it is required to move the value into the register assigned to the Phi node when generating the final machine code.
With that in mind, let's assign $8
to the Phi node and rewrite the subroutine:
6CD8:
ADDIU $29, $29, -40
SW $31, 28($29)
SW $4, 40($29)
SW $5, 44($29)
SW $16, 24($29)
OR $8, $0, $0 ; Initialize $8 to 0
LH $16, 42($29)
BEQ $16, $0, 6D10
NOP
6CFC:
ADDIU $1, $0, 1
BEQ $16, $1, 6D24
NOP
BEQ $0, $0, 6D30
NOP
6D10:
JAL 0x8024BA8C
NOP
ADDU $8, $2, $0 ; Store the subroutine's return value into $8
BEQ $0, $0, 6D30
NOP
6D24:
JAL 0x8024B9B8
NOP
ADDU $8, $2, $0 ; Store the subroutine's return value into $8
6D30:
ADDU $2, $8, $0 ; Load our return value from $8
LW $31, 28($29)
LW $16, 24($29)
ADDIU $29, $29, 40
JR $31
NOP
Data movement optimization (#move)
If a piece of data is being moved from one register to another, especially the result of a Phi node, it may be worthwhile to rewrite the code as if the value was written to the moved-to register to begin with, while keeping constraints on callee-saved and caller-saved registers, and see if the resulting subroutine is shorter. In this subroutine's case, there is this instruction:
ADDU $2, $8, $0 ; Load our return value from $8
Other moves cannot be rewritten as if they were loaded from $8
, because then the code would no longer move from the return value.
With that in mind, let's assign $2
to the Phi node, delete the move from $2
to itself, and rewrite the subroutine:
6CD8:
ADDIU $29, $29, -40
SW $31, 28($29)
SW $4, 40($29)
SW $5, 44($29)
SW $16, 24($29)
OR $2, $0, $0 ; Initialize $2 to 0
LH $16, 42($29)
BEQ $16, $0, 6D10
NOP
6CFC:
ADDIU $1, $0, 1
BEQ $16, $1, 6D20
NOP
BEQ $0, $0, 6D28
NOP
6D10:
JAL 0x8024BA8C
NOP
BEQ $0, $0, 6D28
NOP
6D20:
JAL 0x8024B9B8
NOP
6D28:
LW $31, 28($29)
LW $16, 24($29)
ADDIU $29, $29, 40
JR $31
NOP
The return value of the chosen subroutine is written to $2
for 6CD8
. The return value of 6CD8
needs to be in $2
for the caller. No movement needs to occur in order to forward the return value as a return value. This is the optimal use of $2
for this subroutine.
The subroutine is shaping up nicely, but there are still a few things to be done. Until next time!
No comments :
Post a Comment