This entry will focus on a second subroutine in Super Mario 64, USA version, for the Nintendo 64. It's a MIPS III subroutine at offset 2B3C (hexadecimal) within the ROM.
Last time, we found that the programmer who wrote this C function didn't do so well, leading to the compiler determining that every use of a pointer in a global variable needed to reload its value from memory in order to store a new value through it.
Before writing this entry, I rewrote the subroutine as if the programmer had written a better version of the C function and fed it to the compiler.
Because some data probably exists around the global pointer variable, I chose to take a copy of the global pointer inside the function (struct state* local_state = global_state;) instead of using a non-pointer global variable (struct state global_state).
A better subroutine to work with
2B3C:
  ADDIU   $29, $29, -8
  LUI     $14, 0x8034
  LUI     $15, 0x8034
  LW      $15, -20364($15)
  LW      $14, -20372($14)
  SUBU    $24, $14, $15
  SRA     $25, $24, 3
  SW      $25, 4($29)
  LUI     $9, 0x8034
  LW      $9, -20376($9)    ; Load the state pointer into $9, just once
  LUI     $8, 0x8034
  ADDIU   $8, $8, -20440
  SW      $8, 64($9)
  ADDIU   $10, $0, 2
  SW      $10, 68($9)
  ADDIU   $12, $0, 1
  SW      $12, 0($9)
  LUI     $14, 0x8033
  ADDIU   $14, $14, -19872
  SW      $14, 8($9)
  LUI     $24, 0x8033
  LUI     $25, 0x8033
  ADDIU   $25, $25, -19872
  ADDIU   $24, $24, -19664
  SUBU    $8, $24, $25
  SW      $8, 12($9)
  SW      $0, 4($9)
  LUI     $11, 0x8033
  ADDIU   $11, $11, -19664
  SW      $11, 16($9)
  LUI     $13, 0x8034
  ADDIU   $13, $13, -25920
  SW      $13, 24($9)
  ADDIU   $15, $0, 4096
  SW      $15, 20($9)
  ADDIU   $25, $0, 2048
  SW      $25, 28($9)
  LUI     $9, 0x8020
  ADDIU   $9, $9, 30976
  SW      $9, 32($9)
  ADDIU   $11, $0, 1024
  SW      $11, 36($9)
  LUI     $13, 0x8022
  ADDIU   $13, $13, 28672
  SW      $13, 40($9)
  LUI     $15, 0x8022
  LUI     $1, 0x0001
  ORI     $1, $1, 0xF000
  ADDIU   $15, $15, 28672
  ADDU    $24, $15, $1
  SW      $24, 44($9)
  LUI     $8, 0x8034
  LW      $8, -20364($8)
  SW      $8, 48($9)
  LW      $10, 4($29)
  SLL     $11, $10, 3
  SW      $11, 52($9)
  LUI     $13, 0x8020
  ADDIU   $13, $13, 28160
  SW      $13, 56($9)
  ADDIU   $15, $0, 2304
  SW      $15, 60($9)
  JR      $31
  ADDIU   $29, $29, 8
Which is somewhat shorter, but there can still be work done on it.
Constant propagation (#constprop)
Look at these two parts of the subroutine:
    LUI     $24, 0x8033       ; $24 = 0x80330000;
  LUI     $25, 0x8033       ; $25 = 0x80330000;
  ADDIU   $25, $25, -19872  ; $25 = 0x8032B260;
  ADDIU   $24, $24, -19664  ; $24 = 0x8032B330;
  SUBU    $8, $24, $25      ; $8 = 0x8032B330 - 0x8032B260;
    LUI     $15, 0x8022       ; $15 = 0x80220000;
  LUI     $1, 0x0001        ; $1 = 0x10000;
  ORI     $1, $1, 0xF000    ; $1 = 0x1F000;
  ADDIU   $15, $15, 28672   ; $15 = 0x80227000;
  ADDU    $24, $15, $1      ; $24 = 0x80227000 + 0x1F000;
Each of them ends with an operation that results in a constant. The operations can be replaced with loading the result into their result registers:
    ADDIU   $8, $0, 208       ; $8 = 0x8032B330 - 0x8032B260 = 208;
    LUI     $24, 0x8024
  ORI     $24, $24, 0x6000  ; $24 = 0x80227000 + 0x1F000 = 0x80246000;
As for the original constants, they may be removed, but only if they are not needed for further computations. More formally, a constant is needed in a register if an instruction following loading it into a register reads the register, or if the basic block containing the subroutine call has a branch, possibly through other basic blocks, to a basic block that reads the register. (Yes, that's similar to the algorithm for callee-saved registers.)
Additionally, a constant in a register is unneeded if its value is fully overwritten later but it has not been read in the meantime.
Fortunately for us, $24 and $25, used by SUBU, are both fully overwritten later. As for the ADDU, $15 is fully overwritten later, and $1 is not read again. So we can remove all the constants that were used for the computations!
Constant reuse (#constreuse)
I used a modern version of GCC for MIPS (4.9.1) just before writing this section to make sure that it performs this optimization. It does, so it is indeed a case of Old Games, New Tech.
Before assigning a register to hold a constant, modern compilers record this fact into a table and attempt to reuse the constant to build other constants later on in the same basic block.
The compiler used for Super Mario 64 has not done this. It loads 0x80340000 5 times, 0x80330000 4 times, and 0x80200000 2 times in the same basic block. (Some of these constants are no longer needed because of the constant propagation step above having removed some operations. They're still counted here.)
In order to reuse a partial constant, we can assign a register to hold it for the entire duration for which it is needed. Let's assign $4 to hold 0x80340000, $5 to hold 0x80330000 and $6 to hold 0x80200000 and rewrite the subroutine.
The rewritten subroutine (#rewritten)
2B3C:
  ADDIU   $29, $29, -8
  LUI     $4, 0x8034        ; $4 = 0x80340000, just once
  LW      $15, -20364($4)   ; Load from 0x8033B074 (via $4)
  LW      $14, -20372($4)   ; Load from 0x8033B06C (via $4)
  SUBU    $24, $14, $15
  SRA     $25, $24, 3
  SW      $25, 4($29)
  LW      $9, -20376($4)    ; Load (via $4) into $9, just once
  ADDIU   $8, $4, -20440    ; Form the constant via $4
  SW      $8, 64($9)
  ADDIU   $10, $0, 2
  SW      $10, 68($9)
  ADDIU   $12, $0, 1
  SW      $12, 0($9)
  LUI     $5, 0x8033        ; $5 = 0x80330000, just once
  ADDIU   $14, $5, -19872   ; Form the constant via $5
  SW      $14, 8($9)
  ADDIU   $8, $0, 208       ; $8 = 0x8032B330 - 0x8032B260 = 208;
  SW      $8, 12($9)
  SW      $0, 4($9)
  ADDIU   $11, $5, -19664   ; Form the constant via $5
  SW      $11, 16($9)
  ADDIU   $13, $4, -25920   ; Form the constant via $4
  SW      $13, 24($9)
  ADDIU   $15, $0, 4096
  SW      $15, 20($9)
  ADDIU   $25, $0, 2048
  SW      $25, 28($9)
  LUI     $6, 0x8020        ; $6 = 0x80200000, just once
  ADDIU   $9, $6, 30976     ; Form the constant via $6
  SW      $9, 32($9)
  ADDIU   $11, $0, 1024
  SW      $11, 36($9)
  LUI     $13, 0x8022
  ADDIU   $13, $13, 28672
  SW      $13, 40($9)
  LUI     $24, 0x8024
  ORI     $24, $24, 0x6000  ; $24 = 0x80227000 + 0x1F000 = 0x80246000;
  SW      $24, 44($9)
  LW      $8, -20364($4)    ; Load from 0x8033B074 (via $4)
  SW      $8, 48($9)
  LW      $10, 4($29)
  SLL     $11, $10, 3
  SW      $11, 52($9)
  ADDIU   $13, $6, 28160    ; Form the constant via $6
  SW      $13, 56($9)
  ADDIU   $15, $0, 2304
  SW      $15, 60($9)
  JR      $31
  ADDIU   $29, $29, 8
We can actually get rid of the stack adjustment and the store and reload of stack slot 4, but that's not something new; if you like, you can go to part 3 of subroutine 6CD8, about the stack to remind yourself of what to do.
 
No comments :
Post a Comment