Consider the MIPS processor pipelined such that the load instruction
requires one delay slot (i.e., the result of a load is not available for
the instruction immediately following the load) and the branch
instructions also have one delay slot (i.e., the instruction immediately
following a branch is always executed).
Now consider the following code:
addi $4, $0, 1000
add $3, $0, $0
loop:
lw $2, 100($3)
nop
sw $2, 6000($3)
addi $3, $3, 4
bne $3, $4, loop
nop
-- Begin answer 4- -------- -------- -------- -------- -------- --------
a) What does chunk of code do?
b) Assume that CPI = 1 (we're ignoring memory access delays) how many
cycles does this code take.
c) Optimize the code as much as possible. Both memory and all registers
must be identical after your optimized code. Put your optimized code
Doesn't this just do the same thing 250 times?
requires one delay slot (i.e., the result of a load is not available for
the instruction immediately following the load) and the branch
instructions also have one delay slot (i.e., the instruction immediately
following a branch is always executed).
Now consider the following code:
addi $4, $0, 1000
add $3, $0, $0
loop:
lw $2, 100($3)
nop
sw $2, 6000($3)
addi $3, $3, 4
bne $3, $4, loop
nop
-- Begin answer 4- -------- -------- -------- -------- -------- --------
a) What does chunk of code do?
b) Assume that CPI = 1 (we're ignoring memory access delays) how many
cycles does this code take.
c) Optimize the code as much as possible. Both memory and all registers
must be identical after your optimized code. Put your optimized code
Doesn't this just do the same thing 250 times?