Author: Greg Wolffe

**Objective:** Continue the investigation of pipeline computing begun last
week. Specifically, examine ways in which compiler techniques
can improve the efficiency of a pipeline, and hence reduce the execution
time of a program.

**Deliverables:** Submit in hardcopy a lab report that includes answers to the numbered questions.

This lab should preferably be done in groups of 2. If there are sufficient lab computers available, you may work alone.

- There is an on-line user manual available for the visualization tool used in this lab.

Recall the discussion in last week's lab about the various types of hazards encountered in pipelined execution (structural hazards, data dependencies, control transfers). Let's review the effects of some of these hazards on pipelined code. Suppose I wish to execute the following code snippet:

`sum = item1 + item2;`

`total = sum + tax;`

sum2`= item3 + item4;`

Assume that all data objects are 32-bit integers stored in memory at
succeeding offsets from the address stored in `r1`

. For
example, `sum`

(the first data variable) is located at an offset
of `0`

from `r1`

, `item1`

is located at an offset
of 4 from `r1`

, `item2`

is located at an offset
of 8 from `r1`

, etc. If we wanted to load `sum`

into register `r2`

, the corresponding instruction would be:
** lw r2,0(r1)**. If we wanted to load

`item1`

into register `r3`

, the corresponding instruction would be:
`lw r3, 4(r1)`

Perform the following:

Write the straightforward MIPS assembly language code that implements the code snippet. This means that you should write all the assembly for line 1 (including any necessary

`sw`

instructions) before writing the assembly for line 2. Your code should not explicitly list any stalls. There should be 11 machine instructions in your code, not including the`noops`

and`trap #0`

used to flush the pipeline at the end of the program.Note: For this, and the next 4 problems, you may find it helpful to "run" your code through the

**dlxview**simulator. (Remember that you can choose Next Cycle to step through the code one cycle at a time if desired.) Using`dlxview`

is not required, but should you choose to, make note of the following differences between DLX assembly and MIPS assembly:- When writing code for
`dlxview`

, don't put dollar signs in front of the registers. - Registers are
named
`r1`

,`r2`

,`r3`

, etc. Notice the lower case "r". - Comments begin with semi-colons (;), not hashes (#).
- You must use spaces, not tabs.
- Store word is written like this:
`sw 6(r1), r6`

.

- When writing code for
- How many cycles does it take the "dlxview" machine to "execute"
the code that you have written (not including the
`trap #0`

at the end)? (Start counting with cycle 0, and stop counting when the final`sw`

instruction is in the "WB" state. Note that the total number of cycles is one more than the number DLXview gives to the last cycle.) - Show the locations of the stalls and branch delays in your code from problem 1.
- Pretend you are an optimizing compiler: Rearrange (not rewrite) the instruction sequence to eliminate the data hazards (and the subsequent stalls).
- How many cycles does your "optimized" code take to execute?

`nop`

`nop`

Consider a typical loop that adds two matrices:

`for (i = 0; i < MAX; i++) {`

`c[i] = a[i] + b[i];`

`}`

`c[MAX] = 0;`

The program `exampleCO-1.s`

gives the DLX
assembly code for the loop (with MAX set to 4).

- Run
`exampleCO-1`

through`dlxview`

then Identify*and explain*all data and control hazards. Look carefully, there is a data hazard you probably don't expect! - How many cycles does it take to complete the program in
`exampleCO-1.s`

when`MAX`

is 4? - Now, state the number of cycles as a function of
`MAX`

. - Re-arrange the code to eliminate any data hazards and branch delay slots.
- How many cycles does it take to complete your optimized program
when
`MAX = 4`

? - State the number of cycles of your re-arranged program as a function in terms of
`MAX`

. (**Important:**At this point, have the instructor double-check your answers to make sure you're on the right track.)

The "speedup" of the optimized code is the factor by which we would multiply the running time of the optimized code to get the running time of the original code. For example, if MAX is 1, the original code takes 18 cycles to run, whereas the optimized code takes only 15 cycles. This gives a speedup of 1.2, because 15*1.2 = 18.

Notice that the speedup depends on `MAX`

.

- What is the speedup of your optimized code when
`MAX = 4`

? Use your favorite plotting tool (I like

`gnuplot`

) to plot the speedup of your optimized code as a function of`MAX`

. Attach the graph to your lab submission.To run

`gnuplot`

, type`"gnuplot"`

in a Linux shell. The following commands will plot a graph and save it as a`.jpeg`

file:`set term jpeg`

`set output "myPlot.jpeg"`

`plot [1:100] (x + 1)/(3*x+10)`

. (The parameter`[1:100]`

tells gnuplot to plot values for`1 <= x <= 100`

only.)

- As
`MAX`

gets large, the speedup approaches an upper bound. What is this bound?

Notice that the main body of the loop has three `addi`

and
one `subi`

instruction. We could really improve the speed
of this loop if we could somehow load and store values from the
arrays `a`

, `b`

, and `c`

by
incrementing the offsets instead of incrementing the base pointers
(because all three arrays use the same offset). Unfortunately, this
is not possible because, in MIPS, the offset in an immediate value and
is fixed at compile time.

A smart compiler, however, can greatly reduce running time by "unrolling" the loop. In other words, a smart compiler can replicate the body of a loop and eliminate extra branch instructions and extra index value increments. When replicating the body of the loop, you will need to

- eliminate redundant computations,
- adjust the loop decrement and termination code,
- eliminate unnecessary branches, and
- adjust the addresses of loads and stores to compensate for the new "size" of an iteration.

You may also need to use additional registers.

- Assume that the value of
`MAX`

is known to be 4 at the time the program is compiled. "Unroll" the optimized loop from problem 9 and optimize the unrolled loop use as few cycles as possible. Where possible, modify the offset of`lw`

instructions in order to remove`addi`

instructions. (Hint: Your optimized code should have 17 instructions and no stalls. All branch-related instructions can be removed.) - How many cycles does your "unrolled" assembly take?
- What is the speedup of this unrolled code over the optimized program from problem 9?
- The tradeoff of unrolling loops is that your source code grows in length. By what percentage did your source
code grow when you unrolled the loop (as compared to the source code you wrote for problem 9)? Do not count the
trailing
`nops`

and`trap #0`

.

Unrolling loops is trickier when the number of iterations is not known
at compile time; but, it is still possible. The trick is to
re-organize the loop so that each iteration of the "unrolled" loop
handles `x`

iterations of the original loop. (For example,
the first iteration of the unrolled loop contains instructions to
complete iterations 1 through 4 of the original loop; the second iteration of the unrolled loop contains instructions
to complete iterations 5 through 8 of the original loop, etc.) This way, the
program suffers the "loop overhead" only once every `x`

loops.

- Assume that the value of
`MAX`

is not known at compile time, but is known at run-time before the loop begins. Unroll the optimized loop from problem 9 so that each iteration of the unrolled loop performs the work of 4 original loops. (In other words, iteration 1 of the new loop computes c[0] through c[3]; iteration 2 computes c[4] through c[7], etc.) For simplicity, assume that`MAX`

is a multiple of 4. - How could you modify your code so that it could handle any value
of
`MAX`

(as opposed to multiples of 4 only)? - Write a function that describes the number of cycles taken by
your optimized code in terms of
`MAX`

.*(Verify your answer with the instructor before continuing.)* - When
`MAX`

is large, what is the speedup of the unrolled version as compared to the optimized code from problem 9? - How much bigger (in terms of percentage) is the unrolled source code than the version in problem 9?

Thus far, we have been unrolling code in groups of 4; however, you
can unroll code using arbitrarily large groups. For the remaining
problems consider what would happen if you unrolled your code in
groups of `I`

iterations.

- Write a function that describes the number of cycles taken by
your unrolled loop in terms of both
`MAX`

, and`I`

(the number of iterations per unrolled loop.) Your formula will consider three types of instructions: (1) Instructions that appear once per original iteration (e.g., the`lw`

instructions), (2) instructions that appear once per unrolled iteration (e.g., the branch instruction), and (3) instructions that appear once per program (e.g., the final`sw`

). You can test your formula by making sure that it reduces to your answer for Problem 21 when`I = 4`

. - Write a function that describes the speedup of your unrolled loop
(as compared to the optimized code in problem 9) in terms of both
`MAX`

, and`I`

. - As
`MAX`

grows large, the effects of the constant terms (i.e., the "`+6`

") become negligible. If you ignore these terms,`MAX`

should cancel, leaving a formula in terms of`I`

. What is it? - What is the limit of this speedup as
`I`

grows? (Hint: There is a finite limit. If you think that the answer is "there is no limit", then you made a mistake somewhere.) - Calculate the size of your unrolled loop as a function of
`I`

. - What is the limit of this function as
`I`

grows? - What would you consider the optimal number of iterations per unrolled loop? Why? (This is an opinion question. There isn't a right answer, you just have to say something thoughtful.)