CIS 451 Lab 6: Pipeline Optimizations

Author: Greg Wolffe


Objective: Continue the investigation of pipeline computing begun last week. Specifically, examine ways in which compiler techniques can improve the efficiency of a pipeline, and hence reduce the execution time of a program.

Deliverables: Submit in hardcopy a lab report that includes answers to the numbered questions.


This lab should preferably be done in groups of 2. If there are sufficient lab computers available, you may work alone.


Resources


Data Hazards and Compiler Optimization

Recall the discussion in last week's lab about the various types of hazards encountered in pipelined execution (structural hazards, data dependencies, control transfers). Let's review the effects of some of these hazards on pipelined code. Suppose I wish to execute the following code snippet:

sum = item1 + item2;
total = sum + tax;
sum2 = item3 + item4;

Assume that all data objects are 32-bit integers stored in memory at succeeding offsets from the address stored in r1. For example, sum (the first data variable) is located at an offset of 0 from r1, item1 is located at an offset of 4 from r1, item2 is located at an offset of 8 from r1, etc. If we wanted to load sum into register r2, the corresponding instruction would be: lw r2,0(r1). If we wanted to load item1 into register r3, the corresponding instruction would be: lw r3, 4(r1).

Perform the following:

  1. Write the straightforward MIPS assembly language code that implements the code snippet. This means that you should write all the assembly for line 1 (including any necessary sw instructions) before writing the assembly for line 2. Your code should not explicitly list any stalls. There should be 11 machine instructions in your code, not including the noops and trap #0 used to flush the pipeline at the end of the program.

    Note: For this, and the next 4 problems, you may find it helpful to "run" your code through the dlxview simulator. (Remember that you can choose Next Cycle to step through the code one cycle at a time if desired.) Using dlxview is not required, but should you choose to, make note of the following differences between DLX assembly and MIPS assembly:

  2. How many cycles does it take the "dlxview" machine to "execute" the code that you have written (not including the trap #0 at the end)? (Start counting with cycle 0, and stop counting when the final sw instruction is in the "WB" state. Note that the total number of cycles is one more than the number DLXview gives to the last cycle.)
  3. Show the locations of the stalls and branch delays in your code from problem 1.
  4. Pretend you are an optimizing compiler: Rearrange (not rewrite) the instruction sequence to eliminate the data hazards (and the subsequent stalls).
  5. How many cycles does your "optimized" code take to execute?

Control Hazards and Compiler Optimization

As described last week, the compiler (or an assembly language programmer) needs to insert a nop immediately following a conditional branch instruction. This branch delay slot is necessary to deal with the control hazard introduced by the branch. However, an optimizing compiler can mask the latency introduced by the branch by identifying useful work that is independent of the branch instruction. This work is then inserted into the pipeline in place of the nop.

Consider a typical loop that adds two matrices:

for (i = 0; i < MAX; i++) {
   c[i] = a[i] + b[i];
}
c[MAX] = 0;

The program exampleCO-1.s gives the DLX assembly code for the loop (with MAX set to 4).

  1. Run exampleCO-1 through dlxview then Identify and explain all data and control hazards. Look carefully, there is a data hazard you probably don't expect!
  2. How many cycles does it take to complete the program in exampleCO-1.s when MAX is 4?
  3. Now, state the number of cycles as a function of MAX.
  4. Re-arrange the code to eliminate any data hazards and branch delay slots.
  5. How many cycles does it take to complete your optimized program when MAX = 4?
  6. State the number of cycles of your re-arranged program as a function in terms of MAX. (Important: At this point, have the instructor double-check your answers to make sure you're on the right track.)

The "speedup" of the optimized code is the factor by which we would multiply the running time of the optimized code to get the running time of the original code. For example, if MAX is 1, the original code takes 18 cycles to run, whereas the optimized code takes only 15 cycles. This gives a speedup of 1.2, because 15*1.2 = 18.

Notice that the speedup depends on MAX.

  1. What is the speedup of your optimized code when MAX = 4?
  2. Use your favorite plotting tool (I like gnuplot) to plot the speedup of your optimized code as a function of MAX. Attach the graph to your lab submission.

    To run gnuplot, type "gnuplot" in a Linux shell. The following commands will plot a graph and save it as a .jpeg file:

  3. As MAX gets large, the speedup approaches an upper bound. What is this bound?

Optimizing by re-writing code

Notice that the main body of the loop has three addi and one subi instruction. We could really improve the speed of this loop if we could somehow load and store values from the arrays a, b, and c by incrementing the offsets instead of incrementing the base pointers (because all three arrays use the same offset). Unfortunately, this is not possible because, in MIPS, the offset in an immediate value and is fixed at compile time.

A smart compiler, however, can greatly reduce running time by "unrolling" the loop. In other words, a smart compiler can replicate the body of a loop and eliminate extra branch instructions and extra index value increments. When replicating the body of the loop, you will need to

You may also need to use additional registers.

  1. Assume that the value of MAX is known to be 4 at the time the program is compiled. "Unroll" the optimized loop from problem 9 and optimize the unrolled loop use as few cycles as possible. Where possible, modify the offset of lw instructions in order to remove addi instructions. (Hint: Your optimized code should have 17 instructions and no stalls. All branch-related instructions can be removed.)
  2. How many cycles does your "unrolled" assembly take?
  3. What is the speedup of this unrolled code over the optimized program from problem 9?
  4. The tradeoff of unrolling loops is that your source code grows in length. By what percentage did your source code grow when you unrolled the loop (as compared to the source code you wrote for problem 9)? Do not count the trailing nops and trap #0.

Unrolling loops is trickier when the number of iterations is not known at compile time; but, it is still possible. The trick is to re-organize the loop so that each iteration of the "unrolled" loop handles x iterations of the original loop. (For example, the first iteration of the unrolled loop contains instructions to complete iterations 1 through 4 of the original loop; the second iteration of the unrolled loop contains instructions to complete iterations 5 through 8 of the original loop, etc.) This way, the program suffers the "loop overhead" only once every x loops.

  1. Assume that the value of MAX is not known at compile time, but is known at run-time before the loop begins. Unroll the optimized loop from problem 9 so that each iteration of the unrolled loop performs the work of 4 original loops. (In other words, iteration 1 of the new loop computes c[0] through c[3]; iteration 2 computes c[4] through c[7], etc.) For simplicity, assume that MAX is a multiple of 4.
  2. How could you modify your code so that it could handle any value of MAX (as opposed to multiples of 4 only)?
  3. Write a function that describes the number of cycles taken by your optimized code in terms of MAX. (Verify your answer with the instructor before continuing.)
  4. When MAX is large, what is the speedup of the unrolled version as compared to the optimized code from problem 9?
  5. How much bigger (in terms of percentage) is the unrolled source code than the version in problem 9?

Thus far, we have been unrolling code in groups of 4; however, you can unroll code using arbitrarily large groups. For the remaining problems consider what would happen if you unrolled your code in groups of I iterations.

  1. Write a function that describes the number of cycles taken by your unrolled loop in terms of both MAX, and I (the number of iterations per unrolled loop.) Your formula will consider three types of instructions: (1) Instructions that appear once per original iteration (e.g., the lw instructions), (2) instructions that appear once per unrolled iteration (e.g., the branch instruction), and (3) instructions that appear once per program (e.g., the final sw). You can test your formula by making sure that it reduces to your answer for Problem 21 when I = 4.
  2. Write a function that describes the speedup of your unrolled loop (as compared to the optimized code in problem 9) in terms of both MAX, and I.
  3. As MAX grows large, the effects of the constant terms (i.e., the "+6") become negligible. If you ignore these terms, MAX should cancel, leaving a formula in terms of I. What is it?
  4. What is the limit of this speedup as I grows? (Hint: There is a finite limit. If you think that the answer is "there is no limit", then you made a mistake somewhere.)
  5. Calculate the size of your unrolled loop as a function of I.
  6. What is the limit of this function as I grows?
  7. What would you consider the optimal number of iterations per unrolled loop? Why? (This is an opinion question. There isn't a right answer, you just have to say something thoughtful.)