CIS 451 Lab 5: Understanding Pipeline Operations

Author: Greg Wolffe

Objective: To become more familiar with the concept of pipelining, as found in modern CPU architectures.

Deliverables: Submit in hardcopy a lab report that includes answers to the numbered questions.

This lab should preferably be done in groups of 2. If there are sufficient lab computers available, you may work alone.


DLXview introduction

The name of this simulation tool is DLXview. Its purpose is to graphically illustrate the operation of the pipeline implemented in the DLX architecture. DLX is the name Patterson and Hennessy gave to previous versions of the hypothetical machine and instruction set we have been studying. Although, this name is not used in the current edition of the Patterson and Hennessy or Harris and Harris textbooks.
To start DLXview, simply type dlxview at the command prompt.
Configuring the system
The first step is to configure the machine to simulate a specific pipelined architecture. Feel free to experiment with alternative configurations. However, this lab utilizes only the default mode with default parameters. See the user manual for a description of the various configuration parameters available.
Loading an executable
Once the system has been configured, a file can be loaded for execution.
Running a program
To execute a program:
  1. How many cycles does it take to fill the pipeline? (Notice that DLXview labels the cycles beginning with zero, but when answering how many, you generally would not count this way).
  2. How many cycles does it take for the computer to execute the first instruction completely?
  3. What pipeline stage is the instruction and r6, r7, r8 in during cycle number 5 (i.e., the cycle labeled 5)?
  4. During which cycle does the the processor begin computing the instruction and r12, r13, r14? (Make sure your answer clearly states "the xth cycle" or "the cycle labeled x").
  5. During which cycle does the the processor finish computing the instruction and r12, r13, r14? (Make sure your answer clearly states "the xth cycle" or "the cycle labeled x").
The top half of the display window is an instruction timing diagram, calibrated in clock cycles. Each instruction is uniquely colored to make it easier to follow its progress through the pipeline. In addition to the timing diagram, DLXview provides two views of the pipeline in operation. The button near the bottom of the screen allows you to toggle between the two viewing modes.

The DLXview implementation of the DLX architecture

DLXview is a fairly straightforward implementation of the DLX architecture. Recall the differences between a pipelined processor and the simpler processor we described in class:

A simple pipelined program

The assembly program loaded earlier (exampleDLX-1.s) is a very simple program with no data dependencies, branches, or other potential problems. The code is well-documented and should be easily understood. Take some time now to execute the program, familiarizing yourself with the different views and the operation of the simulator.

Pay particular attention to the register file. Notice how it indicates which registers are being read from and written to. Also pay close attention to the muxes. They help indicate which of the many values generated in the previous stage are actually being used in the current stage. (In other words, watching the lines through the muxes shows you where the inputs to the ALU are coming from.) Stepping backward and forward will help you understand how an individual instruction progresses through the pipeline.

Note: we are only concerned here with the operation and control of the pipeline. In other words, we don't care what data values are actually in the registers or what the results of the operations are. It is possible, however, to initialize registers to a desired value using a *.i file. See the User's manual for additional information.

Note: it is possible at any time to change the operation of the simulator by editing the source code (see the file selection window).

Let's trace through the execution of a single instruction. The first instruction in the example program loads a word from memory into a register. Recall that DLX data addressing uses the indexed method. The word to be read from memory is located at the address contained in the instruction itself (i.e. the constant value 0), offset (or indexed) by the value contained in register R2. The word read from memory is to be loaded into register R1.

Trace through the execution of the other instructions in the program until you are confident you understand the operation of the pipeline.
  1. Which registers are being read during cycle 5 (the cycle labeled 5)?
  2. Which registers are being written during cycle 5 (the cycle labeled 5)?
  3. Where does input to the main ALU come from during cycle 2 (the cycle labeled 2)?
  4. Where does input to the main ALU come from during cycle 3 (the cycle labeled 3)?
  5. What is the purpose of all of the nops at the end of the sample program?

Structural hazards

A structural hazard is another term for a resource conflict. Resource conflicts (we will discuss this in more detail in class) refer to contention for a specific functional unit. Take a moment to study the integer data path of the DLX pipeline.

It is possible that the Instruction Fetch stage might be accessing memory at the same time as the Memory Access stage. One stage is reading an instruction and the other is reading a data value. Only one word can be read from a memory unit at time. This is an example of a resource conflict.

  1. How does the DLX pipeline architecture resolve this conflict?
Locate the adder in the Instruction Decode stage. Trace its backward and forward connections to understand its operation well enough to answer the question:
  1. What purpose does this adder serve?

Data hazards

The next type of hazard is a data hazard, which always involves some type of data dependency. In a typical instance, the current instruction requires some operand that is being generated by the preceding instruction (e.g. a load or an ALU operation). Program exampleDLX-2.s contains such a dependency. Download the program and study the code. Load and run the program and study its execution.
  1. Identify all pairs of instructions that have a data dependency. In particular identify all pairs of instructions (not necessarily adjacent) where the result of the second instruction depends directly on the result of the first.
  2. Describe in your own words how the DLX hardware addresses this dependency problem. Your answer should be precise enough to convince me that you understand the mechanism used.
  3. Trace through the progress of the fourth instruction in a manner similar to that used in the section above ("A simple pipelined program"). There is a "visual bug" or discrepancy that you will notice as you trace through the execution of this program. Identify it.
  4. Step to cycle 4. Use KSnapshot (or another tool of your choice) to capture the main DLXview window. Print the snapshot and highlight the set of wires that shows how the result of sub r3, r4, r5 is routed directly to the main ALU.
Some data dependencies span several instructions. Program exampleDLX-3.s demonstrates this problem in cycle 5, when the first add instruction is writing register r3 during the same cycle that the second add instruction is reading r3.

  1. Which value of r3 will be read if the DLX processor used the register file provided for Project 2? If the wrong value would be read, describe how you would properly coordinate the reading and writing of the registers.

Not all data dependencies can be solved by forwarding. Program exampleDLX-4.s contains an example of a type of hazard that forwarding cannot eliminate. Download the program and study the code. Load and run the program and study its execution.

Examine all of the instructions in the example program and identify their dependencies. Notice that the add instruction immediately follows a load instruction. This is an example of a dependency immune to forwarding. Execute the program using cycles instead of stepping. Refer to the timing diagram and notice the presence of stalls in the pipeline. A stall is a delay cycle, or bubble, inserted into the pipeline. When this type of data dependency (called a load data hazard) is detected, hardware known as a load interlock will delay subsequent instructions to resolve the timing problem. Study the operation of this program until you understand both the problem and the solution.

  1. Why must the add operation be delayed one cycle? Your answer should consider timing issues and functional units. Be sure to explain why forwarding cannot solve the problem.
  2. Describe how hardware can detect a load data hazard.

Control hazards

The final type of hazard involves control transfer -- branches and jumps. Program exampleDLX-5.s implements control transfer in the form of a loop. Download, study, and execute the code until you understand its operation. The comments in the code should help you understand what the loop is doing.

The DLX hardware resolves branches in the Instruction Decode stage of the pipeline. Instead of a beq instruction that uses subtraction in the ALU to compare two registers, the DLX hardware has bez and bnz instructions that branch if the specified register is equal or not equal to 0 respectively. These new branch instructions allow the hardware to determine by the end of the second cycle whether to take the branch.

Notice that the instruction immediately following the branch instruction is a nop. This nop is called a branch delay slot and is typically inserted by the compiler.

  1. Why is the branch delay slot necessary? In other words, what would go wrong if we removed the nop?
  2. Suppose you had a smarter compiler. Explain what it could put in the branch delay slot instead of a nop.
  3. Write an optimized assembly program incorporating your code adjustments. Your solution must still contain a loop.
  4. How many cycles does the original program take to assign the final result to register R6? (In other words, during which cycle is the instruction add r6, r4, r5 in the "write-back" phase?)
  5. How many cycles does your optimized assembly program take to do the same work?