CIS 451 Lab 9: Cache

Author: Zack Kurmas, with some very minor modifications by Andrew Kalafut


Objective: The purpose of this lab is to explore the different behaviors of different CPUs that implement the Intel instruction set. To put it another way, we want to observe an AMD processor behaving differently from an Intel processor --- especially with respect to the implementation of their superpipelines.

Deliverables: Submit in hardcopy a lab report that includes answers to the numbered questions.


This lab should preferably be done in groups of 2. If there are sufficient lab computers available, you may work alone.


Overview

This lab uses the Simplescalar CPU simulator to investigate the effects of different cache configurations on performance.

Simplescalar

Simplescalar is a suite of programs that simulate the execution of programs compiled using a MIPS-like instruction set called PISA. You can simulate the execution of any program using Simplescalar by simply re-compiling the program using a version of gcc that knows how to generate PISA instructions as well as x86 instructions.

For this lab, you will be using the tool sim-cache. This tool takes as input a description of a machine's memory hierarchy (i.e., cache levels) and reports on the number of hits and misses in each cache. Section 4.2 of The Simplescalar Tech Report explains how to describe the cache setup you want to simulate. When you read through this section, take note of three things:

For example, to configure a machine with an 8KB, direct-mapped L1 data cache with 32 byte blocks, use this command: -cache:dl1 dl1:256:32:1:l. Notice that 256 blocks times 32 bytes per block equals 8192 bytes.

When running sim-cache, I recommend sending the output directly to a file using the command line parameters -redir:sim file1 and -redir:prog file2. file1 will contain the results of the simulation (i.e., cache hit and miss rates). file2 will contain the output produced by the program simulated. This data is generally not interesting.

Effects of block size

Your first task is to examine the effects of block size on a "toy" program. Look at blockSize1.c. This small program iterates through each byte in a large array. Begin by compiling this C program for simplescalar using the following command:

~kurmasz/public/Simplescalar/bin/ss_gcc blockSize1.c
(If ~kurmasz/public/CS451/bin is in your path, you can use ss_gcc instead of ~kurmasz/public/Simplescalar/bin/ss_gcc.) Make sure you are running the Simplescalar version. If you can run ./a.out from the command line, you used the wrong version.

Running this command will produce a file named a.out. (As with the normal version of gcc, you can specify the name of the executable generated using the -o flag.) This file will not run by itself. It will run only as input to one of the Simplescalar programs. If it does, you generated it using the wrong version of gcc.

  1. Use sim-cache to determine the miss-rates of an 8KB, direct-mapped cache with the following block sizes: 8 bytes, 16 bytes, 32 bytes, and 64 bytes. To do so, use commands that look like this:

    ~kurmasz/public/Simplescalar/bin/sim-cache -cache:dl1 dl1:line:block:1:l -redir:prog /dev/null -redir:sim output_block a.out

    Where block ranges from 8 to 64, and line is set such that product of block times line is 8192.

    Hints for running sim-cache:

    After you have run sim-cache for each block size, grep each output file (output_8, output_16, etc.) for the line "dl1.miss_rate". List the miss rate for each block size tested.

  2. Based on your observations, determine a formula for the miss rate in terms of block_size.
  3. Explain your results (i.e., what is happening in the cache during each memory access to produce the results you observed).
  4. Now, write a C program for which the miss rate is considerably higher for a 16 byte block than for an 8 byte block. (The easiest way to do this is to find two array locations that conflict with a 16-byte block, but not with an 8-byte block. If you do this, you will see the cache with the 8-byte blocks have a nearly 0% miss rate while the cache with the 16-byte blocks has nearly a 100% miss rate.) List your source code, all cache parameters used, and the resulting hit rates. Hint: You need not loop through the entire array. Instead, find two addresses that collide in the cache. Also, make NUM_LOOPS 1000000.
Hints for writing programs with a specified cache behavior:

Optimal block size

Your next task is to determine the optimal block size for qsort.
  1. Determine the optimal block size for qsort given a 1KB, 4KB, and 16KB cache. Present your results using a graph with block size on the x-axis and the miss rate on the y-axis. Please generate one graph with three lines: One each for 1KB, 4KB, and 16KB. Valid block sizes are powers of 2 from 8 to 64. Your graph should have a form similar to Figure 8.18 from the textbook.

Effects of associativity

  1. Write a simple C program for which a 4-way associative cache has a significantly lower miss rate than a equally sized 2-way set associative cache. You may choose any cache size and block size you wish, but they must remain the same for the entire problem. Submit the source code, all cache parameters, and resulting hit rates. Submit the source code, all cache parameters, and resulting hit rates.
  2. Write a simple C program for which an associativity of 2 has a higher miss rate than a direct-mapped cache. You may choose any cache size and block size you wish, but they must remain the same for the entire problem. Submit the source code, all cache parameters, and resulting hit rates.
  3. Choose qsort (or another program of your choice) and a cache size. Produce a graph showing miss rates as associativity ranges over 1, 2, 4, 8, 16, and fully associative. Your graph should have associativity on the x-axis, and miss-rate on the y-axis. It should also contain four lines: one for each block size. Be sure to clearly label your graph with the cache size. Your graph should have a form similar to Figure 5.30 in Patterson and Hennessey (4th edition, revised).

Effects of replacement policy

  1. Write a simple C program for which a random replacement policy results in a lower miss rate than LRU by at least 10%. (Use an 8-way set associative cache with a size and block size of your choosing.) Submit the source code, all cache parameters, and resulting hit rates.
  2. Plot the effects of replacement policy on qsort using your given cache size, and a block size and associativity of your choosing. Specify all cache parameters.