CIS 451 Lab 8: X86 (IA32) differences

Author: Zack Kurmas, with some very minor modifications by Andrew Kalafut


Objective: The purpose of this lab is to explore the different behaviors of different CPUs that implement the Intel instruction set. To put it another way, we want to observe an AMD processor behaving differently from an Intel processor --- especially with respect to the implementation of their superpipelines.

Deliverables: Submit in hardcopy a lab report that includes answers to the numbered questions.


This lab should preferably be done in groups of 2. If there are sufficient lab computers available, you may work alone.


Background

This lab will use a lot of Intel assembly language as well as many different UNIX tools. Before you begin, you may want to review the Intel Assembly Overview from the Intel Machine Language lab, as well as the following UNIX topics:

You may also want to set up a public/private key pair so that you can use ssh to run commands on different machines without having to type a password.

Timing your assembly code

We are going to time segments of assembly code and watch how the time for that code changes from CPU to CPU.

The Pentium II and higher Intel CPU architectures have a register called the time stamp counter or TSC. It is cleared on reset of the chip and increments its value on every clock cycle. Hence, it can be used to measure event durations at the machine cycle level. The Intel assembly instruction rdtsc (ReaD TimeStamp Counter) takes the current value of the time stamp counter and places the low-order 32 bits into register eax and the high-order 32 bits into register edx.

Begin by downloading timestamp_shell.c. This program contains a loop that calculates and prints the difference of two numbers 10000 times. Run gcc -S timestamp_shell.c to generate the corresponding assembly code. By examining timestamp_shell.s you should see how to

Remember that gcc's assembler puts the destination on the right. Also remember that %eax is the lower 32 bits of %rax.

  1. Copy timestamp_shell.s to time_rdtsc.s. Using time_rdtsc.s as a base (don't modify timestamp_shell.s), write an assembly program that calls rdtsc twice and prints the difference between the two timestamps. (Hint: put 4 lines of code immediately following the label at the top of the loop.) You'll need a movl between the calls to rdtsc so the second call doesn't wipe out the results of the first. You can use movl to move the results of calling rdtsc into the start and stop variables. Submit your code (just the main function).

Your program must contain a loop to repeat the timing many times. When the program begins running, "cold" instruction and data caches may make the program run slower. Also, at any time, things such as context switches and interrupts can interrupt the timing.

  1. Compile time_rdtsc.s then run the command ./a.out | tr -s ' ' | cut -f 8 -d ' ' | sort -n | uniq -c
    The output will contain two columns. The column on the right is a list of timing results. The column on the left indicates how often each time was observed. List the 5 most common and 5 longest timings (or all of them if there are fewer than 10) along with the name of the machine on which you ran it.

When running the command above, you should see that one (possibly two) values are much more common than the others. These values represent the approximate overhead of the timing operation. (In other words, if the time to call rdtsc is 80 cycles, then any measurement you make will be high by about 80 cycles.)

Differences, Part 1

Read the department's equipment page (slightly outdated) and note the different CPUs currently in use in the labs. Currently we have five different CPUS:

arch01, arch02, arch05, arch06
Quad core AMD Phenom II (no hyperthreading)
arch03, arch04, arch07, arch08, eos17-eos24
Quad core Intel i7-2600 with hyperthreading
arch09-arch10
Quad core Intel i7-3770 with hyperthreading
eos01-eos08
Quad core AMD Phenom 9600B (no hyperthreading)
eos09-eos16
Quad core Intel i7 860 with hyperthreading

They all use the IA32 instruction set; therefore, they all have an rdtsc instruction; however, each CPU implements the instruction differently.

  1. Run your timing program on each of the 5 different CPUs in EOS and Arch. (To avoid skewing the results, please try to spread the load. For example, some teams should use arch01 and others should use arch05.) For each CPU, list the typical number of cycles taken. Which CPUs commonly produce two different answers?

Hint: If you have set up a public/private key pair, than you can use a simple bash script to automatically run your program on different machines. Your script will look something like this:

for i in arch05 arch06 <other machines here>
do
        echo -n "$i "
        ssh $i command > file_$i
done

Remember that when you ssh to another machine, the working directory will be your home directory.

Serializing instructions

Modern CPUs can process more than one instruction at a time. In addition, these instructions are not necessarily processed in the same order they appear in code. (The control unit is very careful to ensure that the re-ordering does not affect the output.) Thus, simply placing instructions between two calls to rdtsc will not necessarily produce an accurate result: Some of the instructions being timed may still be in process when the second rdtsc begins. If there are enough (hundreds or thousands) of instructions being timed, then the effects of the overlap will be minimal.

Another approach is to place a "serializing" instruction before each call to rdtsc. "Serializing" instructions do not begin executing until every previous instruction is complete. To the best of my knowledge, cpuid is the only serializing instruction for IA32 that can be run in "user" mode. (The others must run in "protected" mode.) Two things to note when using cpuid:

For those who are curious: The primary purpose of cpuid is to provide programmers a means of querying the CPU for its configuration and extra functions (what is the cache size? Does it implement the MMX instructions? etc.) On Linux, you can run the cpuid command to see the different information this instruction makes available.

Timing One Instruction

It is nearly impossible to accurately time a single instruction: Not only is the overlap with the rdtsc and/or cpuid instructions problematic; but, the instructions are pipelined, with several instructions in progress at once. Thus, it is not possible to pinpoint when a particular instruction begins or ends. Instead, we will time larger blocks of code and watch for interesting behaviors.

Start by timing blocks of addl instructions. The idea is to time a group of 100 addls, then 200, then 300, etc. I have written a perl script, time_tester.pl to automate the process of generating, compiling, and running the test programs. (You will also find a copy of this program in /home/kurmasz/public/CS451/bin.)

  1. Run the previous test on all 5 different CPUs and plot the results together on the same graph. Attach a copy of your plot to your lab report. If using gnuplot, you can change your output format and save the plot to a file. First type "set term mode", where mode is one of many output formats, including "postscript", "jpeg", "png", and "pbm". (Type help term for the complete list.) After you have set the term, type "set output filename". Typing "set term x11" will return the output to the screen.
  2. Find the approximate slope of each line.
  3. Describe the differences you see between CPUs.
  4. You will notice that the slope of the line for the two Intel processors is less than 1. Explain how this is possible given the data dependency between each instruction. Hint: Read sections 2.2.1 and 2.2.2 in Volume 1 of the Intel Software Developer's Manual.
  5. Now, run the test again; but, this time add the constant 837 instead of 1. Describe and explain changes you see to the plots. (Hint: Think about things such as instruction caches and instruction sizes.)
  6. Now, create a test file test_2add.s with the test lines
    		#@ addl $1, %eax
    		#@ addl $1, %ecx
    		
    and run it on the different hardware. What are the slopes of the lines?
  7. The slopes of the line should be significantly less than 1. What does this tell you?
  8. test_1add.s produces code that looks like this:
    		addl $1, %eax
    		addl $1, %eax
    		addl $1, %eax
    		addl $1, %eax
    		...
    		
    whereas test_2add.s produces code that looks like this:
    		addl $1, %eax
    		addl $1, %ebx
    		addl $1, %eax
    		addl $1, %ebx
    		...
    		
    Why can the CPU run the second example faster?
  9. Figure out how many add instructions each different CPU can do in parallel. Attach graphs demonstrating this. (In other words, show me how you figured it out.) In general, each plot should contain more than one line. Look most closely at the slope of the line in the range [15000, 20000] (If you don't see at least one CPU that can handle more than 2 instructions at a time, let me know.)
  10. Thus far, we have been testing an add instruction with an immediate parameter. Now, try an add instruction that takes two register-direct parameters (e.g., addl %eax, %ebx). How many can run in parallel?
  11. Now, find out how many register-indirect instructions (e.g., addl -4(%rbp), %eax) can run in parallel?

Timing imull

Now, let's look at how the compiler handles multiplication:

Write some C code that includes the line x *= 7 where x is declared to be a register int.

  1. What assembly code is generated for this line? (It should be several lines.)
  2. Explain how this assembly code works.

Now find a constant value that causes the compiler to generate an imull instruction. Time a single mull instruction the same manner you tested timed the add.

  1. Find the slope of the lines for each CPU.
  2. What do these graphs tell you about time it takes to perform a multiply?
  3. Now repeat the experiment for larger groups of multiplies and find out how many multiplies can be executed in parallel.