Author: Zack Kurmas, with some very minor modifications by Andrew Kalafut
Objective: The purpose of this lab is to explore the different behaviors of different CPUs that implement the Intel instruction set. To put it another way, we want to observe an AMD processor behaving differently from an Intel processor --- especially with respect to the implementation of their superpipelines.
Deliverables: Submit in hardcopy a lab report that includes answers to the numbered questions.
This lab should preferably be done in groups of 2. If there are sufficient lab computers available, you may work alone.
This lab will use a lot of Intel assembly language as well as many different UNIX tools. Before you begin, you may want to review the Intel Assembly Overview from the Intel Machine Language lab, as well as the following UNIX topics:
You may also want to set up a public/private key pair so that you can
ssh to run commands on different machines without having to type a password.
We are going to time segments of assembly code and watch how the time for that code changes from CPU to CPU.
The Pentium II and higher Intel CPU architectures have a register
called the time stamp counter or
TSC. It is
cleared on reset of the chip and increments its value on every clock
cycle. Hence, it can be used to measure event durations at the
machine cycle level. The Intel assembly instruction
TimeStamp Counter) takes the current value of the time stamp counter and
places the low-order 32 bits into register
eax and the
high-order 32 bits into register
Begin by downloading
This program contains a loop that calculates and prints the difference
of two numbers 10000 times. Run
gcc -S timestamp_shell.c to
generate the corresponding assembly code. By examining
you should see how to
printf. (Notice that the parameters are stored in registers.)
gcc's assembler puts the destination on the right. Also remember that
%eax is the lower 32 bits of
time_rdtsc.sas a base (don't modify
timestamp_shell.s), write an assembly program that calls
rdtsctwice and prints the difference between the two timestamps. (Hint: put 4 lines of code immediately following the label at the top of the loop.) You'll need a
movlbetween the calls to
rdtscso the second call doesn't wipe out the results of the first. You can use
movlto move the results of calling
stopvariables. Submit your code (just the main function).
Your program must contain a loop to repeat the timing many times. When the program begins running, "cold" instruction and data caches may make the program run slower. Also, at any time, things such as context switches and interrupts can interrupt the timing.
time_rdtsc.sthen run the command
./a.out | tr -s ' ' | cut -f 8 -d ' ' | sort -n | uniq -c
When running the command above, you should see that one (possibly
two) values are much more common than the others. These values
represent the approximate overhead of the timing operation. (In other
words, if the time to call
rdtsc is 80 cycles, then any
measurement you make will be high by about 80 cycles.)
Read the department's equipment page (slightly outdated) and note the different CPUs currently in use in the labs. Currently we have five different CPUS:
They all use the IA32 instruction set; therefore, they all have an
rdtsc instruction; however, each CPU
implements the instruction differently.
arch01and others should use
arch05.) For each CPU, list the typical number of cycles taken. Which CPUs commonly produce two different answers?
Hint: If you have set up a public/private key pair, than you can use a simple bash script to automatically run your program on different machines. Your script will look something like this:
for i in arch05 arch06 <other machines here> do echo -n "$i " ssh $i command > file_$i done
Remember that when you ssh to another machine, the working directory will be your home directory.
Modern CPUs can process more than one instruction at a time. In
addition, these instructions are not necessarily processed in the same
order they appear in code. (The control unit is very careful to ensure
that the re-ordering does not affect the output.) Thus, simply placing
instructions between two calls to
rdtsc will not
necessarily produce an accurate result: Some of the instructions being
timed may still be in process when the second
If there are enough (hundreds or thousands) of instructions being timed,
then the effects of the overlap will be minimal.
Another approach is to place a "serializing" instruction before each
rdtsc. "Serializing" instructions do not begin
executing until every previous instruction is complete. To the best of
cpuid is the only serializing instruction for
IA32 that can be run in "user" mode. (The others must run in
"protected" mode.) Two things to note when using
cpuidplaces data in
%edx. Make sure your important data (including the loop counter) is elsewhere when calling
cpuidshould do. For consistency, make sure
%eaxcontains the same value each time it is called.
For those who are curious: The primary purpose of
is to provide programmers a means of querying the CPU for its
configuration and extra functions (what is the cache size? Does it
implement the MMX instructions? etc.) On Linux, you can run the
command to see the different information this instruction makes available.
It is nearly impossible to accurately time a single instruction: Not
only is the overlap with the
instructions problematic; but, the instructions are pipelined, with
several instructions in progress at once. Thus, it is not possible to
pinpoint when a particular instruction begins or ends. Instead, we will
time larger blocks of code and watch for interesting behaviors.
Start by timing blocks of
addl instructions. The idea is
to time a group of 100
etc. I have written a perl script,
to automate the process of generating, compiling, and running the test
programs. (You will also find a copy of this program in
rdtscadd the following line of code between the calls to
#@addl $1, %ecx.
test_1add.s. Note: The "#@" is important. Also, if
gccis currently using
%ecx, choose a different register.
cpuidimmediately before each call to
rdtsc. Your code should contain a block that looks something like this:
.L3: movl $0, %eax cpuid rdtsc movl %eax, %r12d #@ addl $1, %ecx movl $0, %eax cpuid rdtsc movl %eax, %r13d movl %r13d, %edx subl %r12d, %edx
%edx. If so, modify the assembly to use one of the registers in the range
%r15d. (The last few semesters, the compiler has been using
%ebxto store the loop counter. You will need to change this.)
Test; then, put a copy of
time_tester.pl test_1add.s --noCleanUp.
addlinstruction was repeated. (If the
addlinstruction wasn't repeated, then you prepared
addlinstructions overlap with the
rdtscinstruction. This is why we need to time blocks of hundreds of instructions.
time_tester.pl test_1add.s --start 0 --stop 25000 --step 200 > output_1add.
output_1add. One way is to run
gnuplotand give the following command:
gnuplot, you can change your output format and save the plot to a file. First type "
set term mode", where mode is one of many output formats, including "postscript", "jpeg", "png", and "pbm". (Type
help termfor the complete list.) After you have set the term, type "
set output filename". Typing "
set term x11" will return the output to the screen.
test_2add.swith the test lines
#@ addl $1, %eax #@ addl $1, %ecxand run it on the different hardware. What are the slopes of the lines?
test_1add.sproduces code that looks like this:
addl $1, %eax addl $1, %eax addl $1, %eax addl $1, %eax ...whereas
test_2add.sproduces code that looks like this:
addl $1, %eax addl $1, %ebx addl $1, %eax addl $1, %ebx ...Why can the CPU run the second example faster?
addinstructions each different CPU can do in parallel. Attach graphs demonstrating this. (In other words, show me how you figured it out.) In general, each plot should contain more than one line. Look most closely at the slope of the line in the range
[15000, 20000](If you don't see at least one CPU that can handle more than 2 instructions at a time, let me know.)
addinstruction with an immediate parameter. Now, try an
addinstruction that takes two register-direct parameters (e.g.,
addl %eax, %ebx). How many can run in parallel?
addl -4(%rbp), %eax) can run in parallel?
Now, let's look at how the compiler handles multiplication:
Write some C code that includes the line
x *= 7 where
x is declared to be a
Now find a constant value that causes the compiler to generate an
imull instruction. Time a single
mull instruction the same
manner you tested timed the