In my previous article on the Pentium 4 and G4e, I detailed the front ends of
each processor. I also gave a general overview of each processor's pipeline, paying
particular attention to the overall design philosophies embodied their
respective designs. In this article, I want to look in greater detail at the
back end, or execution core, of both processors. I'll talk about the execution
resources that each processor uses for crunching code and data, and how those
resources contribute to overall performance on specific types
of applications.
Before I begin, I should note that I'm saving a look at the memory
subsystems for the next article in this series. This topic, which includes a
detailed discussion of bandwidth, caching, and performance, is so crucial to
understanding real-world performance that it deserves and entire article to
itself.
Preliminary remarks about the ISA: operand formats
Even though this series of articles is aimed primarily at comparing the
microarchitectures of the P4 and the G4e, because of the fact that a microarchitecture
implements an instruction set architecture (ISA) it helps to understand a bit
about the differences in each CPU's ISA. Specifically, some background on the operand formats
of arithmetic instructions in the x86 and PowerPC ISAs will help you understand
some of the important design decisions that I'll cover later in the article.
If you've ever done any x86 assembly language programming, then you know that
x86 uses a two-operand format for both integer and floating-point instructions.
For instance, if you wanted to add two numbers in registers A and B then the
instruction would be laid out as follows:
add A, B
This command adds A to B, and places the result in A. Expressed
mathematically, this would look like:
A = A + B
The problem with using a two-operand format is that it can be inconvenient
for some sequences of operations. For instance, if you wanted to add A to B and
store the result in C then you'd need two operations to do so:
mov C, A
add C, B
The first command moves A into C in order to preserve the value of A, and the
second command ands the two numbers.
With a three- or more operand format, like many of the instructions in the PPC ISA,
you get a little more flexibility and control. For instance, the PPC ISA has a
three-operand add instruction of the format add destination, source 1, source 2, so if you wanted to add A to B and store the
result in C without erasing the values in either A or B (i.e. "C = A +
B") then you could just do:
add C, A, B
Some PPC instructions support even more than three operands, which can be a
real boon to programmers and compiler writers.
The PPC ISA's variety of multiple-operand formats are obviously more flexible
than the one- and two-operand formats of the x86, but nonetheless modern x86
compilers are very, very advanced and can overcome most of the aforementioned
problems through the use of hidden microarchitectural rename registers and
various algorithms. The problems with a two-operand format don't really rear
their heads in integer applications; it's when we get to floating-point and
vector code that the x86's two-operand format gets to be a potential performance
liability. We'll talk more about this later, though.
Handling integer code: the Arithmetic Logic Units (ALUs)
Integer operations are the mainstay of business and office productivity
applications, like spreadsheets, word processors, scheduling and email
applications, and the like. Integer performance is essential to making these
everyday sorts of office apps run well. All the basic integer operations (ADD,
SUB, MUL, etc.) and logical operations (AND, OR, XOR, etc.) are handled in a
processor's arithmetic logic units, or ALU.
Though the P4's double-pumped ALUs
have gotten quite a bit of press, you might be surprised to learn that both the
G4e and the P4 embody approaches to enhancing integer performance that are
really quite similar. As we'll see, this similarity arises from both
processors' application of that universal computing design dictum: make the
common case fast. For integer applications, the common case is easy to spot.
Integer instructions generally fall into two categories:
Simple integer instructions. Instructions like ADD and SUB require very
few steps to complete and are therefore very easy to implement with little overhead. These simple instructions make up the vast majority of the integer
instructions in an average program.
Complex integer instructions. While addition and subtraction are fairly
simple to implement, multiplication and division are complex to implement
and can take quite a few steps to complete. Such instructions involve a
series of additions and bit shifts, all of which can take a while. These
instructions represent only a fraction of the instruction mix for an average
program.
Since simple integer instructions are by far the most common type of integer
instruction, both the P4 and G4e devote most of their integer resources to
executing these types of instructions very rapidly.