The Pentium 4 and the G4e: an Architectural Comparison

Part II: The Execution Core

by Jon "Hannibal" Stokes

Download the PDF

(This feature for subscribers only!)

In my previous article on the Pentium 4 and G4e, I detailed the front ends of each processor. I also gave a general overview of each processor's pipeline, paying particular attention to the overall design philosophies embodied their respective designs. In this article, I want to look in greater detail at the back end, or execution core, of both processors. I'll talk about the execution resources that each processor uses for crunching code and data, and how those resources contribute to overall performance on specific types of applications.

Before I begin, I should note that I'm saving a look at the memory subsystems for the next article in this series. This topic, which includes a detailed discussion of bandwidth, caching, and performance, is so crucial to understanding real-world performance that it deserves and entire article to itself.

Preliminary remarks about the ISA: operand formats

Even though this series of articles is aimed primarily at comparing the microarchitectures of the P4 and the G4e, because of the fact that a microarchitecture implements an instruction set architecture (ISA) it helps to understand a bit about the differences in each CPU's ISA. Specifically, some background on the operand formats of arithmetic instructions in the x86 and PowerPC ISAs will help you understand some of the important design decisions that I'll cover later in the article.

If you've ever done any x86 assembly language programming, then you know that x86 uses a two-operand format for both integer and floating-point instructions. For instance, if you wanted to add two numbers in registers A and B then the instruction would be laid out as follows:

add A, B

This command adds A to B, and places the result in A. Expressed mathematically, this would look like:

A = A + B

The problem with using a two-operand format is that it can be inconvenient for some sequences of operations. For instance, if you wanted to add A to B and store the result in C then you'd need two operations to do so:

mov C, A add C, B

The first command moves A into C in order to preserve the value of A, and the second command ands the two numbers.

With a three- or more operand format, like many of the instructions in the PPC ISA, you get a little more flexibility and control. For instance, the PPC ISA has a three-operand add instruction of the format add destination, source 1, source 2, so if you wanted to add A to B and store the result in C without erasing the values in either A or B (i.e. "C = A + B") then you could just do:

add C, A, B

Some PPC instructions support even more than three operands, which can be a real boon to programmers and compiler writers.

The PPC ISA's variety of multiple-operand formats are obviously more flexible than the one- and two-operand formats of the x86, but nonetheless modern x86 compilers are very, very advanced and can overcome most of the aforementioned problems through the use of hidden microarchitectural rename registers and various algorithms. The problems with a two-operand format don't really rear their heads in integer applications; it's when we get to floating-point and vector code that the x86's two-operand format gets to be a potential performance liability. We'll talk more about this later, though.

Handling integer code: the Arithmetic Logic Units (ALUs)

Integer operations are the mainstay of business and office productivity applications, like spreadsheets, word processors, scheduling and email applications, and the like. Integer performance is essential to making these everyday sorts of office apps run well. All the basic integer operations (ADD, SUB, MUL, etc.) and logical operations (AND, OR, XOR, etc.) are handled in a processor's arithmetic logic units, or ALU.

Though the P4's double-pumped ALUs have gotten quite a bit of press, you might be surprised to learn that both the G4e and the P4 embody approaches to enhancing integer performance that are really quite similar. As we'll see, this similarity arises from both processors' application of that universal computing design dictum: make the common case fast. For integer applications, the common case is easy to spot. Integer instructions generally fall into two categories:

Simple integer instructions. Instructions like ADD and SUB require very few steps to complete and are therefore very easy to implement with little overhead. These simple instructions make up the vast majority of the integer instructions in an average program.
Complex integer instructions. While addition and subtraction are fairly simple to implement, multiplication and division are complex to implement and can take quite a few steps to complete. Such instructions involve a series of additions and bit shifts, all of which can take a while. These instructions represent only a fraction of the instruction mix for an average program.

Since simple integer instructions are by far the most common type of integer instruction, both the P4 and G4e devote most of their integer resources to executing these types of instructions very rapidly.