Ars Technica logo. Serving the PC enthusiast for over 5x10-2 centuries  

Subscribe to Ars Technica!

Have news? Send it in.

 
Ars Guides.
  Buyer's Guide
  How-To's & Tweaks
  Product Reviews
  Ars Shopping Engine

Technopaedia.
  Technical Blackpapers
  CPU Theory & Praxis
  Ars OpenForum
  Search Ars

Columnar Edifice.
  Wankerdesk
  AskArs!
  Diary of a Geek
  Game.Ars Report   Mac.Ars takes on...
  Linux.Ars

Site Info.
  Subscribe to Ars
  Ars Merchandise
  Who We Ars
  Advertising
  Links



The Pentium 4 and the G4e: an Architectural Comparison

Part II: The Execution Core

   by Jon "Hannibal" Stokes

(This feature for subscribers only!

In my previous article on the Pentium 4 and G4e, I detailed the front ends of each processor. I also gave a general overview of each processor's pipeline, paying particular attention to the overall design philosophies embodied their respective designs. In this article, I want to look in greater detail at the back end, or execution core, of both processors. I'll talk about the execution resources that each processor uses for crunching code and data, and how those resources contribute to overall performance on specific types of applications.

Before I begin, I should note that I'm saving a look at the memory subsystems for the next article in this series. This topic, which includes a detailed discussion of bandwidth, caching, and performance, is so crucial to understanding real-world performance that it deserves and entire article to itself.

Preliminary remarks about the ISA: operand formats

Even though this series of articles is aimed primarily at comparing the microarchitectures of the P4 and the G4e, because of the fact that a microarchitecture implements an instruction set architecture (ISA) it helps to understand a bit about the differences in each CPU's ISA. Specifically, some background on the operand formats of arithmetic instructions in the x86 and PowerPC ISAs will help you understand some of the important design decisions that I'll cover later in the article. 

If you've ever done any x86 assembly language programming, then you know that x86 uses a two-operand format for both integer and floating-point instructions. For instance, if you wanted to add two numbers in registers A and B then the instruction would be laid out as follows:

add A, B

This command adds A to B, and places the result in A. Expressed mathematically, this would look like:

A = A + B

The problem with using a two-operand format is that it can be inconvenient for some sequences of operations. For instance, if you wanted to add A to B and store the result in C then you'd need two operations to do so:

mov C, A

add C, B

The first command moves A into C in order to preserve the value of A, and the second command ands the two numbers.

With a three- or more operand format, like many of the instructions in the PPC ISA, you get a little more flexibility and control. For instance, the PPC ISA has a three-operand add instruction of the format add destination, source 1, source 2, so if you wanted to add A to B and store the result in C without erasing the values in either A or B (i.e. "C = A + B") then you could just do:

add C, A, B

Some PPC instructions support even more than three operands, which can be a real boon to programmers and compiler writers. 

The PPC ISA's variety of multiple-operand formats are obviously more flexible than the one- and two-operand formats of the x86, but nonetheless modern x86 compilers are very, very advanced and can overcome most of the aforementioned problems through the use of hidden microarchitectural rename registers and various algorithms. The problems with a two-operand format don't really rear their heads in integer applications; it's when we get to floating-point and vector code that the x86's two-operand format gets to be a potential performance liability. We'll talk more about this later, though.

Handling integer code: the Arithmetic Logic Units (ALUs)

Integer operations are the mainstay of business and office productivity applications, like spreadsheets, word processors, scheduling and email applications, and the like. Integer performance is essential to making these everyday sorts of office apps run well. All the basic integer operations (ADD, SUB, MUL, etc.) and logical operations (AND, OR, XOR, etc.) are handled in a processor's arithmetic logic units, or ALU. 

Though the P4's double-pumped ALUs have gotten quite a bit of press, you might be surprised to learn that both the G4e and the P4 embody approaches to enhancing integer performance that are really quite similar. As we'll see, this similarity arises from both processors' application of that universal computing design dictum: make the common case fast. For integer applications, the common case is easy to spot. Integer instructions generally fall into two categories:

  1. Simple integer instructions. Instructions like ADD and SUB require very few steps to complete and are therefore very easy to implement with little overhead. These simple instructions make up the vast majority of the integer instructions in an average program.
  2. Complex integer instructions. While addition and subtraction are fairly simple to implement, multiplication and division are complex to implement and can take quite a few steps to complete. Such instructions involve a series of additions and bit shifts, all of which can take a while. These instructions represent only a fraction of the instruction mix for an average program.

Since simple integer instructions are by far the most common type of integer instruction, both the P4 and G4e devote most of their integer resources to executing these types of instructions very rapidly.

  

Next: The G4e's ALUs: making the common case fast

 


Dual 2.5GHz Power Mac G5 review

The Sims 2 review

Pipelining: an overview (Part II)

System Guide: September edition

Pipelining: an overview (Part I)

Chris Sawyer's Locomotion review

Multicore, dual-core, and the future of Intel

System Guide: gaming boxes

TrackIR3 Pro review

Doom 3: the review

PowerPC on Apple: An Architectural History, Part I

Virtual machine shootout: Virtual PC vs. VMware

The Pentium: An Architectural History � Part II

Joint Operations: Typhoon Rising game review

AirPort Express review

The Pentium: An Architectural History � Part I

The Ars guide to PCI Express

Beyond Divinity game review

The future of Prescott

Interview with Mozilla.org's Scott Collins

Thief: Deadly Shadows game review

USB 2.0 Hi-Speed Flash drive review

A closer look at Intel's processor numbers and 2004 road map

Far Cry game review

Dell Latitude D800 laptop review

HP Compaq nc6000 laptop review

Hitman: Contracts game review

Deploying a small business Windows 2003 network

Alternative AIM clients for Windows

Inside GNOME 2.6

/etc

OpenForum

Distributed Computing

Take the Poll Technica

FAQ: Celeron overclocking

 

Copyright © 1998-2004 Ars Technica, LLC