visitor maps

Translation-Traduction

Wednesday, September 22, 2010

Two billion-transistor beasts: POWER7 and Niagara 3

     By Jon Stokes |

A 300mm Power 7 processor wafer

In years past, an ISSCC presentation on a new processor would consist of detailed discussion of the chip's microarchitecture (pipeline, instruction fetch and decode, execution units, etc.), along with at least one shot of a floorplan that marked out the location of major functional blocks (the decoder, the floating-point unit, the load-store unit, etc.). This year's ISSCC is well into the many-core era, though, and with single-chip core counts ranging from six to 16, the only elements you're likely to see in a floorplan like the two below are cores, interfaces, and switches. Most of the discussion focuses on power-related arcana, but most folks are interested in the chips themselves.

In this short article, I'll walk you through the floorplan of two chips with similar transistor counts—the Sun's Niagara 3 and IBM's POWER7. Most CPU geeks will already know a lot of the information I'll give below, but many readers will appreciate having it all together in one place.

Niagara 3: threads and I/O

Sun's Niagara 3

Sun's 1 billion-transistor, 16-core Niagara 3 processor is a great example of modern multiprocessor-turned-SoC (system on a chip). Everything about this design is focused on pushing large numbers of parallel instruction streams and data streams through the processor socket at once. The shared cache is small, the shared pipes are wide, and the end result is a chip that's all about maintaining a high rate of flow, and not one that's aimed at collecting a large pile of data and chipping away at it with heavy equipment.

Each of the 16 individual SPARC cores that make up Niagara 3 support up to eight simultaneous threads of execution, for a total of 128 threads per chip. Logically, the chip is laid out so that all of the cores communicate with a unified 6MB L2 cache via a crossbar switch that's placed in the middle of the chip. This combination of cores and L2 connected via a switch forms the basic compute architecture of the SoC.

So that the chip can talk to the outside world, the L2 caches are connected to a variety of I/O interfaces: memory, PCIe, 1G/10G Ethernet, and coherency links. All told, those links can push a total of 2.4Tb/s worth of data through a single Niagara 3 socket—that's a lot of bandwidth, but you need it to feed that many threads. Let's take a quick look at each of these I/O links in turn.

Coherence: Niagra 3's coherence links are the equivalent of the QuickPath Interconnect (QPI) on Intel's Nehalem parts, or of HyperTransport for AMD. These links can be used to connect up to four of the chips together without any additional routing chips (this is what's meant by saying Niagara 3 can be used in a four-socket glueless configuration). Each Niagara 3 chip has two 1.6GHz coherence controllers, which are connected to six coherence links. Each individual link consists of 14 unidirectional lanes that give the link a total bandwidth of 9.6Gb/s.

Memory: Also attached to the L2 is are two DDR3 memory controllers, each of which hosts two memory channels, for a total of four channels of DDR3.

PCIe and Ethernet: A PCIe controller supports two 5Gb/s PCIe ports, and an Ethernet controller supports two 1G/10G Ethernet ports.

IBM's POWER7

IBM's POWER7

At 1.2 billion transistors, IBM's new 45nm POWER7 processor is only a little bigger than Niagara 3, but it couldn't be more different. If Niagara 3 is an army of guys with shovels, POWER7 is a giant bulldozer.

POWER7 has only half the cores (eight) and one quarter of the threads (32) of Sun's chip, but that doesn't mean it falls short in the horsepower department. Each POWER7 core has a ton of very fast execution hardware, and the overall layout of the machine's very wide execution core is a straightforward evolution of the design that I first described in a series of articles on the PowerPC 970. (I talked more about POWER7's execution core in an earlier article comparing it to Intel's Tukwila Itanium.)

Where Niagra 3 keeps a large number of relatively weak cores busy by moving data onto and off of the chip using ample I/O resources, POWER7's approach to feeding a smaller number of much more robust cores is to cache large amounts of data on-chip so that the cores can grind through it in batches. This being the case, POWER7 has the most remarkable on-chip cache hardware of any processor on the market.

First in the chain is the 32KB L1 data cache, which has seen its latency cut in half, from four cycles in the POWER6 to two cycles in POWER7. Then there's the 256KB L2, the latency of which has dropped from 26 cycles in POWER6 to eight cycles in POWER7—that's quite a reduction, and will help greatly to mitigate the impact of the shared L3's increased latency.

The POWER7's L3 is its most unique feature, and, at 32MB, it's positively gigantic. IBM was able to cram such a large L3 onto the chip by making it out of embedded DRAM (eDRAM) instead of the usual SRAM. This decision cost the cache a few cycles of latency, but in exchange IBM got a 3.5x improvement in power efficiency and a 3x improvement in cache density. IBM has actually been talking up the use of eDRAM for on-chip cache since at least 2002, so in this regard POWER7 represents the fruition of years of work on this approach.

On the I/O side, POWER7 features two DDR3 memory controllers that can do up to 100GB/s total. The chip's SMP links (the same as Niagara's coherence links) can do 360GB/s (or almost 2.9Tb/s) of bandwidth, but this amount appears to be divided between internal and external SMP links. The chip doesn't contain the other I/O options—PCIe or Ethernet—that Niagara has.

Ultimately, these two server-class processors show that there are two very different ways to spend a billion transistors, and each design will be good for different applications. Sun's Niagara is aimed at networked server operations where lots of simultaneous, lightweight requests have to be serviced—databases, Web servers, and the like. In contrast, POWER7 has the horsepower to grind through a smaller number of more compute-intensive tasks at a high rate of speed. Both parts have their place in the server ecosystem of 2010.

 

 

No comments:

Clubic.com - Articles / Tests / Dossiers