IBM has taken the wraps off the first servers that are powered by its monstrously powerful Power8 CPUs. With more than 4 billion transistors, packed into a stupidly large 650-square-millimeter die built on IBM’s new 22nm SOI process, the 12-core (96-thread) Power8 CPU is one of the largest and probably the most powerful CPU ever built. In a separate move, IBM is opening up the entire Power8 architecture and technical documentation through the OpenPower Foundation, allowing third parties to make Power-based chips (much like ARM’s licensing model), and to allow for the creation of specialized coprocessors (GPUs, FPGAs, etc.) that link directly into the CPU’s memory space using IBM’s new CAPI interface. You will not be surprised to hear that Nvidia, Samsung, and Google — three huge players among hundreds more who are beholden to Intel’s server monopoly — are core members of the OpenPower Foundation. The Power8 CPU and the OpenPower Foundation are the cornerstones of a very big, well-orchestrated plan to finally put an end to x86’s reign, and place a fairer, more powerfularchitecture at the head of the server table.
First, we should talk about the new Power8 chip. There are 12 CPU cores, each with 512KB of L2 SRAM and 8MB of L3 EDRAM, for a total of 6MB L2 and 96MB L3 cache respectively. There is then a further 230GB/sec of bandwidth to 1TB of DRAM. Whereas each Intel Xeon core is capable of two-way simultaneous threading, and Power7+ cores can do four threads, Power8 ups the ante to eight simlutaneous threads (SMT). As you’d expect, other parts of the chip have been similarly expanded to cater for the Power8’s massive parallelism: There are eight decoders (up from 6), six dispatches per clock cycle, a doubling of load units (4), the data cache can now process four 128-bit transactions per cycle, and the bus width between the L2 and data cache is now 512 bits. Take a look at the block diagram below and be awed by its massive parallelism and throughput.
We expect the Power8 will eventually be capable of clock speeds around 4.5GHz, with a TDP in the region of 250 watts. At this speed, the Power8 CPU will be around 60% faster than the Power7+ in single-threaded applications, and more than two times faster in multithreaded tasks. In certain cases, IBM says the Power8 is capable of analyzing Big Data workloads between 50 and 1,000 times faster than comparable x86 systems (the same amount of RAM, the same number of cores).
Compared to its competitors (Power 7+, the Oracle Sparc T5, the Intel Xeon), the Power8 is anywhere between two and three times more processing power per socket. This is mostly due to the massive thread count (96 vs. 30 for the latest 15-core E7-8890 v2 Xeon), and utterly insane memory bandwidth (230GB/sec vs. 85GB/sec). In terms of performance per watt, though, the Xeon (~150W TDP) is probably just ahead of the Power8 — but in general, when you’re talking servers, power consumption generally plays second fiddle to performance density (how many gigaflops you can squeeze out of a single server).
Beyond raw SPECint and SPECfp performance, Power8 also introduces CAPI (Coherence Attach Processor Interface). CAPI is a direct link into the CPU, allowing peripherals and coprocessors to communicate directly with the CPU, bypassing (substantial) operating system and driver overheads. CAPI is similar to Intel’s QPI, but where QPI is closed and proprietary, IBM is opening up CAPI to third parties. IBM’s Power Systems CTO, Satya Sharma, told me in an interview that in the case of flash memory attached via CAPI the overhead is reduced by a factor of 20. More importantly, though, CAPI can be used to attach coprocessors — GPUs, FPGAs — directly to the Power8 CPU for some truly insane workload-specific performance boosts. It is due to these CAPI-attached coprocessors that a Power8 system can be 1,000 times faster than a comparable x86 system.