After five years of work on CUDA for supplemental compute, nVIDIA is squaring off with the biggest chip-maker in the world: Intel. These two titans of processing cores are both pushing PCI-X-based methods of expanding the compute power of desktops, but their respective approaches and solutions couldn’t be any more divergent.

The new Intel Phi co-processor is quite different from an nVIDIA GPU. The Phi is a MIMD (Multiple Instruction, Multiple Data) machine, while the GPU is a SIMD (Single Instruction, Multiple Data) machine. The Phi is useful only for HPC, while the GPU can run games and graphical simulations as well. The Phi runs its own Linux and presents as a cluster, while GPUs are treated as on-device equipment and managed alongside the CPU.

But beyond the technical differences, the two companies are already showing that they have completely different approaches to the growing HPC market. Intel has focused on bringing its existing compilers and tools to this new processing platform, while nVIDIA has spent the past five years building a community around its CUDA platform, and around GPU compute in general.

Ian Buck, general manager of GPU at nVIDIA, said that “Fundamentally, for HPC, it’s about expressing the parallelism. These accelerators are designed to process this stuff in parallel.” To that end, developers using an nVIDIA GPU to run their code can generate a single thread, and have it replicated across the GPU cores into the tens of thousands of threads. He said this parallelism is easy to get with CUDA.

“If you try that on 60 cores on a chip, it’s a little less clear how you program it,” said Buck, referring to Intel’s Phi. “You can treat it like an MPI (message-passing interface) cluster on a chip. The challenge of that programming model is each core is wimpy on its own: It’s a Pentium Pro-type  processor.”

But James Reinders, director of software products and multicore evangelist at Intel, said this is a benefit of the Phi, not a hindrance. Using the MPI programming model to treat the Phi as if it were a Linux cluster makes this desktop HPC environment familiar to existing HPC developers, who have been using MPI for some time.

“I think of it as SMP [symmetric multiprocessing] on a chip,” said Reinders. “You’ve got a collection of processors, a cache-based architecture with vector units, all brought together with extremely high-performance interconnect. What we’ve done is put it on a single chip. The benefit you get from that is that applications that have been written in MPI or OpenMP should see a very similar environment with Xeon Phi.”

And as for the belief that Phi is a collection of Pentium Pros, Reinders put that to rest. “First of all, they’re not Pentium Pros. They are in-order execution cores,” he said. “The Pentium Pro was our first out-of-order execution core. We’ve said in the past that it’s essentially a Pentium core, but the problem with talking about it like that is that Pentium cores didn’t have vector units. We have very wide vector units: 16 floats wide. It also has four threads per core. It has 64-bit support, machine exception handling, and power states. The Pentium had none of those.

“The only reason we ever mentioned it was sort of like a Pentium was to get people thinking about the in-order execution. It’s not as high-performance per thread. But it’s a better solution to overall power consumption.”
These two companies are also elbowing each other over nVIDIA’s decision to branch out and build OpenACC, rather than build accelerator directives within the OpenMP standards process. These directives can be used to tell an instruction where to process: on the SIMD GPU or the MIMD CPU. For nVIDIA, that’s what OpenACC attempts to create. But Reinders said the OpenMP team was already working on a CPU- and GPU-compatible solution to the problem when nVIDIA went off to create its own directives standard.

Said nVIDIA’s Buck, “I think Intel and nVIDIA agree that directives are a good way to get operations off a CPU and onto a GPU by expressing regions that can be loops and use parallelism. But OpenMP takes time. They’re a standard, they have many years of legacy work to do. Instead of waiting for that to happen, we’ve been pushing OpenACC as an alternative. It supports GPU, CPU, PGI, Cray.”

But Reinders expressed frustration at nVIDIA’s decision to create OpenACC. “I do not think OpenMP is moving slowly. I think I would question anyone who would assert that,” he said.

“Standards bodies should be very careful to standardize things that we can all survive with. OpenACC is a subset of what the OpenMP committee was working on, but it was a subset designed only to service GPUs. There was very deliberate slicing of the standard so it was only to service GPUs. The OpenMP team was working on standardizing in a general way. I don’t find OpenACC to be a standard. It’s a proprietary solution for nVIDIA.

“OpenMP’s solution was the product of many companies, including nVIDIA. It was a very thoughtful solution for directives that would work both for Xeon Phi, nVIDIA GPUs, AMD GPUs, and even Intel GPUs. I think by the time it makes it into 4.0, hopefully all the technology problems will be solved, and it will be a real solution, and we can stay as far away from politicking as possible. I think OpenMP has done a stellar job of making sure the technical concerns of users are addressed, and that we get a standard that users can use.”

But Buck claimed nVIDIA has always planned on contributing the OpenACC standard and code back into OpenMP.

The tale of the tape:
Xeon Phi Co-processor 5100
60-Core CPU
Maximum 16-channel GDDR memory interface with an option to enable ECC.
PCI Express* x16 lane Gen2 interface with optional SMBus management interface.
Node Power and Thermal Management, including power capping support.
On-board flash device that loads the coprocessor OS on boot.
Card level RAS features and recovery capabilities.
MIMD Machine
Cost: Around $1,200

nVIDIA GeForce GTX 690
CUDA Cores: 3072
Memory Speed: 6.0 (GB/sec)
Standard Memory Config: 4096MB (2048MB per GPU) GDDR5
Memory Interface Width: 512-bit (256-bit per GPU)
Memory Bandwidth: 384 (GB/sec)
SIMD Machine
Cost: Around $1,000