“A manycore processor is a device for turning a compute-bound problem into a memory-bound problem,” said Mark Murphy, a Ph.D. candidate in electrical engineering and computer science at the University of California, Berkeley, working and lecturing in the Intel-financed Parallel Computing Lab (Par Lab).

“This is a quote that I use in almost every talk I give. Kathy Yelick [director of the National Energy Research Scientific Computing Center at Lawrence Berkeley Laboratory] was the first to say it…Essentially, the amount of on-chip computation is growing much faster than the amount of memory bandwidth that you have, so we are at the point where you need at least 10 times as many floating point operations as you have memory loads,” he continued.

“If you load a single cache line, you want to be sure to use all of it, and you want to be sure, if you can, to load single cache lines at a time. This is the same problem as we have with CPUs, but it’s exacerbated in GPUs because of the large amount of on-chip parallelism.”

Just a few years ago, Intel’s multicore CPU efforts were stealing headlines. C++ guru Herb Sutter first sounded the alarm to developers that Moore’s Law could no longer be counted on with a Slashdot article entitled “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” (Disclosure: I was on retainer as a contract editor to help Intel evangelize concurrent programming to software developers from 2007 to 2009.)

A funny thing happened on the way to the multicore future, however: While CPU makers and developers alike were slamming into Amdahl’s “Wall,” struggling to realize performance increases for eight or 16 cores thanks to the sequential overhead of their legacy code, GPU makers were quietly cooking up a revolution.

It’s been a few decades in the making, granted. The concept of general-purpose graphics processing units (GPGPUs) dates to the late 1970s. But a number of factors have coalesced recently:

•    an aggressive marketing campaign for Nvidia’s CUDA programming model
•    an installed base of hundreds of millions of desktop GPUs
•    Intel’s failure to launch Larrabee, and its continued propulsion of parallel programming paradigms
•    the success of the OpenGL cross-platform graphics API
•    AMD’s evolving vision for ATI Stream Technology
•    a growing market for parallel-programming productivity tools
•    Apple’s push for the OpenCL framework for portable parallel programming

Is multicore maxed out?
Talk to an IT shop experienced with concurrency and you’ll probably hear the same story: They made a fairly easy leap to using threading to take care of the “embarrassingly parallel” portions of their application, but they’re not sure how much rearchitecting they can do to hum proportionately faster on eight or 16 cores.

We’ve all heard why multicore makes sense, but we may have prematurely risen to the top of the multicore curve, similar to where we are with regards to the decades-long Moore’s Law as it pertains to CPU architectures and clock speeds. Heterogeneous CPU-GPU platforms, however, are burgeoning, and we’ve only just begun to climb aboard.

Could this obviate the need for extensive concurrency training for software developers? Can they simply offload parallel computation to the GPU, which, unlike the CPU, has the potential to linearly scale performance the more cores it has? Can you just “fire it and forget it,” as Sanford Russell, general manager of CUDA and GPU Computing at Nvidia, puts it? Sorry, no.

“The goal is not to offload the CPU. Use CPUs for the things they’re best at and GPUs for the things they’re best at,” said Murphy. An example is a magnetic resonance imaging reconstruction program that he found worked best on the six-core Westmere CPU.

“The per-task working set just happened to be 2MB, and the CPU had 12MB of cache per socket and six cores,” said Murphy. “So if you have a problem with 2MB or less per task, then it maps very beautifully to the L3 cache. Two L3 caches can actually supply data at a higher rate than the GPU can.”

“The CPU is designed to handle all these interrupts—from your mouse or your hard disk, for example,” said Russell. “The GPU is not good at that. What it wants is, ‘Give me tremendous amount of work.’

“We were solving a problem that was literally next door to what the CPU was trying to solve. There were parallel machines and massively parallel machines for a long time, but we had to make it much simpler to program them.”

At a philosophical level, the fundamental difference between CPUs and GPUs is that “the way the resources are deployed and harnessed and the way the data is moved around on a CPU is under software control; the way the data and resources are deployed on a GPU is still significantly under non-software control,” explained Kurt Akeley, a founder of Silicon Graphics and an OpenGL expert, in a 2008 ACM Queue interview.

Interestingly, GPUs have basically been programmable by software developers all along, but the performance trajectory and the problem set meant that for “a variety of tactical reasons, the programmability wasn’t exposed,” said Akeley. Until now.

CUDA takes hold
In just five years, Nvidia’s proprietary (albeit free) CUDA C extensions have made GPGPU programming accessible (if not to the masses, then to some 100,000 developers, the company claims). The CUDA compiler turns C++ source code into binary executables that run on its GPUs. Language bindings are available for Fortran, Java, Python, Ruby and others, and library support includes CUBLAS (linear algebra), CUFFT (Fourier transforms) and CULAPACK (a CUDA version of the classic Fortran-90 routines for linear equations and advanced computations, optimized for shared memory parallel processors).

If a normal software development conference tends a little too much toward the usual enterprise fare when it comes to applications (the words “customer,” “transaction” and “account” abound), the CUDA talks at Nvidia’s recent GPU Technology Conference in San Jose were almost too exciting: robot cars, finite-element code, shock and blast simulations, and large-scale CCTV facial-recognition software, among others.

“Nvidia is trying to build CUDA expertise, but a lot of companies don’t want to have to retrain a staff of CUDA programmers,” said Brian Pierce, CEO of Rogue Wave, a vendor of high-performance computing tools. “That’s why it hasn’t taken off in the financial services space. In so many of the mainstream and supercomputing environments, when people want to change hardware, people say the software has to work on it or you’re not coming in.”

But Nvidia’s Russell disputes that notion, citing financial applications such as those from SciFinance that have accelerated derivatives pricing or Monte Carlo models via CUDA and the company’s GPUs. “Bloomberg announced they were using GPUs for their fixed income calculation. Now that’s a very public back-office application,” he said.

Still, only a handful of financial examples are listed on the company’s website, and some are purely experimental. But Nvidia’s installed base of 250 million CUDA-capable GPUs means that “It’s likely there’s one in your workplace or your house. There are SDKs and libraries available, so you can play with it in your own domain,” Russell said.

Why OpenCL is ugly
CUDA can count momentum, stability, abstraction and a modicum of mature elegance among its attributes. Unfortunately, according to OpenMP pioneer and Intel senior research scientist Timothy Mattson, beauty works against a programming paradigm. In his standard talk on OpenCL for UC Berkeley’s Par Lab, where he now works, he said, “History supports ugly programming models…With all the elegant abstractions for parallelism that have been created, what is actually used are MPI, explicit thread libraries and compiler directives.”

However, the OpenCL is a pragmatic standard, pushed by vendors at an opportune moment. Exactly what’s needed, Mattson crows.

How ugly is it? “CUDA has a much clearer picture of the hardware it’s going to execute on, so it doesn’t need to expose so much to you,” said Murphy. A simple “Hello World” involving vector addition is only about a dozen lines long in CUDA. The OpenCL kernel and host code are quite a bit more complex, with the host code explicitly defining the platform, queues and memory objects; creating the program and kernel; and finally executing the kernel and reading the results on the host.

“The main goal of OpenCL is to provide some sort of portability, and it does that, but it doesn’t provide performance portability,” said Murphy. “It may not be possible to write the same program and have it run in high performance on multiple architectures. OpenCL is not as mature as CUDA, but specific technical problems could be fixed in the next year.”

In contrast, a well-written CUDA program can scale linearly with the greater cores at its disposal. Murphy has sped up parallelizable portions of his MRI reconstruction code several hundred times using multiple GPUs, reducing processing time from one hour to 30 seconds. Other portions of the same program turned out to be sequential but amenable to improvement via concurrency on the CPU. “This is a very important thing not to neglect: There is good library support for CPUs,” said Murphy.

Building an ecosystem
Library support, software development kits, APIs and tools are increasingly available as the concept of parallelism spreads and new platforms get ready for their close-ups. It’s not always about heterogeneous platforms, either.

“There is this convergence where desktop chip manufacturers are driving people to think about parallelism from the start,” said Rogue Wave’s Pierce. “But we also see traditional HPC and supercomputers bleeding into the mainstream. The supercomputer world is not going away, but it needs to be more productive. We bring together sets of development tools that support those trends.”

Since memory bandwidth is the biggest bottleneck, the company’s recent acquisition of Acumem makes sense. Acumem SlowSpotter and ThreadSpotter sniff out poor data locality, cache troubles and thread conflicts. Other tools include the TotalView scalable debugger for serial, parallel, multi-threaded, multi-process and remote applications; and IMSL Numerical Libraries, which are embeddable algorithms for complex problem-solving and predictive analytic applications in C, Fortran, Java and .NET.

A good understanding of the GPU and CPU memory models is key, but before you get there, if you’re trying to rev up existing applications, you’ll want to run something like Intel VTune to see where your hotspots are and to determine what portions might make an ideal GPU workload, said Russell.

The good news is that concurrent and parallel programming, while still in flux, will require much of the same conceptual thinking for the foreseeable future. Par Lab teaches that CUDA constructs, such as scalar threads, warps, blocks and grids, are conceptually applicable to other models, such as DirectX DirectCompute and OpenCL. According to Murphy, “The one thing I’m trying to have my students take away is this: Parallelism is not terribly difficult.”

Milestones in parallel programming
1967 – Amdahl’s Law published
1978 – Ikonas markets GPGPU-based cockpit display system
1988 – Gustafson’s Law published
1989 – Linda coordination language developed
1994 – MPI (message passing interface) proposed
1997 – Open MP standard released; ASCI Red is world’s first TOPS supercomputer
2005 – Top 500 supercomputers are all at least 1 teraFLOP; Sutter’s concurrency manifesto published; Intel Pentium D dual-core processor available
2006 – Intel unveils C++ library of Threading Building Blocks and a quad-core processor; AMD buys GPU-maker ATI
2007 – Intel proposes 80-core TeraFLOP on a chip; Nvidia launches CUDA
2008 – OpenCL 1.0 released
2009 – AMD embraces OpenCL for its ATI Stream SDK; Apple implements OpenCL in Mac OS X Snow Leopard
2010 – Nvidia Fermi boasts 512 stream processors and 3 billion transistors; it also announces 350 CUDA teaching centers and 100,000 developers; Intel releases Array Building Blocks