Like an amoeba, processor cores seem to be dividing. With 4-, 8- and 16-core chips out there now, and even more cores coming, developers have their work cut out for them.

CPUs aren’t the only place to find extra cores for your HPC applications. Graphics processing units (GPUs) are standard in most modern PCs, and if you learn how to tap into their power, you can get an extra burst of speed for those tough-to-compute algorithms.

David Intersimone, vice president of developer relations and chief evangelist at Embarcadero Technologies, said that GPUs are becoming as common as multicore processors. “We’re seeing GPUs appearing everywhere, not only on desktops, but on tablets and mobile devices,” he said.

“People are using them for graphics acceleration, of course. But the processors can do more. At nVidia, they’re saying, ‘There are gamers and then there are artists, and that’s fine.’ But the big energy is being put into GPU computing, and for nVidia, that’s CUDA, and for those of us that want to support multiple graphics processor vendors, it’s OpenCL.”

Neal Robinson, senior director of global content and application support at AMD, said that the ever-increasing power of processors is diminished for developers who don’t fully take advantage of multicore programming techniques.

“We’ve got that Moore’s law issue, where we can’t just depend on speedups in hardware to provide huge leaps forward in software performance anymore. That’s where really putting to use the compute resources we’re putting out there for developers makes all the difference in the world,” he said.

“Developers usually don’t know how many cores are in the machine, all they know is the software experience. That’s where we are concentrating. We’ve got a good chunk of folks working on the language, making sure the tools are available.”

nVidia has also been focusing heavily on CPU/GPU software development. The company has released numerous revisions of its CUDA tools, now on version 4.1. nVidia also supports OpenCL use and tooling on its cards, but it’s ATI that has taken on OpenCL as its primary HPC development framework.

Both companies have a lot riding on the success of HPC development on GPUs. But it’s the developers who win in the end, as all sides attempt to win over users by releasing better and easier-to-use tools.

Manju Hegde, corporate vice president of products group at AMD, was originally an employee at nVidia, but moved over to AMD last summer. Since that time, he has been heading up the development of various new tools for OpenCL users.

These tools are venturing into new territory for most developers. Load balancing, for example, is typically something done on a network. In CPU/GPU development, load balancing needs to be done to keep workloads steady on both processors.

“Facial-recognition detection in the foreground is a very GPU-intensive task,” said Hegde, citing an example of where load balancing is needed. “In the background, it’s a CPU-intensive task. Load balancing allows you to seamlessly transfer the kernel from one processor to the other.”

It’s these areas of minutia that can make developing and optimizing CPU/GPU code difficult for the newcomer. That’s an area that nVidia said it’s addressing in its CUDA tool set.

The real key to providing that simplicity, said Will Ramey, nVidia’s senior product manager for GPU computing, is distinguishing between the developers who use nVidia’s CUDA tool kit.

“What we’ve found is that there’s a really big distinction between developers who are computer scientists and academically trained in engineering, and the domain experts on the other hand: the guys who have gone through a biology, chemistry or physics type education path,” said Ramey. “Unsurprisingly, they have a different set of skills and understanding of how to use the technology.

“What we’re finding is that many of these domain experts—gene sequencing is an example in biology, and oil and gas exploration is an example in geology—where they have learned just enough computer programming to express the problem so the computer can do the math for them. What we find is that for the very large audience of domain experts, they are often much more comfortable working with things like Python or MATLAB or Mathematica.

“The great thing is it allows them to write portable code and take advantage of the CPU and GPU. MATLAB is a great example; their most recent release has built-in support for GPU when there’s a GPU in the system.”

Indeed, much of the focus at nVidia is on allowing existing applications to utilize the GPU without the actual user being required to modify code or learn new techniques, said Ramey. “If you use the MATLAB parallel-computing tool box, you can use MATLAB to write an application, save it as an executable that runs without MATLAB, and scale it across a cluster,” he said.

OpenMP and friends
James Reinders, director and evangelist for Intel Software, said that Intel’s Threading Building Blocks (TBB) have been received as an alternative to OpenMP. Part of this is because it offers a different approach to optimizing multi-threaded code; Cilk Plus, a set of C and C++ language extensions, tackles the parallelization problem from a different angle than OpenMP.

“Threading Building Blocks has built a community that is larger than OpenMP, and quite distinct from it because it’s a C++ crowd,” said Reinders.

“Threading Building Blocks has been a huge success story in the industry in terms of adoption. Adding to this, Cilk Plus uses compiler extensions. OpenMP is a set of library extensions, Threading Building Blocks is a set of libraries. There is often a debate as to which is the better approach. Cilk Plus takes a look at if we were to expand C++ with compiler extensions, what could we do?”

OpenMP, while formerly the standard for writing HPC parallel applications, has given way to newer tools such as CUDA, Intel’s TBB, and OpenCL. But that doesn’t mean the techniques laid out in the OpenMP philosophy are not still in use, particularly by users of C, C++ and Fortran.

“Another approach is using OpenMP-like directives,” said Ramey. “There are domain experts who are really good at writing simple single-core code, and with a couple instructions, you can parallelize that across CPU and GPU.

“The two examples in this area are PGI Accelerator for C and Fortran. PGI Accelerator directives are a really easy way for a domain expert to take serial code and dramatically accelerate it. In a matter of days, people have been able to get 500% improvements on code they’ve had for years by applying these directives.”

Directive solutions, such as OpenMP, are still very useful to HPC developers, said Ramey. “With CAPE [Checkpointing Aided Parallel Execution] directive solutions, companies have been building applications for years, OpenMP being the most notable and successful of these solutions. We see this as very exciting, especially within the domain expert scientist community, because its a way to get GPU acceleration without the deeper thinking that sometimes has to go into parallelizing code, even on your desktop,” he said.

OpenMP continues to be used even though newer alternatives are available. But one of those alternatives, OpenCL, is becoming a point of contention between GPU manufacturers. While both AMD and nVidia GPUs support the use of OpenCL, each GPU manufacturer sees the standard differently.

Ramey said of OpenCL: “I think there are some people who were the early adopters who were able to complete their projects and start talking about them. We’re finding the developers who are choosing OpenCL tend to be more on the computer-science side, because they are more comfortable working in the low-level API, and working in C or C++.

“One of the things we’re happy about at nVidia is we’ve chosen to support all of these options. When developers have an nVidia GPU in their system, they have the freedom of choice.”

AMD’s Hegde disagreed. He said that standardizing on OpenCL was a no-brainer for AMD, and that because OpenCL is an open standard, it will eventually triumph due to the lack of barriers to adoption.

To that end, Hegde and AMD are pushing OpenCL tools and driving an effort to make them open source. The hope is that making more tools available will drive adoption, despite nVidia’s three-year head start with CUDA.

Said Hegde, “There will be lots of compiler work because we want to always stay true to the OpenCL promises of being cross-vendor and cross-platform. We have to go from an intermediate layer to target all the ISAs. There is tools-work that needs to be done there. Our first reference compiler will be for C++ MP, then for all remaining high-level compilers. LLVM will be first. Then the architectural specifications will be open, so we can do GCC too. We are prioritizing the consumer applications first. Then we’ll use the mass-market adoption to fund the more niche applications.”

Intel’s Reinders said that bringing compilers into the equation offers a great deal more options for TBB users. “When we go back to the compiler extensions, you can take care of C developers as well,” he said. “We’re able to help with task-based parallelism and data problems with Cilk Plus in a way we can’t with TBB, because we have the compiler helping us. The disadvantage is that you can only use this with a compiler that supports them.”

Compiling support
Reinders added that the Cilk Plus approach has broader applicability to compiler developers. “To help see other compilers adopt this, we’ve open-sourced the implementation. There’s a GNU branch looking at adding it to GCC. We do think adding parallelism and a few key concepts into the language would benefit everyone,” he said.

“We’re putting our actions behind that by sharing our implementation, so competing compilers can implement to the same level of performance and not have to do that work we’ve done already to support Cilk Plus.”

And the compilers are where the action is now. Embarcadero’s Intersimone said that his company will be supporting both CUDA and OpenCL via its compilers. “People like OpenCL because it supports lots of different parallel-processing systems, whether they’re CPU- or GPU-based. I think it’s just a matter of time before these technologies will either be embedded in the languages, or they’re just part of the same runtime everyone uses,” he said.

Language integrations may indeed be the next step for CPU/GPU computing. Already, various language additions for C, C++ and Fortran are available to help developers building HPC applications. Ramey said that “the directive-based approach, like the PGI Accelerator solution for C, [is effective].

“Earlier this year, someone figured out how to do something similar with Java as well. They’re basically instrumenting the JVM to detect or be informed of sections of the code that would benefit from parallelization, and then the JIT compiler turns the JVM bytecode into a form that can be run on the GPU, and manages transfers back and forth.”

But AMD and nVidia aren’t the only teams playing in the multicore solutions space. Intel’s TBB offers libraries that can be used to accelerate and optimize the inclusion of parallel code in applications. And TBB has been available for a few years now, so it is a mature product for enterprise use.

Reinders said that TBB is continuing to evolve and take advantage of the newest technologies. While GPU integration isn’t yet a part of TBB, the underlying parallel development paradigms it enables are useful for any multicore developer. Additionally, Intel has put great effort into supporting the newly minted C++11 standard.

“C++11 is ratified now,” said Reinders. “It’s an official standard, and we have very aggressively implemented a great deal of it. We’re not completely sure when we’ll be done with it. We think we’ve implemented the features most in demand.

“We were ahead of the standard with lambda support. The most anticipated feature other than lambdas is variatic templates: templates with a variable number of arguments. I think we’re leading the way on that. The good news for developers is that when they look to that standard, they’re going to have a number of options.”

Intel also supports OpenMP 3.1 in its TBB suite. “I think here, we’re probably the first to implement OpenMP 3.1, which has a number of user-requested additions that make OpenMP easier,” said Reinders.

Intersimone said that languages and class libraries should be the focus for the future. “We need the major languages embracing—in their language and in their standard libraries— CUDA and OpenCL,” he said.

“Embarcadero has Prism, which takes advantage of .NET methods for parallel for loops and future variables. You put future on a variable and it generates the code under the covers for me. Or you declare a method as asynchronous.

“The other area is the world of functional languages. A lot of the work people are doing and kinds of things they’re working on there relate directly to high-performance systems. I think as functional language concepts get into languages like C, C++ and Java, then we sort of get the benefits of some of that as we build new applications.”

Across the board, addressing parallelism in languages seems to be pushing more and more of them to support closures, also known as lambdas. Java SE 8 will include closures, and will in turn add functionality to the standard Java libraries. These new functions will be focused on spreading data across clusters, then running iterative functions across them, an execution style that is perfect for parallel systems.

While Java and C++ add closures to deal with parallelism, other developers are simply turning to functional languages, such as Clojure, Erlang and Scala, to write their code for multicore machines.

It’s about the tools
But no matter how far and wide support for Intel’s TBB, nVidia’s CUDA and OpenCL is spread, at the end of the day developers still need tools to make their lives easier. And that’s where these companies are putting a great deal of effort.

AMD, in particular, has been pushing hard to develop tooling for OpenCL. Hegde said that AMD plans to release a retinue of such tooling later this year. That retinue will include profilers, debuggers, load balancers and code analyzers.

And AMD is addressing another key area where OpenCL and multicore programming are concerned: “We’re offering a university kit for professors and teaching a new course. We asked, ‘What are the stumbling blocks?’ We decided to offer a reference version of the course. The next problem is with solutions: You’ve got to grade them…We automated the grading process.”

Algorithm research is also an area of interest for all the involved companies. nVidia’s Ramey said, “The MAGMA project, led out of the University of Tennessee at Knoxville, is headed by Jack Dongarra. The MAGMA project has taken the approach of, instead of choosing either/or, they are researching algorithms that are able to take advantage of CPU and GPU at the same time. They’d take two matrices and multiply part on each processor.”

And nVidia is continuing to improve its tools, added Ramey. “In CUDA 4.1, one of the really exciting things is a brand new visual profiler,” he said. “We redesigned the performance-analysis tool that comes for free in the toolkit to guide developers step-by-step through their applications to find the hot spots and integrate with the documentation to show them how to make those hot spots go away.

“We see projects like MAGMA and CUDA Tools getting used directly in research projects and integrated into higher-level applications like MATLAB. Our strategy is working with…computer scientists in order to build this platform up higher and higher so the larger audience of domain experts can take advantage of it.”

In the end, the GPU has come to be a compute resource for enterprises and academic institutions mostly by default. HPC projects have long clung to new processors and hardware technologies, but in the end, it’s the software technologies that push the state of the art forward. While universities snatched up Power Mac G5s and PlayStation 3s in a desperate search for the next, cheapest way to get extreme performance out of a cluster, they should have been paying attention to the software.

“This is pretty impressive for the last five years since this GPU/CPU technology has been created to have this rich and complete of an ecosystem in place,” said Ramey. “Compare this to the Cell processor, for example. The government had to fund development of that. With GPUs, because we have this massive installed base supporting the push into HPC, the ecosystem is growing by leaps and bounds.”