Harnessing the GPU for multicore processing

Published: November 14th, 2011

- Alex Handy

Like an amoeba, processor cores seem to be dividing. With 4-, 8- and 16-core chips out there now, and even more cores coming, developers have their work cut out for them.

CPUs aren’t the only place to find extra cores for your HPC applications. Graphics processing units (GPUs) are standard in most modern PCs, and if you learn how to tap into their power, you can get an extra burst of speed for those tough-to-compute algorithms.

David Intersimone, vice president of developer relations and chief evangelist at Embarcadero Technologies, said that GPUs are becoming as common as multicore processors. “We’re seeing GPUs appearing everywhere, not only on desktops, but on tablets and mobile devices,” he said.

“People are using them for graphics acceleration, of course. But the processors can do more. At nVidia, they’re saying, ‘There are gamers and then there are artists, and that’s fine.’ But the big energy is being put into GPU computing, and for nVidia, that’s CUDA, and for those of us that want to support multiple graphics processor vendors, it’s OpenCL.”

Neal Robinson, senior director of global content and application support at AMD, said that the ever-increasing power of processors is diminished for developers who don’t fully take advantage of multicore programming techniques.

“We’ve got that Moore’s law issue, where we can’t just depend on speedups in hardware to provide huge leaps forward in software performance anymore. That’s where really putting to use the compute resources we’re putting out there for developers makes all the difference in the world,” he said.

“Developers usually don’t know how many cores are in the machine, all they know is the software experience. That’s where we are concentrating. We’ve got a good chunk of folks working on the language, making sure the tools are available.”

nVidia has also been focusing heavily on CPU/GPU software development. The company has released numerous revisions of its CUDA tools, now on version 4.1. nVidia also supports OpenCL use and tooling on its cards, but it’s ATI that has taken on OpenCL as its primary HPC development framework.

Both companies have a lot riding on the success of HPC development on GPUs. But it’s the developers who win in the end, as all sides attempt to win over users by releasing better and easier-to-use tools.

Manju Hegde, corporate vice president of products group at AMD, was originally an employee at nVidia, but moved over to AMD last summer. Since that time, he has been heading up the development of various new tools for OpenCL users.

These tools are venturing into new territory for most developers. Load balancing, for example, is typically something done on a network. In CPU/GPU development, load balancing needs to be done to keep workloads steady on both processors.

“Facial-recognition detection in the foreground is a very GPU-intensive task,” said Hegde, citing an example of where load balancing is needed. “In the background, it’s a CPU-intensive task. Load balancing allows you to seamlessly transfer the kernel from one processor to the other.”

It’s these areas of minutia that can make developing and optimizing CPU/GPU code difficult for the newcomer. That’s an area that nVidia said it’s addressing in its CUDA tool set.

The real key to providing that simplicity, said Will Ramey, nVidia’s senior product manager for GPU computing, is distinguishing between the developers who use nVidia’s CUDA tool kit.

“What we’ve found is that there’s a really big distinction between developers who are computer scientists and academically trained in engineering, and the domain experts on the other hand: the guys who have gone through a biology, chemistry or physics type education path,” said Ramey. “Unsurprisingly, they have a different set of skills and understanding of how to use the technology.

“What we’re finding is that many of these domain experts—gene sequencing is an example in biology, and oil and gas exploration is an example in geology—where they have learned just enough computer programming to express the problem so the computer can do the math for them. What we find is that for the very large audience of domain experts, they are often much more comfortable working with things like Python or MATLAB or Mathematica.

“The great thing is it allows them to write portable code and take advantage of the CPU and GPU. MATLAB is a great example; their most recent release has built-in support for GPU when there’s a GPU in the system.”

Indeed, much of the focus at nVidia is on allowing existing applications to utilize the GPU without the actual user being required to modify code or learn new techniques, said Ramey. “If you use the MATLAB parallel-computing tool box, you can use MATLAB to write an application, save it as an executable that runs without MATLAB, and scale it across a cluster,” he said.

OpenMP and friends
James Reinders, director and evangelist for Intel Software, said that Intel’s Threading Building Blocks (TBB) have been received as an alternative to OpenMP. Part of this is because it offers a different approach to optimizing multi-threaded code; Cilk Plus, a set of C and C++ language extensions, tackles the parallelization problem from a different angle than OpenMP.

“Threading Building Blocks has built a community that is larger than OpenMP, and quite distinct from it because it’s a C++ crowd,” said Reinders.

“Threading Building Blocks has been a huge success story in the industry in terms of adoption. Adding to this, Cilk Plus uses compiler extensions. OpenMP is a set of library extensions, Threading Building Blocks is a set of libraries. There is often a debate as to which is the better approach. Cilk Plus takes a look at if we were to expand C++ with compiler extensions, what could we do?”

OpenMP, while formerly the standard for writing HPC parallel applications, has given way to newer tools such as CUDA, Intel’s TBB, and OpenCL. But that doesn’t mean the techniques laid out in the OpenMP philosophy are not still in use, particularly by users of C, C++ and Fortran.

“Another approach is using OpenMP-like directives,” said Ramey. “There are domain experts who are really good at writing simple single-core code, and with a couple instructions, you can parallelize that across CPU and GPU.

“The two examples in this area are PGI Accelerator for C and Fortran. PGI Accelerator directives are a really easy way for a domain expert to take serial code and dramatically accelerate it. In a matter of days, people have been able to get 500% improvements on code they’ve had for years by applying these directives.”

Directive solutions, such as OpenMP, are still very useful to HPC developers, said Ramey. “With CAPE [Checkpointing Aided Parallel Execution] directive solutions, companies have been building applications for years, OpenMP being the most notable and successful of these solutions. We see this as very exciting, especially within the domain expert scientist community, because its a way to get GPU acceleration without the deeper thinking that sometimes has to go into parallelizing code, even on your desktop,” he said.

OpenMP continues to be used even though newer alternatives are available. But one of those alternatives, OpenCL, is becoming a point of contention between GPU manufacturers. While both AMD and nVidia GPUs support the use of OpenCL, each GPU manufacturer sees the standard differently.

Ramey said of OpenCL: “I think there are some people who were the early adopters who were able to complete their projects and start talking about them. We’re finding the developers who are choosing OpenCL tend to be more on the computer-science side, because they are more comfortable working in the low-level API, and working in C or C++.

“One of the things we’re happy about at nVidia is we’ve chosen to support all of these options. When developers have an nVidia GPU in their system, they have the freedom of choice.”

AMD’s Hegde disagreed. He said that standardizing on OpenCL was a no-brainer for AMD, and that because OpenCL is an open standard, it will eventually triumph due to the lack of barriers to adoption.

To that end, Hegde and AMD are pushing OpenCL tools and driving an effort to make them open source. The hope is that making more tools available will drive adoption, despite nVidia’s three-year head start with CUDA.

Said Hegde, “There will be lots of compiler work because we want to always stay true to the OpenCL promises of being cross-vendor and cross-platform. We have to go from an intermediate layer to target all the ISAs. There is tools-work that needs to be done there. Our first reference compiler will be for C++ MP, then for all remaining high-level compilers. LLVM will be first. Then the architectural specifications will be open, so we can do GCC too. We are prioritizing the consumer applications first. Then we’ll use the mass-market adoption to fund the more niche applications.”

Intel’s Reinders said that bringing compilers into the equation offers a great deal more options for TBB users. “When we go back to the compiler extensions, you can take care of C developers as well,” he said. “We’re able to help with task-based parallelism and data problems with Cilk Plus in a way we can’t with TBB, because we have the compiler helping us. The disadvantage is that you can only use this with a compiler that supports them.”

Compiling support
Reinders added that the Cilk Plus approach has broader applicability to compiler developers. “To help see other compilers adopt this, we’ve open-sourced the implementation. There’s a GNU branch looking at adding it to GCC. We do think adding parallelism and a few key concepts into the language would benefit everyone,” he said.

“We’re putting our actions behind that by sharing our implementation, so competing compilers can implement to the same level of performance and not have to do that work we’ve done already to support Cilk Plus.”

And the compilers are where the action is now. Embarcadero’s Intersimone said that his company will be supporting both CUDA and OpenCL via its compilers. “People like OpenCL because it supports lots of different parallel-processing systems, whether they’re CPU- or GPU-based. I think it’s just a matter of time before these technologies will either be embedded in the languages, or they’re just part of the same runtime everyone uses,” he said.

Language integrations may indeed be the next step for CPU/GPU computing. Already, various language additions for C, C++ and Fortran are available to help developers building HPC applications. Ramey said that “the directive-based approach, like the PGI Accelerator solution for C, [is effective].

“Earlier this year, someone figured out how to do something similar with Java as well. They’re basically instrumenting the JVM to detect or be informed of sections of the code that would benefit from parallelization, and then the JIT compiler turns the JVM bytecode into a form that can be run on the GPU, and manages transfers back and forth.”

But AMD and nVidia aren’t the only teams playing in the multicore solutions space. Intel’s TBB offers libraries that can be used to accelerate and optimize the inclusion of parallel code in applications. And TBB has been available for a few years now, so it is a mature product for enterprise use.

Reinders said that TBB is continuing to evolve and take advantage of the newest technologies. While GPU integration isn’t yet a part of TBB, the underlying parallel development paradigms it enables are useful for any multicore developer. Additionally, Intel has put great effort into supporting the newly minted C++11 standard.

“C++11 is ratified now,” said Reinders. “It’s an official standard, and we have very aggressively implemented a great deal of it. We’re not completely sure when we’ll be done with it. We think we’ve implemented the features most in demand.

“We were ahead of the standard with lambda support. The most anticipated feature other than lambdas is variatic templates: templates with a variable number of arguments. I think we’re leading the way on that. The good news for developers is that when they look to that standard, they’re going to have a number of options.”

Intel also supports OpenMP 3.1 in its TBB suite. “I think here, we’re probably the first to implement OpenMP 3.1, which has a number of user-requested additions that make OpenMP easier,” said Reinders.

Intersimone said that languages and class libraries should be the focus for the future. “We need the major languages embracing—in their language and in their standard libraries— CUDA and OpenCL,” he said.

“Embarcadero has Prism, which takes advantage of .NET methods for parallel for loops and future variables. You put future on a variable and it generates the code under the covers for me. Or you declare a method as asynchronous.

“The other area is the world of functional languages. A lot of the work people are doing and kinds of things they’re working on there relate directly to high-performance systems. I think as functional language concepts get into languages like C, C++ and Java, then we sort of get the benefits of some of that as we build new applications.”

Across the board, addressing parallelism in languages seems to be pushing more and more of them to support closures, also known as lambdas. Java SE 8 will include closures, and will in turn add functionality to the standard Java libraries. These new functions will be focused on spreading data across clusters, then running iterative functions across them, an execution style that is perfect for parallel systems.

While Java and C++ add closures to deal with parallelism, other developers are simply turning to functional languages, such as Clojure, Erlang and Scala, to write their code for multicore machines.

It’s about the tools
But no matter how far and wide support for Intel’s TBB, nVidia’s CUDA and OpenCL is spread, at the end of the day developers still need tools to make their lives easier. And that’s where these companies are putting a great deal of effort.

AMD, in particular, has been pushing hard to develop tooling for OpenCL. Hegde said that AMD plans to release a retinue of such tooling later this year. That retinue will include profilers, debuggers, load balancers and code analyzers.

And AMD is addressing another key area where OpenCL and multicore programming are concerned: “We’re offering a university kit for professors and teaching a new course. We asked, ‘What are the stumbling blocks?’ We decided to offer a reference version of the course. The next problem is with solutions: You’ve got to grade them…We automated the grading process.”

Algorithm research is also an area of interest for all the involved companies. nVidia’s Ramey said, “The MAGMA project, led out of the University of Tennessee at Knoxville, is headed by Jack Dongarra. The MAGMA project has taken the approach of, instead of choosing either/or, they are researching algorithms that are able to take advantage of CPU and GPU at the same time. They’d take two matrices and multiply part on each processor.”

And nVidia is continuing to improve its tools, added Ramey. “In CUDA 4.1, one of the really exciting things is a brand new visual profiler,” he said. “We redesigned the performance-analysis tool that comes for free in the toolkit to guide developers step-by-step through their applications to find the hot spots and integrate with the documentation to show them how to make those hot spots go away.

“We see projects like MAGMA and CUDA Tools getting used directly in research projects and integrated into higher-level applications like MATLAB. Our strategy is working with…computer scientists in order to build this platform up higher and higher so the larger audience of domain experts can take advantage of it.”

In the end, the GPU has come to be a compute resource for enterprises and academic institutions mostly by default. HPC projects have long clung to new processors and hardware technologies, but in the end, it’s the software technologies that push the state of the art forward. While universities snatched up Power Mac G5s and PlayStation 3s in a desperate search for the next, cheapest way to get extreme performance out of a cluster, they should have been paying attention to the software.

“This is pretty impressive for the last five years since this GPU/CPU technology has been created to have this rich and complete of an ecosystem in place,” said Ramey. “Compare this to the Cell processor, for example. The government had to fund development of that. With GPUs, because we have this massive installed base supporting the push into HPC, the ecosystem is growing by leaps and bounds.”

Article Tags

CUDA, HPC, Intel, OpenCL, parallel programming

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Security Testing

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

API

Gravitee.io

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Harnessing the GPU for multicore processing

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

A new frontier in HPC with “Bring Your Own Code”

December 2024: People on the Move

Intel unveils Gaudi 3 processors as part of its AI strategy

Linux Foundation will form High Performance Software Foundation with AWS, HPE, Intel, and others