Multicore processors, long the mainstay of servers, have made solid inroads on the desktop and even mobile computers, yet the development and QA tools and processes required have historically fallen behind.

Fortunately, that’s beginning to change. And while they’re by no means on every developer’s desktop, there are tools available for multicore programming and debugging. That’s good news, because the challenges of multicore development require both training and software.

Erik Hagersten, CTO of Rogue Wave, while sells tools for multicore development, said that many of the parallel programming problems we see today were solved in the 1990s, but that those solutions required training, understanding and expertise on the part of the developer.

“Parallel processing isn’t new. We were building parallel servers at Sun Microsystems, but the difference now is multiprocessing is going to the masses,” he said.

“It’s not just the experts; it’s pretty much everybody. In that respect, we’re missing a lot. The experts are doing fine of course, but the mass market isn’t ready for this.”

And while doctoral theses on parallel programming and the potential for new tools and languages are all very compelling, they do little to help developers who need help today. “Many people are waiting for that magic thing to happen: a new language with new parallelization; improvements for the Javas and the Erlangs,” said Hagersten.

“That is happening slowly, but not fast enough, and people need to write their code here and now. The largest problem is non-determinism. How can you fix a bug if you can’t recreate it?”

The problem is even more profound in the embedded market, where non-determinism is unacceptable within systems upon which human lives depend, said Greg Rose, vice president of marketing and product management at DDC-I, which sells DO-178B certifiable embedded operating systems and tools for use in flight safety-critical avionics applications.

“To date, most [safety-critical embedded systems] customers using multicore processors are turning one of them off,” he said. “They’re using it in a single-core implementation due to all these effects. It’s really up to the software developers as to how you utilize these extra power granted to you with multicore.”

Embedded developers favor using a single core, said Rose, because non-determinism makes reliable testing almost impossible. Running an application twice under the exact same conditions can yield two entirely different behaviors due to the timing involved in passing processes to two processors that share memory and cache. Even attaching a debugger to the application can change the outcome of a process because the debugger itself alters the timing of execution.

“For determinism, you don’t want to have contention between resources. That could be memory, could be cache, or it could be the system itself,” said Rose.

“You don’t want a system acting non-deterministically. When you put in the multicore, it’s not like multiple discrete computers. It’s on a chip, with shared memory and shared cache, which increases the non-determinism. We have some software for cache partitioning, and being able to segment your cache is important. Also, there are these resource contentions that come along and cause you to have to budget more time. Even if you budgeted for worst-case scenario timing, your average execution time is still going to be nominally dependent on that.”

Thus, many solutions to the problem can cause other problems. It’s a tricky set of difficulties to navigate, especially when the underlying goal is one of improved developer productivity.

“Education is part of it,” said Hagersten. “This is not new; there were several techniques developed in the 1990s, but with those techniques you wouldn’t see the productivity.

“In the short term, I would find the best tools and environments. The problems we’re running into now are so subtle and hard: race conditions, deadlocks and non-determinism.”

The heart of the problem, said Eli Boling, Embarcadero’s manager of compiler development, is that “there’s no silver bullet for the general-purpose programmer on multicore, mostly because regardless of how many cores you put out there, it’s non-trivial to take your general-purpose application and make it parallel.

“There are too many things about them that are serial. There is a set of areas that are really susceptible to parallelization, like digital-signal processing, image processing, simulation, some physics problems where they’re doing large array calculations, or security operations where you’re looking at very large sets of data you have to work on.”

And that is where much of the current focus within enterprises has fallen: using multicore systems to process big data.

James Reinders, chief evangelist for Intel’s software products division, said, “I think people know we’re under a crush of data. We’re making things like audio, video, HD video, and we’re collecting lots of data we want to process. Well, it’s natural that people are going to want to compute on this data and find the hidden gems that make their business better.

“Parallelism is the only way we’re really going to be able to tackle this growing amount of data. We need to find parallel methods for that. Co-arrays in Fortran is a hint at a feature going for that. I feel like people are turning toward understanding the problems they need to solve, and parallelism is part of the solution, rather than just saying, ‘I want parallelism.’ ”

Tools for the job
That’s not to say that there are no tools for parallel developers. There are lots of tools. One of the best known is the Intel Threading Building Blocks, a template library that offers code to help handle and manage threads spread across multiple cores.

Reinders said that the company has been searching for additional ways to help developers deal with multiple cores. “The tools have made a lot of strides. If you take a look at C and C++ programming, you’ll see Intel introduced some aggressive tools in May of 2009 and updated them last year in September, and actually took some of these things out further.

“The more-advanced HPC tools have benefited form the new ease-of-use tools that have been added. C and C++ programmers have new capabilities. Developers shouldn’t be very confused. There’s a promotion of what we call task-based programming, instead of thread-based programming, which frees up a developer to do more abstraction in C and C++, and helps with debugging.”

Hagersten said that Rogue Wave also has tools available now and in the works that can help ease the development process on multicore systems. “If you look in the debugging environment, one of the other tools we have in our portfolio is a tool that turns a multi-threaded execution into a deterministic environment,” he said.

“It’s a debugger where you can execute forwards as well as backwards. It’s a more intelligent way of working with today’s technologies rather than sitting and waiting for the magic language.”

DDC-I also has tools for embedded developers that can help smooth the rough edges of multicore software design. “What we’re pioneering here at DDC-I is a way to help characterize and minimize, in-bound, these resource contentions, which can lead to really inflated worst-case execution times,” he said.

“Cache partitioning is one of the technologies we have here. We’re using other techniques once you’ve got these building blocks in place, and we think the underlying operating system is the key to making this work. Then you can test the application software to make sure the time budgets have been set appropriately so you’re not going to run into scenarios where, because of resource contentions, we can’t get our job done.”

Threading Building Blocks is also evolving, said Reinders. “I get very excited about the things we’re updating. They make Threading Building Blocks easier and address deployment challenges.

“One of the really cool things that got added was full support for C++ lambdas. That’s the most exciting new feature of the new C++ standard. Lambdas give a very concise way to specify ‘here’s some code,’ without having to go off and create a whole function definition for it. This turns out to be very useful for parallelism. You can say, ‘Here’s the code I want running in parallel.’ Anytime code is easier to write, it’s less error-prone and easier for people to work on that code with you.”

Non-deterministic future
The future of multicore development isn’t going to change over night. While the tools continue to improve, the ability to automatically turn a generalized program into a parallel one isn’t likely to appear on the horizon anytime soon.

“It’s definitely going in the right direction, but it’s still far away from the golden goal of automating that or making sure that hard-to-find problems won’t occur,” said Hagersten. “Tools don’t solve the hard problem of making sure the programmer understands how to get performance from the program. I still wouldn’t say we are even close to the productivity we had even 10 years ago. We are used to improving productivity all the time, but now we’re going backwards.”

Reinders said that one of the more interesting areas of research for parallelizing big data at Intel has been around Fortran. “Fortran has been making some equal strides,” he said. “It may surprise some people that have had solutions like OpenMP and MPI, but there’s been some interesting things going on in Fortran. Co-array, for Fortran developers, is really cool, but it hints at a trend we’re going to see spill over into other languages. It’s designed to handle big data. This is really a topic unto itself.”

“I think it’s unlikely it’ll ever turn out to be an automated process,” said Embarcadero’s Boling. “I think the various research projects will bear fruits about language constructs that turn out to be useful for helping people do functional programming, or building applications in a style that tends to be more susceptible to automatic compilation.”

DDC-I’s Rose said that the future of multicore in embedded systems could take an entirely different route than that seen on servers. “We’re looking at asymmetric multiprocessing, where you have a complete copy of your OS running on each core, and you hard-schedule each one of those different cores. What that does is aids in the predictability.

“Imagine if you’ve got cache issues in non-multicore. Imagine if there’re other [areas] where you’re trying to do multi-processes on multicores. The non-determinism can actually get worse. Where we’ve seen our customer’s interest is where they have rigid time and space partitioning, where they partition and say exactly this much time is available to each task, and the one task can’t corrupt the other task. They want hard control over what’s running on the cores to make sure they can completely characterize the effects of determinism. The more you constrain the device, the more you can characterize and put together your schedule.”

That future of running multiple operating systems in tandem on multicore embedded systems, however, is not here yet, said Rose. While servers have long been virtualizing operating systems and cramming multiple machines worth of applications on a single box, the embedded world remains beholden to the reliability that only a single processor core can bring.