Like rabbits left to their own devices, processing cores are breeding. But for all the proliferation of cores on desktops, in data centers and inside handheld devices, the fundamental problems for developers who have to deal with all those cores haven’t changed much.

For years, tools and frameworks have offered solutions to multi-threading, concurrency and parallelism. But unlike some other programming problems, where clear solutions have been selected already, the multicore problem remains one that is better solved through knowledge, skill and experience than through dollars and prepackaged bits. But that’s not to say there haven’t been some new developments in development.

Due to the complexity of the problems associated with multicore development, many vendors have had to be quite clever in how they offer their solutions. While there is no substitute for knowledge and experience, some companies, such as Microsoft, have found ways to get new multicore supports into their tools for developers who might not be so skilled in multicore development.

Brandon Bray, principal group program manager for .NET at Microsoft, said that there are three types of programmers for which they’ve addressed multi-core solutions. The first type of developers, which Bray said is quite small, knows how to develop multi-threaded applications, and doesn’t need much help. The second is similar, but wants concurrency to be easier. The third type is the largest, and is made up of developers who want nothing to do with multicore, multi-threading or concurrency.

For each type of these developers, Microsoft has managed to craft a way to make multicore development a little bit easier. For the first two types, .NET 4.5’s release in September included a new background garbage collector, which improved application performance thanks to the removal of garbage-collection pauses in the runtime.

“Background garbage collection does all the looking through memory while the program is running,” said Bray. “It’s making use of extra cores, and it’s not something we could have done 10 years ago because most computers only had one processor in them. It’s a default feature when people upgrade to .NET 4.5.”

.NET also includes multicore JIT, which uses data from previous application runs to optimize JIT compilation of .NET applications. This can improve startup time for long-running applications. This also uses extra cores automatically to do the compilation, thus offering developers a quicker build time.

While these two solutions are targeted at more experienced programmers, Bray said that developers disinterested in multicore can rely on .NET to offer them a faster way to do I/O and to take advantage of multicore for that: Async and Await.

“Async and Await are used with C#. The idea is that most developers really think about the code running on one processor,” said Bray. “If I write a loop, I understand how that works. They think of it synchronously, but most of the time when you’re doing network calls or file I/O, that I/O time is wasted time where another processor could do the work. Most of us experience that as the user interface being paused. That UI pausing is an opportunity for concurrency.

“The more I can use multicore in these cases, the better. It means I can keep the UI responsive. How do we get developers to fall into this pattern? The libraries you call that involve I/O, like calling a network or a REST API, you’d like that to be put into the background on another processor, and when the task is finished, it executes on the thread pool and the result is just passed back. I never have to pause the UI. This pattern we’ve built into the compiler, and it tells you where you need to write these keywords. If you call this API, it will force you to use this feature. Developers that don’t care are the ones that cause the pausing all the time. They still don’t have to care anymore; often, they don’t know they’re doing some of this concurrency.”

The trouble with Java
Jason van Zyl is in a unique position to observe how Java developers are approaching multicore development in Java. As CTO and founder of Sonatype, and creator of Maven, he has a front-row seat at the world’s largest Java repository, Maven Central.

And yet, van Zyl remained convinced that multicore, concurrent and parallel programming aren’t being made any easier by frameworks and tools in the Java space.

“I don’t think it’s ever going to change, in that you need to find pretty smart people who can do the concurrency code,” he said. “Any syntax or changes in the language, like Scala, you need to be extremely smart to use it, and it’s just too hard for some people. In the core parts of your organization, you’re never going to get around the need for people who have a lot of experience with it and love it.”

But van Zyl points out one area where a new approach to concurrent programming has been fomenting for the past few years: languages as the solution. Specifically, developers have been considering moves to functional languages like Clojure, Erlang, Haskell and Scala.

“There are some advantages to different languages like Erlang,” said van Zyl. “But you always need people who know how to use them. I am not sure you could make a language that’s expressible to every developer. I think Scala needs to get their binary compatibility story a little better, but I hear lots of people like it.”

And that is the central problem of choosing a functional language as your solution to the concurrency problem: It’s basically just as hard to train or hire functional people as it is to train or hire multi-threaded and concurrency people.

Of course, Intel has some incredibly talented concurrency and multicore people, and the company has long made its work available to the public through its Threading Building Blocks products. This year, however, Java has joined the party.

For many, Threading Building Blocks and Parallel Studio XE tools have been a lifesaver for concurrency and multicore development. In September, Intel updated its multicore tools with new capabilities.
#!
James Reinders, director of marketing and chief evangelist for Intel Software, said that, traditionally, Parallel Studio tools have focused on Fortran and C++, but this year one new language has made it into the tool chain.

“Java, to us, is something that shows up in applications. It’s mixed in,” he said. “Users want us to be able to tell them a little about that, our tools that do some of the debugging and find memory leak errors, and our performance tools have been extended to use Java. The Java runtimes today have hooks in them for performance tools to get info back. We can tell the computer is running something in the Java runtime. The users want to see which Java application was doing what. We’re able to do that now. We’ve had some Java support in the past, but it was always limited to one JVM.”

Now, Reinders added, Intel’s Parallel Studio 2013 can attach to multiple JVMs running on the same system. That gives developers more flexibility to find problems that may exist across a Java application ecosystem.

But Parallel Studio XE 2013, and the new HPC-focused Intel Cluster Studio 2013, both offer new capabilities for standard C++ and Fortran applications as well. To begin with, both software suites support C++11 and Fortran 2008, though there are still areas where said support is being filled in.

“No one has implemented C++11 and Fortran 2008 completely,” said Reinders. “We’re working very hard to implement features. We’re implementing them in order of customer feedback. We’ve made great strides, and we have most of C++11 and most of Fortran 2008 done, but we aren’t ready to say everything is done.”

Embedding threading
While traditional languages get attention from Intel, some developers have already taken the drastic step of moving to a functional language in order to gain concurrency. The company behind Scala and concurrency framework Akka understands that functional programming is a big mental shift, and it has been working to find ways to ease the transition.

Mark Brewer, president and CEO of Typesafe, said that Scala is a big jump from Java, but that the benefits are worth the move. “If you look at the masses of Java developers, they’re looking for something cool, something new, to write an application that can’t get done today. We’ll continue to see people make the adoption of Scala because they need the functional performance capabilities, but that’s not the only way they’ll adopt Scala,” he said.

They could also come to Scala because of Akka, a concurrency framework similar in scope to Erlang’s OTP. While concurrency and parallelism are the end goals for many development teams, the Akka framework also adds fault tolerance to the mix.

“Where we’re seeing it being picked up is in embedded: tablets, cars, phones,” said Brewer of recent uptake of Akka. “One of the case studies we want to get out there is a vendor who’s taken Akka and uses it to run streaming video on your DVR. Akka is sitting there as the service to monitor what you’re watching. If you’re not connected directly to the server, it’ll go to your neighbor’s box and figure out if they have it downloaded already, and your DVR will bring it in from there. It’s also a way to collect local metrics on people who are watching local TV shows.”

And this is the promise of ubiquitous concurrency: When every machine on your network can take a self-contained workload off the central stack, perform the work, then pop the information back into the stack, a host of new possibilities arise. As Brewer said, with a concurrency framework running on the end devices, processing can take place on those multicore processors that now live in mobile devices, sensors and embedded machines.

“The Dutch Border Patrol, as you’re driving up, takes your picture with sensors,” said Brewer, describing another use case for Scala and Akka. “All those sensors are sitting on a small piece of code in Scala on Akka, quickly processing your car tag number. By the time you hit the border, there is a guard there, and all of that processing happens in real time. They chose it because it was very lightweight. Akka is running on each of those sensors.”
#!
Same old story
Tobias Lindaaker is the concurrency expert at graph database company Neo Technologies. He said that he’s been speaking on concurrency for years, but has seen little success in imparting the wisdom he’s gathered through simple conference talks. The problem is that concurrency is just too complicated to abstract or simplify, he said.

“The problem is getting much harder,” he said. “Mainly, people aren’t using the tools. I’ve heard the Intel tools are really good, but they’re expensive. I try to do a few talks every year where I get people to understand how concurrency works, but so far I don’t think I’ve managed to be successful in that. Part of it is that it’s easy to ignore. Developers write single-threaded programs, and they kind of work. There’s a huge number of developers who don’t have a need to care about concurrency.

“For those kinds of people, they’re working on frameworks for making it easier to make sure you can still write your single-threaded application code, and our freeware code will take care of it for you. You have many of those single-threaded worlds that only go so far. At some point you’re going to have shared state. You create lots of data from reading out of a database, and you need multiple copies of the same data, one for each thread. They’re pretty much immutable, which is a waste of space, but, even worse, you go to the database to do the same query over and over and over again, when you could have consolidated that.

“Lots of things you’re dealing with in concurrency are about state. If state doesn’t change, then it’s safe to do multi-threads. The problem is when state changes. What I’ve been trying to do recently is to work around blocking calls. It’s not so much about state, but more around flow: saying that you don’t get to decide when this code executes, I am going to decide, and when it executes, I am going to make sure what you can do is safe. It’s very close to the functional way of thinking, just slightly broader.”

For Java developers, that functional way of thinking may soon be much more relevant. With Lambdas coming to OpenJDK 8, Java developers will have at least some functional language capabilities at their disposal.

“The things going on with parallel collections and Lambdas, those will make it easier for the people who already know what they’re doing,” said Lindaaker. “They wont be as appalled by Java as they are today. I’m not so sure that will reach the broader masses.”

And that’s because at the core of this multicore problem is the difficulty of extracting state, coordinating threads, and consolidating memory, all without crashing processes into each other over shared information. It’s a high-stakes, extremely complicated juggling act that can’t be shoehorned into a development process.

But there is at least some advice Lindaaker can offer. His secret to writing better concurrent code in Java? “Mainly, read source. Since I develop a lot of Java, I read the implementations of the JDK. I’ve read the OpenJDK source a number of times, so I know what’s going on and how my use of the code will relate to what happens on the CPU. Read the implementations of the concurrency library in Java. Also the read the compiler source and look at the method assembly code.”

CUDA 5
Nvidia updated its GPU compute SDK in October. In that update came a number of new capabilities and tools. You’ve already read about the new, simpler methods for invoking dynamic parallelism, so here’s info on the other big changes in this release.

Perhaps the change most related to dynamic parallelism is the new GPU-callable libraries. Until CUDA 5, libraries of GPU code had to be wrapped in CPU-understandable code. This also meant all library calls required the CPU and GPU to talk to one another. With GPU-callable libraries, the CPU isn’t needed to pull in routines and primitives at runtime, meaning less cross-chatter, thus better performance overall.

Another related change is the newly added GPU support for Remote Direct Memory Access (RDMA). This means GPUs can now communicate directly with the onboard network card, and use it to check out information held in RAM on another computer in the cluster, thus removing the CPU and its bus, cache, and memory from the RDMA process entirely.

Additionally, the 5.0 release of the CUDA SDK now includes an Eclipse-based development platform. This is the first time Nvidia has offered an Eclipse-based tool for CUDA developers, and the company said this version is based on existing Visual Studio plug-ins.