The SC16 conference in Salt Lake City this week highlighted the future of the high-performance, highly scaled application. That future, it would appear, involves at least the PCI bus, if not explicitly GPUs. At the event, Cray, NVIDIA and PGI discussed the future of the OpenACC standards, which are beginning to turn toward Intel’s hardware, the Xeon Phi co-processor.
While NVIDIA invented the GPU compute market, Intel jumped into the fray with its own dedicated PCI-based multicore processor add-on card to take advantage of what it saw as a gap in supercomputing: GPUs are better at vectors than generalized computing tasks.
While that view of a diverging market was also reflected by OpenACC (a standard created by Cray, NVIDIA and PGI to support GPUs in areas where OpenMP was not quite ready), it would appear that the road map for the project will now include Intel’s Phi co-processor.
(Related: Databricks adds GPU support)
Compiler companies PathScale and PGI already supported OpenACC on multicore processors and ARM, but this year PGI will be adding support for the Power series. Next year, it will add support the Xeon Phi. But that’s just a piece of the larger OpenACC 2.6 puzzle, which is aiming for a release during the middle of 2017.
Michael Wolfe, PGI compiler engineer and chair of the technical committee for OpenACC, said that version 2.6 of OpenACC and beyond will include many changes to make development easier. First up is manual deep copy.
“This has to do with how a device like a GPU does work with separated memory: Maybe my data structure is an array of Fortran-derived types or C++ classes, which itself may have pointers to other arrays,” said Wolfe. “That’s a dynamically allocated complex data structure. You need to move the data structures over and copy the pointers. That’s what we call deep copy. There are real applications that have needed this for a while, and we’ve had no clear coherent way to do it.
“We’ve been working on a directive-based mechanism to really do a true deep copy, but it’s quite complex to make it natural to specify; to make it simple and effective. Manual deep copy is a relatively small change to the OpenACC specification that allows users to coherently write a procedure that would do the deep copy. They write more code, but it really does do the deep copy. What we’re looking to do is the true deep copy. We were worrying about specifying something that would be too inefficient to use, or had the wrong behavior so we’re trying to get a prototype in there.”
Additionally, the next version of OpenACC should have error callback routines to help keep things from crashing unexpectedly. This will give developers a way to compensate, programmatically, if things go sideways, and perhaps recover the application without a restart.
Finally, OpenACC 2.6 should include array reduction. Said Wolfe, “OpenACC has a reduction clause if you need reductions like max and min. They are more or less copied from OpenMP. OpenMP allows for the reduction of a whole array of rows of a matrix into a vector. You implement by allocating a copy of the array to each thread, and each does its partial accumulation. Then you do a final accumulation.”
Wolfe said that this is a much more complex process considering the scale at which OpenACC works. “OpenMP only has four, eight or 12 threads,” he said. “In OpenACC, we’re working with parallelism in the thousands of threads. We were concerned when we did the specification about the efficiency of that, but we decided we had to bite that bullet and implement it and work on the efficiency issue.”
It would appear that with version 2.6, the team is finally starting to bring those threads under control.