As the 2013 International Supercomputing Conference continues this week, product and technology announcements continue to trickle out of the show. NVIDIA of course is no stranger to this show, and coming off their success with Titan last year are ever increasing their presence to try to capture a larger share of the lucrative HPC market. To that end NVIDIA is releasing several announcements this morning that we wanted to briefly cover.

The big news out of ISC 2013 for NVIDIA is that CUDA 5.5 is now out of private beta and onto its public release candidate. Though CUDA 5.5 is just a point release for CUDA, it does bring several significant changes for developers. The biggest change of course is that this is the first version of CUDA to offer ARM support, going hand in hand with the launch of the Kayla development platform and ahead of next year’s launch of NVIDIA’s Logan SoC.

CUDA on ARM is a significant point of interest for NVIDIA for a couple of reasons. On the consumer side of things NVIDIA is hoping to ultimately leverage CUDA for compute on SoC based devices, similar to what they have done in the PC space over the last half-decade. For the ISC crowd however the focus is on what this means for NVIDIA’s HPC ambitions, as an ARM based HPC environment is something that can be powered exclusively by NVIDIA processors, as opposed to today’s common scenario of pairing Tesla compute cards with x86 AMD and Intel processors. Though if nothing else, in the more immediate future this is NVIDIA ensuring they aren’t left behind by the continuing growth of ARM device sales.

Along with bringing ARM support to CUDA, CUDA 5.5 also introduces cross compilation support to the toolkit, allowing ARM binaries to be built either natively on ARM systems or much more quickly on faster x86 systems. Other changes include several different improvements in MPI and HyperQ, such as MPI workload prioritization and HyperQ gaining the ability to receive jobs from multiple MPI processes on Linux systems.

Finally, on a broader view CUDA 5.5 will also be bringing some small but important changes that most developers will see in one way or another. On the development side of things NVIDIA is rolling out a new guided performance analysis tool for use with their Visual Profiler tool and with the Nsight Eclipse Edition IDE in order to help developers better identify and resolve performance bottlenecks. Meanwhile on the deployment side of things NVIDIA is finally rolling out a static compilation option, which should simplify the distribution of CUDA appliations, allowing the necessary CUDA libraries to be statically linked in applications, rather than relying solely on dynamic linking and requiring that the necessary libraries be bundled with the application or the CUDA toolkit installed on the target computer.

Moving on, along with the CUDA 5.5 announcements NVIDIA is also using ISC to showcase some of the latest projects being developed with NVIDIA’s GPUs. NVIDIA’s major theme for ISC is neural net computing, with a pair of announcements relating to that.

On the academic front, Stanford has put together a new cluster to model neural networking for researching how the human brain learns. The 16 server cluster is capable of modeling a 11.2 billion parameter neutral network, which is 6.5 times bigger than the second largest such network, a 1.7 billion parameter model put together by Google in 2012. The cluster is the basis of a separate paper being released this week for the International Conference on Machine Learning, which is simultaneously taking place this week in Atlanta.

Meanwhile on the business front, Nuance, the company behind speech recognition software such as the Dragon series, is being tapped for ISC as a neural network case study. Nuance has used neural networking techniques for years as the basis of the machine learning systems their software uses to train itself, and more recently the company has begun integrating GPUs into that work. Specifically, the company is now using NVIDIA’s GPUs to accelerate the training process, cutting down the amount of time needed to train a model from weeks to days, and in turn allowing the company to experiment with new many more models in the same period of time. The ultimate result being that the company can test and refine models more frequently, with these refined models becoming the basis of future products.

Finally, although Titan is no longer the #1 supercomputer in the world – having been bumped down to merely #2 on the latest Top500 list – word comes from NVIDIA that Titan has finally passed all of its necessary acceptance tests. As is typical for supercomputers, they are unveiled and listed before undergoing full acceptance testing, which means final acceptance may not come until months later. In the case of Titan some unexpected issues were discovered with the PCIe connectors on its motherboards, with excess gold in the connectors leading to solder issues. The issue was repaired in April, and Titan was resubmitted for an acceptance testing pass, which as of last week it has since passed and finally entered full production.

Source: NVIDIA

Comments Locked

5 Comments

View All Comments

  • Senti - Tuesday, June 18, 2013 - link

    I would be way more happy if released OpenCL 1.2 drivers. Seriously, even Intel supports 1.2 now.
  • Fallen Kell - Tuesday, June 18, 2013 - link

    Agreed.
  • p1esk - Tuesday, June 18, 2013 - link

    Regarding GPU powered Neural Network simulation - it's interesting that they used regular gaming GPU cards - not $3.5k Tesla, not $1k Titans, and not even current top of the line $600 780GTX gaming cards - they used last generation 680GTX ($400) cards.
    Makes one wonder why would anyone doing NN simulation want to buy 10x more expensive Tesla K20 cards?
  • ShieTar - Thursday, June 20, 2013 - link

    May depend on the exact usage scenario of the network if you can get any additional value out of ECC memory and double precision calculations. When you can get a significant scientific project to run on about 100 GPUs, the price-difference between $400 and $1k only comes to 60k$, that's not very if you consider that they probably pay at least 500k$ a year on salaries for the scientists working with the system.
    Also keep in mind, this kind of system is rarely planned and purchased on short notice. They probably started the planning before Titan and the 780 came out.
  • wumpus - Thursday, June 20, 2013 - link

    It's a neural net. It probably could get away with 16bit values but probably uses 32bit values to map to the hardware better. It has no use for 64bit double precision floating point (the big difference between the 780 and Titan). Also as a neural net, it should be highly resistant to data corruption on its own (you don't expect wetware to return the answer with 12 nines reliability, more like one or two).

    The only reason they were using a 780 instead of several 660s or similar boards was due to the need to do as many calculations on the same board as possible and limit communications between boards. I'm willing to bet that the next generation will simply buy the best single-point TFLOP/$ and use algorithms that don't rely so heavily on inter-board communication, but they didn't want make a big bet that would work on their first try. They knew in advance they didn't need the Titans or Telsas, however.

Log in

Don't have an account? Sign up now