High Performance AI
Transforming science with deep learning and HPC
Machine learning and deep learning have taken science and society by storm, offering new ways to make discoveries, predict behavior, and guide decision-making.
Just like traditional high performance computing, training a deep neural network or developing a machine learning algorithm requires extremely large numbers of floating point operations (quintillions!) — making them a good fit for supercomputers and their thousands of parallel processors.
To help situate you in the fast-evolving world of AI, we’ve created a primer of terms, tools, and technologies that TACC is applying to help researchers use machine and deep learning effectively.
Training a neural network can require a system to read data up to 180 million times and perform 2,000 input/output (I/O) operations per second. These demands can cause a shared parallel file system to shut down or become unresponsive.
A system developed at TACC intercepts I/O traffic and processes it on the compute node to reduce the data workload on the shared file system, achieving 71-99% of the performance of a solid state storage device.
Deep learning frameworks change frequently and rely on multiple dependencies from the operating system to the interface.
Divide and conquer:
TACC manages low-level systems and libraries; users manage instances of their preferred frameworks.
Using multiple GPUs or CPUs can speed up training and allow researchers to tackle bigger problems, but few know how to do so.
Create guidelines and training to help researchers scale their deep learning problems to many processors.
Science demands that a solution be explainable, but neural networks are frequently black boxes.
Support interpretable deep learning that can demonstrate that a model is capturing the correct information and show how the neural network came to its conclusion.
Not all science problems have the massive amounts of data needed for traditional training.
Employ refinement approaches like interpolation and cost function mitigation to overcome this data deficiency.
Neural networks that don't incorporate known physics or features of a physical system have the potential to produce implausible solutions.
Integrate Scientific Knowledge into Deep Learning
Develop methods that blend deep learning with physics-based constraints to advance domain science.
How does Deep Learning Work?
Deep learning uses neural networks to make decisions. Data is supplied to the network and “hidden layers” of “neurons” recognize characteristics of the data. A simple example might be a color-based network with “is white” and “is blue” as outputs. By itself, the network does nothing, but given a large set of data, and examples of images that are labeled as “white” or “blue”, the network can “learn.”
This process of deep learning uses an iterative, two-step procedure that includes a forward step and a backward step. The forward step applies defined transformations in each unit along the neural network to sampled training data and computes a loss function. This loss (also referred to as training loss) is a statistical measure of the likelihood that the prediction is accurate. The backward step then calculates the gradient (steepness) of the loss function with regard to weights, and passes these gradients backward through the same network. The neurons can then be refined so their weights are updated to better infer ground truth from inputs. This process is repeated until a satisfactory solution is reached. In the simplistic case, the neural network would learn that clouds are white and the sky is blue (and would ignore data that are red).
There are many different characteristics that a network can learn. These networks can have many different neurons in a layer and multiple layers to enable better inference from input data, much the way our own brain recognizes the world around us.
Classes of Deep Learning
RNN: Recurrent Neural Networks
Deep neural network with neurons forming a one-way, non-repeating graph along a temporal sequence.
CNN: Convolutional Neural Networks
Deep neural networks centered around convolution: a mathematical operation on two functions to produce a third function that expresses how the shape of one is modified by the other.
RL: Reinforcement Learning
An area of machine learning that trains an agent to interact with some environment with the goal of maximizing a reward.
GAN: Generative Adversarial Networks
Deep learning algorithms with two neural networks acting as a generator and a discriminator, respectively. The generator produces synthetic data and the discriminator tries to differentiate the synthetic data from the true data.