Researchers accelerate fusion research with Argonne’s Groq AI platform


November 2, 2022 – Researchers are leveraging the Groq AI platform from the Argonne Leadership Computing Facility (ALCF) AI testbed to accelerate deep learning-guided investigations to inform the operation of future energy devices of merger. The ALCF is a user facility of the US Department of Energy (DOE) Office of Science at DOE’s Argonne National Laboratory.

To improve the predictive capabilities of fusion energy science, researchers are developing a workflow that integrates ALCF supercomputers for AI model training with ALCF AI Testbed’s Groq system for inference. Credit: Kyle Felker, Argonne.

Argonne computational scientist Kyle Felker oversees the deep learning fusion effort as part of a larger project supported through the ALCF’s Aurora Early Science (ESP) program, which prepares applications for the laboratory’s next exascale supercomputer. Led by William Tang of DOE’s Princeton Plasma Physics Laboratory, the ESP project aims to improve predictive capabilities and mitigate large-scale perturbations in the burning of plasmas in experimental tokamak systems such as ITER.

“A central problem with fusion has to do with control,” Felker explained. “The equations governing the behavior of a tokamak fusion reactor are extremely complex, and we need to understand how to control the device so that instabilities do not develop and prevent the fusion reaction from becoming self-sustaining. This is a huge engineering problem that requires a solution based on theory, simulation, machine learning, and human expertise. »

Real-time requests

Felker and his colleagues turned to the ALCF AI testbed to explore how new AI accelerators could help advance their research. They found that the Groq Tensor Streaming Processor (TSP) architecture enabled fixed and predictable computation times for a key phase of deep learning (inference) that would vary in duration if performed on conventional high-speed computing resources. performance powered by CPUs (Central Processing Units) and GPUs (Graphics Processing Units).

“We wanted to incorporate real-time deep learning models,” Felker said. “To apply our trained models to try to get a prediction and then send that prediction back to other fusion reactor control systems, we only have a single millisecond.”

The research team’s tokamak application is unique among Aurora ESP deep learning projects for its real-time inference requirements. The team intends to evolve the use of the app towards predicting instabilities in real plasma discharge experiments conducted as part of a pathway to viable fusion energy.

Argonne researchers envision the Groq platform in general as an edge computing device, working in concert with other emerging AI hardware as well as GPU-accelerated machines. Ultimately, the finalized workflow for the fusion application should be one that connects Aurora to an AI machine – Aurora for AI training and the AI ​​accelerator for inference.

Training versus inference

Different considerations inform different phases of the deep learning process. Although easy to confuse, model training – in which a neural network learns to meaningfully analyze sets of data and use them to make predictions – is often done in a separate computational environment. compared to the inference phase, when the predictions are made with the trained model.

The team is using ALCF supercomputing resources to train models and, with the ESP allocation, is preparing to leverage Aurora for massively parallel training.

“Deploying the Aurora system at 2 exaflops will allow training of models that are too large for existing computer systems,” Felker said.

Inference poses a problem for researchers. While the training itself does not have a strict time limit (indeed, for the largest and most complex models the process can take up to several weeks), the time limit for the inference is often very limited.

Consider the end goal of a smart fusion reactor.

“Even if you have a very accurate trained model that can predict any possible plasma instability before it happens, it’s only useful if it can give reactor operators and automated response systems the time to act. There can be no possibility that the AI ​​machine will take a day or an hour or even a minute to generate a prediction,” Felker said.

Deterministic architecture

Considering the importance of the inference problem, the team chose Groq as their platform because of its TSP architecture. The TSP is designed to scale single-core performance across multiple chips, can perform inference tasks quickly, and is deterministic. It provides consistent, predictable, and repeatable performance without overhead for context switching. Unlike when performed on a CPU or GPU based system, inference tasks will run on Groq for exactly the same amount of time in each instance.

“The lack of overhead you would get with a GPU or CPU results in a stronger guarantee that no jitter is introduced, whether that jitter delays performance by 50 microseconds or several milliseconds due to some stochasticity of execution,” Felker explained.

The team envisions a smart fusion reactor supported by TSPs and GPUs where time-critical deep learning models run on the TSPs (for example, when a disruption is imminent and unavoidable, and measurements attenuation are required) and sophisticated reinforcement learning algorithms run on the GPUs (for example, when researchers need to move the plasma to a more favorable state).

Fix merge control issue with ONNX

Last year, Groq produced referrals for a limited number of cases. These results, as presented at last year’s AI Summit, suggest that the TSP architecture could outperform GPUs when running the tokamak application.

The Groq compiler has come a long way since then, allowing the use of a portable format called Open Neural Network Exchange (ONNX) for the team’s deep learning model.

ONNX is common in the world of GPU and CPU-based deep learning as a mechanism to train a model on supercomputers and then save it to disk in a portable descriptive format that characterizes the architecture of the neural network used for training while detailing recorded weight values.

Having the ability to run ONNX files on the Groq TSP via the Groq compiler, the team is currently looking at hundreds of different models to thoroughly investigate the merge control problem. Given the rapid evolution of deep learning models, ONNX support is invaluable in facilitating rapid prototyping and deployment.

Source: Nils Heionen, ALCF


About Author

Comments are closed.