AMD-based ‘Frontier’ supercomputer trapped by failures


Building a supercomputer is always demanding, but establishing the industry’s first exascale-class equipment is very difficult and involves a lot of hardware and software development. Unfortunately, this may be the case with Border supercomputer to Oak Ridge National Laboratorywhich can hardly last a day without encountering many hardware problems.

With AMD 64 cores EPYC Thirty CPU, Instinct MI250X compute GPUs, and HPE’s slingshot interconnectivity, ORNL’s Frontier is the industry’s first system capable of peak performance of up to 1.685 FP64 ExaFLOPS at 21MW of electricity. The system was created by HPE using the Cray-EX architecture, which was created for scalable applications, especially for exceptionally fast supercomputers.

Although hardware components for the Frontier supercomputer have been delivered and the machine appears to have remarkable potential on paper, hardware issues appear to be preventing it from going live and becoming available to researchers who need a performance of approximately 1 ExaFLOPS FP64.

Justin Whitprogram director for the Oak Ridge Leadership Computing Facility (OLCF) commented on the situation by mentioning:

We are working on hardware issues and making sure we understand (what they are). You are going to have failures on this scale. The MTBF on a system of this size is hours, not days.

There have been rumors of possible hardware malfunctions with Frontier for some time. According to another InsideHPC article, several claimed that the Slingshot connector was causing system problems. AMD’s Instinct MI250X compute GPUs weren’t as reliable this year, according to other reports as well. It’s important to keep in mind that only a limited number of consumers can buy Version X, which has more stream processors and faster speeds.

Mr. Whitt insisted that the computer had several hardware problems, but he did not indicate that the system had any specific problems with Instinct or Slingshot.

Many of the challenges are centered around those [GPUs], but that’s not the majority of the challenges we face. That’s a pretty good split among the common culprits of part failures that were a big part of it. I don’t think at this point we have a lot of concerns about AMD products.

Oak Ridge National Laboratory’s Frontier supercomputer is by no means alone in integrating AMD’s EPYC processors, Slingshot interconnects and HPE’s Cray EX architecture. For example, the Lumi supercomputer Finlandofficially recognized as the third most powerful supercomputer in the world, has a maximum performance of 550 petaFLOPS using similar components. The size of the machine, which requires a total of 60 million piecescan make the problem viable.

Since the Frontier supercomputer is still not officially deployed, it is still unclear if it will be made available to academics from 2023 as originally planned to be online in 2022.


About Author

Comments are closed.