//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
Nvidia used the latest series of MLPerf inference marks the debut of public benchmarks for its latest flagship GPU, the H100. H100 is the first chip to build on the business Hopper architecture with its specially designed transformer motor. H100 outperformed Nvidia’s current flagship, the 100, by 1.5-2× RCAoss the board except for BERT scores whbefore the advantage was more pronounced with up to 4.5× uprising.
With triple the raw performance of the A100, why are some of the H100 benchmark scores less than double?
“While the FLOPS and TOPS numbers are an initial set of useful benchmarks, they don’t necessarily predict application performance,” said Dave Salvator, director of AI inference, benchmarking and cloud at Nvidia, to EE Times in an interview. “There are other factors, [including] the nature of the network architecture you are using. Some networks are more I/O related, some networks are more compute related… it varies by network.
Salvator added that there is room for the H100’s scores to improve as its software stack matures.
“It’s a first demonstration for Hopper…there’s still gas in the tank,” he said.
Salvator pointed out that A100 results have improved 6x since MLPerf’s first appearance of this accelerator in July 2020. [Nvidia’s software portal] that developers can use.
The standout result for the H100 was on BERT-Large, where it performed up to 4.5 times better than the A100. Among the new features of the H100 are a hardware and software transformer engine that manages the accuracy of calculations during training for maximum throughput while maintaining accuracy. While this feature is more relevant for training, it applies to inference, Salvator said.
“It’s largely the precision of the FP8 that comes into play here, but so are other architectural aspects of the H100. The fact that we have more compute capacity plays a role, more streaming processors, more tensor cores and more compute,” he said. H100 also approximately doubled its memory bandwidth compared to A100.
Some parts of the BERT 99.9 benchmark were run in FP16 and some in FP8— the secret sauce here is knowing when to switch to higher precision to preserve accuracy, which is part of what the transformer motor does.
Nvidia also showed about a 50% improvement in power efficiency for its state-of-the-art Orin SoC, which Salvator attributed to recent work to find an operational sweet spot for frequency and voltage (MaxQ).
Benchmark CPU system scores for Grace, Grace Hopper and power metrics for H100 should be available once the products hit the market in the first half of next year, Salvator said.
Nvidia’s main challenger, Qualcomm, has focused on power efficiency for its Cloud AI 100 accelerator. Qualcomm runs the same chip in different power envelopes for data center and edge use cases.
More than 200 Cloud AI 100 scores were submitted by Qualcomm and its partners, including Dell, HPE, Lenovo, Inventec and Thundercomm. Three new peripheral platforms based on Snapdragon processors with Cloud AI 100 were also compared, including Foxconn Gloria systems.
Qualcomm entered the Largest System (18 Accelerators) in the Enclosed Data Center Division’s Available category and won the crown for Best Offline and Server ResNet-50 Performance. The 8x Cloud AI 100 scores, however, were easily surpassed by Nvidia’s 8x A100 PCIe system. (Nvidia H100 is in the “preview” category because it is not commercially available yet).
Qualcomm also claimed the best power efficiency across the board in the Closed Edge System and Closed Data Center System divisions.
Chinese Biren GPU Boot offered its first set of MLPerf scores since emerging from stealth last month.
The Chinese startup showed off the scores of its BR104 single-chip accelerator in PCIe form factor alongside its BirenSupa software development platform. For ResNet-50 and BERT 99.9, the Biren 8 accelerator system offered similar performance to Nvidia’s DGX-A100 in server mode, where there is a latency constraint, but comfortably outperformed the Nvidia DGX-A100 in offline mode. line, which is a measure of raw throughput.
Biren’s BR100—which has a pair of the same chiplets used individually in the BR104—has not been calibrated.
Chinese server maker Inspur also submitted results for a commercially available system with 4 PCIe BR104 cards.
Another new entrant was Sapeon, a spin-out from Korean telecom giant SK Telecom. Before spinning, Sapeon had been working on its accelerator since 2017; the X220, a second-generation chip, has been in the market since 2020. The company said its chip is in smart speakers and security camera systems. It took the win over Nvidia’s A2, an Ampere-generation part aimed at entry-level 5G servers and industrial applications.
Sapeon showed scores for the X220-compact, a single-chip PCIe card consuming 65W, and the X220-enterprise, which has two X220 chips and consuming 135W. The company pointed out that the X220-compact beat Nvidia A2 by 2, 3× in performance, but was also 2.2 times more energy efficient, based on maximum power consumption. This despite the X220’s low-cost 28nm process technology (Nvidia A2 is on 7nm).
Sapeon is planning a third-generation chip, the X330, for the second half of 2023, which the company says will deliver greater accuracy and handle both inference and training workloads.
Intel submitted preview scores for its delay Sapphire Rapids processor. This four-chip Xeon data center processor is the first to benefit from Intel’s Advanced Matrix Extensions (AMX), which Intel says enables 8 times the operations per clock compared to previous generations.
Sapphire Rapids also offers more compute, more memory, and more memory bandwidth than previous generations. Intel said Sapphire Rapids scores between 3.9 and 4.7× of its previous generation processors for offline mode and 3.7 and 7.8× for server mode.
Other notable results
Chinese company Moffett submitted scores in the open division for its platform, which includes its Antoum chips, software stack and the company’s own sparse algorithms. The company has the S4 (75W) chip available with S10 and S30 (250W) still in the preview category. The Antoum architecture uses Moffett’s own sparse processing units for native sparse convolution alongside vector processing units, which adds flexibility to the workload.
Startup Neural Magic has developed a parsimony-aware inference engine for CPUs. Combined with Neural Magic’s compression framework, which takes care of pruning and quantization, the inference engine allows neural networks to run efficiently on CPUs by changing the order of execution so that information can be kept in the processor’s cache (without having to access external memory). The company’s scores were submitted on Intel Xeon 8380 processors.
Israeli software startup Deci has submitted results for its version of BERT in the open division, running on AMD Epyc processors. Deci’s software uses neural architecture research to tailor the architecture of the neural network to the particular processor, and often downsizes in the process. Acceleration was between 6.33 and 6.46 × relative to baseline.