Paid function In recent years, the limiting factors for large-scale AI/ML have first been hardware capabilities, followed by the scalability of complex software frameworks. The final hurdle is less obvious, but if not overcome, it could limit what is possible in the computational and algorithmic realms.
This last limitation has less to do with the compute components and everything to do with cooling those processors, accelerators, and memory devices. The reason this is not more widely discussed is that data centers already have sufficient cooling capabilities, most often with air conditioning units and the standard cold aisle and hot aisle implementation.
Currently, it is still perfectly possible to manage with air-cooled server racks. In fact, for general business applications that require one or two processors, this is an acceptable standard. However, for AI training in particular, and its reliance on GPUs, the continued growth of AI capabilities means a complete overhaul of how systems are cooled.
Outside of the biggest supercomputing sites, the world has never seen the type of AI-specific ultra-dense computing packed into a single node. Instead of two processors, AI training systems have at least two high-end processors with four to eight additional GPUs. Power consumption ranges from 500 watts to 700 watts for a general enterprise server to between 2,500 watts and 4,500 watts for a single AI training node.
Imagine the heat generated by this computing power, then visualize an air conditioning unit trying to cool it with simply cooled air. One thing that becomes clear with this kind of compute density and heat per rack is that there’s no way to blow enough air to sufficiently cool some of the most expensive high-performance server gear on the planet. This leads to limiting the calculation elements or, in extreme cases, to stops.
This brings us to another factor: server rack density. With data center real estate request at an all-time high, the need to maximize densities is driving new server innovations, but cooling can only follow by leaving spaces in racks (where more systems could reside) to let air try to follow. Under these conditions, air cooling is insufficient for the task, and it also results in less compute out of each rack and therefore more wasted server room space.
For normal enterprise systems with single-core tasks on dual-processor servers, problems may not escalate as quickly. But for dense AI training clusters, a huge amount of energy is needed to push cold air in, capture heat in the back, and bring it down to a reasonable temperature. This consumption goes well beyond what is needed to power the systems themselves.
With liquid cooling, you dissipate heat much more efficiently. As Noam Rosen, EMEA Director for HPC and AI at Lenovo, explains, “When you use warm, room temperature water to remove heat to cool components, you don’t need to cool anything. ; you do not invest energy to reduce the temperature of the water. This becomes very important when you get the number of national lab nodes and data centers that are doing large-scale AI training. »
Rosen provides quantitative details to compare the company’s general rack-level power requirements against those required by AI training via a life cycle assessment on training several large common AI models. They looked at the model training process for natural language processing (NLP) and found that the NLP training process can emit hundreds of tons of carbon equivalent to nearly five times the lifetime emissions of an average car. .
“When training a new model from scratch or adopting a model for a new dataset, the process emits even more carbon due to the time and computational power required to adjust an existing model.As a result, the researchers recommend that industries and businesses make a concerted effort to use more efficient hardware that requires less energy to operate.
Rosen puts hot water cooling in a tough context by highlighting what one of Lenovo’s Neptune family of liquid-cooled servers can do over the traditional airway. “Today, it’s possible to take a rack and fill it with over a hundred Nvidia A100 GPUs, all in a single rack. The only way to do this is with hot water cooling. That same density would be impossible in an air-cooled rack because of all the empty slots to let the components air-cool and even then it probably couldn’t handle the heat from that many GPUs.
Depending on the server configuration, hot water cooling can remove 85-95% of the heat. With allowable inlet water temperatures of up to 45°C, in many cases energy-intensive chillers are not required, meaning even greater savings, lower total cost of ownership and fewer emissions of carbon,” says Rosen.
For customers who for whatever reason cannot add plumbing to their data center, Lenovo offers a system with a fully closed liquid cooling loop that augments traditional air cooling. It offers customers the benefits of liquid cooling without having to add plumbing.
At this point in AI training with ultra-high densities and an ever-increasing appetite for more compute to power future AI/ML among some of the largest data center operators on the planet, the only path is liquid – and that’s just from a data center and compute perspective. For companies providing AI training at any scale, the biggest motivation should be to control carbon emissions. Fortunately, with efficient liquid cooling, emissions stay under control, electricity costs are reduced, densities can be achieved, and with the right models, AI/ML can continue to change the world.
Sponsored by Lenovo.