Misuse of hardware reduces AI compression

0

One of the most pressing challenges in deploying deep learning at scale, especially for social media giant Meta, is making full use of the hardware for inference as well as training.

Researchers have solved this problem using various compression and pruning techniques, the most recent of which is Metapruning, which in 2019 represented the state of the art in pruning for maximum material efficiency. This has been used at Meta (although oddly the techniques were developed by a set of universities in Asia and are not related to Facebook/Meta efforts).

Despite the hardware’s efficiencies, there’s still a long way to go, according to researchers at Meta and Rice University. The team takes a closer look at the hardware efficiencies left on the table by using more traditional compression techniques for deep learning tasks, all without sacrificing accuracy.

There is a “dilemma between trends in effective DNN design and advances in the modern computing platform. While modern computing platforms (GPUs and TPUs) have constantly evolved to favor a higher degree of parallel computing, existing efficient DNN models often adopt lightweight operations that suffer from low hardware utilization and thus a lower achievable material efficiency”, the team Explain.

Specifically, computational patterns end up being erratic, which is especially difficult for embedded processors to manage. This is due to “their reduced possibilities of data reuse [which] limiting existing effective DNNs to unlock their theoretical potential.

In short, the goal was to create a more hardware-centric DNN that could make better use of parallelism.

“How can we design efficient DNNs that can simultaneously take advantage of both the powerful expressiveness of state-of-the-art efficient DNN structures and the enhanced parallel computing capability of modern computing platforms?”

The result is “depth reducerwhich focuses on super-compact, hardware-aware neural networks that can transform jagged computational patterns into tighter networks for higher throughput and accuracy. The team claims that their compression techniques enable “3.06 higher accuracy and 1.53X throughput over [Nvidia] Tesla V100 on state-of-the-art channel pruning method, MetaPruning.

Instead of the nice, simpler convolutional layers of yore, DepthShrinker takes all the irregular computations that are now the norm and merges “consecutive compact layers, between which activation functions are learned to be unimportant for inference, DNNs derived from DepthShrinker can greatly take advantage of the high degree of parallelism of modern computing platforms and thereby increase hardware efficiency while maintaining the accuracy of the original models.

Because the work is intended to be played on servers as well as inference devices, the team tested the method on an Nvidia Tesla V100 GPU and on the desktop and edge sides, an Nvidia RTX 2080Ti and a Jetson TX2 .

Although most of the benchmarking done by the team has focused on inference, the same concept can be applied to training. “The vanilla design of our DepthShrinker described above takes advantage of the idea that unimportant activation features can be properly removed after training without affecting inference accuracy. Excitingly, this information can also be exploited to improve DNN training.More specifically, we propose to train a given DNN via an Expand-then-Shrink strategy, and call it DepthShrinker+.

The team also extended their evaluation of DepthShrinker to edge processors, including mobile processors like Google Pixel 3 and Raspberry Pi 4 using batch size 1 with a lower latency result than standard approaches (Pytorch at ONNX summed up then to TFLite).

“Extensive experiments validate our DepthShrinker gains both the high precision of per-channel pruning and the decent efficiency of per-layer pruning, opening up a cost-effective dimension for DNN compression.” Full references and more data found here.

Subscribe to our newsletter

With highlights, analysis and stories of the week straight from us to your inbox, with nothing in between.
Subscribe now

Share.

About Author

Comments are closed.