Yang Zhou Optimizes AI Systems by Building a Better Backend

A portrait of Yang Zhou

In his first year at college, Yang Zhou intended to be a physics major. Through an introductory programming course, however, Zhou found a proclivity toward using code to create something new. 

“There was an assignment that had you print out stars — the first row had one star, the second row had two stars and so on,” he said. “I really like this process in which you can create something visually interesting by programming in different languages.” 

Zhou pursued that newfound interest, earning a Bachelor of Science degree in computer science from Peking University in China. He then earned a Master of Science degree and Ph.D. in computer science from Harvard University. Zhou conducted postdoctoral research at UC Berkeley and has now joined the University of California, Davis, as an assistant professor of computer science. 

Fast Lanes for GPUs

Zhou’s research will focus on building better software systems that can enhance performance and lower costs in technologies like machine learning and AI. 

One of the projects that he initiated is building UCCL, a piece of software that helps different types of GPUs coordinate better with each other, particularly in tasks required by machine learning and AI. GPUs, or graphics processing units, are designed to handle multiple simple tasks simultaneously, and their accelerated computational power and processing capabilities make them ideal for handling the complex algorithms required to train and run machine learning and AI. 

UCCL acts as a sort of switchboard operator for GPUs by managing certain tasks, like multipathing (spreading the traffic over multiple different routes), smart congestion control (making sure the messages don’t jam up) and selective retransmission (only resending lost sections instead of the whole message).  

When tested on cloud servers like Amazon Web Services, UCCL made GPU communication up to 3.3 times faster than older systems. In other GPU setups, like Nvidia and Advanced Micro Devices, UCCL showed an increased speed in select operations of up to 2.5 times faster. 

“My research is trying to provide efficient communication support for these heterogeneous GPUs in these kinds of machine learning workloads,” Zhou said. “You can think of a healthy ecosystem, where different vendors play with each other, not just one vendor monopolizing the whole market. Different vendors have different prices, different focuses. With UCCL, you can combine the strengths of different vendors to build a better performance workload on lower-cost systems.” 

A potential byproduct of this work that Zhou is interested in exploring further is how enhanced performance could affect energy efficiency. 

“You can easily translate performance per dollar into something like performance per watt,” he said. “If you have a fixed installation cost and you compute for an hour, you probably consume 200 watts. If you can compute faster, you save that energy, right?” 

All the Right Optics

At UC Davis, Zhou is excited by the prospect of bringing on eager computer science students to help him build systems and investigate his research problems. 

He is also anticipatory about the possibilities for collaboration, particularly with professors in the Department of Electrical and Computer Engineering conducting research in fiber optics for communication and computations. Using light to communicate between GPUs is being used to a certain degree, said Zhou, but he wants to elevate it to the next level. 

“I want to push it further to get even faster and better performance in machine learning communications by aggressively leveraging their techniques,” he said. “They would develop the hardware techniques to provide that capability, and I will develop the software layer to let the applications better leverage the physical capabilities.” 

Zhou has presented research at over a dozen conferences, including ACM SIGCOMM, the USENIX Symposium on Networked Systems Design and Implementation, USENIX Symposium on Operating Systems Design and Implementation, the Machine Learning and Systems, or MLSys, Conference and the Very Large Databases, or VLDB, Conference. 

His peer-reviewed papers have been published in such journals as IEEE Transactions on Knowledge & Data Engineering and the VLDB Journal. Zhou has also conducted research for companies, including Meta, Google Systems Research and NetInfra. 

Primary Category