In 2014, Jim Gao was an engineer at Google. He was responsible for making the massive air-conditioning systems of the tech giant’s data centers (DCs) run as smoothly and efficiently as possible. With a background in mechanical engineering, Gao was following established best practices in the energy industry and obtaining great results.
After the most commonly adopted energy-saving measures were implemented, the performance of Google’s DCs started to plateau, uncovering the limitations of the traditional approach to energy saving. It was clear that a new approach was needed. Gao decided to pursue an unbeaten path, taking advantage of Google’s 20% policy—an initiative that allows employees to spend 20% of their time working on what they think will most benefit Google. Being a data center engineer, he was well aware of the sensors deployed in the DCs and of the large amount of data collected from them for operational purposes. Gao decided to study up on machine learning and tried to build models to predict and improve the DC performance.
The Data Center Energy Consumption Problem
A data center is a building that houses networked servers. In the case of Google, these machines serve Search and Maps queries, store photos and documents, and perform all the other tasks that Google needs in order to offer its services to users.
Energy consumption is a major driver of cost for DCs because of the large number of power-hungry computers they house. High costs are not the only factor to take into account when weighing a DC’s energy consumption; the environmental impact is also important. DCs nowadays consume 2% of the world’s electricity, a number that’s bound to increase as the need for networked services increases. The amount of energy used to power computers can’t readily be optimized by an operations team because it depends on the computing workload and the efficiency of the chips. For this reason, data center engineers strive to reduce all extra consumption. The efficiency of data centers is usually measured by tracking a metric called power usage effectiveness (PUE). This metric reflects how much energy is used on anything other than the actual computers that make the data center:
PUE = (Total Facility Energy) / (IT Equipment Energy)
A perfect data center has a PUE of 1: all the energy is spent to power the computers. The higher the PUE, the more energy is spent on other systems, among which cooling is the most important. For example, a PUE of 1.5 means that for every kilowatt-hour(kWh) of energy consumed to power computers, an additional 0.5 kWh of energy is needed for cooling and other minor needs.
Google has always been a leader in PUE efficiency. According to a 2018 survey of900 DC operators by the Uptime Institute, the average PUE in the industry was 1.58.Google continuously improved its PUE until it reached 1.12 in 2013. Unfortunately,this value didn’t improve until 2017.
ML Approach to Data Center Efficiency
Gao realized that one of the obstacles to lowering the PUE further was that it’s extremely complex to predict it correctly in different scenarios using a traditional engineering approach, because of the complex interactions between factors (for instance, wind can help cool the plant and reduce the need for artificial cooling). On the other hand, he was well aware of the large datasets collected by his team as part of day-to-day operations, thanks to thousands of sensors deployed across components that collect millions of data points.
Gao foresaw the potential of using this data to train an ML model capable of over-coming the limitations of traditional thermodynamics. His first approach was to build a simple neural network (a classic algorithm used to build supervised learning models) that was trained to predict the PUE, given a list of features that affect it. Gao used a total of 19 features, including these:
- Total server IT load
- Total number of process water pumps running
- Mean cooling tower leaving-water-temperature set point
- Total number of chillers running
- Mean heat exchanger approach temperature
- Outside air wet-bulb and dry-bulb temperatures
- Outside air relative humidity, wind speed, and direction
The label of this supervised learning problem is the PUE. Gao used 184,435 time samples at five-minute resolution (approximately two years of operational data). The final model was able to predict DC PUE within 0.004 +/– 0.0005, an approximately 0.4% error, for a PUE of 1.1. Gao was able to build the first proof-of-concept (POC) model quickly using open source coding frameworks.
The final resulting model has been used for three main applications:
- Automatic performance alerting and troubleshooting, by comparing actual versus predicted DC performance for any given set of conditions
- Evaluating PUE sensitivity to operational parameters
- Running digital simulations with different configurations without making physical changes
The results of Gao’s work had a large impact on the company and was recognized publicly by the VP of the data center, Joe Kava, who described the work and results in an Google official blog post. “Better Data Centers Through Machine Learning” high-lighted how Gao’s models were able to identify patterns in the data that are impossible for a person to spot, leading to a model capable of predicting the PUE with a 99.6%accuracy.
Gao and his team started using the model to come up with new ways to improve efficiency. When servers were taken offline, that was known to cause lower performance in the data center, for example. Thanks to Gao’s models, the Google DC’s team was able to run simulations of the data center’s behavior and find new ways to contain the performance loss, saving energy and money.