Rice computer scientist Eugene Ng and his colleagues have introduced a new tool dubbed ShareBackup that will allow shared backup switches in data centers to take on network traffic within a fraction of a second after a software or hardware switch failure. In other words, it will keep data on the fast track when failures inevitably happen.
According to Ng, the tool will solve a typical inconvenience among data professionals, researchers and everyone who depends on a system to convey results all the live long day.
Ng said, “A data network consists of servers and network switches. Switches move data packets to where they need to go. But things fail, especially in large-scale data centers with thousands of pieces of hardware.”
“The usual response to a failed switch is to shunt the flow of data to another line. Generally, the network has multiple paths for connecting servers so, just like if there’s a closure on the highway, we’d drive around it. This is a conventional, natural approach that makes a lot of sense: You reroute around the failure to get where you need to go.”
“But sometimes that other road is congested and everything slows down. Data centers aren’t the internet; they’re not about people surfing websites. They’re about supporting data-intensive applications like data mining or machine learning. And a lot of these applications have stringent performance deadlines, so blindly rerouting traffic could be the wrong thing to do in a data center.”
The ShareBackup tool in such cases put fast switches and software in strategic locations that could pick up the traffic from a failed switch in a microsecond. When that problem is resolved, the tool makes the backup switch available to handle another failure.
Ng said, “ShareBackup could save data centers time and money not only by maintaining full bandwidth but by also helping to analyze problems, including misconfigurations that commonly lead to network failure.”
“Part of our work is to help data centers figure out what went wrong in the network. Once the backup is activated, you can take the failed device out of the production network and test it to identify which component caused the problem.”
“Now, if we take two devices out and can’t figure out which went bad, both need to be replaced. It’s very likely only one of the devices is having the problem. Our software can diagnose these devices in a semiautomatic manner, and if one of the parts is good, it can be reinstated.”
Lead authors of the paper are Rice graduate student Dingming Wu and alumnus Yiting Xia, now a computer scientist at Facebook. Co-authors are Rice graduate students Xiaoye Steven Sun, Xin Sunny Huang and Simbarashe Dzinamarira.