Big Internet companies like Amazon, Facebook, and Google keep up with the growing demand for their services through massive parallelism, with their data centers routinely housing tens of thousands of individual computers, many of which might be working to serve just one end user. Supercomputer facilities are about as big and, if anything, run their equipment even more intensively.
In computing systems built on such huge scales, even low-probability failures take place relatively frequently. If an individual computer can be expected to crash, say, three times a year, in a data center with 10,000 computers, there will be nearly 100 crashes a day.
Our group at the University of Toronto has been investigating ways to prevent that. We started with the simple premise that before we could hope to make these computers work more reliably, we needed to fully understand how real systems fail. While it didn’t surprise us that DRAM errors are a big part of the problem, exactly how those memory chips were malfunctioning proved a great surprise.
Read more at IEEE.