A rogue piece of software, which triggered a cascading failure through Google’s data centres around the world, led to the shutdown of its Gmail system last week.
The failure, during routine maintenance at one of the company’s European data centres, was an unforeseen side-effect of a software program that was written in-house, said Nelson Mattos, vice-president of engineering.
“We’re not perfect, we make mistakes,” he said.
But experts said that anticipating how software would behave in circumstances such as these was something that even corporate IT departments with far less engineering sophistication than Google were expected to master.
“That’s just not acceptable,” said Matt Cain, an analyst at Gartner, the IT consultants. “It was poor thinking-through of a code change. In a corporate environment, you can’t just tell your CEO it was bad luck.”
The glitch that led to the first global shutdown of Gmail since August began on Tuesday, during routine maintenance. Data were moved to a back-up centre while the work was being done.
However, the relocation triggered a software program that is designed to direct data to the centre nearest to where users are based, a measure that improves the response time for online applications.
As it unexpectedly set to work on the new mass of data, the code greatly increased the workload on the reserve data centre and triggered an overload, causing data to be pushed automatically into a third centre.
That in turn led to another overload, eventually triggering a series of failures that toppled Google’s data centres like falling dominoes.
Despite the high-profile problem, Mr Cain said the overall reliability of Gmail was still superior to the in-house e-mail systems that most companies run.