Last Thursday evening I went to http://www.bing.com/ only to get a page not found. I assumed I was having internet connectivity problems, as I would never expect Bing to be down. I could have gone to http://downforeveryoneorjustme.com/bing.com to verify but I went to google.com instead and performed my search.
I later found that Bing was indeed down, as reported on the bing team's blog. I had assumed incorrectly that something like this just couldn't happen to Bing (45 minute total outage during prime time). In recent memory other big companies with a significant internet presence have had similar outages (Google Gmail and Apple Me just to name the obvious ones).
This goes to show that significant disruptions in service quality still happen to large and mature organizations, who have a lot riding on being perceived as a trusted provider of services over the internet.
I'm still trying to decide whether I should be surprised or not about the issue Bing just encountered. One part of me says that as systems and organizations get more complex, it's nearly impossible to be perfect at operational excellence (at least not cost-effectively), and these types of outages are bound to happen. On the other hand, if I was the exec in charge of Bing, I'd have a hard time not being extremely unhappy with my team.
I'd think that as hard as I'm working to gain (or not lose) market share to the competition, I couldn't afford this kind of mistake. There don't seem to be any mitigating circumstances here. It's not like the search site was missing some less-known features, or returning slightly outdated results. Instead, I'd be fuming that my entire business was closed during business hours (this didn't happen in the middle of the night) during the holiday season (remember, Bing is not just search, it's shopping, maps, travel, etc). If Bing was a brick and mortar store, this would not be like being out of stock for a couple items, it would be like having the lights out and doors closed at 6:30 PM PST.
According to that Bing blog, "The cause of the outage was a configuration change during some internal testing that had unfortunate and unintended consequences." If this is true, it really sounds like something that could have fairly easily been avoided. I'm sure there will be some lessons learned that come out of this but will the true root problem be resolved? Will this one procedure be adjusted or will the organization do a deeper introspective and find out what systemic changes can be made to never allow this kind of outcome?
While it's easy to point a finger at the Bing team, every technology service organization can empathize with the team on some level. Similar events have happened at least once to virtually all the big outfits in the business (Amazon, eBay, WalMart...), and even those who have dodged these types of headline news-making event are at risk of falling to complacency.
In the Opposite of Luck, we discuss approaches to ensure that complacency doesn't set in. In particular, we discuss "The Invisible Hand" in Chapter 9, where the entire organization can be rallied behind the same Service Quality objectives via goal setting, repeated broadcasting, getting skin in the game, and promoting a culture of performance. I don't know if those who were doing that "internal testing" on Bing had any skin in the game before this event, but I bet they do now...