The World of Homis: What Went Wrong?

The outage that hit Research in Motion's BlackBerry wireless email network was a shock to those hooked on the wireless devices. The Waterloo, Ont., company hasn't issued much information about the cause of the outage. Carmi Levy, Senior Research Analyst, with InfoTech Research Group in London, Ont., answers a few questions on what happened.

Q. How does the system work?

A. Research In Motion manages messaging traffic through its Waterloo-based network operations centre. The technology uses a massive amount of computing power and network capacity to manage BlackBerry service for users across the Western hemisphere. Other regions have centralized operations centres as well.

Q. Why is everything channelled through Waterloo?

A. This is how Research In Motion originally built the architecture for its network, and as the subscriber base has grown, the organization has elected to continue to place the majority of its operational eggs in one basket for this part of the world.

Centralizing network operations is one way of reducing cost and improving the organization's ability to manage far-flung technology. Technicians can easily monitor the entire network from one place, while the company avoids the cost of building and staffing multiple operations centres.

Q. Isn't this risky putting everything through one location?

A. Given the scope of the current outage, it's a safe assumption that routing all traffic for the entire Western hemisphere through one central operations centre is somewhat risky. If that centre goes down, Research In Motion has a significantly limited ability to maintain service to its user base.

Although the initial driver of this strategy is usually efficiency and cost containment, it must be balanced against the risk of a significant failure that takes down that one site. Unlike the Internet, whose distributed architecture allows other segments to take over when things fail, a centralized approach offers no such redundancy. If it goes down, everyone goes down with it.

Q. Do you think RIM had a back-up plan or system that failed?

A. It is inconceivable that a company responsible for such a large customer base would not have comprehensive backups, failovers and alternative technologies and processes in place to maintain service in the event of an outage. In this day and age, not having a backup plan is equivalent to jumping out of a plane without a parachute: it's simply not done.

Thus, it's quite clear from the scope and depth of the outage that whatever backups or failovers RIM had in place obviously failed to remain functional. This is clearly a worst-case scenario for any company, especially one with the market profile and current agenda of RIM.

Q. Do you think that the company might consider changing this approach now that it has had this failure, i.e. build a back-up system in a separate location? And better response for customers?

A. I would expect that a catastrophic, high-profile outage such as this would prompt RIM to reassess how it builds and manages its global network. Despite the day-to-day efficiencies of a centralized operations centre, the risk of a total service outage is simply too high if that one site is taken out for whatever reason.

The redundancies that were built into the network have obviously failed to contain the outage, and future events of this type will potentially do serious and permanent damage to RIM's reputation and future subscriber and revenue growth. In the coming days and weeks, customers will be pressing RIM for details on how it will fundamentally change its operational strategy to avoid widespread outages like this in future. Considering the degree to which BlackBerry service has infiltrated today's economy, customers have every right to ask these hard questions, and I expect RIM to devote significant engineering and media resources to this response.

More than any event in RIM's recent past, this outage holds the potential to change the direction of the company's support organization. It needs to do this if it hopes to continue to grow its subscriber base and fight for dominance in the mobile messaging space. It cannot afford to be seen as a weak manager of infrastructure, as enterprise customers will not accept weakness from a messaging vendor.

This was found at The Globe and Mail

The World of Homis

Wednesday, April 18, 2007

What Went Wrong?

No comments:

Swidget 1.0

About Me

Blog Archive

FEEDJIT Live Traffic Feed