RIM's Excuse For BlackBerry Outage Finally Emerges
from the too-little-too-late? dept
Research In Motion has delivered an explanation of what caused the BlackBerry outage earlier this week -- sort of. It says an insufficiently tested software upgrade set off a series of errors at its network operations center, which processes all the emails for BlackBerry devices in North America, and then its "failover process", which is supposed to switch things to a backup system, didn't work properly. The company says that it has plenty of capacity and resources to deal with its volume of messages and growing user base, and that it will better test its upgrades in the future. However, that explanation -- and the long time it took to come out -- doesn't wash with some observers, who say there are enough holes in the story that it doesn't add up. In particular, RIM's contention that it was upgrading its software on a Tuesday night, rather than over a weekend, has raised some red flags. Then, if a scheduled upgrade was behind the problem, shouldn't that have been immediately obvious to the company and news spread quickly by its PR team? The real damage from this episode won't be the outage itself, but rather the fallout from how RIM deals with it. On that front, things already aren't looking so good.Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Reader Comments
Subscribe: RSS
View by: Time | Thread
[ link to this | view in chronology ]
It doesn't matter.
It is very possible that they released a minor patch to fix something and that caused a cascade failure when released into the wild. An unexpected and untestable side effect of the "minor" patch screwed up many other things. This is a very common issue when it comes to "minor" items that blow up in your face.
I'm not trying to say RIM doesn't have to properly answer "WHY" this happened, I'm just stating that at this "point" they could still be doing indepth data analysis and simply posting the "general" result of what they have found so far. I've done live patches in the past and while I've never had a cascade failure I've always known it was possible. (100K'ish subscriber level, not anywhere near the level of RIM.) I know that some day, some time, in some way I will miss some small detail and cause a cascade failure, it WILL happen.
So, understanding that there is never going to be 100% uptime, an outage of this type is deplorable yet a reality of large systems engineering. RIM "could" be a bit more upfront about what has gone wrong but on the other hand they very well could be scatching their heads over just what "really" caused the problem.
Now, personally, understanding such things, I would prefer that a company is up front about the reason for the down time. As a technically inclined person, and more importantly, one of the folks who would be asked to justify usage of XYZ system over others, I would not want this sort of generic response which doesn't make a lot of sense to be common. I would want straight answers to the problems and what they are doing to fix them, that's more important than denying there is a problem.
KB
[ link to this | view in chronology ]
[ link to this | view in chronology ]
[ link to this | view in chronology ]
anal retentive?
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Just Saving a Few Bucks
I'm also sure RIM hires students fresh from school, supposedly because they are most up-to-date on the technology and of course, they work real, real cheap.
Of course, as soon as the students start figuring out what the hell they're doing, they want more money and of course, they get fired and a new flock of fresh-faced students gets hired.
This is a dirty little I.S. secret that's been true for many, many years.
From years of observation, my best guess about the outage is that it was caused by some new blockhead student who didn't know a bit from a byte but decided to "fix" something anyway.
[ link to this | view in chronology ]
re: post # 6 & 7
As for post 7... RIM is, in my experience, one of the more reliable technology companies out there. I don't manage any other systems that are as reliable and low maintenance as theirs.
And no, I don't work for them or have any particular investment. Just a very happy customer.
[ link to this | view in chronology ]
Re: re: post # 6 & 7
Well, your argument convinced me.
I guess RIM doesn't "downsize" the well-paid, experienced staff and hire a bunch of inexpensive bozos, fresh from college.
I guess their system didn't go down unexpectedly for no honestly explained reason.
I guess someone actually did provide for a reliable back-up system and someone actually did institute effective change controls and other basic I.S. common sense activities but, dammit, they just didn't work.
I hate devastating arguments like yours. They make me feel so uninformed and inexperienced.
[ link to this | view in chronology ]
Re: Re: re: post # 6 & 7
For clarification, in referring to "any other systems" I realize I was not being accurate. There are other systems that are as reliable and low maintenance... they are, however, few and far between.
[ link to this | view in chronology ]
Re: Re: Re: re: post # 6 & 7
I appreciate your response and I will admit I sometimes go berserk when I experience the cavalier attitude of corporations and other large business entities when it comes to Information Systems.
There are some pretty simple, well-known procedures to follow that will limit unscheduled downtime to something like 0.1 percent. That used to be the target for most mainframe shops and they met the target regularly - or they got fired.
Not so, these days. For instance, my cable ISP seems to simply turn off their service, accidentally or on purpose, any time they feel like it; no warning, no apologies, no refund. It happens a lot. Hooray for monopolies!
It is unfortunate when arrogance, greed and stupidity cause senior management to sacrifice dedication and professionalism on the alter of The Bottom Line.
Worse, they usually adversely affect The Bottom Line, then blame it on the I.S. staff.
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Hmmmm
[ link to this | view in chronology ]
Re: Hmmmm
that's what i was thinking. haha
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Blackberry Outage
[ link to this | view in chronology ]
Don't Rush QA!
[ link to this | view in chronology ]
factor into that the fact that RIM charges for upgrades to BES, Windows updates active sync free of charge.
Also, syncing with a desktop is MUCH easier and error free with AS than with Desktop Manager.
[ link to this | view in chronology ]
RIM Architecture
[ link to this | view in chronology ]