• Service Management

How to Avoid a Facebook Style Outage Damaging Your Business

How to Avoid a Facebook Style Outage Damaging Your Business

It was the biggest outage Meta (formerly Facebook) has ever had. Its whole family of apps which includes Instagram, WhatsApp, and Messenger went offline across the world affecting 3.5 billion people for around six hours in October 2021. 

The social media giant blamed several mistakes, causing a domino effect across their networks. It all began with routine maintenance according to this blog by the vice president of Meta engineering.

Whatever the size of your business if you rely on IT, an outage of any kind can heavily impact your business, so how can you try and avoid one. 

 

Here at IT Naturally, we use a change management approach to control risk.

What is IT Change Management?

It’s a process that makes it easier for your company to roll out change requests to your IT infrastructure. A “change” can be proactive or reactive and is anything that would affect your live environment or end-user. If you think of it a bit like a puppet – if you pull one string, you need to know what impact that will have on everything else. 

Every change you make to your IT infrastructure has the potential to lead to a problem.

A change for Facebook spiraled and resulted in their services being offline affecting their reputation and huge financial loss.

The key thing about change management at IT Naturally is that it is simple, transparent, and accountable. 

We make it simple by documenting every step of the change and not moving forward until we know the results of the previous step. 

A technical person will always assess the change to see if it could cause an impact anywhere.  We also use a peer review practice (2 heads are better than one!) to double-check the change assessment. 

The next step would see the Change Authority Board (CAB) review and approve it. Considering every change in detail helps everyone understand the possible impact.

We believe in transparency through communication and accountability. First, we tell everyone that needs to know what is going to be happening and the possible outcomes, sharing any positive or negative issues along the way. Second, we follow strict protocols, using change records and a runbook so you can see exactly where the problem lies

But that doesn’t mean we run a blame culture, we ensure learning comes from every failure instead. It’s a bit like airline pilots who constantly report issues whilst flying and are striving to provide continuous improvement. It’s the same culture in IT. 

Routine Maintenance

Routine Maintenance seems to have been the initial cause of the Facebook outage, so it’s easier then to see how it could have escalated.

Initially, a change that went on to be routine would go through the 3 full change procedures. However, if there were no problems this would be classed as low risk, with no change procedure needed.

This shows that monitoring and continual change procedure can be worth it even on low-risk jobs.

What Happens if there is a Major Incident? 

Here at IT Naturally we always follow through with a major incident review. We would need to find out what was the root cause of the problem? 

What happened next? 

Was the change communicated? 

And many more questions. 

We would check the systems, policies, and communication and make changes so it doesn’t happen again.

It’s impossible to prepare yourself for every eventuality, but a good IT Change Management system can help you control risks and keep disruption to a minimum.  

Talk to IT Naturally today to find out how we can work at your IT department and minimise risk to your business. 

See more
Related Insights