Sunday, February 07, 2010

Backup and Recovery Considerations

Or – Why Murphy was an Optimist

Every IT system is subject to failure. When things go wrong, we try to recover, usually by using a backup copy that we took just in case. This post deals with some of the considerations that one should keep in mind when planning for backup and recovery. It doesn't attempt to cover the details of the backups that are required for any specific IT technology, they are so diverse, and in many cases so complex, that it's impossible for a blog post to cover the ground. Instead, it deals with questions like how should you plan for backup and recovery.

Along the way, we also take a sidelong look at Murphy's Law, which as we all know states:
Anything that can go wrong, will go wrong.
You may also have heard the claim that Murphy was an optimist. That sounds absurd, given the totally bleak outlook of Murphy's law, but in fact it's perfectly true, for reasons that everyone who is responsible for planning backup and recovery needs to know. We'll deal with that issue too.

Planning Backup and Recovery

Backup and recovery doesn't just happen – especially the recovery part. If you want to be able to recover from your backups when things go wrong, you have to plan carefully. There are three major variations in the way that you can do the planning. I call them foresight, on-demand, and hindsight.

Foresight Planning

This is the hard way of doing the job. Before anything goes wrong, you sit down and think carefully about all the things that could possibly go wrong, and then you apply Murphy's rule and assume that if something can go wrong, it will go wrong sooner or later. You then work out what backups and other evasive action you will need to take in order to recover from each scenario when it goes wrong. This is doing the hard yards in advance, and there are several reasons why it's difficult and rather unrewarding to do:
  • Few other people on your organization will agree with you that your worst-case scenarios could come about. Ah, don't worry about it, it will never happen! is a phrase that you will get sick of hearing. It's hard to get the cooperation that you need from other folk with this mode of preparation.
  • If you do a really good job and handle the catastrophe when it happens without even breaking a sweat, no one will even know what a good job you did. They'll all think that your particular patch is a really easy one to handle.
  • In particular, your boss won't know what a valuable contribution you make to his peaceful sleep till you move on to better things and someone less well prepared than you takes over.
On-Demand Planning

This is a more risky way of handling the problem. You wait for something to go wrong, and when it does, you size up the situation and then make a brilliant, intuitive move to fix it all in one stroke. Well, maybe not one stroke – would you believe two strokes, or maybe three? This approach has a number of advantages, but also some potential downside:
  • You don't have to spend so much time persuading other folks to cooperate with you on solving the problem, or even persuading them that it can happen, because you only spring into action once it has happened.
  • You don't have to worry that your hard work will go unappreciated. Your boss will probably arrive at your desk at some time, and maybe his boss too if you can't fix the problem quickly. They may offer you a lot of helpful suggestions on approaches that you could take, and perhaps a few precautions that you could and should have taken beforehand. They may also suggest a lot of interesting new career moves that you could make if things don't work out, quickly.
Hindsight Planning

Hindsight planning is popularly carried out by folk between about 1AM and 4AM in a darkened room, staring at the ceiling. There are no distractions, so you can really focus on the job and do it right.

The big advantage of hindsight planning is the most accurate and economic way of getting the job done. With the benefit of hindsight, you know exactly what backups you should have taken when, and exactly what recovery procedures you should have undertaken when things went wrong. Next time, you'll know exactly how to handle the situation. If there is a next time.

The big drawback of hindsight planning is that you may never get an opportunity to put your hard-won wisdom into practice. You may be following a new career, one of those recommended to you by your manager.

Which Planning Strategy is Right for You?

We all know one size doesn't really fit all, so which of the three strategies is right for you? Probably a mixture of all three. There is a limit to the patience and cooperation that you can expect from your co-workers, so if you go flat out for the foresight model, you will irritate your colleagues. They will warn you off, and if you persist and really get up their noses, they'll go out of their way to sabotage you.

The on-demand model is risky, but also offers rich returns if you pull it off. Everyone knows that you pulled it off, and they will all think that you're a hero. Well, most of them. There's always a few malcontents. Because of its high risk elements, you may want to mix in a large element of the foresight method. Prepare for the problem, but when it happens, be sure to spend some time telling everyone how unlikely it was that this problem should ever occur. Let the tension build, and once you have your audience, tell them that there is one maneuver that might just save their bacon, then pull your ace from the bottom of the pack.

No matter how smart you are, there are going to be times when something slips through the gaps, and you will find yourself in hindsight mode. Providing you have more successes than failures, you will likely get away with it. Your management will know that no one else is going to achieve a 100% success rate, especially no one who's prepared to work for peanuts like you earn. Cover your back in advance by always asking (in writing) for a few more backups that you actually need, and keep track of the ones that don't get taken. Don't complain about the missed backups, just keep a note of them in your back pocket. Then when you find that you have totally run out of options and your back is to the wall, announce that you can save the situation by using one of the backups that you know isn't there. With horror, and an audience, you discover that it isn't there, and you pass the buck to some other poor sucker. Just don't pull this stunt too often, or your colleagues will get wise to you, and sabotage you out of a job in self defense.

Murphy Revisited

So getting back to Murphy, just why was he an optimist, and what does it have to do with you? Look carefully at his law:
Anything that can go wrong, will go wrong.
There's an unspoken assumption in this statement – that if something can't go wrong, it won't. This turns out to be a really dangerous assumption, as anyone who has scars on their back from years of backup and attempted recovery will know. I would like to offer you my rule, that is particularly relevant for folk who are responsible for planning and executing backup and recovery:
The most difficult problems to recover from are those that you knew could never happen.
There are two reasons why this is true.

Firstly, when they do happen, you simply can't bring yourself to accept that they really did happen You waste precious minutes or hours staring at the forensic evidence in stunned disbelief. While you do, the few precious opportunities that you might have had to escape from your doom slip by, unnoticed

Secondly, once you are able to accept the fact that the unthinkable has happened, you have absolutely no plan in mind to recover from it. You sit there staring at a blank page, with a ring of anxious faces surrounding you, looking at you with gradually fading hope.

The moral of this story is:

Always Have a Plan

Think through everything that could go wrong in advance. Brainstorm with other people and get their ideas too. If they're too busy and preoccupied with their own problems to help out, invest in a few drinks after work some time and, once they have started to chill, entertain them with a few horror stories about things that went wrong at other places. Chuckle with them over other people's misfortunes. Don't tell them that in some of these stories, you were the fall guy – keep it light, keep them laughing. After a while their creative juices will start to flow, and they'll reward you with a few horror scenarios of their own. Take discreet notes.

Once you have an inventory of potential problems, categorize them by probability and impact, as best you can. Remember, probability is never zero. If you don't believe me, ask Heisenberg. Once you have the list, start working on ways in which you could potentially recover should a given problem arise. You will not be able to develop a 100% recovery strategy for 100% of the problems, life is too short, so you have to prioritize. Develop detailed plans for the more likely, bigger impact disasters on your list, and less detailed plans for the less likely or less impactful items on your list. Your goal is not perfection, it's survival. For the raft of problems that seem too bizarre to happen, you can develop one simple, common solution – like keeping $2,000 and your passport in a plain brown envelope in your bottom drawer.

How Many Backups Do You Need?

Just to pick up on one common problem that you are pretty sure to encounter – you can't do recovery without backups. The more problem scenarios you develop and solutions you create to handle them, the more backups you are likely to need.

Sooner or later a delegation will come to your desk and tell you that you're taking too many backups, burning too much valuable removable media and/or bandwidth to backup sites, and too much processor time taking backups. They may point out for example that you still take daily backups of a database that is no longer needed, because the company sold off the division that used that database two years ago. Don't admit that this small fact had slipped your notice. You need to do some homework in advance. Find out the major laws that apply to the retention of financial records in your country, like the Sarbanes-Oxley act in the USA for example, and quote some of the penalties that may be imposed if these laws are broken. Point out that you wouldn't be backing up the data if it wasn't still online, occupying valuable disk space that is denied to you, and suggest that some guy who used to work with the system, but who has since left the company (and ideally the country too) once told you that other systems still in operation post updates to the supposedly discontinued system, and these have to be journalled to comply with record retention legislation. By now your accusers will be shuffling their feet and glancing sideways at one another. At this stage, throw them a fig-leaf so they can beat a hasty retreat from your desk without feeling that they have come away completely empty-handed. Offer to take the backups only once a week instead of daily.

You must always take complaints of this sort seriously, and make time to have a detailed discussion with your accuser about the various backups that you are taking, why you're taking them (better keep notes, or you'll forget yourself), and whether the need is still as great as it seemed when the backup was first put into the roster.

But as you sit down to have this conversation, be sure to kick off an extra, unscheduled backup to cover the time that you will be busy in the meeting and unable to monitor the systems. Just in case.