To be a successful data center you need a disaster plan. In other words: what to do when something major occurs and you experience a site failure?
Everyone works at planning for high availability, the failure of a single component within the data center (and making sure it doesn’t affect user experience), and everything continues to run. But what about the loss of an entire site?
Building a disaster recovery plan is like buying life insurance. With the purchase of life insurance you’re betting you will pass away before you will have paid for the policy, and the insurance company is betting you will not.
Guess what? They win more often, but that risk is something you have to take to ensure your family’s security going forward. That is what a disaster recovery plan is, you are betting and spending money that something major will occur in your data center, and spending money to deal with it.
There are costs involved, and you have to consider how quickly you must be back up and running. One possibility is to create two sites, both fully active and both fully capable of supporting the other site in event of a disaster. From a capital and operational cost, this can be very expensive. Enough physical hardware at each site to run both sites, duplication of everything? For the future of your business, you must create a solid disaster recovery plan, but what is required in the disaster recovery plan?
What does it entail, what must be considered? What applications and associated data are mission critical to the business, what users must be back up and functional as quickly as possible. A Web-based sales site would want the associated web sites and databases on line as quickly as possible. A call-center would want the applications, databases and users on line as quickly as possible. In a medium-sized company, it would be far-reaching to assume everyone needs to be back as soon as possible.
There is not an exact science behind building a disaster recovery plan, in my mind not a single “this is the best way” to accomplish a plan.
There are things to consider, processes to be followed, and careful discussion with the storage vendor on how they move data between sites. When looking at disaster recovery, you must consider the need to move data between sites, how much data, and what the time requirements are to copy the data.
If I have X amount of mission-critical data (user and app) that must transfer between the two regions–and finish in a defined time–what size pipe do I need and how much can the storage vendor transfer on that pipe? Can the storage meet the requirement? This, to me, is the storage vendor story: can they transfer data, changed data, and keep things up to date? How are the LUNs and volumes defined?
Being the manager of an engineering team focused on building solutions, I made a challenge to a set of my engineers. I created a fictitious company, defined a set of requirements and applications, and ask them to build a disaster recovery plan using HA and full-site failure to meet my defined requirements. The result of this work is now published here.
Did they get it right? You tell me. Their solution did have a few issues that had to dealt with after the fact, and we documented a few things they did not consider in the beginning, but had to address at the end. We focused on the Citrix components, applications, and users. We partnered with EMC for the storage, and let them tell us the best approach–from their perspective–to handle our requirements. The actual specific storage hardware configuration can be obtained from EMC, so we let them configure according to their best practices.
We did not automate anything in the solution. My team and I believed user interaction was necessary to ensure a successful failover. Parts and pieces could possibly be automated, but without user intervention, the potential for the failover to fail increases, in our opinion.
And what if we took the disaster recovery site to the cloud to reduce capital costs? Keep watch at this site and at docs.citrix.com for more on this.