Read Windows Server 2008 R2 Unleashed Online
Authors: Noel Morimoto
failure or trying to recover from a disaster, especially in a stressful situation, making
changes without getting approval can lead to costly mistakes. Following the proper
change-control and emergency change-control processes to inform and involve others,
getting approval from management, and following documented processes will provide
accountability and might even save the administrator’s job.
Disaster Recovery Delegation of Responsibilities
At this point, the organization might have a documented and functional backup and
ptg
recovery plan, a PMO, and a change-control committee, but the ownership and mainte-
nance of disaster recovery operations is not yet defined or assigned. Disaster recovery
roles, functions, or responsibilities might be wrapped up into an existing executive’s or
manager’s duties or a dedicated staff member might be required. Commonly, disaster
recovery responsibilities are owned by the chief information officer, operations manager,
chief information security officer, or a combination of these positions. Of course, responsi-
bilities for different aspects of the overall disaster recovery plan are delegated to managers,
departmental leads, and staff volunteers as necessary. An example of delegating disaster
recovery responsibilities is contained in the following list:
. The chief information officer is responsible for disaster recovery planning and main-
taining and executing disaster recovery-related tasks for the entire telecom, desktop
and server computer infrastructure, network infrastructure, and all other electronic
and fax-related communication.
. The manager of facilities or operations is responsible for planning alternate office
locations and offsite storage of original or duplicates of all important paper docu-
ments, such as leases, contracts, insurance policies, stock certificates, and so on, to
support disaster recovery operations to alternate sites or offices.
. The manager of human resources is responsible for creating and maintaining emer-
gency contact numbers for the entire company, storing this information offsite, and
communicating with employees to provide direction and information prior to disas-
ters striking and during a disaster recovery operation.
The list of responsibilities can be very granular and extensive and disaster recovery plan-
ning should not be taken lightly or put on the back burner. Although there are many
aspects of disaster recovery planning, the remainder of this chapter focuses only on the
When Disasters Strike
1271
disaster recovery responsibilities and tasks that should be assigned to qualified Windows
administrators who need to support a Windows Server 2008 R2 environment.
31
Achieving 99.999% Uptime Using Windows Server 2008 R2
When the topic of disaster recovery comes up, many people think of the phrase “five
nines” or “99.999% uptime.” Although understanding this concept is reasonably simple,
actually providing five nines for a server or a network can be quite a large and expensive
task. Achieving 99.999% uptime means that the server, application, network, or whatever
is supposed to have this amount of uptime can only be down for just over five minutes
per year. Having such success is quite a claim to make, so administrators should make it
with caution and document it, citing explicitly what this service depends on. For example,
if a power failure occurs and the battery backups will last only two hours, a dependency
for a server could be that if a power outage occurs, it can withstand up to two hours
without power.
To provide 99.999% uptime for services available on Windows Server 2008 R2, administra-
tors can build in redundancy and replication on a data, service, server, or site level. Many
Windows Server 2008 R2 services outlined in other chapters of this book, including
Failover Clusters, Network Load Balancing, and the Distributed File System, can provide
ptg
redundancy for the specific services available.
When a failure or disaster strikes is when not only having, but also following, a disaster
recovery plan is most important. Having a procedure or checklist to follow allows all
involved parties to be on the same page and understand what steps are being taken to
rectify the situation. The following sections detail steps that can be followed to ensure
that no time is wasted and resources are not being led in the wrong direction.
Qualifying the Disaster or Failure
When a system failure occurs or is reported as failed, the information can come from a
number of different sources and should be verified. The reported issue can be caused by
user or operator error, network connectivity, or a problem with a specific user account
configuration or status. A reported system failure should be verified as failed by perform-
ing the same steps reported by the reporting party.
If the system is, in fact, in a failed state, the impact of the failure should be noted, and
this information should be escalated within the organization so that a formal recovery
plan can be created. This can be known as qualifying the disaster or failure. An example of
qualifying a failure includes a short description of the failure, the steps used to validate
the failure, who is affected, how many end users are affected, which dependent applica-
tions or systems are affected, which branch offices are affected, and who is responsible for
the maintenance and recovery of this system.
1272
CHAPTER 31
Recovering from a Disaster
Validating Priorities
When a disaster strikes that affects an entire server room or office location, the priority of
restoring systems and operations should already be determined. First and foremost are the
core infrastructure systems, such as networking and power, followed by authentication
systems, and the remaining core bare minimum services. In the event of a failure that
involves multiple systems—for example, a web server failure that supports 10 separate
applications—the priority of recovery should be presented and approved by management.
If each of these 10 applications takes 30 minutes to recover, it could be 5 hours before the
system is fully functional, but if one particular application is critical to business opera-
tions, this application should be recovered first. Always perform checkpoints and verifica-
tion to ensure that the priorities of the organization are in line with the recovery work
that is being performed.
Assume and Be Doomed
Disaster, system failures, and data corruption issues tend to create a lot of stress and havoc
among technical business personnel. Recovery administrators and managers should always
be on the same page regarding the priority of recovery and the process. Also, get this
communication in paper or electronic format because it might be required later to justify
why a choice was made. Those administrators who decide to move forward on resolving
ptg
an issue based on assumptions and not by first communicating with their managers might
find themselves in a very sticky situation, especially if the results of their actions prove to
be unsuccessful or end up causing more problems.
Synchronizing with Business Owners
Prioritizing the recovery of critical and bare minimum business systems is part of disaster
recovery planning. When a situation strikes that requires an entire data center or group of
systems to be restored or recovered, the steps that will be followed need to be put back in
front of the business owners again. Please remember that between the time a disaster
recovery plan is created and the time the failure occurs, business priorities might have
shifted and the business owners might be the only ones aware of this change. During a
recovery situation, always take the time to stay calm and focused and communicate with
the managers, executives, and business owners so that they can be informed of the
progress. An informed business owner is less likely to stay in the server room or data
center if they feel that recovery efforts are in good hands.
Communicating with Vendors and Staff
When failures or disasters strike, communication is key. Regardless of whether customers,
vendors, employees, or executives are affected, some level of communication is required or
suggested. This is where the soft skills of an experienced manager, sales executive, techni-
cal consultant, and possibly even lawyers can be most valuable. Providing too much infor-
mation, information that is too technical, or, worst of all, incorrect or no information, is a
mistake technical staff frequently make. My recommendation to technical staff is to only
communicate with your direct manager or his or her boss if they are not available. If the
When Disasters Strike
1273
CEO or an end user asks for an update, try to defer to the manager as best you can, so that
focus can be kept on restoring services.
31
Assigning Tasks and Scheduling Resources
The situation is that we have a failure, we have an approved plan, we have communicated
the situation, and we are ready to begin fixing the issue. The next step is to delegate the
specific tasks to the qualified staff members for execution. As stated previously, hand off
communication to a manager or spokesperson and only communicate through them if
possible. Determining who will restore a particular system is as important if not more
important than assigning communication responsibilities. Only certain technical staff
members might be qualified to restore a system, so selecting the correct resource is essential.
When a serious failure has occurred, recovery efforts might require multiple technical
resources onsite for an extended period of time. Furthermore, there might be dependen-
cies that affect which systems can be restored, and, of course, the order or priority of
restore will advance or delay the recovery of a system. Mapping out the extended recovery
timeline and technical resource scheduling ensures that a technical resource is not onsite
until their skills and time are required. Also, rotating technical resources after six to eight
hours of time helps to keep progress moving forward.
ptg
Keeping the Troops Happy
This section goes out to all technical leads, project managers, IT managers, business
owners, and executives. If you have technical resources working for you in an effort to
recover from a failure, you should do all you can to ensure that these technical resources
are kept happy and focused. For starters, try to keep the end users and any other business
owners or executives from bothering this staff. Regular communication will help with this
task tremendously. Next, and possibly more important, provide all the bottled water, soda,
coffee, snacks, food, breaks, and anything else that will keep these professionals happy,
healthy, and focused on the task at hand. Technical staff will work very hard during disas-
ter situations, so don’t forget to pat them on the back and let them know how much the
organization and you personally appreciate their time and commitment.
Recovering the Infrastructure
After the failure has been validated, the initial communications meetings have been held,
restore tasks have been confirmed and possibly reprioritized, and recovery task assignment
of resources has been completed, the recovery efforts can finally begin. Verify that each
technical resource has all the documentation, phone numbers, software, and hardware
they require to perform their task. Hold periodic checkpoint meetings, starting every 15
minutes and tapering off to every 30 or 60 minutes as recovery efforts continue.
Postmortem Meeting
After a system failure or disaster strikes, and the recovery has been completed, an organi-
zation should hold a meeting to review the entire process. The meeting might just be an
event where individuals are recognized for their great work; however, the meeting will
1274
CHAPTER 31
Recovering from a Disaster
most likely involve reviewing what went wrong and identifying how the process could be
improved in the future. A lot of interesting things will happen during disaster recovery
situations—both unplanned and simulated—and this meeting can provide the catalyst for
ongoing improvement of the processes and documentation.
Disaster Scenario Troubleshooting
This section of the chapter details the high-level steps that can be taken to recover from
particular types of disaster scenarios. As this book and chapter focuses on Windows Server
2008 R2 environments, so shall the following sections.
Network Outage
When an organization is faced with a network outage, the impact can affect a small set of
users, an entire office, or the entire company. When a network outage occurs, the network
administrators should perform the following tasks:
. Test the reported outage to verify if the issue is related to a wide area network (WAN)
connection between the organization and the Internet service provider (ISP), the
router, a network switch, a firewall, a physical fiber or copper network connection or
ptg
network port, or line power to any of the aforementioned devices.