Windows Server 2008 R2 Unleashed (251 page)

failure or trying to recover from a disaster, especially in a stressful situation, making

changes without getting approval can lead to costly mistakes. Following the proper

change-control and emergency change-control processes to inform and involve others,

getting approval from management, and following documented processes will provide

accountability and might even save the administrator’s job.

Disaster Recovery Delegation of Responsibilities

At this point, the organization might have a documented and functional backup and

ptg

recovery plan, a PMO, and a change-control committee, but the ownership and mainte-

nance of disaster recovery operations is not yet defined or assigned. Disaster recovery

roles, functions, or responsibilities might be wrapped up into an existing executive’s or

manager’s duties or a dedicated staff member might be required. Commonly, disaster

recovery responsibilities are owned by the chief information officer, operations manager,

chief information security officer, or a combination of these positions. Of course, responsi-

bilities for different aspects of the overall disaster recovery plan are delegated to managers,

departmental leads, and staff volunteers as necessary. An example of delegating disaster

recovery responsibilities is contained in the following list:

. The chief information officer is responsible for disaster recovery planning and main-

taining and executing disaster recovery-related tasks for the entire telecom, desktop

and server computer infrastructure, network infrastructure, and all other electronic

and fax-related communication.

. The manager of facilities or operations is responsible for planning alternate office

locations and offsite storage of original or duplicates of all important paper docu-

ments, such as leases, contracts, insurance policies, stock certificates, and so on, to

support disaster recovery operations to alternate sites or offices.

. The manager of human resources is responsible for creating and maintaining emer-

gency contact numbers for the entire company, storing this information offsite, and

communicating with employees to provide direction and information prior to disas-

ters striking and during a disaster recovery operation.

The list of responsibilities can be very granular and extensive and disaster recovery plan-

ning should not be taken lightly or put on the back burner. Although there are many

aspects of disaster recovery planning, the remainder of this chapter focuses only on the

When Disasters Strike

1271

disaster recovery responsibilities and tasks that should be assigned to qualified Windows

administrators who need to support a Windows Server 2008 R2 environment.

31

Achieving 99.999% Uptime Using Windows Server 2008 R2

When the topic of disaster recovery comes up, many people think of the phrase “five

nines” or “99.999% uptime.” Although understanding this concept is reasonably simple,

actually providing five nines for a server or a network can be quite a large and expensive

task. Achieving 99.999% uptime means that the server, application, network, or whatever

is supposed to have this amount of uptime can only be down for just over five minutes

per year. Having such success is quite a claim to make, so administrators should make it

with caution and document it, citing explicitly what this service depends on. For example,

if a power failure occurs and the battery backups will last only two hours, a dependency

for a server could be that if a power outage occurs, it can withstand up to two hours

without power.

To provide 99.999% uptime for services available on Windows Server 2008 R2, administra-

tors can build in redundancy and replication on a data, service, server, or site level. Many

Windows Server 2008 R2 services outlined in other chapters of this book, including

Failover Clusters, Network Load Balancing, and the Distributed File System, can provide

ptg

redundancy for the specific services available.

When Disasters Strike

When a failure or disaster strikes is when not only having, but also following, a disaster

recovery plan is most important. Having a procedure or checklist to follow allows all

involved parties to be on the same page and understand what steps are being taken to

rectify the situation. The following sections detail steps that can be followed to ensure

that no time is wasted and resources are not being led in the wrong direction.

Qualifying the Disaster or Failure

When a system failure occurs or is reported as failed, the information can come from a

number of different sources and should be verified. The reported issue can be caused by

user or operator error, network connectivity, or a problem with a specific user account

configuration or status. A reported system failure should be verified as failed by perform-

ing the same steps reported by the reporting party.

If the system is, in fact, in a failed state, the impact of the failure should be noted, and

this information should be escalated within the organization so that a formal recovery

plan can be created. This can be known as qualifying the disaster or failure. An example of

qualifying a failure includes a short description of the failure, the steps used to validate

the failure, who is affected, how many end users are affected, which dependent applica-

tions or systems are affected, which branch offices are affected, and who is responsible for

the maintenance and recovery of this system.

1272

CHAPTER 31

Recovering from a Disaster

Validating Priorities

When a disaster strikes that affects an entire server room or office location, the priority of

restoring systems and operations should already be determined. First and foremost are the

core infrastructure systems, such as networking and power, followed by authentication

systems, and the remaining core bare minimum services. In the event of a failure that

involves multiple systems—for example, a web server failure that supports 10 separate

applications—the priority of recovery should be presented and approved by management.

If each of these 10 applications takes 30 minutes to recover, it could be 5 hours before the

system is fully functional, but if one particular application is critical to business opera-

tions, this application should be recovered first. Always perform checkpoints and verifica-

tion to ensure that the priorities of the organization are in line with the recovery work

that is being performed.

Assume and Be Doomed

Disaster, system failures, and data corruption issues tend to create a lot of stress and havoc

among technical business personnel. Recovery administrators and managers should always

be on the same page regarding the priority of recovery and the process. Also, get this

communication in paper or electronic format because it might be required later to justify

why a choice was made. Those administrators who decide to move forward on resolving

ptg

an issue based on assumptions and not by first communicating with their managers might

find themselves in a very sticky situation, especially if the results of their actions prove to

be unsuccessful or end up causing more problems.

Synchronizing with Business Owners

Prioritizing the recovery of critical and bare minimum business systems is part of disaster

recovery planning. When a situation strikes that requires an entire data center or group of

systems to be restored or recovered, the steps that will be followed need to be put back in

front of the business owners again. Please remember that between the time a disaster

recovery plan is created and the time the failure occurs, business priorities might have

shifted and the business owners might be the only ones aware of this change. During a

recovery situation, always take the time to stay calm and focused and communicate with

the managers, executives, and business owners so that they can be informed of the

progress. An informed business owner is less likely to stay in the server room or data

center if they feel that recovery efforts are in good hands.

Communicating with Vendors and Staff

When failures or disasters strike, communication is key. Regardless of whether customers,

vendors, employees, or executives are affected, some level of communication is required or

suggested. This is where the soft skills of an experienced manager, sales executive, techni-

cal consultant, and possibly even lawyers can be most valuable. Providing too much infor-

mation, information that is too technical, or, worst of all, incorrect or no information, is a

mistake technical staff frequently make. My recommendation to technical staff is to only

communicate with your direct manager or his or her boss if they are not available. If the

When Disasters Strike

1273

CEO or an end user asks for an update, try to defer to the manager as best you can, so that

focus can be kept on restoring services.

31

Assigning Tasks and Scheduling Resources

The situation is that we have a failure, we have an approved plan, we have communicated

the situation, and we are ready to begin fixing the issue. The next step is to delegate the

specific tasks to the qualified staff members for execution. As stated previously, hand off

communication to a manager or spokesperson and only communicate through them if

possible. Determining who will restore a particular system is as important if not more

important than assigning communication responsibilities. Only certain technical staff

members might be qualified to restore a system, so selecting the correct resource is essential.

When a serious failure has occurred, recovery efforts might require multiple technical

resources onsite for an extended period of time. Furthermore, there might be dependen-

cies that affect which systems can be restored, and, of course, the order or priority of

restore will advance or delay the recovery of a system. Mapping out the extended recovery

timeline and technical resource scheduling ensures that a technical resource is not onsite

until their skills and time are required. Also, rotating technical resources after six to eight

hours of time helps to keep progress moving forward.

ptg

Keeping the Troops Happy

This section goes out to all technical leads, project managers, IT managers, business

owners, and executives. If you have technical resources working for you in an effort to

recover from a failure, you should do all you can to ensure that these technical resources

are kept happy and focused. For starters, try to keep the end users and any other business

owners or executives from bothering this staff. Regular communication will help with this

task tremendously. Next, and possibly more important, provide all the bottled water, soda,

coffee, snacks, food, breaks, and anything else that will keep these professionals happy,

healthy, and focused on the task at hand. Technical staff will work very hard during disas-

ter situations, so don’t forget to pat them on the back and let them know how much the

organization and you personally appreciate their time and commitment.

Recovering the Infrastructure

After the failure has been validated, the initial communications meetings have been held,

restore tasks have been confirmed and possibly reprioritized, and recovery task assignment

of resources has been completed, the recovery efforts can finally begin. Verify that each

technical resource has all the documentation, phone numbers, software, and hardware

they require to perform their task. Hold periodic checkpoint meetings, starting every 15

minutes and tapering off to every 30 or 60 minutes as recovery efforts continue.

Postmortem Meeting

After a system failure or disaster strikes, and the recovery has been completed, an organi-

zation should hold a meeting to review the entire process. The meeting might just be an

event where individuals are recognized for their great work; however, the meeting will

1274

CHAPTER 31

Recovering from a Disaster

most likely involve reviewing what went wrong and identifying how the process could be

improved in the future. A lot of interesting things will happen during disaster recovery

situations—both unplanned and simulated—and this meeting can provide the catalyst for

ongoing improvement of the processes and documentation.

Disaster Scenario Troubleshooting

This section of the chapter details the high-level steps that can be taken to recover from

particular types of disaster scenarios. As this book and chapter focuses on Windows Server

2008 R2 environments, so shall the following sections.

Network Outage

When an organization is faced with a network outage, the impact can affect a small set of

users, an entire office, or the entire company. When a network outage occurs, the network

administrators should perform the following tasks:

. Test the reported outage to verify if the issue is related to a wide area network (WAN)

connection between the organization and the Internet service provider (ISP), the

router, a network switch, a firewall, a physical fiber or copper network connection or

ptg

network port, or line power to any of the aforementioned devices.

Other books

The Storm Inside by Anne, Alexis
Renegade Reborn by J. C. Fiske
The Pirate Fairy by A.J. Llewellyn
The Mating Project by Sam Crescent
The Good Sister by Wendy Corsi Staub
Forbidden by Nicola Cornick
Snow Falling on Cedars by David Guterson
Faking Perfect by Rebecca Phillips
Zack by William Bell