Read Windows Server 2008 R2 Unleashed Online
Authors: Noel Morimoto
When a problem is encountered with a cluster resource, the failover cluster service
attempts to fix the problem by restarting the resource and any dependent resources. If that
doesn’t work, the Services and Applications group the resource is a member of is failed
over to another available node in the cluster, where it can then be restarted. Several condi-
tions can cause a Services and Applications group to failover to a different cluster node.
Failover can occur when an active node in the cluster loses power or network connectivity
or suffers a hardware or software failure. In most cases, the failover process is either
noticed by the clients as a short disruption of service or is not noticed at all. Of course, if
failback is configured on a particular Services and Applications group and the group is
simply not stable but all possible nodes are available, the group will be continually moved
ptg
back and forth between the nodes until the failover threshold is reached. When this
happens, the group will be shut down and remain offline by the cluster service.
To avoid unwanted failover, power management should be disabled on each of the cluster
nodes in the motherboard BIOS, on the network interface cards (NICs), and in the Power
applet in the operating system’s Control Panel. Power settings that allow a display to shut
off are okay, but the administrator must make sure that the disks, as well as each of the
network cards, are configured to never go into Standby mode.
Cluster nodes can monitor the status of resources running on their local system, and they
can also keep track of other nodes in the cluster through private network communication
messages called heartbeats. Heartbeat communication is used to determine the status of a
node and send updates of cluster configuration changes and the state of each node to the
cluster quorum.
29
The cluster quorum contains the cluster configuration data necessary to restore a cluster to
a working state. Each node in the cluster needs to have access to the quorum resource,
regardless of which quorum model is chosen or the node will not be able to participate in
the cluster. This prevents something called “split-brain” syndrome, where two nodes in
the same cluster both believe they are the active node and try to control the shared
resource at the same time or worse, each node can present its own set of data, when sepa-
rate data sets are available, which causes changes in both data sets and a whirlwind of
proceeding issues. Windows Server 2008 R2 provides four different quorum models, which
are detailed in the section “Failover Cluster Quorum Models” later in this chapter.
1184
CHAPTER 29
System-Level Fault Tolerance (Clustering/Network Load Balancing)
Network Load Balancing
The second clustering technology provided with Windows Server 2008 R2 is Network
Load Balancing (NLB). NLB clusters provide high network performance, availability, and
redundancy by balancing client requests across several servers with replicated configura-
tions. When client load increases, NLB clusters can easily be scaled out by adding more
nodes to the cluster to maintain or provide better response time to client requests. One
important point to note now is that NLB does not itself replicate server configuration or
application data sets.
Two great features of NLB are that no proprietary hardware is needed and an NLB cluster
can be configured and up and running literally in minutes. One important point to
remember is that within NLB clusters, each server’s configuration must be updated inde-
pendently. The NLB administrator is responsible for making sure that application or
service configuration, version and operating system security, and updates and data are
kept consistent across each NLB cluster node. For details on installing NLB, refer to the
“Deploying Network Load Balancing Clusters” section later in this chapter.
ptg
After an organization decides to cluster an application or service using failover clusters, it
must then decide which cluster configuration model best suits the needs of the particular
deployment. Failover clusters can be deployed using four different configuration models
that will accommodate most deployment scenarios and requirements. The four configura-
tion models in this case are defined by the quorum model selected, which include the
Node Majority Quorum, Node and Disk Majority Quorum, Node and File Share Majority
Quorum, and the No Majority: Disk Only Quorum. The typical and most common cluster
deployment that includes two or more nodes in a single data center is the Node and Disk
Majority Quorum model. Another configuration model of failover clusters that utilizes one
of the previously mentioned quorum models is the geographically dispersed cluster, which
is deployed across multiple networks and geographic locations. Geographically dispersed
clusters or stretch clusters will be detailed later in this chapter in the “Deploying Multisite
or Stretch Geographically Dispersed Failover Clusters” section.
Failover Cluster Quorum Models
As previously stated, Windows Server 2008 R2 failover clusters support four different
cluster quorum models. Each of these four models is best suited for specific configurations
but if all the nodes and shared storage are configured, specified, and available during the
installation of the failover cluster, the best-suited quorum model is automatically selected.
Node Majority Quorum
The Node Majority Quorum model has been designed for failover cluster deployments
that contain an odd number of cluster nodes. When determining the quorum state of the
cluster, only the number of available nodes is counted. A cluster using the Node Majority
Quorum is called a Node Majority cluster. A Node Majority cluster remains up and
Overview of Failover Clusters
1185
running if the number of available nodes exceeds the number of failed nodes. As an
example, in a five-node cluster, three nodes must be available for the cluster to remain
online. If three nodes fail in a five-node Node Majority cluster, the entire cluster is shut
down. Node Majority clusters have been designed and are well suited for geographically or
network dispersed cluster nodes, but for this configuration to be supported by Microsoft, it
takes serious effort, quality hardware, a third-party mechanism to replicate any back-end
data, and a very reliable network. Once again, this model works well for clusters with an
odd number of nodes.
Node and Disk Majority Quorum
The Node and Disk Majority Quorum model determines whether a cluster can continue to
function by counting the number of available nodes and the availability of the cluster
witness disk. Using this model, the cluster quorum is stored on a cluster disk that is acces-
sible and made available to all nodes in the cluster through a shared storage device using
Serial Attached SCSI (SAS), Fibre Channel, or iSCSI connections. This model is the closest
to the traditional single-quorum device cluster configuration model and is composed of
two or more server nodes that are all connected to a shared storage device. In this model,
only one copy of the quorum data is maintained on the witness disk. This model is well
suited for failover clusters using shared storage, all connected on the same network with
an even number of nodes. For example, on a 2-, 4-, 6-, 8-, or 16-node cluster using this
ptg
model, the cluster continues to function as long as half of the total nodes are available
and can contact the witness disk. In the case of a witness disk failure, a majority of the
nodes need to remain up and running for the cluster to continue to function. To calculate
this, take half of the total nodes and add one and this gives you the lowest number of
available nodes that are required to keep a cluster running when the witness disk fails or
goes offline. For example, on a 6-node cluster using this model, if the witness disk fails,
the cluster will remain up and running as long as 4 nodes are available, but on a 2-node
cluster, if the witness disk fails, both nodes will need to remain up and running for the
cluster to function.
Node and File Share Majority Quorum
The Node and File Share Majority Quorum model is very similar to the Node and Disk
Majority Quorum model but instead of a witness disk, the quorum is stored on file share.
The advantage of this model is that it can be deployed similarly to the Node Majority
29
Quorum model but as long as the witness file share is available, this model can tolerate the
failure of half of the total nodes. This model is well suited for clusters with an even number
of nodes that do not utilize shared storage or clusters that span sites. This is the preferred
and recommended quorum configuration for geographically dispersed failover clusters.
No Majority: Disk Only Quorum
The No Majority: Disk Only Quorum model is best suited for testing the process and
behavior of deploying built-in or custom services and/or applications on a Windows
Server 2008 R2 failover cluster. In this model, the cluster can sustain the failover of all
nodes except one, as long as the disk containing the quorum remains available. The limi-
tation of this model is that the disk containing the quorum becomes a single point of
1186
CHAPTER 29
System-Level Fault Tolerance (Clustering/Network Load Balancing)
failure and that is why this model is not well suited for production deployments of
failover clusters.
As a best practice, before deploying a failover cluster, determine if shared storage will be
used, verify that each node can communicate with each LUN presented by the shared
storage device, and when the cluster is created, add all nodes to the list. This ensures that
the correct recommended cluster quorum model is selected for the new failover cluster.
When the recommended model utilizes shared storage and a witness disk, the smallest
available LUN will be selected. This can be changed, if necessary, after the cluster is created.
Choosing Applications for Failover Clusters
Many applications can run on failover clusters, but it is important to choose and test
those applications wisely. Although many can run on failover clusters, the application
might not be optimized for clustering or supported by the software vendor or Microsoft
when deployed on Windows Server 2008 R2 failover clusters. Work with the vendor to
determine requirements, functionality, and limitations (if any). Other major criteria that
should be met to ensure that an application can benefit and adapt to running on a cluster
are the following:
ptg
. Because clustering is IP-based, the cluster application or applications must use an IP-
based protocol.
. Applications that require access to local databases must have the option of configur-
ing where the data can be stored so a drive other than the system drive can be speci-
fied for data storage that is separate from the storage of the application core files.
. Some applications need to have access to data regardless of which cluster node they
are running on. With these types of applications, it is recommended that the data is
stored on a shared disk resource that will failover with the Services and Applications
group. If an application will run and store data only on the local system or boot
drive, the Node Majority Quorum or the Node and File Share Majority Quorum
model should be used, along with a separate file replication mechanism for the
application data.
. Client sessions must be able to reestablish connectivity if the application encounters
a network disruption or fails over to an alternate cluster node. During the failover
process, there is no client connectivity until an application is brought back online.
If the client software does not try to reconnect and simply times out when a net-
work connection is broken, this application might not be well suited for failover or
NLB clusters.
Cluster-aware applications that meet all of the preceding criteria are usually the best appli-
cations to deploy in a Windows Server 2008 R2 failover cluster. Many services built in to
Windows Server 2008 R2 can be clustered and will failover efficiently and properly. If a
particular application is not cluster-aware, be sure to investigate all the implications of the
application deployment on Windows Server 2008 R2 failover clusters before deploying or
spending any time prototyping the solution.
Overview of Failover Clusters
1187
NOTE
If you’re purchasing a third-party software package to use for Windows Server 2008 R2
failover clustering, be sure that both Microsoft and the software manufacturer certify
that it will work on Windows Server 2008 R2 failover clusters; otherwise, support will
be limited or nonexistent when troubleshooting is necessary.
Shared Storage for Failover Clusters
Shared disk storage is a requirement for Windows Server 2008 R2 failover clusters using
the Node and Disk Majority Quorum and the Disk Only Quorum models. Shared storage
devices can be a part of any cluster configuration and when they are used, the disks, disk