Oracle Essentials Oracle Database 11g (50 page)

|

255

Causes of Unplanned Downtime

There are many different causes of unplanned downtime. You can prevent some very easily, while others require significant investments in site infrastructure, telecommunications, hardware, software, and skilled employees.
Figure 11-1
summarizes some of the more common causes of system failures.

Unplanned Downtime

Software Failures

Hardware Failures

Human Error

Disaster

Operating System

System

Operator Error

Fire

Database

CPU

User Error

Flood

Middleware

Memory

DBA Error

Earthquake

Application

Power Supply

System Administrator

Power Failure

Error

Network

Bus

Chemical Spill

Sabotage

Bombing

Peripherals

Disk

Controllers

Network

Power

Figure 11-1. Causes of unplanned downtime

When creating a plan to guarantee the availability of your application, you should consider all of the items shown in this chart as well as other potential causes of system interruption that are specific to your own circumstances. As with all planning, it’s much better to consider all options, even if you quickly dismiss them, than to be caught off guard when an unexpected event occurs.

System Availability Versus Component Availability

A complete system is composed of hardware, software, and networking components operating as a technology
stack
. Ensuring the availability of individual components doesn’t necessarily guarantee system availability. Different strategies and solutions exist for achieving high availability for each of the system components.
Figure 11-2

illustrates the technology stack used to deliver a potential system.

256

|

Chapter 11: Oracle and High Availability

THE SYSTEM STACK

Client

Availability and

Middleware

functionality

required at all

Network Hardware

levels

and Software

Application

Application Server Hardware

and Operating System

Database

Database Server Hardware and Software

Operating Sites and Facilities

Figure 11-2. Components of a system

As this figure shows, a variety of physical and logical layers must cooperate to deliver an application. Some systems may involve fewer components; for example, a two-tier client/server system would not have the additional application server components.

Failures in the components above the database can effectively prevent access to the database even though the database itself may be available. The database server and the database itself serve as the foundation for the stack. When a database fails, it immediately affects the higher levels of the stack. If the failure results in lost or corrupted data, the overall integrity of the application may be affected.

The potential threats to availability span all of the components involved in an application system, but in this chapter we’ll examine only availability issues relating specifically to the database.

System Failure

The abrupt failure of the server machine running the database is one of the most common causes of unplanned downtime. A server may crash because of hardware problems, such as the failure of a power supply, or because of software problems, such as a process that begins to consume all the machine’s CPU resources. Even if the underlying server platform is fine, the Oracle instance itself can fail. Whatever the cause of the crash, the effect on Oracle is the same—the instance cannot deliver
System Failure

|

257

its promised functionality. Remember that when an Oracle database crashes, it is the instance that crashes, not the database (as described in
Chapter 2).
Even if the system fails, the failure will not imperil any data that’s already safely stored within the disk files used by the Oracle database.

The impact of a crash will depend on the activity in progress at the time of the crash.

Any connected sessions will no longer have a server process to which to talk. All active queries and transactions will be abruptly terminated. The process of cleaning up the resulting mess is called
instance recovery
or
crash recovery
.

Telltale Error Messages

The following two error messages are often good indicators that an Oracle instance is down:

ORA-03113: End-of-file on communication channel

This message is usually received by clients that try to resubmit an operation that failed due to an instance failure. The message is somewhat cryptic but becomes clear if you interpret it very literally. Oracle works using a pipe to communicate between the client application and its associated server process in the Oracle instance. When the instance fails, the client’s server process ceases to exist, so there is no one listening on the other end of the pipe. The communication channel between the client and the server is no longer valid.

ORA-01034: Oracle not available

This terse message means that when the client requested a connection to the Oracle instance, the instance was not there. Clients that try to connect to a failed instance will typically get this message. The client can connect to the Listener, but when the Listener attempts to hand the client off to the requested Oracle instance, the ORA-01034 condition results.

What Is Instance Recovery?

When yourestart an Oracle instance after a failure, Oracle detects that a crash occurred using information in the control file and the headers of the database files.

Oracle then performs instance recovery automatically and uses the online redo logs to guarantee that the physical database is restored to a consistent state as it existed at the time of the crash. This includes two actions:

• All committed transactions will be recovered.

• In-flight transactions will be rolled back or undone.

Note that an in-flight transaction might be one that a user didn’t commit or one that was committed by the user but not confirmed by Oracle before the system failure. A transaction isn’t considered committed until Oracle has written the relevant details
258

|

Chapter 11: Oracle and High Availability

of the transaction to the current online redo log and has sent back a message to the client application confirming the committed transaction.

Phases of Instance Recovery

Instance recovery has two phases: rollforward and rollback.

Recovering an instance requires the use of the redo logs, described in
Chapter 2.
The redo logs contain a recording of all the physical changes made to the database as a result of transactional activity, both committed and uncommitted.

The introduction of delayed redo log writes for committed transactions as an option in Oracle Database 11
g
could create a scenario where a transaction appears to have been committed but could still not be recovered.

The checkpoint concept, also described in
Chapter 2,
is critical to understanding crash recovery. When a transaction is committed, Oracle writes all associated database block changes to the current online redo log. The actual database blocks may have already been flushed to disk, or may be flushed at some later point. This means that the online redo log can contain changes not yet reflected in the actual database blocks stored in the datafiles. Oracle periodically ensures that the data blocks in the datafiles on disk are synchronized with the redo log to reflect all the committed changes up to a point in time. Oracle does this by writing all the database blocks changed by those committed transactions to the datafiles on disk. This operation is called a
checkpoint
. Completed checkpoints are recorded in the control file, datafile headers, and redo log.

Rollforward

At any point in time, the online redo logs will be ahead of the datafiles by a certain amount of time or number of committed transactions. Instance recovery closes this gap and ensures that the datafiles reflect all committed transactions up to the time the instance crashed. Oracle performs instance recovery by rolling forward through the online redo log and replaying all the changes from the last completed checkpoint to the time of instance failure. This operation is called the
rollforward
phase of instance recovery.

While implementing rollforward recovery, Oracle reads the necessary database blocks into the System Global Area and reproduces the changes that were originally applied to the blocks. This process includes reproducing the undo or rollback information, in addition to the data changes. Rollback segments are composed of extents and data blocks just like tables, and all changes to rollback segment blocks are part of the redo for a given transaction. For example, suppose that a user changed an employee name from “John” to “Jonathan.” As Oracle applies the redo log, it will
System Failure

|

259

read the block containing the employee row into the cache and redo the name change. As part of recovering the transaction, Oracle will also write the old name

“John” to a rollback segment, as was done for the original transaction.

When the rollforward phase is finished, all the changes for committed and uncommitted transactions have been reproduced. The uncommitted transactions are in-flight once again, just as they were at the time the crash occurred. This leads to the next logical phase of instance recovery—rollback. But before we discuss rollbacks themselves, we need to look at how Oracle uses checkpoints and how the timing of checkpoints can affect recovery time.

Fast-start fault recovery and bounded recovery time

Checkpoints cause an increase in I/O since the database writer flushes all the database blocks to disk to bring the datafiles up to the time of the checkpoint. Prior to Oracle8, DBAs controlled checkpoint frequency by setting the initialization file parameters LOG_CHECKPOINT_INTERVAL (number of redo blocks between checkpoints) and LOG_CHECKPOINT_TIMEOUT (in seconds) and setting the size of the redo log files. In addition, Oracle always performs a checkpoint whenever a log file switch occurs.

Reducing the checkpoint interval or timeout would result in smaller amounts of data between checkpoints and lead to faster recovery times, but could also introduce the overhead of more frequent checkpoints and their associated disk activity. A common strategy for minimizing the number of checkpoints was to set the initialization file parameters so that checkpoints would occur only with log switches.

Oracle8
i
introduced an initialization file parameter to provide a simpler and more accurate way to control recovery times: FAST_START_IO_TARGET. The bulk of recovery activity involves performing I/O for reading database blocks into the cache so that redo can be applied to them. This parameter set a target ceiling on how many database blocks Oracle would have to read in applying redo information. Oracle would dynamically vary the checkpoint frequency in an attempt to limit the number of blocks that will need to be read for recovery to the value of this parameter.

Oracle9
i
further sped this recovery process. Beginning at the last checkpoint, the redo log was scanned for data blocks that contain unsaved changes and need to be recovered. In the subsequent scan, changes are applied only where needed. Because the subsequent scan is a sequential read and reading unnecessary blocks (random I/O) is eliminated, the recovery time is reduced.

Oracle9
i
introduced an important fast-start time-based recovery feature. DBAs specify a target for recovery time in seconds (in the FAST_START_MTTR_TARGET

initialization parameter, where MTTR stands for Mean Time to Recover) in order to meet Service Level Agreements and other requirements. The database automatically
260

|

Chapter 11: Oracle and High Availability

determines values for FAST_START_IO_TARGET and LOG_CHECKPOINT_

INTERVAL. Estimated MTTR values are calculated and placed in V$INSTANCE_

RECOVERY, thereby providing a means for real-world calibration and more accurate estimates over time.

Today, this is all much simpler through Fast-Start Fault Recovery. The Oracle database automatically bounds recovery time at startup using self-tuned checkpoint processing, first introduced in Oracle Database 10
g
.

Rollback improvements

The rollforward phase re-creates uncommitted transactions and their associated rollback information. These in-flight transactions must be rolled back to return to a consistent state.

Other books

Hatchet (9781442403321) by Paulsen, Gary
Whiskey & Charlie by Annabel Smith
No Apologies by Tracy Wolff
Shattering Inside by Lisa Ahne