|
Online Redo Logs |
Preventive Steps |
Detection Mechanisms |
|
Online redo logs are needed for recovery. |
Implement multiple archivers. |
|
|
Types of Outage |
Steps |
Early Detection |
Error Detection |
|
Loss of one and not all of the online redo log members of a particular group |
- Drop log file member - Add log file member |
No detection is possible until log switch is performed either exiting or entering the file. |
The alert log records the following error stack when switching out of the redo log group. ORA-00316, ORA-00312, ORA-00321, ORA-00312. In addition the LGWR process will generate the same error stack Error number 312 reports the file effected. The full syntax is; ORA-00312: online log 1 thread 1: ‘file name’ The view, V$LOG, should be monitored for statuses of ‘STALE’ or ‘INVALID’. |
|
Loss of inactive archived redo log group Notes: 1. All recovery time are quite static except for the crash recovery. 2. The addition of new redo log group can be done afterwards. |
- Shutdown abort - Startup mount - Drop the problem redo log - Alter database open (crash recovery) |
No detection is possible until the LGWR process write to the redo log |
The LGWR process will generate the error stack OPIRIP, the general one ORA-00447 followed by more specific ones. In this test case, they are ORA-00313, ORA-00312 and ORA-07366. The failure of the LGWR will cause other background processes fail with some general error message produced and finally alert log will report a background process failure. (may not be the LGWR process) |
|
Loss of an inactive redo log group that has not been archived Notes: 2. The addition of new redo log group can be done afterwards. 3. Cannot drop the unarchived redo log without setting noarchivelog mode.(ORA-00350) 4. Cannot set archivelog after dropping the problem redo log group since instance recovery required.(ORA-00265) |
- Shutdown abort - Startup mount - Alter database noarchivelog - Drop the problem redo log - Alter database open (crash recovery) - Shutdown normal - Startup mount - Alter database archivelog - Alter database open |
No detection until the ARCH process archive the redo log | The ARCH process will generate archival stoppage message - ARCH: Archival stopped, error occurred. Will continue retrying - followed by informative error messages (ORA-00255, ORA-00312) reporting the problem online redo log and more specific error message(s) telling the cause (in the test case, it is ORA-00286). The same set of error messages will also appear on the alert log file together with archival stoppage message. |
|
Other Online Redo Log Outages |
Detection |
Steps |
|
Loss of the current online redo log group. |
The LGWR process will generate the error stack OPIRIP, the general one ORA-00447 followed by more specific ones. In this test case, they are ORA-00313, ORA-00312 and ORA-07366. The failure of the LGWR will cause other background processes fail with some general error message produced and finally alert log will report a background process failure. (may not be the LGWR process). V$LOG and V$LOGFILE will indicate if this is the current log. If so, we must switch to CRF. |
1. Restore and commence incomplete recovery. |
|
Silent Redo Log Corruption |
Error (ORA 600[3020]) during application of archive log on the standby database site. |
1. Rebuild standby database if there is one
present. |
|
Internal Redo Corruption |
ORA 600 [3020] implies that this change in the redo log is corrupted or inconsistent with the changes in the data block. All ORA 600 errors during application or writing of redo logs may be evidence to a corruption in the online redo log. |
1. If it does not affect primary, then
refresh or rebuild standby database. |
|
System Tablespace |
Preventive Steps |
Detection Mechanisms |
|
System tablespace contains the data dictionary tables. |
Spare disks in case of disk failures. |
|
|
System Tablespace Outages |
Comments / Detection Mechanisms |
Steps |
|
Lack of space or system tablespace fragmentation. |
This should be monitored very closely and the monitor tool should ensure that space is sufficient in all their tablespaces. However, if this occurs, the customer needs to add a datafile. |
Add another data file |
|
Loss of system tablespace. |
Monitor tool should track disk failures and correlate with the corresponding data file. |
1. restore and recover. |
|
Corruption of data dictionary object. |
ORA-1578 on a data dictionary object, and ORA 600 may be an indication of data dictionary corruption. These errors should be parsed from the alert.log by the monitoring tool. |
1. restore and recover. |
|
Temporary Tablespace |
Preventive Steps |
Detection Mechanisms |
|
Temporary segments are used for sorting and temporary scratch work when the allotted space in memory is insufficient. |
Extra disks should be available. |
|
|
Temporary Tablespace Outages or Temporary tablespace related problems. |
Detection Mechanisms/ Comments |
Realistic/ Tested TTR |
Steps |
|
Cannot allocate extents or tablespace fragmentation. |
These errors do not appear in the alert.log; thus, it must be trapped within the application. ORA 1547, ORA 1562 during a sort operation indicates temp tablespace configuration problem. |
The customer should configure the TEMP tablespace to avoid problems like this. To fix this problem once it is detected encompasses adding another data file or creating another tablespace. |
1. Add another data file. . |
|
Losing temporary tablespace data file. |
|
|
1. Run alter user scripts to switch temporary tablespace usage for users. 2. Create another temp tablespace if possible. |
|
Types of Outage |
Steps |
Early Detection |
Error Detection |
|
Loss of temporary tablespace data file |
- Create another temporary tablespace. - Alter user to use to new temporary
tablespace. - Drop current temporary tablespace. |
Monitor tool needs to detect disk failure and correspond OS files to Oracle files. |
Monitor tool needs to detect disk failures and correspond OS files to Oracle files. |
|
Loss of Index Data file |
Steps |
Early Detection |
Error Detection |
|
Recreate Index Tablespace |
- Drop Index tablespace - Create another index tablespace - Create all indices in the tablespace in parallel. |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to recreate the indices. |
Monitor tool should detect disk failures. |
|
Restore from Backup and Recover |
- Offline tablespace. - Restore from hot backup. - Set autorecovery on. - Bring up recovery processes. - Alter database recover automatic tablespace. - Alter tablespace online. |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to be restore and recover. |
Monitor tool should detect disk failures. |
|
Switch to disaster recovery plan. |
Switch to standby database or replicated database. |
Monitor tool needs to detect disk failure and correspond OS files to Oracle files. |
Monitor tool needs to detect disk failures and correspond OS files to Oracle files. |
Case
2: Loss of application table datafiles
|
Loss of Data File |
Steps |
Early Detection |
Error Detection |
|
Recreate Table Tablespace |
Only applicable if exports or unloads of the tables are taken. Only applicable if there is an object level recovery plan. (snapshots, unloader, replicated database, etc) |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to be restore and recover. |
IO errors or ORA 1100s errors. |
|
Restore from Backup and Recover |
- Offline tablespace. - Restore from hot backup - Set autorecovery on. - Bring up recovery processes. - Alter database recover automatic tablespace. - Alter tablespace online. |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to be restore and recover. |
Monitor tool should detect disk failures. |
|
Commence disaster recovery plan |
Only applicable if there is a standby database or a replicated database. This can also be accomplished by breaking three-way mirrors to maintain hot backups on site. |
Monitor tool needs to detect disk failure and correspond OS files to Oracle files. |
Monitor tool needs to detect disk failures and correspond OS files to Oracle files. |
|
Types of Outage |
Steps |
Early Detection |
Error Detection |
|
Cannot allocate extents or tablespace fragmentation |
Add datafile |
Monitor Space Usage and fragmentation of tablespaces. |
ORA-1547 or other error traps need to be trapped within the application |
|
Loss of Index due to user error or corruption. |
Drop index. Create index. |
Monitor alert.log for ORA-1578 errors that affect index data. Periodic index validation with the analyze command. |
ORA-1578 while using or validating an index. |
|
Single Table Loss or Corruption of Single Table. |
Object level recovery plan. Restore and recover. |
Monitor alert.log for ORA-1578 errors that affect table data. Table can be analyze to check for block corruptions. |
ORA-1578 while accessing or analyzing table data. |
|
Reorganize Table - Scheduled Outage. |
Drop / Create Unload/ Direct Load Use of Parallelism would be helpful. |
Monitor and alarm when objects exceed more than 20 extents or some threshold. |
ORA-1556 max extent error should be trapped within the application. |
|
Reorganize Index - Scheduled Outage |
Drop/ Create |
Monitor and alarm when objects exceed more than 20 extents or some threshold. |
ORA-1556 max extent error should be trapped within the application. |
|
Read-Only Data |
Preventive Steps |
Detection Mechanisms |
|
Index segments should only hold indexes. |
1. Partition data according to functionality. 2. Hardware mirroring and redundant controllers |
1. Monitor disks and for IO errors |
|
Data segments should hold both tables and clusters. |
1. Partition data according to functionality. 2. Hardware mirroring and redundant controllers |
1. Monitor disks and for IO errors |
|
Read-Only Data |
Steps |
|
Accidental drop of read-only data. |
1. Restore from backup 2. commence Disaster Recovery Plan. |
|
Loss of a read-only data file. |
1. Restore from backup 2. commence Disaster Recovery Plan. |
|
Archive Log Files |
Preventive Steps |
Detection Mechanisms |
|
|
Archive log files are required for media recovery and for standby databases. |
|
|
|
|
Archive Log Files and Archivelog |
Detection Mechanisms |
Steps |
|
Archiver stuck due to lack of free space in the archive destination. |
Monitor free space in the archive destination and alarm when close to reasonable threshold (70%) |
1. Free up space in the archive destination 2. Switch to different archived destination. |
|
Loss of archive file(s) due to media failure or user error. |
Monitor tool should check for disk failures. Alerts should be available when media failure affects archive logs. When this occurs, automatic procedures should switch to another archive destination. |
1. Backup primary. 2. Refresh standby database if one exists. |
|
Archiver can not write to the archive destination due to media failure. |
Monitor tool should check for disk failures. Alerts should be available when media failure affects archive logs. When this occurs, automatic procedures should switch to another archive destination. |
1. Switch archive destination. |
|
Archive logs are not successfully backed up. |
Checksums must be in place during the backup. Investigate problem and retry. |
If backup is not successful because of a bad archive log file, one needs to backup Primary and recreate the standby site. |
|
Archiver hangs because someone deactivates archivelog or disable archiving. |
This should be checked before every open command. The archive log list command will |
shutdown database |