PDB media failure may cause the whole CDB to crash

Do you remember last year, when 12c arrived with multitenant, David Hueber warned us about the fact that a single PDB can, under certain conditions, generate a complete system downtime? We are beta testers and opened a SR for that. Now one year later the first patchset is out and obviously I checked if the issue was fixed. It’s a patchset afterall, which is expected to fix issues before than bringing new features.

So the issue was that when the SYSTEM tablespace is lost in a PDB, then we cannot restore it without shutting down the whole CDB. This is because we cannot take the SYSTEM tablespace offline, and we cannot close the PDB as a checkpoint cannot be done. There is no SHUTDOWN ABORT for a PDB that can force to it. Conclusion: if you loose one SYSTEM tablespace, either you accept to wait for a maintenance window before bring it back online, or you have to stop the whole CDB with a shutdown abort.

When I receive a new release, I like to check new parameters, even the undocumented ones. And in 12.1.0.2 there is a new underscore parameter _enable_pdb_close_abort which has the description ‘Enable PDB shutdown abort (close abort)’. Great. It has a default value of false but maybe this is how the bug has been addressed.

Before trying that parameter, let’s reproduce the case:

Here are my datafiles:

RMAN> report schema;

using target database control file instead of recovery catalog
Report of database schema for database with db_unique_name CDB_SITE1

List of Permanent Datafiles
===========================
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    790      SYSTEM               YES     /u01/app/oracle/oradata/CDB/system01.dbf
3    680      SYSAUX               NO      /u01/app/oracle/oradata/CDB/sysaux01.dbf
4    215      UNDOTBS1             YES     /u01/app/oracle/oradata/CDB/undotbs01.dbf
5    250      PDB$SEED:SYSTEM      NO      /u01/app/oracle/oradata/CDB/pdbseed/system01.dbf
6    5        USERS                NO      /u01/app/oracle/oradata/CDB/users01.dbf
7    540      PDB$SEED:SYSAUX      NO      /u01/app/oracle/oradata/CDB/pdbseed/sysaux01.dbf
8    250      PDB1:SYSTEM          NO      /u01/app/oracle/oradata/CDB/PDB1/system01.dbf
9    570      PDB1:SYSAUX          NO      /u01/app/oracle/oradata/CDB/PDB1/sysaux01.dbf
10   5        PDB1:USERS           NO      /u01/app/oracle/oradata/CDB/PDB1/PDB1_users01.dbf
11   250      PDB2:SYSTEM          NO      /u01/app/oracle/oradata/CDB/PDB2/system01.dbf
12   570      PDB2:SYSAUX          NO      /u01/app/oracle/oradata/CDB/PDB2/sysaux01.dbf
13   5        PDB2:USERS           NO      /u01/app/oracle/oradata/CDB/PDB2/PDB2_users01.dbf

List of Temporary Files
=======================
File Size(MB) Tablespace           Maxsize(MB) Tempfile Name
---- -------- -------------------- ----------- --------------------
1    197      TEMP                 32767       /u01/app/oracle/oradata/CDB/temp01.dbf
2    100      PDB$SEED:TEMP        32767       /u01/app/oracle/oradata/CDB/pdbseed/pdbseed_temp012014-06-15_09-46-11-PM.dbf
3    20       PDB1:TEMP            32767       /u01/app/oracle/oradata/CDB/PDB1/temp012014-06-15_09-46-11-PM.dbf
4    20       PDB2:TEMP            32767       /u01/app/oracle/oradata/CDB/PDB2/temp012014-06-15_09-46-11-PM.dbf

then I just remove the PDB2 SYSTEM datafile:

rm /u01/app/oracle/oradata/CDB/PDB2/system01.dbf

And I go to sqlplus in order to check the state of my pdb. Remeber, I want to see if I can restore the datafile without doing a shutdown abort on my CDB instance.

SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 27 20:31:45 2014

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> select name,open_mode from v$pdbs;
select name,open_mode from v$pdbs
*
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

Oh… that’s bad… Let’s look at the alert.log:

Errors in file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_ckpt_21620.trc:
ORA-01243: system tablespace file suffered media failure
ORA-01116: error in opening database file 11
ORA-01110: data file 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf'
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
USER (ospid: 21620): terminating the instance due to error 1243
System state dump requested by (instance=1, osid=21620 (CKPT)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_diag_21608_20140727202844.trc
2014-07-27 20:28:49.596000 +02:00
Instance terminated by USER, pid = 21620

The CKPT process has terminated the instance. The whole CDB is down.

That’s worse. In 12.1.0.1 we had to bring down the instance, but at least we were able to choose the time and warn the users. Not here. In 12.1.0.2 it crashes immediately when a checkpoint occurs.

I’ve opened a bug for that (Bug 19001390 – PDB SYSTEM TABLESPACE MEDIA FAILURE CAUSES THE WHOLE CDB TO CRASH) which is expected to be fixed for the next release (12.2).

Ok the good news is that once the CDB is down, recovery is straightforward:

Recovery Manager: Release 12.1.0.2.0 - Production on Mon Jul 27 21:36:22 2014

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database (not started)

RMAN> startup

Oracle instance started
database mounted
database opened

Total System Global Area     838860800 bytes

Fixed Size                     2929936 bytes
Variable Size                616565488 bytes
Database Buffers             213909504 bytes
Redo Buffers                   5455872 bytes


RMAN> list failure;

using target database control file instead of recovery catalog
Database Role: PRIMARY

List of Database Failures
=========================

Failure ID Priority Status    Time Detected Summary
---------- -------- --------- ------------- -------
3353       CRITICAL OPEN      27-JUL-14     System datafile 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' is missing
245        HIGH     OPEN      27-JUL-14     One or more non-system datafiles need media recovery


RMAN> advise failure;

Database Role: PRIMARY

List of Database Failures
=========================

Failure ID Priority Status    Time Detected Summary
---------- -------- --------- ------------- -------
3353       CRITICAL OPEN      27-JUL-14     System datafile 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' is missing
245        HIGH     OPEN      27-JUL-14     One or more non-system datafiles need media recovery

analyzing automatic repair options; this may take some time
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=132 device type=DISK
analyzing automatic repair options complete

Mandatory Manual Actions
========================
no manual actions available

Optional Manual Actions
=======================
1. If file /u01/app/oracle/oradata/CDB/PDB2/system01.dbf was unintentionally renamed or moved, restore it
2. Automatic repairs may be available if you shutdown the database and restart it in mount mode
3. If you restored the wrong version of data file /u01/app/oracle/oradata/CDB/PDB2/sysaux01.dbf, then replace it with the correct one
4. If you restored the wrong version of data file /u01/app/oracle/oradata/CDB/PDB2/PDB2_users01.dbf, then replace it with the correct one

Automated Repair Options
========================
Option Repair Description
------ ------------------
1      Restore and recover datafile 11; Recover datafile 12; Recover datafile 13
  Strategy: The repair includes complete media recovery with no data loss
  Repair script: /u01/app/oracle/diag/rdbms/cdb/CDB/hm/reco_3711091289.hm


RMAN> repair failure;

Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/cdb/CDB/hm/reco_3711091289.hm

contents of repair script:
   # restore and recover datafile
   restore ( datafile 11 );
   recover datafile 11;
   # recover datafile
   recover datafile 12, 13;

Do you really want to execute the above repair (enter YES or NO)? YES
executing repair script

Starting restore at 27-JUL-14
using channel ORA_DISK_1

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00011 to /u01/app/oracle/oradata/CDB/PDB2/system01.dbf
channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fast_recovery_area/CDB/FECFFDC5F6D31F5FE043D74EA8C0715F/backupset/2014_07_28/o1_mf_nnndf_TAG20140728T150921_9xdlw21n_.bkp
channel ORA_DISK_1: piece handle=/u01/app/oracle/fast_recovery_area/CDB/FECFFDC5F6D31F5FE043D74EA8C0715F/backupset/2014_07_28/o1_mf_nnndf_TAG20140728T150921_9xdlw21n_.bkp tag=TAG20140728T150921
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07
Finished restore at 27-JUL-14

Starting recover at 27-JUL-14
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 27-JUL-14

Starting recover at 27-JUL-14
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:00

Finished recover at 27-JUL-14
repair failure complete


RMAN> alter pluggable database PDB2 open;

Statement processed

I whish that one day the PDB will have true isolation so that I can give DBA rights to the application owner on his PDB. But that means that, at least:

A PDB failure cannot crash the CDB instance.
A PDB admin cannot create datafiles anywhere on my server.
A PDB admin cannot run anything as the instance owner user (usually oracle)

PDB media failure may cause the whole CDB to crash

3 Comments