This was first published on https://blog.dbi-services.com/pdb-media-failure-may-case-the-whole-cdb-to-crash (2014-07-28)
Republishing here for new followers. The content is related to the the versions available at the publication date
Do you remember last year, when 12c arrived with multitenant, David Hueber warned us about the fact that a single PDB can, under certain conditions, generate a complete system downtime? We are beta testers and opened a SR for that. Now one year later the first patchset is out and obviously I checked if the issue was fixed. It’s a patchset afterall, which is expected to fix issues before than bringing new features.
So the issue was that when the SYSTEM tablespace is lost in a PDB, then we cannot restore it without shutting down the whole CDB. This is because we cannot take the SYSTEM tablespace offline, and we cannot close the PDB as a checkpoint cannot be done. There is no SHUTDOWN ABORT for a PDB that can force to it. Conclusion: if you loose one SYSTEM tablespace, either you accept to wait for a maintenance window before bring it back online, or you have to stop the whole CDB with a shutdown abort.
When I receive a new release, I like to check new parameters, even the undocumented ones. And in 12.1.0.2 there is a new underscore parameter _enable_pdb_close_abort which has the description ‘Enable PDB shutdown abort (close abort)’. Great. It has a default value of false but maybe this is how the bug has been addressed.
Before trying that parameter, let’s reproduce the case:
Here are my datafiles:
RMAN> report schema; using target database control file instead of recovery catalog Report of database schema for database with db_unique_name CDB_SITE1 List of Permanent Datafiles =========================== File Size(MB) Tablespace RB segs Datafile Name ---- -------- -------------------- ------- ------------------------ 1 790 SYSTEM YES /u01/app/oracle/oradata/CDB/system01.dbf 3 680 SYSAUX NO /u01/app/oracle/oradata/CDB/sysaux01.dbf 4 215 UNDOTBS1 YES /u01/app/oracle/oradata/CDB/undotbs01.dbf 5 250 PDB$SEED:SYSTEM NO /u01/app/oracle/oradata/CDB/pdbseed/system01.dbf 6 5 USERS NO /u01/app/oracle/oradata/CDB/users01.dbf 7 540 PDB$SEED:SYSAUX NO /u01/app/oracle/oradata/CDB/pdbseed/sysaux01.dbf 8 250 PDB1:SYSTEM NO /u01/app/oracle/oradata/CDB/PDB1/system01.dbf 9 570 PDB1:SYSAUX NO /u01/app/oracle/oradata/CDB/PDB1/sysaux01.dbf 10 5 PDB1:USERS NO /u01/app/oracle/oradata/CDB/PDB1/PDB1_users01.dbf 11 250 PDB2:SYSTEM NO /u01/app/oracle/oradata/CDB/PDB2/system01.dbf 12 570 PDB2:SYSAUX NO /u01/app/oracle/oradata/CDB/PDB2/sysaux01.dbf 13 5 PDB2:USERS NO /u01/app/oracle/oradata/CDB/PDB2/PDB2_users01.dbf List of Temporary Files ======================= File Size(MB) Tablespace Maxsize(MB) Tempfile Name ---- -------- -------------------- ----------- -------------------- 1 197 TEMP 32767 /u01/app/oracle/oradata/CDB/temp01.dbf 2 100 PDB$SEED:TEMP 32767 /u01/app/oracle/oradata/CDB/pdbseed/pdbseed_temp012014-06-15_09-46-11-PM.dbf 3 20 PDB1:TEMP 32767 /u01/app/oracle/oradata/CDB/PDB1/temp012014-06-15_09-46-11-PM.dbf 4 20 PDB2:TEMP 32767 /u01/app/oracle/oradata/CDB/PDB2/temp012014-06-15_09-46-11-PM.dbf
then I just remove the PDB2 SYSTEM datafile:
rm /u01/app/oracle/oradata/CDB/PDB2/system01.dbf
And I go to sqlplus in order to check the state of my pdb. Remeber, I want to see if I can restore the datafile without doing a shutdown abort on my CDB instance.
SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 27 20:31:45 2014 Copyright (c) 1982, 2014, Oracle. All rights reserved. Connected to an idle instance. SQL> select name,open_mode from v$pdbs; select name,open_mode from v$pdbs * ERROR at line 1: ORA-01034: ORACLE not available Process ID: 0 Session ID: 0 Serial number: 0
Oh… that’s bad… Let’s look at the alert.log:
Errors in file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_ckpt_21620.trc: ORA-01243: system tablespace file suffered media failure ORA-01116: error in opening database file 11 ORA-01110: data file 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' ORA-27041: unable to open file Linux-x86_64 Error: 2: No such file or directory Additional information: 3 USER (ospid: 21620): terminating the instance due to error 1243 System state dump requested by (instance=1, osid=21620 (CKPT)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/cdb/CDB/trace/CDB_diag_21608_20140727202844.trc 2014-07-27 20:28:49.596000 +02:00 Instance terminated by USER, pid = 21620
The CKPT process has terminated the instance. The whole CDB is down.
That’s worse. In 12.1.0.1 we had to bring down the instance, but at least we were able to choose the time and warn the users. Not here. In 12.1.0.2 it crashes immediately when a checkpoint occurs.
I’ve opened a bug for that (Bug 19001390 – PDB SYSTEM TABLESPACE MEDIA FAILURE CAUSES THE WHOLE CDB TO CRASH) which is expected to be fixed for the next release (12.2).
Ok the good news is that once the CDB is down, recovery is straightforward:
Recovery Manager: Release 12.1.0.2.0 - Production on Mon Jul 27 21:36:22 2014 Copyright (c) 1982, 2014, Oracle and/or its affiliates. All rights reserved. connected to target database (not started) RMAN> startup Oracle instance started database mounted database opened Total System Global Area 838860800 bytes Fixed Size 2929936 bytes Variable Size 616565488 bytes Database Buffers 213909504 bytes Redo Buffers 5455872 bytes RMAN> list failure; using target database control file instead of recovery catalog Database Role: PRIMARY List of Database Failures ========================= Failure ID Priority Status Time Detected Summary ---------- -------- --------- ------------- ------- 3353 CRITICAL OPEN 27-JUL-14 System datafile 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' is missing 245 HIGH OPEN 27-JUL-14 One or more non-system datafiles need media recovery RMAN> advise failure; Database Role: PRIMARY List of Database Failures ========================= Failure ID Priority Status Time Detected Summary ---------- -------- --------- ------------- ------- 3353 CRITICAL OPEN 27-JUL-14 System datafile 11: '/u01/app/oracle/oradata/CDB/PDB2/system01.dbf' is missing 245 HIGH OPEN 27-JUL-14 One or more non-system datafiles need media recovery analyzing automatic repair options; this may take some time allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=132 device type=DISK analyzing automatic repair options complete Mandatory Manual Actions ======================== no manual actions available Optional Manual Actions ======================= 1. If file /u01/app/oracle/oradata/CDB/PDB2/system01.dbf was unintentionally renamed or moved, restore it 2. Automatic repairs may be available if you shutdown the database and restart it in mount mode 3. If you restored the wrong version of data file /u01/app/oracle/oradata/CDB/PDB2/sysaux01.dbf, then replace it with the correct one 4. If you restored the wrong version of data file /u01/app/oracle/oradata/CDB/PDB2/PDB2_users01.dbf, then replace it with the correct one Automated Repair Options ======================== Option Repair Description ------ ------------------ 1 Restore and recover datafile 11; Recover datafile 12; Recover datafile 13 Strategy: The repair includes complete media recovery with no data loss Repair script: /u01/app/oracle/diag/rdbms/cdb/CDB/hm/reco_3711091289.hm RMAN> repair failure; Strategy: The repair includes complete media recovery with no data loss Repair script: /u01/app/oracle/diag/rdbms/cdb/CDB/hm/reco_3711091289.hm contents of repair script: # restore and recover datafile restore ( datafile 11 ); recover datafile 11; # recover datafile recover datafile 12, 13; Do you really want to execute the above repair (enter YES or NO)? YES executing repair script Starting restore at 27-JUL-14 using channel ORA_DISK_1 channel ORA_DISK_1: starting datafile backup set restore channel ORA_DISK_1: specifying datafile(s) to restore from backup set channel ORA_DISK_1: restoring datafile 00011 to /u01/app/oracle/oradata/CDB/PDB2/system01.dbf channel ORA_DISK_1: reading from backup piece /u01/app/oracle/fast_recovery_area/CDB/FECFFDC5F6D31F5FE043D74EA8C0715F/backupset/2014_07_28/o1_mf_nnndf_TAG20140728T150921_9xdlw21n_.bkp channel ORA_DISK_1: piece handle=/u01/app/oracle/fast_recovery_area/CDB/FECFFDC5F6D31F5FE043D74EA8C0715F/backupset/2014_07_28/o1_mf_nnndf_TAG20140728T150921_9xdlw21n_.bkp tag=TAG20140728T150921 channel ORA_DISK_1: restored backup piece 1 channel ORA_DISK_1: restore complete, elapsed time: 00:00:07 Finished restore at 27-JUL-14 Starting recover at 27-JUL-14 using channel ORA_DISK_1 starting media recovery media recovery complete, elapsed time: 00:00:01 Finished recover at 27-JUL-14 Starting recover at 27-JUL-14 using channel ORA_DISK_1 starting media recovery media recovery complete, elapsed time: 00:00:00 Finished recover at 27-JUL-14 repair failure complete RMAN> alter pluggable database PDB2 open; Statement processed
I whish that one day the PDB will have true isolation so that I can give DBA rights to the application owner on his PDB. But that means that, at least:
Good news, the bug I opened is fixed in OCT14 PSU. Don’t even read the workaround, it’s not something we can do on production. Anyway, multitenant is new. You must apply latest patchset to stabilize it.
WARNING! I’ve written the previous comment when I’ve seen the bug status in MOS. But of course, I’ve tested it. It’s NOT FIXED. Details here: http://www.dbi-services.com/index.php/blog/entry/is-cdb-stable-now-after-one-patchset-and-two-psu
[…] You can read more about it at Franck Pachot’s blog here: http://blog.dbi-services.com/is-cdb-stable-now-after-one-patchset-and-two-psu/ http://blog.dbi-services.com/pdb-media-failure-may-case-the-whole-cdb-to-crash/ […]