This was first published on https://blog.dbi-services.com/oracle-cloud-service-my-first-outage (2015-10-18)
Republishing here for new followers. The content is related to the the versions available at the publication date
We experienced the first planned outage this week-end, so let’s see how it is notified and what happens.
Outage notification before:
So the notification was sent a bit less than 2 days before.
Actually I had a session logged at that time.
$ date Sun Oct 18 21:03:27 CEST 2015 $ last -x -t 20151018170000 | head oracle pts/0 178.197.234.212 Sun Oct 18 13:17 - 13:35 (00:17) oracle pts/0 178.197.234.212 Sun Oct 18 13:12 - 13:14 (00:02) oracle pts/0 178.197.234.212 Sun Oct 18 13:09 - 13:10 (00:01) runlevel (to lvl 3) 2.6.39-400.109.1 Sun Oct 18 10:05 - 21:03 (10:57) reboot system boot 2.6.39-400.109.1 Sun Oct 18 10:05 - 21:03 (10:57) shutdown system down 2.6.39-400.109.1 Sat Oct 17 21:19 - 10:05 (12:46) runlevel (to lvl 0) 2.6.39-400.109.1 Sat Oct 17 21:19 - 21:19 (00:00) oracle pts/1 xdsl-188-154-161 Sat Oct 17 18:02 - down (03:16) oracle pts/1 xdsl-188-154-161 Sat Oct 17 18:02 - 18:02 (00:00) oracle pts/1 56.227.197.178.d Wed Oct 14 06:13 - 07:35 (01:21)
Remark: a reboot was not what I expected from the message ‘restarted to match original state’. I expected something like a ‘save state’ that includes the RAM.
Outage notification after:
Remark: the system was up at 10:05 but the notification that it is up came 8 hours later. Then if I have something to restart manually, when am I expected to do it? at 10:05 when I see the system reboot? at 12:00 that was the planned end? Or at 18:16 when I receive the notification? When I’m responsible for an outage, I count the duration from start of maintenance up to availability notification.
Here is the summary from the ‘CLOUD My Service':
There was an outage later on the US cloud. It was planned from 6:00:00 AM to 9:00:00 PM CEST and here is the summary:
The problem is that it overlaps, so we can’t consider a Data Guard setup between both to ensure High Availability.
$ date Sun Oct 18 21:05:46 CEST 2015 $ last -x -t 20151018170000 | head runlevel (to lvl 3) 2.6.39-400.109.1 Sun Oct 18 16:12 - 21:05 (04:53) reboot system boot 2.6.39-400.109.1 Sun Oct 18 16:12 - 21:05 (04:53) shutdown system down 2.6.39-400.109.1 Sun Oct 18 06:10 - 16:12 (10:02) runlevel (to lvl 0) 2.6.39-400.109.1 Sun Oct 18 06:10 - 06:10 (00:00) oracle pts/1 xdsl-188-154-161 Sat Oct 17 21:25 - 00:01 (02:35) oracle pts/1 xdsl-188-154-161 Mon Oct 5 07:29 - 11:41 (04:12) oracle pts/1 xdsl-188-154-161 Mon Oct 5 07:22 - 07:22 (00:00) oracle pts/1 109.132.241.235 Sun Sep 20 10:57 - 10:57 (00:00) oracle pts/1 109.132.241.235 Sun Sep 20 10:28 - 10:28 (00:00) oracle pts/2 109.132.241.235 Sat Sep 19 18:03 - 19:07 (01:03)
During the 06:00 – 16:08 outage (end was planned for 21:00), the system was stopped from 06:10 to 16:12 and the notification came 2 hours later.
On any server you should be confident that all your services restart on server reboot. Check your init.d scripts. Test them. Take care of dependencies if one must start after another one. In the cloud, because you’re not there when the system is brought up (and not notified immediately) then you must be 100% sure that everything restarts.