OVHcloud Web Hosting Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#5897 — xc11.mail.ovh.net
Incident Report for Web Cloud
Resolved
Many databases of this server seem to be corrupted, we are trying to repair the databses
and get back in parallel the intern back-up, for safety.

Update(s):

Date: 2011-10-19 12:26:26 UTC
The set of accounts were recovered on Sunday 16th of October.

Date: 2011-10-19 12:24:29 UTC
20% of the data are recovered.

Each concerned customer was/will be called by our team in order to have a more precise feedback about the improvements made.

Date: 2011-10-19 12:21:34 UTC
We will keep recovering the data and to insert them in the correspondent accounts.

15% of the concerned domains were recovered, the operation is long because exchange authorises one single
data base mounting in a status of repair on one time.

Date: 2011-10-13 09:50:42 UTC
On 77 domains, there are still 14 domains to recover. 7 were recovered.
And the remaining 7, there is no hope of recovering them.

The back up system amounts to a few days and the exchange data bases
were damaged since a long time ago. It seems that the data bases started to
be corrupted since 10-14 days. There should have been raw back up since 2 weeks
in order to recover the 7 domains.
Having the logic back up of a few days only is not sufficient in this case.

The origin of the problem was identified. it is the fact of making the back ups
which has put the data bases in default. During the back ups, the VM was frozen for
a few seconds and it seems obviously that a day on the VM has provoked a corruption
of the data bases.
Since the global back up is the amount of all the back ups,
and inside there is this corrupt back up, the sum is no longer
workable.
Even though, we were successful in mounting in time and in elevating
the bad snaps in order when trying to recover some data bases.
We had to remove the snaps then check the system and recover
base by base. 70 data bases
were recovered, yet regarding the 7 remaining, there still should have been mounted
in time, it means more than one week.

Therefore, even if we have the back up of 10h30, i.e one hour before the crash,
this back up is not workable since it is added to the set of back ups which
we made every hour a few weeks ago.

The corrections in process:
- 3 levels of backup:
- we will install an application back up, it means that Exchange itself has its proper
back up system and we will thus use it.
Exchange will back up its proper data bases.
- we will install a back up system, i.e that via the
VMware Data Recovery, we will back up the vmdk of very VM
- we will install a storage back up, it means that on the ZFS,
we will be able to mount to one week instead of one hour. This back up
is also saved on another media in another data centre.

- The management of automatic crashes in a few minutes
- we will put instead of that an automatic procedure during a crash of a
VM. As soon as we have a downtime of a few minutes on an
Exchange system we will henceforth start a new pre-configured Exchange system
with all the email accounts and will put it in production after 15 minutes of the crash.
Also, the customers can still receive the new emails and continue to communicate.

In parallel, we will take care of the recovering the old emails of the 1st, 2nd and 3rd back up.
The goal is to have the minimum down time on the present one.

We will contact the impacted customers with this crash
in order to explain to them. On the 77 domains, only 2 customers took contact with
the OVH support. It is thus important to explain the other 75 what happened and what we are
going to do from now on in order not to happen again.

The 7 domains will be free forever.
The 70 domains will be free till the end of the day.

Date: 2011-10-13 09:03:52 UTC
The data bases are still being repaired.
The account data on the most important domains were successfully restored.

The work continues the whole night, we hope we can continue during the day.

Date: 2011-10-11 15:01:33 UTC
We are going to add a directory 'ovh-recover' for the concerned accounts.
We are currently insering the missing emails progressively.

Date: 2011-10-11 11:54:43 UTC
63 Domains (with its emails) were recovered
from the 77 domaines hosted by the VM.
We are checking for the recovery of the rest.


Date: 2011-10-11 11:53:35 UTC
The VM xc11 assures that the Exchange server
for some customers crashed at 11:10 without no reason.
After a trial of the restarting, we noticed that
the data were corrupted and it was not possible
to remount the VM as it is.


We looked at the backups in order to recover
a backup of the concerned VM. We
are doing a backup of the VMware every hour.
We noticed that the data was corrupted on the
backups on 2 days.


We succeeded a second backup system based
on the ZFS but it keeps the backups on 24 hours
only. So since the data was not good on 2 days
the ZFS backup was not useful.


So wa worked in 2 directions: try to get back
the VM in its state of working 5 days ago and correct
the status of the VM at 11:15.


Each time we must copy 60Gb of data to remount the VM and then
we must rework the data to make them bootable.
This takes 1h30 for each operation.


That's why we decided to start a VM
with all the customer accounts but with no emails.
Very late, we should have done it yesterday at 13 o'clock.
At least all the new emails will come with no dealy.
And so we could have recovered only the old emails.

At 3:30 we succeeded to remount the VM of 11:10 in RO
and we are getting back the emails of this VM in order
to reinject them in the new VM.


This operation is still in progress.


Many consequences to this unusual problem :
- we are going to increase the life span of the ZFS backup
in order to recover very quickly a VM in working status
-set up an automatic procedure of restarting of a new virgin
VM in case of a serious problem.
- undestand what could have happened at the level of the VM
and see how we can avoid it, it is an Exchange database corruption
and consequently an intern problem of the system.

Date: 2011-10-11 11:09:44 UTC
We succeed to correct the snapshot which was at the origin of all the chain of problems.

So we coud recover the original server without to turn back the clock.

We are checking the status of the recovered server and all the Exchange options.

Date: 2011-10-11 10:45:26 UTC
We set the service with a new Exchange Server so that the new emails are received and could be recovered.

In paralel, we are working on the 2 versions of the back-up, the first of yesterday at 20:00 and the other
of 5 days ago.

When one of the backups will be recovered, we will re-inject the emails in the new Exchange server in production.

Date: 2011-10-11 10:42:06 UTC
We are setting an Exchange server on which we are going to plug-in all the back-up disks.
Starting from there we are going to recover the initial base.

In parallel we check the snapshot at the level of the infastructure. A snapshot not ended on the 8th od October
seems to be the origin of the problem. We are trying to get back this snapshot.

Date: 2011-10-10 20:44:51 UTC
The recovery from the snapshot of midday did not work.
We are recovering the oldest backup.

Date: 2011-10-10 15:51:49 UTC
We are recovering the midday snapshot. The copy is in progress. It's at 60% now.
Posted Oct 10, 2011 - 12:02 UTC