Web Cloud Status - FS#10325

OVHcloud Web Hosting Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#10325 — filerz55.240

Incident Report for Web Cloud

Resolved

The server is not responding.
We are rebooting it.

Update(s):

Date: 2014-03-27 14:51:38 UTC
All accounts have been migrated.

Date: 2014-03-05 08:28:22 UTC
99% of the accounts haves been migrated.
We are checking the remaining accounts.

Date: 2014-03-03 16:06:07 UTC
80% of the website have been moved to a new filer in read-write mode with the backup data. The migrations should be completed tonight.

We are also still working to recover the data from the original filer.

Date: 2014-02-27 14:01:46 UTC
We had a series of hardware problems on the server which
created a ZFS filesystem corruption. Data is readable but
the server is unstable (the system is crashing every 30 minutes).
We're trying to restabilise the system and begin to recover the
data on a new filer. But we need to find a way to block all
automatic ZFS operations, to making the pool read-only so
that it doesn't make everything crash again.

We're also taking down the last backup stored in Roubaix
The operation would take 24 hours so to speed things up, we
retrieved the backup disks directly in Roubaix and we will go
directly to Paris with them. It will be faster.

So in 3-4 hours times, the filer and the backup data should be up.
This will give life to the 1209 websites affected by the failure.
We hope that refreshing this backup with data from the instable
filer we think we would retrieve in approx. 24 hours.
We need to look into or even patch the ZFS code to make the filer
stable at least in read-only mode.

We apologise for the trouble. Total failure of a filer is very very
rare. In this instance the backup is there, we have it so there's
nothing to worry about, and our engineers are working on the most
recent data on the filer.

Date: 2014-02-27 06:36:46 UTC
The filer is having instabilities again, we are intervening

Date: 2014-02-26 11:29:48 UTC
The service is still unstable for this filer, we have to disable it.
We are enabling a cluster which will be dedicate to filerz55.

Date: 2014-02-26 09:16:20 UTC
We have doubled the log disk redundancy and launched a verification of the whole data pool.

The service is up but remains disrupted by the operation in progress which should take over 6 hours.

Date: 2014-02-26 09:14:27 UTC
The system is unstable.
We are changing the given pool configuration.

Date: 2014-02-26 09:13:51 UTC
We continue to monitor the server to check that the problem does not recur.

Date: 2014-02-26 09:12:26 UTC
The server is up again.

Date: 2014-02-26 06:50:41 UTC
We are transferring the data disks in the new system.

Date: 2014-02-26 06:50:01 UTC
The whole cluster is affected by the filer

Date: 2014-02-26 06:49:36 UTC
We are replacing the server with a spare.

Date: 2014-02-26 06:49:15 UTC
The server is back to normal.

Date: 2014-02-26 06:48:57 UTC
We have detected a failure on the server.
We are performing a hardware check.

Date: 2014-02-26 06:47:26 UTC
The server is back.

Posted Feb 26, 2014 - 06:46 UTC