OVHcloud Private Cloud Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#10988 — hosts
Incident Report for Hosted Private Cloud
Resolved
The monitoring system has detected a large quantity of faulty hosts.
We will investigate.

Update(s):

Date: 2014-07-18 08:19:22 UTC
The bug impacted some of the remaining servers in the infrastructure.

Tomorrow all the remaining servers with this version of driver will be rebooted in order to update the network driver and to ensure that they are no longer impacted.


Date: 2014-07-02 17:08:15 UTC
We are checking the entire infrastructure to see if there are any other hosts affected by this update.

Date: 2014-07-02 17:07:26 UTC
All host servers are up-to-date and the tickets concerning the impacted machines have been opened.

We now need to reboot the host servers to apply the driver update.

Date: 2014-07-02 13:09:11 UTC
The first host servers are up-to-date.

A reboot is necessary to apply the update.

We will open a ticket for the relevant host servers.

Date: 2014-07-02 13:07:44 UTC
The ESXi version is not relevant, there are still a few host servers that have the bugged version of the driver.

~ # vmware -lv
VMware ESXi 5.0.0 build-721882
VMware ESXi 5.0.0 Update 1
~ # esxcli software vib list |grep igb
net-igb 3.2.10-1OEM.500.0.0.472560 Intel VMwareCertified 2013-05-14

We will force an update of the drivers.

Date: 2014-07-02 10:31:49 UTC
The same issue has just arisen.

We are currently checking all the hosts and controlling the host drivers.

Date: 2014-06-27 08:28:37 UTC
VMware engineering found corrupted data in the headings of the frames networks.
The exact reason for the corruption is unknown but it originates for the Intel IGB driver.
The current versions of Firmware and Driver are not the latest and we will proceed with an update of the drivers.

Logs analysis: (Bug Id 1272069)
The PSOD is due to that the head pointer of (&(container->slabInfo[2].pktList))->csList is corrupted.

[esx-host3922.ovh.net-2014-06-18--09.04]

(gdb) f 4
#4 PktContainerGetPkt (slabType=PKT_SLAB_HIGH_MEM, container=0x410004c49f00, index=2) at bora/vmkernel/net/pkt.c:3733
3733 entry = PktList_PopHead(&(container->slabInfo[index].pktList));
(gdb) p container
$11 = (PktContainer *) 0x410004c49f00
(gdb) p &(container->slabInfo[index].pktList)
$12 = (PktList *) 0x410004c49fa8
(gdb) p ((PktList *) 0x410004c49fa8)->csList
$13 = {
slist = {
head = 0x61646e656974656c,
tail = 0x4100085e4980
},
numElements = 11
}


Date: 2014-06-26 15:56:57 UTC
The root of the problem has been found.

\"Engineering have analyzed the dumps and found that the PSOD's were due to corruption which originated from the igb network driver.\"

We will escalate the SR in order to find the root of the corruption.

Date: 2014-06-18 11:32:51 UTC
We have opened an SR with VMware for the root cause analysis.

A diagnostic is in progress.

Date: 2014-06-18 11:32:18 UTC
All servers have been checked and rebooted.
We are checking that back to normal on the monitoring system.


Date: 2014-06-18 11:32:07 UTC
Over half the affected hosts have been checked and rebooted.
The intervention is in progress.

Date: 2014-06-18 11:29:44 UTC
The affected hosts all appear to be in version 5.0update1.
There are in purple screen state.
They are being rebooted.
Posted Jun 18, 2014 - 11:28 UTC