OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#4440 — VSS split
Scheduled Maintenance Report for Network & Infrastructure
Completed
Hi,
For the Roubaix 2 data centre, we decided to introduce the network
with an availability of 100% as an objective. Therefore, we have used
the Cisco 6509 switches in the VSS configuration. It is is a system
based on 2 chassis functioning as a one. With 2 chassis, everything
is doubled, thus, we will have a 100% availability.

In the real world, we have a plenty of problems with the VSS which
provokes a cut off in the service and therefore did meet the initial
contract. Basically, we have a chronic problem on the BGP.
At least modification of the routing table, the CPU of the
router is at 100% for 15 minutes at minimum. Never mind. But late
2009, we have put in place strong protections on the internal network
which consisted of isolating each server from the other. We realize it
through the private vlan and the set up of an arp proxy. Too
standard as a solution. The router responds instead of all
servers and ensure routing even in the same vlan. Everything is therefore
very safe. However, the router must respond to all MAC requests of
all the servers and the process that runs it on the VSS
takes a lot of CPU.

Normally, this works without any problem. But, the network needs only to
recalculate the routing tables so that the BGP takes 100% of the CPU
and prevents the MAC processes to function. The result:
servers no longer recognize the MAC and there is a cut off in the
service for 1, 3, or 8 minutes, depending on the importance of the
recalculation of BGP tables.

We think of fixing the BGP problem with the specific routers which
will not do but that. Route reflector. Normally, we should receive
the hardware this month but the order was improperly recorded between
the distributor and the manufacturer ... and we will receive it by the end of
September ... We decided not to wait for that delivery, and
implement a solution this weekend.

But there will be always the problem of MAC. We therefore decided to
break the VSS configurations and come back to what has always
work well: the router in a single chassis. We have
slightly less than 30 routers in the mono chassis configuration
which does not pose any problem. It is only in the double chassis
configuration that we have problems. Thus, we will break the
chassis.

So starting from last week, we will make the modifications
on the VSS in order to move to a configuration based
on a single chassis.

We will carry it out in four steps:
- All the links of the datacentre connected to the chassis 2 will be
re-connected to the chassis 1. no cut off in the service
since everything will work on the chassis 1.
- All links to the Internet connected to the chassis 2 will be
re-connected to the chassis 1. no cut off in the service,
since everything will work on the chassis 1.
- Power cut of the chassis 2. no cut off in the service
since the chassis 2 will no longer be used.
- Change the configuration of the chassis 1 to the mono chassis version.
Then we'll have to hard reboot the router.
and therefore it will take 15 minutes break in service.
We will do it at 4:00 am by the end of next week
if it is going well.

We will first attack the vss-2 which is the most one causing problems.

Normally, up to Step 4, we will have no more BGP problems.
It could be that through the configurations on 2 chassis this
problem can be resolved soon strating from step 3 see 2, because everything
will operate on a single chassis. But we are not sure. In total,
if at the end of step 4 it will be fixed.

Since the BGP will be fixed, we think it is likely that the
MAC problem will be fixed as well. If the BGP does not work
well in double chassis, then maybe other processes do
not function well in a double chassis? We'll see
that, too.

We regret all the small cuts that customers of
Roubaix 2 have experienced recently which are mainly due to
the problems described here. The wrong choice in the hardware
which we entirely assume. We thought that the manufacturer would
solve the problems of CPU but according to him him it's normal. This
hardware is therefore incompatible with our needs. We'll
change it. We have badly managed the situation and we should not have
asked for help from the manufacturer but act in order to simply find
another solution. Error in the problem management .

In order to continue in the transparency, you may have remarked
problems on London, Amsterdam and Frankfurt
14 days ago. In fact, we have added security links
14 days ago. Between London/Amsterdam
Paris/Frankfurt. Large and heavy investments
were decided in order to make the backbone completely secure and 100%
available even in case of problems on the optic fiber.
Adding these links here on routers has caused the
saturation of the available RAM of routers and the crash of
London. This has driven Amsterdam and Frankfurt in the train
for the same reasons. Who says crash of routers, says recalculation of
BGP and therefore 100% of CPU on vss ... that is why these
crashes had consequences on the service in Roubaix 2: (
We have fixed the problem by disabling the MPLS which is not
necessary but that takes 20% of RAM. Since that it is stable.

We thought of changing all routers during the holidays
but the hardware we wanted to introduce is not
available and what is available does not work. We
have in fact received the new Cisco Nexus 7000 and the BGP
does not work but generates error messages ... New hardware
and that's it ... Bad choice of hardware yet.
Therefore, large reconsiderations in perspective ...
However, this will generate also a delay in the predicted
routers' changes. Thus, we shake hands at this moment with
all manufacturers of the market in order to see what we are
going to introduce instead of what we predicted.
Unexpected work which will cause a delay on other projects...

In short..
I think we cannot be more transparent on the last events.

Best regards,
Octave


Update(s):

Date: 2010-09-21 23:35:38 UTC
We accomplished the vss :)


Date: 2010-09-21 22:50:44 UTC
The chassis is again online in standalone mode. the traffic is recovered in Roubaix3 (around 10min blank).


Date: 2010-09-21 22:48:08 UTC
go

Date: 2010-09-21 22:47:54 UTC
We are restarting the chassis #1 in standalone mode. The traffic will be disconnected in Roubaix3 during the reboot.



Date: 2010-09-21 22:46:42 UTC
We are disconnecting the chassis #2.


Date: 2010-09-21 19:09:19 UTC
we will split the last vss on standalone
configuration. it will take 30 minutes from
midnight.



Date: 2010-09-21 17:03:36 UTC
Tonight, we will do the last operation of vss split on the vss-3. As for the interventions on vss-1 and 2, start up from
mid-night.

Date: 2010-09-08 13:30:08 UTC
We started the migration of the uplinks of clients switches to the chassis #1.



Date: 2010-09-02 07:59:30 UTC
The card has booted.

The IP FO HG are up also.

Well, we will stay there with the catastrophes.

Date: 2010-09-02 07:57:47 UTC
Sep 2 04:08:58 GMT: %L2_AGING-SP-4-ENTRY_DNLDFAIL: Slot 8: Download entries failed, reason EARL_ICC_ERR
Sep 2 04:08:58 GMT: %ONLINE-SP-6-LCC_CONFIG_FAIL: Module 8. LCC Client UNSOLICITED SCP failed to configure at 41636D60
Sep 2 04:08:58 GMT: %ONLINE-SP-6-INITFAIL: Module 8: Failed to configure forwarding
sm(cygnus_oir_bay slot8), running yes, state wait_til_online
Last transition recorded: (disabled)-> disabled2 (restart)-> wait_til_disabled (timer)-> may_be_occupied (timer)-> occupied (known)-> can_power_on (yes_power)-> powered_on (real_power_on)-> check_power_on (timer)-> check_power_on (power_on_ok)-> wait_til_online (reset_timer_online)-> wait_til_online
Sep 2 04:08:58 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 8 set off (Module Failed SCP dnld)

Date: 2010-09-02 07:57:21 UTC
We activated a better load balancing L2

vss-1-6k(config)#port-channel load-balance src-dst-mixed-ip-port

Date: 2010-09-02 07:56:42 UTC
Everything is up. It needs about 20 minutes for the router to
come back and boot it all.

Date: 2010-09-02 07:54:17 UTC
It booted
Sep 2 02:34:19.707: %SYS-SP-6-BOOTTIME: Time taken to reboot after reload = 340 seconds

The router is detecting the cards and will
remount all the links.

Date: 2010-09-02 07:53:14 UTC
Here we go again.

Date: 2010-09-02 07:52:32 UTC
Well, we will restart.

Date: 2010-09-02 07:51:56 UTC
We function now in mono-chassis on vss-1 but always therefore in vss mode. More downtimes are predicted for this night. We will bend over on the config in order to determine why the router detects a vss mode at the starting and re-schedule an intervention for the switch.
This new intervention will probably take place next night, to be confirmed. We are sorry for this downtime taking more time than expected.

Date: 2010-09-02 07:40:11 UTC
Impossible to restart the router in standalone mode. We restart in vss mode for the moment.

Date: 2010-09-02 07:38:30 UTC
The router did not restart in standalone mode due to residues of vss orders in the config (the peer-links ports). He has therefore booted in vss mode single chassis and not standalone. We have completely cleared the config of these ports et and reload the conf via another chassis on the CF card. We will reboot the chassis again.

Date: 2010-09-02 07:33:56 UTC
We restart the chassis #1.

Date: 2010-09-02 07:33:15 UTC
We have electrically cut the chassis #2 of vss-1-6k. We prepare for the restarting of chassis #1.

Date: 2010-09-02 07:31:54 UTC
We start the maintenance. The network traffic routed by vss-1 will be available in addition to that of the IP failover towards the HG2010 (but for the HG2010, the traffic moves normally to the main IP)? We are finishing now the preparations. The cut off will occur in approximately 30 minutes.

Date: 2010-09-01 14:47:54 UTC
Tonight at midnight, we schedule the same operation of the split on vss-1 as that performed on the 10th of August on vss-2. We expect about 30min of interruption of the traffic for routed networks via vss-1-6k.



Date: 2010-08-31 15:22:46 UTC
We move the backup link vss-1 <> bru-1 on the vss-2 in order to retrieve one of 10G ports on the vss-1.

Date: 2010-08-31 15:22:33 UTC
The changement of the bays switches uplinks are completed. We must now proceed to renaming of ports-channels as in standalone mode only numbers

Date: 2010-08-31 15:22:19 UTC
We switched 1x10G to Frankfurt and 1x10G to Amsterdam of chassis #2 to chassis #1 in order to balance the traffic between the two chassis. We are continuing to switch the bay's switches uplinks.


Date: 2010-08-31 15:22:06 UTC
We will perform the same operation on vss-1. We start preparing the ground with the movement of the chassis switches uplinks from chassis #2 to chassis #1.

Date: 2010-08-31 15:21:54 UTC
The split of vss-2 was performed. its' better
operating and takes less CPU.

http://status.ovh.net/?do=details&id=382

Date: 2010-08-31 15:21:38 UTC
We will permanently disconnect links \"virtual link\" between the two chassis of vss-2.

Date: 2010-08-31 15:21:25 UTC
Still the problem of CPU on the ARP Input

vss-2-6k#sh proc cpu | i \\ ARP Input
11 1302640401311514886 99 8.55% 9.33% 9.74% 0 ARP Input
11 1302642001311514965 99 8.77% 9.28% 9.73% 0 ARP Input
11 1302655801311515094 99 24.14% 10.47% 9.97% 0 ARP Input
11 1302662441311515330 99 10.87% 10.51% 9.98% 0 ARP Input
11 1302667641311515581 99 7.03% 10.23% 9.93% 0 ARP Input
11 1302673601311515785 99 8.47% 10.09% 9.91% 0 ARP Input
11 1302680561311516086 99 10.39% 10.11% 9.92% 0 ARP Input
11 1302688121311516406 99 12.03% 10.27% 9.95% 0 ARP Input
11 1302696081311516695 99 9.83% 10.23% 9.95% 0 ARP Input


Date: 2010-08-31 15:21:08 UTC
The migration from 1G and 10G port is completed. We are ready to interrupt the chassis 2.


Date: 2010-08-31 15:20:57 UTC
We have moved a little more than half of the uplink of chassis #2 on chassis #1 which unbalances the traffic on the uplinks. So we start migrating some of the 10G chassis #2 to #1 (eng, ams, rbx-97).


Date: 2010-08-31 15:20:44 UTC
We start work at the level of the vss-2. The uplinks of each vlan are gradually migrated from chassis #2 to chassis #1.
Posted Aug 31, 2010 - 14:52 UTC