Get webhook notifications whenever Network & Infrastructure creates an incident, updates an incident, resolves an incident or changes a component status.
Before these tasks, at 20h03 the router in london has crashed then furthermore it has not returned itself. boot problem, we has to resume in series cable in order to finish the booting.
http://status.ovh.co.uk/?do=details&id=326
In sprained frankfurt has crashed due to memory. then in amsterdam has also crashed. in 1 hour 3 routers.
After the analysis we think that with the safety features that we establish and BGP data to synchronize in addition, the router london amsterdam and frankfurt were full RAM unfragmented. They have not been restarted for quite long time ago and RAM has been fragmented.
But it is also related to MPLS. it works without it. In deed we disconnected the MPLS and returned frankfurt link by link. The router is stable with 200Mo free of 1Go.
one of the solutions was ordered three weeks ago and will arrive in 5 weeks. In deed 2 ASR1000 for collector routes. instead to mount 1 BGP session between each router, we are going to mount only 2 BGP sessions by router and the 2 collectors will calculate the route then propagate the information simply to other routers.
it will also take less CPU and less RAM.
especially when network is secure, through the loops, the same information of the same router arrive by different paths at different times to each router and each router is obliged to calculate all at several times. it will be a lot of permanent calculation. In fact the current configuration has come to the end and it must improve. it will be done. Another solution would be the BGP confederations in order that changes be made only in the confederation. we prefer collector routers.
Second solution is to change the cisco 6509 by nexus 7016. we received one for the labo and is now being tested. we are waiting until September to order 5 nexus 7016 because ... cards that we need are not yet available. Available in September ...
Besides some FR and ES customers has also been impacted by these problems especially on vss-1 and vss-2: when the routing changes and it needs to recalculate BGP tables, vss routers are defected. they charge 100% of CPU during a very long time and ARP process is not responding to ARP servers' requests. OSPF disconnects, BGP also, customer servers expire the MAC of the router and are not receiving request's response. and that does not ping. with the route collector we are decreasing the CPU for the BGP. that is already going to fix many problems. A second solution is to move the router's network proxy-arp to a server specially designed for this. that is going to be coded and established.