Network Outage Report, 09-20-2017

  • Wednesday, 20th September, 2017
  • 16:52pm

On Wednesday, September 20, 2017, the VIRTBIZ DAL1 datacenter experienced a network impairment that prevented many customers from accessing the Internet.  Although we work very hard to ensure your uptime and satisfaction with our service, adversity can strike at the most inopportune time.  For our customers impacted by the outage that resulted from today’s event, we offer our sincere apology.

This is a report on the event that details what happened and what we have done to address it.

At 6:53am, our BORDER2 router reported BGP loss with the carrier Zayo.  This is an event that should go largely unnoticed as the border routers adjust to direct traffic across alternate carriers.  Approximately 30 seconds later the primary interface between BORDER1 and BORDER2 began to suffer errors at the fabric which caused the interface to shut down.  In a typical situation this would be recoverable as traffic would then traverse one of two alternate routes between borders.  Unfortunately, the line card crashed, causing a fabric error at the frame at which point the router automatically issued a reload command at 6:54am.  This caused a loss-of-routing as the router was restarted.

When the router completed booting, it came online with a default configuration, as if it were reset to factory defaults.  This should not have been the case.  In a typical situation, the system should reboot into the saved configuration.   After quickly checking to be sure there were no lingering errors, technicians reloaded the router with its production configuration and re-established connectivity.  Customers were reconnected shortly thereafter.

WHAT WENT WRONG?   Several individual issues combined to make a “perfect storm” situation.  Specifically:

1. BGP failure from our carrier
2. Excessive errors followed by linecard crash at our router
3. Router reloaded without configuration
4. Failover routing through alternate borders did not occur

Our senior management and technicians have been working together since service was restored to thoroughly investigate each issue and examine how they may be related.    We found that:

1. Our connection with the Zayo network has encountered errors recently.  We have disconnected from that connection while the fiber-optic connection between DAL1 and their termination point is inspected for defects.

2. We’ve run diagnostics and self-tests on all linecards.  After consulting with Cisco and a third-party consultant, we have replaced the linecard responsible for the fabric-level crash as a preventative measure.   Cisco advises that this error can occur from time to time and is typically cured with a reload, although that behavior may be cured in a forthcoming software update.

3. When the routing software was last updated, the config-register on the router was changed by the loader from 0x2102 (boot to ROM) to 0x2142 (boot to ROM but ignore configuration).   This went unnoticed at the time and was only discovered after this event.   We have corrected this entry.

4. Due to the state at which the linecards between BORDER2 had dropped, the physical connection remained intact.  Even though no traffic was passed between the devices, the gateway routers continued to push traffic to the affected border.  This was a flaw in the configuration that was exposed due to engineering changes that had occurred over time on the network.  We have identified the single line of configuration code for this issue and have updated the configuration to correct this.

Wednesday has been a long day for us here at VIRTBIZ.  My goal with this report is to be as transparent as possible in providing you a window into our operations throughout this event.   That’s what I would want as a customer.  I hope that I have achieved that goal.  From the ladies in the front office answering the phones to the techs in the trenches, please be assured that everyone here is working to make sure we doing everything possible to deliver the service you expect from us.   We understand and share the frustration this type of event can bring.

We (corporately) and I (personally) appreciate your patience as we have worked to restore service, examine the causes for the incident, and take steps to prevent a future event.

As always, if you need any further assistance, please let us know if we can help in any way.

Chris Gebhardt, President
VIRTBIZ Internet Services | (972) 485-4125

« Back