Archive    Sept. 2006     #54


By William Flanagan, Publisher

If you, like the telcos, set a target for high availability at “five nines” or 99.999% uptime, that leaves about 5 minutes of downtime--per year.   So how can you ever hope to reach that level on an IP network if a reboot of your main router takes up to 15 minutes to reestablish internal routing tables, external routing adjacencies, and user sessions?

That's a trick question, to see if you share the common assumption that the router will be completely out of action, unable to forward packets, during the entire period of the re-boot.  That condition has been "normal" and accepted up until very recently.  In fact, the necessity of taking a router down for a re-boot has been so well established in the minds of technical operations people that many still don't consider this kind of interruption to service as "downtime."  Instead, there is a category of time, often called "scheduled maintenance," that allows the network to be dead as far as customers can tell, without incurring any attributable downtime.

Business managers, convinced that outages during maintenance were necessary, gave the boys (and girls) in the back room a free pass, so to speak.  Orders were strict about maintaining the network in operation, often with financial penalties for excessive outages beyond the service level agreement.  But just in case you needed it (which you always did), here was a way to let the network down without taking any blame:  call it scheduled maintenance.

Unfortunately, from the customer's or end user's point of view, when the network is down it doesn't matter why.  Admitted:  the re-boot outages usually (but not always) take place during off hours, 'round midnight on weekends and holidays.  No matter the local time, however,  someplace on earth it's high noon and people are working, people who want connectivity to your data center behind this router.

Fortunately, just as the pressure to increase availability is mounting, a helpful technology appears.   Depending on the vendor, the router feature is called high-availability routing, non-stop routing, graceful switchover, statefull failover, or something similar.   In broad terms, they work similarly.  Details vary by vendor, but not usually by model--most vendors run the same software in all size boxes.

As the headline above hints, the software we're talking about is mainly in the very large routers:   Alcatel  77xx Service Router;  Cisco CRS, 12000, and 7500;  Juniper T-Series;  and Avici's QSR, SSR, and TSR data switches (distributed by Nortel).

The non-stop feature set focuses on two key functions to perform without interrupting traffic:
1.  upgrade of software or hardware, and
2.  failover to backup hardware.

These are the very procedures that historically were allowed to incur the most downtime.   Careful planning has reduced the need to reboot for configuration changes.  Redundant control hardware keeps two copies of all configuration and forwarding information--is the "live" module fails, the backup module assumes control without taking time to learn the router's status.  The backup hardware, while off-line, accepts new firmware which learns the existing configuration and forwarding tables before taking over control.  In this way, the transition need not cause the loss of any packets nor interrupt forwarding.

The Big Box solution has proven that 99.999% availability is practicable.  The Big Price Tag may render this approach impractical because the capital budget can't cover the initial cost.  But in many situations where downtime is expensive (high value financial transactions, manufacturing or transportation operations, security) it could be a matter of "pay not or pay later."

For those on a tighter budget, Virtual Redundant Router Protocol (VRRP) can approach high availability by linking redundant routers (and usually redundant switches) into a hardware cluster (four devices).  Dual hardware and paths can eliminate network downtime for hardware/firmware maintenance, but the process is more labor intensive and requires strict adherence to operating procedures--if you can pull it off, go for it.

--  http://www.alcatel.com/bnd/news/ip/ipv6.jhtml for a report from BT on the Alcatel HA feature;  detailed description of the tests and results.
http://www.cisco.com/en/US/products/ps6550/prod_presentation_list.html is a list of documents related to high availability.
--  http://www.juniper.net/company/presscenter/pr/2004/pr-040818a.html announces the BT certification testing.
--  http://www2.nortel.com/go/product_content.jsp?prod_id=48980 offers literature on Avici core router features, including HA.

