Wednesday, October 14, 2009

BGP router down

We have a Global Crossing line. It is a dual router BGP setup with HSRP between the local routers in my office and BGP fall back on the Global Crossing infrastructure.
This morning, few minutes before the opening bell the primary circuit died. As a result the connection failed over to the backup line, which is what you'd expect. The problem started when the primary line started to bounce. Whenever it came back the connections bounced for a second and the users had to reconnect. Then it failed again and they started to get irritated. They are totally right.
As I called the Global Crossing support I found a very efficient service that was listening to my problem and had the will to help (sound obvious but usually this is not the case with big vendors). They started by checking the logs and found that it is bouncing every few minutes. I asked if they can change the HSRP priority to be higher then the primary router, they had no problem doing the configuration change. Problem resolved!
Now this is how HSRP priority work: the primary router get a higher priority and both get the preempt command which allow change of active state if there is a higher priority router online. By changing the priority on the backup router and changing the preempt to manual we ensured that even when the circuit is fixed and stable online it will not become the primary active line unless we manually change it back. This ensure that users will not get kicked off when the line is fixed or when the telco work on the circuit and bounce it constantly.
The circuit was fixed few hours later and after hours we switched back. It is nice to work with good cooperative service for a change!

