
Hi all, recently we had a session that went down without our quagga noticing it. The remote peer told us the session was down (while our quagga said it was up). We had to clear the session to restore it. It seems that this was caused by our hold time of 0. If the peer just suddenly drops, quagga has ofcourse no way of knowing this and will see the session as up. As it's already up, it will refuse connection attempts from the remote peer. I don't know how often this happens. I think quite some of our peers do not closely monitor peering sessions and therefore similar events will probably go unnoticed. I think it would be best to set the normal hold time for all peerings to prevent this from occuring. I heard from Arife that we don't do that now, to prevent polluting our dump files with loads of keepalives. As far as I can see, there is no option to skip keepalives in the dumps, but I was thinking we should be able to write a patch to do so, or ask quagga-users whether someone did something like it already. So my idea is: - get quagga to stop dumping the keepalives - set all peerings to a hold time of 180 Any comments? cheers, -- Erik Romijn RIPE NCC software engineer erik@ripe.net http://www.ripe.net/

Erik Romijn wrote:
I heard from Arife that we don't do that now, to prevent polluting our dump files with loads of keepalives.
My recollection is that running without keepalives was to remove extra activity caused by underlying network instability for long distance multi-hop peerings. Apart from rrc00 this shouldn't be an issue anyway since all peers are a single hop away. Research has suggested that single-/multi-hop peering and keepalive/no keepalive settings make little difference to collected data (See Alex Tudor's presentation from ripe46: http://www.ripe.net/ripe/meetings/ripe-46/presentations/ripe46-routing-bgp-m...) Given the general level of BGP activity, there should be sufficient Update messages being sent by each peer that the number of additional Keepalive messages needed to keep the session up should be low.
As far as I can see, there is no option to skip keepalives in the dumps, but I was thinking we should be able to write a patch to do so, or ask quagga-users whether someone did something like it already.
So my idea is: - get quagga to stop dumping the keepalives - set all peerings to a hold time of 180
Personally, I'd go for the latter. This would also remove the problems we have with the RRCs peering with Juniper routers which appear not to accept the 0 hold-time but bring the session up anyway (with their standard timings), only to flap the session after they don't receive the keepalives they were expecting. With some full feeds this has resulted in ENORMOUS dump files and very long database insertion times. James

James Aldridge wrote:
Erik Romijn wrote:
So my idea is: - get quagga to stop dumping the keepalives - set all peerings to a hold time of 180
Personally, I'd go for the latter.
I was referring to do both :-) cheers, -- Erik Romijn RIPE NCC software engineer erik@ripe.net http://www.ripe.net/

Erik Romijn wrote:
I heard from Arife that we don't do that now, to prevent polluting our dump files with loads of keepalives.
If I remember correctly, one of the reasons was also the fear that when the RRC is heavily loaded (e.g. when performing a full RIB dump), the CPU might be tied up for long enough that Quagga would miss keepalives, eventually leading to dropped sessions. I don't know if this is still a problem; if it is, we might set keepalives to some more conservative value such as 600, causing session to go down after 30 minutes. Cheers, Lorenzo -- Lorenzo Colitti lorenzo@ripe.net Network Engineer +31-20-5354471 RIPE NCC www.ripe.net
participants (3)
-
Erik Romijn
-
James Aldridge
-
Lorenzo Colitti