New on RIPE Labs: BGP Zombies
Dear colleagues, When withdrawing an IP prefix from the Internet, an origin network sends BGP withdraw messages, which are expected to propagate to all BGP routers that hold an entry for that IP prefix in their routing table. Yet network operators occasionally report issues where routers maintain routes to IP prefixes withdrawn by their origin network - BGP zombies. Please find more details on RIPE Labs: https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Kind regards, Mirjam Kühne RIPE NCC
Hi, On Tue, Apr 23, 2019 at 02:48:04PM +0200, Mirjam Kuehne wrote:
When withdrawing an IP prefix from the Internet, an origin network sends BGP withdraw messages, which are expected to propagate to all BGP routers that hold an entry for that IP prefix in their routing table. Yet network operators occasionally report issues where routers maintain routes to IP prefixes withdrawn by their origin network - BGP zombies.
These are "ghosts", not zombies :-) https://www.sixxs.net/tools/grh/ Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard, Michael Emmer Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
These are "ghosts", not zombies :-)
Yep, just a new name for the same old thing :) Cheers! Sander
As adjective "volatile" (/their presence is highly volatile/), IMO is better for ghost than for zombie, I would vote for "ghost" name :-D. Il 23/04/2019 19:54, Sander Steffann ha scritto:
These are "ghosts", not zombies :-)
https://www.sixxs.net/tools/grh/ Yep, just a new name for the same old thing :)
Cheers! Sander
On 23/04/2019 19:54, Sander Steffann wrote:
These are "ghosts", not zombies :-)
Yep, just a new name for the same old thing :)
From my end i'll try harder to mention the multiple names the phenomenon has (my bad that we don't have the alternative name mentioned in the
I had a similar conversation with somebody insisting these things are called 'stuck routes'. 8) And while i agree naming confusion is not good, what i care about most is that we understand this phenomenon better. GRH project ended years ago, and still we see the phenomenon, not only in IPv6 (which was GHR focus), but also IPv4! paper ( https://www.iij-ii.co.jp/en/members/romain/pdf/romain_pam2019.pdf ) ). I've made a reference to the alternative names in the RIPE Labs post. ciao, Emile
Hi, On Wed, Apr 24, 2019 at 09:06:08AM +0200, Emile Aben wrote:
On 23/04/2019 19:54, Sander Steffann wrote:
These are "ghosts", not zombies :-)
Yep, just a new name for the same old thing :)
I had a similar conversation with somebody insisting these things are called 'stuck routes'. 8)
Both "zombie" and "stuck routes" describes the phenomenom fairly well :-) I'd go for prior art, and claim "ghosts" is the oldest documented term (though, back then, when I was young and thought I found something new in IPv6 BGP, Randy Bush told me that this was something long known in the IPv4 world...) Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard, Michael Emmer Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
And while i agree naming confusion is not good, what i care about most is that we understand this phenomenon better.
and i am still waiting. we've seen them for 30 years. and we are still no nearer understanding them than a conjecture that they are caused by vendor bugs. on a sibling mess, duplicate announcements, folk did real expirments and found some root causes. i am still waiting for that on stuck routes. randy
Hi, On Wed, Apr 24, 2019 at 07:43:15AM -0700, Randy Bush wrote:
And while i agree naming confusion is not good, what i care about most is that we understand this phenomenon better.
and i am still waiting. we've seen them for 30 years. and we are still no nearer understanding them than a conjecture that they are caused by vendor bugs.
on a sibling mess, duplicate announcements, folk did real expirments and found some root causes. i am still waiting for that on stuck routes.
One of the issues we found (Philip Smith and I) "back then" was indeed router bugs. The combination of "export policy is changed" with "an update is queued for this neighbour right then" led to control-plane confusion and missing withdraws. This was fixed. My conclusion then was that something along the following line happens - router R1 remembers where an UPDATE was sent to - export policy on R1 is changed, changing whether or not a given peer would receive an UPDATE for a given prefix - R1 receives withdraw from his best (and only) path, prefix is gone - R1 sends withdraw to "all peers it remembers" - and something goes wrong if that list of peers is not reflecting the real set of peers, possibly due to "BGP internal state not fully in sync between 'export policy is changed' and 'withdraw comes in'", so R1 is no longer aware that one of his neighbours received the prefix originally. Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard, Michael Emmer Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
One of the issues we found (Philip Smith and I) "back then" was indeed router bugs. The combination of "export policy is changed" with "an update is queued for this neighbour right then" led to control-plane confusion and missing withdraws. This was fixed.
cool
My conclusion then was that something along the following line happens
- router R1 remembers where an UPDATE was sent to - export policy on R1 is changed, changing whether or not a given peer would receive an UPDATE for a given prefix - R1 receives withdraw from his best (and only) path, prefix is gone - R1 sends withdraw to "all peers it remembers" - and something goes wrong if that list of peers is not reflecting the real set of peers, possibly due to "BGP internal state not fully in sync between 'export policy is changed' and 'withdraw comes in'", so R1 is no longer aware that one of his neighbours received the prefix originally.
believable conjecture. could and should be tested in lab. but does not explain the cases where we see stuck routes on devices which have no config changes for a loooong time (if you believe rancid). randy
Hi, On Wed, Apr 24, 2019 at 08:06:18AM -0700, Randy Bush wrote:
My conclusion then was that something along the following line happens
- router R1 remembers where an UPDATE was sent to - export policy on R1 is changed, changing whether or not a given peer would receive an UPDATE for a given prefix - R1 receives withdraw from his best (and only) path, prefix is gone - R1 sends withdraw to "all peers it remembers" - and something goes wrong if that list of peers is not reflecting the real set of peers, possibly due to "BGP internal state not fully in sync between 'export policy is changed' and 'withdraw comes in'", so R1 is no longer aware that one of his neighbours received the prefix originally.
believable conjecture. could and should be tested in lab.
Indeed.
but does not explain the cases where we see stuck routes on devices which have no config changes for a loooong time (if you believe rancid).
Well, in the scenario above, R1 would have the config change, but *on R1* the route would be gone. A downstream router R2 would have seen the initial UPDATE, but never received a withdraw - so R2 would claim "I have it, and I have it from R1!" while R1 would claim "no such prefix". So, no contradiction. Gert Doering -- NetMaster -- have you enabled IPv6 on something today...? SpaceNet AG Vorstand: Sebastian v. Bomhard, Michael Emmer Joseph-Dollinger-Bogen 14 Aufsichtsratsvors.: A. Grundner-Culemann D-80807 Muenchen HRB: 136055 (AG Muenchen) Tel: +49 (0)89/32356-444 USt-IdNr.: DE813185279
but does not explain the cases where we see stuck routes on devices which have no config changes for a loooong time (if you believe rancid).
Well, in the scenario above, R1 would have the config change, but *on R1* the route would be gone. A downstream router R2 would have seen the initial UPDATE, but never received a withdraw - so R2 would claim "I have it, and I have it from R1!" while R1 would claim "no such prefix".
so, could we look for a router where emile et alia found a stuck route and we have the rancid for the upstreams? randy
participants (6)
-
Claudio Ferronato
-
Emile Aben
-
Gert Doering
-
Mirjam Kuehne
-
Randy Bush
-
Sander Steffann