One of the issues we found (Philip Smith and I) "back then" was indeed router bugs. The combination of "export policy is changed" with "an update is queued for this neighbour right then" led to control-plane confusion and missing withdraws. This was fixed.
cool
My conclusion then was that something along the following line happens
- router R1 remembers where an UPDATE was sent to - export policy on R1 is changed, changing whether or not a given peer would receive an UPDATE for a given prefix - R1 receives withdraw from his best (and only) path, prefix is gone - R1 sends withdraw to "all peers it remembers" - and something goes wrong if that list of peers is not reflecting the real set of peers, possibly due to "BGP internal state not fully in sync between 'export policy is changed' and 'withdraw comes in'", so R1 is no longer aware that one of his neighbours received the prefix originally.
believable conjecture. could and should be tested in lab. but does not explain the cases where we see stuck routes on devices which have no config changes for a loooong time (if you believe rancid). randy