RIPE European Internet Exchange (EIX) Working Group Internet Exchange Point Switching "Wishlist" Mike Hughes <mike@linx.net> London Internet Exchange April 2001 ABSTRACT: --------- At the RIPE meeting held in Amsterdam in February 2000, a number of participants agreed that the group should produce a "wishlist" to guide equipment manufacturers when producing boxes aimed at the core switching market. Over the coming months, ideas were collected from the EIXP community to form the basis of this document. In Europe, most Internet Exchange Points use a shared switch fabric to which the participants connect. Organisations then arrange peering via bi-lateral peering agreements. It is not compulsory for all particpants to peer with every other participant (called multi-lateral peering). Once two participants agree to peer, they will set up BGP4 sessions between their routers connected to the Exhcange to exchange routes and traffic. In the majority of cases, the Exchange Point operator does not become involved in the routing of any traffic across the Exchange, they choose to leave this to the participants. For this reason, switched Ethernet has become one of the most common choices for Exchange Point media. The main reasons behind this are: * Cost effectiveness * Simplicity of setup * Can use standard CAT5 wiring - easy to implement and maintain * Interfaces available across a wide range of platforms With the growth of the Internet, more and more traffic is being routed to Internet Exchange points, and the importance of IXPs has grown in line with this, especially in Europe where private peering is less common than North America. The IXP operators feel that having the right tools and features implemented in the equipment they deploy will play an important part of scaling ethernet technology to meet the demands placed upon Exchange Points. This is an informational document to outline the various features which IXPs would like to see implemented in core Ethernet Switching products. SECURITY FEATURES: ------------------ a) Control of dynamic MAC learning ---------------------------------- Currently, switches are provided with two options, either statically configured or dynamically learned forwarding information. Exchange Points like to monitor and control how many MAC addresses are connected to a participant's port. The XP operator generally does not desire ad-hoc extensions connected to their network. The common way of managing this is to enforce a "router-only" or "limited MAC address" rule. This is currently controlled by statically configuring forwarding information, or not controlled, but policed by counting the number of MAC addresses learned on each port, and action taken against offenders. Static configuration of forwarding information is a somewhat inelegant option, as this increases configuration overhead, and decreases flexibility, especially in case of emergencies. We propose a configurable "maximum learning" limit, configurable on a per port basis. In this way, operators can configure participants ports according to their house rules, but retain the flexibility of dynamic learning. This feature would also have to include a "first in-last out" lockdown facility, to avoid forwarding information for valid addresses being overwritten by addresses in excess of the exchange's house rules. b) Disable acting on STP BPDU information ----------------------------------------- Many exchange operators currently deploy Spanning Tree Protocol (STP) in networks which contain redundant links/full meshing. There is however, a danger presented by STP information leaked from a participant's network. The participant may have connected a poorly configured switch/router product, and may be leaking their STP information into that of the exchange. We would wish to see a configurable option to allow STP information to be ignored, and filtered at the port, on a per port basis. c) Wire-speed ACL-type filtering based on L3 header info -------------------------------------------------------- The ability to look into the layer3 information of a packet, and selectively monitor, or filter, based on certain layer 3 criteria. d) ARP/Broadcast snooping and control ------------------------------------- Many exchange points insist on participants using IP addresses they have assigned by the exchange operator. It is desirable for the operator to be able to monitor/restrict "off-net" ARP. As Ethernet is a broadcast medium, broadcast storms have been known to bring exchanges to their knees, affecting the forwarding abilities of both the exchange's switches, and the participants' routers. Monitoring/rate limiting/control of Ethernet broadcast frames is desirable. Most exchanges also forbid the speaking of interior routing protocols across their peering network. Since these take the form of broadcase frames on ethernet, broadcast control would help monitor this type of incidence. e) Policy exception logging --------------------------- In the above paragraphs, we have asked for some policy-based tools. Operators need to know when these policies have been breached. Good logging of policy exceptions need to be implemented: * SNMP-trap * Configurable syslog (i.e. which syslog facility to write to) f) Access to management interfaces ---------------------------------- In the past, security of management interfaces on Ethernet switching products as often been lacking. CLI or web interfaces should support authentication using username/password pairs, to avoid the use of "password only" authentication which implies shared passwords. CLI interfaces should also support SSH access, using either username/password or secure key authentication. Web interfaces should be HTTPS/SSL enabled, to avoid passwords being passed in the clear over HTTP. Management interfaces should be able to perform authentication from an external source, such as TACACS, RADIUS or LDAP services, as well as providing locally held accounts (have to be retained for emergencies) All management interfaces, CLI, web and SNMP should be able to benefit from access-list control. The access lists should be able to support variable-length subnet masks. Ability to disable management interfaces on a per-VLAN basis. Many XP operators choose to configure a "management" VLAN, so that all management is done out-of-band of the core peering traffic. It is desirable to have the management interfaces to listen on the management networks only. g) Port mirroring ----------------- It is sometimes necessary to mirror participants' ports, either because a participant is suspected of some inappropriate activity, or to help obtain information to debug a problem. Not all exchange points have staff on site 24x7, and port mirroring may need to be remotely set up, without hands-on intervention on-site. The ability to allow any port to mirror any other port with a similar lower speed within the chassis would allow the operator to connect a traffic collector/analyser device to a monitoring port, and simply configure the switch to mirror a port as desired to monitoring port. SCALABILITY AND RESILIENCE -------------------------- a) Spanning Tree ---------------- Spanning Tree is currently the only dynamic solution available to operators of exchange points for dynamically managing redundant links in their architecture. There are a number of problems with Spanning Tree: * Slow convergence - especially in cases of root bridge re-election * Wasteful of reslilent/redundnant resources - redundnant links are switches off - no traffic sharing * Security concerns (highlighted above) As the routes collected at an Exchange Point can be routed all over the world, any routing instability can act like dropping a pebble in a pond, and will spread around the Internet. It's desirable to maintain stable routing sessions across Exchange Point LANs to minimise these routing flaps, because of load it places on routers, and the effects of route dampening penalties. We believe that being able to declare ports as "end-stations" should avoid them being counted in the STP calculation, enable these ports to start forwarding more rapidly, and speed overall STP convergence time. Rapid spanning tree (IEEE 802.1w) should be implemented (http://www.ieee802.org/1/pages/802.1w.html). b) Resilient Packet Ring - IEEE 802.17 -------------------------------------- This is a standards-based version of the technology currently used by Cisco called DPT (Dynamic Packet Transport). This consists of a counter- rotating ring-system, with spacial reuse and "ring wrapping" circuit protection. The Cisco version is currently implemented over SONET/SDH media, however, the standardised version is being designed to be more media agnostic, and the IEEE working group has already elected to provide support for Gigabit Ethernet and 10 Gigabit Ethernet. RPR shows promise of becoming an ideal backbone technology for use in a flat layer-2 network, such as an exchange point. It will allow for redundnant self-healing backbones, with optimal use of all interswitch capacity, without the need for STP. c) Trunking and Link-Aggregation -------------------------------- It's become increasingly common for exchange points to become multiple switch and multiple site based, and many need to deploy link aggregation to handle the volume of interswitch traffic, where it exceeds the maximum speed of a single link. Most equipment implements load-sharing using either round-robin or address-based algorithms. The address-based system generally employs a hash of source/destination MAC address. Address-based load-sharing may be preferable to minimise jitter. In exchange points, many pieces of equipment will have similar MAC addresses, especially the first and last bytes (corresponding to vendor and slot position on router). If the hash is only based on part of the address, this can result in poor efficiency of load-sharing. Load-sharing algorithms should consider the whole address when calculating the hash used. Load-sharing of broadcasts and multicast traffic should be implemented. Behaviours such as forwarding all broadcast/multicast traffic out of the "primary" port in a trunk have been observed when load-sharing using destination MAC addresses has been implemented. IEEE 803.3ad link-aggregation should be implemented. d) Multicast Control and Containment ------------------------------------ Most switches are configured with IGMP snooping for multicast control. However, in an exchange point, with only routers attched, there is no IGMP present, only PIM and MSDP, and all multicast packets are flooded out of all ports. An exchange point, however, is an ideal place for mutlicast peering to happen, inject the traffic once, and it comes out several times (as much as is needed, or in the current situation, as much as isn't needed!). Cisco developed RGMP (Router Group Management Protocol). This is a proprietary technology whereby the router can communicate to the switch which multicast groups it wishes to see. This is, however, a vendor specific feature, and a wide range of routing platforms and switching platforms are present at many exchange points - both in equipment used by the operator, and the participants. Therefore, this is not a workable solution for most exchange points, whose princples are often include "equal treatment" of participants. While it may not solve all potential issues with multicast peering, implementing PIM-SM snooping and pruning within the switches will achieve the traffic containment requirements. e) Intelligent Layer 2 Forwarding - MAC-SPF ------------------------------------------- The majority of ethernet switching products in use today are switch/router products and contain enough processing power and memory to handle sizeable routing tables. Exchange points which do not involve Multilateral peering currently need to provide a transparent layer-2 service to the participants, and Ethernet has been widely deployed to achieve this. In a bi-laterally peered XP, the XP cannot become involved in the layer- 3 routing decision process. There are a couple of reasons for this, but an overriding one is a matter of scalability. Every participant will have their own routing view. Involving the XP infrastructure in layer-3 routing decisions would need the XP to hold as many routing views as members (Yes, MPLS could be employed, how is not clear right now). With the growth in size of some exchange points, this has led to expansion to multiple sites and switches, and consequently a more complex topology. This has shown up the weaknesses in Spanning Tree. An option this group considers worth exploration is the development of a dynamic layer-2 routing protocol, to replace MAC learning and STP. A number of thoughts have been given to this, and the following ideas have been arrived at: * Retain dynamic MAC learning on "end-station" ports - added to forwarding database as usual * Replace dynamic learning on inter-switch links with a routing process - switches use some form of l2 neighbour discovery * Use existing OSPF algorithms and similar LSAs to manage the routing information This would need to interoperate between different switch vendors to allow for scalability and freedom. Currently, one issue is where and how to pursue this. It's unclear whether this should lie within the IEEE 802 LMSC or within the IETF. Many switch vendors like the sound of this idea, as it solves some problems being faced by service providers working in the MAN space, who need to provide transparent L2 transport services. However, they need to be convinced of demand and ensure interoperability. The IETF has in the past only been concerned with layer-3 and up, "IP- over-Foo". However, they are currently of the mindset that they do need do need to dig down and look at interaction between IP and the Foo it is transmitted over. This is one possible avenue for pursuing this further. f) MPLS ------- Another option for improving scalability and resilience is MPLS (Multi Protocol Label Switching). Currently, there is rather limited support for MPLS in switching products, mainly because customer demand is pushing vendors to implement this in their high-end router products first. Many high-end router products do not posess sufficient port- density for use in an IXP environment. MPLS may enable an exchange to run a layer-3 network internally, and use an existing IGP to route traffic internally, By using MPLS tags to pass the traffic between the peers on the edge of the IXP cloud, the IXP does not need to be aware of IP source and destination of the packets. PHYSICAL WISHES --------------- IXPs are high-uptime environments. The equipment used in an IXP needs to be able to satify this requirement, in terms of redundancy, and hot- swappable components. * Hot swap of management/switch fabric cards with instantaneous failover to any installed redundancy (not rebooting onto the "backup") * Full-redunancy of PSUs, and hot-swap (i.e. box should run on 50% of PSUs) * Rapid booting and card startup (after all, much functionality is implemented in the ASIC hardware) * GBIC-optics for flexibility, easy replacement, and maximised port utilisation (freedom to choose SX/LX, etc) ACKNOWLEDGEMENT AND THANKS -------------------------- Thanks are due to the "usual suspects" in the RIPE EIX Working Group, but specifically Christian Panigl, Keith Mitchell and Rick Payne, for their contributions to this document.