IXP Switch Wishlist Draft

2 May 2001

      RIPE European Internet Exchange (EIX) Working Group

Internet Exchange Point Switching "Wishlist"

Mike Hughes <mike@linx.net>
London Internet Exchange
April 2001

ABSTRACT:
---------
At the RIPE meeting held in Amsterdam in February 2000, a number of 
participants agreed that the group should produce a "wishlist" to guide 
equipment manufacturers when producing boxes aimed at the core switching 
market. Over the coming months, ideas were collected from the EIXP 
community to form the basis of this document.

In Europe, most Internet Exchange Points use a shared switch fabric to 
which the participants connect. Organisations then arrange peering via 
bi-lateral peering agreements. It is not compulsory for all particpants 
to peer with every other participant (called multi-lateral peering). 
Once two participants agree to peer, they will set up BGP4 sessions 
between their routers connected to the Exhcange to exchange routes and 
traffic. In the majority of cases, the Exchange Point operator does not 
become involved in the routing of any traffic across the Exchange, they 
choose to leave this to the participants.

For this reason, switched Ethernet has become one of the most common 
choices for Exchange Point media. The main reasons behind this are:
	* Cost effectiveness
	* Simplicity of setup
	* Can use standard CAT5 wiring - easy to implement and maintain
	* Interfaces available across a wide range of platforms

With the growth of the Internet, more and more traffic is being routed 
to Internet Exchange points, and the importance of IXPs has grown in 
line with this, especially in Europe where private peering is less 
common than North America.

The IXP operators feel that having the right tools and features 
implemented in the equipment they deploy will play an important part of 
scaling ethernet technology to meet the demands placed upon Exchange 
Points.

This is an informational document to outline the various features which 
IXPs would like to see implemented in core Ethernet Switching products.

SECURITY FEATURES:
------------------

a) Control of dynamic MAC learning
----------------------------------
Currently, switches are provided with two options, either statically 
configured or dynamically learned forwarding information.

Exchange Points like to monitor and control how many MAC addresses are 
connected to a participant's port. The XP operator generally does not 
desire ad-hoc extensions connected to their network. The common way of 
managing this is to enforce a "router-only" or "limited MAC address" 
rule.

This is currently controlled by statically configuring forwarding 
information, or not controlled, but policed by counting the number of 
MAC addresses learned on each port, and action taken against offenders.

Static configuration of forwarding information is a somewhat inelegant 
option, as this increases configuration overhead, and decreases 
flexibility, especially in case of emergencies.

We propose a configurable "maximum learning" limit, configurable on a 
per port basis. In this way, operators can configure participants ports 
according to their house rules, but retain the flexibility of dynamic 
learning.

This feature would also have to include a "first in-last out" lockdown 
facility, to avoid forwarding information for valid addresses being 
overwritten by addresses in excess of the exchange's house rules.

b) Disable acting on STP BPDU information
-----------------------------------------
Many exchange operators currently deploy Spanning Tree Protocol (STP) in 
networks which contain redundant links/full meshing.

There is however, a danger presented by STP information leaked from a 
participant's network. The participant may have connected a poorly 
configured switch/router product, and may be leaking their STP 
information into that of the exchange.

We would wish to see a configurable option to allow STP information to 
be ignored, and filtered at the port, on a per port basis.

c) Wire-speed ACL-type filtering based on L3 header info
--------------------------------------------------------
The ability to look into the layer3 information of a packet, and 
selectively monitor, or filter, based on certain layer 3 criteria.

d) ARP/Broadcast snooping and control
-------------------------------------
Many exchange points insist on participants using IP addresses they have 
assigned by the exchange operator. It is desirable for the operator to 
be able to monitor/restrict "off-net" ARP.

As Ethernet is a broadcast medium, broadcast storms have been known to 
bring exchanges to their knees, affecting the forwarding abilities of 
both the exchange's switches, and the participants' routers. 
Monitoring/rate limiting/control of Ethernet broadcast frames is 
desirable.

Most exchanges also forbid the speaking of interior routing protocols 
across their peering network. Since these take the form of broadcase 
frames on ethernet, broadcast control would help monitor this type of 
incidence.

e) Policy exception logging
---------------------------
In the above paragraphs, we have asked for some policy-based tools. 
Operators need to know when these policies have been breached.

Good logging of policy exceptions need to be implemented:
	* SNMP-trap
	* Configurable syslog (i.e. which syslog facility to write to)

f) Access to management interfaces
----------------------------------
In the past, security of management interfaces on Ethernet switching 
products as often been lacking.

CLI or web interfaces should support authentication using 
username/password pairs, to avoid the use of "password only" 
authentication which implies shared passwords.

CLI interfaces should also support SSH access, using either 
username/password or secure key authentication.

Web interfaces should be HTTPS/SSL enabled, to avoid passwords being 
passed in the clear over HTTP.

Management interfaces should be able to perform authentication from an 
external source, such as TACACS, RADIUS or LDAP services, as well as 
providing locally held accounts (have to be retained for emergencies)

All management interfaces, CLI, web and SNMP should be able to benefit 
from access-list control. The access lists should be able to support 
variable-length subnet masks.

Ability to disable management interfaces on a per-VLAN basis. Many XP 
operators choose to configure a "management" VLAN, so that all 
management is done out-of-band of the core peering traffic. It is 
desirable to have the management interfaces to listen on the management 
networks only.

g) Port mirroring
-----------------
It is sometimes necessary to mirror participants' ports, either because 
a participant is suspected of some inappropriate activity, or to help 
obtain information to debug a problem.

Not all exchange points have staff on site 24x7, and port mirroring may 
need to be remotely set up, without hands-on intervention on-site.

The ability to allow any port to mirror any other port with a similar 
lower speed within the chassis would allow the operator to connect a 
traffic collector/analyser device to a monitoring port, and simply 
configure the switch to mirror a port as desired to monitoring port.

SCALABILITY AND RESILIENCE
--------------------------

a) Spanning Tree
----------------
Spanning Tree is currently the only dynamic solution available to 
operators of exchange points for dynamically managing redundant links in 
their architecture.

There are a number of problems with Spanning Tree:
	* Slow convergence
		- especially in cases of root bridge re-election
	* Wasteful of reslilent/redundnant resources
		- redundnant links are switches off
		- no traffic sharing
	* Security concerns (highlighted above)

As the routes collected at an Exchange Point can be routed all over the 
world, any routing instability can act like dropping a pebble in a pond, 
and will spread around the Internet.

It's desirable to maintain stable routing sessions across Exchange Point 
LANs to minimise these routing flaps, because of load it places on 
routers, and the effects of route dampening penalties.

We believe that being able to declare ports as "end-stations" should 
avoid them being counted in the STP calculation, enable these ports to 
start forwarding more rapidly, and speed overall STP convergence time.

Rapid spanning tree (IEEE 802.1w) should be implemented 
(http://www.ieee802.org/1/pages/802.1w.html).

b) Resilient Packet Ring - IEEE 802.17
--------------------------------------
This is a standards-based version of the technology currently used by 
Cisco called DPT (Dynamic Packet Transport). This consists of a counter-
rotating ring-system, with spacial reuse and "ring wrapping" circuit 
protection.

The Cisco version is currently implemented over SONET/SDH media, 
however, the standardised version is being designed to be more media 
agnostic, and the IEEE working group has already elected to provide 
support for Gigabit Ethernet and 10 Gigabit Ethernet.

RPR shows promise of becoming an ideal backbone technology for use in a 
flat layer-2 network, such as an exchange point. It will allow for 
redundnant self-healing backbones, with optimal use of all interswitch 
capacity, without the need for STP.

c) Trunking and Link-Aggregation
--------------------------------
It's become increasingly common for exchange points to become multiple 
switch and multiple site based, and many need to deploy link aggregation 
to handle the volume of interswitch traffic, where it exceeds the 
maximum speed of a single link.

Most equipment implements load-sharing using either round-robin or 
address-based algorithms. The address-based system generally employs a 
hash of source/destination MAC address.

Address-based load-sharing may be preferable to minimise jitter.

In exchange points, many pieces of equipment will have similar MAC 
addresses, especially the first and last bytes (corresponding to vendor 
and slot position on router).

If the hash is only based on part of the address, this can result in 
poor efficiency of load-sharing.

Load-sharing algorithms should consider the whole address when 
calculating the hash used.

Load-sharing of broadcasts and multicast traffic should be implemented. 
Behaviours such as forwarding all broadcast/multicast traffic out of the 
"primary" port in a trunk have been observed when load-sharing using 
destination MAC addresses has been implemented.

IEEE 803.3ad link-aggregation should be implemented.

d) Multicast Control and Containment
------------------------------------
Most switches are configured with IGMP snooping for multicast control. 
However, in an exchange point, with only routers attched, there is no 
IGMP present, only PIM and MSDP, and all multicast packets are flooded 
out of all ports.

An exchange point, however, is an ideal place for mutlicast peering to 
happen, inject the traffic once, and it comes out several times (as much 
as is needed, or in the current situation, as much as isn't needed!).

Cisco developed RGMP (Router Group Management Protocol). This is a 
proprietary technology whereby the router can communicate to the switch 
which multicast groups it wishes to see. 

This is, however, a vendor specific feature, and a wide range of routing 
platforms and switching platforms are present at many exchange points - 
both in equipment used by the operator, and the participants.

Therefore, this is not a workable solution for most exchange points, 
whose princples are often include "equal treatment" of participants. 

While it may not solve all potential issues with multicast peering, 
implementing PIM-SM snooping and pruning within the switches will 
achieve the traffic containment requirements.

e) Intelligent Layer 2 Forwarding - MAC-SPF
-------------------------------------------
The majority of ethernet switching products in use today are 
switch/router products and contain enough processing power and memory to 
handle sizeable routing tables.

Exchange points which do not involve Multilateral peering currently need 
to provide a transparent layer-2 service to the participants, and 
Ethernet has been widely deployed to achieve this.

In a bi-laterally peered XP, the XP cannot become involved in the layer-
3 routing decision process. There are a couple of reasons for this, but 
an overriding one is a matter of scalability. Every participant will 
have their own routing view. Involving the XP infrastructure in layer-3 
routing decisions would need the XP to hold as many routing views as 
members (Yes, MPLS could be employed, how is not clear right now).

With the growth in size of some exchange points, this has led to 
expansion to multiple sites and switches, and consequently a more 
complex topology. This has shown up the weaknesses in Spanning Tree.

An option this group considers worth exploration is the development of  
a dynamic layer-2 routing protocol, to replace MAC learning and STP.

A number of thoughts have been given to this, and the following ideas 
have been arrived at:

* Retain dynamic MAC learning on "end-station" ports
	- added to forwarding database as usual
* Replace dynamic learning on inter-switch links with a routing process
	- switches use some form of l2 neighbour discovery
* Use existing OSPF algorithms and similar LSAs to manage the routing 
information

This would need to interoperate between different switch vendors to 
allow for scalability and freedom.

Currently, one issue is where and how to pursue this. It's unclear 
whether this should lie within the IEEE 802 LMSC or within the IETF.

Many switch vendors like the sound of this idea, as it solves some 
problems being faced by service providers working in the MAN space, who 
need to provide transparent L2 transport services. However, they need to 
be convinced of demand and ensure interoperability.

The IETF has in the past only been concerned with layer-3 and up, "IP-
over-Foo". However, they are currently of the mindset that they do need 
do need to dig down and look at interaction between IP and the Foo it is 
transmitted over. This is one possible avenue for pursuing this further.

f) MPLS
-------
Another option for improving scalability and resilience is MPLS (Multi 
Protocol Label Switching). Currently, there is rather limited support 
for MPLS in switching products, mainly because customer demand is 
pushing vendors to implement this in their high-end router products 
first. Many high-end router products do not posess sufficient port-
density for use in an IXP environment.

MPLS may enable an exchange to run a layer-3 network internally, and use 
an existing IGP to route traffic internally, By using MPLS tags to pass 
the traffic between the peers on the edge of the IXP cloud, the IXP does 
not need to be aware of IP source and destination of the packets.

PHYSICAL WISHES
---------------
IXPs are high-uptime environments. The equipment used in an IXP needs to 
be able to satify this requirement, in terms of redundancy, and hot-
swappable components.

* Hot swap of management/switch fabric cards with instantaneous failover 
to any installed redundancy (not rebooting onto the "backup")
* Full-redunancy of PSUs, and hot-swap (i.e. box should run on 50% of 
PSUs)
* Rapid booting and card startup (after all, much functionality is 
implemented in the ASIC hardware)
* GBIC-optics for flexibility, easy replacement, and maximised port 
utilisation (freedom to choose SX/LX, etc)

ACKNOWLEDGEMENT AND THANKS
-------------------------- 
Thanks are due to the "usual suspects" in the RIPE EIX Working Group, but
specifically Christian Panigl, Keith Mitchell and Rick Payne, for their
contributions to this document.