Saturday, 21 November 2015

OSPF External (Type-5) metric calculation and path selection to multiple network exit points for the same prefix.

I encountered a strange issue the other day with respect to routing of prefixes redistributed into OSPF from an external routing protocol (in this case BGP), and thought I'd do some research on External routes in OSPF. It threw up some somewhat counter-intuitive behaviour and as a result I thought I'd write about it.

There is a lot of confusion around how E1 and E2 routes are handled in OSPF, especially in the area of path selection and the path chosen to the ASBR.

What is an External (Type-5) LSA?

Prefixes injected into OSPF by an Autonomous System Boundary Router (ASBR) in a non-stub area are represented by Type-5 LSAs in OSPF. The Type-5 LSA contains the prefix, the advertising ABSR and a redistributed metric value and type. These prefixes are considered to be "External" in OSPF and can be one of two sub-types, that vary only in the way that the ASBR is selected:
  • External Type 2 (E2) - the metric that is contained in the LSA (set at the time of injection) is the primary metric that is taken into account (but not the only one, as we will see).
  • External Type 1 (E1) - the metric that is contained in the LSA (set at the time of injection) is taken into account as well as the cost to reach the ASBR calculated by the OSPF process.
An important thing to remember here is that, unlike distance-vector or hybrid protocols, the metric in the Type-5 LSA is not updated by each router forwarding the LSA, regardless of whether it is E1 or E2, since the only router that can change the contents of an LSA is the originating router. Subsequent forwarding routers can only define the scope of flooding within the area (and between areas, in the case of an ABR). Therefore there is no update of the E1 Type-5 LSA as it propagates throughout the network; the only difference is how each router treats the LSA when selecting the "closest" ASBR.


Type-4 LSA


A Type-4 LSA is generated when an ABR floods a Type-5 LSA into another area. The Type-4 LSA contains the cost of reaching the ASBR via that ABR. Other routers in the area use Type-4 LSAs when calculating the path cost to the ASBRs and to know which ASBRs are available via which ABRs.


OSPF Path Selection Hierarchy


In terms of how OSPF makes a choice between different LSA types when it comes to path selection, it uses the following hierarchy:

  1. Intra-Area (O)
  2. Inter-Area (O IA)
  3. External Type 1 (E1)
  4. External Type 2 (E2)
  5. NSSA Type 1 (N1)
  6. NSSA Type 2 (N2)
The decision made by the router in evaluating E1 and E2 metrics is to determine which ASBR is the best forwarding router in the case of a single ASBR, then there is only one choice


However, the forwarding metric to the ABSR is still calculated in the case of both E1 and E2 routes, since, even for an E2 route, the least cost path to the ASBR must still be determined just like any other network!

So you can see that an E1 route would be chosen over an E2 route. So far so good. But how would a router make a decision over two External routes of the same type for the same prefix? Which ASBR is chosen?

For E1 routes, this is where the cost to the ASBR (the forward metric) is added to the redistributed cost assigned at the ASBR. Lowest cost wins.

For E2 routes, the first consideration is the cost assigned at redistribution (i.e. the one contained in the LSA) - lowest cost wins. If these costs are equal, then the lowest cost path (lowest forward metric) to the nearest ASBR advertising the prefix is chosen.

Note that this can even lead to ECMP behaviour and per-flow load-sharing where the costs to an ASBR (or more than one ASBR) are equal, which might not be the behavior you expect or even desire.

Where should I use E1 and where should I use E2?

Single exit point


In the situation where there is only one exit point for a given prefix (i.e. it's being injected by only one ASBR) then the design choice between E1 and E2 is somewhat academic. 

While it is true that E1 routes require more processing (since the forwarding metric is calculated) just using E2 routes (which are the Cisco default) because it's easiest not to think about it is not a really good design philosophy. You need to consider the possibility of growth and network changes and whether your choice now will have implications later.


Multiple exit points


In this scenario, then choosing type E1 needs careful selection of the metrics assigned at redistribution and an awareness of the potential changes in forward metric in the case of network topology changes. It leads to a less deterministic situation in terms of which ASBR is chosen for the exit point for the traffic to the advertised network. It could mean that all ASBRs advertising a given destination network are used as exit points for traffic to that destination, with the watershed of which ASBR is chosen by any given set of routers determined, essentially, by the LSA metric + forwarding metric.

In the case of E2, then the redistributed metric becomes the first criterion for ASBR selection (somewhat like local preference in BGP) and the lowest redistributed metric will become the preferred outbound path, regardless of the forward metric to the ASBR. This can be useful if there is a preference as to which ABSR should be used for a given prefix. In the case where the same metric is assigned, then the forwarding decision will be made on the basis of the forward metric, which, as I mentioned above, result in less deterministic behaviour.


Summary


In summary:

  • Type-5 (External) LSA E1 and E2 metrics simply determine the selection of ASBR,
  • In the case of a tie between two E2 routes with the same metric, the internal path cost is used to break the tie.
  • Regardless of whether type E1 or E2 routes are used in redistribution, the internal cost of the route to the ASBR determines the path taken by traffic destined for that network.
  • In cases where there are multiple exit points (ABSRs) advertising the same prefix, be careful in the selection of External path type (E1 or E2). E2 can be used to your advantage if tight control over preferred outbound routing paths is required.

Please also see the excellent INE blog article on OSPF External Path selection, which clarifies many of the confusing issues around E1 and E2 routes.




Sunday, 6 September 2015

EIGRP RTP Reliable Multicast and Conditional Receive

EIGRP uses the Reliable Transport Protocol (RTP) to manage the delivery of packets that it wishes delivered reliably:
  • Update
  • Query
  • Reply
  • SIA-Query
  • SIA-Reply
Some of these packets are delivered via unicast, and some via multicast, and EIGRP runs directly on top of IP, so some method of ensuring delivery is required; to solve this Cisco has borrowed from TCP the concept of packet sequencing.

Reliable Multicast


This is Cisco's name for the method that it uses in RTP to ensure reliable, ordered delivery. As mentioned above, it is very similar to TCP in that a non-zero Sequence number is used in the packet header for all packets (multicast or unicast) it wishes delivered reliably. The sequence number is incremented by the EIGRP process on a router whenever a reliable RTP packet is sent.

For other packets, a zero is used in the Sequence field.

ACKnowledgement


These packets are all required to be responded to by the neighboring router with an Acknowledgement (ACK), either:
  • in the ACK field of a returning reliable RTP / EIGRP packet, or
  • as a stand-alone ACK, which is essentially a Hello with the ACKed sequence number in the ACK field.
 Note, however, that in RTP the ACK contains the last received sequence number, and NOT the next expected sequence number as in TCP.

Conditional Receive


EIGRP RTP "Conditional Receive" allows a list of 'lagging' routers on a multi-access interface, i.e. routers that have not ACKed (within a timeout window) reliable multicast messages one or more times, to be tracked so that they are sent unicast messages instead. This timeout is determined by the "multicast flow timer", and the interval between unicast transmissions by the "retransmission timout (RTO)"; both of these are calculated based on the smooth round-trip time (SRTT) which is a measurement of the time between the sending of a reliable packet and the receipt of its ACK.

In order to ensure that these routers do not try to process BOTH the unicast and multicast messages, the 'lagging' router list is communicated in two specific TLVs: the Sequence TLV and the Next Multicast Sequence TLV.

These are included in a Hello packet (known as a "Sequenced Hello"). All routers receiving the packet examine the Sequence TLV to see if they are on the list: any non-lagging routers will place themselves in 'Conditional Receive mode' (CR-mode); any router receiving the Hello and finding itself in the TLV, or those not receiving the Hello, will not place themselves in CR-mode.

The sending router will then send the next multicast packet with the CR flag set and those in CR-mode will pick up this packet and use it as usual,  then exit CR-mode; others not in CR-mode will ignore it.

Tuesday, 1 September 2015

Saturday, 29 August 2015

EIGRP Metric notes

Classic Metrics

EIGRP carries the following values in the EIGRP advertisements:
  • Bandwidth
  • Delay
  • Reliability = ratio (expressed as x/255) of frames successfully arriving / frames sent
  • Load = ratio (expressed as x/255) of interface load as measured by Txload
  • MTU
  • Hop Count (default max 100 but can be as high as 255)
MTU and Hop Count are NOT USED in metric calculations.

Metric calculation

In general, EIGRP takes the WORST CASE of the 'classic' metrics that go to make up the composite metric value. Each metric component is carried separately in the EIGRP messages, and the composite is calculated in each router according to the metric formula.
Worst case metrics mean:
  • Bandwidth = MIN (BW along the path)
  • Delay = SUM (Delay along the path)
  • Reliability = MIN (Reliability along the path)
  • Load = MAX (Txload along the path)
Reliability and Load DO NOT cause the metric to be re-advertised when they change; a snapshot of the value is used. However Txload is calculated as an average value. They are both relics of IGRP, retained for reasons of backward compatibility, and are not particularly useful.
Delay is also used to indicate an unavailable route as an INFINITE METRIC of 16,777,215 (24 bits of all 1's), used in Split Horizon w/ Posioned Reverse and Route Poisoning. Delay is represented in 10s of microseconds by the metric value.

Composite metric formula and K values

K values are constants used to weight the metric calculation and can take the values 1-255. Since is it imperative that all routers calculate the composite metric in the same way, these must match on every router in the EIGRP autonomous system. Any routers with mis-matching K values cannot form an adjacency.

\[CM = (K_1 . BW_{inv} + \frac{K_2 . BW_{inv}}{256 - Load_{Max}} + 256 . K_3 . \sum Delay ). (\frac {K_5}{K_4 + Reliability_{Min}})\]

\[BW_{inv} = \frac{256 . 10^7}{BW_{min}}\]

N.B. The formula is conditional, and IF K5=0, then the entire final term \(\frac {K_5}{K_4 + Reliability_{Min}}\) is evaluated to 1. [See EIGRP RFC Draft 0.3 section 5.5.3]

Default K values of K1, K3 = 1 and K2, K4, K5 = 0 lead to a simplification of the formula to:

\[CM = BW_{inv} + 256\sum Delay\]

Delay and BW are multiplied by 256 to convert the IGRP 24 bit metric to an EIGRP 32 bit metric.

Wide Metrics

EIGRP has found itself, like other protocols, in the position that its metrics have fallen behind the pace of technological advances.
The BW metric, in the classic form, is unable to make any distinction between interfaces with BW of 10Gbps or more, or with a delay of less than 10 microseconds (delay metric of 1). Also, rounding errors in successively de-scaling and then scaling the metric components for composite metric calculations lead to a loss of resolution.
Unfortunately, this has meant that the affected metrics, although doing the same function as before, have had to be re-named and the formulae for calculation modified in order to distinguish from the classic metrics.

Throughput [Bandwidth]

Throughput is the Wide metric replacing the Bandwidth metric (scaled by 256), with a new calculation of:
\[T_{min} = \frac{65536 . 10^7}{BW_{min}}\]

Latency [Delay]

Latency is the Wide metric replacing the Delay metric, and is calculated using the following formula:
\[La = \frac{65536 . IntDelay}{10^6}\] (where IntDelay is in picoseconds (1x\(10^{-12}\)s)
IntDelay is calculated differently based on whether or not bandwidth and delay are manually set, and on the native speed of the interface as follows:

1Gbps and lower without bandwidth and delay commands

IntDelay = The IOS default delay converted to picoseconds

Over 1Gbps without bandwidth and delay commands

IntDelay = \(10^{13}\) / BW

WITH bandwidth command

IntDelay = The IOS default delay converted to picoseconds

WITH delay command

IntDelay = configured delay value x \(10^7\) (i.e. configured delay value in picoseconds)

Extended metrics

Three extended metrics are defined for future use, but are not currently supported:
  • Jitter
  • Energy
  • Quiescent Energy
These are incorporated with a K6 constant.

The updated Wide Metric is as follows:
\[WM = (K_1 . T_{min} + \frac{K_2 . T_{min}}{256 - Load_{Max}} + K_3 . \sum La + K_6 . ExtM ). (\frac {K_5}{K_4 + Reliability_{Min}})\]

RIB compatibility

Since the wide metric can possibly result in a value wider than 32 bits, this must be downscaled before the route can be installed in the RIB since the RIB can only support 32 bits. This does not influence EIGRP in any way, it is simply so that the RIB can have a valid metric value for the best path that is handed down to the RIB.
This is done by dividing the wide metric by the value (default 128 with the possible values of 1-255) configured in the metric rib-scale EIGRP command.

Metric 'tweaking'

Bandwidth should NEVER be modified in an attempt to modify path selection, since it is used in many other IOS functions (e.g. QoS); instead DELAY should be adjusted, as it has no other function in IOS than in EIGRP metric calculations, and it is additive so can be guaranteed to affect the composite metric and hence the best-path selection.

Friday, 7 August 2015

CCIE Routing and Switching Glossary

A Glossary of terms encountered throughout my CCIE journey that I find confusing or difficult to remember:



CoS - Class of Service - an Ethernet field used when 802.1q tagging is implemented to allow prioritisation of frames.

DF bit - "Do Not Fragment" bit - flag in IP header 'flags' field used to define whether or not the packet should be fragmented. Can be 'set' in a route map for policy routing.

DSCP - Differentiated Services Code Point - a standardised [RFC2474] classification coding for QoS.

DS field - Field in the IP header used to define the packet's traffic classification - aka DiffServ, Differentiated Services - in the past [RFC791 / RFC1349] was called the ToS (Type of Service) field. Now contains DSCP and ECN [RFC2474]. Used in QoS.

IP precedence - Fist three bits in the DS field used to classify IP traffic. Aligns with Ethernet CoS field values. Used in QoS. Can be 'set' in a route map for policy routing.

ToS - Type of Service

Saturday, 25 July 2015

IPv4 Header Game

I found a website that creates simple games, and I used it to make one [IPv4 Header game] to help with identification of fields in the IPv4 header.

Might follow this up with others for other headers /  frame contents etc.

Sunday, 19 July 2015

Multiple Spanning Tree and Cisco Per-VLAN Spanning Tree interactions

MST and PVST+ interoperability

This confused me for quite some time, but turns out to be relatively simple, so I thought I would write a quick post about it.

The case of MST interoperating with CST and RSTP is straightforward, since both type of spanning tree will have a single instance (IST in case of the MST process) with a single root etc. These can be used to interact and determine root bridge for the entire network (an extended single spanning-tree instance).

PVST+ interaction is more complex, since each VLAN has its own instance, each with potentially a different root bridge and spanning tree topology (which is kind of the point of the technology!) and determining port roles for boundary ports (i.e. the ports interconnecting the MST and PVST region) that is consistent for all VLANs is much more difficult.

First of all, VLAN 1's BPDUs are used to represent the entire PVST+ region, and IST (MST instance 0) repesents the MST region side using PVST Simulation.

PVST Simulation

MST uses PVST+ BPDUs to speak to all PVST+ instances, each containing the same IST information. This allows PVST+ to make a consistent choice about a port's role and state. IST also needs to be sure that VLAN 1's BPDUs represent the state for all VLANs in the PVST+ region.

The port roles in MST - PVST+ boundary ports are: Designated, Root, and non-designated.

MST boundary Designated Port

An MST boundary port will become designated if BPDUs for VLAN 1 are superior to received PVST+ VLAN1 BDPUs.

Also, to maintain PVST+ simulation consistency, all received BPDUs (i.e. for all VLANs) on an MST boundary DP must be inferior.

MST boundary Root Port

Keeping in mind that an MST region can be modeled as a single switch, it follows that for an MST boundary port to become a Root Port toward the CIST root bridge it must be receiving the superior VLAN1 BPDU of ANY MST region boundary port.

Also, to maintain PVST+ simulation consistency, all received BPDUs for VLANs other than VLAN1 on an MST boundary RP must be identical or superior to those of VLAN1.

PVST Simulation Inconsistency

An inconsistency arises if the root bridge region for non-VLAN 1 instances is different to that of VLAN 1, which are indicated to the switch by the consistency criteria above.

If the PVST Simulation consistency criteria are not met, then the port will be placed in a blocked state (designated PVST Simulation Inconsistent or Root Inconsistent) until the criteria are met.

In the diagram, the MST region is root for VLAN1 (on switch DLS1), and is therefore trying to become root for all VLANs on its boundary ports. However, PVST+ has been configured to consider ALS1 as root bridge for VLANs 10 and 20, and ALS2 for VLANs 30 and 40. In this case, they are sending superior BPDUs for these VLANs to the MST boundary ports, which are then protecting the network by placing those ports into blocking state until the inconsistency is resolved.



An example of an error message on the console of DLS1 (a 3750) is shown below:
%SPANTREE-2-PVSTSIM_FAIL: Superior PVST BPDU received on VLAN 10 port Fa0/1, claiming root 4106:001b.0ddc.e580. Invoking root guard to block the port.

 This can be resolved in one of two ways:

  1. Change the VLAN 1 root bridge to either of the PVST+ bridges.
  2. Change the priority of VLANs 10 - 40 to be higher (inferior) to VLAN 1 on both the MST and PVST+ switches.