Saturday, 21 November 2015

OSPF External (Type-5) metric calculation and path selection to multiple network exit points for the same prefix.

I encountered a strange issue the other day with respect to routing of prefixes redistributed into OSPF from an external routing protocol (in this case BGP), and thought I'd do some research on External routes in OSPF. It threw up some somewhat counter-intuitive behaviour and as a result I thought I'd write about it.

There is a lot of confusion around how E1 and E2 routes are handled in OSPF, especially in the area of path selection and the path chosen to the ASBR.

What is an External (Type-5) LSA?

Prefixes injected into OSPF by an Autonomous System Boundary Router (ASBR) in a non-stub area are represented by Type-5 LSAs in OSPF. The Type-5 LSA contains the prefix, the advertising ABSR and a redistributed metric value and type. These prefixes are considered to be "External" in OSPF and can be one of two sub-types, that vary only in the way that the ASBR is selected:

External Type 2 (E2) - the metric that is contained in the LSA (set at the time of injection) is the primary metric that is taken into account (but not the only one, as we will see).
External Type 1 (E1) - the metric that is contained in the LSA (set at the time of injection) is taken into account as well as the cost to reach the ASBR calculated by the OSPF process.

An important thing to remember here is that, unlike distance-vector or hybrid protocols, the metric in the Type-5 LSA is not updated by each router forwarding the LSA, regardless of whether it is E1 or E2, since the only router that can change the contents of an LSA is the originating router. Subsequent forwarding routers can only define the scope of flooding within the area (and between areas, in the case of an ABR). Therefore there is no update of the E1 Type-5 LSA as it propagates throughout the network; the only difference is how each router treats the LSA when selecting the "closest" ASBR.

Type-4 LSA

A Type-4 LSA is generated when an ABR floods a Type-5 LSA into another area. The Type-4 LSA contains the cost of reaching the ASBR via that ABR. Other routers in the area use Type-4 LSAs when calculating the path cost to the ASBRs and to know which ASBRs are available via which ABRs.

OSPF Path Selection Hierarchy

In terms of how OSPF makes a choice between different LSA types when it comes to path selection, it uses the following hierarchy:

Intra-Area (O)
Inter-Area (O IA)
External Type 1 (E1)
External Type 2 (E2)
NSSA Type 1 (N1)
NSSA Type 2 (N2)

The decision made by the router in evaluating E1 and E2 metrics is to determine which ASBR is the best forwarding router; in the case of a single ASBR, then there is only one choice.

However, the forwarding metric to the ABSR is still calculated in the case of both E1 and E2 routes, since, even for an E2 route, the least cost path to the ASBR must still be determined just like any other network!

So you can see that an E1 route would be chosen over an E2 route. So far so good. But how would a router make a decision over two External routes of the same type for the same prefix? Which ASBR is chosen?

For E1 routes, this is where the cost to the ASBR (the forward metric) is added to the redistributed cost assigned at the ASBR. Lowest cost wins.

For E2 routes, the first consideration is the cost assigned at redistribution (i.e. the one contained in the LSA) - lowest cost wins. If these costs are equal, then the lowest cost path (lowest forward metric) to the nearest ASBR advertising the prefix is chosen.

Note that this can even lead to ECMP behaviour and per-flow load-sharing where the costs to an ASBR (or more than one ASBR) are equal, which might not be the behavior you expect or even desire.

Where should I use E1 and where should I use E2?

Single exit point

In the situation where there is only one exit point for a given prefix (i.e. it's being injected by only one ASBR) then the design choice between E1 and E2 is somewhat academic.

While it is true that E1 routes require more processing (since the forwarding metric is calculated) just using E2 routes (which are the Cisco default) because it's easiest not to think about it is not a really good design philosophy. You need to consider the possibility of growth and network changes and whether your choice now will have implications later.

Multiple exit points

In this scenario, then choosing type E1 needs careful selection of the metrics assigned at redistribution and an awareness of the potential changes in forward metric in the case of network topology changes. It leads to a less deterministic situation in terms of which ASBR is chosen for the exit point for the traffic to the advertised network. It could mean that all ASBRs advertising a given destination network are used as exit points for traffic to that destination, with the watershed of which ASBR is chosen by any given set of routers determined, essentially, by the LSA metric + forwarding metric.

In the case of E2, then the redistributed metric becomes the first criterion for ASBR selection (somewhat like local preference in BGP) and the lowest redistributed metric will become the preferred outbound path, regardless of the forward metric to the ASBR. This can be useful if there is a preference as to which ABSR should be used for a given prefix. In the case where the same metric is assigned, then the forwarding decision will be made on the basis of the forward metric, which, as I mentioned above, result in less deterministic behaviour.

Summary

In summary:

Type-5 (External) LSA E1 and E2 metrics simply determine the selection of ASBR,
In the case of a tie between two E2 routes with the same metric, the internal path cost is used to break the tie.
Regardless of whether type E1 or E2 routes are used in redistribution, the internal cost of the route to the ASBR determines the path taken by traffic destined for that network.
In cases where there are multiple exit points (ABSRs) advertising the same prefix, be careful in the selection of External path type (E1 or E2). E2 can be used to your advantage if tight control over preferred outbound routing paths is required.

Please also see the excellent INE blog article on OSPF External Path selection, which clarifies many of the confusing issues around E1 and E2 routes.

Sunday, 6 September 2015

EIGRP RTP Reliable Multicast and Conditional Receive

EIGRP uses the Reliable Transport Protocol (RTP) to manage the delivery of packets that it wishes delivered reliably:

Update
Query
Reply
SIA-Query
SIA-Reply

Some of these packets are delivered via unicast, and some via multicast, and EIGRP runs directly on top of IP, so some method of ensuring delivery is required; to solve this Cisco has borrowed from TCP the concept of packet sequencing.

Reliable Multicast

This is Cisco's name for the method that it uses in RTP to ensure reliable, ordered delivery. As mentioned above, it is very similar to TCP in that a non-zero Sequence number is used in the packet header for all packets (multicast or unicast) it wishes delivered reliably. The sequence number is incremented by the EIGRP process on a router whenever a reliable RTP packet is sent.

For other packets, a zero is used in the Sequence field.

ACKnowledgement

These packets are all required to be responded to by the neighboring router with an Acknowledgement (ACK), either:

in the ACK field of a returning reliable RTP / EIGRP packet, or
as a stand-alone ACK, which is essentially a Hello with the ACKed sequence number in the ACK field.

Note, however, that in RTP the ACK contains the last received sequence number, and NOT the next expected sequence number as in TCP.

Conditional Receive

EIGRP RTP "Conditional Receive" allows a list of 'lagging' routers on a multi-access interface, i.e. routers that have not ACKed (within a timeout window) reliable multicast messages one or more times, to be tracked so that they are sent unicast messages instead. This timeout is determined by the "multicast flow timer", and the interval between unicast transmissions by the "retransmission timout (RTO)"; both of these are calculated based on the smooth round-trip time (SRTT) which is a measurement of the time between the sending of a reliable packet and the receipt of its ACK.

In order to ensure that these routers do not try to process BOTH the unicast and multicast messages, the 'lagging' router list is communicated in two specific TLVs: the Sequence TLV and the Next Multicast Sequence TLV.

These are included in a Hello packet (known as a "Sequenced Hello"). All routers receiving the packet examine the Sequence TLV to see if they are on the list: any non-lagging routers will place themselves in 'Conditional Receive mode' (CR-mode); any router receiving the Hello and finding itself in the TLV, or those not receiving the Hello, will not place themselves in CR-mode.

The sending router will then send the next multicast packet with the CR flag set and those in CR-mode will pick up this packet and use it as usual, then exit CR-mode; others not in CR-mode will ignore it.

Tuesday, 1 September 2015

EIGRP Header Game

In a similar vein to the IP Header Game, here is my EIGRP header game.

Saturday, 29 August 2015

EIGRP Metric notes

Classic Metrics

EIGRP carries the following values in the EIGRP advertisements:

Bandwidth
Delay
Reliability = ratio (expressed as x/255) of frames successfully arriving / frames sent
Load = ratio (expressed as x/255) of interface load as measured by Txload
MTU
Hop Count (default max 100 but can be as high as 255)

MTU and Hop Count are NOT USED in metric calculations.

Metric calculation

In general, EIGRP takes the WORST CASE of the 'classic' metrics that go to make up the composite metric value. Each metric component is carried separately in the EIGRP messages, and the composite is calculated in each router according to the metric formula.
Worst case metrics mean:

Bandwidth = MIN (BW along the path)
Delay = SUM (Delay along the path)
Reliability = MIN (Reliability along the path)
Load = MAX (Txload along the path)

Reliability and Load DO NOT cause the metric to be re-advertised when they change; a snapshot of the value is used. However Txload is calculated as an average value. They are both relics of IGRP, retained for reasons of backward compatibility, and are not particularly useful.
Delay is also used to indicate an unavailable route as an INFINITE METRIC of 16,777,215 (24 bits of all 1's), used in Split Horizon w/ Posioned Reverse and Route Poisoning. Delay is represented in 10s of microseconds by the metric value.

Composite metric formula and K values

K values are constants used to weight the metric calculation and can take the values 1-255. Since is it imperative that all routers calculate the composite metric in the same way, these must match on every router in the EIGRP autonomous system. Any routers with mis-matching K values cannot form an adjacency.

\[CM = (K_1 . BW_{inv} + \frac{K_2 . BW_{inv}}{256 - Load_{Max}} + 256 . K_3 . \sum Delay ). (\frac {K_5}{K_4 + Reliability_{Min}})\]

\[BW_{inv} = \frac{256 . 10^7}{BW_{min}}\]

N.B. The formula is conditional, and IF K5=0, then the entire final term \(\frac {K_5}{K_4 + Reliability_{Min}}\) is evaluated to 1. [See EIGRP RFC Draft 0.3 section 5.5.3]

Default K values of K1, K3 = 1 and K2, K4, K5 = 0 lead to a simplification of the formula to:

\[CM = BW_{inv} + 256\sum Delay\]

Delay and BW are multiplied by 256 to convert the IGRP 24 bit metric to an EIGRP 32 bit metric.

Wide Metrics

EIGRP has found itself, like other protocols, in the position that its metrics have fallen behind the pace of technological advances.
The BW metric, in the classic form, is unable to make any distinction between interfaces with BW of 10Gbps or more, or with a delay of less than 10 microseconds (delay metric of 1). Also, rounding errors in successively de-scaling and then scaling the metric components for composite metric calculations lead to a loss of resolution.
Unfortunately, this has meant that the affected metrics, although doing the same function as before, have had to be re-named and the formulae for calculation modified in order to distinguish from the classic metrics.

Throughput [Bandwidth]

Throughput is the Wide metric replacing the Bandwidth metric (scaled by 256), with a new calculation of:
\[T_{min} = \frac{65536 . 10^7}{BW_{min}}\]

Latency [Delay]

Latency is the Wide metric replacing the Delay metric, and is calculated using the following formula:
\[La = \frac{65536 . IntDelay}{10^6}\] (where IntDelay is in picoseconds (1x\(10^{-12}\)s)
IntDelay is calculated differently based on whether or not bandwidth and delay are manually set, and on the native speed of the interface as follows:

1Gbps and lower without bandwidth and delay commands

IntDelay = The IOS default delay converted to picoseconds

Over 1Gbps without bandwidth and delay commands

IntDelay = \(10^{13}\) / BW

WITH bandwidth command

IntDelay = The IOS default delay converted to picoseconds

WITH delay command

IntDelay = configured delay value x \(10^7\) (i.e. configured delay value in picoseconds)

Extended metrics

Three extended metrics are defined for future use, but are not currently supported:

Jitter
Energy
Quiescent Energy

These are incorporated with a K6 constant.

The updated Wide Metric is as follows:
\[WM = (K_1 . T_{min} + \frac{K_2 . T_{min}}{256 - Load_{Max}} + K_3 . \sum La + K_6 . ExtM ). (\frac {K_5}{K_4 + Reliability_{Min}})\]

RIB compatibility

Since the wide metric can possibly result in a value wider than 32 bits, this must be downscaled before the route can be installed in the RIB since the RIB can only support 32 bits. This does not influence EIGRP in any way, it is simply so that the RIB can have a valid metric value for the best path that is handed down to the RIB.
This is done by dividing the wide metric by the value (default 128 with the possible values of 1-255) configured in the metric rib-scale EIGRP command.

Metric 'tweaking'

Bandwidth should NEVER be modified in an attempt to modify path selection, since it is used in many other IOS functions (e.g. QoS); instead DELAY should be adjusted, as it has no other function in IOS than in EIGRP metric calculations, and it is additive so can be guaranteed to affect the composite metric and hence the best-path selection.

Friday, 7 August 2015

CCIE Routing and Switching Glossary

A Glossary of terms encountered throughout my CCIE journey that I find confusing or difficult to remember:

CoS - Class of Service - an Ethernet field used when 802.1q tagging is implemented to allow prioritisation of frames.

DF bit - "Do Not Fragment" bit - flag in IP header 'flags' field used to define whether or not the packet should be fragmented. Can be 'set' in a route map for policy routing.

DSCP - Differentiated Services Code Point - a standardised [RFC2474] classification coding for QoS.

DS field - Field in the IP header used to define the packet's traffic classification - aka DiffServ, Differentiated Services - in the past [RFC791 / RFC1349] was called the ToS (Type of Service) field. Now contains DSCP and ECN [RFC2474]. Used in QoS.

IP precedence - Fist three bits in the DS field used to classify IP traffic. Aligns with Ethernet CoS field values. Used in QoS. Can be 'set' in a route map for policy routing.

ToS - Type of Service

Saturday, 25 July 2015

IPv4 Header Game

I found a website that creates simple games, and I used it to make one [IPv4 Header game] to help with identification of fields in the IPv4 header.

Might follow this up with others for other headers / frame contents etc.

Sunday, 19 July 2015

Multiple Spanning Tree and Cisco Per-VLAN Spanning Tree interactions

MST and PVST+ interoperability

This confused me for quite some time, but turns out to be relatively simple, so I thought I would write a quick post about it.

The case of MST interoperating with CST and RSTP is straightforward, since both type of spanning tree will have a single instance (IST in case of the MST process) with a single root etc. These can be used to interact and determine root bridge for the entire network (an extended single spanning-tree instance).

PVST+ interaction is more complex, since each VLAN has its own instance, each with potentially a different root bridge and spanning tree topology (which is kind of the point of the technology!) and determining port roles for boundary ports (i.e. the ports interconnecting the MST and PVST region) that is consistent for all VLANs is much more difficult.

First of all, VLAN 1's BPDUs are used to represent the entire PVST+ region, and IST (MST instance 0) repesents the MST region side using PVST Simulation.

PVST Simulation

MST uses PVST+ BPDUs to speak to all PVST+ instances, each containing the same IST information. This allows PVST+ to make a consistent choice about a port's role and state. IST also needs to be sure that VLAN 1's BPDUs represent the state for all VLANs in the PVST+ region.

The port roles in MST - PVST+ boundary ports are: Designated, Root, and non-designated.

MST boundary Designated Port

An MST boundary port will become designated if BPDUs for VLAN 1 are superior to received PVST+ VLAN1 BDPUs.

Also, to maintain PVST+ simulation consistency, all received BPDUs (i.e. for all VLANs) on an MST boundary DP must be inferior.

MST boundary Root Port

Keeping in mind that an MST region can be modeled as a single switch, it follows that for an MST boundary port to become a Root Port toward the CIST root bridge it must be receiving the superior VLAN1 BPDU of ANY MST region boundary port.

Also, to maintain PVST+ simulation consistency, all received BPDUs for VLANs other than VLAN1 on an MST boundary RP must be identical or superior to those of VLAN1.

PVST Simulation Inconsistency

An inconsistency arises if the root bridge region for non-VLAN 1 instances is different to that of VLAN 1, which are indicated to the switch by the consistency criteria above.

If the PVST Simulation consistency criteria are not met, then the port will be placed in a blocked state (designated PVST Simulation Inconsistent or Root Inconsistent) until the criteria are met.

In the diagram, the MST region is root for VLAN1 (on switch DLS1), and is therefore trying to become root for all VLANs on its boundary ports. However, PVST+ has been configured to consider ALS1 as root bridge for VLANs 10 and 20, and ALS2 for VLANs 30 and 40. In this case, they are sending superior BPDUs for these VLANs to the MST boundary ports, which are then protecting the network by placing those ports into blocking state until the inconsistency is resolved.

An example of an error message on the console of DLS1 (a 3750) is shown below:

%SPANTREE-2-PVSTSIM_FAIL: Superior PVST BPDU received on VLAN 10 port Fa0/1, claiming root 4106:001b.0ddc.e580. Invoking root guard to block the port.

This can be resolved in one of two ways:

Change the VLAN 1 root bridge to either of the PVST+ bridges.
Change the priority of VLANs 10 - 40 to be higher (inferior) to VLAN 1 on both the MST and PVST+ switches.

Monday, 6 July 2015

Spanning tree and superior BPDUs

SPANNING TREE SIMPLICITY

The bewilderment surrounding the Spanning Tree Protocol and root ports and designated ports (well it bewildered me anyway!) can be immensely simplified by one idea:
It's all about SUPERIOR BPDUs.

Superior BPDUs

So first of all, what is a superior BPDU? It's one that 'wins' i.e. is the LOWEST in the following ranking. If any one is a TIE, then the next lowest down is used to break that tie:

Root Bridge ID (RBID)
Root Path Cost (RPC)
Sending Bridge ID (SBID)
Sending Port ID (SPID)
Receiving Port ID - only used is very rare cases and is not carried in the BPDU, it is assigned locally.

All the information in 1-4 above is carried (along with the timers) in every BPDU that is sent by every switch running STP.
So how does this help? It explains almost everything about the STP process and convergence, and helps, in my mind, to very succinctly define root port and designated port!

Convergence steps

To recap on the three fundamental steps that need to occur for STP convergence:

1) Elect a root bridge 2) Determine root ports 3) Determine designated ports

Elect a root bridge

Electing a root bridge is determined by the lowest RBID (i.e the superior one) in any BPDU circulating the network. It is determined to be a SUPERIOR BPDU because it has the lowest value in the first superiority criteria. Since the superior RBID is placed into all forwarded BPDUs during the election, thereafter EVERY BDPU WILL HAVE THE SAME RBID. So you can discount it!

Determine root ports

Determining the root port (RP) for any switch is done on the basis of lowest 'resulting' path cost (i.e. RPC in the BPDU + receiving port cost) to the root bridge, which is the SECOND SUPERIORITY CRITERIA. It makes sense that there can only be one lowest cost path to the RB from any other switch, and therefore that there can only be one RP per switch.

Now we already know that RBID is going to be the same in every BPDU, so what's next? Root Path Cost.

And the RP, therefore can be very simply defined as the ONLY port on the switch RECEIVING the SUPERIOR BPDU. There can only one port, because there can only be one superior BPDU. If RPC is a tie, then go to the next criteria, and so on. You also know that BPDUs are not sent out of RPs, because there would be no point. Why? Because you already know that the most superior BPDU on the segment ARRIVED on that port, and yours is sure to be ignored as inferior. Also the BPDU stored on a RP is always the superior one of any sent on the segment.

Determine designated ports

Similarly, the designated port (DP) is the only port on the SEGMENT that is SENDING the SUPERIOR BPDU. RPCs in the sent and received BPDUs are simply compared against each other, without modification. How does it know? Because it doesn't hear any that are superior. If it does, it knows it's not the DP, and stops sending them! Again, because there can only be one superior BPDU on the segment, only one port can be sending it.

This means that ports that are not disabled and, although not connected to another switch, are participating in STP are also designated ports; hence they do not get put into blocking state.

A port that uses 'portfast' setting is a special case since it does not send BPDUs and therefore cannot really be considered a DP, but it is immediately placed into Forwarding state.

Monday, 29 June 2015

Some lab notes for Dynamic Trunking Protocol

Dynamic Trunking Protocol (DTP) Notes

Effect of 'switchport mode access' on DTP

After disabling DTP on all other ports, using 'switchport nonegotiate' and enabling 'debug dtp packets' I started investigating the effect of different port settings on DTP. I had been reading some discussion about whether an access port would still send out some DTP packets even after being turned into an access port using the 'switchport mode access' command.

So I put the port into dynamic desirable mode on both ends, successfully established a trunk, and then set one end as an access port.

Here are the results:

DLS2(config-if)#switchport mode access

DLS2(config-if)#

00:43:43: DTP-pkt:Fa0/5:Sending packet ../dyntrk/dyntrk_process.c:1241

00:43:43: DTP-pkt:Fa0/5: TOS/TAS = ACCESS/OFF ../dyntrk/dyntrk_process.c:1244

00:43:43: DTP-pkt:Fa0/5: TOT/TAT = ISL/NEGOTIATE ../dyntrk/dyntrk_process.c:1247

00:43:43: DTP-pkt:Fa0/5:datagramout ../dyntrk/dyntrkprocess.c:1279

00:43:43: DTP-pkt:Fa0/5:Invalid TLV (type 0, len 0) in received packet. ../dyntrk/dyntrk_core.c:1334

00:43:43: DTP-pkt:Fa0/5:Good DTP packet received: ../dyntrk/dyntrk_core.c:1500

00:43:43: DTP-pkt:Fa0/5: Domain: ../dyntrk/dyntrk_core.c:1503

00:43:43: DTP-pkt:Fa0/5: Status: TOS/TAS = ACCESS/DESIRABLE ../dyntrk/dyntrk_core.c:1506

00:43:43: DTP-pkt:Fa0/5: Type: TOT/TAT = ISL/NEGOTIATED ../dyntrk/dyntrk_core.c:1508

00:43:43: DTP-pkt:Fa0/5: ID: 000F90236585 ../dyntrk/dyntrk_core.c:1511

So we can see that only one final DTP packet is sent and received to advise that the port has been placed in Access mode. It then ignores any further DTP packets, even though I can see them still being sent from the other end if I disable and enable DTP by putting the port on the other end into access mode, then back to dynamic desirable.

'switchport nonegotiate' limitations

'switchport nonegotiate' cannot be configured on a port already configured as a DTP trunk i.e. dynamic desirable or dynamic auto. It doesn't just switch DTP off on the port; you would have to place the port into 'switchport mode access' or 'switchport mode trunk' first.

Trunk encapsulation negotiation

Manually setting encapsulation on one end of the link

When DTP is used to negotiate encapsulation ('switchport trunk encapsulation negotiate'), which is default, then the trunk will be negotiated, if both switches support it, as

ISL, then
802.1q, if ISL is not supported by both switches.

However, even between two switches that support ISL, if encapsulation is set manually, using 'switchport trunk encapsulation isl | dot1q', at only one end, then DTP will negotiate that encapsulation on the link.

Limitations on the 'switchport mode trunk' command

The 'switchport mode trunk' command is used to manually set a link to always be a trunk. DTP packets are still sent out of the interface, so a trunk could still be formed with an 'active' DTP port.

However, the 'switchport mode trunk' command cannot be applied if encapsulation is negotiated. The encapsulation must be set manually.

DLS1(config-if)#switchport mode trunk

Command rejected: An interface whose trunk encapsulation is "Auto" can not be configured to "trunk" mode.

The error message is slightly misleading, referring to "Auto" encapsulation. This confused me the first time I saw it, until I realised it was referring to 'switchport trunk encapsulation negotiate' i.e. negotiated encapsulation. It would be great if Cisco kept their error messages consistent with their command syntax!

Sunday, 28 June 2015

Multi-Layer Switch: routed port, switchport and SVIs

'switchport'

The 'switchport' command tells the switch (usually a Multi-Layer Switch or MLS) to treat the port as a layer 2 port, i.e. as a member of a VLAN and to allow it to switch frames and learn MAC addresses etc., as well as participating in all other layer 2 processes such as spanning-tree.

'no switchport'

The 'no switchport' command tells the switch to treat the port as a layer 3 interface, so that you can run a routing protocol, add an interface IP address (or other layer 3 address) and create sub-interfaces, none of which is possible on a layer 2 interface. If you try running this command on a layer 2 only switch (e.g. a 2950) it will not understand it and reject it as 'incomplete', as shown below:

ALS1#conf t
Enter configuration commands, one per line. End with CNTL/Z.
ALS1(config-if)#no switchport
% Incomplete command.

A routed port does not belong to a VLAN as far as the MLS is concerned because it has no concept of VLANs at layer 3, just a like a port on a router. However, on a MLS each VLAN also has a layer 3 interface: the VLAN interface, also known as an SVI. This is created on an MLS when the VLAN itself is created.

On a pure layer 2 switch, such as the 2950, there is only one layer 3 interface: this is the 'VLAN1' interface (an SVI) that you configure to allow management connectivity.

ALS1#show run int vlan 1
Building configuration...
Current configuration : 67 bytes
!
interface Vlan1
no ip address
no ip route-cache
shutdown
end

Thursday, 25 June 2015

Private VLAN summary

Private VLANs

Allows for the separation of ports into private port groups, while still making use of the same subnet. This is more efficient in terms of IP addressing usage and STP and ACL complexity and of particular use in some shared environments such as Service Provider (SP) data centres where access to common resources on a subnet are required in a secure way.
There are essentially three different port classifications in terms of function. Ports that need to communicate with:

all devices
each other and with shared devices (e.g. router or web server)
ONLY shared devices

Private VLANs are constructed so that there exists a primary VLAN, and one or more secondary VLANs. Each secondary VLAN is mapped to a primary VLAN.

Private VLANs are only supported by VTPv3, so "VTP transparent" mode should be configured if not using VTPv3.

Primary VLANs

Contains promiscuous ports i.e. can send and receive to any other port in the PVLAN including those assigned to secondary VLANs. Devices in this VLAN are likely to include the router L3 gateway, web servers, database servers etc.

Secondary VLANs

Are one of two types Community or Isolated

Community VLANs

Ports can talk to other ports in the community and to primary VLAN (promiscuous) ports
Each PVLAN has zero or more community VLANs associated with it.

Isolated VLANs

Ports can ONLY talk to primary VLAN (promiscuous) ports
Each PVLAN has AT MOST ONE isolated VLAN, since only one is required

Private VLAN trunks

Extending Private VLANs across multiple switches is a simple matter; simply use the same VLAN IDs and trunk the VLANs as you would normally. Frames arriving from a port within a Private VLAN (primary or secondary) are tagged with the primary or secondary VLAN tag for transport between switches.

However, there are two special trunk types that are used with Private VLANs:

Promiscuous PVLAN Trunk

This is used when a trunk is carrying traffic for a Primary VLAN, as well as its associated secondaries, and needs to be considered a promiscuous port. It may also be carrying other normal VLANs. In this case, the device on the other end of the trunk is unaware of the relationship between the Private VLANs, and traffic from all secondary VLANs associated with a Primary VLAN is tagged with the Primary VLAN ID. A use case for this scenarios is a "router on a stick" configuration where the gateway interface of a Primary VLAN (on he router) is considered promiscuous and allowed to be communicated with my all associated Secondary VLANs.

The Promiscuous PVLAN Trunk port re-writes secondary VLAN IDs of sent frames into the corresponding primary VLAN ID so that the external device always sees only the primary VLAN. It does not manipulate tags of incoming frames.

Isolated PVLAN Trunk

This is used to extend the isolated VLAN over a trunk carrying multiple VLANs to a switch that does not support Private VLANs but is capable of isolating its own ports e.g. with the port protection feature on entry-level Catalyst switches.

The Isolated PVLAN Trunk re-writes a primary VLAN ID of a sent frame to the ID of the isolated VLAN that is associated with the primary VLAN. It does not manipulate tags of incoming frames.

Tuesday, 23 June 2015

Dangers of Tunnel vision

I gave myself a real fright today.

I was trying to troubleshoot an issue with a configuration that I've never done before; IPv6 ISIS. It was while I was on a training course, I was trying to do some route summarisation. and I struggled with the lab for about 45 minutes, taking out configuration, checking the routing table, putting it back in, checking it again. The topology was about as simple as you can get; two routers. And I checked and double-checked my configuration and found a little error and that made a difference but not the one I wanted.

Over and over and over again I looked at the routing table.

Except...

That is wasn't the full routing table. I was just looking at the ISIS routes. And as soon as the instructor made one little suggestion I looked at the WHOLE routing table, and there was the route... advertised by another protocol with a lower AD! Hidden in the trees, was the wood!

It turned my blood to ice to realise how easily I fell into the tunnel-vision trap - forgetting to take a pause and a step back; to think about alternatives. So that's why I'm writing about it today. If I read this again in 6-9 months' time I can remember today, and the fear I felt when I thought about how easily that could have happened to me in a lab exam, and how I would have totally blown it.

A long way to go, but food for thought.

RJ45 pinout diagram and copper Ethernet cabling

One of the most fundamental things in networking is the construction of a straight-through or cross-over copper cable.

Copper Ethernet cabling is typically constructed with Category-5 (a.k.a. Cat-5), Cat 5e or Cat 6, each of which are rated for higher bandwidth capability than the previous. They are made up of four pairs of copper wires, each insulated then tightly and precisely twisted together (leading to the term Unshielded Twisted Pair or UTP) to make the impedance characteristics (and hence the behaviour at very high frequencies) of the cable very precisely known.

The pin positions of the pairs in the T568A and T568B connectors (RJ-45) are shown below, as well as 'crossover' cable connections. I found it useful to remember that pairs 1 and 4 never change position, no matter what cable type, and that pairs 2 and 3 swap position in a cross-over. Regardless of the connector used at either end, this is easy to remember, and leads to the well-known connection pattern, of a crossover cable, of 1-3, 2-6, 3-1, 6-2.

Straight-through is self-explanatory,where the same connector type is used on both ends (either T568A or T568B), and the pin positions of pairs 1-4 do not change.

In terms of pair numbering within the connector, I try to remember that pairs are numbered outward from the centre toward the 'top' (i.e. the lower numbered pins) for T568A i.e. 1,2,3,3 and from the centre to the 'bottom' 1,2,4,4.

Sunday, 21 June 2015

Ethernet framing summary (taken from the 802.3-2012 standard)

I realised that I didn't really understand the construction and transmission of the Ethernet frame very well, even after reading the Cisco texts, so I thought I would dig into the 802.3 standard. It's actually quite readable! All emphasis is mine.

ETHERNET FRAMING SUMMARY

Taken from 802.3-2012 standard.

Preamble

Section 4.2.5 Preamble generation
Upon request by TransmitLinkMgmt to transmit the first bit of a new frame, PhysicalSignalEncap shall first transmit the preamble, a bit sequence used for physical medium stabilization and synchronization, followed by the Start Frame Delimiter. If, while transmitting the preamble or Start Frame Delimiter, the collision detect variable becomes true, any remaining preamble and Start Frame Delimiter bits shall be sent. The preamble pattern is:

10101010 10101010 10101010 10101010 10101010 10101010 10101010

The bits are transmitted in order, from left to right. The nature of the pattern is such that, for Manchester encoding, it appears as a periodic waveform on the medium that enables bit synchronization. It should be noted that the preamble ends with a “0.”

Start Frame Delimiter

4.2.6 Start frame sequence
The receiveDataValid signal is the indication to the MAC that the frame reception process should begin. Upon reception of the sequence 10101011 following the assertion of receiveDataValid, PhysicalSignalDecap shall begin passing successive bits to ReceiveLinkMgmt for passing to the MAC client.

Address fields

3.2.3 Address fields
Each MAC frame shall contain two address fields: the Destination Address field and the Source Address field, in that order.
The Destination Address field shall specify the destination addressee(s) for which the MAC frame is intended.
The Source Address field shall identify the station from which the MAC frame was initiated.
The representation of each address field shall be as follows:
a) Each address field shall be 48 bits in length.
b) The first bit (LSB) shall be used in the Destination Address field as an address type designation bit [I/G bit] to identify the Destination Address either as an individual or as a group address. If this bit is 0, it shall indicate that the address field contains an individual address. If this bit is 1, it shall indicate that the address field contains a group address that identifies none, one or more, or all of the stations connected to the LAN. In the Source Address field, the first bit is reserved and set to 0.
c) The second bit shall be used to distinguish between locally or globally administered addresses [U/L bit]. For globally administered (or U, universal) addresses, the bit is set to 0. If an address is to be assigned locally, this bit shall be set to 1. Note that for the broadcast address, this bit is also a 1.
d) Each octet of each address field shall be transmitted least significant bit first.
3.2.3.1 Address designation
A MAC sublayer address is one of two types:
a) Individual Address. The address associated with a particular station on the network.
b) Group Address. A multidestination address, associated with one or more stations on a given network. There are two kinds of multicast addresses:

Multicast-Group Address. An address associated by higher-level convention with a group of logically related stations.
Broadcast Address. A distinguished, predefined multicast address that always denotes the set of all stations on a given LAN.

All 1’s in the Destination Address field shall be predefined to be the Broadcast Address.
This group shall be predefined for each communication medium to consist of all stations actively connected to that medium; it shall be used to broadcast to all the active stations on that medium. All stations shall be able to recognize the Broadcast Address. It is not necessary that a station be capable of generating the Broadcast Address.
The address space shall also be partitioned into locally administered and globally administered addresses. The nature of a body and the procedures by which it administers these global (U) addresses is beyond the scope of this standard. [IEEE]

Destination Address field

3.2.4 Destination Address field
The Destination Address field specifies the station(s) for which the MAC frame is intended. It may be an individual or multicast (including broadcast) address.

Source Address field

3.2.5 Source Address field
The Source Address field specifies the station sending the MAC frame. The Source Address field is not interpreted by the MAC sublayer.

Length / Type field

3.2.6 Length/Type field
This two-octet field takes one of two meanings, depending on its numeric value. For numerical evaluation, the first octet is the most significant octet of this field.
a) If the value of this field is less than or equal to 1500 decimal (05DC hexadecimal), then the Length/Type field indicates the number of MAC client data octets contained in the subsequent MAC Client Data field of the basic frame (Length interpretation).
b) If the value of this field is greater than or equal to 1536 decimal (0600 hexadecimal), then the Length/Type field indicates the Ethertype of the MAC client protocol (Type interpretation).[IEEE]
The Length and Type interpretations of this field are mutually exclusive.
When used as a Type field, it is the responsibility of the MAC client to ensure that the MAC client operates properly when the MAC sublayer pads the supplied MAC Client data, as discussed in 3.2.7. Regardless of the interpretation of the Length/Type field, if the length of the MAC Client Data field is less than the minimum required for proper operation of the protocol, a Pad field (a sequence of octets) will be added after the MAC Client Data field but prior to the FCS field, specified below. The procedure that determines the size of the Pad field is specified in 4.2.8.
The Length/Type field is transmitted and received with the high order octet first.

MAC Client Data field

3.2.7 MAC Client Data field
The MAC Client Data field contains a sequence of octets. Full data transparency is provided in the sense that any arbitrary sequence of octet values may appear in the MAC Client Data field up to a maximum field length determined by the particular implementation.
Ethernet implementations shall support at least one of three maximum MAC Client Data field sizes defined as follows:
a) 1500 decimal—basic frames (see 1.4.102)
b) 1504 decimal—Q-tagged frames (see 1.4.334)
c) 1982 decimal—envelope frames (see 1.4.184)
If layer management is implemented, frames with a MAC Client Data field larger than the supported maximum MAC Client Data field size are counted. It is recommended that new implementations support the transmission and reception of envelope frames, item c) above.
NOTE 1—The envelope frame is intended to allow inclusion of additional prefixes and suffixes required by higher layer encapsulation protocols (see 1.4.180) such as those defined by the IEEE 802.1 working group (such as Provider Bridges and MAC Security), ITU-T or IETF (such as MPLS). The original MAC Client Data field maximum remains 1500 octets while the encapsulation protocols may add up to an additional 482 octets. Use of these extra octets for other purposes is not recommended, and may result in MAC frames being dropped or corrupted as they may violate maximum MAC frame size restrictions if encapsulation protocols are required to operate on them.
NOTE 2—All IEEE 802.3 MAC frames share a common format. The processing of the three types of MAC frames is not differentiated within the IEEE 802.3 MAC, except for management. However, they may be distinguished within the MAC client.
NOTE 3—All Q-tagged frames are envelope frames, but not all envelope frames are Q-tagged frames.
See 4.4 for a discussion of MAC parameters; see 4.2.3.3 for a discussion of the minimum frame size and minFrameSize.

Pad field

3.2.8 Pad field
A minimum MAC frame size is required for correct CSMA/CD protocol operation (see 4.2.3.3 and 4.4). If necessary, a Pad field (in units of octets) is appended after the MAC Client Data field prior to calculating and appending the FCS field. The size of the Pad, if any, is determined by the size of the MAC Client Data field supplied by the MAC client and the minimum MAC frame size and address size MAC parameters (see 4.4).

FCS field

3.2.9 Frame Check Sequence (FCS) field
A cyclic redundancy check (CRC) is used by the transmit and receive algorithms to generate a CRC value for the FCS field. The FCS field contains a 4-octet (32-bit) CRC value. This value is computed as a function of the contents of the protected fields of the MAC frame: the Destination Address, Source Address, Length/ Type field, MAC Client Data, and Pad (that is, all fields except FCS).

Preparing for the CCIE Routing and Switching Written Exam v5 (400-101)

After about 10 years or so in Cisco networking, the last three of which have been dedicated to network design and implementation, and having earned my CCDA and CCNA certifications, lapsed my CCDA, and re-earned, and now lapsed again(!) my CCNA, I decided that I should just skip my CCNP certification path and jump straight into CCIE.

This was prompted by the experience and advice of many of my colleagues who have already earned the coveted CCIE certification in either R&S or Service Provider. They said that the effort (and expense) involved in studying and passing the three CCNP exams would be better spent studying for my CCIE which, strangely, requires no pre-requisite qualifications to sit. When I began my networking journey I had told myself that I wasn't really interested in the time and effort it would take to get my CCIE, but the longer I am in this field, the more I realise I want to prove to myself that I still have it in me to work really hard at a very difficult task and attempt to overcome the inevitable issues that will arise in order to succeed. In a way, I am viewing this as my PhD, or at least a pinnacle of achievement in my engineering career of which I can be proud and which will help me in terms of better knowledge and understanding of the details of my chosen field, and for the kudos such an achievement will bring.

I started last year, but was seriously derailed by the death of my father and the need to travel to my family's home in New Zealand for the funeral and to help out my mum. Now, with that substantially taken care of, and with my own emotional energy returning, I decided once again to set myself the task of climbing the "Everest" of Internet Protocol networking; the CCIE.

My goal is to be prepared for the written exam in February of 2016. That gives me 6 months to study and prepare, and then one month to revise and cram exam technique for the exam itself. After that, of course, it's on to the infamous lab exam, which has left many a broken engineer in its wake! It is, frankly, terrifying, and I don't like to fail. But this journey will teach me, I'm certain, that failure is just one of the stones on the path to success.

I am based in and around Reading, UK, and would welcome any local study groups to get in touch.

19 Jul 2015: Updated with reminders of why I'm doing this.

Why become a CCIE?

Rarity: Fewer than 1% of all networking professionals hold a CCIE

Knowledge: Passing the exam is a by-product of being an expert.

CCIE is a challenge: It's about the journey, not about passing the exam.