Routing protocols and architectures/Border Gateway Protocol

Border Gateway Protocol (BGP) is the inter-domain routing protocol commonly used in the Internet.


 * Overview
 * it uses the Path Vector (PV) algorithm to record the sequence of ASes along the path without the risk of routing loops: Human-edit-redo.svg ;
 * routers can aggregate the received routing information before propagating them: Human-edit-redo.svg ;
 * it does not automatically discover the existence of new neighbor routers, but peering sessions must be configured by hand: Human-edit-redo.svg ;
 * it exchanges routing updates by using reliable TCP connections : Human-edit-redo.svg ;
 * it is an extensible protocol thanks to the Type-Length-Value (TLV) format of attributes: Human-edit-redo.svg ;
 * it supports routing policies : Human-edit-redo.svg

Routing information
BGP exchanges inter-domain routing information about external routes, which are in form network address/ prefix length (instead of the netmask).

Path Vector algorithm
As routing policies are defined based on paths, BGP can not be based on the DV algorithm because it is not enough to know their costs. BGP chooses to adopt the Path Vector (PV) algorithm: every AS constitutes a single node, identified by a number of 2 (or 4) bytes, and the additional piece of information is the \ul{list of crossed ASes}.

The PV algorithm is more stable because it is easy to detect loops:
 * if a router receives a PV which already includes its AS number, it discards the PV without propagating it, because a routing loop is going to start;
 * otherwise, the router enters its own AS number into the PV and then it propagates it to its neighbors.

BGP does not support an explicit cost metric: the cost associated to each route is simply equal to the number of crossed ASes included in the list → the least-cost route may not be the actually optimal one:
 * ASes may have different requirements, hence they may adopt different metrics one from each other (e.g. bandwidth, transmission delay) → it is difficult to compute coherent costs for all ASes;
 * the announced cost may not match the actual network topology because an ISP may want to hide from a competitor the actual information about its own network for economic reasons: Human-edit-redo.svg Routing protocols and architectures/Inter-domain routing: peering and transit in the Internet

Route aggregation
When a border router propagates information about received routes, it can be manually configured to include aggregate routes into its advertisement messages to reduce their size: two routes can be aggregated in a route with the common part of their network prefixes.

However not all the routes which has been collapsed into an aggregate route can have the same sequence of crossed ASes, but there could be a more specific route following another path:
 * overlapping route: also the specific route, with its different list of crossed ASes, is announced along with the aggregate route → information is complete, and the 'longest prefix matching' algorithm will select the more specific route in the routing table;
 * not precise route: only the aggregate route is announced → information is approximate, because the list of crossed ASes does not match the path actually followed for all the destination addresses within that address range.

Peering sessions
Two border routers exchanging BGP messages between themselves are called peers, and the TCP-based session is called peering session.

A key difference compared to other routing protocols is the fact that peers are not able to discover each other automatically: manual configuration by the network administrator is required, because peers may not be connected through a direct link but other routers, for which BGP updates are normal data packets to be forwarded to destination, may exist between them.

TCP
Transmission of routing information is reliable because two peers establish a peering session by setting up a TCP connection through which all BGP messages are exchanged:
 * existing components are reused instead of redefining yet another protocol-specific mechanism;
 * BGP does not need to deal directly with retransmissions, lost messages, etc.

Using TCP as transport protocol avoids to periodically send updates: an update is sent only when needed, including just routes which have changed, and it is sent again only if the message went lost → the bandwidth consumed to send routes is reduced.

Since advertisements are not periodic, routes never expire → it is required to explicitly inform that a previously announced route has become unreachable, to withdraw routes when they are no longer valid (analogously to route poisoning in the DV algorithm.

However TCP deprives the application of the control on timings, because control packets may be delayed by TCP mechanisms themselves: in case of congestion TCP reduces the transmission bit rate preventing their timely transmission → quality of service can be configured on internal routers within the AS so as to give priority to BGP packets, considering that they are service packets to allow the network operation.

TCP does not provide any information about whether the remote peer is still reachable → an explicit keepalive mechanism managed by BGP itself is required. Also keepalive messages rely on TCP mechanisms → reactivity to a peer disappearance or a link fault is limited, but it is still acceptable considering that these events are rare (e.g. links between border routers are strongly redundant).

I-BGP e E-BGP
When two border routers set up a peering session between themselves, each one communicates, through an OPEN message, its AS number to the other party to determine the type of sub-protocol:
 * Exterior BGP (E-BGP): peers are border routers belonging to two different ASes, usually connected by a direct link;
 * Interior BGP (I-BGP): peers are border routers belonging to the same AS, usually connected through a series of internal routers.

The processing of BGP messages and the routes announces on peering sessions may be different according to which ASes the peers are belonging to:
 * E-BGP: when a border router propagates a PV to an E-BGP peer, it prepends the current AS number to each list of crossed ASes:
 * external routes : they are propagated to other E-BGP peers, but peers whose AS is on the best path toward those destinations;
 * internal routes : they are propagated to other E-BGP peers;
 * I-BGP: when a border router propagates a PV to an I-BGP peer, it transmits the list as is because the AS number remains unchanged:
 * external routes : they are propagated to other I-BGP peers according to various ways;
 * internal routes : they are never propagated to other I-BGP peers, but every border router learns them from an independent redistribution process.

I-BGP sessions are used to exchange external routes:
 * independently of routes exchanged by the interior protocol: the direct connection between peers avoids to bother the IGP protocol when the variation of an external route does not require the re-computation of internal routes → no transients, less processing;
 * independently of the interior protocol : if border routers when learning external routes from E-BGP limited to redistribute them to the IGP protocol, letting the latter redistributed them naturally to other border routers, some important information needed by BGP would go lost → specific BGP messages, called UPDATES, are required, including this information in their attributes.

IGP-BGP synchronization


BGP routers in a transit AS learn external destinations by other BGP routers via I-BGP, but packet forwarding across the AS (toward the egress BGP router) relies on internal routers, whose routing tables are filled by the IGP protocol and not by BGP → only after they have been announced also by the IGP protocol, external destinations can be announced to border routers in other ASes.

In the example in the side figure, router R4 learns destination D via E-BGP and announces it to router R3 via I-BGP, but R3 can not in turn announce it to router R5 via E-BGP until the destination is redistributed from the IGP protocol to R3, otherwise if R5 tried to send a packet toward D, R3 would forward it inside the AS where internal routers would discard it.

It might be good to disable synchronization when:
 * the AS is not a transit one;
 * all routers in the AS use BGP.

Routing loops


Lack of information about crossed border routers when they belong to the same AS can be the cause of routing loops : a border router can no longer rely on the list of crossed ASes to detect paths going twice through the same border router.

In the example in the side figure, a loop is created in advertising:
 * 1) router R4 learns the external route toward destination D;
 * 2) R4 propagates D to peer R3;
 * 3) R3 propagates D to peer R2;
 * 4) R2 propagates D to peer R4, which is the router which first learnt and announced D.

Thus a situation is created similar to the one which was triggering count to infinities in the Distance Vector algorithm: R4 can not determine whether R2 can reach D by crossing R4 itself or an actually alternative path exists → if a link fault occurs between R4 and the border router of the AS where D is located, R4 will believe that D is still reachable through R2.

External routes can be announced to I-BGP peers in various ways: full mesh, route reflector, AS confederation.

Full mesh


Each border router has an I-BGP peering session with every other border router of its AS.

When a border router learns an external route from E-BGP, it propagates it to all other ones, which in turn propagate it to everyone, and so on.

In presence of more than 2 border routers, routing loops can form due to loops in advertising, like in the side figure.

This solution is not flexible because all peering sessions must configured by hand, although peering sessions do not change much over time because border routers are quite fixed and fault-tolerant.

Route reflector


One of the border routers is elected as the route reflector (RR), and all other border routers set up peering sessions only with it without creating closed paths.

When a border router learns an external route from E-BGP, it propagates it only to RR, which is in charge of in turn propagate the route to other border routes avoiding routing loops.

Route reflector constitutes a single point of failure.

AS confederation


Border routers have a full mesh of I-BGP peering sessions (as in the first way), but the AS is split into mini-ASes, each one with a private AS number, and when a border router propagates the PV it prepends in the list its private AS number.

When an advertisement arrives, the border router can look whether its own private AS number is already in the list, so as to discard the packet if a routing loop is detected.

Path attributes
BGP information about announced routes (e.g. the list of crossed ASes) is included in path attributes inside UPDATE packets.

All attributes are encoded into the Type-Length-Value (TLV) format → BGP is an extensible protocol: extension RFCs can define new attributes without breaking compatibility with the existing world and, if the router does not support that attribute (unrecognized type code), it can ignore it and skip to the next one (thanks to the information about its length).

A BGP attribute can be:
 * well-known: it must be understood by all implementations, and can never be skipped:
 * mandatory : it must be present in all messages;
 * discretionary : it may not be present in all messages;
 * optional: it may not be understood by all implementations, and can be skipped if not supported:
 * transitive : if the router does not support the attribute, it must propagate it anyhow setting flag P;
 * non-transitive : if the router does not support the attribute, it must not propagate it.

Each attribute has the following TLV format: where the fields are:
 * Optional (O) flag (1 bit): it specifies if the attribute is optional or well-known;
 * Transitive (T) flag (1 bit): it specifies if the attribute is transitive or non-transitive;
 * Partial (P) flag (1 bit): it specifies if at least a router along the path has encountered an optional transitive attributed which did not support;
 * Extended Length (E) flag (1 bit): it specifies if the 'Length' field is encoded by one of two bytes;
 * Type field (1 byte): it includes the type code identifying the attribute → a router can determine if it supports that attribute without having to parse its value;
 * Length field (1 o 2 bytes): it includes the length of the attribute value → a router can skip an unsupported attribute and skip to the next one by advancing by the number of bytes indicated by this field;
 * Value field (variable length): it includes the attribute value.

Well-known attributes

 * ORIGIN attribute (type 1, mandatory): it defines the origin of the path information:
 * IGP: the route was manually specified as a static route (bgp network command);
 * EGP: the route was learnt by the EGP protocol ;
 * INCOMPLETE: the route was learnt from an IGP protocol through a redistribution process (bgp redistribute command);
 * AS_PATH attribute (type 2, mandatory): it contains the list of crossed ASes split into path segments:
 * AS_SEQUENCE: AS numbers in the path segment are in traversal order, and if the first segment in the packet is in order a new AS number has to be added at the beginning of that segment;
 * AS_SET: AS numbers in the path segment are not in traversal order, and if the first segment in the packet is not in order a new in-order segment, where the new AS number has to be inserted, has to be added before that segment;
 * NEXT_HOP attribute (type 3, mandatory): it optimizes routing when multiple routers belong to the same LAN but to two different ASes, and therefore traffic from an AS to another one would always cross the border router → the border router can announce to send traffic to the next hop router in the other AS:
 * LOCAL_PREF attribute (type 5, discretionary): in I-BGP when the external destination is reachable across two egress border routers, the route with highest LOCAL_PREF is preferred;
 * ATOMIC_AGGREGATE attribute (type 6, discretionary): it indicates that the announced route is a not precise aggregate route.

Optional attributes

 * MULTI_EXIT_DISC (MED) attribute (type 4, non-transitive): in E-BGP when two ASes are connected via multiple links, the link with lowest MED is preferred and links with higher MEDs are considered as backup links;
 * AGGREGATOR attribute (type 7, transitive): it contains the AS number and the IP address of the router which generated the not precise route;
 * COMMUNITIES attribute (type 8, transitive): it indicates which group of peers this route has to be announced to (e.g. to the entire Internet, only within the current AS, to no one);
 * MP_UNREACH_NLRI attribute (type 15, non-transitive): it informs that a previously announced route has become unreachable (routes never expire).

Decision process


The decision process running on every border router is responsible for: Databases which BGP has to deal with are:
 * selecting which routes are advertised to other BGP peers;
 * selecting which routes are used locally by the border router;
 * aggregating routes to reduce information.
 * Routing Information Base (RIB): it consists of three distinct parts:
 * Adjacent RIB Incoming (Adj-RIB-In): it contains all the routes learnt from the advertisements received from a certain peer;
 * Local RIB (Loc-RIB): it contains the routes selected by the decision process with their degree of preference;
 * Adjacent RIB Outgoing (Adj-RIB-Out): it contains the routes which will be propagated in advertisements to a certain peer;
 * Policy Information Base (PIB): it contains the routing policies defined by manual configuration;
 * routing table: it contains the routes used by the packet forwarding process.

Very complex routing policies can be imposed to affect the decision process: Policies are defined only based on the attributes of the current route: the computation of the degree of preference is never affected by the existence, the non-existence, or the attributes of other routes;
 * 1) a certain function returning, by applying the policies defined on attributes, the degree of preference for that route is applied to each route in the Adj-RIBs-In.
 * 1) for each destination, the route with the greatest degree of preference is selected and inserted into the Loc-RIB;
 * 2) other policies determine which routes are selected from the Loc-RIB to be inserted into the Adj-RIBs-Out.