This blog describes an approach towards building a scalable, fault-tolerant L2 DC Network Fabric. This approach was published as a part of the SIGCOMM'09 conference. (Proceedings of the ACM SIGCOMM 2009 conference on Data communication)
PortLand Design
Fat Tree Topology:
The PortLand design is based on the Fat Tree topology (as shown below) - multi rooted trees with switches organised into a hierarchy: core, aggregate and edge switches. It assumes that the baseline topology is relatively fixed with most of the additions in the form of "leaf" server nodes. Fat Tree also forms the basis of many existing data center topologies. A k-ary fat tree is defined as follows :
- There are k pods, each containing two layers of k/2 switches [aggregate and edge switches].
- Each k-port switch in the lower layer is directly connected to k/2 hosts [edge switches connect to the hosts]. Each physical host can further host multiple virtual machines.
- Each of the remaining k/2 ports (of the edge switches) is connected to k/2 of the k ports in the aggregation layer of the hierarchy.
- There are (k/2)square k-port core switches. Each core switch has one port connected to each of k pods. The i th port of any core switch is connected to pod i, each aggregate switch from a pod connects to (k/2) core switches.
- In general, a fat-tree built with k-port switches supports (k cube)/4 hosts and consists of (5/4)k squared switches.
Fat tree using 4 - port switches (4-ary fattree) |
Design:
The format of the 48-bit pseudo MAC is : | pod-num-16 | switch-pos-8 | port-8 | vmid-16 |. The assignment of pmac and translation of pmac to actual MAC and vice versa is done by the edge switches. Within the network, the actual MACs are completely abstracted using this translation and the whole system is transparent.
PortLand involves a user process called the Fabric Manager. This Fabric Manager corresponds to the controller in the SDN, and plays a key role in minimizing the internal broadcast traffic of the DC.
Working:
Let's discuss Positional Pseudo MAC Addresses before coming back to the key role played by the Fabric Manager.
Positional PMAC addressing -
As discussed above, the PMAC addressing based on the hosts' position in the DC network is carried out by the Edge switches. For all the incoming packets, the edge switch replaces the destination MAC (PMAC) with the actual MAC of the host. Similarly, for all outgoing packets, the source MAC address is replaced by the PMAC of the host. This ensures the that rest of the network is abstracted from the existence of the PMAC <-> AMAC translations. The host continues unmodified, working with its actual flat MAC address.
Having a PMAC reduces the forwarding table entries in the switches - forwarding outside the pod requires only examining the first 8 bits of the destination MAC address, while within the same pod, examining the switch position and then the remaining bits gives the exact location of the host. This is analogous to using the subnet matching for ip addresses in routing tables. Hence, introducing the PMAC translations at the Edge switch level helps in leveraging the benefits of the hierarchical addressing scheme similar to the Layer 3 forwarding.
Role of the Fabric Mananger -
The Fabric Manager maintains a table of the IP -> PMAC mapping of all the hosts in the network. It is updated with these details after a new host is assigned a PMAC address by its corresponding Edge switch. The Fabric manager uses this information to forward ARP requests, thereby minimizing the ARP broadcast traffic.
Location Discovery Protocol - This is a custom protocol used by PortLand switches to set their positions in the network and to monitor liveliness in steady state. (these LDP packets are hence sent periodically and contain the switch identifier, pod, position and other useful information)
This hierarchical addressing is leveraged to reduce the internal broadcast traffic, it reduces both the kinds of broadcast -
- There is no switch level broadcast now since the switch always know where to forward the packet just by looking at the structure of pseudo MAC.
- It also reduces the ARP broadcast traffic : the edge switches intercept any ARP packets and send it to the Fabric Manager. If its a request and the manager has the {IP : pseudo MAC} mapping, it sends the pseudo MAC to the edge switch and the edge switch crafts an ARP reply and sends it back to the requesting host avoiding a broadcast. If the{IP : pseudo MAC} mapping is not present, then it resorts to a broadcast; when the other host replies to the ARP request, the Fabric Manager also gets a copy of the ARP reply and can update its ARP table.