Datacenters - Reduction of Broadcast traffic using SDN

Sunday, April 6, 2014

Literature Survey - PortLand Design

This blog describes an approach towards building a scalable, fault-tolerant L2 DC Network Fabric. This approach was published as a part of the SIGCOMM'09 conference. (Proceedings of the ACM SIGCOMM 2009 conference on Data communication)

PortLand Design

Fat Tree Topology:

The PortLand design is based on the Fat Tree topology (as shown below) - multi rooted trees with switches organised into a hierarchy: core, aggregate and edge switches. It assumes that the baseline topology is relatively fixed with most of the additions in the form of "leaf" server nodes. Fat Tree also forms the basis of many existing data center topologies. A k-ary fat tree is defined as follows :

There are k pods, each containing two layers of k/2 switches [aggregate and edge switches].
Each k-port switch in the lower layer is directly connected to k/2 hosts [edge switches connect to the hosts]. Each physical host can further host multiple virtual machines.
Each of the remaining k/2 ports (of the edge switches) is connected to k/2 of the k ports in the aggregation layer of the hierarchy.
There are (k/2)square k-port core switches. Each core switch has one port connected to each of k pods. The i th port of any core switch is connected to pod i, each aggregate switch from a pod connects to (k/2) core switches.
In general, a fat-tree built with k-port switches supports (k cube)/4 hosts and consists of (5/4)k squared switches.

Fat tree using 4 - port switches (4-ary fattree)

Design:

The PortLand design caters to the requirements of the Data Centers mentioned in the previous blog post and it does this by redesigning the Layer 2 network - it assigns a pseudo MAC address to all the hosts in the network. This pseudo MAC(pmac), as opposed to the actual MAC, is assigned hierarchically based on the host's position in the Data Center network by the switch that it is directly connected to (edge switch).

The format of the 48-bit pseudo MAC is : | pod-num-16 | switch-pos-8 | port-8 | vmid-16 |. The assignment of pmac and translation of pmac to actual MAC and vice versa is done by the edge switches. Within the network, the actual MACs are completely abstracted using this translation and the whole system is transparent.

PortLand involves a user process called the Fabric Manager. This Fabric Manager corresponds to the controller in the SDN, and plays a key role in minimizing the internal broadcast traffic of the DC.

Working:

Let's discuss Positional Pseudo MAC Addresses before coming back to the key role played by the Fabric Manager.

Positional PMAC addressing -

As discussed above, the PMAC addressing based on the hosts' position in the DC network is carried out by the Edge switches. For all the incoming packets, the edge switch replaces the destination MAC (PMAC) with the actual MAC of the host. Similarly, for all outgoing packets, the source MAC address is replaced by the PMAC of the host. This ensures the that rest of the network is abstracted from the existence of the PMAC <-> AMAC translations. The host continues unmodified, working with its actual flat MAC address.

Having a PMAC reduces the forwarding table entries in the switches - forwarding outside the pod requires only examining the first 8 bits of the destination MAC address, while within the same pod, examining the switch position and then the remaining bits gives the exact location of the host. This is analogous to using the subnet matching for ip addresses in routing tables. Hence, introducing the PMAC translations at the Edge switch level helps in leveraging the benefits of the hierarchical addressing scheme similar to the Layer 3 forwarding.

Role of the Fabric Mananger -

The Fabric Manager maintains a table of the IP -> PMAC mapping of all the hosts in the network. It is updated with these details after a new host is assigned a PMAC address by its corresponding Edge switch. The Fabric manager uses this information to forward ARP requests, thereby minimizing the ARP broadcast traffic.

Location Discovery Protocol - This is a custom protocol used by PortLand switches to set their positions in the network and to monitor liveliness in steady state. (these LDP packets are hence sent periodically and contain the switch identifier, pod, position and other useful information)

This hierarchical addressing is leveraged to reduce the internal broadcast traffic, it reduces both the kinds of broadcast -

There is no switch level broadcast now since the switch always know where to forward the packet just by looking at the structure of pseudo MAC.
It also reduces the ARP broadcast traffic : the edge switches intercept any ARP packets and send it to the Fabric Manager. If its a request and the manager has the {IP : pseudo MAC} mapping, it sends the pseudo MAC to the edge switch and the edge switch crafts an ARP reply and sends it back to the requesting host avoiding a broadcast. If the{IP : pseudo MAC} mapping is not present, then it resorts to a broadcast; when the other host replies to the ARP request, the Fabric Manager also gets a copy of the ARP reply and can update its ARP table.

Saturday, March 1, 2014

Broadcast Traffic in Data Centers

This post talks in general about problems and trade-off specific to networks in data centers : general requirements of Data Centers and to what extent L2 and L3 networks meet these requirements.

After going through a few papers and some interesting Coursera videos on SDN, we were able to outline our problem statement better.

Data Centers and their requirements - (A high level perspective)

To understand our problem statement better, it is important to understand the working and requirements in Data Centers clearly.

It is common for user applications to span thousands of servers, where a single user search request might access an inverted index spread over 1k+ servers. In addition to this, analytics involves constant querying of the data stored in these servers. These use cases indicate a high amount of internet traffic within Data Centers, a fair percentage of the total Internet Communication. Hence, any kind of mechanism that can reduce the unnecessary internal traffic like broadcast traffic, can lead to a lot of performance enhancement for DCs.

Because of the scale that DCs deal with, some of the requirements/constraints include:

1. VM migration - IP shouldn't change as existing TCP connection will break and the application level state will be lost. VM Migration is a basic requirement for flexible and efficient resource usage.

2. Minimize configuration of the forwarding devices deployed in the DC network.

3. No forwarding loops - to prevent occurrences of broadcast storms and flooding

Lets see how L2 and L3 match up to these requirements :

- Layer 2 networks use MAC addressesing - a flat addressing scheme.

- The layer 2 network switches have no configuration overheads - the switch is self learning, plug and play.

- In data centers, its usually good to have physical loops so that fail over can be easy and load can be distributed among the links. In this case, spanning tree protocol has to be run on L2 network. But doing so removes the redundant links and does not utilize them.

- To address the 1st requirement, the VM migration should take place within the same Layer 2 network, so that it can continue to have the same IP address.

-Also, L2 networks are more efficient than L3 networks, for several reasons :

Switches don't have to modify packets. Just lookup and forward
But routers have to modify layer 2 frame and update it with its own mac address.
They have to decapsulate one more extra layer - layer 3 and also modify it, mainly ttl which the switches don't have to. This also adds an overhead computation because the IP checksum has to be recomputed due to the modification of the packet.

Also the switch's' logic is usually implemented in specialized hardware which are much more faster than using general purpose micro processors which is usually the case with routers.
This efficiency is desirable in DC as low latency is of very high priority.
But broadcast traffic remains as one of the biggest downsides of using L2 network.

- Layer 3 networks use IP - a hierarchical addressing scheme.

- VM migration can only happen within the same subnet if the IP has to be retained. But this would limit the choice of physical hosts to where the VM can migrate to.

- Layer 3 network switches or Routers as they are called, require-

1. the hosts to be configured with the address of the gateway Router for the subnet it belongs to

2. the addition of each new Router, involves setting its subnet information

Also, synchronizing of DHCP servers is required to distribute IP addresses based on the host's subnet.

These tasks are overseen by network administrators leading to a lot of difficulty in management.

- At the same time, using Routers, eliminates the switch level broadcast as the addressing scheme being used is structured and hence the routers know where to forward the incoming packets based on their destination IP address.

- Layer 3 forwarding can have loops as they are handled by the TTL field and redundant entries for easy fail over.

Note - An ideal system would involve the benefits of both - structured Layer 3 addressing which doesn't use broadcast as well as the simple Layer 2 forwarding mechanism with no configuration overheads and easy vm migration.

As we can see, L2 networks meet many of the ideal requirements of a Data Center, but due to the scaling problem of broadcast traffic they are unusable at such a scale. Our aim is to make it scalable.

Wednesday, February 12, 2014

Project Overview

Reduction of Broadcast Traffic using SDN in Data Centers

Looking at the project title, a number of questions come into mind -

What is broadcast traffic?

What is SDN?

What is its relevance in Data Centers?

How does it involve a gain in performance? "Reduction"

Probing a little more would lead to questions like -

What is the topology used in Data Centers?

What are the typical requirements of a Data Center?

How can we develop and test Software Defined Networks?

Why is broadcast traffic a bottleneck in Data Centers? (and hence needs to be reduced for performance)

This blog would answer some of these questions and perhaps create new ones.

Lets first talk about network switches :

A switch is a device used on a computer network to physically connect devices together. It operates at the Link Layer which resides at Layer 2 of the OSI Model [Wiki: Link layer]. Switches uses MAC, which is a flat addressing scheme, for addressing and identifying connected devices. They are self learning and use mac tables (mapping of the MAC to the interface it belongs), which are populated as the packets flow through the switch, to perform its job.

The switch's job can be split into two parts :

1. Learning(Control Plane) : Adding entries into the mac tables.

The switch looks at the incoming packet's source mac address and the input port it came from. It then creates an entry which maps the two, in the mac table if its not already present.

2. Forwarding(Data plane) : Moving packet from input port to output port

This is done based on the packet's destination mac address. The destination mac address is looked up in the mac table and if matched, the packet is sent out of the corresponding port, obtained from the matched mac table entry.

If the look up fails, the switch sends the packet out of all the ports(broadcast) except the port in which it came from.

One thing to note is that each entry in the mac table has an associated timeout after which it will be removed.

This is done mainly to curb the size of the mac table.

Take away : Layer 2 switches generate a lot of broadcast traffic when forwarding; Either due to addition of a new host or due to the expiry of mac table entries.

Software Defined Networking - Conventionally, layer 2 switches have been self learning, where the control (logic) and the forwarding action (to the appropriate output port) are built into the switch itself. So any kind of tweaking with the control logic (to affect the forwarding action) would involve building custom switches, this can involve a lot of cost overhead for organisations that would want to perform switching based on some custom logic. This is where SDN comes to the rescue! The basic principle of Software Defined Networks is to separate the control and the data plane in the switches. The control plane (the brain of the switch) is moved out to an external component, the controller. The controller can then be controlled programmatically to change the behaviour of the switches.

Note : SDN is applicable to any forwarding device such as a hub, switch, etc. Controller defines the logic and the devices forward packets based on the defined logic.

Data Centers as we know, are huge collections of physical servers that host several applications. There are tens of thousands of hosts(which can be physical or virtual - vm) involved, and all of them are organised in some complex hierarchy. There is an enormous amount of packet traffic within data centers, and a significant part of it is broadcast traffic. This is one of the main reasons why huge Layer 2 networks cannot be used in DC even though they offer several benefits like plug and play nature, vm migration etc. They just cannot scale - there is an explosion of broadcast traffic as size of network increases. Our work involves reducing this broadcast by building applications over SDN.

Monday, February 10, 2014

Introduction

We are a team of three final year students from PES Institute of Technology, Bangalore. We are currently working on our final semester project and wish to use this blog as a way to share our research work as we learn, on a more or less weekly basis.

Reduction of Broadcast traffic in Data Centers using SDN is the project that we've taken up under our faculty, Dr. Ram P Rustagi Sir. This project involves the understanding of a number of different technologies, protocols and tools. We've identified some of them - Mininet, Openflow protocol, SDN, Data Center topologies, Layer 2 network broadcast, Layer 3 network configuration, VLAN's in Data Centers.

The list isn't exhaustive and we might have to touch upon some more related aspects as we proceed.

We will try our best to cater to the understanding of the most novice reader who has some basic understanding of computer networking.