Service chaining with NSH and Open vSwitch

This post originally appeared on https://blog.inmanta.com/2017/04/06/service-chaining-nsh-open-vswitch/

Service chaining is a hot topic in the SDN/NFV world. Service chaining is a simple concept but it requires many moving parts. Most SDN solutions use their own technology for chaining network services. The challenge in service chaining is knowing where in the chain a function is located and where the packet should be sent next without sending all traffic through a central node.

Network Service Headers

Most presentations and articles about service chaining position network service headers (NSH) as the solution for network service chaining. Although NSH is still a draft standard, industry seems to agree that NSH is the way to go.

A service chain uses NSH in conjunction with a encapsulation protocol such as VXLAN. NSH adds additional metadata to the encapsulated packet. The most important metadata is the service graph in which the package was classified and the index of the position in that service graph. The protocol also has room for four 4-byte metadata fields.

Open vSwitch

In the OpenStack ecosystem with SDN controllers Open vSwitch is an important component for virtual switching. All available demos and the OPNVF project use Open vSwitch for chaining. However, NSH support is not available in mainline Open vSwitch and the project has rejected the available patch sets. Currently two patch sets (for a 2.3 dev version and a 2.5 dev version) exist and only the latter has DPDK support.

The lack of decent NSH support makes it rather complex but doable to setup demonstrators. However, it clearly shows that NSH in combination with Open vSwitch is not yet ready for pilot or production use, despite all the fuss at conferences.

Network functions

Network functions with NSH support are also an important step for NSH adoption. Last year we built a demonstrator based on NSH and were unable to find any VNF (virtual network function) with NSH support. It does not mean that it does not exist, but it does raise red flags.

For those who read the NSH RFC might wonder: why not use an NSH proxy and non-NSH aware network functions?

We were not able to find a proxy either. The closest thing to a proxy we could find was this reference on the OpenDaylight mailing list.

Eventually we used the NSH python tools referenced in that list. It is functional, but do not expect any performance.

Conclusion

It is clear that NSH is far from ready for prime time. NSH is promising but lacks real support, even for demo purposes! It makes one wonder if there is chicken or egg problem here: there are no VNFs so SDN vendors are not inclined to add support, and because no SDN tech supports NSH, VNF vendors will not add NSH support.

OpenStack and provider networks

VM images in OpenStack (and by extension all cloud providers) need to do two things to get configured after boot:

  1. Get an IP (v4) through DHCP
  2. Download its initial config by sending a request to 169.254.169.254. A tool such as cloud-init takes care of this. More information at the Amazon EC2 documentation.

In a “normal” OpenStack deployment with a Network node that performs L3 functions (routing), requests to 169.254.169.254 are redirect with NAT to the metadata-proxy, which on its turn proxies the requests to nova-api.

If you want to use OpenStack merely as a management layer on top of compute and storage with a little touch of networking (and skip self-service networks) provider networks are used. The OpenStack admin defines these networks statically: the NIC to use, VLAN ID, subnet, gateway, … This means that the default gateway is not controlled by OpenStack and therefore metadata requests cannot be redirected.

OpenStack has a solution for that. You have to use the OpenStack dhcp service and set enable_isolated_metadata to true in /etc/neutron/dhcp_agent.ini This will pass a host route for 169.254.169.254/16 with the DHCP offer (option 121). The metadata-proxy will also listen for requests to this IP in the DHCP server network namespace on the network node.

Jenkins: Jenkinsfile and procps

At Inmanta we recently switched from gitlab and gitlab-ci to the new github organizations (and their per seat pricing). At the same time, we also switched back from gitlab-ci to Jenkins. Jenkins 2 now has to ability to add your CI config to your repo. We run all of our tests inside a container. It makes it a lot easier to use CentOS 7 on our CI servers, but run our tests inside a Fedora container that has a normal python3. Or, run tests on other operating systems.

It took us a while to get this up and running because of how the Jenkins pipeline plugin uses container. Jenkins starts a container with “cat” as command and does docker exec for the actual CI processes. A normal fedora or centos container will not work, because jekins executes ps to check if the process is still running. When this fails, each CI process was killed after 10s. By default, ps is not installed in the fedora container. Installing procps-ng fixed this.

All examples from CloudBees, use a container with Java, Maven installed and procps installed.

Ceph, OpenStack and fstrim

Ceph is a popular storage backend for OpenStack. It allows you to use the same storage for images, ephemeral disks and volumes. Each Ceph image is also thinly provisioned (by default in blocks of 4M). Until recently, once storage was allocated the filesystem in the virtual machine had no method to release it back. In recent Ceph and Openstack releases it is possible to use fstrim inside the virtual machine to release disks blocks. fstrim was not supported in older operating systems such as Ubuntu 14.04, but Ubuntu 16.04 and CentOS 7 both support it.

Lets try it:

In the virtual machine, 5.4G is used. Before the df I ran sync and fstrim /

vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  5.4G   45G  11% /

The real usage of the vm disks (diff against the stock centos 7 image):

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
7467.23 MB

This is already quite a lot more. This is a virtual machine that has been in use for some time. So probably not all deleted data has been released by the filesystem.

Add a file of 1G and sync it to disk:

vm$ dd if=/dev/zero of=test.img bs=4M count=250
250+0 records in
250+0 records out
1048576000 bytes (1.0 GB) copied, 1.71766 s, 610 MB/s
vm$ sync
vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  6.4G   44G  13% /

The real usage increased with 824M. This is not the full 1G but close:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
8291.23 MB

When we delete the file, only a few MB are released from Ceph storage:

vm$ rm -f test.img 
vm$ sync
vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  5.4G   45G  11% /

Ceph usage:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
8283.23 MB

After fstrim:

vm$ fstrim /

Ceph usage:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
7107.23 MB

This is even less than before the 1G test file. This probably due to additional fs blocks that are released.

Letsencrypt and SSL only Apache websites

Letsencrypt is a great initiative. It lets you create free signed SSL certificates. These certificates are only valid for 3 months, but that does not matter because they automated the process. It has become so easy to use SSL on your website that it makes you wonder why existing CA’s did not come up with this!

Already the majority of the websites I maintain are SSL only, such as this blog for example. However, this can be tricky to setup with the automated cert deployment. You run their client on your webserver and request a cert for a one or more domains and specify the webroot of the website where this site is hosted. The client places a challenge in the .well-known/acme-challenge/ directory to prove that you are the owner of the domain (or at least control the website that the domain points to). Then you daily run a cronjob that renew certificates.

If you move SSL only, you will probably do something like this:

<virtualhost *:80>
    DocumentRoot    /var/www/html/blog
    ServerName      bart.vanbrabant.eu
    Redirect / https://bart.vanbrabant.eu/
</virtualhost>

However, the next time leysencrypt tries to verify the challenge it will be redirected to the https website instead of retrieving the challenge over http. This is easily solved by changing the redirect in:

RedirectMatch 301 ^(?!/\.well-known/acme-challenge/).* https://bart.vanbrabant.eu$0

MLAG with Cumulus

Cumulus Linux supports MLAG for dual-connected hosts. How to set this up is in great detail explained in their documentation. However, the role of the inter switch link and how this needs to be configured was not entirely clear to me. Both in a setup with normal bridges and vlan aware bridges, my initial configuration contained a mistake. This mistake caused the kernel to handle the ISL bond (called peerlink in the examples in the docs) as a normal link instead of the ISL. clagctl did not warn me about it. Actually, the only way to find out is by printing out STP port details using mstpctl.

The docs explain that the setup of the ISL is done by defining a SVI (switch virtual interface) on an available vlan (4094 in their docs) and add additional stanzas to generate the configuration for clagd to work. This is correct, but a crucial piece of information was lost to me. This setup is only used for clagd to function and to sync LACP bonding states between the switch members of a bond. The ISL also needs to be programmed with specific rules into the ASIC, so it handles single connected hosts and STP traffic correct. This only became clear to me after reading this paper. The kernel needs to recognize the ISL bond as the ISL in a specific manner: first is by adding the clag stanzas and second by adding the untagged interface (no subinterface) to an untagged bridge. What this means depends on the bridge mode:

Traditional mode
In traditional mode you have a bridge per VLAN that exist in the network. For the ISL to work correctly, the ISL link (peerlink for example) needs to be added to a bridge without a subinterface. So for example:
auto br0
iface br0
bridge-ports peerlink swp1 swp2 swp3 ...

When this is configured correctly, you should get similar output from mstpctl:

root@cumulus1:~# mstpctl showportdetail br0 peerlink 
br0:peerlink CIST info
  enabled            yes                     role                 Designated
  port id            8.002                   state                forwarding
  external port cost 1382                    admin external cost  0
  internal port cost 1382                    admin internal cost  0
  designated root    1.000.44:38:39:FF:00:01 dsgn external cost   0
  dsgn regional root 1.000.44:38:39:FF:00:01 dsgn internal cost   0
  designated bridge  1.000.44:38:39:FF:00:01 designated port      8.002
  admin edge port    no                      auto edge port       yes
  oper edge port     yes                     topology change ack  no
  point-to-point     yes                     admin point-to-point auto
  restricted role    no                      restricted TCN       no
  port hello time    2                       disputed             no
  bpdu guard port    no                      bpdu guard error     no
  network port       no                      BA inconsistent      no
  Num TX BPDU        3                       Num TX TCN           2
  Num RX BPDU        3                       Num RX TCN           2
  Num Transition FWD 2                       Num Transition BLK   1
  bpdufilter port    no                     
  clag ISL           yes                     clag ISL Oper UP     yes
  clag role          primary                 clag dual conn mac   00:00:00:00:00:00
  clag remote portID F.FFF                   clag system mac      44:38:39:FF:00:01

clag ISL: yes is crucial here.

In traditional bridge mode this mistake is easily made (at least in an isolated setup where the cumulus switches only link to servers) when none of the connected devices their untagged traffic is mapped to untagged traffic on the switches.

VLAN aware bridges
In vlan aware mode, only untagged interfaces are added to a bridge. This makes this mistake harder to make, unless you only add the interfaces of connected devices to the bridge and not peerlink (like I did). Again, no warning anywhere. Except for “clag ISL: no” in mstpctl.

Ceph rbd image size

Another Ceph related command: getting the disk usage of a rbd based image. There is no simple command to get the real disk usage of an rbd based image. That is because in an OpenStack deployment with Ceph, your rbd images are often a snapshot of a OS image or another disk. On the Ceph blog I found the following:

$ rbd diff rbd/tota | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'

rdb/tota is the path to the disk, with rbd the pool name and tota the image name. In an OpenStack deployment you will probably see the pools vms, images and volumes. You can get a listing of the images in each pool with:

$ rbd -p $poolname ls

For example:

$ rbd -p volumes ls
volume-bb85c32b-a003-405d-8029-cf02b0e9eec9
volume-003e8ed8-1e9a-4344-9b97-34ee440a1f35
volume-006c6bed-0407-4334-9bfa-678fc3c2de90
volume-01652b4d-1d70-487e-9b8f-a5bcf6f7fc8f
volume-02489e8f-12fa-4a62-ada5-999fdae7c1f0
volume-036f1366-4df1-4154-8c21-968a29e09db1

In the volumes pool (managed by cinder) the default name is volume-${volumeid}. For virtual machines ephemeral disks (managed by nova) this is ${vm id}_disk and for vm images (managed by glance) this is only the id of the image without any prefix of suffix.

Getting the real size of an rbd image is than for example:

$ rbd diff volumes/volume-a5caf687-bbb0-45ca-a423-7a2ce2f5004d | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
52847.6 MB

Getting pool and OSD information out of Ceph

Whenever I’m doing things with Ceph I seem to forget the exact commands to get details and stats about ceph pools and OSDs. You have the ceph command with its huge number of subcommands, but you can also use the rados and rbd commands.

Pool properties
First of all there is the command to show all pool properties. This is useful when having tiered storage in Ceph:

$ ceph osd pool ls detail

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 190129 lfor 190129 flags hashpspool tiers 29 read_tier 29 write_tier 29 min_write_recency_for_promote 1 stripe_width 0
 removed_snaps [1~3]

pool 29 'rbd_ssd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 350 pgp_num 350 last_change 190129 flags hashpspool,incomplete_clones tier_of 0 cache_mode writeback target_bytes 429496729600 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x4 decay_rate 0 search_last_n 0 min_read_recency_for_promote 2 min_write_recency_for_promote 2 stripe_width 0
 removed_snaps [1~3]

This command is the only command I am aware of that lists all properties of a pool, instead of having to query them one by one with ceph osd pool get.

Disk usage
Ceph has a plethora of commands to get the data/disk usage:

First of all you can use ceph. It gives detailed information about the disk usage. It also takes specific crush rules into account to display the available data. In the example below, two crush rules with different roots have been defined. This allows us to place pools on SSD storage and on SAS disks.

$ ceph df detail

df_pool

Next, you can get a listing of the disk usage per OSD. The documentation mostly mentions ceph osd tree to list all OSD’s and where they are located in the crush tree. With ceph osd df you get a listing of the disk usage of each OSD and the data distribution.

$ ceph osd df
ceph osd df output

A final command is rados df to be complete. The output is similar to ceph df detail. I just recently found out about ceph df detail and have been using rados df for a few years already.

$ rados df
raods_df

Tuur!

Friday at 01:50 our second son Tuur was born. He was born 6 weeks early at 34 weeks, but was already 49cm and 2.8kg. Although it is a preterm birth everything currently goes by the best case scenarios the doctors laid out upfront. This scenario does include wires and tubes and staying in an incubator. Gradually he has been losing wires and tubes and today he moved to a normal bed in the NICU.

tuur

In 2005 I really like the album “Tourist” of Athlete. One of the best known songs on this album is “Wires”, which is about the preterm birth of the singers daughter. This gave the lyrics some meaning but it never really got to me. Until now. Every word in this song is spot on about how it feels to have your newborn child born too early.