OpenStack and provider networks

VM images in OpenStack (and by extension all cloud providers) need to do two things to get configured after boot:

  1. Get an IP (v4) through DHCP
  2. Download its initial config by sending a request to A tool such as cloud-init takes care of this. More information at the Amazon EC2 documentation.

In a “normal” OpenStack deployment with a Network node that performs L3 functions (routing), requests to are redirect with NAT to the metadata-proxy, which on its turn proxies the requests to nova-api.

If you want to use OpenStack merely as a management layer on top of compute and storage with a little touch of networking (and skip self-service networks) provider networks are used. The OpenStack admin defines these networks statically: the NIC to use, VLAN ID, subnet, gateway, … This means that the default gateway is not controlled by OpenStack and therefore metadata requests cannot be redirected.

OpenStack has a solution for that. You have to use the OpenStack dhcp service and set enable_isolated_metadata to true in /etc/neutron/dhcp_agent.ini This will pass a host route for with the DHCP offer (option 121). The metadata-proxy will also listen for requests to this IP in the DHCP server network namespace on the network node.

Ceph, OpenStack and fstrim

Ceph is a popular storage backend for OpenStack. It allows you to use the same storage for images, ephemeral disks and volumes. Each Ceph image is also thinly provisioned (by default in blocks of 4M). Until recently, once storage was allocated the filesystem in the virtual machine had no method to release it back. In recent Ceph and Openstack releases it is possible to use fstrim inside the virtual machine to release disks blocks. fstrim was not supported in older operating systems such as Ubuntu 14.04, but Ubuntu 16.04 and CentOS 7 both support it.

Lets try it:

In the virtual machine, 5.4G is used. Before the df I ran sync and fstrim /

vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  5.4G   45G  11% /

The real usage of the vm disks (diff against the stock centos 7 image):

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
7467.23 MB

This is already quite a lot more. This is a virtual machine that has been in use for some time. So probably not all deleted data has been released by the filesystem.

Add a file of 1G and sync it to disk:

vm$ dd if=/dev/zero of=test.img bs=4M count=250
250+0 records in
250+0 records out
1048576000 bytes (1.0 GB) copied, 1.71766 s, 610 MB/s
vm$ sync
vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  6.4G   44G  13% /

The real usage increased with 824M. This is not the full 1G but close:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
8291.23 MB

When we delete the file, only a few MB are released from Ceph storage:

vm$ rm -f test.img 
vm$ sync
vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  5.4G   45G  11% /

Ceph usage:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
8283.23 MB

After fstrim:

vm$ fstrim /

Ceph usage:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
7107.23 MB

This is even less than before the 1G test file. This probably due to additional fs blocks that are released.

Letsencrypt and SSL only Apache websites

Letsencrypt is a great initiative. It lets you create free signed SSL certificates. These certificates are only valid for 3 months, but that does not matter because they automated the process. It has become so easy to use SSL on your website that it makes you wonder why existing CA’s did not come up with this!

Already the majority of the websites I maintain are SSL only, such as this blog for example. However, this can be tricky to setup with the automated cert deployment. You run their client on your webserver and request a cert for a one or more domains and specify the webroot of the website where this site is hosted. The client places a challenge in the .well-known/acme-challenge/ directory to prove that you are the owner of the domain (or at least control the website that the domain points to). Then you daily run a cronjob that renew certificates.

If you move SSL only, you will probably do something like this:

<virtualhost *:80>
    DocumentRoot    /var/www/html/blog
    Redirect /

However, the next time leysencrypt tries to verify the challenge it will be redirected to the https website instead of retrieving the challenge over http. This is easily solved by changing the redirect in:

RedirectMatch 301 ^(?!/\.well-known/acme-challenge/).*$0

MLAG with Cumulus

Cumulus Linux supports MLAG for dual-connected hosts. How to set this up is in great detail explained in their documentation. However, the role of the inter switch link and how this needs to be configured was not entirely clear to me. Both in a setup with normal bridges and vlan aware bridges, my initial configuration contained a mistake. This mistake caused the kernel to handle the ISL bond (called peerlink in the examples in the docs) as a normal link instead of the ISL. clagctl did not warn me about it. Actually, the only way to find out is by printing out STP port details using mstpctl.

The docs explain that the setup of the ISL is done by defining a SVI (switch virtual interface) on an available vlan (4094 in their docs) and add additional stanzas to generate the configuration for clagd to work. This is correct, but a crucial piece of information was lost to me. This setup is only used for clagd to function and to sync LACP bonding states between the switch members of a bond. The ISL also needs to be programmed with specific rules into the ASIC, so it handles single connected hosts and STP traffic correct. This only became clear to me after reading this paper. The kernel needs to recognize the ISL bond as the ISL in a specific manner: first is by adding the clag stanzas and second by adding the untagged interface (no subinterface) to an untagged bridge. What this means depends on the bridge mode:

Traditional mode
In traditional mode you have a bridge per VLAN that exist in the network. For the ISL to work correctly, the ISL link (peerlink for example) needs to be added to a bridge without a subinterface. So for example:
auto br0
iface br0
bridge-ports peerlink swp1 swp2 swp3 ...

When this is configured correctly, you should get similar output from mstpctl:

root@cumulus1:~# mstpctl showportdetail br0 peerlink 
br0:peerlink CIST info
  enabled            yes                     role                 Designated
  port id            8.002                   state                forwarding
  external port cost 1382                    admin external cost  0
  internal port cost 1382                    admin internal cost  0
  designated root    1.000.44:38:39:FF:00:01 dsgn external cost   0
  dsgn regional root 1.000.44:38:39:FF:00:01 dsgn internal cost   0
  designated bridge  1.000.44:38:39:FF:00:01 designated port      8.002
  admin edge port    no                      auto edge port       yes
  oper edge port     yes                     topology change ack  no
  point-to-point     yes                     admin point-to-point auto
  restricted role    no                      restricted TCN       no
  port hello time    2                       disputed             no
  bpdu guard port    no                      bpdu guard error     no
  network port       no                      BA inconsistent      no
  Num TX BPDU        3                       Num TX TCN           2
  Num RX BPDU        3                       Num RX TCN           2
  Num Transition FWD 2                       Num Transition BLK   1
  bpdufilter port    no                     
  clag ISL           yes                     clag ISL Oper UP     yes
  clag role          primary                 clag dual conn mac   00:00:00:00:00:00
  clag remote portID F.FFF                   clag system mac      44:38:39:FF:00:01

clag ISL: yes is crucial here.

In traditional bridge mode this mistake is easily made (at least in an isolated setup where the cumulus switches only link to servers) when none of the connected devices their untagged traffic is mapped to untagged traffic on the switches.

VLAN aware bridges
In vlan aware mode, only untagged interfaces are added to a bridge. This makes this mistake harder to make, unless you only add the interfaces of connected devices to the bridge and not peerlink (like I did). Again, no warning anywhere. Except for “clag ISL: no” in mstpctl.

Ceph rbd image size

Another Ceph related command: getting the disk usage of a rbd based image. There is no simple command to get the real disk usage of an rbd based image. That is because in an OpenStack deployment with Ceph, your rbd images are often a snapshot of a OS image or another disk. On the Ceph blog I found the following:

$ rbd diff rbd/tota | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'

rdb/tota is the path to the disk, with rbd the pool name and tota the image name. In an OpenStack deployment you will probably see the pools vms, images and volumes. You can get a listing of the images in each pool with:

$ rbd -p $poolname ls

For example:

$ rbd -p volumes ls

In the volumes pool (managed by cinder) the default name is volume-${volumeid}. For virtual machines ephemeral disks (managed by nova) this is ${vm id}_disk and for vm images (managed by glance) this is only the id of the image without any prefix of suffix.

Getting the real size of an rbd image is than for example:

$ rbd diff volumes/volume-a5caf687-bbb0-45ca-a423-7a2ce2f5004d | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
52847.6 MB

Getting pool and OSD information out of Ceph

Whenever I’m doing things with Ceph I seem to forget the exact commands to get details and stats about ceph pools and OSDs. You have the ceph command with its huge number of subcommands, but you can also use the rados and rbd commands.

Pool properties
First of all there is the command to show all pool properties. This is useful when having tiered storage in Ceph:

$ ceph osd pool ls detail

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 190129 lfor 190129 flags hashpspool tiers 29 read_tier 29 write_tier 29 min_write_recency_for_promote 1 stripe_width 0
 removed_snaps [1~3]

pool 29 'rbd_ssd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 350 pgp_num 350 last_change 190129 flags hashpspool,incomplete_clones tier_of 0 cache_mode writeback target_bytes 429496729600 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x4 decay_rate 0 search_last_n 0 min_read_recency_for_promote 2 min_write_recency_for_promote 2 stripe_width 0
 removed_snaps [1~3]

This command is the only command I am aware of that lists all properties of a pool, instead of having to query them one by one with ceph osd pool get.

Disk usage
Ceph has a plethora of commands to get the data/disk usage:

First of all you can use ceph. It gives detailed information about the disk usage. It also takes specific crush rules into account to display the available data. In the example below, two crush rules with different roots have been defined. This allows us to place pools on SSD storage and on SAS disks.

$ ceph df detail


Next, you can get a listing of the disk usage per OSD. The documentation mostly mentions ceph osd tree to list all OSD’s and where they are located in the crush tree. With ceph osd df you get a listing of the disk usage of each OSD and the data distribution.

$ ceph osd df
ceph osd df output

A final command is rados df to be complete. The output is similar to ceph df detail. I just recently found out about ceph df detail and have been using rados df for a few years already.

$ rados df


Friday at 01:50 our second son Tuur was born. He was born 6 weeks early at 34 weeks, but was already 49cm and 2.8kg. Although it is a preterm birth everything currently goes by the best case scenarios the doctors laid out upfront. This scenario does include wires and tubes and staying in an incubator. Gradually he has been losing wires and tubes and today he moved to a normal bed in the NICU.


In 2005 I really like the album “Tourist” of Athlete. One of the best known songs on this album is “Wires”, which is about the preterm birth of the singers daughter. This gave the lyrics some meaning but it never really got to me. Until now. Every word in this song is spot on about how it feels to have your newborn child born too early.

OpenStack neutron CLI

The OpenStack neutron CLI allows you to control almost all aspects of neutron that you can control with the REST api. For the more advanced command line operations, it is not always clear how to structure the command line arguments. Examples include clearing an attribute, setting a list of key-values, … You need this to set routes for a router, host routes on a subnet, the gateway of a subnet, … It took me quite some time to figure this out, so this might be helpful as a reference.

In complex network deployments you need to set routes on subnets (these get distributed through dhcp to the vm’s) or additional routes on routers in the host_routes and routes property respectively. Each route consists of a cidr destination address and a next hop. The syntax for this is the following:
neutron router-update router1 --routes type=dict list=true destination=,nexthop= destination=,nexthop=

This command sets a default route and a route to an other router connected to router1. The “magic” here is that you need to specify that it is a list of dictionaries. The CLI tool transforms this to the following JSON:
"routes": [{"nexthop": "", "destination": ""}, {"nexthop": "", "destination": ""}]

If you want to clear these routes you need use the following command:
neutron router-update router1 --routes action=clear

You can use the action=clear syntax to clear other attributes as well, such as the gateway of a subnet.