Ceph, OpenStack and fstrim

Ceph is a popular storage backend for OpenStack. It allows you to use the same storage for images, ephemeral disks and volumes. Each Ceph image is also thinly provisioned (by default in blocks of 4M). Until recently, once storage was allocated the filesystem in the virtual machine had no method to release it back. In recent Ceph and Openstack releases it is possible to use fstrim inside the virtual machine to release disks blocks. fstrim was not supported in older operating systems such as Ubuntu 14.04, but Ubuntu 16.04 and CentOS 7 both support it.

Lets try it:

In the virtual machine, 5.4G is used. Before the df I ran sync and fstrim /

vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  5.4G   45G  11% /

The real usage of the vm disks (diff against the stock centos 7 image):

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
7467.23 MB

This is already quite a lot more. This is a virtual machine that has been in use for some time. So probably not all deleted data has been released by the filesystem.

Add a file of 1G and sync it to disk:

vm$ dd if=/dev/zero of=test.img bs=4M count=250
250+0 records in
250+0 records out
1048576000 bytes (1.0 GB) copied, 1.71766 s, 610 MB/s
vm$ sync
vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  6.4G   44G  13% /

The real usage increased with 824M. This is not the full 1G but close:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
8291.23 MB

When we delete the file, only a few MB are released from Ceph storage:

vm$ rm -f test.img 
vm$ sync
vm$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        50G  5.4G   45G  11% /

Ceph usage:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
8283.23 MB

After fstrim:

vm$ fstrim /

Ceph usage:

$ rbd diff vms/ae445f4c-38e7-446d-9edd-618533632469_disk | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
7107.23 MB

This is even less than before the 1G test file. This probably due to additional fs blocks that are released.

Ceph rbd image size

Another Ceph related command: getting the disk usage of a rbd based image. There is no simple command to get the real disk usage of an rbd based image. That is because in an OpenStack deployment with Ceph, your rbd images are often a snapshot of a OS image or another disk. On the Ceph blog I found the following:

$ rbd diff rbd/tota | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'

rdb/tota is the path to the disk, with rbd the pool name and tota the image name. In an OpenStack deployment you will probably see the pools vms, images and volumes. You can get a listing of the images in each pool with:

$ rbd -p $poolname ls

For example:

$ rbd -p volumes ls
volume-bb85c32b-a003-405d-8029-cf02b0e9eec9
volume-003e8ed8-1e9a-4344-9b97-34ee440a1f35
volume-006c6bed-0407-4334-9bfa-678fc3c2de90
volume-01652b4d-1d70-487e-9b8f-a5bcf6f7fc8f
volume-02489e8f-12fa-4a62-ada5-999fdae7c1f0
volume-036f1366-4df1-4154-8c21-968a29e09db1

In the volumes pool (managed by cinder) the default name is volume-${volumeid}. For virtual machines ephemeral disks (managed by nova) this is ${vm id}_disk and for vm images (managed by glance) this is only the id of the image without any prefix of suffix.

Getting the real size of an rbd image is than for example:

$ rbd diff volumes/volume-a5caf687-bbb0-45ca-a423-7a2ce2f5004d | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
52847.6 MB

Getting pool and OSD information out of Ceph

Whenever I’m doing things with Ceph I seem to forget the exact commands to get details and stats about ceph pools and OSDs. You have the ceph command with its huge number of subcommands, but you can also use the rados and rbd commands.

Pool properties
First of all there is the command to show all pool properties. This is useful when having tiered storage in Ceph:

$ ceph osd pool ls detail

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 190129 lfor 190129 flags hashpspool tiers 29 read_tier 29 write_tier 29 min_write_recency_for_promote 1 stripe_width 0
 removed_snaps [1~3]

pool 29 'rbd_ssd' replicated size 3 min_size 2 crush_ruleset 1 object_hash rjenkins pg_num 350 pgp_num 350 last_change 190129 flags hashpspool,incomplete_clones tier_of 0 cache_mode writeback target_bytes 429496729600 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 600s x4 decay_rate 0 search_last_n 0 min_read_recency_for_promote 2 min_write_recency_for_promote 2 stripe_width 0
 removed_snaps [1~3]

This command is the only command I am aware of that lists all properties of a pool, instead of having to query them one by one with ceph osd pool get.

Disk usage
Ceph has a plethora of commands to get the data/disk usage:

First of all you can use ceph. It gives detailed information about the disk usage. It also takes specific crush rules into account to display the available data. In the example below, two crush rules with different roots have been defined. This allows us to place pools on SSD storage and on SAS disks.

$ ceph df detail

df_pool

Next, you can get a listing of the disk usage per OSD. The documentation mostly mentions ceph osd tree to list all OSD’s and where they are located in the crush tree. With ceph osd df you get a listing of the disk usage of each OSD and the data distribution.

$ ceph osd df
ceph osd df output

A final command is rados df to be complete. The output is similar to ceph df detail. I just recently found out about ceph df detail and have been using rados df for a few years already.

$ rados df
raods_df