Cumulus Linux supports MLAG for dual-connected hosts. How to set this up is in great detail explained in their documentation. However, the role of the inter switch link and how this needs to be configured was not entirely clear to me. Both in a setup with normal bridges and vlan aware bridges, my initial configuration contained a mistake. This mistake caused the kernel to handle the ISL bond (called peerlink in the examples in the docs) as a normal link instead of the ISL.
clagctl did not warn me about it. Actually, the only way to find out is by printing out STP port details using mstpctl.
The docs explain that the setup of the ISL is done by defining a SVI (switch virtual interface) on an available vlan (4094 in their docs) and add additional stanzas to generate the configuration for clagd to work. This is correct, but a crucial piece of information was lost to me. This setup is only used for clagd to function and to sync LACP bonding states between the switch members of a bond. The ISL also needs to be programmed with specific rules into the ASIC, so it handles single connected hosts and STP traffic correct. This only became clear to me after reading this paper. The kernel needs to recognize the ISL bond as the ISL in a specific manner: first is by adding the clag stanzas and second by adding the untagged interface (no subinterface) to an untagged bridge. What this means depends on the bridge mode:
In traditional mode you have a bridge per VLAN that exist in the network. For the ISL to work correctly, the ISL link (peerlink for example) needs to be added to a bridge without a subinterface. So for example:
bridge-ports peerlink swp1 swp2 swp3 ...
When this is configured correctly, you should get similar output from mstpctl:
root@cumulus1:~# mstpctl showportdetail br0 peerlink br0:peerlink CIST info enabled yes role Designated port id 8.002 state forwarding external port cost 1382 admin external cost 0 internal port cost 1382 admin internal cost 0 designated root 1.000.44:38:39:FF:00:01 dsgn external cost 0 dsgn regional root 1.000.44:38:39:FF:00:01 dsgn internal cost 0 designated bridge 1.000.44:38:39:FF:00:01 designated port 8.002 admin edge port no auto edge port yes oper edge port yes topology change ack no point-to-point yes admin point-to-point auto restricted role no restricted TCN no port hello time 2 disputed no bpdu guard port no bpdu guard error no network port no BA inconsistent no Num TX BPDU 3 Num TX TCN 2 Num RX BPDU 3 Num RX TCN 2 Num Transition FWD 2 Num Transition BLK 1 bpdufilter port no clag ISL yes clag ISL Oper UP yes clag role primary clag dual conn mac 00:00:00:00:00:00 clag remote portID F.FFF clag system mac 44:38:39:FF:00:01
clag ISL: yes is crucial here.
In traditional bridge mode this mistake is easily made (at least in an isolated setup where the cumulus switches only link to servers) when none of the connected devices their untagged traffic is mapped to untagged traffic on the switches.
VLAN aware bridges
In vlan aware mode, only untagged interfaces are added to a bridge. This makes this mistake harder to make, unless you only add the interfaces of connected devices to the bridge and not peerlink (like I did). Again, no warning anywhere. Except for “clag ISL: no” in mstpctl.