Upgrading Ceph cluster to v16.2.8

v16.2.7 -> v16.2.8

What we are going to upgrade

We have a small cluster of three nodes. I know it's a small cluster, but we are just starting with Ceph. All three bare-metal nodes were deployed with MAAS and have these resources:

CleanShot 2022-05-17 at 10.05.51.png

We are using 5 SSDs for OSD and two are as spare for now with one RBD Pool for vSphere with replica 3. Because vSphere doesn't support RBD natively, we choose as a connector iSCSi.

iSCSI Gateways are spread across the nodes. There are two RBD images, one for a public Cloud and another for internal use with two iSCSi targets. In total, we have 6 ESXi nodes connected.

We are in the pilot stage, where we are learning how Ceph works, so we are running only few VMs within cluster.

Here is how it looks like in the Grafana dashboard:

CleanShot 2022-05-17 at 10.11.37.png

Upgrade

All commands are run as a cephadmin user with root passwordless access with cephadm shell.

Run command to check if a new version is available. This command also pulls a new image if it is available, so it's not needed to pull it within the upgrade process.

sudo cephadm shell -- ceph orch upgrade check --ceph-version 16.2.8

If you want to pull image on another nodes as I do run command to pull image on another nodes.

It's 1,23Gb, so I believe it's better to pull it before the upgrade process.

sudo docker image pull quay.io/ceph/ceph:v16.2.8

Run command to upgrade.

sudo cephadm shell -- ceph orch upgrade start --ceph-version 16.2.8

You can also watch the cephadm log by running the following command:

sudo cephadm shell -- ceph -W cephadm

The upgrade in our environment was successfully completed. There were only a few strange logs during the upgrade, like:

2022-05-17T09:08:43.855151+0000 mgr.ceph-01.ufldue [WRN] Failed to set Dashboard config
for iSCSI: dashboard iscsi-gateway-add failed: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1448, in _handle_command
    return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 414, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/mgr_module.py", line 450, in check
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/dashboard/services/iscsi_cli.py", line 36, in
add_iscsi_gateway
    IscsiGatewaysConfig.add_gateway(name, service_url)
  File "/usr/share/ceph/mgr/dashboard/services/iscsi_config.py", line 87, in
add_gateway
    config = cls.get_gateways_config()
  File "/usr/share/ceph/mgr/dashboard/services/iscsi_config.py", line 105, in
get_gateways_config
    return cls._load_config_from_store()
  File "/usr/share/ceph/mgr/dashboard/services/iscsi_config.py", line 48, in
_load_config_from_store
    cls.update_iscsi_config(config)
  File "/usr/share/ceph/mgr/dashboard/services/iscsi_config.py", line 58, in
update_iscsi_config
    for gateway_name, gateway_config in config['gateways'].items():
RuntimeError: dictionary changed size during iteration
 retval: -22

Edit: Here is the bug report of the mentioned issue raised after our report of the issue.

but it seems all fine.We can see iSCSI targets on the dashboard and everything works as expected. We also run shell script on VM stored on ceph datastore to see gaps during the upgrade.

CleanShot 2022-05-17 at 11.52.33.png Here is the script:

#!/bin/bash
while [ true ];do
    date > $1/readwrite.txt
    cat $1/readwrite.txt
    echo --
    sleep 1
done

This script will return every second current timestamp like:

Tue May 17 09:01:55 UTC 2022
--
Tue May 17 09:01:56 UTC 2022
--
Tue May 17 09:01:57 UTC 2022
--

During our update there was only few gaps in timestamp like

Tue May 17 09:02:53 UTC 2022
--
Tue May 17 09:02:54 UTC 2022
--
Tue May 17 09:02:58 UTC 2022
--
Tue May 17 09:02:59 UTC 2022
--

and datastore was all the time connected with iSCSi gateways without problem.

Conclusion

The whole upgrade of the Ceph with cephadm is really fully automated. There are a few health alerts that can arise during the upgrade process, but you can ignore them. The upgrade of the mentioned cluster took less than 10 minutes.

Did you find this article valuable?

Support Jozef Rebjak's Blog by becoming a sponsor. Any amount is appreciated!