We start with Ceph

We start with Ceph

Reliable storage service out of unreliable components with no single point of failure.

What is Ceph?

As official page says

Ceph is highly reliable, easy to manage, and free. Ceph delivers extraordinary scalability: thousands of clients accessing exabytes of data.

But what Ceph really is?

Ceph is reliable and scalable storage designed for any organisation. Yes, it really is and what is even better, it's open source, so we don't need to even care about vendor lock.

We can find a lot of posts about how we can use Ceph in production, but most of them are for large deployments or just for development. In this series I will try to guide you on how to start using Ceph in production with lessons learned from last year.

What if we want just to start with a small cluster and then expand if ever.

The answer is yes, we can!

Of course, we need to start with 3 nodes, which is the minimum for running the Ceph cluster. Maybe you are asking why 3 nodes. The answer is simple, we want to achieve high availability and, in modern technologies, it is the number 3 magic number.

What makes Ceph robust?

Essentially, Ceph provides object, block and file storage in a single, horizontally and vertically scalable cluster, with no single point of failure. A Ceph storage cluster can be easily scaled over time. It also has many enterprise features including snapshots, thin provisioning, tiering and self-healing capabilities.

Reliability

Reliable storage service out of unreliable components with no single point of failure, data durability via replication or erasure coding and no interruption of service from rolling upgrades and maintenance tasks.

Scalability

Storage clusters may grow or shrink. You can add or remove hardware while the system is online and under load.

Vertically

  • You can add more CPU, memory and hard drives over time to expand storage capacity and performance.

Horizontally

  • You can add more nodes over time to expand the cluster.

Sometimes, less is not more

green-chameleon-s9CC2SKySJM-unsplash.jpg

If your target is to get more from less money, then naturally you might gravitate towards the servers that can take the largest number of disks, and the biggest disk you can get.

But this is not a Ceph way!

Better is to get more nodes with fewer hard drives.

Few considerations

Each node holds a larger percentage of your cluster’s data

In a 3 node cluster, each node holds 33% of your data. In a 5 node cluster, it’s only 20% per node. Loss of a single node in a small cluster will result in substantially more data migration, particularly as the cluster starts to fill.

For high disk counts per node, the disk controller may be a bottleneck.

When the controllers reach 100% utilisation, they create a performance bottleneck and they don't have sufficient bandwidth to carry all of the disks at full speed. Once this happens, adding disk shelves does not provide a performance improvement, it only adds capacity.

Increased recovery and healing time

With the loss of an entire node, more data needs to be replicated and recovered.

Forcing more network traffic across fewer nodes

Ceph is a network-based storage system, so with fewer nodes you’re forcing a lot of client traffic over fewer NICs. This becomes worse as the number of disks per node increases, as you have more hard drives competing for limited bandwidth.

Here you can find a comparison of a 3 node vs 5 node cluster.

You can start with 3 nodes, but if you can afford to spread your disks across more nodes, it’s better to do so.

Hardware

CPU

jeremy-bezanger-wl8hZoJBSU8-unsplash.jpg

Without regard to the core architecture, there is one rule that is always true, and that is that higher clockspeed allows more work to be done in the same amount of time. This consideration is most important when working with higher speed storage and network devices.

CPU selection is also an important consideration for specific services. Some services, such as metadata servers, NFS, Samba, and ISCSI gateways benefit from a smaller number of much faster cores, while the OSD nodes need a more core dense solution.

A second consideration is whether to use a single socket or multiple sockets. The answer to this will depend on the device density, type of network hardware being utilized, etc. In many nodes, a single socket will provide better performance as the processor interlink is a bottleneck, though this would most likely be noticed in an all NVMe based node type. The general recommendation is to use a single socket whenever possible.

Memory

stef-westheim-ZGH6xd3usAs-unsplash.jpg

Ceph is a heavy user of RAM, thus the more memory bandwidth available, the more performant the node can be. Even if the memory selected is fast, if all memory channels are not leveraged, performance is being left untapped. It is advantageous to ensure that RAM is distributed evenly across all channels.

Networking

taylor-vick-M5tzZtFCOfs-unsplash.jpg

Aim for 10GbE for production, at minimum, and better still look to have multiple 10Gb interfaces bonded with LACP for increased bandwidth and availability. Usage of VLANs on LACP-bonded Ethernet provides the best balance of bandwidth aggregation and fault tolerance. Network bandwidth should be at least the total bandwidth of all storage devices present in the storage node.

Ceph is a network-based storage system, so one thing the cluster should not lack is network bandwidth. Always separate out your public-facing network from your internal cluster network. The public network will carry client traffic, while the internal network will carry heartbeat, replication and recovery traffic. For redundancy, ensure you’re cabling your nodes to redundant top of rack switches.

Storage devices

art-wall-kittenprint-9Wq1HpghQ4A-unsplash.jpg

Storage device selection can dramatically affect the performance and reliability of a Ceph cluster. Your data should be on the fastest media you can get — SSD at minimum, and NVMe if possible. The recommendation is to ensure that systems utilise Enterprise-class storage media.

It is also important to understand the impact of the storage bus and hardware pieces along the way. Clearly, 6 Gb/s is slower than 12 Gb/s, and 12 Gb/s is slower than PCIe Gen3 (8 Gb/s per lane), but what about mixing SATA 3 Gb/s and SATA 6 Gb/s, or mixing 6 Gb/s and 12 Gb/s SAS?

The general rule is not to mix.

To summary

Ceph is a software-defined storage system. It's an open source system which provides a unified storage system which is highly scalable and without a single point of failure. Because it is open, scalable and distributed, Ceph is becoming one of the best storage solutions for cloud computing technologies, but also a great choice to store your data within a filesystem. Within this blog post we covered some of the considerations before the start of the use of Ceph in production.

Did you find this article valuable?

Support Jozef Rebjak by becoming a sponsor. Any amount is appreciated!