RBD Mirror is a feature of Ceph Block Storage (RBD) that enables asynchronous data replication between different Ceph clusters, providing cross-cluster Disaster Recovery (DR). Its core function is to synchronize data in a primary-backup mode, ensuring rapid service takeover by the backup cluster when the primary cluster fails.
| Term | Explanation |
|---|---|
| Primary Cluster | The cluster currently providing storage services. |
| Secondary Cluster | The standby cluster used for backup purposes. |
quay.io/csiaddons/k8s-controller:v0.12.0 -> <registry>/csiaddons/k8s-controller:v0.12.0quay.io/csiaddons/k8s-sidecar:v0.12.0 -> <registry>/csiaddons/k8s-sidecar:v0.12.0Primary <-> Secondary)Enable Mirroring for Primary Cluster's Block Storage Pool
Execute the following command on both Primary and Secondary clusters' Control nodes:
Parameters:
<block-pool-name>: Block storage pool name.This token serves as the critical credential for establishing mirror connections between clusters.
Execute the following command on both Primary and Secondary clusters' Control nodes:
Create Peer Token Secret in Peer Cluster
Execute the following command on both Primary and Secondary cluster's Control node:
Parameters:
<token>: Token obtained from Step 2.
On the Primary cluster, configure this field using the token obtained from the Secondary cluster.
On the Secondary cluster, configure this field using the token obtained from the Primary cluster.
<block-pool-name>: Block storage pool name.
Patch Peer Secret for Block Storage Pool
Execute the following command on both Primary and Secondary cluster's Control node:
Parameters:
<block-pool-name>: Block storage pool name.Deploy Mirror Daemon
This daemon is responsible for monitoring and managing RBD mirror synchronization processes, including data synchronization and error handling.
Execute the following command on both Primary and Secondary cluster's Control node:
Verify Mirror Status
Execute the following command on both Primary and Secondary cluster's Control node:
Parameters:
<block-pool-name>: Block storage pool name.This feature enables efficient data replication and synchronization without interrupting primary application operations, enhancing system reliability and availability.
Setup CsiAddons Controller
Execute the following commands on both Primary and Secondary clusters' Control nodes:
Parameters:
<registry>: Registry address of platform.Enable CsiAddons sidecar
Execute the following commands on both Primary and Secondary clusters' Control nodes:
Wait for all csi pods to restart successfully
Create VolumeReplicationClass
Execute the following commands on both Primary and Secondary clusters' Control nodes:
<scheduling-interval>: Scheduling interval, (e.g., schedulingInterval: "1h" indicates execution every 1 hour.)Execute the following command on the Primary cluster's Control node:
<vr-name>: The name of the VolumeReplication object, recommended to be the same as the PVC name.<namespace>: The namespace to which the VolumeReplication belongs, which must be the same as the PVC namespace.<pvc-name>: The name of the PVC for which Mirror needs to be enabled.Note After enabling, the RBD image in the Secondary cluster becomes read-only.
Use cases: Datacenter maintenance, technology refresh, disaster avoidance, etc.
The Relocation operation is the process of switching production to a backup facility(normally your recovery site) or vice versa.
For relocation, access to the image on the primary site should be stopped. The image should now be made primary on the secondary cluster so that the access can be resumed there.
Follow the below steps for planned migration of workload from the Primary cluster to the Secondary cluster:
Scale down all the application pods which are using the mirrored PVC on the Primary cluster.
Update VolumeReplications for all the PVCs which mirroring is enabled on the Primary cluster.
Set spec.replicationState to secondary.
Create VolumeReplications for all the PVCs for which mirroring is enabled on the Secondary.
<vr-name>: The name of the VolumeReplication object, recommended to be the same as the PVC name.<namespace>: The namespace to which the VolumeReplication belongs, which must be the same as the PVC namespace.<pvc-name>: The name of the PVC for which Mirror needs to be enabled.Check VolumeReplication CR status to verify if the image is marked primary on the secondary site.
Once the Image is marked as primary, the PVC is now ready to be used. Now, we can scale up the applications to use the PVC.
Use cases: Natural disasters, Power failures, System failures, and crashes, etc.
In case of Disaster recovery, create VolumeReplication CR at the Secondary Site.
Since the connection to the Primary Site is lost, the operator automatically sends a GRPC request down to the driver to forcefully mark the dataSource as primary on the Secondary Site.
Create VolumeReplications for all the PVCs for which mirroring is enabled on the Secondary.
<vr-name>: The name of the VolumeReplication object, recommended to be the same as the PVC name.<namespace>: The namespace to which the VolumeReplication belongs, which must be the same as the PVC namespace.<pvc-name>: The name of the PVC for which Mirror needs to be enabled.Check VolumeReplication CR status to verify if the image is marked primary on the secondary site.
Once the Image is marked as primary, the PVC is now ready to be used. Now, we can scale up the applications to use the PVC.
Once the failed cluster is recovered on the primary site and you want to failback from secondary site, follow the below steps:
Scale down the running applications (if any) on the primary site. Ensure that all persistent volumes in use by the workload are no longer in use on the primary cluster.
Update VolumeReplication CR replicationState from primary to secondary on the primary site.
Scale down the applications on the secondary site.
Update VolumeReplication CR replicationState state from primary to secondary in secondary site.
On the primary site, verify the VolumeReplication status is marked as volume ready to use.
Once the volume is marked to ready to use, change the replicationState state from secondary to primary in primary site.
Scale up the applications again on the primary site.