open source

Scaling Kubernetes ReadWriteMany Ceph Deployments

Christopher Simpson

May 12, 2019 • 10 min read

Run ReadWriteMany volumes on Google Kubernetes engine which allows your Deployments with persistent data to scale.

This allows us to scale a Kubernetes deployment with Persistent Volumes without the deployment getting stuck waiting for a realease on a volume.

This tutotial has a repo which has been pinned as a reference of a working tutotial for Kubernetes Ceph deployments.

Setup a Kubernetes Cluster in Google Console

Warning: The default options for creating a Kubernetes Cluster will not work for Ceph. See below for the settings needed

Begin creating cluster

You can follow the step by step below, or, use the command (you still need to create the project first to get project name):

gcloud beta container --project "<project-name>" \
clusters create "<cluster-name>" \
--zone "us-central1-a" \
--username "admin" \
--cluster-version "1.11.8-gke.6" \
--machine-type "n1-standard-1" \
--image-type "UBUNTU" \
--disk-type "pd-standard" \
--disk-size "100" \
--node-labels preemptive=false \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--num-nodes "3" \
--enable-cloud-logging \
--enable-cloud-monitoring \
--no-enable-ip-alias \
--network "projects/<project-name>/global/networks/default" \
--subnetwork "projects/<project-name>/regions/us-central1/subnetworks/default" \
--addons HorizontalPodAutoscaling,HttpLoadBalancing \
--no-enable-autoupgrade \
--no-enable-autorepair

Note: you will likley hit quota limmits on your Google account preventing this from succeeding. You may need to request Google increases the number of ips and nodes allowed on your account. You can do this via the google web console.
The message "Insufficient regional quota to satisfy request: resource IN_USE_ADDRESSES: requires '3.0' and is show '1.0'. .. may appear. In which case, visit your quotas page and request an increase if appropriate.

Selection_059

The image type must be ubuntu because it contains drivers required by Ceph, the default (cos) will not work and you will see a modprobe error when ceph attempts to load.

The dafault CPU and RAM is sufficient for a working demo. You must start with 3 nodes, and each node will have:

1 vCPU
3.75 GB memory
100 GB Standard Persistent disk

Set image type to Ubuntu:
Selection_061

Uncheck (optional?):

Enable auto-upgrade
Enable auto-repair (because both these

Add the following labels:

preemptive=false
disable-legacy-endpoints=true

We use the label preemptive=false as a way to mark these nodes as non preemptive. Understand that the label has no effect (you still need to make sure "Enable pre-emptible nodes" is un-checked)

Add labels:
Selection_062

Press save to confirm your nodes then click 'Availability, networking and additional features:
Selection_063

Choose both the following:

Enable VPC-native
Disable basic autentication and client certificate issuance

Connect to your cluster

After your cluster is ready, connect to it:
Selection_066

gcloud container clusters get-credentials standard-cluster-1 --zone us-central1-a --project <project-name>

Become cluster admin:

kubectl create clusterrolebinding cluster-admin-binding   --clusterrole=cluster-admin   --user=$(gcloud config get-value core/account)

Note: The --user flag of clusterrolebinding is case sensitive. If you see Error from server (Forbidden): error when creating ... must have cluster-admin privileges to use the then check your real user email addres on the "IAM & admin" page in Google Console.

Clone Rook & Ceph for Kubernetes

Get the repo:

https://github.com/chrisjsimpson/rook.git

Checkout to the tag v1.0.0-ceph-GKE:

git checkout v1.0.0-ceph-GKE
cd rook/cluster/examples/kubernetes/ceph/

Deploy the manifests needed: (bases on the Rook Docs)

kubectl create -f common.yaml
kubectl create -f operator.yaml

Verify rook-ceph has come up successfully

Verify the rook-ceph-operator, rook-ceph-agent, and rook-discover pods are in the Running state before proceeding:

watch kubectl -n rook-ceph get pod

Output:
Selection_068

Create the Rook Ceph Cluster

Now that the rook-ceph-operator, rook-ceph-agents and rook-discover pods are up, it's safe to bring up a ceph cluster.

For the cluster to survive reboots, make sure you set the dataDirHostPath property that is valid for your hosts. For more settings, see the documentation on configuring the cluster.

See Rook Docs

Create the cluster:

kubectl create -f cluster.yaml

Whilst the cluster is being created, if you run:

kubectl -n rook-ceph get pod

You will see the beauty of the rook-ceph-operator doing it's thing; ensuring a valid ceph-cluster is brought up with all the components needed for a valid ceph cluster. This includes:

rook-ceph-mon-a, b and c
rook-ceph-mgr-a
rook-ceph-osd-0, 1 and 2

Selection_069

Check out the Ceph Documentation to understand what OSD's, Monitors, Managers and MDSs are.

Verify Ceph status using the toolkit

The toolkit is a helpful pod which contains programs to inspect the status of your ceph cluster status etc. To run it:

kubectl create -f toolbox.yaml
# Wait a litte, then exec into the toolkit
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash

Now you're inside the toolkit pod, run ceph status:

Selection_070
Remember we chose 100GB for each of the nodes, and we have 265GB available for use in our Ceph cluster! (The rest is taken up by the Ubuntu os and other tooling running on the nodes.

Remember to exit out of the toolkit if you're still inside it ;)

Create a Ceph Shared Filesystem on Kubernetes

It's finally time to create a shared filesystem capable of ReadWriteMany on our Kubernetes cluster, and shared it accross multiple pods!

The steps are:

Declare a Filesystem
Reference the Filesystem name in Deployment manifests

We've created a test-filesystem.yaml example for you in the repo:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: firstcephfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 2
  dataPools:
    - replicated:
        size: 2
  metadataServer:
    activeCount: 1
    activeStandby: true

To apply it:

kubectl create -f test-filesystem.yaml

Jump back into your toolkit, ceph status and you'll soon see some i/o activity on your ceph cluster and the filesystem you just created as active:

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
watch ceph stauts

Output:
Selection_071

Use your ManyReadWrite filesystem

A Deployment example of using an Nginx deployment, with a PersistantVolume (using the Ceph Filesystem we just created).

The Deployment can be scaled, with the PersistantVolume
Writes to any of the pods are automatical handles by Ceph, so all pods see the same data
Isn't that awesome!?

Create Deployment using the Ceph Filesystem

This section of the tutorial is inspired by, SysEleven's excellent example on the topic.

First create a test namespace for this deployment:

kubectl create namespace read-write-many-tutorial

Now deploy the following manifest which defines a Deployment

An Nginx pod with
two replicates
one volume, which uses Ceph (the fsName must match the filesystem name we created earlier)
Although not explicitly written, this volume is ReadWriteMany (steps verify test this are below)

test-multi-write.yaml

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: read-write-many-tutorial
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: read-write-many
    spec:
      volumes:
      - name: nginx-data
        flexVolume:
          driver: ceph.rook.io/rook
          fsType: ceph
          options:
            fsName: firstcephfs
            clusterNamespace: rook-ceph
      containers:
      - image: nginx:latest
        name: nginx
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: "/data"
          name: nginx-data
          readOnly: false

Apply the Deployment:

kubectl apply -f test-multi-write.yaml
namespace/read-write-many-tutorial created

Verify both nginx replica pods are running:

kubectl get pods --namespace read-write-many-tutorial
NAME                               READY   STATUS    RESTARTS   AGE
nginx-deployment-df5b48897-cf5rf   1/1     Running   0          34s
nginx-deployment-df5b48897-w8hs5   1/1     Running   0          33s

Kubernetes writemany

Now the fun starts! We're going to write to one pod, and see the data also be accessible on the other pod. Yes, this is kubernetes persistent volume with multiple pods write.

Wrote some data to the volume from pod one:

kubectl exec --namespace read-write-many-tutorial $(kubectl get pods --namespace read-write-many-tutorial -o jsonpath='{.items[0].metadata.name}') -i -t -- bash -c 'echo "Inconceivable!" > /data/test.txt'

Now, from pod two, you will see the same data also accesible from the pod two:

kubectl exec --namespace read-write-many-tutorial $(kubectl get pods --namespace read-write-many-tutorial -o jsonpath='{.items[1].metadata.name}') -i -t -- bash -c 'cat /data/test.txt'

Output:

Inconceivable!

Scale the deployment

Now the real test. Can we scale the deployment sucessfully and read the same data from any pod in the deployment?

Scale the deployment:

kubectl scale deployment.v1.apps/nginx-deployment --replicas=5 --namespace=read-write-many-tutorial

Eventually you will see the 5 replicas:
Selection_072

Verify we can access the same data from any of these pods:

kubectl exec --namespace read-write-many-tutorial $(kubectl get pods --namespace read-write-many-tutorial -o jsonpath='{.items[3].metadata.name}') -i -t -- bash -c 'cat /data/test.txt'

Output:

Inconceivable!

Congratulations! You've completed a ManyReadWrite volume and rolled out a deployment using a Shared filesystem.

If you want to, create a service for the deployment, and view it in your browser using port-forward on your localhost machine:

kubectl apply -f test-service.yaml
kubectl port-forward service/nginx-read-write-many 8081:80 -n read-write-many-tutorial

Then visit: http://127.0.0.1:8081/ to view the service.
For example, you might want to edit the deployment and change the mountPath to "/usr/share/nginx/html". This is a really quick way to visualise the writed on the homepage for example, create an index.html in that directiory , edit it, and see the changes reflected regardless of pod being accessed. To do this, simply:

This post was copied from Karma Computing. For the source, see https://blog.karmacomputing.co.uk

Tear down - how to delete

Remove all resources created, then delete ceph deployment:

kubectl delete -f test-multi-write.yaml 
kubectl delete -f test-service.yaml 
kubectl delete -f test-filesystem.yaml 
kubectl -n rook-ceph delete cephcluster rook-ceph
kubectl -n rook-ceph get cephcluster
kubectl delete -f operator.yaml

Now delete the left over data on the nodes:

Get list of nodes:

gcloud compute instances list

SSH into each node like it's the 90s:

gcloud compute ssh <node-name> --zone us-central1-a # your zone may be different

Then for each of your nodes listed part of your project, remove /var/lib/rook:

sudo rm -rf /var/lib/rook

You should now have an empty kubernetes cluster agin. If you get stuck, and you see errors relating to "cannot delete x becaue namespace y is still terminating" you probably still have some dangling resources left behind. See https://rook.io/docs/rook/v1.0/ceph-teardown.html for more information or just delete the whole cluster in Google Console.

Multiple Filesystems with Ceph on Kubernetes

BETA Feature.

To enable multiple ceph filesystems to be created, edit operator.yaml and set ROOK_ALLOW_MULTIPLE_FILESYSTEMS to "true".

Delete your cluster, and start over, this time with your ROOK_ALLOW_MULTIPLE_FILESYSTEMS set. Everytime you create a new filesystem, for example below, give is a new name. e.g.:

Example seccond filesystem:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: seccondcephfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 2
  dataPools:
    - replicated:
        size: 2
  metadataServer:
    activeCount: 1
    activeStandby: true

Selection_073

Dashboard

You do not need the dashboard for any of the tutotrial, but it's pretty and very user friendly. If you want to see it:

On your local machine open a terminal and issue:

kubectl port-forward service/rook-ceph-mgr-dashboard 8443:84 -n rook-cep

Now visit: https://127.0.0.1:8443/

Get your password:

kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode && echo

Username is 'admin', use the output from the above command as your password.

Internal dashboard pages work fine:
Selection_074

The homepage of the dashboard for v1 of this repo has a but which trows 500 erros, but you can navigate to the other pages just fine.

There is a work-around (which might cause more problems than its worth) to fix the dashboard homepage until its fixed: Thanks to noahdesu: https://github.com/rook/rook/issues/3106#issuecomment-490433510

To 'fix' the dashboard (at your own risk of causing problems):

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
# Then issue:
ceph dashboard ac-role-create admin-no-iscsi
# Then: 
for scope in dashboard-settings log rgw prometheus grafana nfs-ganesha manager hosts rbd-image config-opt rbd-mirroring cephfs user osd pool monitor; do
    ceph dashboard ac-role-add-scope-perms admin-no-iscsi ${scope} create delete read update;
done
# Finally:
ceph dashboard ac-user-set-roles admin admin-no-iscsi

Ta-da! Stunning Ceph Dashboard running on Kubernetes with two filesystems. Just Look at that throughput..
Selection_075

Have fun!