How to deploy ceph with OpenStack Ocata

Vikram Hosakote · ‎10-04-2017

Ceph (Ceph Homepage - Ceph) is a great way to deploy persistent storage with OpenStack. Ceph can be used as the persistent storage backend with OpenStack Cinder (GitHub - openstack/cinder: OpenStack Block Storage (Cinder)) for:

Volumes of nova VMs.
Glance images.

Without ceph, storage in OpenStack is ephemeral or temporary and will be deleted when we delete a nova VM. Hence, ceph is great for:

Persisting the data in the volume of a nova VM even after we delete the nova VM.
Live migration of a nova VM to a different compute host.
Data backup and replication of the data in a nova VM and glance images.
Data reliability in OpenStack (data will still be available if a nova VM crashes).

Here are some useful links about live migration in OpenStack using ceph:

https://docs.openstack.org/nova/pike/admin/configuring-migrations.html

https://docs.openstack.org/ha-guide/storage-ha-backend.html

https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html

This blog deploys ceph 0.94 (Hammer stable) with OpenStack Ocala.

Ceph Releases — Ceph Documentation

Below are the configurations needed for ceph, glance, cinder and nova in OpenStack Ocata:

On the OpenStack controller host:

/etc/ceph/ceph.conf:

[global]
osd pool default crush rule = 0
osd pool default size = 3
public_network = 172.18.0.0/16
err to syslog = true
mon host = 172.18.7.160,172.18.7.161,172.18.7.162
auth cluster required = none
osd pool default min size = 0 # 0 means no specific default; ceph will use (pool_default_size)-(pool_default_size/2) so 2 if pool_default_size=3
osd pool default pgp num = 128
auth service required = none
mon client hunt interval = 40
log to syslog = true
auth supported = none
auth client required = none
clog to syslog = true
mon_initial_members = controller1,controller2,controller3
cluster_network = 172.18.0.0/16
log file = /dev/null
max open files = 16229437
fsid = c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5
osd pool default pg num = 128

[client]
rbd default map options = rw
rbd cache writethrough until flush = true
log file = /var/log/rbd-clients/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor
rbd default features = 3 # sum features digits
admin socket = /var/run/ceph/rbd-clients/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
rbd concurrent management ops = 20
rbd default format = 2
rbd cache = true

[mon.controller1]
host = controller1

[mon.controller2]
host = controller2

[mon.controller3]
host = controller3

[mon]
mon osd full ratio = 0.95
mon clock drift warn backoff = 30
mon cluster log file = /dev/null
mon lease = 20
mon osd min down reporters = 7 # number of OSDs per host + 1
mon cluster log to syslog = true
mon osd nearfull ratio = 0.9
mon pg warn max object skew = 10 # set to 20 or higher to disable complaints about number of PGs being too low if some pools have very few objects bringing down the average number of objects per pool. This happens when running RadosGW. Ceph default is 10
mon osd down out interval = 600
mon pg warn max per osd = 0 # disable complains about low pgs numbers per osd
mon clock drift allowed = 0.15
mon lease ack timeout = 40
mon lease renew interval = 12
mon osd report timeout = 300
mon osd allow primary affinity = true
mon accept timeout = 40

[osd]
osd scrub sleep = 0.1
osd recovery threads = 1
osd scrub load threshold = 10.0
osd heartbeat grace = 30
filestore op threads = 2
osd scrub begin hour = 0
osd mon heartbeat interval = 30
osd disk thread ioprio priority = 7
osd mount options xfs = noatime,largeio,inode64,swalloc
osd max backfills = 1
osd objectstore = filestore
osd op threads = 2
osd scrub end hour = 24
osd max scrubs = 1
filestore merge threshold = 40
osd recovery max chunk = 1048576
osd mkfs options xfs = -f -i size=2048
osd recovery max active = 1
osd scrub chunk max = 5
osd deep scrub stride = 1048576
osd disk thread ioprio class = idle
filestore max sync interval = 5
filestore split multiple = 8
osd crush update on start = true
osd recovery op priority = 2
osd deep scrub interval = 2419200
osd mkfs type = xfs
osd journal size = 4096

/etc/ceph/ceph.client.admin.keyring:

[client.admin]
  key = ABCDEMtZYzuLGBAAktFtABCDE/ABCDErOs2Nhg==

/etc/glance/glance-api.conf:

[glance_store]

default_store=rbd
stores=rbd, file, http
rbd_store_ceph_conf=/etc/ceph/ceph.conf
rbd_store_pool=nova-images1
rbd_store_chunk_size=8

/etc/glance/glance-scrubber.conf:

[glance_store]

default_store=rbd
stores=rbd, file, http
rbd_store_ceph_conf=/etc/ceph/ceph.conf
rbd_store_pool=nova-images1
rbd_store_chunk_size=8

/etc/cinder/cinder.conf:

[DEFAULT]
default_volume_type=ceph

/etc/cinder/cinder-volume.conf:

[DEFAULT]
enabled_backends=ceph-default

[ceph-default]
volume_driver=cinder.volume.drivers.rbd.RBDDriver
rbd_pool=nova-images1
rbd_store_ceph_conf=/etc/ceph/ceph.conf
rbd_flatten_volume_from_snapshot=false
rbd_max_clone_depth=5
rbd_store_chunk_size=4
volume_backend_name=ceph-default

/etc/rabbitmq/rabbitmq-monitor-queues.conf:

autogen.controller.cinder-scheduler 20 5

autogen.cinder-volume.cinder-volume-1.ceph-default 20 5

On the OpenStack compute host:

/etc/ceph/ceph.conf:

[global]
osd pool default crush rule = 0
osd pool default size = 3
public_network = 172.18.0.0/16
err to syslog = true
mon host = 172.18.7.160,172.18.7.161,172.18.7.162
auth cluster required = none
osd pool default min size = 0 # 0 means no specific default; ceph will use (pool_default_size)-(pool_default_size/2) so 2 if pool_default_size=3
osd pool default pgp num = 128
auth service required = none
mon client hunt interval = 40
log to syslog = true
auth supported = none
auth client required = none
clog to syslog = true
mon_initial_members = controller1,controller2,controller3
cluster_network = 172.18.0.0/16
log file = /dev/null
max open files = 16229437
fsid = c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5
osd pool default pg num = 128

[client]
rbd default map options = rw
rbd cache writethrough until flush = true
log file = /var/log/rbd-clients/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor
rbd default features = 3 # sum features digits
admin socket = /var/run/ceph/rbd-clients/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
rbd concurrent management ops = 20
rbd default format = 2
rbd cache = true

[mon.controller1]
host = controller1

[mon.controller2]
host = controller2

[mon.controller3]
host = controller3

[mon]
mon osd full ratio = 0.95
mon clock drift warn backoff = 30
mon cluster log file = /dev/null
mon lease = 20
mon osd min down reporters = 7 # number of OSDs per host + 1
mon cluster log to syslog = true
mon osd nearfull ratio = 0.9
mon pg warn max object skew = 10 # set to 20 or higher to disable complaints about number of PGs being too low if some pools have very few objects bringing down the average number of objects per pool. This happens when running RadosGW. Ceph default is 10
mon osd down out interval = 600
mon pg warn max per osd = 0 # disable complains about low pgs numbers per osd
mon clock drift allowed = 0.15
mon lease ack timeout = 40
mon lease renew interval = 12
mon osd report timeout = 300
mon osd allow primary affinity = true
mon accept timeout = 40

[osd]
osd scrub sleep = 0.1
osd recovery threads = 1
osd scrub load threshold = 10.0
osd heartbeat grace = 30
filestore op threads = 2
osd scrub begin hour = 0
osd mon heartbeat interval = 30
osd disk thread ioprio priority = 7
osd mount options xfs = noatime,largeio,inode64,swalloc
osd max backfills = 1
osd objectstore = filestore
osd op threads = 2
osd scrub end hour = 24
osd max scrubs = 1
filestore merge threshold = 40
osd recovery max chunk = 1048576
osd mkfs options xfs = -f -i size=2048
osd recovery max active = 1
osd scrub chunk max = 5
osd deep scrub stride = 1048576
osd disk thread ioprio class = idle
filestore max sync interval = 5
filestore split multiple = 8
osd crush update on start = true
osd recovery op priority = 2
osd deep scrub interval = 2419200
osd mkfs type = xfs
osd journal size = 4096

/etc/nova/nova.conf:

[libvirt]
images_type=rbd
images_rbd_pool=nova-images1
images_rbd_ceph_conf=/etc/ceph/ceph.conf

The OpenStack controller hosts run ceph-mon, glance (glance-api, glance-registry) and cinder (cinder-api, cinder-scheduler) the following way:

ceph-mon:

/usr/bin/ceph-mon -i controller1 --pid-file /var/run/ceph/mon.controller1.pid -c /etc/ceph/ceph.conf --cluster ceph -f

glance (glance-api, glance-registry):

/usr/bin/python2 /usr/bin/glance-api

/usr/bin/python2 /usr/bin/glance-registry

cinder (cinder-api, cinder-scheduler):

/usr/bin/python2 /usr/bin/cinder-api --config-file /etc/cinder/cinder.conf --config-file /etc/cinder/cinder-api.conf

/usr/bin/python2 /usr/bin/cinder-scheduler --config-file /etc/cinder/cinder.conf --config-file /etc/cinder/cinder-scheduler.conf

The OpenStack controller hosts also run the nova services like nova-api, nova-cert, nova-conductor, nova-scheduler, nova-console, nova-consoleauth and nova-novncproxy.

The OpenStack compute hosts run ceph-osd and nova-compute the following way:

ceph-osd:

/usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf --cluster ceph -f

/usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf --cluster ceph -f

/usr/bin/ceph-osd -i 9 --pid-file /var/run/ceph/osd.9.pid -c /etc/ceph/ceph.conf --cluster ceph -f

/usr/bin/ceph-osd -i 13 --pid-file /var/run/ceph/osd.13.pid -c /etc/ceph/ceph.conf --cluster ceph -f

nova-compute:

/usr/bin/python2 /usr/bin/nova-compute --config-file /etc/nova/nova.conf --config-file /etc/nova/nova-compute.conf

There are two ways to deploy ceph in OpenStack:

Ceph-osd on the compute hosts (the way this blog describes).
Ceph-osd on separate storage hosts and not on compute hosts.

If we deploy ceph-osd on the compute hosts, we will lose a ceph host if a compute host crashes. Also, this approach will increase the CPU and memory usage of the compute host as ceph-osd competes with nova-compute on the compute host.

If we deploy ceph-osd on separate hosts and not on compute hosts, it is best to use fast 40 Gig links between the compute hosts and the storage hosts so that the ceph traffic moves fast between compute host and storage hosts. Refer Jumbo Mumbo in OpenStack using Cisco's UCS servers and Nexus 9000 to configure jumbo frames in OpenStack as it increases the throughput of the ceph traffic between the compute host and storage hosts. This approach is expensive as we need separate storage hosts for ceph-osd.

Below are the outputs of some useful ceph commands run on the controller host to check if the ceph cluster in OpenStack is in good condition:

# ceph version

ceph version 0.94.9-9.el7cp (b83334e01379f267fb2f9ce729d74a0a8fa1e92c)

# ceph status

cluster c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5

health HEALTH_WARN

clock skew detected on mon.controller2

1 mons down, quorum 0,1 controller1,controller2

Monitor clock skew detected

monmap e1: 3 mons at {controller1=172.18.7.160:6789/0,controller2=172.18.7.161:6789/0,controller3=172.18.7.162:6789/0}

election epoch 8, quorum 0,1 controller1,controller2

osdmap e172: 16 osds: 16 up, 16 in

pgmap v201041: 320 pgs, 3 pools, 583 MB data, 8218 objects

2746 MB used, 4730 GB / 4733 GB avail

320 active+clean

# ceph osd pool stats

pool rbd id 0

nothing is going on

pool nova-images1 id 1

nothing is going on

pool backups id 2

nothing is going on

# ceph pg dump | head -8

dumped all in format plain

version 201067

stamp 2017-10-05 03:35:55.386523

last_osdmap_epoch 172

last_pg_scan 3

full_ratio 0.95

nearfull_ratio 0.9

pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp

2.7d 0 0 0 0 0 0 0 0 active+clean 2017-10-04 21:54:02.754318 0'0 172:45 [9,8,14] 9 [9,8,14] 9 0'0 2017-10-04 21:54:02.754202 0'0 2017-09-27 02:42:12.189613

# ceph mon dump

dumped monmap epoch 1

epoch 1

fsid c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5

last_changed 0.000000

created 0.000000

0: 172.18.7.160:6789/0 mon.controller1

1: 172.18.7.161:6789/0 mon.controller2

2: 172.18.7.162:6789/0 mon.controller3

# ceph df

GLOBAL:

SIZE AVAIL RAW USED %RAW USED

4733G 4730G 2753M 0.06

POOLS:

NAME ID USED %USED MAX AVAIL OBJECTS

rbd 0 0 0 1576G 0

nova-images1 1 583M 0.04 1576G 8218

backups 2 0 0 1576G 0

# ceph osd pool ls detail

pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0

pool 1 'nova-images1' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 172 flags hashpspool stripe_width 0

removed_snaps [1~6,b~8d]

pool 2 'backups' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 3 flags hashpspool stripe_width 0

# ceph -w

cluster c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5

health HEALTH_WARN

clock skew detected on mon.controller2

1 mons down, quorum 0,1 controller1,controller2

Monitor clock skew detected

monmap e1: 3 mons at {controller1=172.18.7.160:6789/0,controller2=172.18.7.161:6789/0,controller3=172.18.7.162:6789/0}

election epoch 8, quorum 0,1 controller1,controller2

osdmap e172: 16 osds: 16 up, 16 in

pgmap v201177: 320 pgs, 3 pools, 664 MB data, 9515 objects

3097 MB used, 4730 GB / 4733 GB avail

320 active+clean

client io 0 B/s wr, 90 op/s

2017-10-05 03:42:07.745506 mon.0 [INF] pgmap v201176: 320 pgs: 320 active+clean; 669 MB data, 3111 MB used, 4730 GB / 4733 GB avail; 377 kB/s wr, 96 op/s

2017-10-05 03:42:08.755441 mon.0 [INF] pgmap v201177: 320 pgs: 320 active+clean; 664 MB data, 3097 MB used, 4730 GB / 4733 GB avail; 0 B/s wr, 90 op/s

# ceph quorum_status

{"election_epoch":8,"quorum":[0,1],"quorum_names":["controller1","controller2"],"quorum_leader_name":"controller1","monmap":{"epoch":1,"fsid":"c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"controller1","addr":"172.18.7.160:6789\/0"},{"rank":1,"name":"controller2","addr":"172.18.7.161:6789\/0"},{"rank":2,"name":"controller3","addr":"172.18.7.162:6789\/0"}]}}

# ceph mon_status

{"name":"controller1","rank":0,"state":"leader","election_epoch":8,"quorum":[0,1],"outside_quorum":[],"extra_probe_peers":["172.18.7.161:6789\/0","172.18.7.162:6789\/0"],"sync_provider":[],"monmap":{"epoch":1,"fsid":"c29e5118-e0d3-c1c1-1d3d-fc32b8f011c5","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"controller1","addr":"172.18.7.160:6789\/0"},{"rank":1,"name":"controller2","addr":"172.18.7.161:6789\/0"},{"rank":2,"name":"controller3","addr":"172.18.7.162:6789\/0"}]}}

# ceph auth list | head -5

installed auth entries:

osd.0

key: AQCCF8tZf1O0HxAAXWSfmzMokX5QSPEHr00nvA==

caps: [mon] allow profile osd

caps: [osd] allow *

osd.1

# ceph mds stat

e1: 0/0/0 up

# ceph mon stat

e1: 3 mons at {controller1=172.18.7.160:6789/0,controller2=172.18.7.161:6789/0,controller3=172.18.7.162:6789/0}, election epoch 8, quorum 0,1 controller1,controller2

Refer How to attach cinder/ceph XFS volume to a nova instance in OpenStack horizon for the steps how to mount a ceph XFS volume inside a nova VM and read/write data from/to it.

Hope this blog is useful! Please let me know your comments below!

How to deploy ceph with OpenStack Ocata

How to deploy ceph with OpenStack Ocata

How to stack DevStack Newton on CentOS-7 in VirtualBox on Mac

How to deploy SR-IOV (PCI pass-through) in OpenStack using Cisco's UCSM ml2 neutron plugin