一、Rook简介

Rook官网:https://rook.io

Rook使用Kubernetes原语使Ceph存储系统能够在Kubernetes上运行。下图说明了Ceph Rook如何与Kubernetes集成:

随着Rook在Kubernetes集群中运行,Kubernetes应用程序可以挂载由Rook管理的块设备和文件系统,或者可以使用S3 / Swift API提供对象存储。Rook oprerator自动配置存储组件并监控群集,以确保存储处于可用和健康状态。
Rook oprerator是一个简单的容器,具有引导和监视存储集群所需的全部功能。oprerator将启动并监控ceph monitor pods和OSDs的守护进程,它提供基本的RADOS存储。oprerator通过初始化运行服务所需的pod和其他组件来管理池,对象存储(S3 / Swift)和文件系统的CRD。

oprerator将监视存储后台驻留程序以确保群集正常运行。Ceph mons将在必要时启动或故障转移,并在群集增长或缩小时进行其他调整。oprerator还将监视api服务请求的所需状态更改并应用更改。
Rook oprerator还创建了Rook agent。这些agent是在每个Kubernetes节点上部署的pod。每个agent都配置一个Flexvolume插件,该插件与Kubernetes的volume controller集成在一起。处理节点上所需的所有存储操作,例如附加网络存储设备,安装卷和格式化文件系统。

该rook容器包括所有必需的Ceph守护进程和工具来管理和存储所有数据 - 数据路径没有变化。 rook并没有试图与Ceph保持完全的忠诚度。 许多Ceph概念(如placement groups和crush maps)都是隐藏的,因此您无需担心它们。 相反,Rook为管理员创建了一个简化的用户体验,包括物理资源,池,卷,文件系统和buckets。 同时,可以在需要时使用Ceph工具应用高级配置。
Rook在golang中实现。Ceph在C ++中实现,其中数据路径被高度优化。我们相信这种组合可以提供两全其美的效果。

二、开始部署集群

本文参考的官方文档

rook官方helm部署文档:https://www.rook.io/docs/rook/v1.6/helm.html

rook的GitHub地址:https://github.com/rook/rook

2.1 环境信息介绍

名称 信息
节点配置 8c32g 5500gHDD *4 40GSSD *2(一个系统盘一个ceph元数据) 节点数3
Kubernetes集群 ACK集群 v1.18.8 托管三节点模式
helm v3.5.2
Rook v1.6.7
Ceph v16.2.4
Mon组件 3个
Mgr组件 2个

2.2 安装helm

如果已经安装helm请跳过

下载最新的helm二进制包

1
$ wget https://get.helm.sh/helm-v3.5.2-linux-amd64.tar.gz

解压并移动二进制文件至可执行目录

1
2
3
4
5
6
7
8
$ tar xzvf helm-v3.5.2-linux-amd64.tar.gz 
linux-amd64/
linux-amd64/helm
linux-amd64/LICENSE
linux-amd64/README.md

$ chmod +x linux-amd64/helm
$ mv linux-amd64/helm /usr/bin/

验证安装

1
2
$ helm  version 
version.BuildInfo{Version:"v3.5.2", GitCommit:"167aac70832d3a384f65f9745335e9fb40169dc2", GitTreeState:"dirty", GoVersion:"go1.15.7"}

2.3 部署operator

官方文档:https://www.rook.io/docs/rook/v1.6/helm-operator.html

创建rook-ceph名称空间

1
2
3
$ kubectl create namespace rook-ceph
$ kubectl get ns |grep rook
rook-ceph Active 9s

添加rook-operator的helm仓库

1
2
3
4
$ helm repo add rook-release https://charts.rook.io/release
$ helm repo list
NAME URL
rook-release https://charts.rook.io/release

新建rook-operator的rook-operator.yaml文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ vim rook-operator.yaml
-----------------$ 下面内容是我修改后的 特殊需要可以根据values.yaml修改 $------------------
csi:
# 配置插件镜像相关地址,默认是谷歌仓库,这里我改成自己的dockerhub镜像
# putianhui/cephcsi:v3.3.1 # 将仓库名换成dockerhub puitanhui/imageName:tag 可以
# 仓库名换成阿里云的仓库打头也可以 registry.cn-shanghai.aliyuncs.com/k8s-gle/imageName:tag
cephcsi:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/cephcsi:v3.3.1
registrar:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-node-driver-registrar:v2.2.0
provisioner:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-provisioner:v2.2.2
snapshotter:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-snapshotter:v4.1.1
attacher:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-attacher:v3.2.1
resizer:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-resizer:v1.2.0

enableSelinuxRelabeling: false

#Ceph mon和osd pod需要写入hostPath。在OpenShift和SELinux中有限制的权限
## pod必须运行特权才能写入hostPath卷,然后必须将其设置为true。
hostpathRequiresPrivileged: true

通过rook-operator.yaml文件部署rook-operator

1
$ helm install -n rook-ceph rook-ceph rook-release/rook-ceph --version v1.6.7 -f rook-operator.yaml

等待pod启动成功

1
2
3
4
5
6
7
8
9
# 正在创建中
$ kubectl --namespace rook-ceph get pods
NAME READY STATUS RESTARTS AGE
rook-ceph-operator-fdb564699-rv9cg 0/1 ContainerCreating 0 46s

# 启动成功
$ kubectl --namespace rook-ceph get pods
NAME READY STATUS RESTARTS AGE
rook-ceph-operator-fdb564699-rv9cg 1/1 Running 0 89s

这个是自己翻译的一个operator的yaml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
image:
prefix: rook
repository: rook/ceph
tag: v1.6.7
pullPolicy: IfNotPresent

crds:
# # https://rook.github.io/docs/rook/master/ceph-disaster-recovery.html#restoring-crds-after-deletion
# #舵图是否应该创建和更新CRD。如果为false,则CRD必须为
# #使用cluster/examples/kubernetes/ceph/crds.yaml独立管理。
# #**警告**仅在首次部署时设置。如果以后禁用,群集可能会被销毁。
# #如果在这种情况下删除了CRD,请参阅《灾难恢复指南》以恢复它们。
enabled: true

resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi

nodeSelector: {}
# Constraint rook-ceph-operator Deployment to nodes with label `disktype: ssd`.
# For more info, see https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector
# disktype: ssd

# Tolerations for the rook-ceph-operator to allow it to run on nodes with particular taints
tolerations: []

# Delay to use in node.kubernetes.io/unreachable toleration
unreachableNodeTolerationSeconds: 5

# rook是只监视crd的当前名称空间还是整个集群,缺省值为false 监控整个集群
currentNamespaceOnly: false

## Annotations to be added to pod
annotations: {}

# 日志等级
logLevel: INFO

#如果为true,则创建并使用RBAC资源
rbacEnable: true

# 如果为true,则创建并使用PSP资源
##
pspEnable: true

# 设置是否禁用驱动程序或其他守护程序(如果未禁用)
## needed
csi:
enableRbdDriver: true
enableCephfsDriver: true
enableGrpcMetrics: false
enableCephfsSnapshotter: true
enableRBDSnapshotter: true

rbdFSGroupPolicy: "ReadWriteOnceWithFSType"

cephFSFSGroupPolicy: "ReadWriteOnceWithFSType"
enableOMAPGenerator: false

allowUnsupportedVersion: false
# 针对ceph csi RBD provisioner插件的pod进行资源限制
#
# csiRBDProvisionerResource: |
# - name : csi-provisioner
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-resizer
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-attacher
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-snapshotter
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-rbdplugin
# resource:
# requests:
# memory: 512Mi
# cpu: 250m
# limits:
# memory: 1Gi
# cpu: 500m
# - name : liveness-prometheus
# resource:
# requests:
# memory: 128Mi
# cpu: 50m
# limits:
# memory: 256Mi
# cpu: 100m
#
# 针对ceph csi RBD插件的pod进行资源限制
#
# csiRBDPluginResource: |
# - name : driver-registrar
# resource:
# requests:
# memory: 128Mi
# cpu: 50m
# limits:
# memory: 256Mi
# cpu: 100m
# - name : csi-rbdplugin
# resource:
# requests:
# memory: 512Mi
# cpu: 250m
# limits:
# memory: 1Gi
# cpu: 500m
# - name : liveness-prometheus
# resource:
# requests:
# memory: 128Mi
# cpu: 50m
# limits:
# memory: 256Mi
# cpu: 100m
#
# # 针对ceph csi cephfs provisioner 插件的pod进行资源限制
#
# csiCephFSProvisionerResource: |
# - name : csi-provisioner
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-resizer
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-attacher
# resource:
# requests:
# memory: 128Mi
# cpu: 100m
# limits:
# memory: 256Mi
# cpu: 200m
# - name : csi-cephfsplugin
# resource:
# requests:
# memory: 512Mi
# cpu: 250m
# limits:
# memory: 1Gi
# cpu: 500m
# - name : liveness-prometheus
# resource:
# requests:
# memory: 128Mi
# cpu: 50m
# limits:
# memory: 256Mi
# cpu: 100m
# (Optional) CEPH CSI CephFS plugin resource requirement list, Put here list of resource
# requests and limits you want to apply for plugin pod
#
# 针对ceph csi cephfs插件的pod进行资源限制
#
# csiCephFSPluginResource: |
# - name : driver-registrar
# resource:
# requests:
# memory: 128Mi
# cpu: 50m
# limits:
# memory: 256Mi
# cpu: 100m
# - name : csi-cephfsplugin
# resource:
# requests:
# memory: 512Mi
# cpu: 250m
# limits:
# memory: 1Gi
# cpu: 500m
# - name : liveness-prometheus
# resource:
# requests:
# memory: 128Mi
# cpu: 50m
# limits:
# memory: 256Mi
# cpu: 100m
#
# 为provisioner的pod设置污点容忍和节点亲和性
# CSI provisioner最好在与其他ceph守护进程相同的节点上启动
#
# provisionerTolerations:
# - key: key
# operator: Exists
# effect: NoSchedule
# provisionerNodeAffinity: key1=value1,value2; key2=value3
#
# 为csi plugins的守护进程设置污点容忍和节点亲和
# csi插件需要在客户端需要装在存储的所有节点上启动
#
# pluginTolerations:
# - key: key
# operator: Exists
# effect: NoSchedule
# pluginNodeAffinity: key1=value1,value2; key2=value3
# ceph的监控metrics端口号
#cephfsGrpcMetricsPort: 9091
#cephfsLivenessMetricsPort: 9081
#rbdGrpcMetricsPort: 9090
#rbdLivenessMetricsPort: 9080
#
# 在内核<4.17上启用Ceph内核客户端。如果内核不支持cepfs的配额
# 您可能要禁用此设置。但是,这将在升级期间引起问题
# FUSE客户端。请参阅升级指南:https://rook.io/docs/rook/v1.2/ceph-upgrade.html
#
forceCephFSKernelClient: true
# 配置kubelet的目录
#kubeletDirPath: /var/lib/kubelet
# 配置插件镜像相关地址,默认是谷歌仓库,这里我改成自己的dockerhub镜像
# putianhui/cephcsi:v3.3.1 # 将仓库名换成dockerhub puitanhui/imageName:tag 可以
# 仓库名换成阿里云的仓库打头也可以 registry.cn-shanghai.aliyuncs.com/k8s-gle/imageName:tag
cephcsi:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/cephcsi:v3.3.1
registrar:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-node-driver-registrar:v2.2.0
provisioner:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-provisioner:v2.2.2
snapshotter:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-snapshotter:v4.1.1
attacher:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-attacher:v3.2.1
resizer:
image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-resizer:v1.2.0
# 为CSI CephFS的Deployments和DaemonSets添加自定义标签
#cephfsPodLabels: "key1=value1,key2=value2"
# 为CSI RBD的Deployments和DaemonSets添加自定义标签
#rbdPodLabels: "key1=value1,key2=value2"
# 启用卷复制控制器
volumeReplication:
enabled: false
#image: "quay.io/csiaddons/volumereplication-operator:v0.1.0"

enableFlexDriver: false
enableDiscoveryDaemon: false

# 允许在同一集群中有多个Ceph文件系统 默认为false
# WARNING: Experimental feature in Ceph Releases Octopus (v15) and Nautilus (v14)
# https://docs.ceph.com/en/octopus/cephfs/experimental-features/#multiple-file-systems-within-a-ceph-cluster
allowMultipleFilesystems: false

## 如果为true,则在主机网络上运行rook operator
# useOperatorHostNetwork: true

## Rook 的 Agent配置污点容忍、节点亲和
## toleration: NoSchedule, PreferNoSchedule or NoExecute
## tolerationKey: Set this to the specific key of the taint to tolerate
## tolerations: Array of tolerations in YAML format which will be added to agent deployment
## nodeAffinity: Set to labels of the node to match
## flexVolumeDirPath: The path where the Rook agent discovers the flex volume plugins
## libModulesDirPath: The path where the Rook agent can find kernel modules
# agent:
# toleration: NoSchedule
# tolerationKey: key
# tolerations:
# - key: key
# operator: Exists
# effect: NoSchedule
# nodeAffinity: key1=value1,value2; key2=value3
# mountSecurityMode: Any
## For information on FlexVolume path, please refer to https://rook.io/docs/rook/master/flexvolume.html
# flexVolumeDirPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
# libModulesDirPath: /lib/modules
# mounts: mount1=/host/path:/container/path,/host/path2:/container/path2

## Rook 的 Discover 配置污点容忍、节点亲和
## toleration: NoSchedule, PreferNoSchedule or NoExecute
## tolerationKey: Set this to the specific key of the taint to tolerate
## tolerations: Array of tolerations in YAML format which will be added to agent deployment
## nodeAffinity: Set to labels of the node to match
# discover:
# toleration: NoSchedule
# tolerationKey: key
# tolerations:
# - key: key
# operator: Exists
# effect: NoSchedule
# nodeAffinity: key1=value1,value2; key2=value3
# podLabels: "key1=value1,key2=value2"

# In some situations SELinux relabelling breaks (times out) on large filesystems, and doesn't work with cephfs ReadWriteMany volumes (last relabel wins).
# Disable it here if you have similar issues.
# For more details see https://github.com/rook/rook/issues/2417
enableSelinuxRelabeling: false

#Ceph mon和osd pod需要写入hostPath。在OpenShift和SELinux中有限制的权限
## pod必须运行特权才能写入hostPath卷,然后必须将其设置为true。
hostpathRequiresPrivileged: true

# Disable automatic orchestration when new devices are discovered.
disableDeviceHotplug: false

# Blacklist certain disks according to the regex provided.
discoverDaemonUdev:

# 配置镜像拉取secret
# imagePullSecrets:
# - name: my-registry-secret

# Whether the OBC provisioner should watch on the operator namespace or not, if not the namespace of the cluster will be used
# 启用OBC监视Operator命名空间
enableOBCWatchOperatorNamespace: true

admissionController:
# Set tolerations and nodeAffinity for admission controller pod.
# The admission controller would be best to start on the same nodes as other ceph daemons.
# tolerations:
# - key: key
# operator: Exists
# effect: NoSchedule
# nodeAffinity: key1=value1,value2; key2=value3

2.4 部署ceph-cluster

官方文档:https://www.rook.io/docs/rook/v1.6/helm-ceph-cluster.html

添加rook-cluster 的helm仓库

1
2
3
4
5
$ helm repo add rook-master https://charts.rook.io/master
$ helm repo list
NAME URL
rook-release https://charts.rook.io/release
rook-master https://charts.rook.io/master

新建rook-ceph-cluster的rook-ceph-cluster.yaml文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$ vim rook-ceph-cluster.yaml
-----------------$ 下面内容是我修改后的 特殊需要可以根据values.yaml修改 $------------------
cephClusterSpec:
mgr:
count: 2
dashboard:
enabled: true
port: 7001
ssl: false
storage:
# 是否使用所有节点和节点上面的所有可用磁盘,设置为false时需要手动指定节点选择和磁盘信息过滤
useAllNodes: false
useAllDevices: false
config:
metadataDevice: "vdf" # 我元数据存储的ssd盘,不使用ssd存储元数据会比较慢 use it as block db device of bluestore.
databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
journalSizeMB: "1024" # uncomment if the disks are 20 GB or smaller

# 也可以指定单个节点及其配置,但上面的“useAllNodes”必须设置为false。那么,只有
# 下面的节点将用作存储资源。每个节点的“name”字段应与其“kubernetes.io/hostname”标签匹配。
nodes:
- name: "cn-zhangjiakou.172.16.1.155"
devices: # 下面指定你的osd存储盘标
- name: "vdb"
- name: "vdc"
- name: "vdd"
- name: "vde"
- name: "cn-zhangjiakou.172.16.1.156"
devices:
- name: "vdb"
- name: "vdc"
- name: "vdd"
- name: "vde"
- name: "cn-zhangjiakou.172.16.1.157"
devices:
- name: "vdb"
- name: "vdc"
- name: "vdd"
- name: "vde"

部署rook-ceph-cluster

1
$ helm  install -n rook-ceph rook-ceph-cluster rook-master/rook-ceph-cluster --version 0 -f rook-ceph-cluster.yaml

查看rook-ceph-operator的pod日志查看部署进度

1
$ kubectl logs -f -n rook-ceph rook-ceph-operator-fdb564699-rv9cg

当查看pod很多镜像拉取失败时,请看你部署operator时有没有修改镜像仓库地址,默认是谷歌镜像仓库,国内被墙后是获取不到了,我个人的镜像仓库只上传了 ceph v1.6.7版本的所有镜像,如果你部署的是其他版本,镜像问题自行解决。

当集群中所有osd的pod都运行成功集群就创建好了,日志会显示以下内容

1
2
3
4
5
6
$ kubectl logs -f -n rook-ceph rook-ceph-operator-fdb564699-rv9cg
2021-07-16 02:54:46.417047 I | op-k8sutil: Reporting Event rook-ceph:rook-ceph Normal:ReconcileSucceeded:cluster has been configured successfully
I0716 02:54:46.417161 6 manager.go:118] objectbucket.io/provisioner-manager "msg"="starting provisioner" "name"="rook-ceph.ceph.rook.io/bucket"
2021-07-16 02:54:48.670028 I | ceph-cluster-controller: Disabling the insecure global ID as no legacy clients are currently connected. If you still require the insecure connections, see the CVE to suppress the health warning and re-enable the insecure connections. https://docs.ceph.com/en/latest/security/CVE-2021-20288/
2021-07-16 02:54:50.333134 I | ceph-cluster-controller: insecure global ID is now disabled
2021-07-16 02:55:32.278545 I | op-mon: checking if multiple mons are on the same node

下面是我自己翻译的cluster.yaml 仅供参考

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# rook的operator所在Namespace
operatorNamespace: rook-ceph

# cephluster CR的metadata.name。默认名称与命名空间相同
#clusterName: rook-ceph

# 自定义配置,会覆盖 ceph.conf
#configOverride: |
# [global]
# mon_allow_pool_delete = true
#
# osd_pool_default_size = 3
# osd_pool_default_min_size = 2

# 安装部署一个 toolbox 调试工具,后面再安装
toolbox:
enabled: false
image: rook/ceph:v1.6.7
tolerations: []
affinity: {}

monitoring:
# 需要提前安装好promethues
# 启用将创建RBAC规则,以允许操作员创建 ServiceMonitors
enabled: false
rulesNamespaceOverride:

cephClusterSpec:
cephVersion:
image: ceph/ceph:v16.2.4
#是否允许不支持的Ceph版本。目前支持“nautilus”和“octopus”。
# 将来的版本,如“pacific”,需要将其设置为“true”。
# 在生产中不要设置为true。
allowUnsupported: false

# 主机上保存配置文件的路径。必须指定。
# 重要提示:如果重新安装群集,请确保从每个主机上删除此目录,否则mons将无法在新群集上启动。
# 在Minikube中,'/data'目录被配置为在重新启动时保持不变。在Minikube环境中使用“/data/rook”。
dataDirHostPath: /var/lib/rook
skipUpgradeChecks: false
# 升级期间PG不干净是否继续
continueUpgradeAfterChecksEvenIfNotHealthy: false
waitTimeoutForHealthyOSDInMinutes: 10
mon:
# 设置要启动的mons数。必须是奇数,通常建议为3
count: 3
# 是否允许mon的pod一个节点上启动多个,mons应该位于唯一的节点上。因此,对于生产,建议至少使用3个节点。
# 对于可以接受数据丢失的测试环境,应该只允许在同一节点上使用Mons。
allowMultiplePerNode: false
mgr:
# 当需要mgr高可用性时,将计数设置到2。
# 在这种情况下,一个mgr将处于活动状态,另一个处于备用状态。当Ceph更新
# mgr处于活动状态,Rook将更新mgr服务以匹配活动的mgr。
count: 2
modules:
- name: pg_autoscaler
enabled: true
# 开启ceph的dashboard来显示ceph集群状态
dashboard:
enabled: true
# 当ssl为true时dashboard默认使用8443端口,不需要配置prot: ""
# port: 8443
# 当ssl为false时必须指定prot: 7000,否则无法访问dashboard
port: 7001
# 是否开启dashboard的ssl
ssl: false
#network:
# 开启主机网络访问
#provider: host
# EXPERIMENTAL: enable the Multus network provider
#provider: multus
#selectors:
#public: public-conf --> NetworkAttachmentDefinition object name in Multus
#cluster: cluster-conf --> NetworkAttachmentDefinition object name in Multus
# Provide internet protocol version. IPv6, IPv4 or empty string are valid options. Empty string would mean IPv4
#ipFamily: "IPv6"
# Ceph daemons to listen on both IPv4 and Ipv6 networks
#dualStack: false
# enable the crash collector for ceph daemon crash collection
crashCollector:
disable: false
# Uncomment daysToRetain to prune ceph crash entries older than the
# specified number of days.
#daysToRetain: 30
# enable log collector, daemons will log on files and rotate
# logCollector:
# enabled: true
# periodicity: 24h # SUFFIX may be 'h' for hours or 'd' for days.
# automate [data cleanup process](https://github.com/rook/rook/blob/master/Documentation/ceph-teardown.md#delete-the-data-on-hosts) in cluster destruction.
# 数据清除策略
cleanupPolicy:
confirmation: ""
# sanitizeDisks表示在删除群集时对OSD磁盘进行清理的设置
sanitizeDisks:
method: quick
dataSource: zero
iteration: 1
allowUninstallWithVolumes: false
# To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
# 如果指定特定的k8s集群节点当做存储的节点,配置以下节点选择器和污点容忍规则
# The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
# 1.对特定的存储节点打上节点标签 "role=storage-node"
# 2.配置存储节点的污点容忍,带有key是"storage-node"的污点容忍
# tolerate taints with a key of 'storage-node'.
# placement:
# all:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: role
# operator: In
# values:
# - storage-node
# podAffinity:
# podAntiAffinity:
# topologySpreadConstraints:
# tolerations:
# - key: storage-node
# operator: Exists
# The above placement information can also be specified for mon, osd, and mgr components
# mon:
# osd:
# mgr:
# cleanup:
#annotations:
# all:
# mon:
# osd:
# cleanup:
# prepareosd:
# If no mgr annotations are set, prometheus scrape annotations will be set by default.
# mgr:
#labels:
# all:
# mon:
# osd:
# cleanup:
# mgr:
# prepareosd:
# monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
# These labels can be passed as LabelSelector to Prometheus
# monitoring:
#
# 针对mgr mon osd 各个组件pod配置资源限制
#resources:
# The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
# mgr:
# limits:
# cpu: "500m"
# memory: "1024Mi"
# requests:
# cpu: "500m"
# memory: "1024Mi"
# The above example requests/limits can also be added to the other components
# mon:
# osd:
# prepareosd:
# mgr-sidecar:
# crashcollector:
# logcollector:
# cleanup:
# 自动删除已退出并且可以安全销毁的OSD的选项。
removeOSDsIfOutAndSafeToRemove: false
# priorityClassNames:
# all: rook-ceph-default-priority-class
# mon: rook-ceph-mon-priority-class
# osd: rook-ceph-osd-priority-class
# mgr: rook-ceph-mgr-priority-class

# 群集级存储配置和选择
storage: # cluster level storage configuration and selection
# 是否使用所有节点和节点上面的所有可用磁盘,设置为false时需要手动指定节点选择和磁盘信息过滤
useAllNodes: false
useAllDevices: false
#deviceFilter: ""
config:
# crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
metadataDevice: "vdf" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
journalSizeMB: "1024" # uncomment if the disks are 20 GB or smaller
# osdsPerDevice: "1" # this value can be overridden at the node or device level
# encryptedDevice: "true" # the default value for this option is "false"

# 也可以指定单个节点及其配置,但上面的“useAllNodes”必须设置为false。那么,只有
# 下面的节点将用作存储资源。每个节点的“name”字段应与其“kubernetes.io/hostname”标签匹配。
nodes:
- name: "cn-zhangjiakou.172.16.1.149"
devices: # specific devices to use for storage can be specified for each node
- name: "vdb"
- name: "vdc" # multiple osds can be created on high performance devices
- name: "vdd" # multiple osds can be created on high performance devices
- name: "vde" # multiple osds can be created on high performance devices
- name: "cn-zhangjiakou.172.16.1.150"
devices: # specific devices to use for storage can be specified for each node
- name: "vdb"
- name: "vdc" # multiple osds can be created on high performance devices
- name: "vdd" # multiple osds can be created on high performance devices
- name: "vde" # multiple osds can be created on high performance devices
- name: "cn-zhangjiakou.172.16.1.151"
devices: # specific devices to use for storage can be specified for each node
- name: "vdb"
- name: "vdc" # multiple osds can be created on high performance devices
- name: "vdd" # multiple osds can be created on high performance devices
- name: "vde" # multiple osds can be created on high performance devices
# config:
# osdsPerDevice: "5"
# - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # devices can be specified using full udev paths
# config: # configuration can be specified at the node level which overrides the cluster level config
# - name: "172.17.4.301"
# deviceFilter: "^sd."
# The section for configuring management of daemon disruptions during upgrade or fencing.
disruptionManagement:
# If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
# via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
# block eviction of OSDs by default and unblock them safely when drains are detected.
managePodBudgets: true
# A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
# default DOWN/OUT interval) when it is draining. This is only relevant when `managePodBudgets` is `true`. The default value is `30` minutes.
osdMaintenanceTimeout: 30
# A duration in minutes that the operator will wait for the placement groups to become healthy (active+clean) after a drain was completed and OSDs came back up.
# Operator will continue with the next drain if the timeout exceeds. It only works if `managePodBudgets` is `true`.
# No values or 0 means that the operator will wait until the placement groups are healthy before unblocking the next drain.
pgHealthCheckTimeout: 0
# If true, the operator will create and manage MachineDisruptionBudgets to ensure OSDs are only fenced when the cluster is healthy.
# Only available on OpenShift.
manageMachineDisruptionBudgets: false
# Namespace in which to watch for the MachineDisruptionBudgets.
machineDisruptionBudgetNamespace: openshift-machine-api

# healthChecks
# Valid values for daemons are 'mon', 'osd', 'status'
healthCheck:
daemonHealth:
mon:
disabled: false
interval: 45s
osd:
disabled: false
interval: 60s
status:
disabled: false
interval: 60s
# Change pod liveness probe, it works for all mon,mgr,osd daemons
livenessProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false

2.5 访问ceph的dashboard

dashboard这边有点问题,安装时使用7001端口,默认不会替换,我们需要改下``rook-ceph-cluster.yaml`部署清单,通过helm更新一下,然后我们就可以用7000端口访问了。

1
2
3
4
5
6
$ vim rook-ceph-cluster.yaml
cephClusterSpec:
dashboard:
enabled: true
port: 7000 # 将7001改为7000
ssl: false

更新helm

1
2
3
4
5
$ helm  upgrade -n rook-ceph rook-ceph-cluster rook-master/rook-ceph-cluster --version 0 -f rook-ceph-cluster.yaml

# 当operator中出现以下日志代表dahboard开启成功
2021-07-16 03:07:42.809666 I | op-mgr: restarting the mgr module
2021-07-16 03:07:44.841640 I | op-mgr: successful modules: dashboard

配置port-forward转发访问ceph-dashboard,这里你想用ingress或者nodeport都可以。

1
$ kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard 7000:7000

获取dashboard密码(默认账号为admin)

1
2
$ kubectl get secrets  -n rook-ceph rook-ceph-dashboard-password --template={{.data.password}}|base64 -d
@sI3\&9@&CY=vBLBU0uL

然后你浏览器访问http://127.0.0.1:7000即可访问到ceph-dashboard

三、k8s挂载RBD和CephFS验证

3.1 挂载RBD验证

创建RBD使用的存储池和存储类

官方文档地址:https://www.rook.io/docs/rook/v1.6/ceph-block.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ vim rbd-storageclass.yaml
-------------------------------------------------------
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: k8s-rbd-test-pool
namespace: rook-ceph
spec:
failureDomain: host
replicated:
size: 2
requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-ceph-rbs-storageclass
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
clusterID: rook-ceph # namespace:cluster
pool: k8s-rbd-test-pool

#格式 1 -(已弃用)对新的 rbd 图像使用原始格式。所有版本的 librbd 和内核 rbd 模块都可以理解这种格式,但不支持克隆等较新的功能。
#格式 2 - 使用第二个 rbd 格式,这是自 Bobtail 版本以来 librbd 和内核 3.10 以来的内核 rbd 模块所支持的(除了“花式”条带化,自内核 4.17 起支持)。这增加了对克隆的支持,并且更容易扩展以在未来允许更多功能。
imageFormat: "2"
imageFeatures: layering
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
csi.storage.k8s.io/fstype: ext4
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete

创建rbd存储类并验证

1
2
3
4
5
$ kubectl apply -f rbd-storageclass.yaml 
cephblockpool.ceph.rook.io/k8s-rbd-test-pool created
storageclass.storage.k8s.io/local-ceph-rbs-storageclass created

$ kubectl get sc |grep rbd

创建测试deployment验证rbd的块pod是否可以成功挂载

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
$ vim rbd-deploy.yaml
-------------------------------------------------------
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: local-ceph-test-rbd-pvc
spec:
storageClassName: "local-ceph-rbs-storageclass"
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-ceph-rdb-deploy
spec:
selector:
matchLabels:
app: rbd
strategy:
type: Recreate
template:
metadata:
labels:
app: rbd
spec:
containers:
- image: centos:7
name: centos
ports:
- containerPort: 80
name: centos
volumeMounts:
- name: test-rbd-pvc
mountPath: /data/
command: ["/bin/bash","-c","sleep 9999999"]
volumes:
- name: test-rbd-pvc
persistentVolumeClaim:
claimName: local-ceph-test-rbd-pvc

创建rbd的测试deploy

1
2
3
4
5
6
7
8
9
$ kubectl apply -f test-deploy.yaml

# 查看pod和pvc
$ kubectl get pod,pvc
NAME READY STATUS RESTARTS AGE
pod/test-ceph-rdb-deploy-5b7cb7d44-k779p 1/1 Running 0 55s

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/local-ceph-test-rbd-pvc Bound pvc-ad88bb0b-c848-4b82-a7f7-904b6e4223b9 20Gi RWO local-ceph-rbs-storageclass 55s

进入pod查看/data目录挂载情况

1
2
3
4
5
6
7
8
9
$ kubectl exec -it test-ceph-rdb-deploy-5b7cb7d44-k779p -- bash
$ df -h /data
/dev/rbd0 20G 45M 20G 1% /data

# 验证创建删除文件是否正常
$ touch /data/test.txt
$ ls /data/
lost+found test.txt
$ rm -f /data/test.txt

3.2 挂载CephFS验证

官方文档地址:https://www.rook.io/docs/rook/v1.6/ceph-filesystem-crd.html

创建cephFS的存储池和存储类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
$ vim cephfsf-storageclass.yaml
-------------------------------------------------------
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
name: myfs
namespace: rook-ceph
spec:
metadataPool:
replicated:
size: 2
dataPools:
- replicated:
size: 2
preserveFilesystemOnDelete: true
metadataServer:
activeCount: 1
activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: myfs
pool: myfs-data0

csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

reclaimPolicy: Delete

创建cephFS存储类并验证

1
2
3
4
5
$ kubectl apply -f  cephfsf-storageclass.yaml
cephfilesystem.ceph.rook.io/myfs created
storageclass.storage.k8s.io/rook-cephfs created

$ kubectl get sc |grep cephfs

创建测试deployment验证cephfs文件系统pod是否可以成功挂载

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
$ vim cephfs-deploy.yaml
-------------------------------------------------------
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-cephfs-pvc
spec:
storageClassName: "rook-cephfs"
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-cephfs-deploy
spec:
selector:
matchLabels:
app: cephfs
strategy:
type: Recreate
template:
metadata:
labels:
app: cephfs
spec:
containers:
- image: centos:7
name: centos
ports:
- containerPort: 80
name: centos
volumeMounts:
- name: cephfs-test
mountPath: /data/
command: ["/bin/bash","-c","sleep 9999999"]
volumes:
- name: cephfs-test
persistentVolumeClaim:
claimName: test-cephfs-pvc

创建cephfs的测试deploy

1
2
3
4
5
6
7
8
$ kubectl apply -f cephfs-deploy.yaml 
persistentvolumeclaim/test-cephfs-pvc created
deployment.apps/test-cephfs-deploy created

# 查看pod和pvc
$ kubectl get pod,pvc |grep cephfs
pod/test-cephfs-deploy-668bf75967-mrhwh 1/1 Running 0 36s
persistentvolumeclaim/test-cephfs-pvc Bound pvc-64c80ba9-4dac-4478-9207-bf3515e8b9a7 20Gi RWO rook-cephfs 36s

进入pod查看/data目录挂载情况

1
2
3
4
5
6
7
8
9
10
$ kubectl exec -it test-cephfs-deploy-668bf75967-mrhwh -- bash

$ df -h|grep data
192.168.127.20:6789,192.168.221.191:6789,192.168.254.166:6789:/volumes/csi/csi-vol-ee996359-e5e7-11eb-867a-12321954c01b/089942c2-327d-4a57-b0da-ed91b1556535 20G 0 20G 0% /data

# 验证创建删除文件是否正常
$ touch /data/test.txt
$ ls /data/
lost+found test.txt
$ rm -f /data/test.txt

四、Ceph性能测试

详细的fio性能测试使用方法看我博客中另外一篇专门讲fio工具的文章。

4.1 测试pod挂载rbd性能

这里测试用的pod是3.1中启动的测试rbd挂载使用的pod,在里面安装fio工具进行性能测试

进入到rbd的测试pod,安装fi工具

1
2
3
$ kubectl exec -it test-ceph-rdb-deploy-5b7cb7d44-k779p -- bash
$ yum install -y fio
$ touch /data/test.txt

测试pod挂载rbd块的4k顺序读性能

1
2
3
4
$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=read -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
read: IOPS=7191, BW=28.1MiB/s (29.5MB/s)(3072MiB/109363msec)

测试pod挂载rbd块的4k顺序写性能

1
2
3
$  fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=write -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt
# 结果
write: IOPS=5954, BW=23.3MiB/s (24.4MB/s)(3072MiB/132073msec)

测试pod挂载rbd块的4M顺序读性能

1
2
3
4
$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=read -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4M -size=10G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
read: IOPS=247, BW=991MiB/s (1039MB/s)(10.0GiB/10333msec)

测试pod挂载rbd块的4M顺序写性能

1
2
3
$  fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=write -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4M -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt
# 结果
write: IOPS=41, BW=167MiB/s (175MB/s)(3072MiB/18378msec)

4.2 测试pod挂载cephFS性能

这里测试用的pod是3.1中启动的测试cephFS挂载使用的pod,在里面安装fio工具进行性能测试

进入到cephfs的测试pod,安装fio工具

1
2
3
$ kubectl exec -it test-cephfs-deploy-668bf75967-mrhwh -- bash
$ yum install -y fio
$ touch /data/test.txt

测试pod挂载cephfs的4k顺序读性能

1
2
3
4
$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=read -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
read: IOPS=6221, BW=24.3MiB/s (25.5MB/s)(3072MiB/126411msec)

测试pod挂载cephfs的4k顺序写性能

1
2
3
4
$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=write -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
write: IOPS=5290, BW=20.7MiB/s (21.7MB/s)(3072MiB/148658msec)

4.3 安装ceph-tools为pool压测

官方文档:https://www.rook.io/docs/rook/v1.6/ceph-toolbox.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
apiVersion: apps/v1
kind: Deployment
metadata:
name: rook-ceph-tools
namespace: rook-ceph
labels:
app: rook-ceph-tools
spec:
replicas: 1
selector:
matchLabels:
app: rook-ceph-tools
template:
metadata:
labels:
app: rook-ceph-tools
spec:
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: rook-ceph-tools
image: rook/ceph:v1.6.7
command: ["/tini"]
args: ["-g", "--", "/usr/local/bin/toolbox.sh"]
imagePullPolicy: IfNotPresent
env:
- name: ROOK_CEPH_USERNAME
valueFrom:
secretKeyRef:
name: rook-ceph-mon
key: ceph-username
- name: ROOK_CEPH_SECRET
valueFrom:
secretKeyRef:
name: rook-ceph-mon
key: ceph-secret
volumeMounts:
- mountPath: /etc/ceph
name: ceph-config
- name: mon-endpoint-volume
mountPath: /etc/rook
volumes:
- name: mon-endpoint-volume
configMap:
name: rook-ceph-mon-endpoints
items:
- key: data
path: mon-endpoints
- name: ceph-config
emptyDir: {}
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 5