Helm3通过Rook安装Ceph集群

一、Rook简介

Rook官网：https://rook.io

Rook使用Kubernetes原语使Ceph存储系统能够在Kubernetes上运行。下图说明了Ceph Rook如何与Kubernetes集成：

随着Rook在Kubernetes集群中运行，Kubernetes应用程序可以挂载由Rook管理的块设备和文件系统，或者可以使用S3 / Swift API提供对象存储。Rook oprerator自动配置存储组件并监控群集，以确保存储处于可用和健康状态。
Rook oprerator是一个简单的容器，具有引导和监视存储集群所需的全部功能。oprerator将启动并监控ceph monitor pods和OSDs的守护进程，它提供基本的RADOS存储。oprerator通过初始化运行服务所需的pod和其他组件来管理池，对象存储（S3 / Swift）和文件系统的CRD。

oprerator将监视存储后台驻留程序以确保群集正常运行。Ceph mons将在必要时启动或故障转移，并在群集增长或缩小时进行其他调整。oprerator还将监视api服务请求的所需状态更改并应用更改。
Rook oprerator还创建了Rook agent。这些agent是在每个Kubernetes节点上部署的pod。每个agent都配置一个Flexvolume插件，该插件与Kubernetes的volume controller集成在一起。处理节点上所需的所有存储操作，例如附加网络存储设备，安装卷和格式化文件系统。

该rook容器包括所有必需的Ceph守护进程和工具来管理和存储所有数据 - 数据路径没有变化。 rook并没有试图与Ceph保持完全的忠诚度。许多Ceph概念（如placement groups和crush maps）都是隐藏的，因此您无需担心它们。相反，Rook为管理员创建了一个简化的用户体验，包括物理资源，池，卷，文件系统和buckets。同时，可以在需要时使用Ceph工具应用高级配置。
Rook在golang中实现。Ceph在C ++中实现，其中数据路径被高度优化。我们相信这种组合可以提供两全其美的效果。

二、开始部署集群

本文参考的官方文档

rook官方helm部署文档：https://www.rook.io/docs/rook/v1.6/helm.html

rook的GitHub地址：https://github.com/rook/rook

2.1 环境信息介绍

名称	信息
节点配置	8c32g 5500gHDD *4 40GSSD *2(一个系统盘一个ceph元数据) 节点数3
Kubernetes集群	ACK集群 v1.18.8 托管三节点模式
helm	v3.5.2
Rook	v1.6.7
Ceph	v16.2.4
Mon组件	3个
Mgr组件	2个

2.2 安装helm

如果已经安装helm请跳过

下载最新的helm二进制包

1	$ wget https://get.helm.sh/helm-v3.5.2-linux-amd64.tar.gz

解压并移动二进制文件至可执行目录

$ tar xzvf helm-v3.5.2-linux-amd64.tar.gz 
linux-amd64/
linux-amd64/helm
linux-amd64/LICENSE
linux-amd64/README.md

$ chmod +x linux-amd64/helm 
$ mv linux-amd64/helm /usr/bin/

验证安装

1 2	$ helm version version.BuildInfo{Version:"v3.5.2", GitCommit:"167aac70832d3a384f65f9745335e9fb40169dc2", GitTreeState:"dirty", GoVersion:"go1.15.7"}

2.3 部署operator

官方文档：https://www.rook.io/docs/rook/v1.6/helm-operator.html

创建rook-ceph名称空间

1
2
3

$ kubectl create namespace rook-ceph
$ kubectl get ns |grep rook
rook-ceph         Active   9s

添加rook-operator的helm仓库

$ helm repo add rook-release https://charts.rook.io/release
$ helm repo list
NAME        	URL                           
rook-release	https://charts.rook.io/release

新建rook-operator的rook-operator.yaml文件

$ vim rook-operator.yaml
-----------------$ 下面内容是我修改后的 特殊需要可以根据values.yaml修改 $------------------
csi:
  # 配置插件镜像相关地址,默认是谷歌仓库，这里我改成自己的dockerhub镜像
  # putianhui/cephcsi:v3.3.1    # 将仓库名换成dockerhub puitanhui/imageName:tag 可以
  # 仓库名换成阿里云的仓库打头也可以 registry.cn-shanghai.aliyuncs.com/k8s-gle/imageName:tag
  cephcsi:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/cephcsi:v3.3.1
  registrar:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-node-driver-registrar:v2.2.0
  provisioner:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-provisioner:v2.2.2
  snapshotter:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-snapshotter:v4.1.1
  attacher:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-attacher:v3.2.1
  resizer:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-resizer:v1.2.0

enableSelinuxRelabeling: false

#Ceph mon和osd pod需要写入hostPath。在OpenShift和SELinux中有限制的权限
## pod必须运行特权才能写入hostPath卷，然后必须将其设置为true。
hostpathRequiresPrivileged: true

通过rook-operator.yaml文件部署rook-operator

1	$ helm install -n rook-ceph rook-ceph rook-release/rook-ceph --version v1.6.7 -f rook-operator.yaml

等待pod启动成功

# 正在创建中
$ kubectl --namespace rook-ceph get pods
NAME                                 READY   STATUS              RESTARTS   AGE
rook-ceph-operator-fdb564699-rv9cg   0/1     ContainerCreating   0          46s

# 启动成功
$ kubectl --namespace rook-ceph get pods
NAME                                 READY   STATUS    RESTARTS   AGE
rook-ceph-operator-fdb564699-rv9cg   1/1     Running   0          89s

这个是自己翻译的一个operator的yaml

image:
  prefix: rook
  repository: rook/ceph
  tag: v1.6.7
  pullPolicy: IfNotPresent

crds:
  #         # https://rook.github.io/docs/rook/master/ceph-disaster-recovery.html#restoring-crds-after-deletion
  #         #舵图是否应该创建和更新CRD。如果为false，则CRD必须为
  #         #使用cluster/examples/kubernetes/ceph/crds.yaml独立管理。
  #         #**警告**仅在首次部署时设置。如果以后禁用，群集可能会被销毁。
  #         #如果在这种情况下删除了CRD，请参阅《灾难恢复指南》以恢复它们。
  enabled: true

resources:
  limits:
    cpu: 500m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 256Mi

nodeSelector: {}
# Constraint rook-ceph-operator Deployment to nodes with label `disktype: ssd`.
# For more info, see https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#nodeselector
#  disktype: ssd

# Tolerations for the rook-ceph-operator to allow it to run on nodes with particular taints
tolerations: []

# Delay to use in node.kubernetes.io/unreachable toleration
unreachableNodeTolerationSeconds: 5

# rook是只监视crd的当前名称空间还是整个集群，缺省值为false 监控整个集群
currentNamespaceOnly: false

## Annotations to be added to pod
annotations: {}

# 日志等级
logLevel: INFO

#如果为true，则创建并使用RBAC资源
rbacEnable: true

# 如果为true，则创建并使用PSP资源
##
pspEnable: true

# 设置是否禁用驱动程序或其他守护程序（如果未禁用）
## needed
csi:
  enableRbdDriver: true
  enableCephfsDriver: true
  enableGrpcMetrics: false
  enableCephfsSnapshotter: true
  enableRBDSnapshotter: true

  rbdFSGroupPolicy: "ReadWriteOnceWithFSType"

  cephFSFSGroupPolicy: "ReadWriteOnceWithFSType"
  enableOMAPGenerator: false

  allowUnsupportedVersion: false
  #  针对ceph csi RBD provisioner插件的pod进行资源限制
  #    
  # csiRBDProvisionerResource: |
  #  - name : csi-provisioner
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-resizer
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-attacher
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-snapshotter
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-rbdplugin
  #    resource:
  #      requests:
  #        memory: 512Mi
  #        cpu: 250m
  #      limits:
  #        memory: 1Gi
  #        cpu: 500m
  #  - name : liveness-prometheus
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 50m
  #      limits:
  #        memory: 256Mi
  #        cpu: 100m
  #
  # 针对ceph csi RBD插件的pod进行资源限制
  #
  # csiRBDPluginResource: |
  #  - name : driver-registrar
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 50m
  #      limits:
  #        memory: 256Mi
  #        cpu: 100m
  #  - name : csi-rbdplugin
  #    resource:
  #      requests:
  #        memory: 512Mi
  #        cpu: 250m
  #      limits:
  #        memory: 1Gi
  #        cpu: 500m
  #  - name : liveness-prometheus
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 50m
  #      limits:
  #        memory: 256Mi
  #        cpu: 100m
  #
  #  # 针对ceph csi cephfs provisioner 插件的pod进行资源限制
  #    
  # csiCephFSProvisionerResource: |
  #  - name : csi-provisioner
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-resizer
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-attacher
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 100m
  #      limits:
  #        memory: 256Mi
  #        cpu: 200m
  #  - name : csi-cephfsplugin
  #    resource:
  #      requests:
  #        memory: 512Mi
  #        cpu: 250m
  #      limits:
  #        memory: 1Gi
  #        cpu: 500m
  #  - name : liveness-prometheus
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 50m
  #      limits:
  #        memory: 256Mi
  #        cpu: 100m
  # (Optional) CEPH CSI CephFS plugin resource requirement list, Put here list of resource
  # requests and limits you want to apply for plugin pod
  #
  #   针对ceph csi cephfs插件的pod进行资源限制
  #    
  # csiCephFSPluginResource: |
  #  - name : driver-registrar
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 50m
  #      limits:
  #        memory: 256Mi
  #        cpu: 100m
  #  - name : csi-cephfsplugin
  #    resource:
  #      requests:
  #        memory: 512Mi
  #        cpu: 250m
  #      limits:
  #        memory: 1Gi
  #        cpu: 500m
  #  - name : liveness-prometheus
  #    resource:
  #      requests:
  #        memory: 128Mi
  #        cpu: 50m
  #      limits:
  #        memory: 256Mi
  #        cpu: 100m
  #
  # 为provisioner的pod设置污点容忍和节点亲和性
  # CSI provisioner最好在与其他ceph守护进程相同的节点上启动
  #
  # provisionerTolerations:
  #    - key: key
  #      operator: Exists
  #      effect: NoSchedule
  # provisionerNodeAffinity: key1=value1,value2; key2=value3
  #
  # 为csi plugins的守护进程设置污点容忍和节点亲和
  # csi插件需要在客户端需要装在存储的所有节点上启动
  #
  # pluginTolerations:
  #    - key: key
  #      operator: Exists
  #      effect: NoSchedule
  # pluginNodeAffinity: key1=value1,value2; key2=value3
  # ceph的监控metrics端口号
  #cephfsGrpcMetricsPort: 9091
  #cephfsLivenessMetricsPort: 9081
  #rbdGrpcMetricsPort: 9090
  #rbdLivenessMetricsPort: 9080
  # 
  # 在内核<4.17上启用Ceph内核客户端。如果内核不支持cepfs的配额
  # 您可能要禁用此设置。但是，这将在升级期间引起问题
  # FUSE客户端。请参阅升级指南：https://rook.io/docs/rook/v1.2/ceph-upgrade.html
  #
  forceCephFSKernelClient: true
  # 配置kubelet的目录
  #kubeletDirPath: /var/lib/kubelet
  # 配置插件镜像相关地址,默认是谷歌仓库，这里我改成自己的dockerhub镜像
  # putianhui/cephcsi:v3.3.1    # 将仓库名换成dockerhub puitanhui/imageName:tag 可以
  # 仓库名换成阿里云的仓库打头也可以 registry.cn-shanghai.aliyuncs.com/k8s-gle/imageName:tag
  cephcsi:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/cephcsi:v3.3.1
  registrar:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-node-driver-registrar:v2.2.0
  provisioner:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-provisioner:v2.2.2
  snapshotter:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-snapshotter:v4.1.1
  attacher:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-attacher:v3.2.1
  resizer:
    image: registry.cn-shanghai.aliyuncs.com/k8s-gle/csi-resizer:v1.2.0
  # 为CSI CephFS的Deployments和DaemonSets添加自定义标签
  #cephfsPodLabels: "key1=value1,key2=value2"
  # 为CSI RBD的Deployments和DaemonSets添加自定义标签
  #rbdPodLabels: "key1=value1,key2=value2"
  # 启用卷复制控制器
  volumeReplication:
    enabled: false
    #image: "quay.io/csiaddons/volumereplication-operator:v0.1.0"

enableFlexDriver: false
enableDiscoveryDaemon: false

# 允许在同一集群中有多个Ceph文件系统 默认为false
# WARNING: Experimental feature in Ceph Releases Octopus (v15) and Nautilus (v14)
# https://docs.ceph.com/en/octopus/cephfs/experimental-features/#multiple-file-systems-within-a-ceph-cluster
allowMultipleFilesystems: false

## 如果为true，则在主机网络上运行rook operator
# useOperatorHostNetwork: true

## Rook 的 Agent配置污点容忍、节点亲和
## toleration: NoSchedule, PreferNoSchedule or NoExecute
## tolerationKey: Set this to the specific key of the taint to tolerate
## tolerations: Array of tolerations in YAML format which will be added to agent deployment
## nodeAffinity: Set to labels of the node to match
## flexVolumeDirPath: The path where the Rook agent discovers the flex volume plugins
## libModulesDirPath: The path where the Rook agent can find kernel modules
# agent:
#   toleration: NoSchedule
#   tolerationKey: key
#   tolerations:
#   - key: key
#     operator: Exists
#     effect: NoSchedule
#   nodeAffinity: key1=value1,value2; key2=value3
#   mountSecurityMode: Any
## For information on FlexVolume path, please refer to https://rook.io/docs/rook/master/flexvolume.html
#   flexVolumeDirPath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
#   libModulesDirPath: /lib/modules
#   mounts: mount1=/host/path:/container/path,/host/path2:/container/path2

## Rook 的 Discover 配置污点容忍、节点亲和
## toleration: NoSchedule, PreferNoSchedule or NoExecute
## tolerationKey: Set this to the specific key of the taint to tolerate
## tolerations: Array of tolerations in YAML format which will be added to agent deployment
## nodeAffinity: Set to labels of the node to match
# discover:
#   toleration: NoSchedule
#   tolerationKey: key
#   tolerations:
#   - key: key
#     operator: Exists
#     effect: NoSchedule
#   nodeAffinity: key1=value1,value2; key2=value3
#   podLabels: "key1=value1,key2=value2"

# In some situations SELinux relabelling breaks (times out) on large filesystems, and doesn't work with cephfs ReadWriteMany volumes (last relabel wins).
# Disable it here if you have similar issues.
# For more details see https://github.com/rook/rook/issues/2417
enableSelinuxRelabeling: false

#Ceph mon和osd pod需要写入hostPath。在OpenShift和SELinux中有限制的权限
## pod必须运行特权才能写入hostPath卷，然后必须将其设置为true。
hostpathRequiresPrivileged: true

# Disable automatic orchestration when new devices are discovered.
disableDeviceHotplug: false

# Blacklist certain disks according to the regex provided.
discoverDaemonUdev:

# 配置镜像拉取secret
# imagePullSecrets:
# - name: my-registry-secret

# Whether the OBC provisioner should watch on the operator namespace or not, if not the namespace of the cluster will be used
# 启用OBC监视Operator命名空间
enableOBCWatchOperatorNamespace: true

admissionController:
  # Set tolerations and nodeAffinity for admission controller pod.
  # The admission controller would be best to start on the same nodes as other ceph daemons.
  # tolerations:
  #    - key: key
  #      operator: Exists
  #      effect: NoSchedule
  # nodeAffinity: key1=value1,value2; key2=value3

2.4 部署ceph-cluster

官方文档：https://www.rook.io/docs/rook/v1.6/helm-ceph-cluster.html

添加rook-cluster 的helm仓库

$ helm repo add rook-master https://charts.rook.io/master
$ helm repo list
NAME        	URL                           
rook-release	https://charts.rook.io/release
rook-master 	https://charts.rook.io/master

新建rook-ceph-cluster的rook-ceph-cluster.yaml文件

$ vim rook-ceph-cluster.yaml
-----------------$ 下面内容是我修改后的 特殊需要可以根据values.yaml修改 $------------------
cephClusterSpec:
  mgr:
    count: 2
  dashboard:
    enabled: true
    port: 7001
    ssl: false
  storage:
    # 是否使用所有节点和节点上面的所有可用磁盘，设置为false时需要手动指定节点选择和磁盘信息过滤
    useAllNodes: false
    useAllDevices: false
    config:
      metadataDevice: "vdf" # 我元数据存储的ssd盘，不使用ssd存储元数据会比较慢 use it as block db device of bluestore.
      databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
      journalSizeMB: "1024"  # uncomment if the disks are 20 GB or smaller

  #   也可以指定单个节点及其配置，但上面的“useAllNodes”必须设置为false。那么，只有
  #   下面的节点将用作存储资源。每个节点的“name”字段应与其“kubernetes.io/hostname”标签匹配。
    nodes:
      - name: "cn-zhangjiakou.172.16.1.155"
        devices: # 下面指定你的osd存储盘标
        - name: "vdb"
        - name: "vdc"
        - name: "vdd"
        - name: "vde"
      - name: "cn-zhangjiakou.172.16.1.156"
        devices:
        - name: "vdb"
        - name: "vdc"
        - name: "vdd"
        - name: "vde"
      - name: "cn-zhangjiakou.172.16.1.157"
        devices:
        - name: "vdb"
        - name: "vdc"
        - name: "vdd"
        - name: "vde"

部署rook-ceph-cluster

1	$ helm install -n rook-ceph rook-ceph-cluster rook-master/rook-ceph-cluster --version 0 -f rook-ceph-cluster.yaml

查看rook-ceph-operator的pod日志查看部署进度

1	$ kubectl logs -f -n rook-ceph rook-ceph-operator-fdb564699-rv9cg

当查看pod很多镜像拉取失败时，请看你部署operator时有没有修改镜像仓库地址，默认是谷歌镜像仓库，国内被墙后是获取不到了，我个人的镜像仓库只上传了 ceph v1.6.7版本的所有镜像，如果你部署的是其他版本，镜像问题自行解决。

当集群中所有osd的pod都运行成功集群就创建好了,日志会显示以下内容

$ kubectl logs -f -n rook-ceph rook-ceph-operator-fdb564699-rv9cg
2021-07-16 02:54:46.417047 I | op-k8sutil: Reporting Event rook-ceph:rook-ceph Normal:ReconcileSucceeded:cluster has been configured successfully
I0716 02:54:46.417161       6 manager.go:118] objectbucket.io/provisioner-manager "msg"="starting provisioner"  "name"="rook-ceph.ceph.rook.io/bucket"
2021-07-16 02:54:48.670028 I | ceph-cluster-controller: Disabling the insecure global ID as no legacy clients are currently connected. If you still require the insecure connections, see the CVE to suppress the health warning and re-enable the insecure connections. https://docs.ceph.com/en/latest/security/CVE-2021-20288/
2021-07-16 02:54:50.333134 I | ceph-cluster-controller: insecure global ID is now disabled
2021-07-16 02:55:32.278545 I | op-mon: checking if multiple mons are on the same node

下面是我自己翻译的cluster.yaml 仅供参考

# rook的operator所在Namespace
operatorNamespace: rook-ceph

# cephluster CR的metadata.name。默认名称与命名空间相同
#clusterName: rook-ceph

# 自定义配置，会覆盖 ceph.conf
#configOverride: |
#  [global]
#  mon_allow_pool_delete = true
#
#  osd_pool_default_size = 3
#  osd_pool_default_min_size = 2

# 安装部署一个 toolbox 调试工具,后面再安装
toolbox:
  enabled: false
  image: rook/ceph:v1.6.7
  tolerations: []
  affinity: {}

monitoring:
  # 需要提前安装好promethues
  # 启用将创建RBAC规则，以允许操作员创建 ServiceMonitors
  enabled: false
  rulesNamespaceOverride:

cephClusterSpec:
  cephVersion:
    image: ceph/ceph:v16.2.4
    #是否允许不支持的Ceph版本。目前支持“nautilus”和“octopus”。
    # 将来的版本，如“pacific”，需要将其设置为“true”。
    # 在生产中不要设置为true。
    allowUnsupported: false

  # 主机上保存配置文件的路径。必须指定。
  # 重要提示：如果重新安装群集，请确保从每个主机上删除此目录，否则mons将无法在新群集上启动。
  # 在Minikube中，'/data'目录被配置为在重新启动时保持不变。在Minikube环境中使用“/data/rook”。
  dataDirHostPath: /var/lib/rook
  skipUpgradeChecks: false
  # 升级期间PG不干净是否继续
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    # 设置要启动的mons数。必须是奇数，通常建议为3
    count: 3
    # 是否允许mon的pod一个节点上启动多个，mons应该位于唯一的节点上。因此，对于生产，建议至少使用3个节点。
    # 对于可以接受数据丢失的测试环境，应该只允许在同一节点上使用Mons。
    allowMultiplePerNode: false
  mgr:
    # 当需要mgr高可用性时，将计数设置到2。
    # 在这种情况下，一个mgr将处于活动状态，另一个处于备用状态。当Ceph更新
    # mgr处于活动状态，Rook将更新mgr服务以匹配活动的mgr。
    count: 2
    modules:
      - name: pg_autoscaler
        enabled: true
  # 开启ceph的dashboard来显示ceph集群状态
  dashboard:
    enabled: true
    # 当ssl为true时dashboard默认使用8443端口，不需要配置prot: ""
    # port: 8443
    # 当ssl为false时必须指定prot: 7000，否则无法访问dashboard
    port: 7001
    # 是否开启dashboard的ssl
    ssl: false
  #network:
  # 开启主机网络访问
  #provider: host
  # EXPERIMENTAL: enable the Multus network provider
  #provider: multus
  #selectors:
  #public: public-conf --> NetworkAttachmentDefinition object name in Multus
  #cluster: cluster-conf --> NetworkAttachmentDefinition object name in Multus
  # Provide internet protocol version. IPv6, IPv4 or empty string are valid options. Empty string would mean IPv4
  #ipFamily: "IPv6"
  # Ceph daemons to listen on both IPv4 and Ipv6 networks
  #dualStack: false
  # enable the crash collector for ceph daemon crash collection
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    #daysToRetain: 30
  # enable log collector, daemons will log on files and rotate
  # logCollector:
  #   enabled: true
  #   periodicity: 24h # SUFFIX may be 'h' for hours or 'd' for days.
  # automate [data cleanup process](https://github.com/rook/rook/blob/master/Documentation/ceph-teardown.md#delete-the-data-on-hosts) in cluster destruction.
  # 数据清除策略
  cleanupPolicy:
    confirmation: ""
    # sanitizeDisks表示在删除群集时对OSD磁盘进行清理的设置
    sanitizeDisks:
      method: quick
      dataSource: zero
      iteration: 1
    allowUninstallWithVolumes: false
  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # 如果指定特定的k8s集群节点当做存储的节点，配置以下节点选择器和污点容忍规则
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # 1.对特定的存储节点打上节点标签 "role=storage-node"
  # 2.配置存储节点的污点容忍，带有key是"storage-node"的污点容忍
  # tolerate taints with a key of 'storage-node'.
  #  placement:
  #    all:
  #      nodeAffinity:
  #        requiredDuringSchedulingIgnoredDuringExecution:
  #          nodeSelectorTerms:
  #          - matchExpressions:
  #            - key: role
  #              operator: In
  #              values:
  #              - storage-node
  #      podAffinity:
  #      podAntiAffinity:
  #      topologySpreadConstraints:
  #      tolerations:
  #      - key: storage-node
  #        operator: Exists
  # The above placement information can also be specified for mon, osd, and mgr components
  #    mon:
  #    osd:
  #    mgr:
  #    cleanup:
  #annotations:
  #    all:
  #    mon:
  #    osd:
  #    cleanup:
  #    prepareosd:
  # If no mgr annotations are set, prometheus scrape annotations will be set by default.
  #    mgr:
  #labels:
  #    all:
  #    mon:
  #    osd:
  #    cleanup:
  #    mgr:
  #    prepareosd:
  # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
  # These labels can be passed as LabelSelector to Prometheus
  #    monitoring:
  #
  # 针对mgr mon osd 各个组件pod配置资源限制
  #resources:
  # The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
  #    mgr:
  #      limits:
  #        cpu: "500m"
  #        memory: "1024Mi"
  #      requests:
  #        cpu: "500m"
  #        memory: "1024Mi"
  # The above example requests/limits can also be added to the other components
  #    mon:
  #    osd:
  #    prepareosd:
  #    mgr-sidecar:
  #    crashcollector:
  #    logcollector:
  #    cleanup:
  # 自动删除已退出并且可以安全销毁的OSD的选项。
  removeOSDsIfOutAndSafeToRemove: false
  #  priorityClassNames:
  #    all: rook-ceph-default-priority-class
  #    mon: rook-ceph-mon-priority-class
  #    osd: rook-ceph-osd-priority-class
  #    mgr: rook-ceph-mgr-priority-class
  
  # 群集级存储配置和选择
  storage: # cluster level storage configuration and selection
    # 是否使用所有节点和节点上面的所有可用磁盘，设置为false时需要手动指定节点选择和磁盘信息过滤
    useAllNodes: false
    useAllDevices: false
    #deviceFilter: ""
    config:
    # crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
      metadataDevice: "vdf" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
      databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
      journalSizeMB: "1024"  # uncomment if the disks are 20 GB or smaller
    # osdsPerDevice: "1" # this value can be overridden at the node or device level
    # encryptedDevice: "true" # the default value for this option is "false"

  #   也可以指定单个节点及其配置，但上面的“useAllNodes”必须设置为false。那么，只有
  #   下面的节点将用作存储资源。每个节点的“name”字段应与其“kubernetes.io/hostname”标签匹配。
    nodes:
      - name: "cn-zhangjiakou.172.16.1.149"
        devices: # specific devices to use for storage can be specified for each node
        - name: "vdb"
        - name: "vdc" # multiple osds can be created on high performance devices
        - name: "vdd" # multiple osds can be created on high performance devices
        - name: "vde" # multiple osds can be created on high performance devices
      - name: "cn-zhangjiakou.172.16.1.150"
        devices: # specific devices to use for storage can be specified for each node
        - name: "vdb"
        - name: "vdc" # multiple osds can be created on high performance devices
        - name: "vdd" # multiple osds can be created on high performance devices
        - name: "vde" # multiple osds can be created on high performance devices
      - name: "cn-zhangjiakou.172.16.1.151"
        devices: # specific devices to use for storage can be specified for each node
        - name: "vdb"
        - name: "vdc" # multiple osds can be created on high performance devices
        - name: "vdd" # multiple osds can be created on high performance devices
        - name: "vde" # multiple osds can be created on high performance devices
  #        config:
  #          osdsPerDevice: "5"
  #      - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # devices can be specified using full udev paths
  #      config: # configuration can be specified at the node level which overrides the cluster level config
  #    - name: "172.17.4.301"
  #      deviceFilter: "^sd."
  # The section for configuring management of daemon disruptions during upgrade or fencing.
  disruptionManagement:
    # If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
    # via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
    # block eviction of OSDs by default and unblock them safely when drains are detected.
    managePodBudgets: true
    # A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
    # default DOWN/OUT interval) when it is draining. This is only relevant when  `managePodBudgets` is `true`. The default value is `30` minutes.
    osdMaintenanceTimeout: 30
    # A duration in minutes that the operator will wait for the placement groups to become healthy (active+clean) after a drain was completed and OSDs came back up.
    # Operator will continue with the next drain if the timeout exceeds. It only works if `managePodBudgets` is `true`.
    # No values or 0 means that the operator will wait until the placement groups are healthy before unblocking the next drain.
    pgHealthCheckTimeout: 0
    # If true, the operator will create and manage MachineDisruptionBudgets to ensure OSDs are only fenced when the cluster is healthy.
    # Only available on OpenShift.
    manageMachineDisruptionBudgets: false
    # Namespace in which to watch for the MachineDisruptionBudgets.
    machineDisruptionBudgetNamespace: openshift-machine-api

  # healthChecks
  # Valid values for daemons are 'mon', 'osd', 'status'
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    # Change pod liveness probe, it works for all mon,mgr,osd daemons
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

2.5 访问ceph的dashboard

dashboard这边有点问题，安装时使用7001端口，默认不会替换，我们需要改下``rook-ceph-cluster.yaml`部署清单，通过helm更新一下，然后我们就可以用7000端口访问了。

$ vim rook-ceph-cluster.yaml
cephClusterSpec:
  dashboard:
    enabled: true
    port: 7000			# 将7001改为7000
    ssl: false

更新helm

$ helm  upgrade -n rook-ceph rook-ceph-cluster rook-master/rook-ceph-cluster --version 0 -f rook-ceph-cluster.yaml

# 当operator中出现以下日志代表dahboard开启成功
2021-07-16 03:07:42.809666 I | op-mgr: restarting the mgr module
2021-07-16 03:07:44.841640 I | op-mgr: successful modules: dashboard

配置port-forward转发访问ceph-dashboard，这里你想用ingress或者nodeport都可以。

1	$ kubectl port-forward -n rook-ceph service/rook-ceph-mgr-dashboard 7000:7000

获取dashboard密码(默认账号为admin)

1 2	$ kubectl get secrets -n rook-ceph rook-ceph-dashboard-password --template={{.data.password}}\|base64 -d @sI3\&9@&CY=vBLBU0uL

然后你浏览器访问http://127.0.0.1:7000即可访问到ceph-dashboard

三、k8s挂载RBD和CephFS验证

3.1 挂载RBD验证

创建RBD使用的存储池和存储类

官方文档地址：https://www.rook.io/docs/rook/v1.6/ceph-block.html

$ vim rbd-storageclass.yaml
-------------------------------------------------------
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: k8s-rbd-test-pool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 2
    requireSafeReplicaSize: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-ceph-rbs-storageclass
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph # namespace:cluster
  pool: k8s-rbd-test-pool

#格式 1 -（已弃用）对新的 rbd 图像使用原始格式。所有版本的 librbd 和内核 rbd 模块都可以理解这种格式，但不支持克隆等较新的功能。
#格式 2 - 使用第二个 rbd 格式，这是自 Bobtail 版本以来 librbd 和内核 3.10 以来的内核 rbd 模块所支持的（除了“花式”条带化，自内核 4.17 起支持）。这增加了对克隆的支持，并且更容易扩展以在未来允许更多功能。
  imageFormat: "2"
  imageFeatures: layering
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph # namespace:cluster
  csi.storage.k8s.io/fstype: ext4
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete

创建rbd存储类并验证

$ kubectl apply -f rbd-storageclass.yaml 
cephblockpool.ceph.rook.io/k8s-rbd-test-pool created
storageclass.storage.k8s.io/local-ceph-rbs-storageclass created

$ kubectl get sc |grep rbd

创建测试deployment验证rbd的块pod是否可以成功挂载

$ vim rbd-deploy.yaml
-------------------------------------------------------
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-ceph-test-rbd-pvc
spec:
  storageClassName: "local-ceph-rbs-storageclass"
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-ceph-rdb-deploy
spec:
  selector:
    matchLabels:
      app: rbd
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: rbd
    spec:
      containers:
        - image: centos:7
          name: centos
          ports:
            - containerPort: 80
              name: centos
          volumeMounts:
            - name: test-rbd-pvc
              mountPath: /data/
          command: ["/bin/bash","-c","sleep 9999999"]
      volumes:
        - name: test-rbd-pvc
          persistentVolumeClaim:
            claimName: local-ceph-test-rbd-pvc

创建rbd的测试deploy

$ kubectl apply -f test-deploy.yaml

# 查看pod和pvc
$ kubectl get pod,pvc
NAME                                       READY   STATUS    RESTARTS   AGE
pod/test-ceph-rdb-deploy-5b7cb7d44-k779p   1/1     Running   0          55s

NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
persistentvolumeclaim/local-ceph-test-rbd-pvc   Bound    pvc-ad88bb0b-c848-4b82-a7f7-904b6e4223b9   20Gi       RWO            local-ceph-rbs-storageclass   55s

进入pod查看/data目录挂载情况

$ kubectl exec -it test-ceph-rdb-deploy-5b7cb7d44-k779p -- bash
$ df -h /data
/dev/rbd0        20G   45M   20G   1% /data

# 验证创建删除文件是否正常
$ touch /data/test.txt
$ ls /data/
lost+found  test.txt
$ rm -f /data/test.txt

3.2 挂载CephFS验证

官方文档地址：https://www.rook.io/docs/rook/v1.6/ceph-filesystem-crd.html

创建cephFS的存储池和存储类

$ vim cephfsf-storageclass.yaml
-------------------------------------------------------
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: myfs
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 2
  dataPools:
    - replicated:
        size: 2
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: myfs
  pool: myfs-data0

  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

reclaimPolicy: Delete

创建cephFS存储类并验证

$ kubectl apply -f  cephfsf-storageclass.yaml
cephfilesystem.ceph.rook.io/myfs created
storageclass.storage.k8s.io/rook-cephfs created

$ kubectl get sc |grep cephfs

创建测试deployment验证cephfs文件系统pod是否可以成功挂载

$ vim cephfs-deploy.yaml
-------------------------------------------------------
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-cephfs-pvc
spec:
  storageClassName: "rook-cephfs"
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-cephfs-deploy
spec:
  selector:
    matchLabels:
      app: cephfs
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: cephfs
    spec:
      containers:
        - image: centos:7
          name: centos
          ports:
            - containerPort: 80
              name: centos
          volumeMounts:
            - name: cephfs-test
              mountPath: /data/
          command: ["/bin/bash","-c","sleep 9999999"]
      volumes:
        - name: cephfs-test
          persistentVolumeClaim:
            claimName: test-cephfs-pvc

创建cephfs的测试deploy

$ kubectl apply -f cephfs-deploy.yaml 
persistentvolumeclaim/test-cephfs-pvc created
deployment.apps/test-cephfs-deploy created

# 查看pod和pvc
$ kubectl get pod,pvc |grep cephfs
pod/test-cephfs-deploy-668bf75967-mrhwh    1/1     Running   0          36s
persistentvolumeclaim/test-cephfs-pvc           Bound    pvc-64c80ba9-4dac-4478-9207-bf3515e8b9a7   20Gi       RWO            rook-cephfs                   36s

进入pod查看/data目录挂载情况

$ kubectl exec -it test-cephfs-deploy-668bf75967-mrhwh -- bash

$ df -h|grep data
192.168.127.20:6789,192.168.221.191:6789,192.168.254.166:6789:/volumes/csi/csi-vol-ee996359-e5e7-11eb-867a-12321954c01b/089942c2-327d-4a57-b0da-ed91b1556535   20G     0   20G   0% /data

# 验证创建删除文件是否正常
$ touch /data/test.txt
$ ls /data/
lost+found  test.txt
$ rm -f /data/test.txt

四、Ceph性能测试

详细的fio性能测试使用方法看我博客中另外一篇专门讲fio工具的文章。

4.1 测试pod挂载rbd性能

这里测试用的pod是3.1中启动的测试rbd挂载使用的pod，在里面安装fio工具进行性能测试

进入到rbd的测试pod，安装fi工具

1
2
3

$ kubectl exec -it test-ceph-rdb-deploy-5b7cb7d44-k779p -- bash
$ yum install -y fio
$ touch /data/test.txt

测试pod挂载rbd块的4k顺序读性能

$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=read -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
read: IOPS=7191, BW=28.1MiB/s (29.5MB/s)(3072MiB/109363msec)

测试pod挂载rbd块的4k顺序写性能

1
2
3

$  fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=write -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt
# 结果
write: IOPS=5954, BW=23.3MiB/s (24.4MB/s)(3072MiB/132073msec)

测试pod挂载rbd块的4M顺序读性能

$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=read -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4M -size=10G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
read: IOPS=247, BW=991MiB/s (1039MB/s)(10.0GiB/10333msec)

测试pod挂载rbd块的4M顺序写性能

1
2
3

$  fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=write -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4M -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt
# 结果
write: IOPS=41, BW=167MiB/s (175MB/s)(3072MiB/18378msec)

4.2 测试pod挂载cephFS性能

这里测试用的pod是3.1中启动的测试cephFS挂载使用的pod，在里面安装fio工具进行性能测试

进入到cephfs的测试pod，安装fio工具

1
2
3

$ kubectl exec -it test-cephfs-deploy-668bf75967-mrhwh -- bash
$ yum install -y fio
$ touch /data/test.txt

测试pod挂载cephfs的4k顺序读性能

$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=read -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
read: IOPS=6221, BW=24.3MiB/s (25.5MB/s)(3072MiB/126411msec)

测试pod挂载cephfs的4k顺序写性能

$ fio -name=Seq_Read_IOPS_Test -group_reporting -direct=1 -iodepth=128 -rw=write -ioengine=libaio -refill_buffers -norandommap -randrepeat=0 -bs=4k -size=3G -numjobs=1 -runtime=600 -filename=/data/test.txt

# 结果
write: IOPS=5290, BW=20.7MiB/s (21.7MB/s)(3072MiB/148658msec)

4.3 安装ceph-tools为pool压测

官方文档：https://www.rook.io/docs/rook/v1.6/ceph-toolbox.html

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rook-ceph-tools
  namespace: rook-ceph
  labels:
    app: rook-ceph-tools
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rook-ceph-tools
  template:
    metadata:
      labels:
        app: rook-ceph-tools
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: rook-ceph-tools
        image: rook/ceph:v1.6.7
        command: ["/tini"]
        args: ["-g", "--", "/usr/local/bin/toolbox.sh"]
        imagePullPolicy: IfNotPresent
        env:
          - name: ROOK_CEPH_USERNAME
            valueFrom:
              secretKeyRef:
                name: rook-ceph-mon
                key: ceph-username
          - name: ROOK_CEPH_SECRET
            valueFrom:
              secretKeyRef:
                name: rook-ceph-mon
                key: ceph-secret
        volumeMounts:
          - mountPath: /etc/ceph
            name: ceph-config
          - name: mon-endpoint-volume
            mountPath: /etc/rook
      volumes:
        - name: mon-endpoint-volume
          configMap:
            name: rook-ceph-mon-endpoints
            items:
            - key: data
              path: mon-endpoints
        - name: ceph-config
          emptyDir: {}
      tolerations:
        - key: "node.kubernetes.io/unreachable"
          operator: "Exists"
          effect: "NoExecute"
          tolerationSeconds: 5