通过Chaos-Mesh打造更稳定TiDB数据库高可用架构(一)

2023-05-04 10:20:41

一、简介

本文主要介绍了chaos-mesh的相关知识，包括混沌工程chaos-mesh的介绍、核心功能、架构预览和相关实验功能，为后期构建tidb容器数据库做准备

1、Chaos-Mesh简介

Chaos Mesh 它是一个开源的云原生混沌工程平台，提供丰富的故障模拟类型，具有较强的故障场景安排能力，方便用户在开发测试和生产环境中模拟现实世界中可能出现的各种异常，帮助用户发现系统的潜在问题。Chaos Mesh 提供完善的可视化操作，旨在降低混沌工程用户的门槛。用户可以很容易地在那里 Web UI 在界面上设计自己的混沌场景，监控混沌实验的运行状态。

2、Chaos 核心功能Mesh

Chaos Mesh 作为行业领先的混沌测试平台，具有以下核心优势：

稳定的核心能力：Chaos Mesh 起源于 TiDB 核心测试平台在发布初期继承了大量的 TiDB 现有的测试经验。

充分验证：Chaos Mesh 被腾讯、美团等众多公司和组织使用；同时用于众多知名分布式系统的测试系统，如 Apache APISIX 和 RabbitMQ 等。

系统易用性强：基于图形操作的图形操作 Kubernetes 充分利用自动化能力。

云原生：Chaos Mesh 原生支持 Kubernetes 提供强大自动化能力的环境。

故障模拟场景丰富：Chaos Mesh 它几乎涵盖了分布式测试系统中基本故障模拟的绝大多数场景。

灵活的实验安排能力:用户可以通过平台设计自己的混沌实验场景，包括多个混沌实验安排和应用状态检查。

安全性高：Chaos Mesh 具有多层次的安全控制设计，提供高安全性。

活跃社区：Chaos Mesh 它是世界著名的开源混沌测试平台，CNCF 开源基金会孵化项目。

强大的扩展能力：Chaos Mesh 为故障测试类型的扩展和功能扩展提供了充分的扩展能力。

3、架构概览

通过Chaos-Mesh打造更稳定TiDB数据库高可用架构(一)_Pod

Chaos Mesh 基于 Kubernetes CRD (Custom Resource Definition) 结构，根据不同的故障类型定义多种类型 CRD 不同的类型 CRD 单独实现对象 Controller 管理不同的混沌实验。Chaos Mesh 主要包括以下三个组件：

Chaos Dashboard：Chaos Mesh 可视化组件为用户提供了友好的服务 Web 用户可以通过界面操作和观察混沌实验。同时，Chaos Dashboard 还提供了 RBAC 权限管理机制。

Chaos Controller Manager：Chaos Mesh 核心逻辑组件主要负责混沌实验的调度和管理。该组件包含多个组件 CRD Controller，例如 Workflow Controller、Scheduler Controller 以及各种类型的故障 Controller。

Chaos Daemon：Chaos Mesh 主要执行组件。Chaos Daemon 以 DaemonSet 默认拥有的运行方式 Privileged 权限(可关闭)。该组件主要通过入侵目标。 Pod Namespace 干扰具体网络设备、文件系统、内核等的方式。

二、安装部署
1.环境准备

1.安装前请确保环境已安装 Helm。[root@k8s-master chaos-mesh]# helm versionversion.BuildInfo{Version:"v3.4.1", GitCommit:“c4e748886b2efe3218578e6db9be0a6e29” GitTreeState:"clean", GoVersion:"go1.14.11"}2.添加chaoss mesh 仓库helm repo add chaos-mesh https://charts.chaos-mesh.org3.检查chaos的安装情况 helmesh版本helmmesh版本 search repo chaos-mesh或helm search repo chaos-mesh -l 4.创建命名空间kubectllltll create ns chaos-testing5.安装docker 环境helm install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-testing6.验证安装[root@k8s-master chaos-mesh]# kubectl get po -n chaos-testingNAME                                        READY   STATUS    RESTARTS   AGEchaos-controller-manager-856c96c68-6mppc   1/1     Running   0          6h49mchaoshaoshhaosh-controller-manager-856bc96c68-hk6nl   1/1     Running   0          6h50mchaoshaoshaoshaos-controller-manager-86bcc96c68-q99vm   1/1     Running   0          6h50mchaoshaoshaoshaos-daemon-ng4vx                          1/1     Running   0          6h49mchaoshaoshhaosh-daemon-w2w7h                          1/1     Running   0          6h50mchaoshaoshaoshaos-dashboard-5fdf8b8bbb-nnnhz             1/1     Running   0          为保证高可用性，6h50m备注Chaos Mesh 默认开启了 leader-election 特性。如果不需要这个特性，请通过 --set controllerManager.leaderElection.enabled=false 手动关闭这一特性。6.升级chaos meshhelm upgrade chaos-mesh chaos-mesh/chaos-mesh7.卸载chaos meshhelm uninstall chaos-mesh -n chaos-testing

2.管理用户权限
2.1、通过token登录

1、创建用户并绑定权限。访问dashboard点击此处生成2、创建令牌辅助生成器：  2.1：选择权限范围  2.2：选择角色  2.3：生成rbac配置  2.4：点击复制3、创建用户并绑定权限[root@k8s-master chaos-mesh]# cat /chaosMesh/rbac.yml kind: ServiceAccountapiVersion: v1metadata:  namespace: tidb  name: account-tidb-manager-aypth---kind: RoleapiVersion: rbac.authorization.k8s.io/v1metadata:  namespace: tidb  name: role-tidb-manager-aypthrules:- apiGroups: [""]  resources: ["pods", "namespaces"]  verbs: ["get", "watch", "list"]- apiGroups:  - chaos-mesh.org  resources: [ "*" ]  verbs: ["get", "list", "watch", "create", "delete", "patch", "update"]---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata:  name: bind-tidb-manager-aypth  namespace: tidbsubjects:- kind: ServiceAccount  name: account-tidb-manager-aypth  namespace: tidbroleRef:  kind: Role  name: role-tidb-manager-aypth  apiGroup: rbac.authorization.k8s.iokubectl apply -f rbac.yml4、生成令牌，查看kubectl describe -n tidb secrets account-tidb-manager-aypthName:         account-tidb-manager-aypth-token-z4kvcNamespace:    tidbLabels:       <none>Annotations:  kubernetes.io/service-account.name: account-tidb-manager-aypth              kubernetes.io/service-account.uid: 9910f01-64b1-489c-be76-ab924c614atype:  kubernetes.io/service-account-tokenData====token:      eyJhbgcioiJSUzi1Niisisimtpzcilyxc2pxt1hRQkdngfalutpowpeyvzlm1FieFisiftJVzfvoxagp6rsxSjQifQ.WJUVWWNGGGG29ZGGG2ZGGG2ZGGG2ZGGG2Y2WJuZGG2ZGG2YWNGG2YWNGG2YWNGGGG2YGG2ZGG2ZGG2YWNGG2ZGG2YWWNGG2ZGG2ZGG2YWWNGG2YWWNGG2ZGG2YWNGG2YWNGG2YWNGG2YWNGG2YWNGG2YWNGG2ZGG.ZoomZT5ncaxcurz6R5hspa5tmqWMUHaNjjnM_Psa3HeShSYlcM-0ruvjtvj1-g-I2vcyLKYAUCU4MHCEAULBonwdwuHM1kqGH6ehrfBKelJ1H8edsDA65RDoiBoylJqnui0ngrsbwhyVOEupcoHTPRAS0glvwt77kcfavmwb0ch-wxgeBlgLqCq-JQel6gO0j6kE38_sB1o8Bqk4my4NV95SNZCIuiwzipYTz7bmK3lF4A2s9BK6_7kBT5SPZ_7kBT5SPZ_7kBT5SPZ_YnIIb-C2rHzy0zuvzuslbjpg32Wi0TD1LF9A1lQz5lxtzlyzrweq082nmnmnmnmnmnmncactwind1LF1.crt:     1066 bytesnamespace:  4 bytes5、用令牌登录

2.2、关闭token登录(不安全)

使用 Helm 安装 Chaos Mesh 默认情况下，打开权限验证功能。对于生产环境等安全要求较高的场景，建议保持权限验证功能打开。如果你只是想体验一下 Chaos Mesh 希望关闭权限验证以快速创建混沌实验的功能 Helm 命令中设置 --set dashboard.securityMode=false，命令如下：

helm upgrade chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --version 2.1.4 --set dashboard.securityMode=false备注，如果要重新开启权限验证功能，再重新设置 --set dashboard.securityMode=true 即可。

三、混沌工程实验类型
(一）、准备实验环境

创建相应的podddeployment

1、通过deployment创建相关pod服务#cat web-show.yml apiVersion: apps/v1kind: Deploymentmetadata:name: webshow-deploymentlabels: app: webshow-deploymentspec:replicas: 1selector: matchLabels:   app: webshow-deploymenttemplate: metadata:   labels:     app: webshow-deployment spec:   containers:        - name: webshow-deployment          image: pingcap/web-show          imagePullPolicy: Always          command:            - /usr/local/bin/web-show            - --target-ip=${TARGET_IP}          ports:            - name: web-port              containerPort: 8081              hostPort: 80812、创建相关服务#kubectl apply -f web-show.yml3、服务端口通过master节点映射#nohup kubectl port-forward --address 0.0.0.0 deployment.apps/webshow-deployment 8081:8081 -n  chaosmesh-test &4、如果端口有问题，杀死重启端口映射步骤3killl $(lsof -t -i:8081) > /dev/null  2>&1 |||true5、正常访问的页面如下：

(二)、实验
3.2.1、POD创建pod类型 FAILURE测试
1.点击实验-新建实验
2.依次选择实验类型：KUBERNETES 、POD故障
3.填写实验信息选项卡

注：mode的相关信息包括：

指定实验的运行方式包括：one(表示随机选择符合条件的。 Pod）、all(表示选择所有符合条件的条件 Pod）、fixed(表示选择指定的数量并符合条件 Pod）、fixed-percent(表示符合条件的选择。 Pod 指定百分比的 Pod）、random-max-percent(表示符合条件的选择。 Pod 不超过指定百分比的 Pod）

4.提交相关信息。
5.通过k8smaster节点监控查看pod 的相关情况

#watch kubectl get pod,PodChaos,StressChaos,NetworkChaos -n chaosmesh-testNAME                                      READY   STATUS    RESTARTS   AGEpod/webshow-deployment-6cbdc4cd4cd4cd-ljbtk   1/1     Running   7          6h43mNAME                                          AGEpodchaos.chaos-mesh.org/pod-containers-kill   7h13mpodchaos.chaos-mesh.org/pod-failure-01        20mpodchaoshaospospodchaosososs.chaos-mesh.org/pod-kill              8hpodchaosh.chaos-mesh.org/pod-kill-all          6h43mpodchaoshaosposhaosposs.chaos-mesh.org/pod-kill03            8hNAME                                 DURATIONstresschaos.chaos-mesh.org/pod-cpu   5mNAME                                              ACTION   DURATIONnetworkchaos.chaos-mesh.org/network-delay         loss     5mnetworkchaos.chaos-mesh.org/network-delay-02    delay    5mnetworkchaos.chaos-mesh.org/pod-network-delay     delay    70snetworkchaos.chaos-mesh.org/pod-network-loss    loss     120snetworkchaoshaosshaosss.chaos-mesh.org/pod-network-loss-01   loss     2m

6.当执行任务出现截图等相关问题时
7. 通过kubectl检查实验结果

可以使用 kubectl describe 查看混沌实验对象的命令 Status 和 Events，从而确定实验结果

# kubectl describe networkchaos.chaos-mesh.org/network-delay -nchaosmesh-testName:         network-delayNamespace:    chaosmesh-testLabels:       <none>Annotations:  experiment.chaos-mesh.org/pause: falseAPI Version:  chaos-mesh.org/v1alpha1Kind:         NetworkChaosMetadata:  Creation Timestamp:  2022-04-01T08:06:54Z  Finalizers:    chaos-mesh/records  Generation:  24  Managed Fields:    API Version:  chaos-mesh.org/v1alpha1    Fields Type:  FieldsV1    fieldsv1:      f:metadata:        f:finalizers:          .:          v:"chaos-mesh/records":      f:status:        f:conditions:        f:experiment:          f:containerRecords:          f:desiredPhase:        f:instances:          .:          f:chaosmesh-test/webshow-deployment-6cbdc4cd4cd4cd-ljbtk:    Manager:      chaos-controller-manager    Operation:    Update    Time:         2022-04-01T08:06:54Z    API Version:  chaos-mesh.org/v1alpha1    Fields Type:  FieldsV1    fieldsv1:      f:metadata:        f:annotations:          .:          f:experiment.chaos-mesh.org/pause:      f:spec:        .:        f:action:        f:direction:        f:duration:        f:loss:          .:          f:loss:        f:mode:        f:selector:          .:          f:labelSelectors:            .:            f:app:          f:namespaces:      f:status:        .:        f:experiment:    Manager:         chaos-dashboard    Operation:       Update    Time:            2022-04-01T08:11:11Z  Resource Version:  1305926  UID:               ee609703-aa48-4b5-9ff2-88b4aab967b5Spec:  Action:     loss  Direction:  to  Duration:   5m  Loss:    Correlation:  0    Loss:         80  Mode:           all  Selector:    Label Selectors:      App:  webshow-deployment    Namespaces:      chaosmesh-testStatus:  Conditions:    Reason:      Status:  False    Type:    AllInjected    Reason:      Status:  True    Type:    AllRecovered    Reason:      Status:  False    Type:    Paused    Reason:      Status:  True    Type:    Selected  Experiment:    Container Records:      Id:            chaosmesh-test/webshow-deployment-6cbdc4cd4cd4cd-ljbtk      Phase:         Not Injected      Selector Key:  .    Desired Phase:   Stop  Instances:    chaosmesh-test/webshow-deployment-6cbdc4cd4cd4cd-ljbtk:  11Events:                                                  <none>

上述输出主要包括两部分：

Status根据混沌实验的执行过程，Status 提供以下四种状态记录：

Paused：代表混沌实验正处于暂停阶段。
Selected：代表混沌实验已正确选择待测试目标。
AllInjected：所有的测试目标都成功地注入了故障。
AllRecoverd：代表所有测试目标的故障已成功恢复。

当前混沌实验的真实运行可以通过这四种状态记录来推断。例如：

当 Paused、Selected、AllRecoverd 的状态是 True 且 AllInjected 的状态是 False说明当前实验处于暂停状态。
当 Paused 为 True 当前实验处于暂停状态，但如果此时 Selected 值为 False，然后可以进一步得出结论，这种混沌实验无法选择待测试的目标。

注意

可以从以上四种实验记录组合中推导出更多信息，比如 Paused 为 True 当混沌实验处于暂停状态时，但如果此时 Selected 值为 False，然后可以进一步得出结论，这种混沌实验无法选择待测试的目标。
Events
事件列表包含混沌实验在整个生命周期中的操作记录，有助于确定混沌实验的状态，消除问题。

8.查看dashboard界面
9.实验结束时，检查pod服务是否正常
10.归档实验步骤

假如你想在那里 Dashboard 将混沌实验删除并归档到历史记录总结中，点击相应的混沌实验归档按钮。

3.3.3、模拟网络故障

请确保在网络注入过程中 Controller Manager 与 Chaos Daemon 两者之间的连接畅通，否则将无法恢复。

如果使用 Net Emulation 请确保功能 Linux 内核拥有 NET_SCH_NETEM 模块。对于 CentOS 可以通过 kernel-modules-extra 对于包装安装，大多数其他发行版本都默认安装了相应的模块。

(一)模拟LOSS
1.依次选择新建-网络攻击--LOSS

loss:表示丢包的概率。取值范围:[0， 100]

correlation:表示延迟时间的长度与之前的延迟时间有关。取值范围:[0， 100]

direction: 值为 from，to 或 both。用于指定选择“来源” target 包”，“发送 target 包”，或“全部选中”

externalTargets: 表示 Kubernetes 以外的网络目标，可以是 IPv4 地址或域名。只能和。 direction: to 一起工作。如8.8.8.8 baidu.com

2.填写实验信息并提交。
3.进入容器内进行相关ping操作，会出现丢包现象。

[root@k8s-master ~]# kubectl exec -it   pod/webshow-deployment-6cbdc4cd4cd4cd-ljbtk -nchaosmesh-test /bin/shkubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.sh-4.2# ping 8.8.8.8PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.64 bytes from 8.8.8.8: icmp_seq=54 ttl=108 time=53.5 ms^C--- 8.8.8.8 ping statistics ---67 packets transmitted, 1 received, 98% packet loss, time 67604msrtt min/avg/max/mdev = 598/598/53.598/53.598/0000 mssh-4.2# ping 8.8.8.8PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.

(二)、模拟delay场景
1.创建相关配置
2.查看相关实验信息，点击开始
3.验证相关结果，通过进入pod、master节点和pod对应的work节点进行ping测试

1.进入容器进行ping外网# kubectl exec -it   pod/webshow-deployment-6cbdc4cd4cd4cd-ljbtk -nchaosmesh-test /bin/sh#ping 8.8.8.82.在master节点pingpodip地址3.pod所在的work节点pingpod地址。总结:通过ping发现ping延迟或丢包。

备注：

字段说明

参数

类型

说明

默认值

是否必填

示例

action

string

表示具体的故障类型。netem，delay，loss，duplicate，corrupt 对应 net emulation 类型；partition 表示网络分区；bandwidth 表示限制带宽

无

是

partition

target

Selector

与 direction 组合使用，使用 Chaos 只对部分包生效

无

否

direction

enum

值为 from，to 或 both。用于指定选择“来源” target 包”，“发送 target 包”，或“全部选中”

否

both

mode

string

指定实验的运行方式包括：one(表示随机选择符合条件的。 Pod）、all(表示选择所有符合条件的条件 Pod）、fixed(表示选择指定的数量并符合条件 Pod）、fixed-percent(表示符合条件的选择。 Pod 指定百分比的 Pod）、random-max-percent(表示符合条件的选择。 Pod 不超过指定百分比的 Pod）

无

是

one

value

string

取决与 mode 的配置，为 mode 提供相应的参数。例如，当您将 mode 配置为 fixed-percent 时，value 用于指定 Pod 的百分比

无

否

containerNames

[]string

指定注入的容器名称

无

否

["nginx"]

selector

struct

指定注入故障的目标 Pod，详情请参考定义实验范围

无

是

externalTargets

[]string

表示 Kubernetes 以外的网络目标，可以是 IPv4 地址或域名。只能和。 direction: to 一起工作。

无

否

1.1.1.1, www.google.com

device

string

指定网络设备的影响

无

否

"eth0"

参数

类型

说明

默认值

是否必填

示例

latency

string

表示延迟的时间长度

否

2ms

correlation

string

表示延迟时间的长度与之前的延迟时间有关。取值范围:[0, 100]

否

jitter

string

表示延迟时间的变化范围

否

1ms

reorder

Reorder(#Reorder)

表示网络乱序的状态

否

具体可参考https://chaos-mesh.org/zh/docs/simulate-network-chaos-on-kubernetes/#Loss

3.3.4、模拟压力场景
1.选择dashboard-实验-新实验-压力测试
2.通过进入pod内部和pod所在的计算节点，查看cpu的相关测试信息

1.进入容器内部查看负载[root@k8s-master ~]# kubectl exec -it   pod/webshow-deployment-6cbdc4cd4cd4cd-ljbtk -nchaosmesh-test /bin/shkubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.sh-4.2# toptop - 03:17:58 up 7 days, 23:04,  0 users,  load average: 6.33, 1.99, 0.75Tasks:  16 total,   5 running,  11 sleeping,   0 stopped,   0 zombie%Cpu(s): 93.8 us,  4.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.6 si,  0.0 stKiB Mem :  8154912 total,   240500 free,  2111328 used,  5803084 buff/cacheKiB Swap:        0 total,        0 free,        0 used.  5718908 avail Mem   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 46 root      20   0  291064 253740   1344 R 100.0  3.1   1:02.55 stress-ng-vm                                                            44 root      20   0  291064 253740   1344 R 100.0  3.1   1:03.47 stress-ng-vm                                                            43 root      20   0  291064 253740   1344 R  93.3  3.1   1:04.10 stress-ng-vm                                                            45 root      20   0  291064 253740   1344 R  60.0  3.1   1:02.40 stress-ng-vm                                                             1 root      20   0  112976  15316   6820 S   0.0  0.2   0:25.18 web-show                                                                34 root      20   0   41060   7840   5452 S   0.0  0.1   0:00.00 stress-ng                                                               35 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            36 root      20   0   41704   9252   3540 S   0.0  0.1   0:01.97 stress-ng-cpu                                                           37 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            38 root      20   0   41704   9192   3476 S   0.0  0.1   0:01.62 stress-ng-cpu                                                           39 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            40 root      20   0   41704   9252   3540 S   0.0  0.1   0:01.64 stress-ng-cpu                                                           41 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            42 root      20   0   41704   9192   3476 S   0.0  0.1   0:01.66 stress-ng-cpu                                                           47 root      20   0   11832   2684   2456 S   0.0  0.0   0:00.01 sh                                                                   sh-4.2# toptop - 03:19:13 up 7 days, 23:05,  0 users,  load average: 8.91, 3.79, 1.48Tasks:  16 total,   5 running,  11 sleeping,   0 stopped,   0 zombie%Cpu(s): 98.6 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 stKiB Mem :  8154912 total,   239884 free,  2111728 used,  5803300 buff/cacheKiB Swap:        0 total,        0 free,        0 used.  5718512 avail Mem   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 46 root      20   0  291064 253740   1344 R  94.3  3.1   2:11.74 stress-ng-vm                                                            45 root      20   0  291064 253740   1344 R  93.7  3.1   2:13.74 stress-ng-vm                                                            44 root      20   0  291064 253740   1344 R  93.3  3.1   2:13.47 stress-ng-vm                                                            43 root      20   0  291064 253740   1344 R  91.3  3.1   2:13.76 stress-ng-vm                                                            38 root      20   0   41704   9192   3476 S  11.7  0.1   0:03.40 stress-ng-cpu                                                            1 root      20   0  112976  15316   6820 S   0.0  0.2   0:25.21 web-show                                                                34 root      20   0   41060   7840   5452 S   0.0  0.1   0:00.00 stress-ng                                                               35 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            36 root      20   0   41704   9252   3540 S   0.0  0.1   0:03.71 stress-ng-cpu                                                           37 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            39 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            40 root      20   0   41704   9252   3540 S   0.0  0.1   0:02.83 stress-ng-cpu                                                           41 root      20   0   41064   2448     52 S   0.0  0.0   0:00.00 stress-ng-vm                                                            42 root      20   0   41704   9192   3476 S   0.0  0.1   0:02.85 stress-ng-cpu                                                           47 root      20   0   11832   2804   2440 S   0.0  0.0   0:00.01 sh                                                                   [1]+  Stopped(SIGSTOP)        topsh-4.2# uptime 03:19:22 up 7 days, 23:05,  0 users,  load average: 8.95, 3.97, 1.562.检查计算节点的负载[root@k8s-node1 ~]# toptop - 11:19:55 up 7 days, 23:06,  1 user,  load average: 8.36, 4.30, 1.75Tasks: 189 total,   6 running, 115 sleeping,   1 stopped,   0 zombie%Cpu(s): 98.5 us,  1.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 stKiB Mem :  8154912 total,   144968 free,  2029056 used,  5980888 buff/cacheKiB Swap:        0 total,        0 free,        0 used.  5623656 avail Mem   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                               5484 root      20   0  291064 253740   1344 R  97.3  3.1   2:52.08 stress-ng-vm                                                          5485 root      20   0  322328 284904   1344 R  93.7  3.5   2:52.45 stress-ng-vm                                                          5486 root      20   0  322328 284860   1344 R  91.0  3.5   2:50.38 stress-ng-vm                                                          5483 root      20   0  322328 284792   1344 R  89.4  3.5   2:52.82 stress-ng-vm                                                          5476 root      20   0   41704   9252   3540 S   9.6  0.1   0:05.00 stress-ng-cpu                                                        13676 tidb      20   0   10.5g 209764  58128 S   9.3  2.6 432:52.32 pd-server                                                            22361 root      20   0 1986632 125576  70520 S   3.0  1.5 307:20.30 kubelet                                                              30096 root      20   0  752296  57752  35756 S   1.3  0.7  22:41.88 kube-scheduler                                                       31181 root      20   0  753256  62176  35336 S   1.0  0.8   5:01.58 chaos-controlle                                                       2340 root      20   0 1695896 108680  53292 S   0.7  1.3 107:31.90 dockerd                                                                873 root      20   0   21544   2704   2456 S   0.3  0.0   0:20.71 irqbalance                                                           25407 root      20   0  711016  14412   6096 S   0.3  0.2   0:54.73 containerd-shim                                                          1 root      20   0  191568   5648   3700 S   0.0  0.1   0:46.20 systemd                                                                  2 root      20   0       0      0      0 S   0.0  0.0   0:00.38 kthreadd                                                                 3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp                                                               [root@k8s-node1 ~]# uptime 11:19:58 up 7 days, 23:06,  1 user,  load average: 8.49, 4.40, 1.79

3.中间暂停，发cpu负载，继续时再上来，知道实验结束了。
(二)、工作流

为满足这一需要，Chaos Mesh 提供了 Chaos Mesh Workflow，内置工作流引擎。使用此引擎，您可以串行或并行执行各种不同的引擎 Chaos 实验，用于模拟生产水平的错误。

目前， Chaos Mesh Workflow 支持以下功能：

串行编排
并行编排
自定义任务
条件分支

例如：使用场景：

同时注入多个并行编排 NetworkChaos 模拟复杂的网络环境
在串行安排中进行健康检查，使用条件分支决定是否执行剩余步骤

Chaos Mesh Workflow 在一定程度上参考设计 Argo Workflow。如果您熟悉 Argo Workflow 你也可以很快开始 Chaos Mesh Workflow。

具体可参考https://chaos-mesh.org/zh/docs/create-chaos-mesh-workflow/

(三)、计划

在 Kubernetes 中，Chaos Mesh 使用 Schedule 对象描述定时任务。

一个 Schedule 对象名不得超过 57 字符，因为它创建的混沌实验将在名称后添加额外的 6 位置随机字符。一个包含 Workflow 的 Schedule 对象名不得超过 51 字符，因为 Workflow 还将在创建名称后添加额外的名称 6 位置随机字符。

schedule 字段•schedule 用于指定实验发生时间的字段。# ┌───────────── 分钟 (0 - 59)# │ ┌───────────── 小时 (0 - 23)# │ │ ┌───────────── 月的某天 (1 - 31)# │ │ │ ┌───────────── 月份 (1 - 12)# │ │ │ │ ┌───────────── 周的某天 (0 - 6) (从周日到周一；在某些系统上，7 也是星期日)# │ │ │ │ │# │ │ │ │ │# │ │ │ │ │# * * * * *

输入

描述

等效替代

@yearly (or @annually)

每年 1 月 1 午夜运行一次

0 0 1 1 *

@monthly

每月第一天的午夜运行一次

0 0 1 * *

@weekly

每周的周日午夜运行一次

0 0 * * 0

@daily (or @midnight)

每天午夜运行一次

0 0 * * *

@hourly

每小时开始一次

0 * * * *

1.制定工作计划
2.填写计划周期、并发策略等信息
3.提交实验
4.因为schedule每两分钟执行一次.

您可以查看pod的cpu负载和pod所在work节点的cpu负载，并在master节点查看schedule信息

1.查看master节点的信息 kubectl get pod,PodChaos,StressChaos,schedule -n chaosmesh-test -owide                                                                                     Sat Apr  2 12:03:35 2022NAME                                      READY   STATUS    RESTARTS   AGE   IP            NODE        NOMINATED NODE   READINESS GATESpod/webshow-deployment-6cbdc4cd4cd4cd-ljbtk   1/1     Running   7          21h   10.244.3.39   k8s-node1   <none>           <none>NAME                                          AGEpodchaos.chaos-mesh.org/pod-containers-kill   hpodchaos21.chaos-mesh.org/pod-failure-01        15hpodchaos.chaos-mesh.org/pod-kill              23hpodchaos.chaos-mesh.org/pod-kill-all          hpodchaos21.chaos-mesh.org/pod-kill03            22hNAME                                           DURATIONstresschaos.chaos-mesh.org/cpu-test-01         schaos10mstress.chaos-mesh.org/pod-cpu             chaoss5mstress.chaos-mesh.org/schedule-01-j9n5f   10mNAME                                  AGEschedule.chaos-mesh.org/schedule-01   33m####查看schedule的详细信息[root@k8s-master ~]# kubectl describe schedule.chaos-mesh.org/schedule-01 -nchaosmesh-testName:         schedule-01Namespace:    chaosmesh-testLabels:       <none>Annotations:  experiment.chaos-mesh.org/pause: falseAPI Version:  chaos-mesh.org/v1alpha1Kind:         ScheduleMetadata:  Creation Timestamp:  2022-04-02T03:30:07Z  Generation:          23  Managed Fields:    API Version:  chaos-mesh.org/v1alpha1    Fields Type:  FieldsV1    fieldsv1:      f:status:        f:active:        f:time:    Manager:      chaos-controller-manager    Operation:    Update    Time:         2022-04-02T03:32:00Z    API Version:  chaos-mesh.org/v1alpha1    Fields Type:  FieldsV1    fieldsv1:      f:metadata:        f:annotations:          .:          f:experiment.chaos-mesh.org/pause:      f:spec:        .:        f:concurrencyPolicy:        f:historyLimit:        f:schedule:        f:startingDeadlineSeconds:        f:stressChaos:          .:          f:duration:          f:mode:          f:selector:            .:            f:namespaces:          f:stressors:            .:            f:cpu:              .:              f:workers:            f:memory:              .:              f:size:              f:workers:        f:type:      f:status:    Manager:         chaos-dashboard    Operation:       Update    Time:            2022-04-02T03:36:49Z  Resource Version:  1513442  UID:               7a198cb5-feb6-4403-ab37-b3ceabeab1e954especespecespecespece:  Concurrency Policy:         Forbid  History Limit:              1  Schedule:                   */2 * * * *  Starting Deadline Seconds:  600  Stress Chaos:    Duration:  10m    Mode:      all    Selector:      Namespaces:        chaosmesh-test    Stressors:      Cpu:        Workers:  3      Memory:        Size:     1024m        Workers:  3  Type:           StressChaosStatus:  Active:    API Version:       chaos-mesh.org/v1alpha1    Kind:              StressChaos    Name:              schedule-01-98lvp    Namespace:         chaosmesh-test    Resource Version:  1513440    UID:               abcedc4b-1cb4-48ef-923e-f3cc9cb694  Time:                2022-04-02T04:04:28ZEvents:  Type     Reason   Age   From           Message  ----     ------   ----  ----           -------  Normal   Spawned  35m   schedule-cron  Create new object: schedule-01-j9n5f  Normal   Updated  35m   schedule-cron  Successfully update lastScheduleTime of resource  Warning  Forbid   33m   schedule-cron  Forbid spawning new job because: schedule-01-j9n5f is still running  Normal   Spawned  3m5s  schedule-cron  Create new object: schedule-01-98lvp  Normal   Updated  3m5s  schedule-cron  Successfully update lastScheduleTime of resource  Warning  Forbid   93s   schedule-cron  Forbid spawning new job because: schedule-01-98lvp is still running2.查看pod负载topp - 04:06:14 up 7 days, 23:52,  0 users,  load average: 7.65, 2.91, 1.67Tasks:  14 total,   7 running,   6 sleeping,   1 stopped,   0 zombie%Cpu(s): 98.4 us,  1.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 stKiB Mem :  8154912 total,   276480 free,  2087400 used,  5791032 buff/cacheKiB Swap:        0 total,        0 free,        0 used.  5742844 avail Mem   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                 76 root      20   0  374392 335024   1336 R  73.4  4.1   1:04.01 stress-ng-vm                                                            74 root      20   0   41704   8012   3544 R  68.1  0.1   1:07.17 stress-ng-cpu                                                           71 root      20   0   41704   8012   3544 R  65.8  0.1   1:07.89 stress-ng-cpu                                                           75 root      20   0  374392 335024   1336 R  63.5  4.1   1:08.54 stress-ng-vm                                                            72 root      20   0  374392 335024   1336 R  57.5  4.1   1:08.74 stress-ng-vm                                                            69 root      20   0   41704   8012   3544 R  51.5  0.1   1:05.62 stress-ng-cpu                                                            1 root      20   0  112976  15316   6820 S   0.0  0.2   0:26.62 web-show                                                                47 root      20   0   11832   2804   2440 S   0.0  0.0   0:00.01 sh                                                                      54 root      20   0   56192   3776   3264 T   0.0  0.0   0:00.00 top                                                                     56 root      20   0   56196   3720   3208 R   0.0  0.0   0:00.63 top                                                                     67 root      20   0   41056   5628   5280 S   0.0  0.1   0:00.00 stress-ng                                                               68 root      20   0   41060    420     64 S   0.0  0.0   0:00.00 stress-ng-vm                                                            70 root      20   0   41060    420     64 S   0.0  0.0   0:00.00 stress-ng-vm                                                            73 root      20   0   41060    420     64 S   0.0  0.0   0:00.00 stress-ng-vm     3.查看work节点的负载[root@k8s-node1 ~]# uptime 12:06:51 up 7 days, 23:53,  1 user,  load average: 7.82, 3.46, 1.90

5.暂停定时任务

与 CronJob 不同的是，暂停一个 Schedule 它不仅会阻止它创建新的实验，还会暂停已创建的实验。

1.如果你暂时不想通过定期任务创建混沌实验，你需要这样做 Schedule 对象添加 experiment.chaos-mesh.org/pause=true 注解。可以使用 kubectl 在命令行工具中添加注释：kubectl annotate -n $NAMESPACE schedule $NAME experiment.chaos-mesh.org/pause=true返回结果：schedule/$NAME annnotated2.如果要解除暂停，可使用以下命令去除注释：kubectl annotate -n $NAMESPACE schedule $NAME experiment.chaos-mesh.org/pause-schedule/返回结果$NAME annotated

备注.搜索mode类型

https://github.com/chaos-mesh/chaos-mesh/blob/master/api/v1alphaselector.goconst (  // OneMode represents that the system will do the chaos action on one object selected randomly.  OneMode SelectorMode = "one"  // AllMode represents that the system will do the chaos action on all objects  // regardless of status (not ready or not running pods includes).  // Use this label carefully.  AllMode SelectorMode = "all"  // FixedMode represents that the system will do the chaos action on a specific number of running objects.  FixedMode SelectorMode = "fixed"  // FixedPercentMode to specify a fixed % that can be inject chaos action.  FixedPercentMode SelectorMode = "fixed-percent"  // RandomMaxPercentMode to specify a maximum % that can be inject chaos action.  RandomMaxPercentMode SelectorMode = "random-max-percent")

19908451513

467805942@qq.com

通过Chaos-Mesh打造更稳定TiDB数据库高可用架构(一)