Kubernetes高可用性与灾难恢复配置构建容错能力强的集群一、高可用性概述Kubernetes高可用性是指集群在面对节点故障、网络中断等问题时能够保持服务正常运行的能力。1.1 HA架构┌─────────────────────────────────────────────────────────────────┐ │ 控制平面高可用 │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Master 1 │ │ Master 2 │ │ Master 3 │ │ │ │ API Server │ │ API Server │ │ API Server │ │ │ │ etcd │ │ etcd │ │ etcd │ │ │ │ Controller │ │ Controller │ │ Controller │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ └─────────┼─────────────────┼─────────────────┼─────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Load Balancer │ └───────────────────────────┬─────────────────────────────────────┘ │ ┌─────────────────┼─────────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ Pod A │ │ │ │ Pod B │ │ │ │ Pod C │ │ │ │ Pod D │ │ │ │ Pod E │ │ │ │ Pod F │ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ └─────────────────┘ └─────────────────┘ └─────────────────┘1.2 HA组件组件高可用策略API Server多实例 负载均衡etcd分布式集群Controller Manager多实例选举Scheduler多实例选举二、etcd高可用配置2.1 etcd集群配置apiVersion: v1 kind: Pod metadata: name: etcd-server spec: containers: - name: etcd image: quay.io/coreos/etcd:v3.5.0 command: - etcd - --nameetcd-0 - --initial-advertise-peer-urlshttp://etcd-0:2380 - --listen-peer-urlshttp://0.0.0.0:2380 - --listen-client-urlshttp://0.0.0.0:2379 - --advertise-client-urlshttp://etcd-0:2379 - --initial-clusteretcd-0http://etcd-0:2380,etcd-1http://etcd-1:2380,etcd-2http://etcd-2:2380 - --initial-cluster-statenew - --data-dir/var/lib/etcd2.2 etcd备份配置#!/bin/bash TIMESTAMP$(date %Y%m%d_%H%M%S) BACKUP_DIR/backup/etcd mkdir -p $BACKUP_DIR etcdctl snapshot save $BACKUP_DIR/snapshot_$TIMESTAMP.db etcdctl snapshot status $BACKUP_DIR/snapshot_$TIMESTAMP.db三、控制平面高可用配置3.1 API Server配置apiVersion: v1 kind: Service metadata: name: kubernetes spec: type: ClusterIP clusterIP: 10.96.0.1 ports: - port: 443 targetPort: 6443 selector: component: kube-apiserver3.2 Controller Manager配置apiVersion: v1 kind: Pod metadata: name: kube-controller-manager spec: containers: - name: kube-controller-manager image: k8s.gcr.io/kube-controller-manager:v1.25.0 command: - kube-controller-manager - --leader-electtrue - --controllers* - --cluster-namemy-cluster四、Worker节点高可用配置4.1 Pod副本配置apiVersion: apps/v1 kind: Deployment metadata: name: highly-available-app spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - my-app topologyKey: kubernetes.io/hostname4.2 Pod中断预算apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-app-pdb spec: minAvailable: 2 selector: matchLabels: app: my-app五、灾难恢复配置5.1 定期备份配置apiVersion: batch/v1 kind: CronJob metadata: name: etcd-backup spec: schedule: 0 2 * * * jobTemplate: spec: template: spec: containers: - name: etcd-backup image: quay.io/coreos/etcd:v3.5.0 command: - /bin/sh - -c - etcdctl snapshot save /backup/snapshot.db volumeMounts: - name: backup-volume mountPath: /backup volumes: - name: backup-volume persistentVolumeClaim: claimName: backup-pvc restartPolicy: OnFailure5.2 数据恢复配置#!/bin/bash etcdctl snapshot restore /backup/snapshot.db \ --nameetcd-restore \ --initial-advertise-peer-urlshttp://etcd-restore:2380 \ --initial-clusteretcd-restorehttp://etcd-restore:2380 \ --data-dir/var/lib/etcd六、网络高可用配置6.1 Ingress高可用apiVersion: apps/v1 kind: Deployment metadata: name: nginx-ingress spec: replicas: 3 selector: matchLabels: app: nginx-ingress template: spec: containers: - name: nginx-ingress image: nginx/nginx-ingress:latest6.2 Service高可用apiVersion: v1 kind: Service metadata: name: my-service spec: selector: app: my-app ports: - port: 80 targetPort: 8080 type: ClusterIP七、监控与告警配置7.1 HA监控apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: etcd-monitor spec: selector: matchLabels: app: etcd endpoints: - port: metrics interval: 30s7.2 HA告警规则apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: ha-alerts spec: groups: - name: etcd.rules rules: - alert: EtcdMembersDown expr: count(up{jobetcd} 0) 0 for: 5m labels: severity: critical annotations: summary: etcd member down八、高可用性最佳实践8.1 多区域部署apiVersion: apps/v1 kind: Deployment metadata: name: multi-zone-app spec: replicas: 6 template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - my-app topologyKey: topology.kubernetes.io/zone8.2 定期演练#!/bin/bash echo 开始高可用性演练 kubectl cordon node-1 kubectl drain node-1 --ignore-daemonsets kubectl uncordon node-1 echo 高可用性演练完成 九、总结高可用性配置需要关注控制平面HA多Master节点部署数据存储HAetcd集群配置应用HA多副本反亲和性配置灾难恢复定期备份和恢复演练监控告警及时发现和响应故障建议建立完善的高可用体系定期进行故障演练。参考资料Kubernetes HA文档etcd HA文档PodDisruptionBudget文档