
Kubernetes Pod调度策略详解
一、概述
1.1 背景介绍
Pod调度是Kubernetes的核心机制之一,决定了Pod最终运行在哪个节点上。默认调度器kube-scheduler通过一系列预选(Filtering)和优选(Scoring)算法完成调度决策,但默认行为在生产环境中往往不够用。
实际场景中经常遇到的问题:数据库Pod被调度到了没有SSD的节点上,导致IO性能差;两个高负载服务的Pod被调度到同一个节点,互相抢资源;GPU节点上跑了一堆普通业务Pod,真正需要GPU的任务反而调度不上去。
这些问题都需要通过调度策略来解决。Kubernetes提供了nodeSelector、nodeAffinity、podAffinity/podAntiAffinity、taints/tolerations、topologySpreadConstraints等多种调度机制,本文逐一讲解并给出生产环境的配置方案。
1.2 技术特点
多层调度控制:从简单的nodeSelector到复杂的自定义调度器,提供不同粒度的调度控制能力 软硬约束结合:requiredDuringScheduling是硬约束(不满足就不调度),preferredDuringScheduling是软约束(尽量满足,不满足也能调度) 拓扑感知:topologySpreadConstraints支持按zone、node、rack等拓扑域分散Pod,实现跨故障域部署 抢占机制:PriorityClass支持高优先级Pod抢占低优先级Pod的资源
1.3 适用场景
场景一:将数据库、缓存等IO密集型Pod调度到SSD节点,计算密集型Pod调度到高CPU节点 场景二:同一服务的多个副本分散到不同节点/可用区,避免单点故障导致服务全部不可用 场景三:GPU、FPGA等特殊硬件资源的独占调度,防止普通Pod占用专用资源 场景四:多租户集群中,不同团队的Pod隔离到各自的节点池
1.4 环境要求
二、详细步骤
2.1 准备工作
2.1.1 节点标签规划
调度策略的基础是节点标签,先把标签体系规划好:
# 查看现有节点标签
kubectl get nodes --show-labels
# 按硬件类型打标签
kubectl label node k8s-worker-01 disktype=ssd
kubectl label node k8s-worker-02 disktype=ssd
kubectl label node k8s-worker-03 disktype=hdd
# 按业务用途打标签
kubectl label node k8s-worker-01 workload-type=database
kubectl label node k8s-worker-02 workload-type=application
kubectl label node k8s-worker-03 workload-type=application
# 按可用区打标签(如果是多机房部署)
kubectl label node k8s-worker-01 topology.kubernetes.io/zone=zone-a
kubectl label node k8s-worker-02 topology.kubernetes.io/zone=zone-b
kubectl label node k8s-worker-03 topology.kubernetes.io/zone=zone-c
# GPU节点标签
kubectl label node k8s-gpu-01 accelerator=nvidia-tesla-v100
kubectl label node k8s-gpu-02 accelerator=nvidia-tesla-a100
# 验证标签
kubectl get nodes -L disktype,workload-type,topology.kubernetes.io/zone
注意:标签key的命名要有规范,建议用<domain>/<name>格式,如company.com/team=backend。Kubernetes内置标签用kubernetes.io和k8s.io前缀,自定义标签不要用这两个前缀。
2.1.2 理解调度流程
kube-scheduler的调度流程分为两个阶段:
预选(Filtering):过滤掉不满足条件的节点,比如资源不足、nodeSelector不匹配、taint不容忍等 优选(Scoring):对通过预选的节点打分,选择得分最高的节点
# 查看scheduler的调度日志(需要提高日志级别)
# 修改/etc/kubernetes/manifests/kube-scheduler.yaml
# 在command中添加 --v=4
# 然后查看日志
kubectl logs -n kube-system kube-scheduler-k8s-master-01 --tail=50
2.1.3 调度策略优先级
多种调度策略同时存在时的生效顺序:
nodeName(最高优先级):直接指定节点名,跳过调度器 taints/tolerations:节点污点过滤,不容忍的Pod直接排除 nodeSelector:简单的标签匹配过滤 nodeAffinity:更灵活的节点亲和性规则 podAffinity/podAntiAffinity:Pod间的亲和/反亲和 topologySpreadConstraints:拓扑分散约束 资源请求:节点剩余资源是否满足Pod的requests
2.2 核心配置
2.2.1 nodeSelector(最简单的调度约束)
nodeSelector是最基础的调度方式,通过标签键值对匹配节点:
# 文件:nginx-nodeselector.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:nginx-ssd
namespace:default
spec:
replicas:3
selector:
matchLabels:
app:nginx-ssd
template:
metadata:
labels:
app:nginx-ssd
spec:
nodeSelector:
disktype:ssd
containers:
-name:nginx
image:nginx:1.24
resources:
requests:
cpu:100m
memory:128Mi
limits:
cpu:200m
memory:256Mi
kubectl apply -f nginx-nodeselector.yaml
# 验证Pod只调度到了ssd节点
kubectl get pods -l app=nginx-ssd -o wide
注意:nodeSelector是硬约束,如果没有节点匹配标签,Pod会一直Pending。生产环境建议配合nodeAffinity的软约束使用。
2.2.2 nodeAffinity(节点亲和性)
nodeAffinity比nodeSelector更灵活,支持多种操作符和软硬约束:
# 文件:app-node-affinity.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:app-with-affinity
namespace:default
spec:
replicas:6
selector:
matchLabels:
app:app-affinity
template:
metadata:
labels:
app:app-affinity
spec:
affinity:
nodeAffinity:
# 硬约束:必须调度到zone-a或zone-b
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:topology.kubernetes.io/zone
operator:In
values:
-zone-a
-zone-b
# 软约束:优先调度到ssd节点,权重1-100
preferredDuringSchedulingIgnoredDuringExecution:
-weight:80
preference:
matchExpressions:
-key:disktype
operator:In
values:
-ssd
-weight:20
preference:
matchExpressions:
-key:workload-type
operator:In
values:
-application
containers:
-name:app
image:nginx:1.24
resources:
requests:
cpu:200m
memory:256Mi
操作符说明:
In:标签值在列表中NotIn:标签值不在列表中Exists:标签存在(不关心值)DoesNotExist:标签不存在Gt:标签值大于指定值(仅限数字)Lt:标签值小于指定值(仅限数字)
注意:requiredDuringSchedulingIgnoredDuringExecution中的IgnoredDuringExecution表示Pod已经运行后,即使节点标签变了也不会驱逐Pod。Kubernetes计划实现RequiredDuringExecution但目前还没有。
2.2.3 podAffinity和podAntiAffinity(Pod间亲和/反亲和)
控制Pod之间的调度关系,典型场景:Web应用和缓存部署在同一节点减少网络延迟,同一服务的多个副本分散到不同节点。
# 文件:web-cache-affinity.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:web-frontend
namespace:default
spec:
replicas:3
selector:
matchLabels:
app:web-frontend
template:
metadata:
labels:
app:web-frontend
spec:
affinity:
# Pod亲和:和redis-cache部署在同一节点
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
-weight:100
podAffinityTerm:
labelSelector:
matchExpressions:
-key:app
operator:In
values:
-redis-cache
topologyKey:kubernetes.io/hostname
# Pod反亲和:同一服务的副本分散到不同节点
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchExpressions:
-key:app
operator:In
values:
-web-frontend
topologyKey:kubernetes.io/hostname
containers:
-name:web
image:nginx:1.24
resources:
requests:
cpu:200m
memory:256Mi
说明:
topologyKey: kubernetes.io/hostname表示以节点为拓扑域,同一节点上的Pod视为同一拓扑域topologyKey: topology.kubernetes.io/zone表示以可用区为拓扑域podAntiAffinity的硬约束会限制副本数不能超过节点数,3个副本至少需要3个节点
警告:podAffinity/podAntiAffinity的计算复杂度是O(N^2),N是集群中的Pod数量。在Pod数量超过5000的大集群中,大量使用podAffinity会导致调度延迟从毫秒级上升到秒级。
2.2.4 Taints和Tolerations(污点和容忍)
Taints从节点角度排斥Pod,Tolerations从Pod角度容忍污点。两者配合实现节点专用化。
# 给GPU节点添加污点,只允许GPU任务调度
kubectl taint nodes k8s-gpu-01 gpu=true:NoSchedule
kubectl taint nodes k8s-gpu-02 gpu=true:NoSchedule
# 给维护中的节点添加NoExecute污点,驱逐现有Pod
kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute
# 查看节点污点
kubectl describe node k8s-gpu-01 | grep -A 5 Taints
# 删除污点
kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute-
Pod中配置Tolerations:
# 文件:gpu-job.yaml
apiVersion:batch/v1
kind:Job
metadata:
name:gpu-training-job
namespace:ml-training
spec:
template:
spec:
tolerations:
# 容忍gpu污点
-key:"gpu"
operator:"Equal"
value:"true"
effect:"NoSchedule"
nodeSelector:
accelerator:nvidia-tesla-v100
containers:
-name:training
image:pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
resources:
limits:
nvidia.com/gpu:1
requests:
cpu:"4"
memory:"16Gi"
restartPolicy:Never
Taint Effect说明:
NoSchedule:新Pod不会调度到该节点,已有Pod不受影响PreferNoSchedule:尽量不调度,但资源不足时仍可调度NoExecute:新Pod不调度,已有Pod如果不容忍会被驱逐。可以设置tolerationSeconds指定驱逐前的等待时间
# 容忍NoExecute污点,但最多等待300秒后被驱逐
tolerations:
- key: "maintenance"
operator: "Equal"
value: "true"
effect: "NoExecute"
tolerationSeconds: 300
2.2.5 topologySpreadConstraints(拓扑分散约束)
1.19版本GA的功能,比podAntiAffinity更精细地控制Pod在拓扑域间的分布:
# 文件:app-topology-spread.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:app-spread
namespace:default
spec:
replicas:9
selector:
matchLabels:
app:app-spread
template:
metadata:
labels:
app:app-spread
spec:
topologySpreadConstraints:
# 跨可用区均匀分布,最大偏差1
-maxSkew:1
topologyKey:topology.kubernetes.io/zone
whenUnsatisfiable:DoNotSchedule
labelSelector:
matchLabels:
app:app-spread
# 跨节点均匀分布,最大偏差1,软约束
-maxSkew:1
topologyKey:kubernetes.io/hostname
whenUnsatisfiable:ScheduleAnyway
labelSelector:
matchLabels:
app:app-spread
containers:
-name:app
image:nginx:1.24
resources:
requests:
cpu:100m
memory:128Mi
参数说明:
maxSkew:拓扑域间Pod数量的最大差值。设为1表示任意两个域的Pod数量差不超过1topologyKey:拓扑域的标签keywhenUnsatisfiable:不满足约束时的行为,DoNotSchedule(硬约束)或ScheduleAnyway(软约束)
9个副本在3个zone中的分布结果:zone-a=3, zone-b=3, zone-c=3。如果zone-c只有1个节点且资源不足,DoNotSchedule会导致部分Pod Pending,ScheduleAnyway则会尽量均匀但允许偏差。
2.2.6 PriorityClass(优先级和抢占)
高优先级Pod可以抢占低优先级Pod的资源:
# 定义优先级类
apiVersion:scheduling.k8s.io/v1
kind:PriorityClass
metadata:
name:critical-production
value:1000000
globalDefault:false
preemptionPolicy:PreemptLowerPriority
description:"生产核心服务,可抢占低优先级Pod"
---
apiVersion:scheduling.k8s.io/v1
kind:PriorityClass
metadata:
name:normal-production
value:500000
globalDefault:true
preemptionPolicy:PreemptLowerPriority
description:"普通生产服务"
---
apiVersion:scheduling.k8s.io/v1
kind:PriorityClass
metadata:
name:batch-job
value:100000
globalDefault:false
preemptionPolicy:Never
description:"批处理任务,不抢占其他Pod"
在Pod中引用:
apiVersion: apps/v1
kind:Deployment
metadata:
name:core-api
spec:
replicas:3
selector:
matchLabels:
app:core-api
template:
metadata:
labels:
app:core-api
spec:
priorityClassName:critical-production
containers:
-name:api
image:myapp:v1.0
resources:
requests:
cpu:"1"
memory:"2Gi"
警告:preemptionPolicy: PreemptLowerPriority会驱逐低优先级Pod来腾出资源,被驱逐的Pod会收到SIGTERM信号。确保应用能正确处理优雅关闭,否则会丢数据。生产环境建议批处理任务设置preemptionPolicy: Never。
2.3 启动和验证
2.3.1 验证调度结果
# 查看Pod调度到了哪个节点
kubectl get pods -o wide -l app=app-spread
# 查看Pod的调度事件
kubectl describe pod <pod-name> | grep -A 10 Events
# 查看调度失败的原因
kubectl get events --field-selector reason=FailedScheduling -A
# 查看节点资源分配情况
kubectl describe node k8s-worker-01 | grep -A 20 "Allocated resources"
2.3.2 调度模拟测试
# 使用kubectl创建一个dry-run的Pod,查看是否能调度成功
kubectl run test-schedule --image=nginx:1.24 --dry-run=server -o yaml \
--overrides='{
"spec": {
"nodeSelector": {"disktype": "ssd"},
"containers": [{"name": "test", "image": "nginx:1.24", "resources": {"requests": {"cpu": "100m", "memory": "128Mi"}}}]
}
}'
# 查看各节点的可分配资源
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory,\
PODS_ALLOC:.status.allocatable.pods
2.3.3 验证拓扑分散
# 查看Pod在各zone的分布
kubectl get pods -l app=app-spread -o custom-columns=\
NAME:.metadata.name,\
NODE:.spec.nodeName,\
ZONE:.spec.nodeName
# 更精确的方式:通过节点标签查看
for pod in $(kubectl get pods -l app=app-spread -o jsonpath='{.items[*].spec.nodeName}'); do
zone=$(kubectl get node $pod -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}')
echo "Node: $pod, Zone: $zone"
done
三、示例代码和配置
3.1 完整配置示例
3.1.1 多租户节点池隔离方案
生产环境中不同团队共用一个集群,通过taint+nodeAffinity实现节点池隔离:
# 文件:namespace-resource-setup.yaml
# 第一步:创建团队namespace
apiVersion:v1
kind:Namespace
metadata:
name:team-backend
labels:
team:backend
---
apiVersion:v1
kind:Namespace
metadata:
name:team-data
labels:
team:data
---
# 第二步:为每个团队设置ResourceQuota
apiVersion:v1
kind:ResourceQuota
metadata:
name:backend-quota
namespace:team-backend
spec:
hard:
requests.cpu:"20"
requests.memory:40Gi
limits.cpu:"40"
limits.memory:80Gi
pods:"100"
---
apiVersion:v1
kind:ResourceQuota
metadata:
name:data-quota
namespace:team-data
spec:
hard:
requests.cpu:"40"
requests.memory:80Gi
limits.cpu:"80"
limits.memory:160Gi
pods:"200"
节点打标签和污点:
# backend团队节点池
kubectl label node k8s-worker-{01..05} node-pool=backend
kubectl taint nodes k8s-worker-{01..05} node-pool=backend:NoSchedule
# data团队节点池
kubectl label node k8s-worker-{06..10} node-pool=data
kubectl taint nodes k8s-worker-{06..10} node-pool=data:NoSchedule
# 公共节点池(不加污点,所有Pod都能调度)
kubectl label node k8s-worker-{11..15} node-pool=shared
团队Deployment模板:
# 文件:backend-app-template.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:backend-api
namespace:team-backend
spec:
replicas:5
selector:
matchLabels:
app:backend-api
template:
metadata:
labels:
app:backend-api
spec:
tolerations:
-key:"node-pool"
operator:"Equal"
value:"backend"
effect:"NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:node-pool
operator:In
values:
-backend
-shared
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
-weight:100
podAffinityTerm:
labelSelector:
matchExpressions:
-key:app
operator:In
values:
-backend-api
topologyKey:kubernetes.io/hostname
containers:
-name:api
image:backend-api:v2.1.0
ports:
-containerPort:8080
resources:
requests:
cpu:500m
memory:512Mi
limits:
cpu:"1"
memory:1Gi
readinessProbe:
httpGet:
path:/health
port:8080
initialDelaySeconds:10
periodSeconds:5
livenessProbe:
httpGet:
path:/health
port:8080
initialDelaySeconds:30
periodSeconds:10
3.1.2 调度策略自动注入脚本
通过Kyverno策略自动为特定namespace的Pod注入调度规则,避免每个Deployment都手动配置:
# 文件:kyverno-scheduling-policy.yaml
apiVersion:kyverno.io/v1
kind:ClusterPolicy
metadata:
name:inject-node-affinity-backend
spec:
rules:
-name:add-backend-scheduling
match:
any:
-resources:
kinds:
-Pod
namespaces:
-team-backend
mutate:
patchStrategicMerge:
spec:
tolerations:
-key:"node-pool"
operator:"Equal"
value:"backend"
effect:"NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:node-pool
operator:In
values:
-backend
-shared
注意:Kyverno需要单独安装(helm install kyverno kyverno/kyverno -n kyverno --create-namespace)。这种方式比在每个Deployment里写调度规则更易维护,团队只需要关注业务配置,调度策略由平台团队统一管理。
3.2 实际应用案例
案例一:数据库Pod的调度策略
场景描述:MySQL主从集群部署在K8s中,主库需要SSD+高内存节点,从库可以用普通节点。主从Pod不能在同一节点上,避免节点故障导致主从同时不可用。
实现代码:
# 文件:mysql-master-statefulset.yaml
apiVersion:apps/v1
kind:StatefulSet
metadata:
name:mysql-master
namespace:database
spec:
serviceName:mysql-master
replicas:1
selector:
matchLabels:
app:mysql
role:master
template:
metadata:
labels:
app:mysql
role:master
spec:
priorityClassName:critical-production
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:disktype
operator:In
values:
-ssd
-key:workload-type
operator:In
values:
-database
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchExpressions:
-key:app
operator:In
values:
-mysql
topologyKey:kubernetes.io/hostname
containers:
-name:mysql
image:mysql:8.0.35
ports:
-containerPort:3306
env:
-name:MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name:mysql-secret
key:root-password
resources:
requests:
cpu:"2"
memory:4Gi
limits:
cpu:"4"
memory:8Gi
volumeMounts:
-name:mysql-data
mountPath:/var/lib/mysql
volumeClaimTemplates:
-metadata:
name:mysql-data
spec:
accessModes:["ReadWriteOnce"]
storageClassName:local-ssd
resources:
requests:
storage:100Gi
---
# MySQL从库
apiVersion:apps/v1
kind:StatefulSet
metadata:
name:mysql-slave
namespace:database
spec:
serviceName:mysql-slave
replicas:2
selector:
matchLabels:
app:mysql
role:slave
template:
metadata:
labels:
app:mysql
role:slave
spec:
priorityClassName:normal-production
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchExpressions:
-key:app
operator:In
values:
-mysql
topologyKey:kubernetes.io/hostname
topologySpreadConstraints:
-maxSkew:1
topologyKey:topology.kubernetes.io/zone
whenUnsatisfiable:DoNotSchedule
labelSelector:
matchLabels:
app:mysql
role:slave
containers:
-name:mysql
image:mysql:8.0.35
ports:
-containerPort:3306
resources:
requests:
cpu:"1"
memory:2Gi
limits:
cpu:"2"
memory:4Gi
volumeMounts:
-name:mysql-data
mountPath:/var/lib/mysql
volumeClaimTemplates:
-metadata:
name:mysql-data
spec:
accessModes:["ReadWriteOnce"]
storageClassName:standard
resources:
requests:
storage:100Gi
运行结果:
NAME READY STATUS NODE ZONE
mysql-master-0 1/1 Running k8s-worker-01 zone-a (SSD+database节点)
mysql-slave-0 1/1 Running k8s-worker-02 zone-b (不同节点不同zone)
mysql-slave-1 1/1 Running k8s-worker-03 zone-c (不同节点不同zone)
案例二:混合调度策略实现灰度发布
场景描述:灰度发布时,新版本Pod先调度到特定的灰度节点,验证通过后再扩展到所有节点。通过标签和调度策略控制灰度范围。
实现步骤:
给灰度节点打标签:
kubectl label node k8s-worker-01 canary=true
灰度版本Deployment:
# 文件:app-canary.yaml
apiVersion:apps/v1
kind:Deployment
metadata:
name:myapp-canary
namespace:production
spec:
replicas:2
selector:
matchLabels:
app:myapp
version:canary
template:
metadata:
labels:
app:myapp
version:canary
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
-matchExpressions:
-key:canary
operator:In
values:
-"true"
containers:
-name:myapp
image:myapp:v2.0.0-rc1
ports:
-containerPort:8080
resources:
requests:
cpu:200m
memory:256Mi
灰度验证通过后,全量发布:
# 移除灰度节点约束,更新稳定版Deployment的镜像
kubectl set image deployment/myapp-stable myapp=myapp:v2.0.0 -n production
# 缩容灰度Deployment
kubectl scale deployment/myapp-canary --replicas=0 -n production
# 清理灰度标签
kubectl label node k8s-worker-01 canary-
四、最佳实践和注意事项
4.1 最佳实践
4.1.1 性能优化
减少podAffinity的使用范围:podAffinity/podAntiAffinity的调度计算复杂度高,在500+节点集群中,一个带podAffinity的Pod调度耗时从5ms增加到200ms。能用topologySpreadConstraints替代的场景优先用topologySpreadConstraints。
# 不推荐:用podAntiAffinity实现分散
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
-labelSelector:
matchLabels:
app:myapp
topologyKey:kubernetes.io/hostname
# 推荐:用topologySpreadConstraints替代
topologySpreadConstraints:
-maxSkew:1
topologyKey:kubernetes.io/hostname
whenUnsatisfiable:DoNotSchedule
labelSelector:
matchLabels:
app:myapp合理设置resource requests:调度器根据requests而非limits做调度决策。requests设太高导致节点利用率低(实测平均CPU利用率只有15%),设太低导致节点超卖严重,Pod被OOMKill。建议requests设为实际使用量的P95值。
# 查看Pod实际资源使用,作为requests设置参考
kubectl top pods -n production --sort-by=cpu
kubectl top pods -n production --sort-by=memory使用Descheduler重平衡:节点扩容或Pod漂移后,集群负载可能不均衡。Descheduler可以驱逐不符合当前调度策略的Pod,让调度器重新调度。
# 安装Descheduler
helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm install descheduler descheduler/descheduler -n kube-system \
--set schedule="*/5 * * * *"
4.1.2 安全加固
限制nodeName直接指定:nodeName会跳过调度器的所有检查(包括资源检查),生产环境通过RBAC限制普通用户使用nodeName字段。
# OPA/Gatekeeper策略:禁止使用nodeName
apiVersion:constraints.gatekeeper.sh/v1beta1
kind:K8sDenyNodeName
metadata:
name:deny-nodename
spec:
match:
kinds:
-apiGroups:[""]
kinds:["Pod"]
excludedNamespaces:["kube-system"]PriorityClass权限控制:高优先级PriorityClass的创建和使用需要限制,防止普通用户创建高优先级Pod抢占核心服务资源。
# RBAC:只允许admin使用critical-production优先级
apiVersion:rbac.authorization.k8s.io/v1
kind:ClusterRole
metadata:
name:use-critical-priority
rules:
-apiGroups:["scheduling.k8s.io"]
resources:["priorityclasses"]
resourceNames:["critical-production"]
verbs:["get","list"]节点污点防篡改:关键节点(如GPU节点)的污点被误删会导致普通Pod涌入。通过准入控制webhook拦截对特定节点taint的修改操作。
4.1.3 高可用配置
HA方案一:核心服务至少3副本,配合podAntiAffinity硬约束分散到不同节点,再用topologySpreadConstraints分散到不同可用区
HA方案二:使用PodDisruptionBudget(PDB)限制同时不可用的Pod数量,防止节点维护时服务中断
apiVersion: policy/v1
kind:PodDisruptionBudget
metadata:
name:myapp-pdb
spec:
minAvailable:2
selector:
matchLabels:
app:myapp备份策略:调度策略配置(PriorityClass、节点标签、污点)纳入GitOps管理,用ArgoCD或FluxCD同步
4.2 注意事项
4.2.1 配置注意事项
警告:调度策略配置错误可能导致Pod无法调度或调度到错误节点,修改前在测试环境验证。
注意 nodeAffinity的 requiredDuringSchedulingIgnoredDuringExecution中,多个nodeSelectorTerms之间是OR关系,同一个nodeSelectorTerm中的多个matchExpressions之间是AND关系。搞混了会导致调度结果不符合预期。注意 podAntiAffinity硬约束会限制副本数上限。如果 topologyKey是kubernetes.io/hostname,副本数不能超过可用节点数,否则多出来的Pod永远Pending。注意 topologySpreadConstraints的 labelSelector必须和Pod自身的标签匹配,否则约束不生效。这个错误很隐蔽,Pod能正常调度但分布不均匀。
4.2.2 常见错误
0/6 nodes are available | kubectl get nodes --show-labels | |
4.2.3 兼容性问题
版本兼容:topologySpreadConstraints在1.18 Beta、1.19 GA; minDomains字段在1.25 Beta,使用前确认集群版本平台兼容:云厂商托管K8s通常自动设置 topology.kubernetes.io/zone标签,自建集群需要手动设置组件依赖:Descheduler版本需要和K8s版本匹配,0.27.x支持K8s 1.25-1.28
五、故障排查和监控
5.1 故障排查
5.1.1 日志查看
# 查看kube-scheduler日志
kubectl logs -n kube-system -l component=kube-scheduler --tail=100
# 查看调度事件
kubectl get events -A --field-selector reason=FailedScheduling --sort-by='.lastTimestamp'
# 查看特定Pod的调度事件
kubectl describe pod <pod-name> -n <namespace> | grep -A 20 Events
# 查看节点资源分配详情
kubectl describe node <node-name> | grep -A 30 "Allocated resources"
5.1.2 常见问题排查
问题一:Pod Pending,报Insufficient cpu
# 诊断命令
kubectl describe pod <pod-name> | grep -A 5 "Events"
kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU_REQ:.status.allocatable.cpu,MEM_REQ:.status.allocatable.memory
# 查看各节点已分配资源
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
echo "=== $node ==="
kubectl describe node $node | grep -A 5 "Allocated resources"
done
解决方案:
检查Pod的resource requests是否设置过高 检查节点是否有足够的可分配资源(allocatable - 已分配) 考虑扩容节点或优化现有Pod的资源配置
问题二:调度策略不生效,Pod没有按预期分布
# 诊断命令
kubectl get pod <pod-name> -o yaml | grep -A 50 "affinity"
kubectl get pod <pod-name> -o yaml | grep -A 20 "topologySpreadConstraints"
# 检查节点标签是否正确
kubectl get nodes --show-labels | grep <expected-label>
解决方案:
确认YAML缩进正确,affinity字段层级关系容易写错 确认labelSelector和Pod标签一致 用 kubectl apply --dry-run=server验证YAML语法
问题三:节点drain卡住不动
症状:
kubectl drain命令长时间无响应排查:
# 查看哪些Pod阻止了drain
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --dry-run=client
# 检查PDB
kubectl get pdb -A解决:
DaemonSet的Pod加 --ignore-daemonsets跳过使用emptyDir的Pod加 --delete-emptydir-dataPDB限制导致的,先扩容副本再drain 有Pod设置了 terminationGracePeriodSeconds很长,加--timeout=300s限制等待时间
5.1.3 调试模式
# 提高scheduler日志级别
# 编辑/etc/kubernetes/manifests/kube-scheduler.yaml
# 在command中添加 --v=10(最详细,仅调试用)
# 查看scheduler的调度决策过程
kubectl logs -n kube-system kube-scheduler-k8s-master-01 | grep "pod-name"
# 使用kubectl-scheduler-simulator模拟调度(需要单独安装)
# https://github.com/kubernetes-sigs/kube-scheduler-simulator
# 查看scheduler的metrics
kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes
5.2 性能监控
5.2.1 关键指标监控
# 调度延迟
kubectl get --raw /metrics | grep scheduler_scheduling_algorithm_duration_seconds
# 调度队列长度
kubectl get --raw /metrics | grep scheduler_pending_pods
# 抢占次数
kubectl get --raw /metrics | grep scheduler_preemption_victims
# 节点资源使用率
kubectl top nodes
5.2.2 监控指标说明
5.2.3 Prometheus监控规则
# 文件:scheduler-alerts.yaml
apiVersion:monitoring.coreos.com/v1
kind:PrometheusRule
metadata:
name:scheduler-alerts
namespace:monitoring
spec:
groups:
-name:kube-scheduler
rules:
-alert:SchedulerHighLatency
expr:|
histogram_quantile(0.99,
sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le)
) > 0.5
for:10m
labels:
severity:warning
annotations:
summary:"Scheduler P99 latency exceeds 500ms"
-alert:PodsPendingTooLong
expr:|
sum(scheduler_pending_pods{queue="active"}) > 10
for:5m
labels:
severity:warning
annotations:
summary:"More than 10 pods pending for over 5 minutes"
-alert:SchedulerUnhealthy
expr:absent(up{job="kube-scheduler"}==1)
for:3m
labels:
severity:critical
annotations:
summary:"kube-scheduler is not running"
-alert:NodeHighAllocation
expr:|
(1 - sum(kube_node_status_allocatable{resource="cpu"} - kube_pod_container_resource_requests{resource="cpu"}) by (node)
/ sum(kube_node_status_allocatable{resource="cpu"}) by (node)) > 0.85
for:10m
labels:
severity:warning
annotations:
summary:"Node {{ $labels.node }} CPU allocation exceeds 85%"
-alert:FrequentPreemption
expr:|
increase(scheduler_preemption_victims[1h]) > 5
for:5m
labels:
severity:warning
annotations:
summary:"More than 5 preemption events in the last hour"
5.3 备份与恢复
5.3.1 备份策略
#!/bin/bash
# 调度策略配置备份脚本
# 文件:/opt/scripts/scheduling-config-backup.sh
set -euo pipefail
BACKUP_DIR="/data/scheduling-backup/$(date +%Y%m%d)"
mkdir -p "${BACKUP_DIR}"
# 备份PriorityClass
kubectl get priorityclass -o yaml > "${BACKUP_DIR}/priorityclasses.yaml"
# 备份PDB
kubectl get pdb -A -o yaml > "${BACKUP_DIR}/pdbs.yaml"
# 备份节点标签和污点
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
kubectl get node "$node" -o jsonpath='{.metadata.labels}' > "${BACKUP_DIR}/${node}-labels.json"
kubectl get node "$node" -o jsonpath='{.spec.taints}' > "${BACKUP_DIR}/${node}-taints.json"
done
# 备份Kyverno策略(如果使用)
kubectl get clusterpolicy -o yaml > "${BACKUP_DIR}/kyverno-policies.yaml" 2>/dev/null || true
echo"[$(date)] Scheduling config backup completed: ${BACKUP_DIR}"
5.3.2 恢复流程
停止服务:暂停业务部署,避免恢复过程中的调度冲突 恢复数据: kubectl apply -f ${BACKUP_DIR}/priorityclasses.yaml验证完整性: kubectl get priorityclass确认PriorityClass恢复恢复节点配置:逐节点恢复标签和污点 验证调度:创建测试Pod验证调度策略是否生效
六、总结
6.1 技术要点回顾
要点一:nodeSelector适合简单场景,nodeAffinity适合需要软硬约束结合的场景,两者都基于节点标签,标签规划是基础 要点二:podAntiAffinity硬约束会限制副本数不超过节点数,大规模集群中计算开销大,优先用topologySpreadConstraints替代 要点三:taints/tolerations从节点角度控制调度,适合节点专用化场景(GPU节点、数据库节点),NoExecute会驱逐已有Pod 要点四:PriorityClass的抢占机制要谨慎使用,批处理任务设置 preemptionPolicy: Never,核心服务设置高优先级要点五:topologySpreadConstraints是生产环境跨故障域部署的首选方案, maxSkew: 1配合DoNotSchedule保证严格均匀分布
6.2 进阶学习方向
自定义调度器:当内置调度策略无法满足需求时,可以开发自定义调度器,通过Scheduling Framework扩展点实现
学习资源:Scheduling Framework 实践建议:先用调度器扩展(Extender)验证逻辑,再考虑写Framework插件 Descheduler深度使用:配置LowNodeUtilization、RemoveDuplicates等策略,自动重平衡集群负载
学习资源:Descheduler GitHub 实践建议:先在非生产环境测试Descheduler策略,避免误驱逐核心服务 Volcano批调度器:针对AI/大数据场景的批调度器,支持Gang Scheduling(一组Pod要么全部调度成功,要么全部不调度)
6.3 参考资料
Kubernetes调度官方文档 - 调度机制全面说明 kube-scheduler源码 - 理解调度算法实现 Descheduler项目 - Pod重调度工具 Volcano项目 - 批调度器
附录
A. 命令速查表
# 节点标签管理
kubectl label node <node> key=value # 添加标签
kubectl label node <node> key=value --overwrite # 修改标签
kubectl label node <node> key- # 删除标签
kubectl get nodes --show-labels # 查看所有标签
kubectl get nodes -L key1,key2 # 查看指定标签列
# 污点管理
kubectl taint nodes <node> key=value:NoSchedule # 添加污点
kubectl taint nodes <node> key=value:NoSchedule- # 删除污点
kubectl taint nodes <node> key- # 删除指定key的所有污点
kubectl describe node <node> | grep Taints # 查看污点
# 调度排查
kubectl get events --field-selector reason=FailedScheduling -A # 调度失败事件
kubectl describe pod <pod> | grep -A 10 Events # Pod事件
kubectl get pods -o wide # 查看Pod所在节点
kubectl top nodes # 节点资源使用
kubectl describe node <node> | grep -A 20 "Allocated resources"# 已分配资源
# 节点维护
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data # 腾空节点
kubectl uncordon <node> # 恢复调度
kubectl cordon <node> # 标记不可调度
B. 配置参数详解
nodeAffinity操作符:
In | values: ["ssd", "nvme"] | |
NotIn | values: ["hdd"] | |
Exists | ||
DoesNotExist | ||
Gt | values: ["100"] | |
Lt | values: ["50"] |
Taint Effect对比:
NoSchedule | |||
PreferNoSchedule | |||
NoExecute |
topologySpreadConstraints参数:
maxSkew | ||
topologyKey | ||
whenUnsatisfiable | DoNotScheduleScheduleAnyway软约束 | |
labelSelector | ||
minDomains | ||
matchLabelKeys |
C. 术语表
(版权归原作者所有,侵删)
免责声明:本文内容来源于网络,所载内容仅供参考。转载仅为学习和交流之目的,如无意中侵犯您的合法权益,请及时联系Docker中文社区!

温馨提示:文章内容系作者个人观点,不代表Docker中文对观点赞同或支持。
版权声明:本文为转载文章,来源于 互联网 ,版权归原作者所有,欢迎分享本文,转载请保留出处!

发表评论