如何判断 k8s 的 OOMKilled 是因为 node 内存不足从而 kill pod,还是因为 pod 本身申请的内存超过了 limt 声明限制而被kill呢?-灵析社区

爱打瞌睡的三角龙

子问题:k8s pod OOM 原因分析 如何判断 k8s 的 OOMKilled 是因为 node 内存不足从而 kill pod,还是因为 pod 本身申请的内存超过了 limt 声明限制而被kill呢? 能不能直接通过 kubectl describe 看出来? > 从下面的 describe 子命令的输出信息,只能看出 OOMKilled,但是看不出是 pod 自身导致的 OOMKilled,还是外部原因导致的 > OOMKilled ─➤ kb describe -n xxxxx pod image-vector-api-server-prod-5fffcd4884-j9447 Name: image-vector-api-server-prod-5fffcd4884-j9447 Namespace: mediawise Priority: 0 Service Account: default Node: cn-hangzhou.xxxxx/xxxx Start Time: Wed, 01 Nov 2023 17:25:54 +0800 Labels: app=image-vector-api pod-template-hash=5fffcd4884 Annotations: kubernetes.io/psp: ack.privileged Status: Running IP: xxxxx IPs: IP: xxxxx Controlled By: ReplicaSet/image-vector-api-server-prod-5fffcd4884 Containers: image-vector-api: Container ID: docker://78dc88a880d769d5cb4a553672d8a4b4a0b69b720fcbf9380096a77d279c5645 Image: registry-vpc.cn-xxxx.xxxx.com/xxx-cn/image-vector:master-xxxxxx Image ID: docker-pullable://registry-vpc.cn-hangzhou.aliyuncs.com/xxx-cn/image-vector@sha256:058c43265845a975d7cc537911ddcc203fa26f608714fe8b388d5dfd1eb02d92 Port: 9205/TCP Host Port: 0/TCP Command: python api.py State: Running Started: Wed, 01 Nov 2023 18:35:49 +0800 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Wed, 01 Nov 2023 18:25:34 +0800 Finished: Wed, 01 Nov 2023 18:35:47 +0800 Ready: True Restart Count: 8 Limits: cpu: 2 memory: 2000Mi Requests: cpu: 10m memory: 1000Mi Liveness: http-get http://:9205/ delay=60s timeout=1s period=30s #success=1 #failure=3 Readiness: http-get http://:9205/ delay=60s timeout=1s period=30s #success=1 #failure=3 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2kwj9 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-2kwj9: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: > 想在不引入外部工具的情况下(比如 prometheus),分析这种问题 * * * 我去 kubernetes 那里提了这个问题: [https://discuss.kubernetes.io/t/how-can-we-tell-if-the-oomkil...](https://link.segmentfault.com/?enc=rowgIdQgr9LuWqNIw1dV%2FA%3D%3D.8MxXXgtQPYPYlxg3TaT1vfJCUSBYULeveueD2FcgG7ytvIldhXAdlpgxn%2FCxV8x9Y8jpx%2FAg8oY6hlFWelne792%2BEdFUA%2BFOEhD0Di1V9MQI4hjxRBMXz8sW19WiO2pb0Nx%2BS6Dad4%2BjkAV6362bO9gSM1c6%2FnpzvLywz5qp%2B%2BSXlc4FztPp7ZeVZ6uiqZO2RO4dMBWzdrzUGj9Pwvy4ectiffbBMMbKTOE0%2F4lDcPcXmtSqLZmVenq06hzZDoo4kkxICgurKtIILQw9WW4oKwDsglRKF9yXLdMgvGM2eJod0JR7ES7oiQ1jJlev5X5BaqDg5cSrIyhoMuBwPAAfFg%3D%3D)

阅读量:14

点赞量:0

问AI
一个曾经用过的方案:在 k8s 加一个监控,可以没有 pod 的内存使用变化曲线,以及 node 的总内存占用变化曲线。挂了之后,看下挂的时候两者内存大概都用了多少基本就知道了。监控上还可以加报警,就是内存使用达到某一个阈值就发个邮件,或者发个短信(如果用接口的话)。 *** 说曾经,是因为现在我不需要去管理 k8s 了。