一次服务异常退出问题排查
Published on: | Views: 150问题背景
统计服务最近一直异常退出,存活时间可能30min~3h左右 挂掉的时机通常在执行数据统计期间(每30min一次统计) 统计期间,gc会非常频繁
分析
我们服务使用了健康检查,容器运行时使用了参数 --restart unless-stopped, 如果三次健康检查不通过就会重启。 使用docker update --restart=no 关闭重启机制, 发现过一段时间服务依然挂掉了,说明不是健康检查引起的。
docker inspect 查看容器信息,发现:
[
{
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": true,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2022-06-25T01:41:19.952775213Z",
"FinishedAt": "2022-06-25T04:35:04.300512699Z",
},
"Env": [
"JAR_ARGS= -DserviceName=xxx_service_name -Xmx1g -XX:NewSize=256m -XX:NewRatio=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=logs/java_1.2.0.1655084162.hprof -XX:ErrorFile=logs/java_<pid>.log ...",
"PATH=/usr/local/openjdk-8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"LANG=C.UTF-8",
"JAVA_HOME=/usr/local/openjdk-8",
"JAVA_VERSION=8u265",
"TZ=Asia/Shanghai",
"APP_HOME=/app"
],
}
]
退出代码137, OOMKilled为true, 说明是内存不够导致程序退出。
再用docker container logs 看一下gc日志:
2022-06-25T12:34:37.730+0800: [GC (Allocation Failure) 2022-06-25T12:34:37.731+0800: [ParNew: 235968K->21536K(235968K), 0.0616245 secs] 529197K->331224K(574380K), 0.0623098 secs] [Times: user=0.12 sys=0.00, real=0.06 secs]
2022-06-25T12:34:37.795+0800: [GC (CMS Initial Mark) [1 CMS-initial-mark: 309688K(338412K)] 331224K(574380K), 0.0091831 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2022-06-25T12:34:37.804+0800: [CMS-concurrent-mark-start]
2022-06-25T12:34:38.078+0800: [CMS-concurrent-mark: 0.273/0.273 secs] [Times: user=0.28 sys=0.02, real=0.28 secs]
2022-06-25T12:34:38.078+0800: [CMS-concurrent-preclean-start]
2022-06-25T12:34:38.082+0800: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2022-06-25T12:34:38.084+0800: [GC (CMS Final Remark) [YG occupancy: 23337 K (235968 K)]2022-06-25T12:34:38.084+0800: [Rescan (parallel) , 0.0107235 secs]2022-06-25T12:34:38.095+0800: [weak refs processing, 0.0023168 secs]2022-06-25T12:34:38.097+0800: [class unloading, 0.0652499 secs]2022-06-25T12:34:38.162+0800: [scrub symbol table, 0.0293179 secs]2022-06-25T12:34:38.192+0800: [scrub string table, 0.0024302 secs][1 CMS-remark: 309688K(338412K)] 333025K(574380K), 0.1470139 secs] [Times: user=0.16 sys=0.00, real=0.15 secs]
2022-06-25T12:34:38.231+0800: [CMS-concurrent-sweep-start]
2022-06-25T12:34:38.432+0800: [CMS-concurrent-sweep: 0.201/0.201 secs] [Times: user=0.21 sys=0.02, real=0.20 secs]
2022-06-25T12:34:38.433+0800: [CMS-concurrent-reset-start]
2022-06-25T12:34:38.435+0800: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2022-06-25T12:35:02.496+0800: [GC (Allocation Failure) 2022-06-25T12:35:02.496+0800: [ParNew: 231328K->26176K(235968K), 0.4541069 secs] 405420K->284107K(574380K), 0.4547576 secs] [Times: user=0.80 sys=0.02, real=0.46 secs]
2022-06-25T12:35:02.954+0800: [GC (CMS Initial Mark) [1 CMS-initial-mark: 257931K(338412K)] 285237K(574380K), 0.0263418 secs] [Times: user=0.05 sys=0.00, real=0.03 secs]
2022-06-25T12:35:02.980+0800: [CMS-concurrent-mark-start]
2022-06-25T12:35:03.747+0800: [CMS-concurrent-mark: 0.749/0.766 secs] [Times: user=0.99 sys=0.03, real=0.76 secs]
2022-06-25T12:35:03.748+0800: [CMS-concurrent-preclean-start]
2022-06-25T12:35:03.781+0800: [CMS-concurrent-preclean: 0.032/0.033 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]
2022-06-25T12:35:03.782+0800: [CMS-concurrent-abortable-preclean-start]
看起来没有什么异常,服务设置的内存xmx是1G, 挂掉的时候,新生代235968K,总大小574380K,未达到设定值1G,按理说,如果需要的话会继续扩容而不会引起OOM才对。
看了一下docker run 的参数, -m 设置了容器的内存大小是1G。 emm 好像发现问题了,服务继续扩大堆内存,虽然自己不会OOM,但是会导致容器memory占用超过1G, 然后容器以OOM KILL退出。
解决
调整容器内存限制大小为2G, 观察后发现问题得到解决。