背景
继上次找到redis导致的gc问题之后,又有一个应用发生了gc时间长的问题。每次gc的大概1s的停顿时间,导致这1s的请求全部超时,经常被业务方投诉。因为G1有更好的并行度,更短和确定的并行时间,因此负责的开发将应用从CMS升级到G1,然而问题依然没有解决. 这是一个网关类型的应用,负责接受rpc的调用,将请求转发到后端的http请求,将封装好的数据,转发给业务使用。所以实际上是
Application---->rpc-->GateWay(问题应用)-----http client ---> 其他服务.
分析
解决GC问题的思路,还是先要了解下你这个gc的特性和阶段,然后结合gc日志排查到底是哪里有问题。是频繁的youngGC还是oldGC,针对具体的问题,再配合jmap工具dump内存的heap下来分析。找到内存泄露或者是代码或者是配置不合理的地方。
尝试的方法
- 首先是将CMS改为G1, 并且设定预期的时间为200ms,然而并没有解决问题
Garbage_collector="G1" GC_OPTS="$GC_OPTS -XX:+UseG1GC \ -XX:MaxGCPauseMillis=200 \ -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=4 -XX:ConcGCThreads=4 \ -XX:+HeapDumpOnOutOfMemoryError \ -XX:+PrintClassHistogramBeforeFullGC \ -XX:+PrintClassHistogramAfterFullGC \ -XX:+PrintGCApplicationConcurrentTime \ -XX:+PrintGCApplicationStoppedTime \ -XX:+PrintTenuringDistribution \ -XX:+PrintReferenceGC \ -XX:+PrintHeapAtGC"
过2~3天,又会因为gc问题发生一波超时.
G1日志分析
G1了解
要看懂g1日志,首先得知道这个G1的原理。总结无非这几点, G1使用的是Region作为内存单元来存储数据(大小可以设置),然后他模糊了youngGC区和old区的概念,这个只作为一个逻辑分区。因此在他yonggc发生的时候,可以顺带的回收old区的内容,而不用像cms一样,等到一个百分比的时候,再触发一次oldgc,来一个全区的扫描,那个当然就费时间了
另外,就是他的核心RSet的设计,记录了跨区的引用,以此来防止全区的一个扫描。
G1的gc分为以下几个阶段:
YoungGC阶段
触发时机 : 当Eden空间被占满之后,就会触发YGC,在G1中YGC依然采用复制存活对象到survivor空间的方式。他虽然是stop wold的,但是他是并发执行的,日志类似于这样:
youngGC不需要扫描老年代,因为他通过RSet就知道哪些年轻代的对象是被老年代引用的.
2020-07-28T18:29:37.235+0800: 78143.812: [SoftReference, 0 refs, 0.0005405 secs]2020-07-28T18:29:37.236+0800: 78143.813: [WeakReference, 650 refs, 0.0004000 secs]2020-07-28T18:29:37.236+0800: 78143.813: [FinalReference, 56137 refs, 0.0136734 secs]2020-07-28T18:29:37.250+0800: 78143.827: [PhantomReference, 0 refs, 10 refs, 0.0006262 secs]2020-07-28T18:29:37.251+0800: 78143.828: [JNI Weak Reference, 0.0000456 secs], 0.0872782 secs] [Parallel Time: 61.9 ms, GC Workers: 4] [GC Worker Start (ms): Min: 78143750.0, Avg: 78143750.0, Max: 78143750.0, Diff: 0.1] [Ext Root Scanning (ms): Min: 10.0, Avg: 10.2, Max: 10.3, Diff: 0.3, Sum: 40.9] [Update RS (ms): Min: 21.1, Avg: 21.2, Max: 21.2, Diff: 0.1, Sum: 84.8] [Processed Buffers: Min: 451, Avg: 484.8, Max: 516, Diff: 65, Sum: 1939] [Scan RS (ms): Min: 5.9, Avg: 6.0, Max: 6.0, Diff: 0.1, Sum: 23.8] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] [Object Copy (ms): Min: 24.0, Avg: 24.2, Max: 24.4, Diff: 0.4, Sum: 96.9] [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Termination Attempts: Min: 1, Avg: 8.8, Max: 17, Diff: 16, Sum: 35] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3] [GC Worker Total (ms): Min: 61.6, Avg: 61.7, Max: 61.7, Diff: 0.1, Sum: 246.8] [GC Worker End (ms): Min: 78143811.7, Avg: 78143811.7, Max: 78143811.8, Diff: 0.1] [Code Root Fixup: 0.2 ms] [Code Root Purge: 0.0 ms] [Clear CT: 1.8 ms] [Other: 23.4 ms] [Choose CSet: 0.0 ms] [Ref Proc: 15.6 ms] [Ref Enq: 0.4 ms] [Redirty Cards: 0.5 ms] [Humongous Register: 0.1 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 2.7 ms] [Eden: 6084.0M(6084.0M)->0.0B(6080.0M) Survivors: 60.0M->64.0M Heap: 8179.6M(10.0G)->2102.0M(10.0G)] Heap after GC invocations=1261 (full 0): garbage-first heap total 10485760K, used 2152475K [0x0000000540000000, 0x0000000540405000, 0x00000007c0000000) region size 4096K, 16 young (65536K), 16 survivors (65536K) Metaspace used 82266K, capacity 85616K, committed 85760K, reserved 1124352K class space used 9546K, capacity 10192K, committed 10240K, reserved 1048576K } [Times: user=0.30 sys=0.00, real=0.08 secs]
然后我们可以看到这个并发交替的过程如下,他是一会执行用户程序,一会执行gc任务,清楚的看到他的交替过程:
2020-07-28T18:29:37.259+0800: 78143.836: Total time for which application threads were stopped: 0.0960241 seconds, Stopping threads took: 0.0009531 seconds 2020-07-28T18:29:39.182+0800: 78145.759: Application time: 1.9228829 seconds 2020-07-28T18:29:39.190+0800: 78145.767: Total time for which application threads were stopped: 0.0079957 seconds, Stopping threads took: 0.0009701 seconds 2020-07-28T18:29:39.190+0800: 78145.767: Application time: 0.0002209 seconds 2020-07-28T18:29:39.195+0800: 78145.772: Total time for which application threads were stopped: 0.0047948 seconds, Stopping threads took: 0.0003450 seconds 2020-07-28T18:29:39.195+0800: 78145.772: Application time: 0.0003185 seconds 2020-07-28T18:29:39.199+0800: 78145.776: Total time for which application threads were stopped: 0.0038102 seconds, Stopping threads took: 0.0002901 seconds 2020-07-28T18:29:39.200+0800: 78145.776: Application time: 0.0004269 seconds 2020-07-28T18:29:39.203+0800: 78145.780: Total time for which application threads were stopped: 0.0036949 seconds, Stopping threads took: 0.0003017 seconds 2020-07-28T18:29:39.204+0800: 78145.780: Application time: 0.0003345 seconds 2020-07-28T18:29:39.207+0800: 78145.784: Total time for which application threads were stopped: 0.0037270 seconds, Stopping threads took: 0.0002750 seconds 2020-07-28T18:29:39.208+0800: 78145.784: Application time: 0.0002432 seconds 2020-07-28T18:29:39.211+0800: 78145.788: Total time for which application threads were stopped: 0.0034454 seconds, Stopping threads took: 0.0002974 seconds 2020-07-28T18:29:39.211+0800: 78145.788: Application time: 0.0002809 seconds 2020-07-28T18:29:39.215+0800: 78145.792: Total time for which application threads were stopped: 0.0034799 seconds, Stopping threads took: 0.0003126 seconds 2020-07-28T18:29:39.215+0800: 78145.792: Application time: 0.0001971 seconds 2020-07-28T18:29:39.218+0800: 78145.795: Total time for which application threads were stopped: 0.0033639 seconds, Stopping threads took: 0.0002907 seconds 2020-07-28T18:29:39.219+0800: 78145.795: Application time: 0.0002002 seconds 2020-07-28T18:29:39.222+0800: 78145.799: Total time for which application threads were stopped: 0.0034734 seconds, Stopping threads took: 0.0002560 seconds 2020-07-28T18:29:39.223+0800: 78145.799: Application time: 0.0004148 seconds 2020-07-28T18:29:39.230+0800: 78145.807: Total time for which application threads were stopped: 0.0078923 seconds, Stopping threads took: 0.0007479 seconds 2020-07-28T18:29:39.231+0800: 78145.807: Application time: 0.0002697 seconds 2020-07-28T18:29:39.239+0800: 78145.816: Total time for which application threads were stopped: 0.0080425 seconds, Stopping threads took: 0.0005797 seconds 2020-07-28T18:29:39.239+0800: 78145.816: Application time: 0.0004746 seconds 2020-07-28T18:29:39.247+0800: 78145.824: Total time for which application threads were stopped: 0.0081480 seconds, Stopping threads took: 0.0005586 seconds 2020-07-28T18:29:39.248+0800: 78145.824: Application time: 0.0002206 seconds 2020-07-28T18:29:39.255+0800: 78145.832: Total time for which application threads were stopped: 0.0077953 seconds, Stopping threads took: 0.0003357 seconds 2020-07-28T18:29:39.256+0800: 78145.833: Application time: 0.0004123 seconds 2020-07-28T18:29:39.263+0800: 78145.840: Total time for which application threads were stopped: 0.0075859 seconds, Stopping threads took: 0.0004447 seconds 2020-07-28T18:29:39.264+0800: 78145.840: Application time: 0.0001576 seconds 2020-07-28T18:29:39.271+0800: 78145.847: Total time for which application threads were stopped: 0.0070406 seconds, Stopping threads took: 0.0003025 seconds 2020-07-28T18:29:39.271+0800: 78145.848: Application time: 0.0003462 seconds 2020-07-28T18:29:39.278+0800: 78145.855: Total time for which application threads were stopped: 0.0071142 seconds, Stopping threads took: 0.0004311 seconds 2020-07-28T18:29:39.278+0800: 78145.855: Application time: 0.0001479 seconds 2020-07-28T18:29:39.285+0800: 78145.862: Total time for which application threads were stopped: 0.0071285 seconds, Stopping threads took: 0.0003712 seconds 2020-07-28T18:29:39.287+0800: 78145.864: Application time: 0.0018179 seconds 2020-07-28T18:29:39.295+0800: 78145.872: Total time for which application threads were stopped: 0.0075329 seconds, Stopping threads took: 0.0005250 seconds 2020-07-28T18:29:39.296+0800: 78145.872: Application time: 0.0008192 seconds 2020-07-28T18:29:39.303+0800: 78145.880: Total time for which application threads were stopped: 0.0072368 seconds, Stopping threads took: 0.0002966 seconds 2020-07-28T18:29:39.303+0800: 78145.880: Application time: 0.0006276 seconds 2020-07-28T18:29:39.324+0800: 78145.901: Total time for which application threads were stopped: 0.0202434 seconds, Stopping threads took: 0.0002417 seconds 2020-07-28T18:30:04.065+0800: 78170.642: Application time: 24.7410815 seconds 2020-07-28T18:30:04.073+0800: 78170.650: Total time for which application threads were stopped: 0.0082734 seconds, Stopping threads took: 0.0009748 seconds 2020-07-28T18:30:14.002+0800: 78180.578: Application time: 9.9284404 seconds
MIXGC混合GC阶段
G1中的MIXGC选定所有新生代里的Region,外加根据global concurrent marking统计得出收集收益高的若干老年代Region,在用户指定的开销目标范围内尽可能选择收益高的老年代Region进行回收。所以MIXGC回收的内存区域是新生代+老年代
FullGC
G1一般不会发生FullGC,如果发生了,基本上就是有灾难了.
https://zhuanlan.zhihu.com/p/52841787
G1日志分析
引用: g1日志格式分析 :https://segmentfault.com/a/1190000018885417
https://matt33.com/2018/07/28/jvm-cms/
我们通过jvm的参数,打开gc日志详情,我们可以看到,出问题的时间点的日志如下:
2020-07-28T18:30:14.107+0800: 78180.684: [GC concurrent-root-region-scan-start] 2020-07-28T18:30:14.108+0800: 78180.685: Total time for which application threads were stopped: 0.1065961 seconds, Stopping threads took: 0.0010147 seconds 2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-root-region-scan-end, 0.1157296 secs] 2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-mark-start] 2020-07-28T18:30:15.297+0800: 78181.873: [GC concurrent-mark-end, 1.0732893 secs] 2020-07-28T18:30:15.297+0800: 78181.873: Application time: 1.1884735 seconds 2020-07-28T18:30:15.304+0800: 78181.881: [GC remark 2020-07-28T18:30:15.304+0800: 78181.881: [Finalize Marking, 0.0042761 secs] 2020-07-28T18:30:15.309+0800: 78181.885: [GC ref-proc2020-07-28T18:30:15.309+0800: 78181.885: [SoftReference, 434351 refs, 0.0672266 secs]2020-07-28T18:30:15.376+0800: 78181.953: [WeakReference, 376 refs, 0.0004634 secs]2020-07-28T18:30:15.376+0800: 78181.953: [FinalReference, 435363 refs, 0.3004894 secs]2020-07-28T18:30:15.677+0800: 78182.254: [PhantomReference, 0 refs, 331 refs, 0.0009668 secs]2020-07-28T18:30:15.678+0800: 78182.255: [JNI Weak Reference, 0.0002237 secs], 0.4895046 secs] 2020-07-28T18:30:15.798+0800: 78182.375: [Unloading, 0.0347793 secs], 0.5384580 secs] [Times: user=1.65 sys=0.24, real=0.54 secs]
我们看到GC ref-proc这个阶段耗时很长,这个是个stop the world的过程,再看看 FinalReference居然有40多万,平时是6万多(也不少)
再来学习下这个并发标记是什么意思.
认识全局并发标记
全局并发标记过程分为五个阶段
(1) Initial Mark初始标记 STW
Initial Mark初始标记是一个STW事件,其完成工作是标记GC ROOTS 直接可达的对象。并将它们的字段压入扫描栈(marking stack)中等到后续扫描。G1使用外部的bitmap来记录mark信息,而不使用对象头的mark word里的mark bit。因为 STW,所以通常YGC的时候借用YGC的STW顺便启动Initial Mark,也就是启动全局并发标记,全局并发标记与YGC在逻辑上独立。
(2)Root Region Scanning 根区域扫描
根区域扫描是从Survior区的对象出发,标记被引用到老年代中的对象,并把它们的字段在压入扫描栈(marking stack)中等到后续扫描。与Initial Mark不一样的是,Root Region Scanning不需要STW与应用程序是并发运行。Root Region Scanning必须在YGC开始前完成。
(3)Concurrent Marking 并发标记
不需要STW。不断从扫描栈取出引用递归扫描整个堆里的对象。每扫描到一个对象就会对其标记,并将其字段压入扫描栈。重复扫描过程直到扫描栈清空。过程中还会扫描SATB write barrier所记录下的引用。Concurrent Marking 可以被YGC中断
就是日志这里的内容:
2020-07-28T18:30:14.107+0800: 78180.684: [GC concurrent-root-region-scan-start] 2020-07-28T18:30:14.108+0800: 78180.685: Total time for which application threads were stopped: 0.1065961 seconds, Stopping threads took: 0.0010147 seconds 2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-root-region-scan-end, 0.1157296 secs] 2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-mark-start] 2020-07-28T18:30:15.297+0800: 78181.873: [GC concurrent-mark-end, 1.0732893 secs]
(4)Remark 最终标记 STW
STW操作。在完成并发标记后,每个Java线程还会有一些剩下的SATB write barrier记录的引用尚未处理。这个阶段就负责把剩下的引用处理完。同时这个阶段也进行弱引用处理(reference processing)。注意这个暂停与CMS的remark有一个本质上的区别,那就是这个暂停只需要扫描SATB buffer,而CMS的remark需要重新扫描mod-union table里的dirty card外加整个根集合,而此时整个young gen(不管对象死活)都会被当作根集合的一部分,因而CMS remark有可能会非常慢。
就是出问题的地方
2020-07-28T18:30:15.304+0800: 78181.881: [GC remark 2020-07-28T18:30:15.304+0800: 78181.881: [Finalize Marking, 0.0042761 secs] 2020-07-28T18:30:15.309+0800: 78181.885: [GC ref-proc2020-07-28T18:30:15.309+0800: 78181.885: [SoftReference, 434351 refs, 0.0672266 secs]2020-07-28T18:30:15.376+0800: 78181.953: [WeakReference, 376 refs, 0.0004634 secs]2020-07-28T18:30:15.376+0800: 78181.953: [FinalReference, 435363 refs, 0.3004894 secs]2020-07-28T18:30:15.677+0800: 78182.254: [PhantomReference, 0 refs, 331 refs, 0.0009668 secs]2020-07-28T18:30:15.678+0800: 78182.255: [JNI Weak Reference, 0.0002237 secs], 0.4895046 secs] 2020-07-28T18:30:15.798+0800: 78182.375: [Unloading, 0.0347793 secs], 0.5384580 secs]
(5)Cleanup 清除 STW AND Concurrent
STW操作,清点出有存活对象的Region和没有存活对象的Region(Empty Region)
STW操作,更新Rset
通过以上分析我们基本上断定了,就是因为并发FinalReference太多了,导致标记的时候,花费了很长的时间,造成了应用的暂停和抖动。
定位到这个问题之后,就尝试着去分析,为啥会有这么多的FinalReference这玩意儿呢? 能不能减少呢?
heap分析
通过jmap -dump:format=b,file=heapdump.phrof 把生产的内存区下载下来之后,再去分析,也没发现什么有用的信息, 果然是发现这个东西很多, 看起来依然是跟SocksSocketImpl有关系
问题定位
查找new SocksSoketImpl()的来源
我们看看为什么会有这么多java.net.SocksSoketImpl呢?通过以前的手段,我们继续使用阿尔萨斯来定位。通过监控SocksSoketImpl的构造方法,我们发现不断的被创建新的连接。
尝试1: 关闭JMX收集
我们发现不少的连接是来自JMX的信息收集,我先把这个关闭了看看,结果并没有任何改观
尝试2: httpclient的短连接问题?
开始试着查找为什么不断的new 新的连接呢?是不是连接未能复用呢?网上很多人也遇到了同样的问题,很多是因为错误的使用了短连接造成的。
我开始质疑是不是公司的中间件有问题。公司的httpclient统一使用的是中间件封装的一层,目的是为了便于统计和跟踪。在本地测试的时候,发现连接是能够保持的。是不是生产网络的连接无法保持呢?
搞清楚httpclient,调整httpclient连接池参数
我们内部调用http的服务,一般是这么调的
JSONObject obj = restTemplate.getForObject(realUrl, JSONObject.class);
这个其实是spring给你封装的一个模板方法,大大的减少了使用httpclient的成本。还记得以前使用httpclient都会自己封装一层,使用连接池,使用idle线程踢出不再使用的线程等等。
而公司的中间利用的是模板方法里提供的插入自己逻辑的机会,增加了埋点。另外就是把配置暴露到application.yml里给你可以配置。
PoolingHttpClientConnectionManager的分析
去爬了一遍PoolingHttpClientConnectionManager的源代码,了解下连接池管理这一段,去找找到底为什么连接无法保持呢?分析的过程如下
涉及的类
CPoolEntry: Pool的一个单元,包含了route信息和connect的连接
Cpool: lease 尝试去租用一个连接, 此方法会获取一个Future的对象.
Future<CPoolEntry> 具体的能力是无阻塞的获取一个Pool单元.
那么这个Future是怎么获取连接的,调用他get方法, 会拿到上一次的连接,如果没拿到,会调用
getPoolEntryBlocking方法,进一步的获取连接单元.
这个方法调用了RouteSpecificPool.getFree.//拿到一个可用的连接.
RouteSpecificPool: 里面是两个链表, 存储的是route维度的连接信息. 如果是getFree方法,那就从avaiable里拿一个元素出来,并且放到lease的链表里,代表成功lease了一个连接。
try {
final RouteSpecificPool<T, C, E> pool = getPool(route);
E entry;
for (;;) {
Asserts.check(!this.isShutDown, "Connection pool shut down");
for (;;) {
//从avaiable的队列里取出来一个并且放到lease的队列里
//获取一个可用
entry = pool.getFree(state);
if (entry == null) {
break;
}
if (entry.isExpired(System.currentTimeMillis())) {
entry.close();
}
//如果连接是关闭的,过期了会自动关闭
if (entry.isClosed()) {
//因为这个连接已经不可用了, 把他从可用列表里移除
this.available.remove(entry);
//再从pool里的lease的队列里取一个放到available的队列列
//因为是false,所以不会放到avaiable里
pool.free(entry, false);
} else {
break;
}
//第一个for完了,得到了一个entry(可用)或者null
if (entry != null) {
//成功的获取到了,记录一下从可用里移除,并且结束进程
this.available.remove(entry);
//放到lease的里面去
this.leased.add(entry);
//标记为可重复使用
onReuse(entry);
return entry;
}
// 如果为null则需要重新构建一个连接.
final int maxPerRoute = getMax(route);
// Shrink the pool prior to allocating a new connection
final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);
//这个excess恒大于1?
if (excess > 0) {
for (int i = 0; i < excess; i++) {
final E lastUsed = pool.getLastUsed();
if (lastUsed == null) {
break;
}
lastUsed.close();
this.available.remove(lastUsed);
pool.remove(lastUsed);
}
}
//如果还有容量, 则新建一个连接
if (pool.getAllocatedCount() < maxPerRoute) {
final int totalUsed = this.leased.size();
final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);
if (freeCapacity > 0) {
final int totalAvailable = this.available.size();
if (totalAvailable > freeCapacity - 1) {
if (!this.available.isEmpty()) {
final E lastUsed = this.available.removeLast();
lastUsed.close();
final RouteSpecificPool<T, C, E> otherpool = getPool(lastUsed.getRoute());
otherpool.remove(lastUsed);
}
}
final C conn = this.connFactory.create(route);
entry = pool.add(conn);
this.leased.add(entry);
return entry;
}
}
}
通过这个可以清楚的了解到,这些配置参数的含义。
max-total: 连接池最大的连接数量
max-per-route: 连接池里每个域名的最大连接数量
default-max-per-route: 没找到对应的域名的配置时候,使用的连接数量
公司的中间还封装了一个逻辑,默认全局只有一个连接池,使用的是上面的配置。他封装了一个逻辑,可以一个域名配置一个连接池。配置格式类似如下,下面就有两个连接池,每个连接池用自己的配置。
httpclient: default-pool: default-max-per-route: 20 max-total: 100 max-per-route: 50 keep-alive-time: 60 connect-timeout: 5000 connection-request-timeout: 5000 socket-timeout: 1000 idle-connections-timeout: 30 idle-connections-check-schedule: 5000 pool-configs: "[http://10.69.3.58:9003]": max-total: 213 max-per-route: 123 keep-alive-time: 900 connect-timeout: 450 connection-request-timeout: 450 socket-timeout: 450 idle-connections-timeout: 30 idle-connections-check-schedule: 5000
生产问题定位
了解了PoolingHttpClientConnectionManager之后,其实我们可以把日志打开看一看,到底是什么情况?
<Logger name="org.apache.http.impl.conn.PoolingHttpClientConnectionManager" level="DEBUG">
</Logger>
我把这个类的info日志打开,发现日志如下:
2020-08-26 16:47:36 [36mDEBUG[m[pool-7-thread-149] [o.a.h.i.c.PoolingHttpClientConnectionManager.requestConnection:266] [154ee4edcd6d1152,154ee4edcd6d1152,0] Connection request: [route: {s}->https://aiengine-hellomap-route-inner.hellobike.cn:9003][total kept alive: 0; route allocated: 50 of 50; total allocated: 50 of 100] 2020-08-26 16:47:36 [36mDEBUG[m[pool-7-thread-219] [o.a.h.i.c.PoolingHttpClientConnectionManager.leaseConnection:310] [8f2ce76435c397c6,8f2ce76435c397c6,0] Connection leased: [id: 607][route: {s}->https://aiengine-hellomap-route-inner.hellobike.cn:9003][total kept alive: 0; route allocated: 50 of 50; total allocated: 50 of 100]
这个结合着PoolingHttpClientConnectionManager源代码还是比较容易理解的。
这里为啥是 50 of 50跟我配置的不一样? 后来发现是中间件的bug!!! 在https的时候就有bug,用户自己配的 max-per-route是无效的..这个是后话了
通过这个日志可以很清楚的知道,不仅使用了连接池,而且还能够保持连接。所以这个猜想是错误的。
是不是有突增的流量进来呢
我们这个应用还给另外一个特殊的定时任务类型的应用使用,会不会是他带来的巨大的流量,导致了很多连接的创建,最终导致了FinalReference过多呢? 我们尝试把这部分流量切出去,然而问题依旧。
是不是sslsession的问题导致的呢
期间还听说另外一个开放也遇到gc的问题,是因为jdk的bug导致的
https://bugs.openjdk.java.net/browse/JDK-8210985
-Djavax.net.ssl.sessionCacheSize=101 通过设置他的大小来控制队列的长度,通过修改这个参数,依然没什么效果。
另类的解决方法
查找FinalReference过多是查不到了,可能这个应用就是这么个特性。因为并发高,后端逻辑是纯http request,所以连接比较多是正常的。至于为啥FinalReference突然从6万多到了40多万,带来了gc的停顿,至今也没想明白。
但是生产的问题依然存在,必须火速解决。既然解决不了问题,我们就解决提出问题的人!回到问题的本质,是因为在并发标记阶段FinalReference太多了导致的。每次出问题都是并发标记带来的,那么我是不是可以通过增加并发标记的次数,主动的去干这个事,而不是让他累积到一定的量导致爆发呢?
增加并发标记的次数
说干就干,我们通过增加参数-XX:InitiatingHeapOccupancyPercent=7来达到主动的触发并发标记的目的。这个参数的默认是45,也就是达到了45%会触发一次并发标记。同时我们把4核的机器换成8核,把并发标记的线程从4个提升到6个,我给你更多的计算资源,让你帮我把FinalReference主动的减少,行不行呢?
参数修改之后,观察效果,发现并发标记的次数确实变多了,每次并发标记之后,FinalReference也如期减少了,达到了预期的效果。经过一周的观察,gc问题解决了,再也没有gc带来的超时问题.
从日志上看7万多的FinalReference经过并发标记之后,到了5000多,目的达到了。
2020-09-08T18:26:29.601+0800: 509725.355: [SoftReference, 0 refs, 0.0005530 secs]2020-09-08T18:26:29.602+0800: 509725.355: [WeakReference, 651 refs, 0.0003417 secs]2020-09-08T18:26:29.602+0800: 509725.356: [FinalReference, 59425 refs, 0.0084187 secs]2020-09-08T18:26:29.611+0800: 509725.364: [PhantomReference, 0 refs, 0 refs, 0.0007736 secs]2020-09-08T18:26:29.611+0800: 509725.365: [JNI Weak Reference, 0.0000406 secs], 0.0485472 secs] [Parallel Time: 31.2 ms, GC Workers: 6] [GC Worker Start (ms): Min: 509725323.3, Avg: 509725323.4, Max: 509725323.5, Diff: 0.1] [Ext Root Scanning (ms): Min: 3.8, Avg: 5.3, Max: 11.4, Diff: 7.6, Sum: 31.7] [Update RS (ms): Min: 3.6, Avg: 9.6, Max: 11.0, Diff: 7.3, Sum: 57.7] [Processed Buffers: Min: 256, Avg: 319.7, Max: 376, Diff: 120, Sum: 1918] [Scan RS (ms): Min: 0.1, Avg: 0.3, Max: 0.3, Diff: 0.2, Sum: 1.6] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1] [Object Copy (ms): Min: 15.2, Avg: 15.5, Max: 15.8, Diff: 0.5, Sum: 93.2] [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3] [Termination Attempts: Min: 1, Avg: 150.8, Max: 196, Diff: 195, Sum: 905] [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.5] [GC Worker Total (ms): Min: 30.8, Avg: 30.8, Max: 30.9, Diff: 0.1, Sum: 185.0] [GC Worker End (ms): Min: 509725354.2, Avg: 509725354.2, Max: 509725354.3, Diff: 0.1] [Code Root Fixup: 0.2 ms] [Code Root Purge: 0.0 ms] [Clear CT: 0.6 ms] [Other: 16.6 ms] [Choose CSet: 0.0 ms] [Ref Proc: 10.4 ms] [Ref Enq: 0.3 ms] [Redirty Cards: 0.1 ms] [Humongous Register: 0.0 ms] [Humongous Reclaim: 0.0 ms] [Free CSet: 1.4 ms] [Eden: 6084.0M(6084.0M)->0.0B(6084.0M) Survivors: 60.0M->60.0M Heap: 7811.1M(10.0G)->1725.1M(10.0G)] Heap after GC invocations=19624 (full 0): garbage-first heap total 10485760K, used 1766479K [0x0000000540000000, 0x0000000540405000, 0x00000007c0000000) region size 4096K, 15 young (61440K), 15 survivors (61440K) Metaspace used 96904K, capacity 100148K, committed 101504K, reserved 1138688K class space used 11478K, capacity 12106K, committed 12416K, reserved 1048576K } [Times: user=0.19 sys=0.06, real=0.05 secs] 2020-09-08T18:26:29.616+0800: 509725.370: [GC concurrent-root-region-scan-start] 2020-09-08T18:26:29.617+0800: 509725.370: Total time for which application threads were stopped: 0.0667356 seconds, Stopping threads took: 0.0006084 seconds 2020-09-08T18:26:29.624+0800: 509725.377: [GC concurrent-root-region-scan-end, 0.0075624 secs] 2020-09-08T18:26:29.624+0800: 509725.377: [GC concurrent-mark-start] 2020-09-08T18:26:29.745+0800: 509725.498: [GC concurrent-mark-end, 0.1212810 secs] 2020-09-08T18:26:29.745+0800: 509725.499: Application time: 0.1284912 seconds 2020-09-08T18:26:29.762+0800: 509725.516: [GC remark 2020-09-08T18:26:29.762+0800: 509725.516: [Finalize Marking, 0.0020710 secs] 2020-09-08T18:26:29.764+0800: 509725.518: [GC ref-proc2020-09-08T18:26:29.764+0800: 509725.518: [SoftReference, 244 refs, 0.0005201 secs]2020-09-08T18:26:29.765+0800: 509725.518: [WeakReference, 2549 refs, 0.0006340 secs]2020-09-08T18:26:29.766+0800: 509725.519: [FinalReference, 34 refs, 0.0003842 secs]2020-09-08T18:26:29.766+0800: 509725.519: [PhantomReference, 0 refs, 772 refs, 0.0008726 secs]2020-09-08T18:26:29.767+0800: 509725.520: [JNI Weak Reference, 0.0002677 secs], 0.0028920 secs] 2020-09-08T18:26:29.767+0800: 509725.521: [Unloading, 0.0209432 secs], 0.0292943 secs] [Times: user=0.10 sys=0.03, real=0.03 secs] 2020-09-08T18:26:29.792+0800: 509725.546: Total time for which application threads were stopped: 0.0473635 seconds, Stopping threads took: 0.0005978 seconds 2020-09-08T18:26:29.793+0800: 509725.546: Application time: 0.0001449 seconds 2020-09-08T18:26:29.809+0800: 509725.563: [GC cleanup 1797M->1797M(10G), 0.0051662 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
总结
至此,困扰长达一个多月的gc问题解决了,尽管没有找到FinalReference为什么突然增多的问题,但是通过另外的手段解决了生产的问题,根治了顽固的gc问题。通过整个过程,我们有以下收获
- 充分的了解了G1垃圾回收器的原理和细节,为我们以后排查问题打下基础
- 认识了g1的gc日志,这些日志密密麻麻,其实藏了很多关键信息,结合这1所学的知识,看起来也就不费力了,还是需要静下心来,耐心的去看
- 再解决的过程中,我们顺便看了公司的中间的代码,替他们找到一个大bug.另外,我们也了解了apache httpclient的连接池的原理以及分析和看了他的源代码
- 再次使用了heap分析工具mat,尽管未能定位到问题
- -Djavax.net.ssl.sessionCacheSize=101的bug问题