又一起GC问题排查过程

背景

继上次找到redis导致的gc问题之后,又有一个应用发生了gc时间长的问题。每次gc的大概1s的停顿时间,导致这1s的请求全部超时,经常被业务方投诉。因为G1有更好的并行度,更短和确定的并行时间,因此负责的开发将应用从CMS升级到G1,然而问题依然没有解决. 这是一个网关类型的应用,负责接受rpc的调用,将请求转发到后端的http请求,将封装好的数据,转发给业务使用。所以实际上是

Application---->rpc-->GateWay(问题应用)-----http client ---> 其他服务.

分析

解决GC问题的思路,还是先要了解下你这个gc的特性和阶段,然后结合gc日志排查到底是哪里有问题。是频繁的youngGC还是oldGC,针对具体的问题,再配合jmap工具dump内存的heap下来分析。找到内存泄露或者是代码或者是配置不合理的地方。

尝试的方法

  1. 首先是将CMS改为G1, 并且设定预期的时间为200ms,然而并没有解决问题
Garbage_collector="G1"
GC_OPTS="$GC_OPTS -XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=4 -XX:ConcGCThreads=4 \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:+PrintClassHistogramBeforeFullGC \
-XX:+PrintClassHistogramAfterFullGC  \
-XX:+PrintGCApplicationConcurrentTime  \
-XX:+PrintGCApplicationStoppedTime \
-XX:+PrintTenuringDistribution \
-XX:+PrintReferenceGC \
-XX:+PrintHeapAtGC"

过2~3天,又会因为gc问题发生一波超时.

G1日志分析

G1了解

要看懂g1日志,首先得知道这个G1的原理。总结无非这几点, G1使用的是Region作为内存单元来存储数据(大小可以设置),然后他模糊了youngGC区和old区的概念,这个只作为一个逻辑分区。因此在他yonggc发生的时候,可以顺带的回收old区的内容,而不用像cms一样,等到一个百分比的时候,再触发一次oldgc,来一个全区的扫描,那个当然就费时间了

另外,就是他的核心RSet的设计,记录了跨区的引用,以此来防止全区的一个扫描。

G1的gc分为以下几个阶段:

YoungGC阶段

触发时机 : 当Eden空间被占满之后,就会触发YGC,在G1中YGC依然采用复制存活对象到survivor空间的方式。他虽然是stop wold的,但是他是并发执行的,日志类似于这样:

youngGC不需要扫描老年代,因为他通过RSet就知道哪些年轻代的对象是被老年代引用的.

2020-07-28T18:29:37.235+0800: 78143.812: [SoftReference, 0 refs, 0.0005405 secs]2020-07-28T18:29:37.236+0800: 78143.813: [WeakReference, 650 refs, 0.0004000 secs]2020-07-28T18:29:37.236+0800: 78143.813: [FinalReference, 56137 refs, 0.0136734 secs]2020-07-28T18:29:37.250+0800: 78143.827: [PhantomReference, 0 refs, 10 refs, 0.0006262 secs]2020-07-28T18:29:37.251+0800: 78143.828: [JNI Weak Reference, 0.0000456 secs], 0.0872782 secs]
   [Parallel Time: 61.9 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 78143750.0, Avg: 78143750.0, Max: 78143750.0, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 10.0, Avg: 10.2, Max: 10.3, Diff: 0.3, Sum: 40.9]
      [Update RS (ms): Min: 21.1, Avg: 21.2, Max: 21.2, Diff: 0.1, Sum: 84.8]
         [Processed Buffers: Min: 451, Avg: 484.8, Max: 516, Diff: 65, Sum: 1939]
      [Scan RS (ms): Min: 5.9, Avg: 6.0, Max: 6.0, Diff: 0.1, Sum: 23.8]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 24.0, Avg: 24.2, Max: 24.4, Diff: 0.4, Sum: 96.9]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 8.8, Max: 17, Diff: 16, Sum: 35]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [GC Worker Total (ms): Min: 61.6, Avg: 61.7, Max: 61.7, Diff: 0.1, Sum: 246.8]
      [GC Worker End (ms): Min: 78143811.7, Avg: 78143811.7, Max: 78143811.8, Diff: 0.1]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.8 ms]
   [Other: 23.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 15.6 ms]
      [Ref Enq: 0.4 ms]
      [Redirty Cards: 0.5 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 2.7 ms]
   [Eden: 6084.0M(6084.0M)->0.0B(6080.0M) Survivors: 60.0M->64.0M Heap: 8179.6M(10.0G)->2102.0M(10.0G)]
Heap after GC invocations=1261 (full 0):
 garbage-first heap   total 10485760K, used 2152475K [0x0000000540000000, 0x0000000540405000, 0x00000007c0000000)
  region size 4096K, 16 young (65536K), 16 survivors (65536K)
 Metaspace       used 82266K, capacity 85616K, committed 85760K, reserved 1124352K
  class space    used 9546K, capacity 10192K, committed 10240K, reserved 1048576K
}
 [Times: user=0.30 sys=0.00, real=0.08 secs] 

然后我们可以看到这个并发交替的过程如下,他是一会执行用户程序,一会执行gc任务,清楚的看到他的交替过程:

2020-07-28T18:29:37.259+0800: 78143.836: Total time for which application threads were stopped: 0.0960241 seconds, Stopping threads took: 0.0009531 seconds
2020-07-28T18:29:39.182+0800: 78145.759: Application time: 1.9228829 seconds
2020-07-28T18:29:39.190+0800: 78145.767: Total time for which application threads were stopped: 0.0079957 seconds, Stopping threads took: 0.0009701 seconds
2020-07-28T18:29:39.190+0800: 78145.767: Application time: 0.0002209 seconds
2020-07-28T18:29:39.195+0800: 78145.772: Total time for which application threads were stopped: 0.0047948 seconds, Stopping threads took: 0.0003450 seconds
2020-07-28T18:29:39.195+0800: 78145.772: Application time: 0.0003185 seconds
2020-07-28T18:29:39.199+0800: 78145.776: Total time for which application threads were stopped: 0.0038102 seconds, Stopping threads took: 0.0002901 seconds
2020-07-28T18:29:39.200+0800: 78145.776: Application time: 0.0004269 seconds
2020-07-28T18:29:39.203+0800: 78145.780: Total time for which application threads were stopped: 0.0036949 seconds, Stopping threads took: 0.0003017 seconds
2020-07-28T18:29:39.204+0800: 78145.780: Application time: 0.0003345 seconds
2020-07-28T18:29:39.207+0800: 78145.784: Total time for which application threads were stopped: 0.0037270 seconds, Stopping threads took: 0.0002750 seconds
2020-07-28T18:29:39.208+0800: 78145.784: Application time: 0.0002432 seconds
2020-07-28T18:29:39.211+0800: 78145.788: Total time for which application threads were stopped: 0.0034454 seconds, Stopping threads took: 0.0002974 seconds
2020-07-28T18:29:39.211+0800: 78145.788: Application time: 0.0002809 seconds
2020-07-28T18:29:39.215+0800: 78145.792: Total time for which application threads were stopped: 0.0034799 seconds, Stopping threads took: 0.0003126 seconds
2020-07-28T18:29:39.215+0800: 78145.792: Application time: 0.0001971 seconds
2020-07-28T18:29:39.218+0800: 78145.795: Total time for which application threads were stopped: 0.0033639 seconds, Stopping threads took: 0.0002907 seconds
2020-07-28T18:29:39.219+0800: 78145.795: Application time: 0.0002002 seconds
2020-07-28T18:29:39.222+0800: 78145.799: Total time for which application threads were stopped: 0.0034734 seconds, Stopping threads took: 0.0002560 seconds
2020-07-28T18:29:39.223+0800: 78145.799: Application time: 0.0004148 seconds
2020-07-28T18:29:39.230+0800: 78145.807: Total time for which application threads were stopped: 0.0078923 seconds, Stopping threads took: 0.0007479 seconds
2020-07-28T18:29:39.231+0800: 78145.807: Application time: 0.0002697 seconds
2020-07-28T18:29:39.239+0800: 78145.816: Total time for which application threads were stopped: 0.0080425 seconds, Stopping threads took: 0.0005797 seconds
2020-07-28T18:29:39.239+0800: 78145.816: Application time: 0.0004746 seconds
2020-07-28T18:29:39.247+0800: 78145.824: Total time for which application threads were stopped: 0.0081480 seconds, Stopping threads took: 0.0005586 seconds
2020-07-28T18:29:39.248+0800: 78145.824: Application time: 0.0002206 seconds
2020-07-28T18:29:39.255+0800: 78145.832: Total time for which application threads were stopped: 0.0077953 seconds, Stopping threads took: 0.0003357 seconds
2020-07-28T18:29:39.256+0800: 78145.833: Application time: 0.0004123 seconds
2020-07-28T18:29:39.263+0800: 78145.840: Total time for which application threads were stopped: 0.0075859 seconds, Stopping threads took: 0.0004447 seconds
2020-07-28T18:29:39.264+0800: 78145.840: Application time: 0.0001576 seconds
2020-07-28T18:29:39.271+0800: 78145.847: Total time for which application threads were stopped: 0.0070406 seconds, Stopping threads took: 0.0003025 seconds
2020-07-28T18:29:39.271+0800: 78145.848: Application time: 0.0003462 seconds
2020-07-28T18:29:39.278+0800: 78145.855: Total time for which application threads were stopped: 0.0071142 seconds, Stopping threads took: 0.0004311 seconds
2020-07-28T18:29:39.278+0800: 78145.855: Application time: 0.0001479 seconds
2020-07-28T18:29:39.285+0800: 78145.862: Total time for which application threads were stopped: 0.0071285 seconds, Stopping threads took: 0.0003712 seconds
2020-07-28T18:29:39.287+0800: 78145.864: Application time: 0.0018179 seconds
2020-07-28T18:29:39.295+0800: 78145.872: Total time for which application threads were stopped: 0.0075329 seconds, Stopping threads took: 0.0005250 seconds
2020-07-28T18:29:39.296+0800: 78145.872: Application time: 0.0008192 seconds
2020-07-28T18:29:39.303+0800: 78145.880: Total time for which application threads were stopped: 0.0072368 seconds, Stopping threads took: 0.0002966 seconds
2020-07-28T18:29:39.303+0800: 78145.880: Application time: 0.0006276 seconds
2020-07-28T18:29:39.324+0800: 78145.901: Total time for which application threads were stopped: 0.0202434 seconds, Stopping threads took: 0.0002417 seconds
2020-07-28T18:30:04.065+0800: 78170.642: Application time: 24.7410815 seconds
2020-07-28T18:30:04.073+0800: 78170.650: Total time for which application threads were stopped: 0.0082734 seconds, Stopping threads took: 0.0009748 seconds
2020-07-28T18:30:14.002+0800: 78180.578: Application time: 9.9284404 seconds

MIXGC混合GC阶段

G1中的MIXGC选定所有新生代里的Region,外加根据global concurrent marking统计得出收集收益高的若干老年代Region,在用户指定的开销目标范围内尽可能选择收益高的老年代Region进行回收。所以MIXGC回收的内存区域是新生代+老年代

FullGC

G1一般不会发生FullGC,如果发生了,基本上就是有灾难了.

https://zhuanlan.zhihu.com/p/52841787

G1日志分析

引用: g1日志格式分析 :https://segmentfault.com/a/1190000018885417

https://matt33.com/2018/07/28/jvm-cms/

我们通过jvm的参数,打开gc日志详情,我们可以看到,出问题的时间点的日志如下:

2020-07-28T18:30:14.107+0800: 78180.684: [GC concurrent-root-region-scan-start]
2020-07-28T18:30:14.108+0800: 78180.685: Total time for which application threads were stopped: 0.1065961 seconds, Stopping threads took: 0.0010147 seconds
2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-root-region-scan-end, 0.1157296 secs]
2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-mark-start]
2020-07-28T18:30:15.297+0800: 78181.873: [GC concurrent-mark-end, 1.0732893 secs]
2020-07-28T18:30:15.297+0800: 78181.873: Application time: 1.1884735 seconds
2020-07-28T18:30:15.304+0800: 78181.881: [GC remark 2020-07-28T18:30:15.304+0800: 78181.881: [Finalize Marking, 0.0042761 secs] 2020-07-28T18:30:15.309+0800: 78181.885: [GC ref-proc2020-07-28T18:30:15.309+0800: 78181.885: [SoftReference, 434351 refs, 0.0672266 secs]2020-07-28T18:30:15.376+0800: 78181.953: [WeakReference, 376 refs, 0.0004634 secs]2020-07-28T18:30:15.376+0800: 78181.953: [FinalReference, 435363 refs, 0.3004894 secs]2020-07-28T18:30:15.677+0800: 78182.254: [PhantomReference, 0 refs, 331 refs, 0.0009668 secs]2020-07-28T18:30:15.678+0800: 78182.255: [JNI Weak Reference, 0.0002237 secs], 0.4895046 secs] 2020-07-28T18:30:15.798+0800: 78182.375: [Unloading, 0.0347793 secs], 0.5384580 secs]
 [Times: user=1.65 sys=0.24, real=0.54 secs]

我们看到GC ref-proc这个阶段耗时很长,这个是个stop the world的过程,再看看 FinalReference居然有40多万,平时是6万多(也不少)

再来学习下这个并发标记是什么意思.

认识全局并发标记

全局并发标记过程分为五个阶段

(1) Initial Mark初始标记 STW

Initial Mark初始标记是一个STW事件,其完成工作是标记GC ROOTS 直接可达的对象。并将它们的字段压入扫描栈(marking stack)中等到后续扫描。G1使用外部的bitmap来记录mark信息,而不使用对象头的mark word里的mark bit。因为 STW,所以通常YGC的时候借用YGC的STW顺便启动Initial Mark,也就是启动全局并发标记,全局并发标记与YGC在逻辑上独立。

(2)Root Region Scanning 根区域扫描

根区域扫描是从Survior区的对象出发,标记被引用到老年代中的对象,并把它们的字段在压入扫描栈(marking stack)中等到后续扫描。与Initial Mark不一样的是,Root Region Scanning不需要STW与应用程序是并发运行。Root Region Scanning必须在YGC开始前完成。

(3)Concurrent Marking 并发标记

不需要STW。不断从扫描栈取出引用递归扫描整个堆里的对象。每扫描到一个对象就会对其标记,并将其字段压入扫描栈。重复扫描过程直到扫描栈清空。过程中还会扫描SATB write barrier所记录下的引用。Concurrent Marking 可以被YGC中断

就是日志这里的内容:

2020-07-28T18:30:14.107+0800: 78180.684: [GC concurrent-root-region-scan-start]
2020-07-28T18:30:14.108+0800: 78180.685: Total time for which application threads were stopped: 0.1065961 seconds, Stopping threads took: 0.0010147 seconds
2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-root-region-scan-end, 0.1157296 secs]
2020-07-28T18:30:14.223+0800: 78180.800: [GC concurrent-mark-start]
2020-07-28T18:30:15.297+0800: 78181.873: [GC concurrent-mark-end, 1.0732893 secs]

(4)Remark 最终标记 STW

STW操作。在完成并发标记后,每个Java线程还会有一些剩下的SATB write barrier记录的引用尚未处理。这个阶段就负责把剩下的引用处理完。同时这个阶段也进行弱引用处理(reference processing)。注意这个暂停与CMS的remark有一个本质上的区别,那就是这个暂停只需要扫描SATB buffer,而CMS的remark需要重新扫描mod-union table里的dirty card外加整个根集合,而此时整个young gen(不管对象死活)都会被当作根集合的一部分,因而CMS remark有可能会非常慢。

就是出问题的地方

2020-07-28T18:30:15.304+0800: 78181.881: [GC remark 2020-07-28T18:30:15.304+0800: 78181.881: [Finalize Marking, 0.0042761 secs] 2020-07-28T18:30:15.309+0800: 78181.885: [GC ref-proc2020-07-28T18:30:15.309+0800: 78181.885: [SoftReference, 434351 refs, 0.0672266 secs]2020-07-28T18:30:15.376+0800: 78181.953: [WeakReference, 376 refs, 0.0004634 secs]2020-07-28T18:30:15.376+0800: 78181.953: [FinalReference, 435363 refs, 0.3004894 secs]2020-07-28T18:30:15.677+0800: 78182.254: [PhantomReference, 0 refs, 331 refs, 0.0009668 secs]2020-07-28T18:30:15.678+0800: 78182.255: [JNI Weak Reference, 0.0002237 secs], 0.4895046 secs] 2020-07-28T18:30:15.798+0800: 78182.375: [Unloading, 0.0347793 secs], 0.5384580 secs]

(5)Cleanup 清除 STW AND Concurrent

STW操作,清点出有存活对象的Region和没有存活对象的Region(Empty Region)

STW操作,更新Rset

通过以上分析我们基本上断定了,就是因为并发FinalReference太多了,导致标记的时候,花费了很长的时间,造成了应用的暂停和抖动。

定位到这个问题之后,就尝试着去分析,为啥会有这么多的FinalReference这玩意儿呢? 能不能减少呢?

heap分析

通过jmap -dump:format=b,file=heapdump.phrof 把生产的内存区下载下来之后,再去分析,也没发现什么有用的信息, 果然是发现这个东西很多image.png, 看起来依然是跟SocksSocketImpl有关系

问题定位

查找new SocksSoketImpl()的来源

我们看看为什么会有这么多java.net.SocksSoketImpl呢?通过以前的手段,我们继续使用阿尔萨斯来定位。通过监控SocksSoketImpl的构造方法,我们发现不断的被创建新的连接。

尝试1: 关闭JMX收集

我们发现不少的连接是来自JMX的信息收集,我先把这个关闭了看看,结果并没有任何改观

尝试2: httpclient的短连接问题?

开始试着查找为什么不断的new 新的连接呢?是不是连接未能复用呢?网上很多人也遇到了同样的问题,很多是因为错误的使用了短连接造成的。

我开始质疑是不是公司的中间件有问题。公司的httpclient统一使用的是中间件封装的一层,目的是为了便于统计和跟踪。在本地测试的时候,发现连接是能够保持的。是不是生产网络的连接无法保持呢?

搞清楚httpclient,调整httpclient连接池参数

我们内部调用http的服务,一般是这么调的

JSONObject obj = restTemplate.getForObject(realUrl, JSONObject.class);

这个其实是spring给你封装的一个模板方法,大大的减少了使用httpclient的成本。还记得以前使用httpclient都会自己封装一层,使用连接池,使用idle线程踢出不再使用的线程等等。

而公司的中间利用的是模板方法里提供的插入自己逻辑的机会,增加了埋点。另外就是把配置暴露到application.yml里给你可以配置。

PoolingHttpClientConnectionManager的分析

去爬了一遍PoolingHttpClientConnectionManager的源代码,了解下连接池管理这一段,去找找到底为什么连接无法保持呢?分析的过程如下

涉及的类

CPoolEntry: Pool的一个单元,包含了route信息和connect的连接

Cpool: lease 尝试去租用一个连接, 此方法会获取一个Future的对象.

Future<CPoolEntry> 具体的能力是无阻塞的获取一个Pool单元.

那么这个Future是怎么获取连接的,调用他get方法, 会拿到上一次的连接,如果没拿到,会调用

getPoolEntryBlocking方法,进一步的获取连接单元.

这个方法调用了RouteSpecificPool.getFree.//拿到一个可用的连接.

RouteSpecificPool: 里面是两个链表, 存储的是route维度的连接信息. 如果是getFree方法,那就从avaiable里拿一个元素出来,并且放到lease的链表里,代表成功lease了一个连接。

  try {

           final RouteSpecificPool<T, C, E> pool = getPool(route);

           E entry;

           for (;;) {

               Asserts.check(!this.isShutDown, "Connection pool shut down");

               for (;;) {

                   //从avaiable的队列里取出来一个并且放到lease的队列里

                   //获取一个可用

                   entry = pool.getFree(state);

                   if (entry == null) {

                       break;

                   }

                   if (entry.isExpired(System.currentTimeMillis())) {

                       entry.close();

                   }

                   //如果连接是关闭的,过期了会自动关闭

                   if (entry.isClosed()) {

                       //因为这个连接已经不可用了, 把他从可用列表里移除

                       this.available.remove(entry);

                       //再从pool里的lease的队列里取一个放到available的队列列

                       //因为是false,所以不会放到avaiable里

                       pool.free(entry, false);

                   } else {

                       break;

                   }

                   //第一个for完了,得到了一个entry(可用)或者null

                  if (entry != null) {

                      //成功的获取到了,记录一下从可用里移除,并且结束进程

                   this.available.remove(entry);

                      //放到lease的里面去

                   this.leased.add(entry);

                      //标记为可重复使用

                   onReuse(entry);

                   return entry;

               }

                        // 如果为null则需要重新构建一个连接.

               final int maxPerRoute = getMax(route);

               // Shrink the pool prior to allocating a new connection

               final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);

                   //这个excess恒大于1?

                   if (excess > 0) {

                   for (int i = 0; i < excess; i++) {

                       final E lastUsed = pool.getLastUsed();

                       if (lastUsed == null) {

                           break;

                       }

                       lastUsed.close();

                       this.available.remove(lastUsed);

                       pool.remove(lastUsed);

                   }

               }

                   //如果还有容量, 则新建一个连接

                     if (pool.getAllocatedCount() < maxPerRoute) {

                   final int totalUsed = this.leased.size();

                   final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);

                   if (freeCapacity > 0) {

                       final int totalAvailable = this.available.size();

                       if (totalAvailable > freeCapacity - 1) {

                           if (!this.available.isEmpty()) {

                               final E lastUsed = this.available.removeLast();

                               lastUsed.close();

                               final RouteSpecificPool<T, C, E> otherpool = getPool(lastUsed.getRoute());

                               otherpool.remove(lastUsed);

                           }

                       }

                       final C conn = this.connFactory.create(route);

                       entry = pool.add(conn);

                       this.leased.add(entry);

                       return entry;

                   }

               }

               }

通过这个可以清楚的了解到,这些配置参数的含义。

max-total: 连接池最大的连接数量

max-per-route: 连接池里每个域名的最大连接数量

default-max-per-route: 没找到对应的域名的配置时候,使用的连接数量

公司的中间还封装了一个逻辑,默认全局只有一个连接池,使用的是上面的配置。他封装了一个逻辑,可以一个域名配置一个连接池。配置格式类似如下,下面就有两个连接池,每个连接池用自己的配置。

httpclient:
  default-pool:
    default-max-per-route: 20
    max-total: 100
    max-per-route: 50
    keep-alive-time: 60
    connect-timeout: 5000
    connection-request-timeout: 5000
    socket-timeout: 1000
    idle-connections-timeout: 30
    idle-connections-check-schedule: 5000
  pool-configs:
      "[http://10.69.3.58:9003]":
        max-total: 213
        max-per-route: 123
        keep-alive-time: 900
        connect-timeout: 450
        connection-request-timeout: 450
        socket-timeout: 450
        idle-connections-timeout: 30
        idle-connections-check-schedule: 5000

生产问题定位

了解了PoolingHttpClientConnectionManager之后,其实我们可以把日志打开看一看,到底是什么情况?

  <Logger name="org.apache.http.impl.conn.PoolingHttpClientConnectionManager" level="DEBUG">

       </Logger>

我把这个类的info日志打开,发现日志如下:

2020-08-26 16:47:36 DEBUG[pool-7-thread-149] [o.a.h.i.c.PoolingHttpClientConnectionManager.requestConnection:266] [154ee4edcd6d1152,154ee4edcd6d1152,0] Connection request: [route: {s}->https://aiengine-hellomap-route-inner.hellobike.cn:9003][total kept alive: 0; route allocated: 50 of 50; total allocated: 50 of 100]
2020-08-26 16:47:36 DEBUG[pool-7-thread-219] [o.a.h.i.c.PoolingHttpClientConnectionManager.leaseConnection:310] [8f2ce76435c397c6,8f2ce76435c397c6,0] Connection leased: [id: 607][route: {s}->https://aiengine-hellomap-route-inner.hellobike.cn:9003][total kept alive: 0; route allocated: 50 of 50; total allocated: 50 of 100]

这个结合着PoolingHttpClientConnectionManager源代码还是比较容易理解的。

这里为啥是 50 of 50跟我配置的不一样? 后来发现是中间件的bug!!! 在https的时候就有bug,用户自己配的 max-per-route是无效的..这个是后话了

通过这个日志可以很清楚的知道,不仅使用了连接池,而且还能够保持连接。所以这个猜想是错误的。

是不是有突增的流量进来呢

我们这个应用还给另外一个特殊的定时任务类型的应用使用,会不会是他带来的巨大的流量,导致了很多连接的创建,最终导致了FinalReference过多呢? 我们尝试把这部分流量切出去,然而问题依旧。

是不是sslsession的问题导致的呢

期间还听说另外一个开放也遇到gc的问题,是因为jdk的bug导致的

https://bugs.openjdk.java.net/browse/JDK-8210985

-Djavax.net.ssl.sessionCacheSize=101 通过设置他的大小来控制队列的长度,通过修改这个参数,依然没什么效果。

另类的解决方法

查找FinalReference过多是查不到了,可能这个应用就是这么个特性。因为并发高,后端逻辑是纯http request,所以连接比较多是正常的。至于为啥FinalReference突然从6万多到了40多万,带来了gc的停顿,至今也没想明白。

但是生产的问题依然存在,必须火速解决。既然解决不了问题,我们就解决提出问题的人!回到问题的本质,是因为在并发标记阶段FinalReference太多了导致的。每次出问题都是并发标记带来的,那么我是不是可以通过增加并发标记的次数,主动的去干这个事,而不是让他累积到一定的量导致爆发呢?

增加并发标记的次数

说干就干,我们通过增加参数-XX:InitiatingHeapOccupancyPercent=7来达到主动的触发并发标记的目的。这个参数的默认是45,也就是达到了45%会触发一次并发标记。同时我们把4核的机器换成8核,把并发标记的线程从4个提升到6个,我给你更多的计算资源,让你帮我把FinalReference主动的减少,行不行呢?

参数修改之后,观察效果,发现并发标记的次数确实变多了,每次并发标记之后,FinalReference也如期减少了,达到了预期的效果。经过一周的观察,gc问题解决了,再也没有gc带来的超时问题.

从日志上看7万多的FinalReference经过并发标记之后,到了5000多,目的达到了。

2020-09-08T18:26:29.601+0800: 509725.355: [SoftReference, 0 refs, 0.0005530 secs]2020-09-08T18:26:29.602+0800: 509725.355: [WeakReference, 651 refs, 0.0003417 secs]2020-09-08T18:26:29.602+0800: 509725.356: [FinalReference, 59425 refs, 0.0084187 secs]2020-09-08T18:26:29.611+0800: 509725.364: [PhantomReference, 0 refs, 0 refs, 0.0007736 secs]2020-09-08T18:26:29.611+0800: 509725.365: [JNI Weak Reference, 0.0000406 secs], 0.0485472 secs]
   [Parallel Time: 31.2 ms, GC Workers: 6]
      [GC Worker Start (ms): Min: 509725323.3, Avg: 509725323.4, Max: 509725323.5, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 3.8, Avg: 5.3, Max: 11.4, Diff: 7.6, Sum: 31.7]
      [Update RS (ms): Min: 3.6, Avg: 9.6, Max: 11.0, Diff: 7.3, Sum: 57.7]
         [Processed Buffers: Min: 256, Avg: 319.7, Max: 376, Diff: 120, Sum: 1918]
      [Scan RS (ms): Min: 0.1, Avg: 0.3, Max: 0.3, Diff: 0.2, Sum: 1.6]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 15.2, Avg: 15.5, Max: 15.8, Diff: 0.5, Sum: 93.2]
      [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
         [Termination Attempts: Min: 1, Avg: 150.8, Max: 196, Diff: 195, Sum: 905]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.5]
      [GC Worker Total (ms): Min: 30.8, Avg: 30.8, Max: 30.9, Diff: 0.1, Sum: 185.0]
      [GC Worker End (ms): Min: 509725354.2, Avg: 509725354.2, Max: 509725354.3, Diff: 0.1]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.6 ms]
   [Other: 16.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 10.4 ms]
      [Ref Enq: 0.3 ms]
      [Redirty Cards: 0.1 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.4 ms]
   [Eden: 6084.0M(6084.0M)->0.0B(6084.0M) Survivors: 60.0M->60.0M Heap: 7811.1M(10.0G)->1725.1M(10.0G)]
Heap after GC invocations=19624 (full 0):
 garbage-first heap   total 10485760K, used 1766479K [0x0000000540000000, 0x0000000540405000, 0x00000007c0000000)
  region size 4096K, 15 young (61440K), 15 survivors (61440K)
 Metaspace       used 96904K, capacity 100148K, committed 101504K, reserved 1138688K
  class space    used 11478K, capacity 12106K, committed 12416K, reserved 1048576K
}
 [Times: user=0.19 sys=0.06, real=0.05 secs]
2020-09-08T18:26:29.616+0800: 509725.370: [GC concurrent-root-region-scan-start]
2020-09-08T18:26:29.617+0800: 509725.370: Total time for which application threads were stopped: 0.0667356 seconds, Stopping threads took: 0.0006084 seconds
2020-09-08T18:26:29.624+0800: 509725.377: [GC concurrent-root-region-scan-end, 0.0075624 secs]
2020-09-08T18:26:29.624+0800: 509725.377: [GC concurrent-mark-start]
2020-09-08T18:26:29.745+0800: 509725.498: [GC concurrent-mark-end, 0.1212810 secs]
2020-09-08T18:26:29.745+0800: 509725.499: Application time: 0.1284912 seconds
2020-09-08T18:26:29.762+0800: 509725.516: [GC remark 2020-09-08T18:26:29.762+0800: 509725.516: [Finalize Marking, 0.0020710 secs] 2020-09-08T18:26:29.764+0800: 509725.518: [GC ref-proc2020-09-08T18:26:29.764+0800: 509725.518: [SoftReference, 244 refs, 0.0005201 secs]2020-09-08T18:26:29.765+0800: 509725.518: [WeakReference, 2549 refs, 0.0006340 secs]2020-09-08T18:26:29.766+0800: 509725.519: [FinalReference, 34 refs, 0.0003842 secs]2020-09-08T18:26:29.766+0800: 509725.519: [PhantomReference, 0 refs, 772 refs, 0.0008726 secs]2020-09-08T18:26:29.767+0800: 509725.520: [JNI Weak Reference, 0.0002677 secs], 0.0028920 secs] 2020-09-08T18:26:29.767+0800: 509725.521: [Unloading, 0.0209432 secs], 0.0292943 secs]
 [Times: user=0.10 sys=0.03, real=0.03 secs]
2020-09-08T18:26:29.792+0800: 509725.546: Total time for which application threads were stopped: 0.0473635 seconds, Stopping threads took: 0.0005978 seconds
2020-09-08T18:26:29.793+0800: 509725.546: Application time: 0.0001449 seconds
2020-09-08T18:26:29.809+0800: 509725.563: [GC cleanup 1797M->1797M(10G), 0.0051662 secs]
 [Times: user=0.02 sys=0.00, real=0.00 secs]

总结

至此,困扰长达一个多月的gc问题解决了,尽管没有找到FinalReference为什么突然增多的问题,但是通过另外的手段解决了生产的问题,根治了顽固的gc问题。通过整个过程,我们有以下收获

  1. 充分的了解了G1垃圾回收器的原理和细节,为我们以后排查问题打下基础
  2. 认识了g1的gc日志,这些日志密密麻麻,其实藏了很多关键信息,结合这1所学的知识,看起来也就不费力了,还是需要静下心来,耐心的去看
  3. 再解决的过程中,我们顺便看了公司的中间的代码,替他们找到一个大bug.另外,我们也了解了apache httpclient的连接池的原理以及分析和看了他的源代码
  4. 再次使用了heap分析工具mat,尽管未能定位到问题
  5. -Djavax.net.ssl.sessionCacheSize=101的bug问题

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注