记一次 .NET 某招聘网后端服务 内存暴涨分析

一:背景 1. 讲故事
前段时间有位朋友wx找到我,说他的程序存在内存阶段性暴涨,寻求如何解决,和朋友沟通下来,他的内存平时大概是5G 左右,在某些时点附近会暴涨到 10G+, 画个图大概就是这样。
记一次 .NET 某招聘网后端服务 内存暴涨分析
文章图片

所以接下来就是想办法给他找到那莫名奇妙的 5-6G 是个啥,上 windbg 说话。
二:Windbg 分析 1. 判断托管还是非托管
从描述上看大概率是托管层面的问题,但为了文章的完整性,我们还是用 !address -summary!eeheap -gc 来看一下。

0:000> !address -summary--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal Free11647f5`58f12000 (7.958 TB)99.48% 6924a`6de84000 (41.717 GB)97.90%0.51% Stack11230`16340000 ( 355.250 MB)0.81%0.00% Image40630`1607d000 ( 352.488 MB)0.81%0.00% Heap710`0c9ea000 ( 201.914 MB)0.46%0.00% TEB3740`002ec000 (2.922 MB)0.01%0.00% Other130`001c6000 (1.773 MB)0.00%0.00% PEB10`00001000 (4.000 kB)0.00%0.00%--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal MEM_PRIVATE5423a`87200000 (42.111 GB)98.83%0.51% MEM_IMAGE70330`1e5d6000 ( 485.836 MB)1.11%0.01% MEM_MAPPED1130`01908000 (25.031 MB)0.06%0.00%--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal MEM_FREE11647f5`58f12000 (7.958 TB)99.48% MEM_RESERVE41658`1b873000 (32.430 GB)76.11%0.40% MEM_COMMIT84042`8b86b000 (10.180 GB)23.89%0.12%0:000> !eeheap -gc Number of GC Heaps: 32 ------------------------------ Heap 0 (00000000004106d0) generation 0 starts at 0x0000000082eb0e58 generation 1 starts at 0x0000000082d79b20 generation 2 starts at 0x000000007fff1000 ephemeral segment allocation context: none segmentbeginallocatedsize 000000007fff0000000000007fff10000000000083f801280x3f8f128(66646312) Large object heap starts at 0x000000087fff1000 segmentbeginallocatedsize 000000087fff0000000000087fff10000000000883fe41900x3ff3190(67056016) 0000000927ff00000000000927ff1000000000092bfe24300x3ff1430(67048496) 0000000a81c500000000000a81c510000000000a8221c8580x5cb858(6076504) Heap Size:Size: 0xc53ef40 (206827328) bytes. ------------------------------ ... Heap 31 (0000000019c84130) generation 0 starts at 0x0000000844fc5170 generation 1 starts at 0x0000000844f851f8 generation 2 starts at 0x000000083fff1000 ephemeral segment allocation context: none segmentbeginallocatedsize 000000083fff0000000000083fff10000000000845171ca00x5180ca0(85462176) Large object heap starts at 0x00000008fbff1000 segmentbeginallocatedsize 00000008fbff000000000008fbff100000000008fffe22900x3ff1290(67048080) 000000094bff0000000000094bff1000000000094ea2ebb80x2a3dbb8(44293048) 000000096bff0000000000096bff1000000000096dbdec000x1bedc00(29285376) Heap Size:Size: 0xd79d6e8 (226088680) bytes. ------------------------------ GC Heap Size:Size: 0x1f1986a88 (8348265096) bytes.

从卦中得知,10G的内存,托管堆吃掉了 8.3G,很明显托管层问题,知道大方向后,接下来就可以到托管堆看一看,根据过往经验程序肯定是生成了大量的类对象所致,上命令 !dumpheap -stat
0:000> !dumpheap -stat Statistics: MTCountTotalSize Class Name ... 000007fe9ddd5fc034128030032640 System.ServiceModel.Description.MessagePartDescription 000007fe9c4865a086634941584752 System.Xml.XmlDictionaryString 000007fe9defb09893780145014448 System.Xml.XmlDictionaryString 000007fe9c66bd2810505245086880 System.Collections.Generic.Dictionary`2+Entry[[System.String, mscorlib],[System.Xml.XmlDictionaryString, System.Runtime.Serialization]][] 000007fe9e0f4d2011329949050864 System.Collections.Generic.Dictionary`2+Entry[[System.String, mscorlib],[System.Xml.XmlDictionaryString, System.Runtime.Serialization]][] 00000000003c919044573618414438Free 000007fef8f6c1684284101209974642 System.Char[] 000007fef8f4f1b828497581246912848 System.Object[] 000007fef8f6f0585319631670620873 System.Byte[] 000007fef8f6aee023684312382587716 System.String

真是皂滑弄人,并没有命中过往经验,可以看出占用最大的都是些 Byte,String,Char,Object 基础类型,其实这些基础类型排查起来很难搞,要么不断的用 -min, -max 去筛选,要么就写一个脚本对它进行分组排序,蹩脚脚本如下:
"use strict"; /* 按 mt 对托管堆类型的size进行分组 */let platform = 64 let mtlist = ["000007fef8f4f1b8"]; let maxlimit = 100; function initializeScript() { return [new host.apiVersionSupport(1, 7)]; } function log(str) { host.diagnostics.debugLog(str + "\n"); } function exec(str) { log("\n" + str); return host.namespace.Debugger.Utility.Control.ExecuteCommand(str); } function invokeScript() { for (var mt of mtlist) { groupby_mtsize_inheap(mt); } }//对某个类型按照size 进行分组 function groupby_mtsize_inheap(mt) { var size_group = {}; var commandText = "!dumpheap -mt " + mt; var output = exec(commandText); for (var line of output) { if (line == "" || line.indexOf("Address") > -1) continue; if (line.indexOf("Statistics") > -1) break; var size = parseInt(line.substring(Math.ceil(platform / 2) + 1).trim()); if (!size_group[size]) size_group[size] = 0; size_group[size]++; } show_top10_format(mt, size_group); }function show_top10_format(mt, size_group) { var maparr = []; //转数组 for (var size in size_group) { maparr.push({ "size": size, "count": size_group[size], "totalsize": (size * size_group[size]) }); }maparr.sort(function (a, b) { return b.totalsize - a.totalsize }); var topTotalSize = 0; //按size输出 for (var i = 0; i < Math.min(maparr.length, maxlimit); i++) { var size = maparr[i].size; var count = maparr[i].count; var totalsize = Math.round(maparr[i].totalsize / 1024 / 1024, 2); topTotalSize += totalsizelog("size=" + size + ",count=" + count + ",totalsize=" + totalsize + "M"); }log("Total:" + topTotalSize + "M"); //show max if (maparr.length > 0) { var size = maparr[0].size; var totalsize = Math.round(maparr[0].totalsize / 1024 / 1024, 2) + "M"; var output = exec("!dumpheap -mt " + mt + " -min 0n" + size + " -max 0n" + size + " -short").Take(maxlimit); for (var line of output) { log(line); } } }

接下来把 string 的方法表地址传下去看看排序结果,简化输出如下:
!dumpheap -mt 000007fef8f6aee0 size=29285946,count=2,totalsize=56M size=29285540,count=2,totalsize=56M size=29285502,count=2,totalsize=56M size=29285348,count=2,totalsize=56M size=27455186,count=2,totalsize=52M size=31116504,count=1,totalsize=30M size=31116490,count=1,totalsize=30M size=31116306,count=1,totalsize=30M size=31115934,count=1,totalsize=30M size=31115920,count=1,totalsize=30M size=31115718,count=1,totalsize=30M size=29286342,count=1,totalsize=28M size=29285898,count=1,totalsize=28M ... Total:1198M

可以看到,有不少大 size 的 string,那这些string到底是个啥,这里我随便抽几个导出到txt看看。
0:000> !dumpheap -mt 000007fef8f6aee0 -min 0n31116490 -max 0n31116490 -short 0000000a61c51000 0:000> !do 0000000a61c51000 Name:System.String MethodTable: 000007fef8f6aee0 EEClass:000007fef88d3720 Size:31116490(0x1daccca) bytes File:C:\Windows\Microsoft.Net\assembly\GAC_64\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll String:Fields: MTFieldOffsetType VTAttrValue Name 000007fef8f6dc9040000aa8System.Int321 instance15558232 m_stringLength 000007fef8f6c1c840000abcSystem.Char1 instance50 m_firstChar 000007fef8f6aee040000ac18System.String0sharedstatic Empty >> Domain:Value00000000003fb620:NotInit000000001ca30bd0:NotInit000000001f7b21a0:NotInit000000001f8940c0:NotInit0000000027dc46b0:NotInit00000000281bd720:NotInit00000000282b7ee0:NotInit<<0:000> .writemem D:\dumps\xxxx\string.txt 0000000a61c51000 L?0x1daccca Writing 1daccca bytes..........

【记一次 .NET 某招聘网后端服务 内存暴涨分析】记一次 .NET 某招聘网后端服务 内存暴涨分析
文章图片

从内容看其实就是 pdf 的 base64 编码,以同样的方式调研 char[]byte[] 类型,发现大多也都是 pdf,猜测程序在处理 pdf 的过程中,进行了 byte[],char[],string 之间的切换,所以这些对象理论上大多属于无根对象,其实通过 !heapstat -iu 也能看到那大约 5.5G 的无根对象正等待GC回收。
0:000> !heapstat -iu HeapGen0Gen1Gen2LOH Heap017625808127468047745824140181016 ... Total3574862562810061622296733765733004848Free space:Percentage Heap039622402411211224298616SOH: 22% LOH:0% Heap156258561449857168302152SOH: 27% LOH:0% ... Heap3114485762419957312218024SOH: 25% LOH:0% Total18149278411364318258565183128Unrooted objects:Percentage Heap01216392824358442872137153536SOH: 18% LOH: 97% ... Heap312368322392721435840139770656SOH:2% LOH: 99% Total1649549527948448290664805530423784

三:总结 本次内存阶段性暴涨的事故,主要还是程序接收了上游过多的 pdf文件,毕竟这些都是大对象,还进行了 char[] ,string,byte[] 的切换,造成短时间内过大的内存占用。
最后就是我个人的解决建议:
  1. 针对大量的pdf,能否借用第三方的 oss 软件来规避一些不必要的内存占用。
  2. 清洗服务是否可以做些限流或者使用服务均摊的方式。
后来听朋友说,他做了筛选过滤以及一些业务流程优化解决了这个问题,我想现实中肯定有很多朋友遇到过这类问题,欢迎大家留言补充您的解决方案。

    推荐阅读