前言:
又被阿里云的ecs主机被坑了…. 一开始被那个所谓的安全组坑,后来又被 nc 端口空洞坑。。。。 这次是被阿里云默认的内核优化参数给干了…. 为啥要用阿里云主机,而不用实体服务器,主要原因在于,我司是自建的机房,且可用性相当的低… 总是网络切割,谁能扛得住…..
事情的前后是这样,我们这里有大量的cdn注入客户端,每次上线的时候都会有大量的连接异常…. 异常本身是没有关系,因为服务里都加了try catch。 但这样会造成一个很郁闷的问题,redis pop数据没到达客户端,这样造成了数据丢失… 这个是Python的异常….
该文章后续会有更新, 原文地址 http://xiaorui.cc/?p=4890
2017-11-24 13:46:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',) Traceback (most recent call last): File "./inject_agent/agent.py", line 78, in execute self._execute() File "./inject_agent/agent.py", line 87, in _execute task = self.get_task() File "./inject_agent/agent.py", line 44, in get_task body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"]) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop return self.execute_command('LPOP', name) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command connection.send_command(*args) File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command self.send_packed_command(self.pack_command(*args)) File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command self.connect() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect self.on_connect() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect if nativestr(self.read_response()) != 'OK': File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response response = self._parser.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response response = self._buffer.readline() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline self._read_from_socket() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket (e.args,))
这个是 /var/log/message 内核异常日志…
Nov 24 13:37:53 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: __ratelimit: 29 callbacks suppressed Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: __ratelimit: 467 callbacks suppressed Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
通过各个环节的日志上的时间可以推断出他们的事件是关联的。执行命令 netstat -ant|grep TIME_WAIT|wc -l 统计处于 TIME_WAIT 状态的 TCP 连接数,发现处于 TIME_WAIT 状态的 TCP 连接非常多。 另外我上面有说这个参数会影响到redis数据丢失,首先redis的list模型本就没有可靠性这么一说,那也不能这么无故的丢失数据。 我通过各方面抓包分析出,client发起了pop请求,redis也收到了,也把pop的数据放到buffer里,但是返回给用户的时候失败了….
内核 sysctl.conf 参数 net.ipv4.tcp_max_tw_buckets 可以调整内核中管理 TIME_WAIT 状态的数量,当实例中处于 TIME_WAIT 及需要转换为 TIME_WAIT 状态连接数之和超过了 net.ipv4.tcp_max_tw_buckets 参数值时,/var/log/message 日志中会打印 time wait bucket table,光打印没事,内核会强制关闭超出参数值的部分 TCP 连接。
解决方法:
可以把 net.ipv4.tcp_max_tw_buckets 放大,阿里云默认值是5000,这个实在是有点少了,可以适当的调节… 该主机上有好几个高频的调度服务,本身就有大量的连接产生,后面 一大批agent连上调度服务会产生更多的连接。 这也直接导致 time wait buckets 撑满… 另外也一定要调节net.ipv4.tcp_max_syn_backlog参数…