记一次内核参数引起的服务崩溃

前言:

    又被阿里云的ecs主机被坑了….  一开始被那个所谓的安全组坑,后来又被 nc 端口空洞坑。。。。 这次是被阿里云默认的内核优化参数给干了….   为啥要用阿里云主机,而不用实体服务器,主要原因在于,我司是自建的机房,且可用性相当的低…   总是网络切割,谁能扛得住….. 


    事情的前后是这样,我们这里有大量的cdn注入客户端,每次上线的时候都会有大量的连接异常….  异常本身是没有关系,因为服务里都加了try catch。 但这样会造成一个很郁闷的问题,redis pop数据没到达客户端,这样造成了数据丢失… 这个是Python的异常….


该文章后续会有更新, 原文地址  http://xiaorui.cc/?p=4890


2017-11-24 13:46:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',)
Traceback (most recent call last):
  File "./inject_agent/agent.py", line 78, in execute
    self._execute()
  File "./inject_agent/agent.py", line 87, in _execute
    task = self.get_task()
  File "./inject_agent/agent.py", line 44, in get_task
    body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"])
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop
    return self.execute_command('LPOP', name)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect
    self.on_connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect
    if nativestr(self.read_response()) != 'OK':
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline
    self._read_from_socket()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket
    (e.args,))

这个是 /var/log/message 内核异常日志…

Nov 24 13:37:53 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: __ratelimit: 29 callbacks suppressed
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: __ratelimit: 467 callbacks suppressed
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow

通过各个环节的日志上的时间可以推断出他们的事件是关联的。执行命令 netstat -ant|grep TIME_WAIT|wc -l 统计处于 TIME_WAIT 状态的 TCP 连接数,发现处于 TIME_WAIT 状态的 TCP 连接非常多。 另外我上面有说这个参数会影响到redis数据丢失,首先redis的list模型本就没有可靠性这么一说,那也不能这么无故的丢失数据。 我通过各方面抓包分析出,client发起了pop请求,redis也收到了,也把pop的数据放到buffer里,但是返回给用户的时候失败了….

内核 sysctl.conf 参数 net.ipv4.tcp_max_tw_buckets 可以调整内核中管理 TIME_WAIT 状态的数量,当实例中处于 TIME_WAIT 及需要转换为 TIME_WAIT 状态连接数之和超过了 net.ipv4.tcp_max_tw_buckets 参数值时,/var/log/message 日志中会打印 time wait bucket table,光打印没事,内核会强制关闭超出参数值的部分 TCP 连接。

解决方法:

可以把 net.ipv4.tcp_max_tw_buckets 放大,阿里云默认值是5000,这个实在是有点少了,可以适当的调节…  该主机上有好几个高频的调度服务,本身就有大量的连接产生,后面 一大批agent连上调度服务会产生更多的连接。 这也直接导致 time wait buckets 撑满…  另外也一定要调节net.ipv4.tcp_max_syn_backlog参数…


大家觉得文章对你有些作用! 如果想赏钱,可以用微信扫描下面的二维码,感谢!
另外再次标注博客原地址  xiaorui.cc