记一次内核参数引起的服务崩溃

前言:

又被阿里云的ecs主机被坑了…. 一开始被那个所谓的安全组坑，后来又被 nc 端口空洞坑。。。。这次是被阿里云默认的内核优化参数给干了…. 为啥要用阿里云主机，而不用实体服务器，主要原因在于，我司是自建的机房，且可用性相当的低… 总是网络切割，谁能扛得住…..

事情的前后是这样，我们这里有大量的cdn注入客户端，每次上线的时候都会有大量的连接异常…. 异常本身是没有关系，因为服务里都加了try catch。但这样会造成一个很郁闷的问题，redis pop数据没到达客户端，这样造成了数据丢失… 这个是Python的异常….

该文章后续会有更新，原文地址 http://xiaorui.cc/?p=4890

2017-11-24 13:46:54,760 - 8 - ERROR: - Error while reading from socket: ('Connection closed by server.',)
Traceback (most recent call last):
  File "./inject_agent/agent.py", line 78, in execute
    self._execute()
  File "./inject_agent/agent.py", line 87, in _execute
    task = self.get_task()
  File "./inject_agent/agent.py", line 44, in get_task
    body = self.db.redis_con.lpop(redis_conf["proxy_task_queue"])
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1329, in lpop
    return self.execute_command('LPOP', name)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 493, in connect
    self.on_connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 561, in on_connect
    if nativestr(self.read_response()) != 'OK':
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 284, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 216, in readline
    self._read_from_socket()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 191, in _read_from_socket
    (e.args,))

这个是 /var/log/message 内核异常日志…

Nov 24 13:37:53 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: __ratelimit: 29 callbacks suppressed
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:11 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: __ratelimit: 467 callbacks suppressed
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow
Nov 24 13:45:16 xxxx kernel: TCP: time wait bucket table overflow

通过各个环节的日志上的时间可以推断出他们的事件是关联的。执行命令 netstat -ant|grep TIME_WAIT|wc -l 统计处于 TIME_WAIT 状态的 TCP 连接数，发现处于 TIME_WAIT 状态的 TCP 连接非常多。另外我上面有说这个参数会影响到redis数据丢失，首先redis的list模型本就没有可靠性这么一说，那也不能这么无故的丢失数据。我通过各方面抓包分析出，client发起了pop请求，redis也收到了，也把pop的数据放到buffer里，但是返回给用户的时候失败了….

内核 sysctl.conf 参数 net.ipv4.tcp_max_tw_buckets 可以调整内核中管理 TIME_WAIT 状态的数量，当实例中处于 TIME_WAIT 及需要转换为 TIME_WAIT 状态连接数之和超过了 net.ipv4.tcp_max_tw_buckets 参数值时，/var/log/message 日志中会打印 time wait bucket table，光打印没事，内核会强制关闭超出参数值的部分 TCP 连接。

解决方法：

可以把 net.ipv4.tcp_max_tw_buckets 放大，阿里云默认值是5000，这个实在是有点少了，可以适当的调节… 该主机上有好几个高频的调度服务，本身就有大量的连接产生，后面一大批agent连上调度服务会产生更多的连接。这也直接导致 time wait buckets 撑满… 另外也一定要调节net.ipv4.tcp_max_syn_backlog参数…

大家觉得文章对你有些作用！如果想赏钱，可以用微信扫描下面的二维码，感谢!
另外再次标注博客原地址 xiaorui.cc