解决python elasticsearch的TransportError异常问题
照例先扯闲话,今天的雾霾终于下去了,风很大,身体有些虚. 是时候该锻炼了.
收到elasticsearch数据延迟的微信报警。 通过看日志得知consumer进程异常了, ps aux f看了下进程状态貌似是正常. 我们可以确定了41577 是主进程,剩下的都是由41577 spawn出去的. 让我们拿出大杀器,strace
文章写的不是很严谨,欢迎来喷,另外该文后续有更新的,请到原文地址查看更新.
通过strace -p pid看到主进程在等待41591主进程。 我们可以看到bulk_transfer在做futex_wait_bitset操作, 而没有去干活。 另外我这边会有一个逻辑,如果内置队列满2k,暂停工作. 为毛队列一直爆满,而消费进程又在干嘛.
#blog: http://xiaorui.cc
[ruifengyun@wx-buzz-monitor01 shop_scripts]strace -p 41577
Process 41577 attached - interrupt to quit
wait4(41591, ^C <unfinished ...>
Process 41577 detached
[ruifengyun@wx-buzz-monitor01 shop_scripts] strace -p 41591
Process 41591 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 398485}) = 0 (Timeout)
futex(0x7f9408cff000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1452049478, 527346000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
futex(0x7f9408cff000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1452049479, 38034000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
futex(0x7f9408cff000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1452049479, 548685000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
futex(0x7f9408cff000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1452049480, 59378000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
select(0, NULL, NULL, NULL, {0, 500000}^C <unfinished ...>
Process 41591 detached
下面是详细的程序日志. TransportError又是TransportError ,上次也是这破问题,当时的解决方法是直接exit退出,然后用supervisord来控制进程.
2016-01-06 05:21:28,111 - mylogger - INFO - pack cost 120 q_task queue len 0 2016-01-06 05:21:28,112 - mylogger - INFO - q_res queue len 1 Traceback (most recent call last): File "bulk_transfer.py", line 163, in <module> handle_request() File "bulk_transfer.py", line 149, in handle_request es.bulk(Q_RES) File "/home/ruifengyun/shop_master/shop_scripts/utils.py", line 67, in bulk data = helpers.bulk(self.es_conn, actions, stats_only=False, chunk_size=csize) File "/usr/local/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 176, in bulk for ok, item in streaming_bulk(client, actions, **kwargs): File "/usr/local/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 118, in streaming_bulk raise e elasticsearch.exceptions.TransportError: TransportError(503, u'ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2 /no master];]')
TransportError(503, u’ClusterBlockException , 这个异常一般是elasticsearch cluster集群出问题引起的。 当elasticsearch集群出问题时候,所有的客户端都会出现这样的报错。
需要注意的是,不能简单的把这个TransportError异常try execpt过滤,最好在异常后重新创立一个新的elasticsearch连接。
上次我只是过滤TransportError异常,但等elasticsearch正常后,还是无法正常的入库。 pdb调试了下,貌似连接的状态有问题, 在官方问了issuse也没有得到靠谱的回答。不知道是不是elasticsearch-py的一个bug.
#blog: http://xiaorui.cc import time from elasticsearch.exceptions import TransportError from elasticsearch import Elasticsearch es = Elasticsearch() while 1: try: response = es.search(index="test-index", body={"query": {"match_all": {}}}) return response except TransportError as e: time.sleep(5) es = Elasticsearch()
下面也是常见的elasticsearch python api的异常情况.
class elasticsearch.ConnectionError(TransportError)
Error raised when there was an exception while talking to ES. Original exception from the underlying Connection implementation is available as .info.
class elasticsearch.ConnectionTimeout(ConnectionError)
A network timeout. Doesn’t cause a node retry by default.
class elasticsearch.SSLError(ConnectionError)
Error raised when encountering SSL errors.
class elasticsearch.NotFoundError(TransportError)
Exception representing a 404 status code.
class elasticsearch.ConflictError(TransportError)
Exception representing a 409 status code.
class elasticsearch.RequestError(TransportError)
Exception representing a 400 status code.
class elasticsearch.ConnectionError(TransportError)
Error raised when there was an exception while talking to ES. Original exception from the underlying Connection implementation is available as .info.