大数据ElasticSearch遇到ignore_above问题

    以前有一个叫朱伟大神的人跟我聊过,凡是所谓的坑,都是因为你没看他的源码或者文档导致的。 这话听起来有道理,但问题ElasticSearch的各种文档很是乱,刚接触他的人只看es的dsl语句就会郁闷住的,不用说更深层的文档。  ElasticSearch的更新算是频繁,但我还在使用旧版的1.4,测试环境用2.0.   跟朱伟( 伟哥 )学了不少东西, 今下午跟他把logging过了一遍,收获很大. 

不扯闲话了,正题开始.


该文章写的有些乱,欢迎来喷 ! 另外文章后续不断更新中,请到原文地址查看更新。  http://xiaorui.cc/?p=3151

第一个问题是:

#xiaorui.cc
<head><title>413 Request Entity Too Large</title></head>
<body bgcolor="white">
<center><h1>413 Request Entity Too Large</h1></center>
<hr><center>nginx/1.8.0</center>
</body>
</html>
, type: <class 'pyes.exceptions.ElasticSearchException'>

这里用的是python pyes库,当我批量写入elasticsearch的时候 nginx给我返回413的操作。 nginx 413的错误,一般在上传文件的时候出现. 我这边问题肯定是一次request批量过大导致的.

如果不想修改程序,那么可以针对413的问题进行调参.

#xiaorui.cc
vim /etc/nginx/nginx.conf

client_max_body_size 350m; 

下面是nginx的错误.

2016/04/07 14:34:04 [error] 7096#0: *257701697 client intended to send too large body: 376903231 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:34:18 [error] 7092#0: *257703097 client intended to send too large body: 376940599 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:34:33 [error] 7092#0: *257704727 client intended to send too large body: 376903231 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:34:51 [error] 7165#0: *257706617 client intended to send too large body: 376903231 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:34:59 [error] 7092#0: *257702178 client intended to send too large body: 368153553 bytes, client: 10.10.3.121, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:35:19 [error] 7159#0: *257703466 client intended to send too large body: 376847541 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:35:33 [error] 7081#0: *257705268 client intended to send too large body: 376785777 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"
2016/04/07 14:35:54 [error] 7161#0: *257705592 client intended to send too large body: 377229290 bytes, client: 10.10.3.152, server: , request: "POST /_bulk HTTP/1.1", host: "xiaorui.cc:9000"

第二个问题, 也跟size大小有点关系。 上面是request body太大,都尼玛200M+了。 ” whose UTF8 encoding is longer than the max length 32766 “, 这个问题是某个字段size过大不能索引引起的。 

下面是python的报错,后面会说明这个原因. 

#xiaorui.cc
from elasticsearch.helpers import BulkIndexError
from elasticsearch.exceptions import TransportError,ConnectionTimeout,ConnectionError
代码片段... ... !
while 1:
    try:
        helpers.bulk(self.es_conn, actions, stats_only=False, chunk_size=batch_num)
        return True
    except ConnectionTimeout,e:
        logging.error("this is ES ConnectionTimeout ERROR \n %s"%str(traceback.format_exc()))
        logging.info('retry bulk es')
    except TransportError,e:
        mark += 1
        logging.error("this is ES TransportERROR \n %s"%str(traceback.format_exc()))
        logging.info('retry bulk es')
    except ConnectionError,e:
        logging.error("this is ES ConnectionError ERROR \n %s"%str(traceback.format_exc()))
        logging.info('retry bulk es')
        time.sleep(0.01)
    except BulkIndexError,e:
        logging.error("this is ES BulkIndexError ERROR \n %s"%str(traceback.format_exc()))
        logging.info('retry bulk es')
        return
    except Exception,e:
        logging.error("exception not match \n %s"%str(traceback.format_exc()))
        logging.error('retry bulk es')
        time.sleep(0.01)
        return
    else:
        return True

下面是pyes异常.

#xiaorui.cc
2016-04-07 16:42:10,482 - 482 - ERROR:    Traceback (most recent call last):
  File "p.py", line 96, in bulk
    data = helpers.bulk(self.es_conn, actions, stats_only=False, chunk_size=csize)
  File "/usr/local/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 188, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/usr/local/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 160, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "/usr/local/lib/python2.7/site-packages/elasticsearch/helpers/__init__.py", line 132, in _process_bulk_chunk
    raise BulkIndexError('%i document(s) failed to index.' % len(errors), errors)
BulkIndexError: (u'1 document(s) failed to index.', [{u'index': {u'status': 500, u'_type': u'xxz', u'_id': fengyun', u'error': u'IllegalArgumentException[Document contains at least one immense term in field="buzz_comment.v" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: \'[32, 124, 32, -26, -100, -128, -26, -106, -80, -27, -72, -106, -27, -83, -112, 32, 124, 32, -26, -97, -91, -25, -100, -117, -28, -67, -100, -24, -128, -123]...\', original message: bytes can be at most 32766 in length; got 55687]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 55687]; ', u'_index': u'xiaorui.cc'}}])

下面是ElasticSearch服务端的报错.

java.lang.IllegalArgumentException: Document contains at least one immense term in field="buzz_comment.v" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[32, 32, 32, -25, -67, -111, -27, -113, -117, -27, -101, -98, -27, -92, -115, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 106, 105, 97, 110]...', original message: bytes can be at most 32766 in length; got 47841

在stackoverflow看到解决的方法: 分词要关闭,另外ignore-above 为 256, 这样可以存储超过37266字节的数据。 

set "index": "not_analyzed" and "ignore_above": 256

index是什么? 他有下面三个选项

set "index": "no"   #不分词,不索引
set "index": "analyze"    #分词,索引
set "index": "not_analyzed" # 不去分词

ignore_above是啥?

ignore_above是索引的范围,用于not_analyzed字段,默认不会对20字节的数据索引。 

另外他还有一个用处,如果要存储超过32766字节的数据,那么ignore_above = 256就可以了。  

我们在1.7下做过ignore_above参数的测试,当字段是not_analyzed ignore_above 256时,可以存入超过32766字节的数据. 

下面是一个参考的mapping

'mapping'    => [
    'type'   => 'multi_field',
    'path'   => 'full',
    'fields' => [
        '{name}' => [
            'type'     => 'string',
            'index'    => 'analyzed',
            'analyzer' => 'standard',
        ],
        'raw' => [
            'type'         => 'string',
            'index'        => 'not_analyzed',
            'ignore_above' => 256,
        ],
    ],
]

下面是官方ignore-above:

https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html

END.


大家觉得文章对你有些作用! 如果想赏钱,可以用微信扫描下面的二维码,感谢!
另外再次标注博客原地址  xiaorui.cc

发表评论

邮箱地址不会被公开。 必填项已用*标注