关于python使用批量方式插入hbase的性能测试

这个5 1过得有些蛋疼,没有出去旅游,原本是计划去旅游,我自己把天气预报看错了,结果导致大家都没去成清凉谷….   好在昨天跟朋友们在工体的酒吧, 花了我2000多快,因为没有散台了,这种时候让人等到有散台会很没有意思的….  所以直接去了卡座…   拿铁酒吧的妹子不错….  嘿嘿,身材好棒   !


正题开始, 工作方面还是遇到了烦人的问题,很是不爽!  有个数据抽取队列居然飙升到了170w左右,你妈了个蛋的! ….  用strace看了下问题,是thrift server堵塞造成的.. … 上星期我们开发了一个组件,就是把会消耗时间的hbase和es的操作,数据序列化后放到redis里面, 然后由多个导出程序来消费这个队列 。 出了这个之外,我们还已经用nginx tcp模式针对10个Thrift server进行负载均衡,为了解决nginx tcp module只能针对端口的监控探测,我们的应对方案是开发了一个针对后端所有thrift server进行scan的小任务操作,用来保持最佳状态的后端服务。 就算这样,有时候还是会堵塞。    看了happybase的文档,发现是可以批量put到hbase的。  把线程的代码改成batch模式很是简单, 但是结果没有太多的改善。  我把自己的测试的结果show给大家…  

标记下文章的原文链接是,  http://xiaorui.cc      http://xiaorui.cc/?p=1353

需要做个对比才能看出效果, 因为vpn的关系看不到metric的收集数据展现的图表,这样只能简单的通过计算hbase的操作的时间记录日志,查看他的消费的时间。    我这里额外提一下,我这边每条的数据都比较大,虽然是压缩后的网页源码。    这里先看下批量插入的实现代码

if len(self.datalist)>100:
    s = int(time.time())
    hb.put_batch(self.datalist)
    self.logger.warning("batch put hbase time used %s Hbase_ip:%s"%(time.time()-start_time,hb.host))
    self.datalist = []
    at = 0
else:
    pass

hadoop.py

def put_batch(self,datalist):
    s = time.time()
    if self.conn is None:
        self._setup_conn()
    table = self.get_table()
    batch = table.batch()
    for data in datalist:
        batch.put(data['md5'], {"bz:url": data['url'],"src:html": compress(data['source'])})
    batch.send()
    self.logger.warning("In put_batch function , hbase batch cost %s%%s "%(time.time()-s))

我这里用的是happybase,官方的文档已经写的很清楚,大家可以看看。 


下面是每个任务为一次put动作   我们会发现他消耗的时间相当的少,几乎都是在0.0x秒。   


[2015-05-02 11:25:21,685] WARNING extractor "put hbase time used 0.021054983139 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:23,759] WARNING extractor "put hbase time used 0.0045530796051 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:25,589] WARNING extractor "put hbase time used 0.0892169475555 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:25,736] WARNING extractor "put hbase time used 0.0320320129395 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:26,449] WARNING extractor "put hbase time used 0.0114378929138 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:38,015] WARNING extractor "put hbase time used 0.00867915153503 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:40,416] WARNING extractor "put hbase time used 0.00477409362793 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:40,510] WARNING extractor "put hbase time used 0.00833702087402 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:40,801] WARNING extractor "put hbase time used 0.0206489562988 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:45,599] WARNING extractor "put hbase time used 0.0165710449219 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:45,675] WARNING extractor "put hbase time used 0.0191271305084 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:25:46,238] WARNING extractor "put hbase time used 0.0264899730682 Hbase_ip:xx.xx.xx.xx"

下面是20条数据为一次put数据。 

[2015-05-02 11:03:49,112] WARNING extractor "batch put hbase time used 0.51046705246 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:53,656] WARNING extractor "In put_batch function , hbase batch cost 0.158232927322%s "
[2015-05-02 11:03:53,656] WARNING extractor "In put_batch function , hbase batch cost 0.158232927322%s "
[2015-05-02 11:03:53,656] WARNING extractor "batch put hbase time used 0.158485889435 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:53,656] WARNING extractor "batch put hbase time used 0.158485889435 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:54,913] WARNING extractor "In put_batch function , hbase batch cost 0.0812509059906%s "
[2015-05-02 11:03:54,913] WARNING extractor "In put_batch function , hbase batch cost 0.0812509059906%s "
[2015-05-02 11:03:54,914] WARNING extractor "batch put hbase time used 0.081521987915 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:54,914] WARNING extractor "batch put hbase time used 0.081521987915 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:54,973] WARNING extractor "In put_batch function , hbase batch cost 0.0716309547424%s "
[2015-05-02 11:03:54,973] WARNING extractor "In put_batch function , hbase batch cost 0.0716309547424%s "
[2015-05-02 11:03:54,973] WARNING extractor "batch put hbase time used 0.0719060897827 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:54,973] WARNING extractor "batch put hbase time used 0.0719060897827 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:58,972] WARNING extractor "In put_batch function , hbase batch cost 0.325189828873%s "
[2015-05-02 11:03:58,972] WARNING extractor "In put_batch function , hbase batch cost 0.325189828873%s "
[2015-05-02 11:03:58,972] WARNING extractor "batch put hbase time used 0.325511932373 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:03:58,972] WARNING extractor "batch put hbase time used 0.325511932373 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:04:03,998] WARNING extractor "In put_batch function , hbase batch cost 29.8570251465%s "
[2015-05-02 11:04:03,998] WARNING extractor "In put_batch function , hbase batch cost 29.8570251465%s "
[2015-05-02 11:04:03,999] WARNING extractor "batch put hbase time used 29.8572819233 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:04:03,999] WARNING extractor "batch put hbase time used 29.8572819233 Hbase_ip:xx.xx.xx.xx"

50个任务

[2015-05-02 11:16:01,463] WARNING extractor "In put_batch function , hbase batch cost 0.331127882004%s "
[2015-05-02 11:16:01,463] WARNING extractor "batch put hbase time used 0.331458806992 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:16:01,463] WARNING extractor "batch put hbase time used 0.331458806992 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:16:28,892] WARNING extractor "In put_batch function , hbase batch cost 0.182182073593%s "
[2015-05-02 11:16:28,892] WARNING extractor "In put_batch function , hbase batch cost 0.182182073593%s "
[2015-05-02 11:16:28,893] WARNING extractor "batch put hbase time used 0.182493209839 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:16:28,893] WARNING extractor "batch put hbase time used 0.182493209839 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:16:43,419] WARNING extractor "In put_batch function , hbase batch cost 0.20866894722%s "
[2015-05-02 11:16:43,419] WARNING extractor "In put_batch function , hbase batch cost 0.20866894722%s "
[2015-05-02 11:16:43,419] WARNING extractor "batch put hbase time used 0.208983898163 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:16:43,419] WARNING extractor "batch put hbase time used 0.208983898163 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:17:30,570] WARNING extractor "In put_batch function , hbase batch cost 0.221275806427%s "
[2015-05-02 11:17:30,570] WARNING extractor "In put_batch function , hbase batch cost 0.221275806427%s "
[2015-05-02 11:17:30,571] WARNING extractor "batch put hbase time used 0.221582889557 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:17:30,571] WARNING extractor "batch put hbase time used 0.221582889557 Hbase_ip:xx.xx.xx.xx"
[2015-05-02 11:17:31,309] WARNING extractor "In put_batch function , hbase batch cost 0.228880882263%s "
[2015-05-02 11:17:31,309] WARNING extractor "In put_batch function , hbase batch cost 0.228880882263%s "

那么我们再来试试把数量级调节到100个 … …  我们会发现性能很是不稳…..   


[2015-05-03 16:30:28,554] WARNING extractor "In put_batch function , hbase batch cost 3.95684981346%s "
[2015-05-03 16:30:28,555] WARNING extractor "batch put hbase time used 3.95731115341 Hbase_ip:xx.xx.xx.xx"
[2015-05-03 16:30:31,565] WARNING extractor "In put_batch function , hbase batch cost 3.03742599487%s "
[2015-05-03 16:30:31,566] WARNING extractor "batch put hbase time used 3.03776407242 Hbase_ip:xx.xx.xx.xx"
[2015-05-03 16:30:35,555] WARNING extractor "In put_batch function , hbase batch cost 15.3058838844%s "
[2015-05-03 16:30:35,555] WARNING extractor "batch put hbase time used 15.3061511517 Hbase_ip:xx.xx.xx.xx"
[2015-05-03 16:30:41,410] WARNING extractor "In put_batch function , hbase batch cost 12.1368830204%s "
[2015-05-03 16:30:41,411] WARNING extractor "batch put hbase time used 12.1372339725 Hbase_ip:xx.xx.xx.xx"
[2015-05-03 16:31:15,827] WARNING extractor "In put_batch function , hbase batch cost 87.7399909496%s "
[2015-05-03 16:31:15,827] WARNING extractor "batch put hbase time used 87.7403361797 Hbase_ip:xx.xx.xx.xx"
[2015-05-03 16:31:26,087] WARNING extractor "In put_batch function , hbase batch cost 1.96412801743%s "
[2015-05-03 16:31:26,088] WARNING extractor "batch put hbase time used 1.96462702751 Hbase_ip:xx.xx.xx.xx"

批量的效果不是那么的明显。。。。


后续,跟同事的批量业务对比了下,他们的rowkey设计的时候包含了时间戳,时间是有序的数据,所以他会在一台服务器上插入。 然而我们的rowkey是url的md5,是个随机的哈希,会分配到不同的regionserver上面。  这样很好的解决了hbase热点在一区域的问题。








大家觉得文章对你有些作用! 如果想赏钱,可以用微信扫描下面的二维码,感谢!
另外再次标注博客原地址  xiaorui.cc

1 Response

  1. 往往 2015年5月4日 / 上午9:47

    这么看性能一般呀

发表评论

邮箱地址不会被公开。 必填项已用*标注