这个5 1过得有些蛋疼,没有出去旅游,原本是计划去旅游,我自己把天气预报看错了,结果导致大家都没去成清凉谷…. 好在昨天跟朋友们在工体的酒吧, 花了我2000多快,因为没有散台了,这种时候让人等到有散台会很没有意思的…. 所以直接去了卡座… 拿铁酒吧的妹子不错…. 嘿嘿,身材好棒 !
正题开始, 工作方面还是遇到了烦人的问题,很是不爽! 有个数据抽取队列居然飙升到了170w左右,你妈了个蛋的! …. 用strace看了下问题,是thrift server堵塞造成的.. … 上星期我们开发了一个组件,就是把会消耗时间的hbase和es的操作,数据序列化后放到redis里面, 然后由多个导出程序来消费这个队列 。 出了这个之外,我们还已经用nginx tcp模式针对10个Thrift server进行负载均衡,为了解决nginx tcp module只能针对端口的监控探测,我们的应对方案是开发了一个针对后端所有thrift server进行scan的小任务操作,用来保持最佳状态的后端服务。 就算这样,有时候还是会堵塞。 看了happybase的文档,发现是可以批量put到hbase的。 把线程的代码改成batch模式很是简单, 但是结果没有太多的改善。 我把自己的测试的结果show给大家…
标记下文章的原文链接是, http://xiaorui.cc http://xiaorui.cc/?p=1353
需要做个对比才能看出效果, 因为vpn的关系看不到metric的收集数据展现的图表,这样只能简单的通过计算hbase的操作的时间记录日志,查看他的消费的时间。 我这里额外提一下,我这边每条的数据都比较大,虽然是压缩后的网页源码。 这里先看下批量插入的实现代码
if len(self.datalist)>100: s = int(time.time()) hb.put_batch(self.datalist) self.logger.warning("batch put hbase time used %s Hbase_ip:%s"%(time.time()-start_time,hb.host)) self.datalist = [] at = 0 else: pass
hadoop.py
def put_batch(self,datalist): s = time.time() if self.conn is None: self._setup_conn() table = self.get_table() batch = table.batch() for data in datalist: batch.put(data['md5'], {"bz:url": data['url'],"src:html": compress(data['source'])}) batch.send() self.logger.warning("In put_batch function , hbase batch cost %s%%s "%(time.time()-s))
我这里用的是happybase,官方的文档已经写的很清楚,大家可以看看。
下面是每个任务为一次put动作 我们会发现他消耗的时间相当的少,几乎都是在0.0x秒。
[2015-05-02 11:25:21,685] WARNING extractor "put hbase time used 0.021054983139 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:23,759] WARNING extractor "put hbase time used 0.0045530796051 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:25,589] WARNING extractor "put hbase time used 0.0892169475555 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:25,736] WARNING extractor "put hbase time used 0.0320320129395 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:26,449] WARNING extractor "put hbase time used 0.0114378929138 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:38,015] WARNING extractor "put hbase time used 0.00867915153503 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:40,416] WARNING extractor "put hbase time used 0.00477409362793 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:40,510] WARNING extractor "put hbase time used 0.00833702087402 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:40,801] WARNING extractor "put hbase time used 0.0206489562988 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:45,599] WARNING extractor "put hbase time used 0.0165710449219 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:45,675] WARNING extractor "put hbase time used 0.0191271305084 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:25:46,238] WARNING extractor "put hbase time used 0.0264899730682 Hbase_ip:xx.xx.xx.xx"
下面是20条数据为一次put数据。
[2015-05-02 11:03:49,112] WARNING extractor "batch put hbase time used 0.51046705246 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:53,656] WARNING extractor "In put_batch function , hbase batch cost 0.158232927322%s " [2015-05-02 11:03:53,656] WARNING extractor "In put_batch function , hbase batch cost 0.158232927322%s " [2015-05-02 11:03:53,656] WARNING extractor "batch put hbase time used 0.158485889435 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:53,656] WARNING extractor "batch put hbase time used 0.158485889435 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:54,913] WARNING extractor "In put_batch function , hbase batch cost 0.0812509059906%s " [2015-05-02 11:03:54,913] WARNING extractor "In put_batch function , hbase batch cost 0.0812509059906%s " [2015-05-02 11:03:54,914] WARNING extractor "batch put hbase time used 0.081521987915 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:54,914] WARNING extractor "batch put hbase time used 0.081521987915 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:54,973] WARNING extractor "In put_batch function , hbase batch cost 0.0716309547424%s " [2015-05-02 11:03:54,973] WARNING extractor "In put_batch function , hbase batch cost 0.0716309547424%s " [2015-05-02 11:03:54,973] WARNING extractor "batch put hbase time used 0.0719060897827 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:54,973] WARNING extractor "batch put hbase time used 0.0719060897827 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:58,972] WARNING extractor "In put_batch function , hbase batch cost 0.325189828873%s " [2015-05-02 11:03:58,972] WARNING extractor "In put_batch function , hbase batch cost 0.325189828873%s " [2015-05-02 11:03:58,972] WARNING extractor "batch put hbase time used 0.325511932373 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:03:58,972] WARNING extractor "batch put hbase time used 0.325511932373 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:04:03,998] WARNING extractor "In put_batch function , hbase batch cost 29.8570251465%s " [2015-05-02 11:04:03,998] WARNING extractor "In put_batch function , hbase batch cost 29.8570251465%s " [2015-05-02 11:04:03,999] WARNING extractor "batch put hbase time used 29.8572819233 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:04:03,999] WARNING extractor "batch put hbase time used 29.8572819233 Hbase_ip:xx.xx.xx.xx"
50个任务
[2015-05-02 11:16:01,463] WARNING extractor "In put_batch function , hbase batch cost 0.331127882004%s " [2015-05-02 11:16:01,463] WARNING extractor "batch put hbase time used 0.331458806992 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:16:01,463] WARNING extractor "batch put hbase time used 0.331458806992 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:16:28,892] WARNING extractor "In put_batch function , hbase batch cost 0.182182073593%s " [2015-05-02 11:16:28,892] WARNING extractor "In put_batch function , hbase batch cost 0.182182073593%s " [2015-05-02 11:16:28,893] WARNING extractor "batch put hbase time used 0.182493209839 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:16:28,893] WARNING extractor "batch put hbase time used 0.182493209839 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:16:43,419] WARNING extractor "In put_batch function , hbase batch cost 0.20866894722%s " [2015-05-02 11:16:43,419] WARNING extractor "In put_batch function , hbase batch cost 0.20866894722%s " [2015-05-02 11:16:43,419] WARNING extractor "batch put hbase time used 0.208983898163 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:16:43,419] WARNING extractor "batch put hbase time used 0.208983898163 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:17:30,570] WARNING extractor "In put_batch function , hbase batch cost 0.221275806427%s " [2015-05-02 11:17:30,570] WARNING extractor "In put_batch function , hbase batch cost 0.221275806427%s " [2015-05-02 11:17:30,571] WARNING extractor "batch put hbase time used 0.221582889557 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:17:30,571] WARNING extractor "batch put hbase time used 0.221582889557 Hbase_ip:xx.xx.xx.xx" [2015-05-02 11:17:31,309] WARNING extractor "In put_batch function , hbase batch cost 0.228880882263%s " [2015-05-02 11:17:31,309] WARNING extractor "In put_batch function , hbase batch cost 0.228880882263%s "
那么我们再来试试把数量级调节到100个 … … 我们会发现性能很是不稳…..
[2015-05-03 16:30:28,554] WARNING extractor "In put_batch function , hbase batch cost 3.95684981346%s " [2015-05-03 16:30:28,555] WARNING extractor "batch put hbase time used 3.95731115341 Hbase_ip:xx.xx.xx.xx" [2015-05-03 16:30:31,565] WARNING extractor "In put_batch function , hbase batch cost 3.03742599487%s " [2015-05-03 16:30:31,566] WARNING extractor "batch put hbase time used 3.03776407242 Hbase_ip:xx.xx.xx.xx" [2015-05-03 16:30:35,555] WARNING extractor "In put_batch function , hbase batch cost 15.3058838844%s " [2015-05-03 16:30:35,555] WARNING extractor "batch put hbase time used 15.3061511517 Hbase_ip:xx.xx.xx.xx" [2015-05-03 16:30:41,410] WARNING extractor "In put_batch function , hbase batch cost 12.1368830204%s " [2015-05-03 16:30:41,411] WARNING extractor "batch put hbase time used 12.1372339725 Hbase_ip:xx.xx.xx.xx" [2015-05-03 16:31:15,827] WARNING extractor "In put_batch function , hbase batch cost 87.7399909496%s " [2015-05-03 16:31:15,827] WARNING extractor "batch put hbase time used 87.7403361797 Hbase_ip:xx.xx.xx.xx" [2015-05-03 16:31:26,087] WARNING extractor "In put_batch function , hbase batch cost 1.96412801743%s " [2015-05-03 16:31:26,088] WARNING extractor "batch put hbase time used 1.96462702751 Hbase_ip:xx.xx.xx.xx"
批量的效果不是那么的明显。。。。
后续,跟同事的批量业务对比了下,他们的rowkey设计的时候包含了时间戳,时间是有序的数据,所以他会在一台服务器上插入。 然而我们的rowkey是url的md5,是个随机的哈希,会分配到不同的regionserver上面。 这样很好的解决了hbase热点在一区域的问题。
这么看性能一般呀