免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 4747 | 回复: 4
打印 上一主题 下一主题

[MongoDB] mongodb分片部署,稳定性非常差,没这么差吧 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2016-02-25 15:54 |只看该作者 |倒序浏览

搭建了一个3server sharding环境
含5个primary+secondary+arbitery副本集分片
每个server上跑1个configsrv实例
然后每个server上跑1个mongos实例
所有traffic都走的内网。

几天使用下来,感觉mongodb非常不稳定,没法用到生产系统中啊,大伙有遇到类似问题吗?
  1. # mongod --version
  2. db version v3.2.1
  3. git version: a14d55980c2cdc565d4704a7e3ad37e4e535c1b2
  4. OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
  5. allocator: tcmalloc
  6. modules: none
  7. build environment:
  8.     distmod: rhel62
  9.     distarch: x86_64
  10.     target_arch: x86_64
  11. # uname -a
  12. Linux APGW02 2.6.32-220.60.2.el6.x86_64 #1 SMP Fri Feb 27 15:05:50 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
  13. # cat /etc/issue
  14. Red Hat Enterprise Linux Server release 6.2 (Santiago)
  15. Kernel \r on an \m
复制代码
然后通过第4台机器上的python脚本多进程并发向各个mongos发update upsert=true请求,最终数据规模3000w,每秒大约2~3w条
发现某台主机的mongos经常性留下这样的遗言后退出了
  1. 2016-02-25T11:45:38.291+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57592 #29 (12 connections now open)
  2. 2016-02-25T11:45:38.291+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57591 #30 (13 connections now open)
  3. 2016-02-25T11:45:38.291+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57593 #31 (14 connections now open)
  4. 2016-02-25T11:45:38.372+0800 I NETWORK  [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
  5. 2016-02-25T11:45:38.372+0800 I NETWORK  [mongosMain] failed to create thread after accepting new connection, closing connection
  6. 2016-02-25T11:45:38.373+0800 I NETWORK  [conn26] end connection 139.122.10.181:57584 (12 connections now open)
  7. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57594 #32 (13 connections now open)
  8. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57597 #33 (14 connections now open)
  9. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
  10. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] failed to create thread after accepting new connection, closing connection
  11. 2016-02-25T11:45:38.376+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57600 #34 (14 connections now open)
  12. 2016-02-25T11:45:38.376+0800 I NETWORK  [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
  13. 2016-02-25T11:45:38.376+0800 I NETWORK  [mongosMain] failed to create thread after accepting new connection, closing connection
  14. 2016-02-25T11:45:38.483+0800 E NETWORK  [conn32] Uncaught std::exception: , terminating
  15. 2016-02-25T11:45:38.483+0800 E NETWORK  [conn30] Uncaught std::exception: , terminating
  16. 2016-02-25T11:45:38.483+0800 E NETWORK  [conn29] Uncaught std::exception: , terminating
  17. 2016-02-25T11:45:38.520+0800 I SHARDING [conn29] dbexit:  rc:100
  18. 2016-02-25T11:45:38.520+0800 I SHARDING [conn32] dbexit:  rc:100
  19. 2016-02-25T11:45:38.520+0800 I SHARDING [conn30] dbexit:  rc:100
  20. 2016-02-25T11:45:38.521+0800 I NETWORK  [conn21] end connection 139.122.10.181:57575 (10 connections now open)
  21. 2016-02-25T11:45:38.521+0800 I NETWORK  [conn22] end connection 139.122.10.181:57576 (10 connections now open)
复制代码
然后做一个聚合查询,和cout一下,发现shard3的两个实例都不可用了
  1. mongos> db.online.aggregate(
  2. ... [{ $match: {ts: 1456376400}},
  3. ...  {
  4. ...      $group:{
  5. ...          _id: {msisdn: "$msisdn", rg: "$rg"},
  6. ...          vol: { $sum: "$vol" },
  7. ...         }
  8. ...  },
  9. ...  {  $sort: {vol: -1}},
  10. ...  {  $limit: 10 }
  11. ... ],
  12. ... {allowDiskUse: true}
  13. ... )
  14. { "_id" : { "msisdn" : "XXXXXXXX", "rg" : 5 }, "vol" : 36825 }
  15. ...

  16. mongos> db.online.count()
  17. 2016-02-25T15:08:18.994+0800 E QUERY    [thread1] Error: count failed: {
  18.         "code" : 16340,
  19.         "ok" : 0,
  20.         "errmsg" : "No replica set monitor active and no cached seed found for set: shard3"
  21. } :
复制代码
去两个shard实例的日志下看没看到有出错
  1. 016-02-25T14:16:26.039+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:24.849+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
  2. 2016-02-25T14:16:56.455+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:56.291+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
  3. 2016-02-25T14:17:05.690+0800 I NETWORK  [conn988] end connection 139.122.10.145:27348 (58 connections now open)
  4. 2016-02-25T14:17:21.060+0800 I NETWORK  [conn989] end connection 139.122.10.145:27353 (57 connections now open)
  5. 2016-02-25T14:17:26.635+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:17:26.485+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
复制代码
然后重启起来,再在mongos客户端下查询集合的count,结果吧mongos也给搞垮了
  1. 2016-02-25T15:15:59.901+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T15:15:59.775+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW01:27017:1456284050:1804289383', sleeping for 30000ms
  2. 2016-02-25T15:16:03.636+0800 F ASIO     [NetworkInterfaceASIO-TaskExecutorPool-9-0] Uncaught exception in NetworkInterfaceASIO IO worker thread of type: UnknownError Caught std::exception of type std::system_error: thread: Resource temporarily unavailable
  3. 2016-02-25T15:16:03.636+0800 I -        [NetworkInterfaceASIO-TaskExecutorPool-9-0] Fatal Assertion 28820
  4. 2016-02-25T15:16:03.636+0800 I -        [NetworkInterfaceASIO-TaskExecutorPool-9-0]

  5. ***aborting after fassert() failure


  6. 2016-02-25T15:16:04.199+0800 F -        [NetworkInterfaceASIO-TaskExecutorPool-9-0] Got signal: 6 (Aborted).

  7. 0xc401d2 0xc3f119 0xc3f922 0x3553a0f4a0 0x3553632885 0x3553634065 0xbc6902 0x9e3b9d 0xe174b0 0x3553a077f1 0x35536e570d
复制代码

求职 : Linux运维
论坛徽章:
203
拜羊年徽章
日期:2015-03-03 16:15:432015年辞旧岁徽章
日期:2015-03-03 16:54:152015年迎新春徽章
日期:2015-03-04 09:57:092015小元宵徽章
日期:2015-03-06 15:58:182015年亚洲杯之约旦
日期:2015-04-05 20:08:292015年亚洲杯之澳大利亚
日期:2015-04-09 09:25:552015年亚洲杯之约旦
日期:2015-04-10 17:34:102015年亚洲杯之巴勒斯坦
日期:2015-04-10 17:35:342015年亚洲杯之日本
日期:2015-04-16 16:28:552015年亚洲杯纪念徽章
日期:2015-04-27 23:29:17操作系统版块每日发帖之星
日期:2015-06-06 22:20:00操作系统版块每日发帖之星
日期:2015-06-09 22:20:00
2 [报告]
发表于 2016-02-25 17:04 |只看该作者
allocator: tcmalloc
这么好的配置不可能吧     

论坛徽章:
0
3 [报告]
发表于 2016-02-26 07:33 |只看该作者
os 版本有点低。内核参数优化了吗?

论坛徽章:
0
4 [报告]
发表于 2016-02-26 10:56 |只看该作者
本帖最后由 PinkOrient 于 2016-02-26 10:59 编辑

回复 3# lcstudio

numactl --interleave=all方式启动
ulimit设置了
ULIMIT_CMD="ulimit -f unlimited;ulimit -t unlimited;ulimit -v unlimited;ulimit -n 64000;ulimit -u 32000;ulimit -m unlimited"
这俩也弄了
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

之前单例运行和不做分片单副本集运行,觉得健壮性还可以,不知道为啥一做分片就稀烂。
  1. NUMACTL="numactl --interleave=all "     
  2. MONGO_PATH_OPTS="--dbpath $DATA_PATH --logpath $LOG_PATH --pidfilepath $PID_PATH --logappend"
  3. MONGO_OPTS="--fork --journal --directoryperdb"
  4. MONGO_OPTS2="--shardsvr --port $PORT --replSet ${SHARD_NAME}"
  5. MONGO_EXEC=/opt/mongodb/bin/mongod

  6. usage="Usage: mongo.sh [start|stop|status|restart] [shard_number]"

  7. check_status() {
  8.         kill -0 `cat $PID_PATH` > /dev/null 2>&1
  9. }


  10. modify_env() {

  11.         if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
  12.           echo never > /sys/kernel/mm/transparent_hugepage/enabled
  13.         fi

  14.         if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
  15.            echo never > /sys/kernel/mm/transparent_hugepage/defrag
  16.         fi
  17. }
复制代码

论坛徽章:
72
20周年集字徽章-20	
日期:2020-10-28 14:04:30操作系统版块每日发帖之星
日期:2016-07-13 06:20:0015-16赛季CBA联赛之广夏
日期:2016-07-10 09:04:02数据库技术版块每日发帖之星
日期:2016-07-09 06:20:00操作系统版块每日发帖之星
日期:2016-07-09 06:20:00数据库技术版块每日发帖之星
日期:2016-07-07 06:20:00操作系统版块每日发帖之星
日期:2016-07-07 06:20:00操作系统版块每日发帖之星
日期:2016-07-04 06:20:00数据库技术版块每日发帖之星
日期:2016-07-03 06:20:00操作系统版块每日发帖之星
日期:2016-07-03 06:20:00数据库技术版块每日发帖之星
日期:2016-07-02 06:20:00操作系统版块每日发帖之星
日期:2016-07-02 06:20:00
5 [报告]
发表于 2016-04-20 16:15 |只看该作者
听我一个之前做游戏的朋友说,mongo有点坑。

1.png (118.88 KB, 下载次数: 56)

1.png
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP