Chinaunix

标题: mongodb分片部署,稳定性非常差,没这么差吧 [打印本页]

作者: PinkOrient    时间: 2016-02-25 15:54
标题: mongodb分片部署,稳定性非常差,没这么差吧

搭建了一个3server sharding环境
含5个primary+secondary+arbitery副本集分片
每个server上跑1个configsrv实例
然后每个server上跑1个mongos实例
所有traffic都走的内网。

几天使用下来,感觉mongodb非常不稳定,没法用到生产系统中啊,大伙有遇到类似问题吗?
  1. # mongod --version
  2. db version v3.2.1
  3. git version: a14d55980c2cdc565d4704a7e3ad37e4e535c1b2
  4. OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
  5. allocator: tcmalloc
  6. modules: none
  7. build environment:
  8.     distmod: rhel62
  9.     distarch: x86_64
  10.     target_arch: x86_64
  11. # uname -a
  12. Linux APGW02 2.6.32-220.60.2.el6.x86_64 #1 SMP Fri Feb 27 15:05:50 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
  13. # cat /etc/issue
  14. Red Hat Enterprise Linux Server release 6.2 (Santiago)
  15. Kernel \r on an \m
复制代码
然后通过第4台机器上的python脚本多进程并发向各个mongos发update upsert=true请求,最终数据规模3000w,每秒大约2~3w条
发现某台主机的mongos经常性留下这样的遗言后退出了
  1. 2016-02-25T11:45:38.291+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57592 #29 (12 connections now open)
  2. 2016-02-25T11:45:38.291+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57591 #30 (13 connections now open)
  3. 2016-02-25T11:45:38.291+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57593 #31 (14 connections now open)
  4. 2016-02-25T11:45:38.372+0800 I NETWORK  [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
  5. 2016-02-25T11:45:38.372+0800 I NETWORK  [mongosMain] failed to create thread after accepting new connection, closing connection
  6. 2016-02-25T11:45:38.373+0800 I NETWORK  [conn26] end connection 139.122.10.181:57584 (12 connections now open)
  7. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57594 #32 (13 connections now open)
  8. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57597 #33 (14 connections now open)
  9. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
  10. 2016-02-25T11:45:38.374+0800 I NETWORK  [mongosMain] failed to create thread after accepting new connection, closing connection
  11. 2016-02-25T11:45:38.376+0800 I NETWORK  [mongosMain] connection accepted from 139.122.10.181:57600 #34 (14 connections now open)
  12. 2016-02-25T11:45:38.376+0800 I NETWORK  [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
  13. 2016-02-25T11:45:38.376+0800 I NETWORK  [mongosMain] failed to create thread after accepting new connection, closing connection
  14. 2016-02-25T11:45:38.483+0800 E NETWORK  [conn32] Uncaught std::exception: , terminating
  15. 2016-02-25T11:45:38.483+0800 E NETWORK  [conn30] Uncaught std::exception: , terminating
  16. 2016-02-25T11:45:38.483+0800 E NETWORK  [conn29] Uncaught std::exception: , terminating
  17. 2016-02-25T11:45:38.520+0800 I SHARDING [conn29] dbexit:  rc:100
  18. 2016-02-25T11:45:38.520+0800 I SHARDING [conn32] dbexit:  rc:100
  19. 2016-02-25T11:45:38.520+0800 I SHARDING [conn30] dbexit:  rc:100
  20. 2016-02-25T11:45:38.521+0800 I NETWORK  [conn21] end connection 139.122.10.181:57575 (10 connections now open)
  21. 2016-02-25T11:45:38.521+0800 I NETWORK  [conn22] end connection 139.122.10.181:57576 (10 connections now open)
复制代码
然后做一个聚合查询,和cout一下,发现shard3的两个实例都不可用了
  1. mongos> db.online.aggregate(
  2. ... [{ $match: {ts: 1456376400}},
  3. ...  {
  4. ...      $group:{
  5. ...          _id: {msisdn: "$msisdn", rg: "$rg"},
  6. ...          vol: { $sum: "$vol" },
  7. ...         }
  8. ...  },
  9. ...  {  $sort: {vol: -1}},
  10. ...  {  $limit: 10 }
  11. ... ],
  12. ... {allowDiskUse: true}
  13. ... )
  14. { "_id" : { "msisdn" : "XXXXXXXX", "rg" : 5 }, "vol" : 36825 }
  15. ...

  16. mongos> db.online.count()
  17. 2016-02-25T15:08:18.994+0800 E QUERY    [thread1] Error: count failed: {
  18.         "code" : 16340,
  19.         "ok" : 0,
  20.         "errmsg" : "No replica set monitor active and no cached seed found for set: shard3"
  21. } :
复制代码
去两个shard实例的日志下看没看到有出错
  1. 016-02-25T14:16:26.039+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:24.849+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
  2. 2016-02-25T14:16:56.455+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:56.291+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
  3. 2016-02-25T14:17:05.690+0800 I NETWORK  [conn988] end connection 139.122.10.145:27348 (58 connections now open)
  4. 2016-02-25T14:17:21.060+0800 I NETWORK  [conn989] end connection 139.122.10.145:27353 (57 connections now open)
  5. 2016-02-25T14:17:26.635+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:17:26.485+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
复制代码
然后重启起来,再在mongos客户端下查询集合的count,结果吧mongos也给搞垮了
  1. 2016-02-25T15:15:59.901+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T15:15:59.775+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW01:27017:1456284050:1804289383', sleeping for 30000ms
  2. 2016-02-25T15:16:03.636+0800 F ASIO     [NetworkInterfaceASIO-TaskExecutorPool-9-0] Uncaught exception in NetworkInterfaceASIO IO worker thread of type: UnknownError Caught std::exception of type std::system_error: thread: Resource temporarily unavailable
  3. 2016-02-25T15:16:03.636+0800 I -        [NetworkInterfaceASIO-TaskExecutorPool-9-0] Fatal Assertion 28820
  4. 2016-02-25T15:16:03.636+0800 I -        [NetworkInterfaceASIO-TaskExecutorPool-9-0]

  5. ***aborting after fassert() failure


  6. 2016-02-25T15:16:04.199+0800 F -        [NetworkInterfaceASIO-TaskExecutorPool-9-0] Got signal: 6 (Aborted).

  7. 0xc401d2 0xc3f119 0xc3f922 0x3553a0f4a0 0x3553632885 0x3553634065 0xbc6902 0x9e3b9d 0xe174b0 0x3553a077f1 0x35536e570d
复制代码

作者: lyhabc    时间: 2016-02-25 17:04
allocator: tcmalloc
这么好的配置不可能吧     
作者: lcstudio    时间: 2016-02-26 07:33
os 版本有点低。内核参数优化了吗?
作者: PinkOrient    时间: 2016-02-26 10:56
本帖最后由 PinkOrient 于 2016-02-26 10:59 编辑

回复 3# lcstudio

numactl --interleave=all方式启动
ulimit设置了
ULIMIT_CMD="ulimit -f unlimited;ulimit -t unlimited;ulimit -v unlimited;ulimit -n 64000;ulimit -u 32000;ulimit -m unlimited"
这俩也弄了
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

之前单例运行和不做分片单副本集运行,觉得健壮性还可以,不知道为啥一做分片就稀烂。
  1. NUMACTL="numactl --interleave=all "     
  2. MONGO_PATH_OPTS="--dbpath $DATA_PATH --logpath $LOG_PATH --pidfilepath $PID_PATH --logappend"
  3. MONGO_OPTS="--fork --journal --directoryperdb"
  4. MONGO_OPTS2="--shardsvr --port $PORT --replSet ${SHARD_NAME}"
  5. MONGO_EXEC=/opt/mongodb/bin/mongod

  6. usage="Usage: mongo.sh [start|stop|status|restart] [shard_number]"

  7. check_status() {
  8.         kill -0 `cat $PID_PATH` > /dev/null 2>&1
  9. }


  10. modify_env() {

  11.         if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
  12.           echo never > /sys/kernel/mm/transparent_hugepage/enabled
  13.         fi

  14.         if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
  15.            echo never > /sys/kernel/mm/transparent_hugepage/defrag
  16.         fi
  17. }
复制代码

作者: cjfeii    时间: 2016-04-20 16:15
听我一个之前做游戏的朋友说,mongo有点坑。

1.png (118.88 KB, 下载次数: 58)

1.png





欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2