mongodb分片部署，稳定性非常差，没这么差吧

PinkOrient 发表于 2016-02-25 15:54

搭建了一个3server sharding环境
含5个primary+secondary+arbitery副本集分片
每个server上跑1个configsrv实例
然后每个server上跑1个mongos实例
所有traffic都走的内网。

几天使用下来，感觉mongodb非常不稳定，没法用到生产系统中啊，大伙有遇到类似问题吗？# mongod --version
db version v3.2.1
git version: a14d55980c2cdc565d4704a7e3ad37e4e535c1b2
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
distmod: rhel62
distarch: x86_64
target_arch: x86_64
# uname -a
Linux APGW02 2.6.32-220.60.2.el6.x86_64 #1 SMP Fri Feb 27 15:05:50 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)
Kernel \r on an \m
然后通过第4台机器上的python脚本多进程并发向各个mongos发update upsert=true请求，最终数据规模3000w，每秒大约2~3w条
发现某台主机的mongos经常性留下这样的遗言后退出了2016-02-25T11:45:38.291+0800 I NETWORK connection accepted from 139.122.10.181:57592 #29 (12 connections now open)
2016-02-25T11:45:38.291+0800 I NETWORK connection accepted from 139.122.10.181:57591 #30 (13 connections now open)
2016-02-25T11:45:38.291+0800 I NETWORK connection accepted from 139.122.10.181:57593 #31 (14 connections now open)
2016-02-25T11:45:38.372+0800 I NETWORK pthread_create failed: errno:11 Resource temporarily unavailable
2016-02-25T11:45:38.372+0800 I NETWORK failed to create thread after accepting new connection, closing connection
2016-02-25T11:45:38.373+0800 I NETWORK end connection 139.122.10.181:57584 (12 connections now open)
2016-02-25T11:45:38.374+0800 I NETWORK connection accepted from 139.122.10.181:57594 #32 (13 connections now open)
2016-02-25T11:45:38.374+0800 I NETWORK connection accepted from 139.122.10.181:57597 #33 (14 connections now open)
2016-02-25T11:45:38.374+0800 I NETWORK pthread_create failed: errno:11 Resource temporarily unavailable
2016-02-25T11:45:38.374+0800 I NETWORK failed to create thread after accepting new connection, closing connection
2016-02-25T11:45:38.376+0800 I NETWORK connection accepted from 139.122.10.181:57600 #34 (14 connections now open)
2016-02-25T11:45:38.376+0800 I NETWORK pthread_create failed: errno:11 Resource temporarily unavailable
2016-02-25T11:45:38.376+0800 I NETWORK failed to create thread after accepting new connection, closing connection
2016-02-25T11:45:38.483+0800 E NETWORK Uncaught std::exception: , terminating
2016-02-25T11:45:38.483+0800 E NETWORK Uncaught std::exception: , terminating
2016-02-25T11:45:38.483+0800 E NETWORK Uncaught std::exception: , terminating
2016-02-25T11:45:38.520+0800 I SHARDING dbexit:rc:100
2016-02-25T11:45:38.520+0800 I SHARDING dbexit:rc:100
2016-02-25T11:45:38.520+0800 I SHARDING dbexit:rc:100
2016-02-25T11:45:38.521+0800 I NETWORK end connection 139.122.10.181:57575 (10 connections now open)
2016-02-25T11:45:38.521+0800 I NETWORK end connection 139.122.10.181:57576 (10 connections now open)然后做一个聚合查询，和cout一下，发现shard3的两个实例都不可用了mongos> db.online.aggregate(
... [{ $match: {ts: 1456376400}},
...{
...    $group:{
...       _id: {msisdn: "$msisdn", rg: "$rg"},
...       vol: { $sum: "$vol" },
...       }
...},
...{$sort: {vol: -1}},
...{$limit: 10 }
... ],
... {allowDiskUse: true}
... )
{ "_id" : { "msisdn" : "XXXXXXXX", "rg" : 5 }, "vol" : 36825 }
...

mongos> db.online.count()
2016-02-25T15:08:18.994+0800 E QUERY Error: count failed: {
   "code" : 16340,
   "ok" : 0,
   "errmsg" : "No replica set monitor active and no cached seed found for set: shard3"
} :去两个shard实例的日志下看没看到有出错016-02-25T14:16:26.039+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:24.849+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
2016-02-25T14:16:56.455+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:56.291+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
2016-02-25T14:17:05.690+0800 I NETWORK end connection 139.122.10.145:27348 (58 connections now open)
2016-02-25T14:17:21.060+0800 I NETWORK end connection 139.122.10.145:27353 (57 connections now open)
2016-02-25T14:17:26.635+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:17:26.485+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms然后重启起来，再在mongos客户端下查询集合的count，结果吧mongos也给搞垮了2016-02-25T15:15:59.901+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T15:15:59.775+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW01:27017:1456284050:1804289383', sleeping for 30000ms
2016-02-25T15:16:03.636+0800 F ASIO Uncaught exception in NetworkInterfaceASIO IO worker thread of type: UnknownError Caught std::exception of type std::system_error: thread: Resource temporarily unavailable
2016-02-25T15:16:03.636+0800 I -    Fatal Assertion 28820
2016-02-25T15:16:03.636+0800 I -

***aborting after fassert() failure

2016-02-25T15:16:04.199+0800 F -    Got signal: 6 (Aborted).

0xc401d2 0xc3f119 0xc3f922 0x3553a0f4a0 0x3553632885 0x3553634065 0xbc6902 0x9e3b9d 0xe174b0 0x3553a077f1 0x35536e570d

lyhabc 发表于 2016-02-25 17:04

allocator: tcmalloc
这么好的配置不可能吧

lcstudio 发表于 2016-02-26 07:33

os 版本有点低。内核参数优化了吗？

PinkOrient 发表于 2016-02-26 10:56

本帖最后由 PinkOrient 于 2016-02-26 10:59 编辑

回复 3# lcstudio

numactl --interleave=all方式启动
ulimit设置了
ULIMIT_CMD="ulimit -f unlimited;ulimit -t unlimited;ulimit -v unlimited;ulimit -n 64000;ulimit -u 32000;ulimit -m unlimited"
这俩也弄了
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

之前单例运行和不做分片单副本集运行，觉得健壮性还可以，不知道为啥一做分片就稀烂。NUMACTL="numactl --interleave=all "
MONGO_PATH_OPTS="--dbpath $DATA_PATH --logpath $LOG_PATH --pidfilepath $PID_PATH --logappend"
MONGO_OPTS="--fork --journal --directoryperdb"
MONGO_OPTS2="--shardsvr --port $PORT --replSet ${SHARD_NAME}"
MONGO_EXEC=/opt/mongodb/bin/mongod

usage="Usage: mongo.sh "

check_status() {
   kill -0 `cat $PID_PATH` > /dev/null 2>&1
}

modify_env() {

   if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
      echo never > /sys/kernel/mm/transparent_hugepage/enabled
   fi

   if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
      echo never > /sys/kernel/mm/transparent_hugepage/defrag
   fi
}

cjfeii 发表于 2016-04-20 16:15

听我一个之前做游戏的朋友说，mongo有点坑。

页: [1]

Chinaunix's Archiver

mongodb分片部署，稳定性非常差，没这么差吧