mongodb分片部署,稳定性非常差,没这么差吧
搭建了一个3server sharding环境
含5个primary+secondary+arbitery副本集分片
每个server上跑1个configsrv实例
然后每个server上跑1个mongos实例
所有traffic都走的内网。
几天使用下来,感觉mongodb非常不稳定,没法用到生产系统中啊,大伙有遇到类似问题吗?# mongod --version
db version v3.2.1
git version: a14d55980c2cdc565d4704a7e3ad37e4e535c1b2
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
distmod: rhel62
distarch: x86_64
target_arch: x86_64
# uname -a
Linux APGW02 2.6.32-220.60.2.el6.x86_64 #1 SMP Fri Feb 27 15:05:50 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)
Kernel \r on an \m
然后通过第4台机器上的python脚本多进程并发向各个mongos发update upsert=true请求,最终数据规模3000w,每秒大约2~3w条
发现某台主机的mongos经常性留下这样的遗言后退出了2016-02-25T11:45:38.291+0800 I NETWORK connection accepted from 139.122.10.181:57592 #29 (12 connections now open)
2016-02-25T11:45:38.291+0800 I NETWORK connection accepted from 139.122.10.181:57591 #30 (13 connections now open)
2016-02-25T11:45:38.291+0800 I NETWORK connection accepted from 139.122.10.181:57593 #31 (14 connections now open)
2016-02-25T11:45:38.372+0800 I NETWORK pthread_create failed: errno:11 Resource temporarily unavailable
2016-02-25T11:45:38.372+0800 I NETWORK failed to create thread after accepting new connection, closing connection
2016-02-25T11:45:38.373+0800 I NETWORK end connection 139.122.10.181:57584 (12 connections now open)
2016-02-25T11:45:38.374+0800 I NETWORK connection accepted from 139.122.10.181:57594 #32 (13 connections now open)
2016-02-25T11:45:38.374+0800 I NETWORK connection accepted from 139.122.10.181:57597 #33 (14 connections now open)
2016-02-25T11:45:38.374+0800 I NETWORK pthread_create failed: errno:11 Resource temporarily unavailable
2016-02-25T11:45:38.374+0800 I NETWORK failed to create thread after accepting new connection, closing connection
2016-02-25T11:45:38.376+0800 I NETWORK connection accepted from 139.122.10.181:57600 #34 (14 connections now open)
2016-02-25T11:45:38.376+0800 I NETWORK pthread_create failed: errno:11 Resource temporarily unavailable
2016-02-25T11:45:38.376+0800 I NETWORK failed to create thread after accepting new connection, closing connection
2016-02-25T11:45:38.483+0800 E NETWORK Uncaught std::exception: , terminating
2016-02-25T11:45:38.483+0800 E NETWORK Uncaught std::exception: , terminating
2016-02-25T11:45:38.483+0800 E NETWORK Uncaught std::exception: , terminating
2016-02-25T11:45:38.520+0800 I SHARDING dbexit:rc:100
2016-02-25T11:45:38.520+0800 I SHARDING dbexit:rc:100
2016-02-25T11:45:38.520+0800 I SHARDING dbexit:rc:100
2016-02-25T11:45:38.521+0800 I NETWORK end connection 139.122.10.181:57575 (10 connections now open)
2016-02-25T11:45:38.521+0800 I NETWORK end connection 139.122.10.181:57576 (10 connections now open)然后做一个聚合查询,和cout一下,发现shard3的两个实例都不可用了mongos> db.online.aggregate(
... [{ $match: {ts: 1456376400}},
...{
... $group:{
... _id: {msisdn: "$msisdn", rg: "$rg"},
... vol: { $sum: "$vol" },
... }
...},
...{$sort: {vol: -1}},
...{$limit: 10 }
... ],
... {allowDiskUse: true}
... )
{ "_id" : { "msisdn" : "XXXXXXXX", "rg" : 5 }, "vol" : 36825 }
...
mongos> db.online.count()
2016-02-25T15:08:18.994+0800 E QUERY Error: count failed: {
"code" : 16340,
"ok" : 0,
"errmsg" : "No replica set monitor active and no cached seed found for set: shard3"
} :去两个shard实例的日志下看没看到有出错016-02-25T14:16:26.039+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:24.849+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
2016-02-25T14:16:56.455+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:56.291+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
2016-02-25T14:17:05.690+0800 I NETWORK end connection 139.122.10.145:27348 (58 connections now open)
2016-02-25T14:17:21.060+0800 I NETWORK end connection 139.122.10.145:27353 (57 connections now open)
2016-02-25T14:17:26.635+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:17:26.485+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms然后重启起来,再在mongos客户端下查询集合的count,结果吧mongos也给搞垮了2016-02-25T15:15:59.901+0800 I SHARDING cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T15:15:59.775+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW01:27017:1456284050:1804289383', sleeping for 30000ms
2016-02-25T15:16:03.636+0800 F ASIO Uncaught exception in NetworkInterfaceASIO IO worker thread of type: UnknownError Caught std::exception of type std::system_error: thread: Resource temporarily unavailable
2016-02-25T15:16:03.636+0800 I - Fatal Assertion 28820
2016-02-25T15:16:03.636+0800 I -
***aborting after fassert() failure
2016-02-25T15:16:04.199+0800 F - Got signal: 6 (Aborted).
0xc401d2 0xc3f119 0xc3f922 0x3553a0f4a0 0x3553632885 0x3553634065 0xbc6902 0x9e3b9d 0xe174b0 0x3553a077f1 0x35536e570d allocator: tcmalloc
这么好的配置不可能吧 os 版本有点低。内核参数优化了吗? 本帖最后由 PinkOrient 于 2016-02-26 10:59 编辑
回复 3# lcstudio
numactl --interleave=all方式启动
ulimit设置了
ULIMIT_CMD="ulimit -f unlimited;ulimit -t unlimited;ulimit -v unlimited;ulimit -n 64000;ulimit -u 32000;ulimit -m unlimited"
这俩也弄了
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
之前单例运行和不做分片单副本集运行,觉得健壮性还可以,不知道为啥一做分片就稀烂。NUMACTL="numactl --interleave=all "
MONGO_PATH_OPTS="--dbpath $DATA_PATH --logpath $LOG_PATH --pidfilepath $PID_PATH --logappend"
MONGO_OPTS="--fork --journal --directoryperdb"
MONGO_OPTS2="--shardsvr --port $PORT --replSet ${SHARD_NAME}"
MONGO_EXEC=/opt/mongodb/bin/mongod
usage="Usage: mongo.sh "
check_status() {
kill -0 `cat $PID_PATH` > /dev/null 2>&1
}
modify_env() {
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
if test -f /sys/kernel/mm/transparent_hugepage/defrag; then
echo never > /sys/kernel/mm/transparent_hugepage/defrag
fi
} 听我一个之前做游戏的朋友说,mongo有点坑。
页:
[1]