- 论坛徽章:
- 0
|
搭建了一个3server sharding环境
含5个primary+secondary+arbitery副本集分片
每个server上跑1个configsrv实例
然后每个server上跑1个mongos实例
所有traffic都走的内网。
几天使用下来,感觉mongodb非常不稳定,没法用到生产系统中啊,大伙有遇到类似问题吗?- # mongod --version
- db version v3.2.1
- git version: a14d55980c2cdc565d4704a7e3ad37e4e535c1b2
- OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
- allocator: tcmalloc
- modules: none
- build environment:
- distmod: rhel62
- distarch: x86_64
- target_arch: x86_64
- # uname -a
- Linux APGW02 2.6.32-220.60.2.el6.x86_64 #1 SMP Fri Feb 27 15:05:50 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
- # cat /etc/issue
- Red Hat Enterprise Linux Server release 6.2 (Santiago)
- Kernel \r on an \m
复制代码 然后通过第4台机器上的python脚本多进程并发向各个mongos发update upsert=true请求,最终数据规模3000w,每秒大约2~3w条
发现某台主机的mongos经常性留下这样的遗言后退出了- 2016-02-25T11:45:38.291+0800 I NETWORK [mongosMain] connection accepted from 139.122.10.181:57592 #29 (12 connections now open)
- 2016-02-25T11:45:38.291+0800 I NETWORK [mongosMain] connection accepted from 139.122.10.181:57591 #30 (13 connections now open)
- 2016-02-25T11:45:38.291+0800 I NETWORK [mongosMain] connection accepted from 139.122.10.181:57593 #31 (14 connections now open)
- 2016-02-25T11:45:38.372+0800 I NETWORK [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
- 2016-02-25T11:45:38.372+0800 I NETWORK [mongosMain] failed to create thread after accepting new connection, closing connection
- 2016-02-25T11:45:38.373+0800 I NETWORK [conn26] end connection 139.122.10.181:57584 (12 connections now open)
- 2016-02-25T11:45:38.374+0800 I NETWORK [mongosMain] connection accepted from 139.122.10.181:57594 #32 (13 connections now open)
- 2016-02-25T11:45:38.374+0800 I NETWORK [mongosMain] connection accepted from 139.122.10.181:57597 #33 (14 connections now open)
- 2016-02-25T11:45:38.374+0800 I NETWORK [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
- 2016-02-25T11:45:38.374+0800 I NETWORK [mongosMain] failed to create thread after accepting new connection, closing connection
- 2016-02-25T11:45:38.376+0800 I NETWORK [mongosMain] connection accepted from 139.122.10.181:57600 #34 (14 connections now open)
- 2016-02-25T11:45:38.376+0800 I NETWORK [mongosMain] pthread_create failed: errno:11 Resource temporarily unavailable
- 2016-02-25T11:45:38.376+0800 I NETWORK [mongosMain] failed to create thread after accepting new connection, closing connection
- 2016-02-25T11:45:38.483+0800 E NETWORK [conn32] Uncaught std::exception: , terminating
- 2016-02-25T11:45:38.483+0800 E NETWORK [conn30] Uncaught std::exception: , terminating
- 2016-02-25T11:45:38.483+0800 E NETWORK [conn29] Uncaught std::exception: , terminating
- 2016-02-25T11:45:38.520+0800 I SHARDING [conn29] dbexit: rc:100
- 2016-02-25T11:45:38.520+0800 I SHARDING [conn32] dbexit: rc:100
- 2016-02-25T11:45:38.520+0800 I SHARDING [conn30] dbexit: rc:100
- 2016-02-25T11:45:38.521+0800 I NETWORK [conn21] end connection 139.122.10.181:57575 (10 connections now open)
- 2016-02-25T11:45:38.521+0800 I NETWORK [conn22] end connection 139.122.10.181:57576 (10 connections now open)
复制代码 然后做一个聚合查询,和cout一下,发现shard3的两个实例都不可用了- mongos> db.online.aggregate(
- ... [{ $match: {ts: 1456376400}},
- ... {
- ... $group:{
- ... _id: {msisdn: "$msisdn", rg: "$rg"},
- ... vol: { $sum: "$vol" },
- ... }
- ... },
- ... { $sort: {vol: -1}},
- ... { $limit: 10 }
- ... ],
- ... {allowDiskUse: true}
- ... )
- { "_id" : { "msisdn" : "XXXXXXXX", "rg" : 5 }, "vol" : 36825 }
- ...
- mongos> db.online.count()
- 2016-02-25T15:08:18.994+0800 E QUERY [thread1] Error: count failed: {
- "code" : 16340,
- "ok" : 0,
- "errmsg" : "No replica set monitor active and no cached seed found for set: shard3"
- } :
复制代码 去两个shard实例的日志下看没看到有出错- 016-02-25T14:16:26.039+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:24.849+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
- 2016-02-25T14:16:56.455+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:16:56.291+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
- 2016-02-25T14:17:05.690+0800 I NETWORK [conn988] end connection 139.122.10.145:27348 (58 connections now open)
- 2016-02-25T14:17:21.060+0800 I NETWORK [conn989] end connection 139.122.10.145:27353 (57 connections now open)
- 2016-02-25T14:17:26.635+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T14:17:26.485+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW02:27021:1456298014:1883953971', sleeping for 30000ms
复制代码 然后重启起来,再在mongos客户端下查询集合的count,结果吧mongos也给搞垮了- 2016-02-25T15:15:59.901+0800 I SHARDING [LockPinger] cluster 139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018 pinged successfully at 2016-02-25T15:15:59.775+0800 by distributed lock pinger '139.122.10.145:27018,139.122.10.146:27018,139.122.10.23:27018/APGW01:27017:1456284050:1804289383', sleeping for 30000ms
- 2016-02-25T15:16:03.636+0800 F ASIO [NetworkInterfaceASIO-TaskExecutorPool-9-0] Uncaught exception in NetworkInterfaceASIO IO worker thread of type: UnknownError Caught std::exception of type std::system_error: thread: Resource temporarily unavailable
- 2016-02-25T15:16:03.636+0800 I - [NetworkInterfaceASIO-TaskExecutorPool-9-0] Fatal Assertion 28820
- 2016-02-25T15:16:03.636+0800 I - [NetworkInterfaceASIO-TaskExecutorPool-9-0]
- ***aborting after fassert() failure
- 2016-02-25T15:16:04.199+0800 F - [NetworkInterfaceASIO-TaskExecutorPool-9-0] Got signal: 6 (Aborted).
- 0xc401d2 0xc3f119 0xc3f922 0x3553a0f4a0 0x3553632885 0x3553634065 0xbc6902 0x9e3b9d 0xe174b0 0x3553a077f1 0x35536e570d
复制代码 |
|