- 论坛徽章:
- 0
|
2010.04.26 21:52 续:
今天中午又蹦出来一个……虽然是T级别的,但仍然不行。
见下图
考虑到目前我们单台Server流量不会超过2Gbps,临时修改了部分代码(ganglia-3.1.7)
- diff metrics.c metrics.c.ori
- 219d218
- < //TODO add log "%index %dev RxB TxB %dev RxB TxB"
- 231,232c230
- < //debug_msg("update_ifdata(%s) - Overflow in rbi: %lu -> %lu",caller,ns->rbi,rbi);
- < err_msg("update_ifdata(%s) - Overflow in rbi: %lu -> %lu",caller,ns->rbi,rbi);
- ---
- > debug_msg("update_ifdata(%s) - Overflow in rbi: %lu -> %lu",caller,ns->rbi,rbi);
- 241,242c239
- < //debug_msg("updata_ifdata(%s) - Overflow in rpi: %lu -> %lu",caller,ns->rpi,rpi);
- < err_msg("updata_ifdata(%s) - Overflow in rpi: %lu -> %lu",caller,ns->rpi,rpi);
- ---
- > debug_msg("updata_ifdata(%s) - Overflow in rpi: %lu -> %lu",caller,ns->rpi,rpi);
- 255,256c252
- < //debug_msg("update_ifdata(%s) - Overflow in rbo: %lu -> %lu",caller,ns->rbo,rbo);
- < err_msg("update_ifdata(%s) - Overflow in rbo: %lu -> %lu",caller,ns->rbo,rbo);
- ---
- > debug_msg("update_ifdata(%s) - Overflow in rbo: %lu -> %lu",caller,ns->rbo,rbo);
- 265,266c261
- < //debug_msg("update_ifdata(%s) - Overflow in rpo: %lu -> %lu",caller,ns->rpo,rpo);
- < err_msg("update_ifdata(%s) - Overflow in rpo: %lu -> %lu",caller,ns->rpo,rpo);
- ---
- > debug_msg("update_ifdata(%s) - Overflow in rpo: %lu -> %lu",caller,ns->rpo,rpo);
- 297,298c292
- < // if ((l_bin > 1.0e13) || (l_bout > 1.0e13) ||
- < if ((l_bin > 2.5e8) || (l_bout > 2.5e8) ||
- ---
- > if ((l_bin > 1.0e13) || (l_bout > 1.0e13) ||
- 300,304c294
- < err_msg("update_ifdata(%s): %lu %g, %lu %g, %lu %g, %lu %g / %g", caller,
- < l_bytes_in, l_bin,
- < l_bytes_out, l_bout,
- < l_pkts_in, l_pin,
- < l_pkts_in, l_pout,t);
- ---
- > err_msg("update_ifdata(%s): %g %g %g %g / %g",caller,l_bin,l_bout,l_pin,l_pout,t);
复制代码 本来想每次统计时都计算一次当前机器的激活的网卡数量来做一个上限(例如2块千M网卡,一块全速,另一块接了百M线,总量是1.1Gbps),特意从 ethtool的源代码中抽取出一部分做了一个测试,但发现这种方式需要root权限。root权限是一定不会用的,另开一个进程反而增加了整个方案的复杂度只能放弃这个诱人的打算 
单靠丢弃不合理数据不是个根本的解决办法,应该弄清楚错误数据是怎么产生的。继续观察,明天有时间再修改一下,把所有日志都加进去。 |
|