平台论坛博客文库

› 论坛 › 程序设计 › 内核源码 › Linux路由缓存实现浅析

12 3 4 5 / 5 页下一页

Linux路由缓存实现浅析 [复制链接]

独孤九贱

富足长乐

论坛徽章:: 0

电梯直达

楼 [收藏(0)] [报告]

发表于 2010-12-09 09:59 |只看该作者 |正序浏览

本帖最后由独孤九贱于 2010-12-09 10:32 编辑

最近闲赋在家，在家里做了一个双ADSL负载均衡的东东，不过遗憾的是，流量始终在一条线路上，本着解决问题的态度，把Linux的路由缓存子系统看了一下，现在把笔记发上来。
原来好像也发过一篇，不过是老版本内核的，本贴对应的版本是2.6.31。

不保证内容都正确，仅供讨论学习之用。

转载请注明作者和出处。

一、什么是路由缓存
路由查询IP层最重要的工作，同时，它也是一件很耗时的工作，为了提高路由查询的效率。Linux内核引用了路由缓存，用于减少对路由表的查询。呵呵，在计算机世界里，cache是无处不在的。Linux的路由缓存（下文中可能会简称为DST）是被设计来与协议无关的独立子系统。一个典型的路由缓存如下：

root@kendo-ThinkpadT410:~# route -Cn
内核 IP 路由缓存
Source Destination Gateway Flags Metric Ref Use Iface
10.1.1.199 74.125.53.102 10.1.1.254 0 0 3 eth0
10.1.1.199 219.148.35.84 10.1.1.254 0 0 0 eth0
10.1.1.199 118.123.3.237 10.1.1.254 0 0 21 eth0
61.55.167.138 10.1.1.199 10.1.1.199 l 0 0 33 lo
10.1.1.199 203.208.37.22 10.1.1.254 0 0 3 eth0
10.1.1.183 10.1.1.255 10.1.1.255 ibl 0 0 1 lo
10.1.1.199 72.14.213.101 10.1.1.254 0 0 1 eth0
10.1.1.137 10.1.1.255 10.1.1.255 ibl 0 0 0 lo
10.1.1.199 61.139.2.69 10.1.1.254 0 0 53 eth0
10.1.1.199 8.8.8.8 10.1.1.254 0 0 45 eth0
10.1.1.199 220.166.65.249 10.1.1.254 0 0 21 eth0
10.1.1.199 207.46.193.178 10.1.1.254 0 0 3 eth0
219.148.35.84 10.1.1.199 10.1.1.199 l 0 0 2 lo
10.1.1.199 72.14.203.148 10.1.1.254 0 0 0 eth0
8.8.8.8 10.1.1.199 10.1.1.199 l 0 0 22 lo
10.1.1.199 207.46.193.178 10.1.1.254 0 0 1 eth0
10.1.1.199 219.232.243.91 10.1.1.254 0 0 2 eth0
10.1.1.199 118.123.3.236 10.1.1.254 0 0 21 eth0
……

复制代码

二、路由缓存初始化
2.1 ip_rt_init
路由缓存使用hash表存储，初始化工作，最重要的就是分配hash表和表项所使用的SLAB，这个工作是在ip_rt_init中完成的：

int __init ip_rt_init(void)
{
……
/* 初始化DST SLAB分配缓存器 */
ipv4_dst_ops.kmem_cachep =
kmem_cache_create("ip_dst_cache", sizeof(struct rtable), 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
ipv4_dst_blackhole_ops.kmem_cachep = ipv4_dst_ops.kmem_cachep;
/* 根据系统内存容量，分配路由缓存hash表 */
rt_hash_table = (struct rt_hash_bucket *)
alloc_large_system_hash("IP route cache",
sizeof(struct rt_hash_bucket),
rhash_entries,
(totalram_pages >= 128 * 1024) ?
15 : 17,
0,
&rt_hash_log,
&rt_hash_mask,
rhash_entries ? 0 : 512 * 1024);
/* 初始化hash表 */
memset(rt_hash_table, 0, (rt_hash_mask + 1) * sizeof(struct rt_hash_bucket));
rt_hash_lock_init();
/* gc_thresh和ip_rt_max_size用于垃圾回收 */
ipv4_dst_ops.gc_thresh = (rt_hash_mask + 1);
ip_rt_max_size = (rt_hash_mask + 1) * 16;
……

复制代码

指针rt_hash_table指向缓存hash表，表的每一个桶是结构struct rt_hash_bucket，桶下的链表的结构是struct rtable。

/*
* Route cache.
*/
/* The locking scheme is rather straight forward:
*
* 1) Read-Copy Update protects the buckets of the central route hash.
* 2) Only writers remove entries, and they hold the lock
* as they look at rtable reference counts.
* 3) Only readers acquire references to rtable entries,
* they do so with atomic increments and with the
* lock held.
*/
struct rt_hash_bucket {
struct rtable *chain;
};

复制代码

rt_hash_bucket只有一个struct rtable结构的成员，rtable用于描于一个缓存项（准确地讲，它是包括但不仅限于）：

struct fib_nh;
struct inet_peer;
struct rtable
{
union
{
struct dst_entry dst;
} u;
/* Cache lookup keys */
struct flowi fl;
struct in_device *idev;
int rt_genid;
unsigned rt_flags;
__u16 rt_type;
__be32 rt_dst; /* Path destination */
__be32 rt_src; /* Path source */
int rt_iif;
/* Info on neighbour */
__be32 rt_gateway;
/* Miscellaneous cached information */
__be32 rt_spec_dst; /* RFC1122 specific destination */
struct inet_peer *peer; /* long-living peer info */
};

复制代码

rtable包含一些具体的协议有关的项和一个与协议无关、类型为struct dst_entry的成员。路由缓存中，最为精华的部份就是DST的单独抽像，设计者将它设计成一个无协议无关的结构，无协议无关，意味着不论是IPV4，还是V6，亦或其它网络层协议，都可以使用它。值得注意的是，dst成员被设计成union，结构dst_entry与rtable有相同的地址，同一个指针可以方便地在两者之前进行强制类型转换。整个hash表如下图所示：

2.2 hash表的分配
DST缓存的hash表的分配，是通过调用系统API alloc_large_system_hash实现的：

rt_hash_table = (struct rt_hash_bucket *)
alloc_large_system_hash("IP route cache",
sizeof(struct rt_hash_bucket),
rhash_entries,
(totalram_pages >= 128 * 1024) ?
15 : 17,
0,
&rt_hash_log,
&rt_hash_mask,
rhash_entries ? 0 : 512 * 1024);

复制代码

对照以上代码，来分析alloc_large_system_hash的实现：

/*
* allocate a large system hash table from bootmem
* - it is assumed that the hash table must contain an exact power-of-2
* quantity of entries
* - limit is the number of hash buckets, not the total allocation size
*/
void *__init alloc_large_system_hash(const char *tablename, /* hash表名称 */
unsigned long bucketsize, /* hash表的每个桶的大小 */
unsigned long numentries, /* hash表的总的元素数目 */
int scale,
int flags,
unsigned int *_hash_shift,
unsigned int *_hash_mask,
unsigned long limit)
{

复制代码

函数前三个参数很清晰，后面几个参数在代码中逐步了解，一个有意思的是，实始分配的时候，并不需要指
明hash表的桶的数目。而这个数目，对于hash表来讲，是至关重要的。

unsigned long long max = limit;
unsigned long log2qty, size;
void *table = NULL;

复制代码

如果没有手动指定hash表的大小，则根据系统内存大小自动计算hash表的元素总数。对于DST子系统而言，其值是一个
内核名命行参数rhash_entries，用户可以在内核引导时指定其大小:

/* allow the kernel cmdline to have a say */
if (!numentries) {
/* round applicable memory size up to nearest megabyte */
/**
* numentries的计算基数是nr_kernel_pages，它表示内存的dma和normal页区的实际页数
*/
numentries = nr_kernel_pages;
/**
* 这部份的计算是让numentries的值自动校正为其对应的最接近以MB字节单位的页面数的值，
* 以x86，32位的情况，一个1MB包含的页面数为1UL << 20 - PAGE_SHIFT，后面直接以256来行文了。
* 例如，如果 numentries为100，则会自动调整为256，如果为257，则会调整为512，如果为1000，则
* 会调整为1024……(假定)
*/
/**
* 这里 "+= 1MB包含的页面数" 意味着向上对齐，即如果原始是2，则会变成257(当然，通过后面的位移运算，会把它变成256)，
* 而不是变成0(向下对齐)，而-1则是一个调整阀值，对于一些边界值，如0，会保证它还是0，256还是256（而不是向上靠成512了）
*/
numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
/* 右移左移移得人头晕，其实就是以256为边界对齐 */
numentries >>= 20 - PAGE_SHIFT;
numentries <<= 20 - PAGE_SHIFT;
/* limit to 1 bucket per 2^scale bytes of low memory */
/* scale只是一个当numentries为0时，计算numentries的滚动标尺 */
if (scale > PAGE_SHIFT)
numentries >>= (scale - PAGE_SHIFT);
else
numentries <<= (PAGE_SHIFT - scale);
/* Make sure we've got at least a 0-order allocation.. */
if (unlikely((numentries * bucketsize) < PAGE_SIZE))
numentries = PAGE_SIZE / bucketsize;
}
/* 变为最接近的2的幂 */
numentries = roundup_pow_of_two(numentries);

复制代码

最后一个参数limit为hash表元素的总大小，如果没有指定，则自动计算一个，默认情况下，使用总共路由的1/16（右移四位），：

/* limit allocation size to 1/16 total memory by default */
if (max == 0) {
max = ((unsigned long long)nr_all_pages << PAGE_SHIFT) >> 4;
do_div(max, bucketsize);
}

复制代码

如果numentries超限，调整它：

if (numentries > max)
numentries = max;

复制代码

对numentries对对数：ilog2 - log of base 2 of 32-bit or a 64-bit unsigned value，即
hash表的总元素是2^log2qty。可以很方便地使用1 << log2qty来表示之。

log2qty = ilog2(numentries);

复制代码

另一方面，hash表的分配，采用了三种方式，其值主要是根据参数flags和另一个全局变量hashdist来决定的：

do {
size = bucketsize << log2qty;
/**
* 这里可以看到其第5个参数的作用，如果标志位设置有HASH_EARLY,表明在启动时分配,
* 在bootmem中分配，否则使用其它方式来分配。
*/
if (flags & HASH_EARLY)
table = alloc_bootmem_nopanic(size);
else if (hashdist)
table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
else {
/*
* If bucketsize is not a power-of-two, we may free
* some pages at the end of hash table which
* alloc_pages_exact() automatically does
*/
if (get_order(size) < MAX_ORDER) {
table = alloc_pages_exact(size, GFP_ATOMIC);
kmemleak_alloc(table, size, 1, GFP_ATOMIC);
}
}
} while (!table && size > PAGE_SIZE && --log2qty);

复制代码

/* 分配失败 */

if (!table)
panic("Failed to allocate %s hash table\n", tablename);

复制代码

/* 从成功分配的信息当中，可以了解一些重要的计算参数的含义，也可以在dmesg中，对照计算 */

printk(KERN_INFO "%s hash table entries: %d (order: %d, %lu bytes)\n",
tablename,
(1U << log2qty),
ilog2(size) - PAGE_SHIFT,
size);

复制代码

/**
* 第6个参数向用户返回log2qty的值，这个值的含义前文已有分析。
*/

if (_hash_shift)
*_hash_shift = log2qty;

复制代码

/**
   * 1U << log2qty是hash表桶的大小，第7个参数*_hash_mask是向调用者返回桶的大小，即桶大小为*_hash_mask + 1
   * 之所以在做减1的调整，应该是因为C语言的数组是从0开始的。
   */

if (_hash_mask)
*_hash_mask = (1 << log2qty) - 1;
return table;
}

复制代码

hashdist的dist，意指distribution：

#define HASH_EARLY 0x00000001 /* Allocating during early boot? */
/* Only NUMA needs hash distribution. 64bit NUMA architectures have
* sufficient vmalloc space.
*/
#if defined(CONFIG_NUMA) && defined(CONFIG_64BIT)
#define HASHDIST_DEFAULT 1
#else
#define HASHDIST_DEFAULT 0
#endif
extern int hashdist; /* Distribute hashes across NUMA nodes? */
int hashdist = HASHDIST_DEFAULT;
#ifdef CONFIG_NUMA
static int __init set_hashdist(char *str)
{
if (!str)
return 0;
hashdist = simple_strtoul(str, &str, 0);
return 1;
}
__setup("hashdist=", set_hashdist);
#endif

复制代码

可见，hashdist主要是为了支持NUMA，而这个distribution，应该就是对应vmalloc的特性吧：物理上非连续。

了解了alloc_large_system_hash函数的各个参数的作用，就可以完全理解DST的hash表的分配了。

三、缓存的查询
IP层收到数据报文的时候，ip_rcv_finish会调用 ip_route_input进行路由查询工作：

static int ip_rcv_finish(struct sk_buff *skb)
{
const struct iphdr *iph = ip_hdr(skb);
struct rtable *rt;
/*
* Initialise the virtual path cache for the packet. It describes
* how the packet travels inside Linux networking.
*/
if (skb_dst(skb) == NULL) {
int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
skb->dev);
if (unlikely(err)) {
if (err == -EHOSTUNREACH)
IP_INC_STATS_BH(dev_net(skb->dev),
IPSTATS_MIB_INADDRERRORS);
else if (err == -ENETUNREACH)
IP_INC_STATS_BH(dev_net(skb->dev),
IPSTATS_MIB_INNOROUTES);
goto drop;
}
}

复制代码

ip_route_input会首先尝试进行缓存的查找，如果找不到，再查询路由表，这里仅分析缓存的查找：

int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
u8 tos, struct net_device *dev)
{
struct rtable * rth;
unsigned hash;
int iif = dev->ifindex;
struct net *net;
net = dev_net(dev);
if (!rt_caching(net))
goto skip_cache;
tos &= IPTOS_RT_MASK;
hash = rt_hash(daddr, saddr, iif, rt_genid(net));
rcu_read_lock();
for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->u.dst.rt_next)) {
if (((rth->fl.fl4_dst ^ daddr) |
(rth->fl.fl4_src ^ saddr) |
(rth->fl.iif ^ iif) |
rth->fl.oif |
(rth->fl.fl4_tos ^ tos)) == 0 &&
rth->fl.mark == skb->mark &&
net_eq(dev_net(rth->u.dst.dev), net) &&
!rt_is_expired(rth)) {
dst_use(&rth->u.dst, jiffies);
RT_CACHE_STAT_INC(in_hit);
rcu_read_unlock();
skb_dst_set(skb, &rth->u.dst);
return 0;
}
RT_CACHE_STAT_INC(in_hlist_search);
}
rcu_read_unlock();

复制代码

ip_route_input首先调用rt_hash_code函数计算hash值，以取得在rt_hash_table中的入口，然后使用for循环，遍历hash链中的每一个桶，进行缓存的匹备，匹备的要素包括：
目的地址
来源地址
输入接口
输出接口或ToS
netfilter mark
缓存设备
缓存是否过期

如果缓存查找命中，则使用dst_use更新使用计数器和时间戳:

static inline void dst_use(struct dst_entry *dst, unsigned long time)
{
dst_hold(dst);
dst->__use++;
dst->lastuse = time;
}

复制代码

RT_CACHE_STAT_INC宏用于累加查找命中计数器，skb_dst_set设置当前skb的dst：

static inline void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
{
skb->_skb_dst = (unsigned long)dst;
}

复制代码

有一个重要的问题是，在缓存创建的时候，封装了缓存下一步的发送函数output，这里设置了skb的dst，就意味着它可以继续处理和转发了，例如：

rth->u.dst.input = ip_forward;
rth->u.dst.output = ip_output;

复制代码

一点小改变：值得注意的是，查找匹备与老版本相比较，已经有了明显的变化：

for (rth = rcu_dereference(rt_hash_table[hash].chain); rth;
rth = rcu_dereference(rth->u.rt_next)) {
if (rth->fl.fl4_dst == daddr &&
rth->fl.fl4_src == saddr &&
rth->fl.iif == iif &&
rth->fl.oif == 0 &&
#ifdef CONFIG_IP_ROUTE_FWMARK
rth->fl.fl4_fwmark == skb->nfmark &&
#endif
rth->fl.fl4_tos == tos) {

复制代码

上面代码取自2.6.12，新版本中多增加了两项比较：

net_eq(dev_net(rth->u.dst.dev), net) &&
!rt_is_expired(rth)

复制代码

因为缓存是独立于协议的，所以net_eq比较当前缓存对应的协议是否匹备，例如是否都是ipv4。rt_is_expired用于检查缓存是否过期。
另一个变化是把(XX == XX) && (YY == YY)比较，变成了(XX ^ XX) | (YY ^ YY)，这样变的理由在于：
如果A == B, 则A ^ B = 0

文库|博客

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

匿名者7

白手起家

论坛徽章:: 0

44楼 [报告]

发表于 2018-11-12 18:06 |只看该作者

好帖！
问个问题：假如在路由器下有多个wan口(eth0 eth1 eth 2 eth3).
那我在调用ip_route_input时会设置哪一个为dst呢？
有没有什么办法可以手动去设定具体的路由呢？

wavespider

白手起家

论坛徽章:: 0

43楼 [报告]

发表于 2014-11-26 10:29 |只看该作者

回复 1# 独孤九贱

"dst成员被设计成union，结构dst_entry与rtable有相同的地址"
struct rtable
{
union
{
struct dst_entry dst;
} u;
/* Cache lookup keys */
struct flowi fl;
.....
}

问题是，如果dst不是一个union的话，地址也一样吧？！