论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2012-12-19 11:03 |只看该作者 |倒序浏览

本帖最后由独孤九贱于 2012-12-20 09:33 编辑

马上就要2012.12.21日了，在这个特殊的日子来临之际，发个灌水贴，以示纪念。

作者：独孤九贱
转载请注明出处。

1、概述
RCU锁相对于RWLOCK来讲，对于很少的写者，而多个读取并发的情况下，可以带来更好的并发性，关于原理及对比数据分析，这里就不再赘述了。内核也为链表使用RCU锁做了实现,即RCUList，事实上，这也是RCU锁一个最重要的应用之一。但是做为对RCULIst的扩展的RCUList_null，虽然引入有较长一段时间了，但是由于内核子系统使用较少，所以还不太知名，但是不巧的是网络栈的Netfilter的连接跟踪使用了它。内核对这个接口是这样描述的：

Using hlist_nulls to protect read-mostly linked lists and objects using SLAB_DESTROY_BY_RCU allocations.
Please read the basics in Documentation/RCU/listRCU.txt

复制代码

本文将结合Netfilter的conntrack表以及内核文档中对rculist_nulls的使用的介绍，从rculist_null、自旋锁、引用计数器、定时器的同步几个方面来详细分析其用法。所以阅读本文需要参考的nf_conntrack_core.c和Documentation/RCU/rculist_nulls.txt 两个文件。

2、数据结构

struct nf_conn {
/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
plus 1 for any connection(s) we are `master' for */
struct nf_conntrack ct_general;
spinlock_t lock;
/* XXX should I move this to the tail ? - Y.K */
/* These are my tuples; original and reply */
struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];

复制代码

ct_general中包含了引用计数器：

#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct nf_conntrack {
atomic_t use;
};
#endif

复制代码

lock是用于整个结构的互斥锁

ct的两个方向的tuple都包含了hlist_nulls节点成员：

/* Connections have two entries in the hash table: one for each way */
struct nf_conntrack_tuple_hash {
struct hlist_nulls_node hnnode;
……
};

复制代码

另外，整个conntrack表，有一个全局锁

spinlock_t nf_conntrack_lock ;

复制代码

它主要用于对表的添加、修改、遍历（不是查找）等操作。

conntrack使用了以上结构体成员或变量来实现相应的并发互斥。

3、节点分配与分初始化

整个高速缓存的创建，注意SLAB_DESTROY_BY_RCU的使用：

static int nf_conntrack_init_net(struct net *net)
{
//初始化节点总的计数器
atomic_set(&net->ct.count, 0);
//分配高速缓存
net->ct.nf_conntrack_cachep = kmem_cache_create(net->ct.slabname,
sizeof(struct nf_conn), 0,
SLAB_DESTROY_BY_RCU, NULL);
}

复制代码

init_conntrack函数完成节点的分配与初始化：

//分配高速缓存
ct = kmem_cache_alloc(net->ct.nf_conntrack_cachep, gfp);
//初始化锁、引用计数器等成员
spin_lock_init(&ct->lock);
ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode.pprev = NULL;
/*
* changes to lookup keys must be done before setting refcnt to 1
*/
smp_wmb();
atomic_set(&ct->ct_general.use, 1);

复制代码

这与RCULIst_null.txt中介绍是相同的，当然，文档中同时讲了节点的插入操作，关于conntrack的插入，后文将会分析到：

2) Insert function :
--------------------
/*
* Please note that new inserts are done at the head of list,
* not in the middle or end.
*/ obj = kmem_cache_alloc(cachep); lock_chain(); // typically a spin_lock() obj->key = key; /*
* changes to obj->key must be visible before refcnt one
*/ smp_wmb(); atomic_set(&obj->refcnt, 1); /*
* insert obj in RCU way (readers might be traversing chain)
*/ hlist_nulls_add_head_rcu(&obj->obj_node, list); unlock_chain(); // typically a spin_unlock()

复制代码

4、节点的插入
__nf_conntrack_confirm函数完成相应的插入操作：

/* Confirm a connection given skb; places it in hash table */
int
__nf_conntrack_confirm(struct sk_buff *skb)
{
……
//加全局锁
spin_lock_bh(&nf_conntrack_lock);
//加引用计数器
atomic_inc(&ct->ct_general.use);
/* Since the lookup is lockless, hash insertion must be done after
* starting the timer and setting the CONFIRMED bit. The RCU barriers
* guarantee that no other CPU can find the conntrack before the above
* stores are visible.
*/
__nf_conntrack_hash_insert(ct, hash, repl_hash);
spin_unlock_bh(&nf_conntrack_lock);
……
}

复制代码

分配节点和插入的时候，引用计数器都加了1.这与netfilter的语义有关，分配的时候，将其设置为1，
与查找的时候，增加引用计数器原因相同：它们都是为skb使用conntrack准备的。而这里加1，则是为hash
表准备的：表明hash表已经在使用它了。

//具体的插入函数
static void __nf_conntrack_hash_insert(struct nf_conn *ct,
unsigned int hash,
unsigned int repl_hash)
{
struct net *net = nf_ct_net(ct);
hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
&net->ct.hash[hash]);
hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
&net->ct.hash[repl_hash]);
}

复制代码

可以看到，这里的分配、插入操作，都是rculist_nulls.txt中代码的翻版。
未完，待续……

评分

参与人数 1	可用积分 +8	收起理由
Godbach	+ 8	赞一个!

查看全部评分

文库|博客

独孤九贱

富足长乐

论坛徽章:: 0

2楼 [报告]

发表于 2012-12-19 11:04 |只看该作者

本帖最后由独孤九贱于 2012-12-19 11:12 编辑

5、查找
查找是rculist_null最为重要的应用，可以说所有的准备都是为它而来，
____nf_conntrack_find完成hash表的查找工作，它返回要查找的节点，
该函数的结构为标准的hlist_nulls的查找，但是这个函数没有RCU锁的保护，
作者将其一分为二，锁放在上层了调用函数了：

/*
* Warning :
* - Caller must take a reference on returned object
* and recheck nf_ct_tuple_equal(tuple, &h->tuple)
* OR
* - Caller must lock nf_conntrack_lock before calling this function
*/
static struct nf_conntrack_tuple_hash *
____nf_conntrack_find(struct net *net, u16 zone,
const struct nf_conntrack_tuple *tuple, u32 hash)
{
struct nf_conntrack_tuple_hash *h;
struct hlist_nulls_node *n;
unsigned int bucket = hash_bucket(hash, net);
/* Disable BHs the entire time since we normally need to disable them
* at least once for the stats anyway.
*/
local_bh_disable();
begin:
hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[bucket], hnnode) {
if (nf_ct_tuple_equal(tuple, &h->tuple) &&
nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
NF_CT_STAT_INC(net, found);
local_bh_enable();
return h;
}
NF_CT_STAT_INC(net, searched);
}
/*
* if the nulls value we got at the end of this lookup is
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
if (get_nulls_value(n) != bucket) {
NF_CT_STAT_INC(net, search_restart);
goto begin;
}
local_bh_enable();
return NULL;
}

复制代码

上层调用函数使用RCU锁保护，来调用该查找函数，并且如果查找命中，增加引用计数器：

/* Find a connection corresponding to a tuple. */
static struct nf_conntrack_tuple_hash *
__nf_conntrack_find_get(struct net *net, u16 zone,
const struct nf_conntrack_tuple *tuple, u32 hash)
{
struct nf_conntrack_tuple_hash *h;
struct nf_conn *ct;
rcu_read_lock();
begin:
h = ____nf_conntrack_find(net, zone, tuple, hash);
if (h) {
ct = nf_ct_tuplehash_to_ctrack(h);
if (unlikely(nf_ct_is_dying(ct) ||
!atomic_inc_not_zero(&ct->ct_general.use)))
h = NULL;
else {
if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
nf_ct_zone(ct) != zone)) {
nf_ct_put(ct);
goto begin;
}
}
}
rcu_read_unlock();
return h;
}

复制代码

这同样与rculist_nulls.txt中描述是完全相同的，只是实现的时候切割开了：

1) lookup algo
head = &table[slot];
rcu_read_lock(); begin: hlist_nulls_for_each_entry_rcu(obj, node, head, member) { if (obj->key == key) { if (!try_get_ref(obj)) // might fail for free objects goto begin; if (obj->key != key) { // not the object we expected put_ref(obj);
goto begin; } goto out; } /*
* if the nulls value we got at the end of this lookup is
* not the expected one, we must restart lookup.
* We probably met an item that was moved to another chain.
*/
if (get_nulls_value(node) != slot) goto begin; obj = NULL;
out: rcu_read_unlock();

复制代码

6、更新
当查找到的节点需要对其成员进行操作时，需要加节点锁，如下所示：

static int tcp_packet(struct nf_conn *ct,
const struct sk_buff *skb,
unsigned int dataoff,
enum ip_conntrack_info ctinfo,
u_int8_t pf,
unsigned int hooknum)
{
spin_lock_bh(&ct->lock);
//对ct成员的若干操作
……
spin_unlock_bh(&ct->lock);
}

复制代码

这里，最重要的理解查找操作不需要更多的自旋锁保护，只需rcu_read_lock即可。这涉及到RCU锁的原理与实现，本文侧重于rculist_null的使用接口介绍与分析，不再RCU相关东东了。
未完，待续……

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

独孤九贱

富足长乐

论坛徽章:: 0

3楼 [报告]

发表于 2012-12-19 11:05 |只看该作者

7、删除
7.1 超时释放

节点的定时器超时，会调用death_by_timeout函数：

static void death_by_timeout(unsigned long ul_conntrack)
{
……
nf_ct_delete_from_lists(ct);
nf_ct_put(ct);
}

复制代码

nf_ct_delete_from_lists完成从链表中释放

void nf_ct_delete_from_lists(struct nf_conn *ct)
{
……
//全局锁
spin_lock_bh(&nf_conntrack_lock);
/* Inside lock so preempt is disabled on module removal path.
* Otherwise we can get spurious warnings. */
……
clean_from_lists(ct);
spin_unlock_bh(&nf_conntrack_lock);
}

复制代码

对于添加和删除链表的操作，是需要一个链表的自旋锁保护，这里是nf的全局锁nf_conntrack_lock。

static void
clean_from_lists(struct nf_conn *ct)
{
pr_debug("clean_from_lists(%p)\n", ct);
hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode);
/* Destroy all pending expectations */
nf_ct_remove_expectations(ct);
}

复制代码

nf_ct_put释放节点高速缓存：

/* decrement reference count on a conntrack */
static inline void nf_ct_put(struct nf_conn *ct)
{
NF_CT_ASSERT(ct);
nf_conntrack_put(&ct->ct_general);
}
static inline void nf_conntrack_put(struct nf_conntrack *nfct)
{
if (nfct && atomic_dec_and_test(&nfct->use))
nf_conntrack_destroy(nfct);
}

复制代码

nf_conntrack_destroy事实上会调用destroy_conntrack->nf_conntrack_free函数，
最终释放高速缓存：

void nf_conntrack_free(struct nf_conn *ct)
{
……
//递减计数器
atomic_dec(&net->ct.count);
//释放高速缓存
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);
}

复制代码

这段代码示例很好地展示了全局锁与单个节点的引用计数器的配合使用，完整全部的释放工作。

7.2 使用者释放
conntrack查找命中后，会将找到的ct节点指针赋给skb

skb->nfct = &ct->ct_general; //resolve_normal_ct函数

复制代码

当skb被释放时，会调用上述nf_conntrack_put函数尝试释放引用：

static void skb_release_head_state(struct sk_buff *skb)
{
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
nf_conntrack_put(skb->nfct);
#endif
}

复制代码

这里有意思是的skb释放是直接释放高速缓存，并不涉及链表操作，有两种场景：
a、超时函数并没有触发，那么只里仅是增加、减少引用计数器而已，不涉及其它；
b、超时函数被调用，节点被从hash表中删除并减少引用计数器，而释放则放到这里；

同样地，这也与文档中描述相同：

3) Remove algo
-------------- Nothing special here, we can use a standard RCU hlist deletion. But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused very very fast (before the end of RCU grace period)
if (put_last_reference_on(obj) { lock_chain(); // typically a spin_lock()
hlist_del_init_rcu(&obj->obj_node);
unlock_chain(); // typically a spin_unlock()
kmem_cache_free(cachep, obj); }

复制代码

8、全清

下面来看对整个hash表的节点的清空操作：

static void nf_conntrack_cleanup_net(struct net *net)
{
i_see_dead_people:
nf_ct_iterate_cleanup(net, kill_all, NULL);
//ct.count为整个hash表的节点计数器，这里循环直至所有节点被释放，
//这是因为此时模块移除时，Netfilter Hook函数或网络软中断函数可能拥有conntrack节点。
if (atomic_read(&net->ct.count) != 0) {
schedule();
goto i_see_dead_people;
}
……
//释放高速缓存
kmem_cache_destroy(net->ct.nf_conntrack_cachep);
……
}

复制代码

nf_ct_iterate_cleanup遍历整个hash表，释放节点：

void nf_ct_iterate_cleanup(struct net *net,
int (*iter)(struct nf_conn *i, void *data),
void *data)
{
struct nf_conn *ct;
unsigned int bucket = 0;
while ((ct = get_next_corpse(net, iter, data, &bucket)) != NULL) {
/* Time to push up daises... */
if (del_timer(&ct->timeout))
death_by_timeout((unsigned long)ct);
/* ... else the timer will get him soon. */
nf_ct_put(ct);
}
}

复制代码

这里需要注意的是定时器的删除操作，即del_timer函数的调用：
1、当定时器已经被调度执行，其将直接返回0，nf_ct_put将被执行，
这意味着后续的nf_ct_put函数可与在其它CPU上执行的定时器超时函数death_by_timeout并行执行，
它们的互斥显然由引用计数器来实现。而不需要再调用del_timer_sync等待同步；

2、另一种情况，定时器并没有被调度，则直接删除定时器，并且直接调用death_by_timeout来实现删除操作；

get_next_corpse在遍历hash表时，查找节点的时候，会增加引用计数器，所以，循环内部最后调用nf_ct_put(ct)释放之，而不是：