论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2009-01-06 18:47 |只看该作者 |倒序浏览

------------------------------------------
本文系本站原创,欢迎转载!
转载请注明出处:http://ericxiao.cublog.cn/
------------------------------------------

  Normal
  0

  7.8 磅
  0
  2

  false
  false
  false



  MicrosoftInternetExplorer4

st1\:*{behavior:url(#ieooui) }
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:普通表格;
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
一:前言
前面已经分析了cgroup的框架,下面来分析cpuset子系统.所谓cpuset,就是在用户空间中操作cgroup文件系统来执行进程与cpu和进程与内存结点之间的绑定.有关cpuset的详细描述可以参考文档: linux-2.6.28-rc7/Documentation/cpusets.txt.本文从cpuset的源代码角度来对cpuset进行详细分析.以下的代码分析是基于linux-2.6.28.

二:cpuset的数据结构
每一个cpuset都对应着一个struct cpuset结构,如下示:
struct
cpuset {
/*用于从cgroup到cpuset的转换*/
struct cgroup_subsys_state css;
/*cpuset的标志*/
unsigned long flags;       /* "unsigned long" so bitops work */
/*该cpuset所绑定的cpu*/
cpumask_t cpus_allowed;    /* CPUs allowed to tasks in cpuset */
/*该cpuset所绑定的内存结点*/
nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
/*cpuset的父结点*/
struct cpuset *parent;    /* my parent */

/*
   *
Copy of global cpuset_mems_generation as of the most
   *
recent time this cpuset changed its mems_allowed.
   */
   /*是当前cpuset_mems_generation的拷贝.每更新一次

*mems_allowed,cpuset_mems_generation就会加1
      */
int mems_generation;
/*用于memory_pressure*/
struct fmeter fmeter;    /* memory_pressure filter */

/* partition number for rebuild_sched_domains()
*/
/*对应调度域的分区号*/
int pn;

/* for custom sched domain */
/*与sched domain相关*/
int relax_domain_level;

/* used for walking a cpuset heirarchy */
/*用来遍历所有的cpuset*/
struct list_head stack_list;
}
这个数据结构中的成员含义现在没必要深究,到代码分析遇到的时候再来详细讲解.在这里我们要注意的是struct cpuset中内嵌了struct cgroup_subsys_state
css.也就是说,我们可以从struct cgroup_subsys_state
css的地址导出struct cpuset的地址.故内核中,从cpuset到cgroup的转换有以下关系:
static
inline struct cpuset *cgroup_cs(struct cgroup *cont)
{
return
container_of(cgroup_subsys_state(cont, cpuset_subsys_id),

struct cpuset, css);
}
Cgroup_subsys_state()代码如下:
static
inline struct cgroup_subsys_state *cgroup_subsys_state(
struct cgroup *cgrp, int subsys_id)
{
return cgrp->subsys[subsys_id];
}
即从cgroup中求得对应的cgroup_subsys_state.再用container_of宏利用地址偏移求得cpuset.
另外,在内核中还有下面这个函数:
static
inline struct cpuset *task_cs(struct task_struct *task)
{
return container_of(task_subsys_state(task,
cpuset_subsys_id),

struct cpuset, css);
}
同理,从struct task_struct->cgroup得到cgroup_subsys_state结构.再取得cpuset.

三:cpuset的初始化
Cpuset的初始化分为三部份.如下所示:
asmlinkage
void __init start_kernel(void)
{
……
……
cpuset_init_early();
……
cpuset_init();
……
}
Start_kernel()
à kernel_init() à cpuset_init_smp()
下面依次分析这些初始化函数.

3.1:Cpuset_init_eary()
该函数代码如下:
int
__init cpuset_init_early(void)
{
top_cpuset.mems_generation =
cpuset_mems_generation++;
return 0;
}
该函数十分简单,就是初始化top_cpuset.mems_generation.在这里我们遇到了前面分析cpuset数据结构中提到的全局变量cpuset_mems_generation.它的定义如下:
/*
* Increment this integer everytime any cpuset
changes its
* mems_allowed value.  Users of cpusets can track this generation
* number, and avoid having to lock and reload
mems_allowed unless
* the cpuset they're using changes generation.
*
* A single, global generation is needed
because cpuset_attach_task() could
* reattach a task to a different cpuset, which
must not have its
* generation numbers aliased with those of
that tasks previous cpuset.
*
* Generations are needed for mems_allowed
because one task cannot
* modify another's memory placement.  So we must enable every task,
* on every visit to __alloc_pages(), to
efficiently check whether
* its current->cpuset->mems_allowed has
changed, requiring an update
* of its current->mems_allowed.
*
* Since writes to cpuset_mems_generation are
guarded by the cgroup lock
* there is no need to mark it atomic.
*/
static
int cpuset_mems_generation;
注释上说的很详细,简而言之,全局变量cpuset_mems_generation就是起一个对比作用,它在每次改更了cpuset的mems_allowed都是加1.然后进程在关联cpuset的时候,会将task->cpuset_mems_generation.设置成进程所在cpuset的cpuset->cpuset_mems_generation的值.每次cpuset中的mems_allowed发生更改的时候,都会将cpuset->
mems_generation设置成当前cpuset_mems_generation的值.这样,进程在分配内存的时候就会对比task->cpuset_mems_generation和cpuset->cpuset_mems_generation的值,如果不相等,说明cpuset的mems_allowed的值发生了更改,所以在分配内存之前首先就要更新进程的mems_allowed.举个例子:
alloc_pages()->alloc_pages_current()->cpuset_update_task_memory_state().重点来跟踪一下cpuset_update_task_memory_state().代码如下:
void
cpuset_update_task_memory_state(void)
{
int my_cpusets_mem_gen;
struct task_struct *tsk = current;
struct cpuset *cs;

/*取得进程对应的cpuset的,然后求得要对比的mems_generation*/
/*在这里要注意访问top_cpuset和其它cpuset的区别.访问top_cpuset
   *的时候不必要持rcu .因为它是一个静态结构.永远都不会被释放
   *因此无论什么访问他都是安全的
   */
if (task_cs(tsk) == &top_cpuset) {
      /* Don't need rcu for top_cpuset.  It's never freed. */
      my_cpusets_mem_gen =
top_cpuset.mems_generation;
} else {
      rcu_read_lock();
      my_cpusets_mem_gen = task_cs(tsk)->mems_generation;
      rcu_read_unlock();
}

/*如果所在cpuset的mems_generaton不和进程的cpuset_mems_generation相同
   *说明进程所在的cpuset的mems_allowed发生了改变.所以要更改进程
   *的mems_allowed.
   */
if (my_cpusets_mem_gen !=
tsk->cpuset_mems_generation) {
      mutex_lock(&callback_mutex);
      task_lock(tsk);
      cs = task_cs(tsk); /* Maybe changed when
task not locked */
      /*更新进程的mems_allowed*/
      guarantee_online_mems(cs,
&tsk->mems_allowed);
      /*更新进程的cpuset_mems_generation*/
      tsk->cpuset_mems_generation =
cs->mems_generation;
      /*PF_SPREAD_PAGE和PF_SPREAD_SLAB*/
      if (is_spread_page(cs))
         tsk->flags |= PF_SPREAD_PAGE;
      else
         tsk->flags &=
~PF_SPREAD_PAGE;
      if (is_spread_slab(cs))
         tsk->flags |= PF_SPREAD_SLAB;
      else
         tsk->flags &=
~PF_SPREAD_SLAB;
      task_unlock(tsk);
      mutex_unlock(&callback_mutex);
      /*重新绑定进程和允许的内存结点*/
      mpol_rebind_task(tsk,
&tsk->mems_allowed);
}
}
这个函数就是用来在请求内存的判断进程的cpuset->mems_allowed有没有更改.如果有更改就更新进程的相关域.最后再重新绑定进程到允许的内存结点.
在这里,我们遇到了cpuset的两个标志.一个是is_spread_page()测试的CS_SPREAD_PAGE和is_spread_slab()测试的CS_SPREAD_SLAB.这两个标识是什么意思呢?从代码中可以看到,它就是对应进程的PF_SPREAD_PAGE和PF_SPREAD_SLAB.它的作用是在为页面缓页或者是inode分配空间的时候,平均使用进程所允许使用的内存结点.举个例子:
__page_cache_alloc()
à
cpuset_mem_spread_node():
int
cpuset_mem_spread_node(void)
{
int node;

node =
next_node(current->cpuset_mem_spread_rotor, current->mems_allowed);
if (node == MAX_NUMNODES)
      node =
first_node(current->mems_allowed);
current->cpuset_mem_spread_rotor = node;
return node;
}
看到是怎么找分配节点了吧?代码中current->cpuset_mem_spread_rotor是上次文件缓存分配的内存结点.它就是轮流使用进程所允许的内存结点.
返回到cpuset_update_task_memory_state()中,看一下里面涉及到的几个子函数:
guarantee_online_mems()用来更新进程的mems_allowed.代码如下:
static
void guarantee_online_mems(const struct cpuset *cs, nodemask_t *pmask)
{
while (cs &&
!nodes_intersects(cs->mems_allowed,
                  node_states[N_HIGH_MEMORY]))
      cs = cs->parent;
if (cs)
      nodes_and(*pmask, cs->mems_allowed,
                  node_states[N_HIGH_MEMORY]);
else
      *pmask = node_states[N_HIGH_MEMORY];
BUG_ON(!nodes_intersects(*pmask,
node_states[N_HIGH_MEMORY]));
}
在内核中,所有在线的内存结点都存放在node_states[N_HIGH_MEMORY].这个函数的作用就是到所允许的在线的内存结点.何所谓”在线的”内存结点呢?听说过热插拨吧?服务器上的内存也是这样的,可以运态插拨的.

另一个重要的子函数是mpol_rebind_task(),它将进程与所允许的内存结点重新绑定.也就是移动旧节点的数值到新结点中.这个结点是mempolicy方面的东西了.在这里不做详细讲解了.可以自行跟踪看一下,代码很简单的.
分析完全局变量cpuset_mems_generation的作用之后,来看下一个初始化函数.

3.2: cpuset_init()
Cpuset_init()代码如下:
int
__init cpuset_init(void)
{
int err = 0;

/*初始化top_cpuset的cpus_allowed和mems_allowed
   *将它初始化成系统中的所有cpu和所有的内存节点
   */
cpus_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);

/*初始化top_cpuset.fmeter*/
fmeter_init(&top_cpuset.fmeter);

/*因为更改了top_cpuset->mems_allowed
   *所以要更新cpuset_mems_generation
   */
top_cpuset.mems_generation =
cpuset_mems_generation++;
/*设置top_cpuset的CS_SCHED_LOAD_BALANCE*/
set_bit(CS_SCHED_LOAD_BALANCE,
&top_cpuset.flags);
/*设置top_spuset.relax_domain_level*/
top_cpuset.relax_domain_level = -1;

/*注意cpuset 文件系统*/
err =
register_filesystem(&cpuset_fs_type);
if (err
      return err;
/*cpuset 个数计数*/
number_of_cpusets = 1;
return 0;
}
在这里主要初始化了顶层cpuset的相关信息.在这里,我们又遇到了几个标志.下面一一讲解:
CS_SCHED_LOAD_BALANCE:
Cpuset中cpu的负载均衡标志.如果cpuset设置了此标志,表示该cpuset下的cpu在调度的时候,实现负载均衡.
relax_domain_level:
它是调度域的一个标志,表示在NUMA中负载均衡时寻找空闲CPU的标志.有以下几种取值:
  -1  :
no request. use system default or follow request of others.
0  :
no search.
1  :
search siblings (hyperthreads in a core).
2  :
search cores in a package.
3  :
search cpus in a node [= system wide on non-NUMA system]
(
4  : search nodes in a chunk of node [on
NUMA system] )
(
5  : search system wide [on NUMA system]
)

在这个函数还出现了fmeter.有关fmeter我们之后等遇到再来分析.
另外,cpuset还对应一个文件系统,这是为了兼容cgroup之前的cpuset操作.跟踪这个文件系统看一下:
static
struct file_system_type cpuset_fs_type = {
.name = "cpuset",
.get_sb = cpuset_get_sb,
};
Cpuset_get_sb()代码如下;
static
int cpuset_get_sb(struct file_system_type *fs_type,
         int flags, const char *unused_dev_name,
         void *data, struct vfsmount *mnt)
{
struct file_system_type *cgroup_fs =
get_fs_type("cgroup");
int ret = -ENODEV;
if (cgroup_fs) {
      char mountopts[] =
         "cpuset,noprefix,"
         "release_agent=/sbin/cpuset_release_agent";
      ret = cgroup_fs->get_sb(cgroup_fs,
flags,
                     unused_dev_name, mountopts, mnt);
      put_filesystem(cgroup_fs);
}
return ret;
}
可见就是使用cpuset,noprefix,release_agent=/sbin/cpuset_release_agent选项挂载cgroup文件系统.
即相当于如下操作:
Mount
–t cgroup cgroup –o
puset,noprefix,release_agent=/sbin/cpuset_release_agent  mount_dir
其中,mount_dir指文件系统挂载点.

3.3: cpuset_init_smp()
代码如下:
void
__init cpuset_init_smp(void)
{
top_cpuset.cpus_allowed = cpu_online_map;
top_cpuset.mems_allowed =
node_states[N_HIGH_MEMORY];

hotcpu_notifier(cpuset_track_online_cpus,
0);
hotplug_memory_notifier(cpuset_track_online_nodes,
10);
}
它将cpus_allowed和mems_allwed更新为在线的cpu和在线的内存结点.最后为cpu热插拨和内存热插拨注册了hook.来看一下.
在分析这两个hook之前,有必要提醒一下,在这个hook里面涉及的一些子函数有些是cpuset中一些核心的函数.在之后对cpuset的流程进行分析的时候,有很多地方都会调用这两个hook中的子函数.因此理解这部份代码是理解整个cpuset子系统的关键。好了,闲言少叙,转入正题.
Cpu
hotplug对应的hook为cpuset_track_online_cpus.代码如下:
static
int cpuset_track_online_cpus(struct notifier_block *unused_nb,
            unsigned long phase, void
*unused_cpu)
{
struct sched_domain_attr *attr;
cpumask_t *doms;
int ndoms;

/*只处理CPU_ONLINE,CPU_ONLINE_FROZEN,CPU_DEAD,CPU_DEAD_FROZEM*/
switch (phase) {
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
      break;

default:
      return NOTIFY_DONE;
}

/*更新top_cpuset.cpus_allowed*/
cgroup_lock();
top_cpuset.cpus_allowed = cpu_online_map;
scan_for_empty_cpusets(&top_cpuset);
/*更新cpuset 调度域*/
ndoms = generate_sched_domains(&doms,
&attr);
cgroup_unlock();

/* Have scheduler rebuild the domains */
/*更新scheduler的调度域信息*/
partition_sched_domains(ndoms, doms, attr);

return NOTIFY_OK;
}
这个函数是对应cpu hotplug的处理,如果系统中的cpu发生了改变,比如添加/删除,就必须要修正cpuset中的cpu信息.首先,我们在之前分析过,top_cpuset中包含了所有的cpu和memory node,因此首先要修正top_cpuset中的cpu信息，其次，系统中cpu发生改变，有可能引起某些cpuse中的cpu信息变为了空值，因此要对这些空值cpuset下的进程进行处理。同理，也要更新调度域信息。下面一一来分析里面涉及到的子函数。

3.3.1：scan_for_empty_cpusets（）
这一个要分析的就是scan_for_empty_cpusets（），它用来扫描空的cpuset,将它空集cpuset下的task移到它的上级非空的cpuset的，代码如下：
static
void scan_for_empty_cpusets(struct cpuset *root)
{
LIST_HEAD(queue);
struct cpuset *cp;  /* scans cpusets being updated */
struct cpuset *child; /* scans child cpusets of cp */
struct cgroup *cont;
nodemask_t oldmems;

list_add_tail((struct list_head
*)&root->stack_list, &queue);

/*遍历所有的cpuset*/
while (!list_empty(&queue)) {
      cp = list_first_entry(&queue, struct
cpuset, stack_list);
      list_del(queue.next);
      list_for_each_entry(cont,
&cp->css.cgroup->children, sibling) {
         child = cgroup_cs(cont);
         list_add_tail(&child->stack_list,
&queue);
      }

      /* Continue past cpusets with all cpus,
mems online */
      /*所包含的cpuset 和内存结点如果都是正常的*/
      if (cpus_subset(cp->cpus_allowed,
cpu_online_map) &&

nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
         continue;
      /*之前的mems_allowed*/
      oldmems = cp->mems_allowed;

      /* Remove offline cpus and mems from
this cpuset. */
      /*丢弃掉已经移除的内存结点和cpu*/
      mutex_lock(&callback_mutex);
      cpus_and(cp->cpus_allowed,
cp->cpus_allowed, cpu_online_map);
      nodes_and(cp->mems_allowed,
cp->mems_allowed,
                     node_states[N_HIGH_MEMORY]);
      mutex_unlock(&callback_mutex);

      /* Move tasks from the empty cpuset to a
parent */
      /*如果调整之后的cpu和内存结点信息为空*/
      if (cpus_empty(cp->cpus_allowed) ||

  nodes_empty(cp->mems_allowed))
         remove_tasks_in_empty_cpuset(cp);
      /*更新cpuset下进程的cpu和内存结点信息*/
      else {
         update_tasks_cpumask(cp, NULL);
         update_tasks_nodemask(cp,
&oldmems);
      }
}
}
首先要看懂这个函数，必须要了解cgroup的架构了，关于这部份，请参阅本站的另一篇文档《linux cgroup机制分析之框架分析》.cpuset-> stack_list成员在这里派上用场了，它就是用来链入临时链表中。我们从代码中可以看到，它是一个从top_cpuset往下层的层次遍次。
对于遍历到的每一个cpuset,
1：如果cpuset的cpu和memory信息都是正常的（分别是cpu_online_map和node_states[N_HIGH_MEMORY]的子集）那就用不着更新了。
2：丢弃掉已经离线的cpu和memory.（也就是与cpu_online_map和n ode_states[N_HIGH_MEMORY]取交集）。
3：如果调整之后的cpuset中cpu或者是memory为空，就要处理它下面的所关联进程的了。这是在remove_tasks_in_empty_cpuset()中处理的.
4:如果调整之后的cpuset的cpu和memory都不都为空。说明它所关联的进程还有资源可用，只需更新所关联进程的mems_allowed和cpus_allowed位图即可。这是在update_tasks_cpumask()和update_tasks_nodemask()中处理的。
下面来分析一下scan_for_empty_cpusets()中调用的几个子函数.
3.3.1.1:
remove_tasks_in_empty_cpuset()
代码如下：
static
void remove_tasks_in_empty_cpuset(struct cpuset *cs)
{
struct cpuset *parent;

/*
   * The
cgroup's css_sets list is in use if there are tasks
   * in
the cpuset; the list is empty if there are none;
   * the
cs->css.refcnt seems always 0.
   */
   /*如果这个cpuset下没有关联的进程*/
if
(list_empty(&cs->css.cgroup->css_sets))
      return;

/*
   *
Find its next-highest non-empty parent, (top cpuset
   * has
online cpus, so can't be empty).
   */
   /*向上找到一个cpu和mems不为空的cpuset*/
parent = cs->parent;
while (cpus_empty(parent->cpus_allowed)
||
         nodes_empty(parent->mems_allowed))
      parent = parent->parent;
/*将cpuset中的进程移到parent上*/
move_member_tasks_to_cpuset(cs, parent);
}

如果cpuset中有关联的进程，但cpuset允许的相关资源为空，那么就向上找到有资源的cpuset,并将其关联的task移到找到的cpuset中。对照代码中的注释，应该很好理解，这里就不详细分析了。
Move_member_tasks_to_cpuset()代码如下：
static
void move_member_tasks_to_cpuset(struct cpuset *from, struct cpuset *to)
{
struct cpuset_hotplug_scanner scan;

scan.scan.cg = from->css.cgroup;
scan.scan.test_task = NULL; /* select all
tasks in cgroup */
scan.scan.process_task =
cpuset_do_move_task;
scan.scan.heap = NULL;
scan.to = to->css.cgroup;

if (cgroup_scan_tasks(&scan.scan))
      printk(KERN_ERR
"move_member_tasks_to_cpuset: "
            "cgroup_scan_tasks
failed\n");
}
这里涉及到cgroup中的另外一个接口cgroup_scan_tasks（）。这个接口在后面再来详细分析，这里先大概说一下，它就是一个遍历cgroup中关联进程的迭代器。对cgroup中关联的每个进程都会调用回调函数scan.scan.process_task.在上面的这段代码中也就是cpuset_do_move_task().代码如下：
static
void cpuset_do_move_task(struct task_struct *tsk,
            struct cgroup_scanner *scan)
{
struct cpuset_hotplug_scanner *chsp;

chsp = container_of(scan, struct
cpuset_hotplug_scanner, scan);
cgroup_attach_task(chsp->to, tsk);
}
在这个函数中，调用了cgroup_attach_task（）将进程关联到了chsp->to.chsp->to也就是我们在上面的代码中看到的parent.

3.3.1.2:
update_tasks_cpumask()
这个函数用来更新cpuset下所有进程的cpu信息，代码如下：
static
void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
{
struct cgroup_scanner scan;

/*遍历cpuset 下的所有task.
   *对每一个task调用cpuset_change_cpumask()
   */
scan.cg = cs->css.cgroup;
scan.test_task = cpuset_test_cpumask;
scan.process_task = cpuset_change_cpumask;
scan.heap = heap;
cgroup_scan_tasks(&scan);
}
Cgroup_scan_tasks()这个接口我们在上面已经讨论过来，对cpuset中的每一个进程都会调用cpuset_change_cpumask().代码如下：
static
void cpuset_change_cpumask(struct task_struct *tsk,

struct cgroup_scanner *scan)
{
set_cpus_allowed_ptr(tsk,
&((cgroup_cs(scan->cg))->cpus_allowed));
}
该函数很简单，就是设置进程的cpus_allowed域，在下次进程被调度回来的时候，就会切换到允许的cpu上面运行。

3.3.1.3：update_tasks_nodemask（）
该函数用来更新cpuset下的task的memory node信息。代码如下：
static
int update_tasks_nodemask(struct cpuset *cs, const nodemask_t *oldmem)
{
struct task_struct *p;
struct mm_struct **mmarray;
int i, n, ntasks;
int migrate;
int fudge;
struct cgroup_iter it;
int retval;

cpuset_being_rebound = cs;    /* causes mpol_dup() rebind */

/*fudge是为mmarray[ ]提供适当多余的长度*/
fudge = 10;          /*
spare mmarray[] slots */
fudge += cpus_weight(cs->cpus_allowed); /* imagine one fork-bomb/cpu */
retval = -ENOMEM;

/*
   *
Allocate mmarray[] to hold mm reference for each task
   * in
cpuset cs.  Can't kmalloc GFP_KERNEL
while holding
   *
tasklist_lock.  We could use GFP_ATOMIC,
but with a
   * few
more lines of code, we can retry until we get a big
   *
enough mmarray[] w/o using GFP_ATOMIC.
   */
   /*取得cpuset中task 的个数,这里加上fudge是为了防止在
*操作的过程中,又fork出了一些新的进程,分配空间不够
*/
while (1) {
      ntasks =
cgroup_task_count(cs->css.cgroup);  /*
guess */
      ntasks += fudge;
      mmarray = kmalloc(ntasks *
sizeof(*mmarray), GFP_KERNEL);
      if (!mmarray)
         goto done;
      read_lock(&tasklist_lock);    /* block fork */
      if (cgroup_task_count(cs->css.cgroup)
         break;             /* got enough */
      read_unlock(&tasklist_lock);       /* try again */
      kfree(mmarray);
}

n = 0;

/* Load up mmarray[] with mm reference for
each task in cpuset. */
/*将cpuset下的所有进程的mm都保存至mmarray[ ]中
*n用来计算所取得task的个数
*/
cgroup_iter_start(cs->css.cgroup,
&it);
while ((p =
cgroup_iter_next(cs->css.cgroup, &it))) {
      struct mm_struct *mm;

      if (n >= ntasks) {
         printk(KERN_WARNING
            "Cpuset mempolicy rebind
incomplete.\n");
         break;
      }
      mm = get_task_mm(p);
      if (!mm)
         continue;
      mmarray[n++] = mm;
}
cgroup_iter_end(cs->css.cgroup, &it);
read_unlock(&tasklist_lock);

/*
   * Now
that we've dropped the tasklist spinlock, we can
   *
rebind the vma mempolicies of each mm in mmarray[] to their
   * new
cpuset, and release that mm.  The
mpol_rebind_mm()
   *
call takes mmap_sem, which we couldn't take while holding
   *
tasklist_lock.  Forks can happen again
now - the mpol_dup()
   *
cpuset_being_rebound check will catch such forks, and rebind
   *
their vma mempolicies too.  Because we
still hold the global
   *
cgroup_mutex, we know that no other rebind effort will
   * be
contending for the global variable cpuset_being_rebound.
   *
It's ok if we rebind the same mm twice; mpol_rebind_mm()
   * is
idempotent.  Also migrate pages in each
mm to new nodes.
   */

   /*
   *更新进程的内存分配策略
   *如果设置了CS_MEMORY_MIGRATE,就表示需要将进程的
   *内存空间从旧结点移动到新结点上
   */
migrate = is_memory_migrate(cs);
for (i = 0; i
      struct mm_struct *mm = mmarray;

      mpol_rebind_mm(mm,
&cs->mems_allowed);
      if (migrate)
         cpuset_migrate_mm(mm, oldmem,
&cs->mems_allowed);
      mmput(mm);
}

/* We're done rebinding vmas to this
cpuset's new mems_allowed. */
kfree(mmarray);
cpuset_being_rebound = NULL;
retval = 0;
done:
return retval;
}
根据代码中的注释，应该比较容易理解这段代码。在这里涉及到一个新的东西：cgroup_iter。这也是我们之前遇到的Cgroup_scan_tasks()中所使用的迭代器，这部份我们在后面分析Cgroup_scan_tasks()代码的时候再来详细分析。
另外，这里还涉及到mmpolicy 的一些接口，比如mpol_rebind_mm（）cpuset_migrate_mm（）à do_migrate_pages()这里就不再分析了。感兴趣的，可自行阅读其源代码。
此外，在这个函数中还涉及到一个全局cpuset_being_rebound.它在mpol_dup()拷贝当前进程的内存分存policy的时候会用到。

回到cpuset_track_online_cpus（）中，在上面已经分析完了scan_for_empty_cpusets().现在来分析其它的子函数。

3.3.2:
generate_sched_domains()
该函数用来取得cpuset中的调度域信息，将取得的调度域信息保存进它的两上函数中，如下示：
static
int generate_sched_domains(cpumask_t **domains,
         struct sched_domain_attr
**attributes)
{
LIST_HEAD(q);    /* queue of cpusets to be scanned */
struct cpuset *cp;  /* scans q */
struct cpuset **csa; /* array of all cpuset ptrs */
int csn;       /*
how many cpuset ptrs in csa so far */
int i, j, k;       /*
indices for partition finding loops */
cpumask_t *doms; /* resulting partition; i.e. sched domains */
struct sched_domain_attr *dattr;  /* attributes for custom domains */
int ndoms = 0;    /* number of sched domains in result */
int nslot;    /*
next empty doms[] cpumask_t slot */

doms = NULL;
dattr = NULL;
csa = NULL;

/* Special case for the 99% of systems with
one, full, sched domain */
/*如果top_cpuset设置了CS_SCHED_LOAD_BALANCE
   *说明要在系统全部的cpu间实现sched balance*/
if (is_sched_load_balance(&top_cpuset))
{
      doms = kmalloc(sizeof(cpumask_t),
GFP_KERNEL);
      if (!doms)
         goto done;

      dattr = kmalloc(sizeof(struct
sched_domain_attr), GFP_KERNEL);
      if (dattr) {
         *dattr = SD_ATTR_INIT;
         /* 取得top_cpuset以及它下面子层的最大relax_domain_level
*/
         update_domain_attr_tree(dattr,
&top_cpuset);
      }
      /* 顶层的cpus_allowed */
      *doms = top_cpuset.cpus_allowed;

      ndoms = 1;
      goto done;
}

/* cpuset数组*/
csa = kmalloc(number_of_cpusets *
sizeof(cp), GFP_KERNEL);
if (!csa)
      goto done;
csn = 0;

/*遍历整个cpuset tree,将设置了CS_SCHED_LOAD_BALANCE
   *的cpuset放入csa[]中. csn表示cpuset 的项数*/
list_add(&top_cpuset.stack_list,
&q);
while (!list_empty(&q)) {
      struct cgroup *cont;
      struct cpuset *child; /* scans child cpusets of cp */

      cp = list_first_entry(&q, struct
cpuset, stack_list);
      list_del(q.next);

      if (cpus_empty(cp->cpus_allowed))
         continue;

      /*
      *
All child cpusets contain a subset of the parent's cpus, so
      *
just skip them, and then we call update_domain_attr_tree()
      *
to calc relax_domain_level of the corresponding sched
      *
domain.
      */
      if (is_sched_load_balance(cp)) {
         csa[csn++] = cp;
         continue;
      }

      list_for_each_entry(cont,
&cp->css.cgroup->children, sibling) {
         child = cgroup_cs(cont);
         list_add_tail(&child->stack_list,
&q);
      }
}

/*将csa[]中的cpuset->pn设置为所在的数组项*/
for (i = 0; i
      csa->pn = i;
ndoms = csn;

restart:
/* Find the best partition (set of sched
domains) */
/*遍历csa数组中的cpuset.将有交叉的cpuset->pn设为相同
   *ndoms即为csa中没有交叉的cpuset的cpuset 个数*/
for (i = 0; i
      struct cpuset *a = csa;
      int apn = a->pn;

      for (j = 0; j
         struct cpuset *b = csa[j];
         int bpn = b->pn;

         if (apn != bpn &&
cpusets_overlap(a, b)) {
            for (k = 0; k
                  struct cpuset *c = csa[k];

                  if (c->pn == bpn)
                     c->pn = apn;
            }
            ndoms--; /* one less element */
            goto restart;
         }
      }
}

/*
   * Now
we know how many domains to create.
   *
Convert  to  and populate cpu masks.
   */
   /*有多少个不交叉的设置了CS_SCHED_LOAD_BALANCE的cpuset
   *就有多少个调度域*/
doms = kmalloc(ndoms * sizeof(cpumask_t),
GFP_KERNEL);
if (!doms)
      goto done;

/*
   * The
rest of the code, including the scheduler, can deal with
   *
dattr==NULL case. No need to abort if alloc fails.
   */
   /*有多少个调度域,就有多少个调度域属性*/
dattr = kmalloc(ndoms * sizeof(struct
sched_domain_attr), GFP_KERNEL);

/*填充doms和dattr,分别为同一项的cpu_allowed合集和
*该层cpuset下面最大relax_domain_level 值
*/
for (nslot = 0, i = 0; i
      struct cpuset *a = csa;
      cpumask_t *dp;
      int apn = a->pn;

      if (apn
         /* Skip completed partitions */
         continue;
      }

      dp = doms + nslot;

      /*按理说,nslot不可能毛坯地ndoms.因为ndoms代表调度域的个数

*而nslot是cas中pn不相同的cpuset项数-1 .因为nslot是从0开始计数的*/
      if (nslot == ndoms) {
         static int warnings = 10;
         if (warnings) {
            printk(KERN_WARNING
               "rebuild_sched_domains confused:"

" nslot %d, ndoms %d, csn %d, i %d,"

" apn %d\n",

nslot, ndoms, csn, i, apn);
            warnings--;
         }
         continue;
      }

      cpus_clear(*dp);
      if (dattr)
         *(dattr + nslot) = SD_ATTR_INIT;
      for (j = i; j
         struct cpuset *b = csa[j];

         if (apn == b->pn) {
            cpus_or(*dp, *dp,
b->cpus_allowed);
            if (dattr)
                  update_domain_attr_tree(dattr
+ nslot, b);

            /* Done with this partition */
            b->pn = -1;
         }
      }
      nslot++;
}
BUG_ON(nslot != ndoms);

done:
kfree(csa);

/*
   *
Fallback to the default domain if kmalloc() failed.
   * See
comments in partition_sched_domains().
   */
if (doms == NULL)
      ndoms = 1;

*domains
= doms;
*attributes = dattr;
return ndoms;
}
这个函数比较简单，就不详细分析了。请对照添加的注释自行分析。
至此，cpuset的初始化就分析完了.
四:cpuset中的相关操作
下面来分析cpuset中的相关操作，
Cpuset
subsystem的结构如下：
struct
cgroup_subsys cpuset_subsys = {
.name = "cpuset",
.create = cpuset_create,
.destroy = cpuset_destroy,
.can_attach = cpuset_can_attach,
.attach = cpuset_attach,
.populate = cpuset_populate,
.post_clone = cpuset_post_clone,
.subsys_id = cpuset_subsys_id,
.early_init = 1,
};
根据上面的结构再结合我们之前分析过的cgroup子系统，可以得知相关的操作流程。

4.1:创建cgroup时
经过前面的分析，我们知道在创建cgroup的时候会调用subsystem的create接口。在cpuset中对应就是cpuset_create().代码如下：

static
struct cgroup_subsys_state *cpuset_create(
struct cgroup_subsys *ss,
struct cgroup *cont)
{
struct cpuset *cs;
struct cpuset *parent;

/*如果是根目录.返回top_cpuset即可.*/
if (!cont->parent) {
      /* This is early initialization for the
top cgroup */
      top_cpuset.mems_generation =
cpuset_mems_generation++;
      return &top_cpuset.css;
}

/*取得父结点的cpuset*/
parent = cgroup_cs(cont->parent);
/*分配并初始化一个cpuset*/
cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
      return ERR_PTR(-ENOMEM);

cpuset_update_task_memory_state();
cs->flags = 0;
if (is_spread_page(parent))
      set_bit(CS_SPREAD_PAGE,
&cs->flags);
if (is_spread_slab(parent))
      set_bit(CS_SPREAD_SLAB,
&cs->flags);
set_bit(CS_SCHED_LOAD_BALANCE,
&cs->flags);
/*清空cpus_allowed and mems_allowed*/
cpus_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
cs->mems_generation =
cpuset_mems_generation++;
fmeter_init(&cs->fmeter);
cs->relax_domain_level = -1;
/*设置父结点*/
cs->parent = parent;
number_of_cpusets++;
return &cs->css ;
}
上面的代码比较简单，在这里是返回cpuset->css.因此就可以根据cgroup_subsys_state这个结构找到所属的cpuset结构。
另外，我们在这里也可以看到，新建一个cpuset，它的mems_allowed和cpus_allowed都是空的。而relax_domain_level则是默认值-1.

4.2:关联进程时
在为cgroup关联进程的时候，首先会调用subsys->can_attach()来判断进程是否能够关联到cgroup。返回0说明可以。如果可以关联的时候，还会调用subsys->attach()来对进程进行关联。下面分别来分析这两个接口.

4.2.1:
cpuset_can_attach()
代码如下：
static
int cpuset_can_attach(struct cgroup_subsys *ss,

struct cgroup *cont, struct task_struct *tsk)
{
struct cpuset *cs = cgroup_cs(cont);

/*如果此cpuset中允许的资源为空,进程无法运行,不可关联*/
if (cpus_empty(cs->cpus_allowed) ||
nodes_empty(cs->mems_allowed))
      return -ENOSPC;

/*如果进程已经指定了绑定的cpu.
   *如果指定绑定的cpu集不同于cpuset中的cpu集,不可关联*/
if (tsk->flags & PF_THREAD_BOUND) {
      cpumask_t mask;

      mutex_lock(&callback_mutex);
      mask = cs->cpus_allowed;
      mutex_unlock(&callback_mutex);
      if (!cpus_equal(tsk->cpus_allowed,
mask))
         return -EINVAL;
}

/*进行常规安全性检查*/
return security_task_setscheduler(tsk, 0,
NULL);
}
这函数比较简单，就不详细分析了。

4.2.2: cpuset_attach()
代码如下：
static
void cpuset_attach(struct cgroup_subsys *ss,

struct cgroup *cont, struct cgroup *oldcont,

struct task_struct *tsk)
{
cpumask_t cpus;
nodemask_t from, to;
struct mm_struct *mm;
struct cpuset *cs = cgroup_cs(cont);
struct cpuset *oldcs = cgroup_cs(oldcont);
int err;

/*cs:是进程即将要移到的cpuset. oldcs是进程之前所在的cpuset*/

/*更新进程的cpu位图*/
mutex_lock(&callback_mutex);
guarantee_online_cpus(cs, &cpus);
err = set_cpus_allowed_ptr(tsk, &cpus);
mutex_unlock(&callback_mutex);
if (err)
      return;

/*更新进程的内存结点位图.如果定义了CS_MEMORY_MIGRATE
   *还需要将进程从旧结点移动到新结点中
   */
from = oldcs->mems_allowed;
to = cs->mems_allowed;
mm = get_task_mm(tsk);
if (mm) {
      mpol_rebind_mm(mm, &to);
      if (is_memory_migrate(cs))
         cpuset_migrate_mm(mm, &from,
&to);
      mmput(mm);
}
这个函数也比较简单，请参照代码注释自行分析。

4.3:创建操作文件时
当cpuset在创建时，会在其文件系统下创建操作文件，相应的会调用subsys->
populate().代码如下：
static
int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
{
int err;

err = cgroup_add_files(cont, ss, files,
ARRAY_SIZE(files));
if (err)
      return err;
/* memory_pressure_enabled is in root cpuset
only */
if (!cont->parent)
      err = cgroup_add_file(cont, ss,
                  &cft_memory_pressure_enabled);
return err;
}
从代码中可以看到，cpuset顶层多了一个文件，相应的cftype结构为cft_memory_pressure_enabled.如下所示：
static
struct cftype cft_memory_pressure_enabled = {
.name = "memory_pressure_enabled",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_MEMORY_PRESSURE_ENABLED,
};
也就是一个名为”
memory_pressure_enabled”的文件。
在所有cpuset目录下都有的文件为file对应的cftype,结构如下示：
static
struct cftype files[] = {
{
      .name = "cpus",
      .read = cpuset_common_file_read,
      .write_string = cpuset_write_resmask,
      .max_write_len = (100U + 6 * NR_CPUS),
      .private = FILE_CPULIST,
},

{
      .name = "mems",
      .read = cpuset_common_file_read,
      .write_string = cpuset_write_resmask,
      .max_write_len = (100U + 6 *
MAX_NUMNODES),
      .private = FILE_MEMLIST,
},

{
      .name = "cpu_exclusive",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_CPU_EXCLUSIVE,
},

{
      .name = "mem_exclusive",
      .read_u64
= cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_MEM_EXCLUSIVE,
},

{
      .name = "mem_hardwall",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_MEM_HARDWALL,
},

{
      .name = "sched_load_balance",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_SCHED_LOAD_BALANCE,
},

{
      .name =
"sched_relax_domain_level",
      .read_s64 = cpuset_read_s64,
      .write_s64 = cpuset_write_s64,
      .private = FILE_SCHED_RELAX_DOMAIN_LEVEL,
},

{
      .name = "memory_migrate",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_MEMORY_MIGRATE,
},

{
      .name = "memory_pressure",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_MEMORY_PRESSURE,
},

{
      .name = "memory_spread_page",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_SPREAD_PAGE,
},

{
      .name = "memory_spread_slab",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_SPREAD_SLAB,
},
}
也就是名为cpus, mems, cpu_exclusive,
mem_exclusive, mem_hardwall, sched_load_balance, sched_relax_domain_level,
memory_migrate, memory_pressure, memory_spread_page, memory_spread_slab这几个文件。
其中有几个文件代表的含义我们在上面已经分析过了，如：cpus,mems,sched_load_balance.sched_relax_domain_level,memory_migreate,
memory_spread_page和memory_spread_slab.下面我们重点分析一下其它文件是代表的意义。

五：cpuset中的文件操作
5.1: memory_pressure_enabled文件
我们从顶层目录看起,对于cpuset subsystem而言,顶层有个特有的文件,即memory_pressure_enabled.这个文件的含义为:是否计算cpuset中内存压力.何所谓内存压力?就是指当前系统的空闲内存不能满足当前的内存分配请求的速率.有关内存压力计算的细节可以参考kernel自带的文档.
文件对应的cftype如下示:
static
struct cftype cft_memory_pressure_enabled = {
.name = "memory_pressure_enabled",
.read_u64 = cpuset_read_u64,
.write_u64 = cpuset_write_u64,
.private = FILE_MEMORY_PRESSURE_ENABLED,
};
从上面看到读操作的接口为cpuset_read_u64,写操作的接口为cpuset_write_u64.我们在之后也可以看到,cpuset中的大部份文件都是用的两个接口,它是根据它的private成员来区分各项操作的,
先来分析读操作:
static
u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
{
struct cpuset *cs = cgroup_cs(cont);
cpuset_filetype_t type = cft->private;
switch (type) {
case FILE_CPU_EXCLUSIVE:
      return is_cpu_exclusive(cs);
case FILE_MEM_EXCLUSIVE:
      return is_mem_exclusive(cs);
case FILE_MEM_HARDWALL:
      return is_mem_hardwall(cs);
case FILE_SCHED_LOAD_BALANCE:
      return is_sched_load_balance(cs);
case FILE_MEMORY_MIGRATE:
      return is_memory_migrate(cs);
case FILE_MEMORY_PRESSURE_ENABLED:
      return cpuset_memory_pressure_enabled;
case FILE_MEMORY_PRESSURE:
      return
fmeter_getrate(&cs->fmeter);
case FILE_SPREAD_PAGE:
      return is_spread_page(cs);
case FILE_SPREAD_SLAB:
      return is_spread_slab(cs);
default:
      BUG();
}

/* Unreachable but makes gcc happy */
return 0;
}
对应到memory_pressure_enable文件,对应的private域为FILE_MEMORY_PRESSURE_ENABLED.即返回cpuset_memory_pressure_enable的值.这个变量定义如下:
int
cpuset_memory_pressure_enabled
虽然它是一个int型数据,但它是一个bool型的,只有0,1两种可能.从写操作就可以看到.

写操作的接口为: cpuset_write_u64().代码如下:
static
int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
int retval = 0;
struct cpuset *cs = cgroup_cs(cgrp);
cpuset_filetype_t type = cft->private;

if (!cgroup_lock_live_group(cgrp))
      return -ENODEV;

switch (type) {
case FILE_CPU_EXCLUSIVE:
      retval = update_flag(CS_CPU_EXCLUSIVE,
cs, val);
      break;
case FILE_MEM_EXCLUSIVE:
      retval = update_flag(CS_MEM_EXCLUSIVE,
cs, val);
      break;
case FILE_MEM_HARDWALL:
      retval = update_flag(CS_MEM_HARDWALL,
cs, val);
      break;
case FILE_SCHED_LOAD_BALANCE:
      retval =
update_flag(CS_SCHED_LOAD_BALANCE, cs, val);
      break;
case FILE_MEMORY_MIGRATE:
      retval = update_flag(CS_MEMORY_MIGRATE,
cs, val);
      break;
case FILE_MEMORY_PRESSURE_ENABLED:
      cpuset_memory_pressure_enabled = !!val;
      break;
case FILE_MEMORY_PRESSURE:
      retval = -EACCES;
      break;
case FILE_SPREAD_PAGE:
      retval = update_flag(CS_SPREAD_PAGE, cs,
val);
      cs->mems_generation =
cpuset_mems_generation++;
      break;
case FILE_SPREAD_SLAB:
      retval = update_flag(CS_SPREAD_SLAB, cs,
val);
      cs->mems_generation =
cpuset_mems_generation++;
      break;
default:
      retval = -EINVAL;
      break;
}
cgroup_unlock();
return retval;
}
对应的memory_pressure_enable文件,它的操作为:
cpuset_memory_pressure_enabled
= !!val
即就是设置cpuset_memory_pressure_enabled的值.如果写入为0,该值为0,如果写入其它数,该值为1.

综合上面的分析,它主要是对cpuset_memory_pressure_enabled进行操作,那么这个变量有什么作用呢?下面来分析一下.
在__alloc_pages_internal()中,如果当前内存不能满足内存分配请求的要求,就会调用cpuset_memory_pressure_bump().代码如下所示:
#define
cpuset_memory_pressure_bump()             \
do {                         \
      if (cpuset_memory_pressure_enabled)    \
         __cpuset_memory_pressure_bump(); \
} while (0)

它实际上就是一个宏定义.如果启用了memory pressure,也就是cpuset_memroy_pressue_enable为1时.就会执行__cpuset_memroy_pressure_bump().代码如下:
void
__cpuset_memory_pressure_bump(void)
{
task_lock(current);
fmeter_markevent(&task_cs(current)->fmeter);
task_unlock(current);
}
在这里我们就看到cpuset->fmeter成员的意义,它就是用来计算内存压力的.fmeter_markevent()就不分析了,它无非就是根据请求时内存不足速率来计算压力值.最后计算出来的压力值会保存在fmeter.val中.

5.2: memory_pressure文件
memory_pressure文件用来查看当前cpuset节点的内存压力值.cftype结构如下:
{
      .name = "memory_pressure",
      .read_u64 = cpuset_read_u64,
      .write_u64 = cpuset_write_u64,
      .private = FILE_MEMORY_PRESSURE,
},
操作接口跟之前分析的是一样的.

读操作:
static
u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
{
．．．．．．
．．．．．．
case FILE_MEMORY_PRESSURE:
      return
fmeter_getrate(&cs->fmeter);
．．．．．．
｝
Fmeter_getrate()代码如下:
static
int fmeter_getrate(struct fmeter *fmp)
{
int val;

spin_lock(&fmp->lock);
fmeter_update(fmp);
val = fmp->val;
spin_unlock(&fmp->lock);
return val;
}
它就是返回了当前节下的内存压力值.

写操作:
static
int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
．．．．．．
．．．．．．
case FILE_MEMORY_PRESSURE:
      retval = -EACCES;
．．．．．．
｝
从此可看到,这个文件是不可写的.

5.3:cpus文件
Cpus文件可以用来配置与cpuset的绑定cpu.对应的cftype结构如下:
{
      .name = "cpus",
      .read = cpuset_common_file_read,
      .write_string = cpuset_write_resmask,
      .max_write_len = (100U + 6 * NR_CPUS),
      .private = FILE_CPULIST,
}

读操作接口为cpuset_common_file_read().代码如下:
static
ssize_t cpuset_common_file_read(struct cgroup *cont,
                     struct cftype *cft,
                     struct file *file,
                     char __user *buf,
                     size_t nbytes, loff_t *ppos)
{
struct cpuset *cs = cgroup_cs(cont);
cpuset_filetype_t type = cft->private;
char *page;
ssize_t retval = 0;
char *s;

if (!(page = (char
*)__get_free_page(GFP_TEMPORARY)))
      return -ENOMEM;

s = page;

switch (type) {
case FILE_CPULIST:
      /*将cpuset->cpus_allowed转换为字串存放s中*/
      s += cpuset_sprintf_cpulist(s, cs);
      break;
case FILE_MEMLIST:
      /*将cpuset->memsallowd转换为字串存放在s 中*/
      s += cpuset_sprintf_memlist(s, cs);
      break;
default:
      retval = -EINVAL;
      goto out;
}
/*以\n结尾*/
*s++ = '\n';

/*copy 到用户空间*/
retval = simple_read_from_buffer(buf,
nbytes, ppos, page, s - page);
out:
free_page((unsigned long)page);
return retval;
}
这个接口是与mems文件共用的.代码比较简单,这里就不详细分析了.就是接cpuset->cpus_allowed输出.

写操作入口为:
static
int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
            const char *buf)
{
int retval = 0;

/*加锁要操作的cgroup*/
if (!cgroup_lock_live_group(cgrp))
      return -ENODEV;

switch (cft->private) {
case FILE_CPULIST:
      /*更新cpuset的cpus_allowed*/
      retval = update_cpumask(cgroup_cs(cgrp),
buf);
      break;
case FILE_MEMLIST:
      /*更新cpuset的mems_allowed*/
      retval =
update_nodemask(cgroup_cs(cgrp), buf);
      break;
default:
      retval = -EINVAL;
      break;
}
cgroup_unlock();
return retval;
}
对应如果是cpus,流程转入到update_cpumask().代码如下:
static
int update_cpumask(struct cpuset *cs, const char *buf)
{
struct ptr_heap heap;
struct cpuset trialcs;
int retval;
int is_load_balanced;

/* top_cpuset.cpus_allowed tracks
cpu_online_map; it's read-only */
/*顶层的cpuset是read-only的*/
if (cs == &top_cpuset)
      return -EACCES;

/*trialcs是cs的一个拷贝*/
trialcs = *cs;

/*
   * An
empty cpus_allowed is ok only if the cpuset has no tasks.
   *
Since cpulist_parse() fails on an empty mask, we special case
   *
that parsing.  The validate_change() call
ensures that cpusets
   *
with tasks have cpus.
   */
   /*如果写入的是空字串.清空cpus_allowed*/
if (!*buf) {
      cpus_clear(trialcs.cpus_allowed);
} else {
      /*解析buf 中的位图信息,并将其存入到副本的cpus_allowed中*/
      retval = cpulist_parse(buf,
trialcs.cpus_allowed);
      if (retval
         return retval;
      /*如果要更新的cpus_allowed信息不是cpu_online_map的一个子集*/
      if (!cpus_subset(trialcs.cpus_allowed,
cpu_online_map))
         return -EINVAL;
}
/*检验cs是否可以更新为triaics的位图信息*/
retval = validate_change(cs, &trialcs);
if (retval
      return retval;

/* Nothing to do if the cpus didn't change
*/
/*如果要改更的cpus_allowed是相同的.那用不着更改了*/
if (cpus_equal(cs->cpus_allowed,
trialcs.cpus_allowed))
      return 0;

/*初始化堆排序*/
retval = heap_init(&heap, PAGE_SIZE,
GFP_KERNEL, NULL);
if (retval)
      return retval;
/*是否设置了CS_SCHED_LOAD_BALANCE标志*/
is_load_balanced =
is_sched_load_balance(&trialcs);

/*更改cpuset->cpus_allowed*/
mutex_lock(&callback_mutex);
cs->cpus_allowed = trialcs.cpus_allowed;
mutex_unlock(&callback_mutex);

/*
   *
Scan tasks in the cpuset, and update the cpumasks of any
   *
that need an update.
   */
   /* 因为进程所在的cpuset的cpus_allowed信息更改了
   * 所以需要更改里面进程的所有cpus_allowed信息
   */
update_tasks_cpumask(cs, &heap);

/*释放heap的空间*/
heap_free(&heap);

/*如果设置了CS_SCHED_LOAD_BALANCE*/
if (is_load_balanced)
      async_rebuild_sched_domains();
return 0;
}
代码注释详细给出了各部份的操作,这里就不加详细分析了,因为它里面涉及到的重要的接口,我们在上面就已经分析过了.

5.4:其它文件
Mems文件操作和cpus类似,所以就不在详细分析了,其它文件都是对一些标志的设定.这些标志我们在之前都分析过,而且这部份代码也比较简单,所以也不加详细分析了.自行阅读即可.

六:遗留问题.
6.1: cgroup_scan_tasks()
我们在之前的分析为了流程的连贯性跳过了cgroup_scan_tasks()的分析.其实这个函数的功能就是遍历cgroup中的task,然后对这个task调用指定的一个函数.这里涉及到一个数据结构struct cgroup_scanner.如下示:
struct
cgroup_scanner {
/*要扫描的cgroup*/
struct cgroup *cg;
/*测试该task,用来判断是否是想要处理的task*/
int (*test_task)(struct task_struct *p,
struct cgroup_scanner *scan);
/*task的处理函数*/
void (*process_task)(struct task_struct *p,
         struct cgroup_scanner *scan);
/*排序用的堆.可以指定,也可以由系统默认构建*/
struct ptr_heap *heap;
};
首先,对每个cgroup中的task.先调用struct
cgroup_scanner->test_task().如果返回1,表示是我们希望处理的task,所以接着接用struct
cgroup_scanner->process_task().在这里的heap跟我们进程结构里的堆栈中的堆是不同的.这里是堆排序,相当于是一个二叉树.有关堆排序方面的东西在>上有详细的描述.

来看一下代码:
int
cgroup_scan_tasks(struct cgroup_scanner *scan)
{
int retval, i;
struct cgroup_iter it;
struct task_struct *p, *dropped;
/* Never dereference latest_task, since it's
not refcounted */
struct task_struct *latest_task = NULL;
struct ptr_heap tmp_heap;
struct ptr_heap *heap;
struct timespec latest_time = { 0, 0 };

if (scan->heap) {
      /* The caller supplied our heap and
pre-allocated its memory */
      heap = scan->heap;
      heap->gt = &started_after;
} else {
      /* We need to allocate our own heap
memory */
      heap = &tmp_heap;
      retval = heap_init(heap, PAGE_SIZE,
GFP_KERNEL, &started_after);
      if (retval)
         /* cannot allocate the heap */
         return retval;
}
如果scan->heap不为空,说明用户已经自己指定的heap,只需要设置好heap中元素的比较函数heap->gt()就可以了.如果scan->heap为空.那就需要系统默认分配一个heap,并对其初始化.
子函数heap_init()很简单,如下:
int heap_init(struct ptr_heap *heap, size_t size, gfp_t gfp_mask,
      int (*gt)(void *, void *))
{
heap->ptrs =
kmalloc(size, gfp_mask);
if (!heap->ptrs)
      return -ENOMEM;
heap->size = 0;
heap->max = size /
sizeof(void *);
heap->gt = gt;
return 0;
}
Heap->ptrs是一个二级指针,也可以将它看成是一个指针数据.heap->size是表示存放区域的大小,heap->max是表示存放对象的个数,它的大小等于总空间除以每个指针的大小,即(sizeof(void *)),heap->gt是比较函数,用来确定元素在堆中的位置。

again:
/*
   *
Scan tasks in the cgroup, using the scanner's "test_task" callback
   * to
determine which are of interest, and using the scanner's
   *
"process_task" callback to process any of them that need an update.
   *
Since we don't want to hold any locks during the task updates,
   *
gather tasks to be processed in a heap structure.
   * The
heap is sorted by descending task start time.
   * If
the statically-sized heap fills up, we overflow tasks that
   *
started later, and in future iterations only consider tasks that
   *
started after the latest task in the previous pass. This
   *
guarantees forward progress and that we don't miss any tasks.
   */
heap->size = 0;

/*调用cgroup iter来取得cgroup中的所有task*/
cgroup_iter_start(scan->cg, &it);
while ((p = cgroup_iter_next(scan->cg,
&it))) {
      /*
      *
Only affect tasks that qualify per the caller's callback,
      *
if he provided one
      */
      /*调用test_task来测试该task是否是需要update*/
      if (scan->test_task &&
!scan->test_task(p, scan))
         continue;
      /*
      *
Only process tasks that started after the last task
      *
we processed
      */
      if (!started_after_time(p,
&latest_time, latest_task))
         continue;
      /*将进程p添加到heap中*/
      dropped = heap_insert(heap, p);

      /*添加成功.增加task的引用计数*/
      if (dropped == NULL) {
         /*
         * The new task was inserted; the heap wasn't
         * previously full
         */
         get_task_struct(p);
      }
      /*heap已经满了,它踢出了一个*/
      /*如果踢出的不和要加入的相等,要更其它们的引用计数*/
      else if (dropped != p) {
         /*
         * The new task was inserted, and pushed out a
         * different task
         */
         get_task_struct(p);
         put_task_struct(dropped);
      }
      /*
      *
Else the new task was newer than anything already in
      *
the heap and wasn't inserted
      */
      /*如果是要加入的task加入失败.不需要做任何处理,处理下一个task*/
}
/*cgroup iter使用完成*/
cgroup_iter_end(scan->cg, &it);
cgroup_iter在分析cgroup框架的时候已经分析过,它就是一个遍历cgroup中task的迭代器.在上述代码中可以看到,要将进程加到heap中,要满足二个条件:
1:如果scan->test_task被设置的话,那么scan->test_task()必须要返回1.
2:必须started_after_time()不为0.这个函数定义如下示:
static inline int started_after_time(struct task_struct *t1,
                  struct timespec *time,
                  struct task_struct *t2)
如果ti>time返回1.如果t1等于time,那么当t1>t2的时候返回1.
结合上面的代码, latest_task设置为NULL, latest_time设置成了0,0.因此在刚开始的时候,所有的task都会满足started_after_time().
当heap满了的时候,就是丢掉了一个heap->gt()值最大项.也就是heap_insert()返回不为空的时候.
在这个函数中,heap->gt为started_after().代码如下:
static inline int started_after(void *p1, void *p2)
{
struct task_struct *t1 =
p1;
struct task_struct *t2 =
p2;
return
started_after_time(t1, &t2->start_time, t2);
}
从此可以看到,它就是将现在heap中task->start_time或者是task最大项丢出来了.
在后续的处理中,lastst_time.laters_task被更新成如下所示:
if (i == 0) {
            latest_time =
q->start_time;
            latest_task =
q;
         }
它就是将它们的值设为了当前heap中的相关最大值.
综合上面的分析.在满了的时候被丢出来的task对应的task->start_time或者task一定会大于heap中的最大匹配值.因此这些被挤出来的task在下一次遍历的时候就会被加进heap,而那些已经处理过的,就不能添加进去了.

/*现在要处理的task已经都放入了heap中.update heap中的task*/
if (heap->size) {
      for (i = 0; i size; i++) {
         struct task_struct *q =
heap->ptrs;
         if (i == 0) {
            latest_time = q->start_time;
            latest_task = q;
         }
         /* Process the task per the caller's
callback */
         scan->process_task(q, scan);
         put_task_struct(q);
      }
      /*
      *
If we had to process any tasks at all, scan again
      *
in case some of them were in the middle of forking
      *
children that didn't get processed.
      *
Not the most efficient way to do it, but it avoids
      *
having to take callback_mutex in the fork path
      */
      /*在前面的处理中,可能因为各种原因还有其它的task

*末被处理,跳转到前面再处理一次

*/
      goto again;
}
现在heap中已经有了数据了,就调用heap->>process_task处理heap中的task.在它末尾有一个goto again.它是返回到函数的最前面,来处理那些被挤出来的task.

/*如果heap是在本函数中分配的空间.释放之*/
if (heap == &tmp_heap)
      heap_free(&tmp_heap);
return 0;
}
Heap中已经没有数据了,说明cgroup中的task已经全部都处理完了.如果heap是系统分配的,那么释放掉它的空间.

这段代码中涉及到的堆排序算法,鉴于篇幅原因,这里就不详细分析了.不理解代码的可以参考>的第七章.

6.2:关于CS_MEM_HARDWALL标志
CS_MEM_HARDWALL标志有一些特殊的处理.有这里有必要单独指出来.
在页面分配器中,有如下代码片段:
static
struct page *
get_page_from_freelist(gfp_t
gfp_mask, nodemask_t *nodemask, unsigned int order,
      struct zonelist *zonelist, int
high_zoneidx, int alloc_flags)
{
．．．．．．
．．．．．．
if ((alloc_flags & ALLOC_CPUSET)
&&
         !cpuset_zone_allowed_softwall(zone,
gfp_mask))
            goto try_next_zone;
．．．．．．
．．．．．．
}
上面的这段代码对是否可以在zone上分配内存的判断.如果定义了ALLOC_CPUSET分配标志,那么必须要受cpuset的限制.跟踪cpuset_zone_allowed_softwall()代码如下示:
static
int inline cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
{
return number_of_cpusets
      __cpuset_zone_allowed_softwall(z,
gfp_mask);
}
如果系统中总共才1个cpuset(top_cpuset),那就没必要进行下面的判断了,如果有很多cpuset,流程转入__cpuset_zone_allowed_softwall().代码如下:
int
__cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask)
{
int node;          /*
node that zone z is on */
const struct cpuset *cs; /* current cpuset ancestors */
int allowed;          /*
is allocation in zone z allowed? */

/*如果在中断环境或者在设置__CFP_THISNODE的情况下.允许*/
if (in_interrupt() || (gfp_mask &
__GFP_THISNODE))
      return 1;
/*该zone所在的节点*/
node = zone_to_nid(z);
/*在没有带__GFP_HARDWALL情况下会引起睡眠*/
might_sleep_if(!(gfp_mask &
__GFP_HARDWALL));
/*如果要分配的结点是当进程所允许的.允许*/
if (node_isset(node,
current->mems_allowed))
      return 1;
/*
   *
Allow tasks that have access to memory reserves because they have
   *
been OOM killed to get memory anywhere.
   */
   /*如果当前进程含有TIF_MEMDIE.允许*/
if (unlikely(test_thread_flag(TIF_MEMDIE)))
      return 1;
/*如果带了__GFP_HARDWALL标志.表示只能在该进程所属的cpuset
   *的结点上分配内存.运行到这里,说明当前进程
   *所在的cpuset并没有包含这个内存节点,这个结点是不允许的
   */
if (gfp_mask & __GFP_HARDWALL)  /* If hardwall request, stop here */
      return 0;
/*进程正在退出了*/
if (current->flags & PF_EXITING) /*
Let dying task have memory */
      return 1;

/* Not hardwall and node outside
mems_allowed: scan up cpusets */
/*运行到这里的话.说明进程所属的cpuset 没有包含这个结点
* 且又没有指定__CFP_HARDWALL标记.可以从它的父结点中选择
* 内存结点
*/
mutex_lock(&callback_mutex);

task_lock(current);
cs =
nearest_hardwall_ancestor(task_cs(current));
task_unlock(current);

/*判断找到当前zone所在节点是否在cs->mems_allowed中
*如果是,返回1.否则返回0
*/
allowed = node_isset(node,
cs->mems_allowed);
mutex_unlock(&callback_mutex);
return allowed;
}
这个函数是对是否可以在这个zone上分配内存的判断.有以下情况:
1:在中断环境下,或者是用户使用了__GFP_THISNODE指明在该node上分配.这很好理解.中断环境中的内存分配请求应该是尽量满足的.

2:如果该zone所在node是cpuset中所规定的,毫无疑问,可以分配.(cpuset中的mems_allowed反映在所关联进程的task->mems_allowed中)

3:进程包含TIF_MEMDIE标志
在系统内存极度紧张的时候,连一些系统服务都不能满足了,那就必须要选择一个进程终止,这个选择出的进程就会被设置TIF_MEMDIE标志.这类进程马上就要被kill或者正在被kill.

4: 带有__GFP_HARDWALL标志.且不为上面所说的几种条件,不能在这个节点上进行分配,没有商量的余地.

5:进程带有PF_EXITING标志,说明进程正在退出了.满足.
6:其它的情况,判断就会进入到nearest_hardwall_ancestor().代码如下:
static
const struct cpuset *nearest_hardwall_ancestor(const struct cpuset *cs)
{
while (!(is_mem_exclusive(cs) ||
is_mem_hardwall(cs)) && cs->parent)
      cs = cs->parent;
return cs;
}
从上面可以看到,如果当前cpuset没有设置CS_MEM_EXCLUSIVE或者CS_MEM_HARDWALL.就可以找到它的最上层的没有设置这两个标志的cpuset.如果请求的节点在经过调整之后的cpuset中,满足.

我们在这里看到的CS_MEM_HARDWALL和CS_MEM_EXCLUSIVE的功能,它们的区别是, CS_MEM_EXCLUSIVE使cpuset拥有独立的内存结点,而CS_MEM_HARDWALL却没有这个限制.

七:小结
Cpuset是一个在大系统上常用的功能.这部份涉及到进程调度和内存分配方面的东西,如果对这些周边知识有不了解的地方.可以参阅本站的其它文档.

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u1/51562/showart_1777937.html

文库|博客

返回列表

Chinaunix › 论坛 › 操作系统 › Linux新手园地 › Linux文档专区 › Linux cgroup机制分析之cpuset subsystem

Linux cgroup机制分析之cpuset subsystem [复制链接]

浏览过的版块