免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 812 | 回复: 0
打印 上一主题 下一主题

ZERO PAGE removal influence [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2008-09-26 15:14 |只看该作者 |倒序浏览

   
   
   
   
   
   
  ZERO PAGE removal influence
  Codec Performance Differences between 2.6.19 and 2.6.24 kernel
  Xinyu Chen. Freescale Semiconductor Ltd.
  25 Sep 2008
  
  
    Most
Linux kernel improvement or regression is to make application perform
better and stable. Therefore the case that one application binary has
different performance or behavior running on different kernel is
normal. For embedded system, application performance and behavior is
important and even critical, and with kernel version upgrade, changes
must be taken into account.
  
  
  
    Background
  
  
Video codec team has found their codec standalone application perform
worse under 2.6.24 kernel than 2.6.19 kernel. In order to find the root
cause quickly, they used a set of unit test cases to do performance
test, and gave out the result that one case(read large case) has big
differences performance between these two kernels, say mainly 20%.
  
The codec performance is figured out by duration
(precision is us) from one frame decoding beginning to the end. The
performance is critical on our embedded multimedia system, which is
also important to potential customer. So I got the unit test codes, and
start to analysis. Tring to find why it has such differences, and how
to fix such gap.
  
  
   
    Analysis
  
The unit test program read a buffer data as one frame to do decode,
then increase the read size to decode until reach the max buffer size,
who simulate the real video codec case. So the test program just do two
jobs: read from memory, and do compute.
  According to this, what mainly concern about is:
  

  •       L1/L2 cache settings
       

  
L1/L2 cache settings and policy will have big impact on memory
performance. ARMv6 L1 cache policy is always writeback, it's identical
in this two kernel. L2 cache settings need to be checked.
  
  

  •       CPU core clock
       

  CPU core clock will affect the compute speed. It's depend on bootloader (redboot)
  

  •       IRQ and softirq latency
       

Application will be interrupted by
hardware or software interruption, The interrupt handler latency will
delay the read or compute process.
  

  •       Scheduler
       

For a normal priority application
process, it will not always be running on CPU. After using up the time
slice, it will be scheduled out to let other process to take CPU.
Therefore the scheduler policy can decide how much time slice the
process will get, and how it will be preempted by other high priority
process.
  

  •       Statistical method
       

The gettimeofday API implement used to
calculate the duration of decoding one frame may have changed. Or even
the clock source precision has been changed. Then the statistical
result will have some differ.
  
  Therefore here figure out the main differences of kernel we care about in such case between 2.6.19 and 2.6.24 version:
  

  •       Kernel scheduler evolve to CFS (completely fair scheduler), vanilla scheduler has been droped
       

  •       Memory management improvement
       

    •         Remove ZERO PAGE
            

    •         Add SLUB allocator support
            

  •       Add high-resolution timer and tickless support to ARM architecture
       

  •       Add clocksource support to ARM architecture
       

  •       ARM L2 cache code change
       

  
  Due to huge amounts of changes between these two kernels, I can only analysis the suspicions one by one.
  
  First eliminated one is CPU core clock. Make sure we use the same bootloader to set clock to 532MHz.
  
The second one is L2 cache setting. Kernel 2.6.19 did not export the
L2CC configuration interface, and all the settings are done in L2 cache
driver probe, and the AUX setting is default. Then make the L2CC
configuration consistent, nothing found.
  
The third eliminated one is Scheduler. The CFS scheduler replaces
vanilla 2.6 scheduler in kernel 2.6.23, who is developed mostly for
desktop use case. 80% of CFS's design can be summed up in a single
sentence: CFS basically models an "ideal, precise multi-tasking CPU" on
real hardware. "Ideal multi-tasking CPU" is a CPU that has 100%
physical power and which can run each task at precise equal speed, in
parallel, each at 1/nr_running speed. For example: if there are 2 tasks
running then it runs each at 50% physical power - totally in parallel.
So in our case, maybe some other process share the time slice (2.6.19
scheduler gave) with our process under CFS, despite of the priority of
our process, others can have fair run speed.  Though we neither can go
back to the vanilla 2.6 scheduler (dropped in 2.6.23) nor CFS patch can
be applied to 2.6.19, another way of running test program on same
scheduler policy is used to prove CFS is not the root cause. Linux
kernel provides SCHED_FIFO, SCHED_RR, SCHED_BATCH and SCHED_NORMAL
schedule policies, the SCHED_NORMAL uses CFS under 2.6.24, but
SCHED_FIFO and SCHED_RR use realtime schedule class. The SCHED_FIFO
policy can give a high priority process realtime running environment,
it can always run until exit, no other process can preempt. After
testing the program on the two kernels with SCHED_FIFO policy
(sched_setscheduler syscall), the differ is still there.
  
Next is statistical method. Kernel 2.6.24 uses highres and tickless
timer, which known as nanosecond precision timer and tickless in idle.
The GPT clock source driver is adjusted to the new timer. Disabling
highres and nohz in command line under 2.6.24, the test result is
identical to the result with highres and nohz. So statistical method is
not the root cause.
  
Finally, I focused on the IRQ and softirq latency. The analysis method
is simple, put test case into kernel module, and disable interrupt and
kernel preempt before running. The result is interesting, as 2.6.19
kernel still use ticker interrupt to update jiffies, IRQ disabled
kernel path statistics is not correct. So it can prove nothing. Then I
tried to remove the interrupt disable codes from test kernel module to
see what happen. What a surprise, the test result on 2.6.24 is align
with the one on 2.6.19. It's time to find out the different running
environment between kernel module and user process.
  
After analysing the test codes in detail, I found it use an uninitiated
static array variable as buffer to simulate frame. So the program
always read zero from frame, and do zero computing. In user space, the
static variable is put into .bss section in ELF binary, and kernel will
allocate only a vma for it after process start up. And kernel will not
allocate real physical pages for it until application do access to that
variable, here the access means read or write. But for kernel module
(no matter, 2.6.19 or 2.6.24 version), all the static variable will be
allocated with physical pages when module loading. The codes below
allocate a vma for module core with size of core_size which include
static data section, then do memset to clear.
  
          load_module(): kernel/module.c
  
   
              /* Do the allocs. */
   
   
              ptr = module_alloc(mod->core_size);
   
   
              if (!ptr) {
   
   
                      err = -ENOMEM;
   
   
                      goto free_percpu;
   
   
              }
   
   
              memset(ptr, 0, mod->core_size);
   
   
              mod->module_core = ptr;
   
   
      
              ptr = module_alloc(mod->init_size);
   
   
              if (!ptr && mod->init_size) {
   
   
                      err = -ENOMEM;
   
   
                      goto free_core;
   
   
              }
   
   
              memset(ptr, 0, mod->init_size);
   
   
              mod->module_init = ptr;
   
  
This memset write access to vma will cause page fault handler to
allocate physical pages for them. Even there's no write access to
static variable in test case codes, kernel module is already allocating
physical pages for them. Compare to user space, the kernel changes of
"Remove ZERO PAGE" makes different behavior. Under 2.6.19 kernel,
there's a optimized mechanism for read access to not allocated
anonymous page, called ZERO PAGE. It will create page table pointing to
this zero page (only one zero page in kernel) when do anonymous page
fault with read access to non allocated vm. Therefore, as the test
program has no write access to the static buffer, read is all from such
zero page, the L1 cache accelerate the read speed. Because of removing
ZERO PAGE in kernel 2.6.20 version, 2.6.24 kernel will allocated a
cleared physical page for both read and write access to anonymous vm
page when doing page fault. There's no zero page existed, so the buffer
read will read from different physical pages, and L1 cache can not
help. The page fault handler patch snatch is listed below:
@@ -2252,39 +2158,24 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
     spinlock_t *ptl;
     pte_t entry;

-    if (write_access) {
-        /* Allocate our own private page. */
-        pte_unmap(page_table);
-
-        if (unlikely(anon_vma_prepare(vma)))
-            goto oom;
-        page = alloc_zeroed_user_highpage_movable(vma, address);
-        if (!page)
-            goto oom;
-
-        entry = mk_pte(page, vma->vm_page_prot);
-        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+    /* Allocate our own private page. */
+    pte_unmap(page_table);

-        page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-        if (!pte_none(*page_table))
-            goto release;
-        inc_mm_counter(mm, anon_rss);
-        lru_cache_add_active(page);
-        page_add_new_anon_rmap(page, vma, address);
-    } else {
-        /* Map the ZERO_PAGE - vm_page_prot is readonly */
-        page = ZERO_PAGE(address);
-        page_cache_get(page);
-        entry = mk_pte(page, vma->vm_page_prot);
+    if (unlikely(anon_vma_prepare(vma)))
+        goto oom;
+    page = alloc_zeroed_user_highpage_movable(vma, address);
+    if (!page)
+        goto oom;

-        ptl = pte_lockptr(mm, pmd);
-        spin_lock(ptl);
-        if (!pte_none(*page_table))
-            goto release;
-        inc_mm_counter(mm, file_rss);
-        page_add_file_rmap(page);
-    }
+    entry = mk_pte(page, vma->vm_page_prot);
+    entry = maybe_mkwrite(pte_mkdirty(entry), vma);

+    page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+    if (!pte_none(*page_table))
+        goto release;
+    inc_mm_counter(mm, anon_rss);
+    lru_cache_add_active(page);
+    page_add_new_anon_rmap(page, vma, address);
     set_pte_at(mm, address, page_table, entry);

     /* No need to invalidate - it was non-present before */
  From the patch,we can see do_anonymous_page handler do not handle write/read access separately. Zero page is disappeared.
Add static variable initial codes to test problem (do write access), and running under user space, the two results are aligned.
    Conclusion
  The
codec unit test program has logical problem, it is a not good behavior
to read an uninitiated memory. And such behavior must be avoid in real
video codec program. But under the 2.6.20 kernel or later, it's a way
to allocate physical memory before running some performance critical
codes, this can avoid too much page fault interrupting the process.
Here lists the commit of removing ZERO_PAGE:
  
   
  
  
    commit 557ed1fa2620dc119adb86b34c614e152a629a80
    Author: Nick Piggin
    Date: Tue Oct 16 01:24:40 2007 -0700
   
     remove ZERO_PAGE
   
     The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
   
     A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
     (and thus mapcounted and count towards shared rss). These writes to
     the struct page could cause excessive cacheline bouncing on big
     systems. There are a number of ways this could be addressed if it is
     an issue.
   
     And indeed this cacheline bouncing has shown up on large SGI systems.
     There was a situation where an Altix system was essentially livelocked
     tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
     This situation can be avoided in userspace, but it does highlight the
     potential scalability problem with refcounting ZERO_PAGE, and corner
     cases where it can really hurt (we don't want the system to livelock!).
   
     There are several broad ways to fix this problem:
     1. add back some special casing to avoid refcounting ZERO_PAGE
     2. per-node or per-cpu ZERO_PAGES
     3. remove the ZERO_PAGE completely
   
     I will argue for 3. The others should also fix the problem, but they
     result in more complex code than does 3, with little or no real benefit
     that I can see.
   
     Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
     false optimisation: if an application is performance critical, it would
     not be doing many read faults of new memory, or at least it could be
     expected to write to that memory soon afterwards. If cache or memory use
     is critical, it should not be working with a significant number of
     ZERO_PAGEs anyway (a more compact representation of zeroes should be
     used).
   
     As a sanity check -- mesuring on my desktop system, there are never many
     mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
     increase much without it.
   
     When running a make -j4 kernel compile on my dual core system, there are
     about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
     ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
     is torn down without being COWed). So removing ZERO_PAGE will save 1,000
     page faults per second when running kbuild, while keeping it only saves
     less than 1 page clearing operation per second. 1 page clear is cheaper
     than a thousand faults, presumably, so there isn't an obvious loss.
   
     Neither the logical argument nor these basic tests give a guarantee of no
     regressions. However, this is a reasonable opportunity to try to remove
     the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
     we can reintroduce it and just avoid refcounting it.
   
     The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see
     much use to them except on benchmarks. All other users of ZERO_PAGE are
     converted just to use ZERO_PAGE(0) for simplicity. We can look at
     replacing them all and maybe ripping out ZERO_PAGE completely when we are
     more satisfied with this solution.
   
     Signed-off-by: Nick Piggin
     Signed-off-by: Andrew Morton
     Signed-off-by: Linus "snif" Torvalds
  
  
     Resource
  
Kernel change log summery: http://kernelnewbies.org/Linux_2_6_xx
CFS design: Documentation/sched-design-CFS.txt
MM:
               
               
               

本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/14459/showart_1226287.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP