关于pagecache的来两个问题

registcn 发表于 2014-04-10 23:59

1、page cache存储在用户还是内核空间？
如果在用户空间，不同进程看到的pagecache是同一份吗？
如果在内核空间pagecache一般占多大？

2、为什么write的时候要从用户拷贝到内核？假如pagecache在内核空间，是为了更新page cache吗？
内核为什么不直接读取用户空间的数据？

希望大牛给出结论同时，能给些参考资料

kiongf 发表于 2014-04-11 00:31

我也是新手.
1) page cache确实是内核空间.page cache,"页高速缓存",用来缓存在存储在慢速硬盘中文件的数据。
也就是说它是相对于文件而言,而不是进程而言. 文件在文件系统中通过inode来描述,
文件可以通过inode->i_mapping（address_space描述的是page cache）来得到Page cache.
进程打开文件: -->file descripter---->file -----> inode, 不管有多少个进程打开(open)同一个文件,最终都是会访问同一个inode的,不同的是fd和file.
不同进程看到同一个文件的page cache是一样的. 至于大小,这是可变的.address_sspace通过一个radix tree来管理page cache中的物理页。这棵树是可以变化的

2) write的操作不一定会触发实际的IO.通常你write的时候,数据从用户空间缓存在内核空间的page cache中,然后标记这个页是脏页,然后就直接返回了,所以有时候write比read
的动作要快..
实际写入到硬盘中的动作交由特定的内核线程bdflush来完成.内核线程会遍历所有文件的page cache,找到所有的脏页逐一进行IO,这是才实际更新硬盘中文件的数据.

参考：
ULK12章文件
15章 page cache
16 章

humjb_1983 发表于 2014-04-11 08:40

pagecache肯定在内核态，大小不限(新版本内核可以限制)，只要有空闲内存，就会尽量将其用作cache，在free命令中可以看到cache的实际用量。
write时，如果数据不拷贝到内核空间，就无法将其写入存储设备了。(当然专用的零拷贝方案除外~)

瀚海书香 发表于 2014-04-11 15:37

回复 3# humjb_1983
关于pagecache大小限制的问题一直有不同的声音，所有说patch虽然出来很久了，但一直没有进入mainline。很多厂商自己merge了，比如suse

humjb_1983 发表于 2014-04-14 08:43

瀚海书香发表于 2014-04-11 15:37 static/image/common/back.gif
回复 3# humjb_1983
关于pagecache大小限制的问题一直有不同的声音，所有说patch虽然出来很久了，但一直没 ...
呵呵，我就是看到suse中已经有相应的功能了，mainline暂时没有关注，感谢提醒。
目前我们都是通过内存水线来控制pagecache的用量，但很有局限性。

瀚海书香 发表于 2014-04-14 08:47

回复 5# humjb_1983
目前我们都是通过内存水线来控制pagecache的用量，但很有局限性。
可以让应用层的开发者关心:mrgreen:
posix_fadvise

humjb_1983 发表于 2014-04-14 08:59

感谢，后面需要多关注看看~~

humjb_1983 发表于 2014-05-13 16:22

瀚海书香发表于 2014-04-11 15:37 static/image/common/back.gif
回复 3# humjb_1983
关于pagecache大小限制的问题一直有不同的声音，所有说patch虽然出来很久了，但一直没 ...
请问瀚海兄，这个patch在哪儿可以找到呢？另外，“有不同的声音”具体是因为啥？是不稳定么？是否有相关的链接可以参考？谢谢！

瀚海书香 发表于 2014-05-14 10:34

回复 8# humjb_1983
当时貌似是在LWN上看的，刚才找了找没找到。因为这个patch太老了，我的邮件列表里面也没有了。不过当时反对的原因是反对者认为这个选项与linux尽可能使用内存向违背了，反对者认为vm系统应该尽可能使用内存加速文件的访问。
这个patch的稳定性没有问题的，我这边有一个suse下的patch。From: Markus Guertler <mguertler@novell.com>
Subject: Introduce (optional) pagecache limit
References: FATE309111
Patch-mainline: Never

There are apps that consume lots of memory and touch some of their
pages very infrequently; yet those pages are very important for the
overall performance of the app and should not be paged out in favor
of pagecache. The kernel can't know this and takes the wrong decisions,
even with low swappiness values.

This sysctl allows to set a limit for the non-mapped page cache;
non-mapped meaning that it will not affect shared memory or files
that are mmap()ed -- just anonymous file system cache.
Above this limit, the kernel will always consider removing pages from
the page cache first.

The limit that ends up being enforced is dependent on free memory;
if we have lots of it, the effective limit is much higher -- only when
the free memory gets scarce, we'll become strict about anonymous
page cache. This should make the setting much more attractive to use.

Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>

Index: linux-3.0-SLE11-SP2-3.0/Documentation/vm/pagecache-limit
===================================================================
--- /dev/null
+++ linux-3.0-SLE11-SP2-3.0/Documentation/vm/pagecache-limit
@@ -0,0 +1,51 @@
+Functionality:
+-------------
+The patch introduces a new tunable in the proc filesystem:
+
+/proc/sys/vm/pagecache_limit_mb
+
+This tunable sets a limit to the unmapped pages in the pagecache in megabytes.
+If non-zero, it should not be set below 4 (4MB), or the system might behave erratically. In real-life, much larger limits (a few percent of system RAM / a hundred MBs) will be useful.
+
+Examples:
+echo 512 >/proc/sys/vm/pagecache_limit_mb
+
+This sets a baseline limits for the page cache (not the buffer cache!) of 0.5GiB.
+As we only consider pagecache pages that are unmapped, currently mapped pages (files that are mmap'ed such as e.g. binaries and libraries as well as SysV shared memory) are not limited by this.
+NOTE: The real limit depends on the amount of free memory. Every existing free page allows the page cache to grow 8x the amount of free memory above the set baseline. As soon as the free memory is needed, we free up page cache.
+
+
+How it works:
+------------
+The heart of this patch is a new function called shrink_page_cache(). It is called from balance_pgdat (which is the worker for kswapd) if the pagecache is above the limit.
+The function is also called in __alloc_pages_slowpath.
+
+shrink_page_cache() calculates the nr of pages the cache is over its limit. It reduces this number by a factor (so you have to call it several times to get down to the target) then shrinks the pagecache (using the Kernel LRUs).
+
+shrink_page_cache does several passes:
+- Just reclaiming from inactive pagecache memory.
+This is fast -- but it might not find enough free pages; if that happens,
+the second pass will happen
+- In the second pass, pages from active list will also be considered.
+- The third pass is just another round of the second pass
+
+In all passes, only unmapped pages will be considered.
+
+
+How it changes memory management:
+--------------------------------
+If the pagecache_limit_mb is set to zero (default), nothing changes.
+
+If set to a positive value, there will be three different operating modes:
+(1) If we still have plenty of free pages, the pagecache limit will NOT be enforced. Memory management decisions are taken as normally.
+(2) However, as soon someone consumes those free pages, we'll start freeing pagecache -- as those are returned to the free page pool, freeing a few pages from pagecache will return us to state (1) -- if however someone consumes these free pages quickly, we'll continue freeing up pages from the pagecache until we reach pagecache_limit_mb.
+(3) Once we are at or below the low watermark, pagecache_limit_mb, the pages in the page cache will be governed by normal paging memory management decisions; if it starts growing above the limit (corrected by the free pages), we'll free some up again.
+
+This feature is useful for machines that have large workloads, carefully sized to eat most of the memory. Depending on the applications page access pattern, the kernel may too easily swap the application memory out in favor of pagecache. This can happen even for low values of swappiness. With this feature, the admin can tell the kernel that only a certain amount of pagecache is really considered useful and that it otherwise should favor the applications memory.
+
+
+Foreground vs. background shrinking:
+-----------------------------------
+
+Usually, the Linux kernel reclaims its memory using the kernel thread kswapd. It reclaims memory in the background. If it can't reclaim memory fast enough, it retries with higher priority and if this still doesn't succeed it uses a direct reclaim path.
+
Index: linux-3.0-SLE11-SP2-3.0/include/linux/pagemap.h
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/include/linux/pagemap.h
+++ linux-3.0-SLE11-SP2-3.0/include/linux/pagemap.h
@@ -12,6 +12,7 @@
#include <asm/uaccess.h>
#include <linux/gfp.h>
#include <linux/bitops.h>
+#include <linux/swap.h>
#include <linux/hardirq.h> /* for in_interrupt() */
#include <linux/hugetlb_inline.h>

Index: linux-3.0-SLE11-SP2-3.0/include/linux/swap.h
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/include/linux/swap.h
+++ linux-3.0-SLE11-SP2-3.0/include/linux/swap.h
@@ -262,6 +262,10 @@ extern unsigned long mem_cgroup_shrink_n
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+#define FREE_TO_PAGECACHE_RATIO 8
+extern unsigned long pagecache_over_limit(void);
+extern void shrink_page_cache(gfp_t mask, struct page *page);
+extern unsigned int vm_pagecache_limit_mb;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;

Index: linux-3.0-SLE11-SP2-3.0/kernel/sysctl.c
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/kernel/sysctl.c
+++ linux-3.0-SLE11-SP2-3.0/kernel/sysctl.c
@@ -1126,6 +1126,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+ {
+ .procname = "pagecache_limit_mb",
+ .data = &vm_pagecache_limit_mb,
+ .maxlen = sizeof(vm_pagecache_limit_mb),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_HUGETLB_PAGE
{
.procname = "nr_hugepages",
Index: linux-3.0-SLE11-SP2-3.0/mm/filemap.c
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/mm/filemap.c
+++ linux-3.0-SLE11-SP2-3.0/mm/filemap.c
@@ -507,6 +507,9 @@ int add_to_page_cache(struct page *page,
{
int error;

+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ shrink_page_cache(gfp_mask, page);
+
__set_page_locked(page);
error = add_to_page_cache_locked(page, mapping, offset, gfp_mask);
if (unlikely(error))
Index: linux-3.0-SLE11-SP2-3.0/mm/page_alloc.c
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/mm/page_alloc.c
+++ linux-3.0-SLE11-SP2-3.0/mm/page_alloc.c
@@ -5604,6 +5604,25 @@ out:
spin_unlock_irqrestore(&zone->lock, flags);
}

+/* Returns a number that's positive if the pagecache is above
+ * the set limit. Note that we allow the pagecache to grow
+ * larger if there's plenty of free pages.
+ */
+unsigned long pagecache_over_limit()
+{
+ /* We only want to limit unmapped page cache pages */
+ unsigned long pgcache_pages = global_page_state(NR_FILE_PAGES)
+ - global_page_state(NR_FILE_MAPPED);
+ unsigned long free_pages = global_page_state(NR_FREE_PAGES);
+ unsigned long limit;
+
+ limit = vm_pagecache_limit_mb * ((1024*1024UL)/PAGE_SIZE) +
+ FREE_TO_PAGECACHE_RATIO * free_pages;
+ if (pgcache_pages > limit)
+ return pgcache_pages - limit;
+ return 0;
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be isolated before calling this.
Index: linux-3.0-SLE11-SP2-3.0/mm/shmem.c
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/mm/shmem.c
+++ linux-3.0-SLE11-SP2-3.0/mm/shmem.c
@@ -1036,6 +1036,10 @@ uncharge:
mem_cgroup_uncharge_cache_page(page);
if (found < 0)
error = found;
+ else if (found > 0) {
+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ shrink_page_cache(GFP_KERNEL, page);
+ }
out:
unlock_page(page);
page_cache_release(page);
Index: linux-3.0-SLE11-SP2-3.0/mm/vmscan.c
===================================================================
--- linux-3.0-SLE11-SP2-3.0.orig/mm/vmscan.c
+++ linux-3.0-SLE11-SP2-3.0/mm/vmscan.c
@@ -148,8 +148,9 @@ struct scan_control {
/*
* From 0 .. 100.Higher means more swappy.
*/
-int vm_swappiness = 60;
-long vm_total_pages; /* The total number of pages which the VM controls */
+int vm_swappiness __read_mostly = 60;
+unsigned int vm_pagecache_limit_mb __read_mostly = 0;
+long vm_total_pages __read_mostly; /* The total number of pages which the VM controls */

static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
@@ -2363,6 +2364,8 @@ static bool sleeping_prematurely(pg_data
return !all_zones_ok;
}

+static void __shrink_page_cache(gfp_t mask);
+
/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
@@ -2418,6 +2421,10 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);

+ /* this reclaims from all zones so don't count to sc.nr_reclaimed */
+ if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+ __shrink_page_cache(GFP_KERNEL);
+
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
unsigned long lru_pages = 0;
int has_under_min_watermark_zone = 0;
@@ -2587,6 +2594,12 @@ loop_again:
}
out:

+ /* We do not need to loop_again if we have not achieved our
+ * pagecache target (i.e. && pagecache_over_limit(0) > 0) because
+ * the limit will be checked next time a page is added to the page
+ * cache. This might cause a short stall but we should rather not
+ * keep kswapd awake.
+ */
/*
* order-0: All zones must meet high watermark for a balanced node
* high-order: Balanced zones must make up at least 25% of the node
@@ -2900,6 +2913,160 @@ unsigned long shrink_all_memory(unsigned
}
#endif /* CONFIG_HIBERNATION */

+
+/*
+ * We had to resurect this function for __shrink_page_cache (upstream has
+ * removed it and reworked shrink_all_memory by 7b51755c).
+ *
+ * Tries to reclaim 'nr_pages' pages from LRU lists system-wide, for given
+ * pass and priority.
+ *
+ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
+ */
+static void shrink_all_zones(unsigned long nr_pages, int prio,
+ int pass, struct scan_control *sc)
+{
+ struct zone *zone;
+ unsigned long nr_reclaimed = 0;
+
+ for_each_populated_zone(zone) {
+ enum lru_list l;
+
+ if (zone->all_unreclaimable && prio != DEF_PRIORITY)
+ continue;
+
+ for_each_evictable_lru(l) {
+ enum zone_stat_item ls = NR_LRU_BASE + l;
+ unsigned long lru_pages = zone_page_state(zone, ls);
+
+ /* For pass = 0, we don't shrink the active list */
+ if (pass == 0 && (l == LRU_ACTIVE_ANON ||
+ l == LRU_ACTIVE_FILE))
+ continue;
+
+ /* Original code relied on nr_saved_scan which is no
+ * longer present so we are just considering LRU pages.
+ * This means that the zone has to have quite large
+ * LRU list for default priority and minimum nr_pages
+ * size (8*SWAP_CLUSTER_MAX). In the end we will tend
+ * to reclaim more from large zones wrt. small.
+ * This should be OK because shrink_page_cache is called
+ * when we are getting to short memory condition so
+ * LRUs tend to be large.
+ */
+ if (((lru_pages >> prio) + 1) >= nr_pages || pass > 3) {
+ unsigned long nr_to_scan;
+
+ nr_to_scan = min(nr_pages, lru_pages);
+ /* shrink_list takes lru_lock with IRQ off so we
+ * should be careful about really huge nr_to_scan
+ */
+ nr_reclaimed += shrink_list(l, nr_to_scan, zone,
+ sc, prio);
+ if (nr_reclaimed >= nr_pages) {
+ sc->nr_reclaimed += nr_reclaimed;
+ return;
+ }
+ }
+ }
+ }
+ sc->nr_reclaimed += nr_reclaimed;
+}
+
+/*
+ * Function to shrink the page cache
+ *
+ * This function calculates the number of pages (nr_pages) the page
+ * cache is over its limit and shrinks the page cache accordingly.
+ *
+ * The maximum number of pages, the page cache shrinks in one call of
+ * this function is limited to SWAP_CLUSTER_MAX pages. Therefore it may
+ * require a number of calls to actually reach the vm_pagecache_limit_kb.
+ *
+ * This function is similar to shrink_all_memory, except that it may never
+ * swap out mapped pages and only does two passes.
+ */
+static void __shrink_page_cache(gfp_t mask)
+{
+ unsigned long ret = 0;
+ int pass;
+ struct reclaim_state reclaim_state;
+ struct scan_control sc = {
+ .gfp_mask = mask,
+ .may_swap = 0,
+ .may_unmap = 0,
+ .may_writepage = 0,
+ .swappiness = vm_swappiness,
+ };
+ struct shrink_control shrink = {
+ .gfp_mask = mask,
+ };
+ struct reclaim_state *old_rs = current->reclaim_state;
+ long nr_pages;
+
+ /* How many pages are we over the limit?
+ * But don't enforce limit if there's plenty of free mem */
+ nr_pages = pagecache_over_limit();
+
+ /* Don't need to go there in one step; as the freed
+ * pages are counted FREE_TO_PAGECACHE_RATIO times, this
+ * is still more than minimally needed. */
+ nr_pages /= 2;
+
+ /* Return early if there's no work to do */
+ if (nr_pages <= 0)
+ return;
+ /* But do a few at least */
+ nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
+
+ current->reclaim_state = &reclaim_state;
+
+ /*
+ * Shrink the LRU in 2 passes:
+ * 0 = Reclaim from inactive_list only (fast)
+ * 1 = Reclaim from active list but don't reclaim mapped (not that fast)
+ * 2 = Reclaim from active list but don't reclaim mapped (2nd pass)
+ */
+ for (pass = 0; pass < 2; pass++) {
+ int prio;
+
+ for (prio = DEF_PRIORITY; prio >= 0; prio--) {
+ unsigned long nr_to_scan = nr_pages - ret;
+
+ sc.nr_scanned = 0;
+ /* sc.swap_cluster_max = nr_to_scan; */
+ shrink_all_zones(nr_to_scan, prio, pass, &sc);
+ ret += sc.nr_reclaimed;
+ if (ret >= nr_pages)
+ goto out;
+
+ reclaim_state.reclaimed_slab = 0;
+ shrink_slab(&shrink, sc.nr_scanned,
+ global_reclaimable_pages());
+ ret += reclaim_state.reclaimed_slab;
+
+ if (ret >= nr_pages)
+ goto out;
+
+ }
+ }
+
+out:
+ current->reclaim_state = old_rs;
+}
+
+void shrink_page_cache(gfp_t mask, struct page *page)
+{
+ /* FIXME: As we only want to get rid of non-mapped pagecache
+ * pages and we know we have too many of them, we should not
+ * need kswapd. */
+ /*
+ wakeup_kswapd(page_zone(page), 0);
+ */
+
+ __shrink_page_cache(mask);
+}
+
/* It's optimal to keep kswapds on the same CPUs as their memory, but
not required for correctness.So if the last cpu in a node goes
away, we get changed to run anywhere: as the first one comes back,

humjb_1983 发表于 2014-05-14 11:31

本帖最后由 humjb_1983 于 2014-05-14 14:32 编辑

瀚海书香发表于 2014-05-14 10:34 static/image/common/back.gif
回复 8# humjb_1983
当时貌似是在LWN上看的，刚才找了找没找到。因为这个patch太老了，我的邮件列表里面也 ...
感谢瀚海兄支持~
从我们的使用经历看，page cache目前的机制(不做限制)在某些应用场景中确实有问题的，比如：
1、当内核中使用ATOMIC标记分配内存时，不会回收cache，此时如果free内存不足，则分配不到需要的内存，而此时系统中的cached实际是很多的，这种情况很容易出现，就是因为cache不做限制。
2、当业务对内存分配的及时性有要求时，cache不做限制的话，经常会出现时间需要分配内存时才做cache回收的情况，而且是同步回收，会增加内存分配的延时，直接影响业务性能。而suse的补丁中的cache应该都是异步回收的，应该不存在这个问题。

页: [1] 2

Chinaunix's Archiver

关于pagecache的来两个问题