mmap和msync相关的一个问题
本帖最后由 yangPSO 于 2014-11-24 17:03 编辑在前面的一个帖子
http://bbs.chinaunix.net/thread-4161518-1-1.html
关于文件读写互斥的问题
的讨论中想到了另外一个问题,这里请教大家:
====================================
【下面基于linux-2.6.32】
假设有一个应用程序,它以mmap的方式映射一个文件到虚拟地址空间中。
当该程序打算写此虚拟地址空间时,发生缺页异常,引发了如下的函数调用:
handle_pte_fault
do_linear_fault
__do_fault
vma->vm_ops->page_mkwrite // 以xfs为例,它实现了此回调函数
xfs_vm_page_mkwrite
block_page_mkwrite
__block_page_mkwrite
set_page_dirty
__set_page_dirty_buffers
__set_page_dirty
radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY);
这样将新分配的page在基树中设置为DIRTY的。
如果flush线程在某个时刻刷新此文件,有如下的函数调用:
bdi_start_fn
wb_do_writeback
wb_writeback
__writeback_inodes_sb
writeback_sb_inodes
writeback_single_inode
ret = do_writepages(mapping, wbc);
ret = mapping->a_ops->writepages(mapping, wbc);
xfs_vm_writepages
generic_writepages
write_cache_pages
ret = (*writepage)(page, wbc, data);
xfs_vm_writepage
xfs_start_page_writeback
set_page_writeback
test_set_page_writeback
radix_tree_tag_clear(&mapping->page_tree,page_index(page),PAGECACHE_TAG_DIRTY);
将page在基树中的DIRTY标签去掉了。
之后,该进程再次通过memcpy写该page所在的虚拟地址,不再发生缺页。但是page在基树中仍然没有DIRTY标签。
最后进程调用msync刷新文件映射的虚拟地址空间,引发如下的函数调用:
sys_msync
vfs_fsync
vfs_fsync_range
filemap_write_and_wait_range
__filemap_fdatawrite_range
ret = do_writepages(mapping, &wbc);
ret = mapping->a_ops->writepages(mapping, wbc);
xfs_vm_writepages
generic_writepages
write_cache_pages
pagevec_lookup_tag
但是pagevec_lookup_tag查找不到上面的page,因为它没有DIRTY标签,这样就会出现脏页无法被
msync刷新的现象。
不知道是不是这样的? 本帖最后由 镇水铁牛 于 2014-11-24 22:52 编辑
当再次memcpy时,也同样是写page cache,可能是调用
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
我认为流程可能相同,具体看mark_buffer_dirty中的实现。
/*
* Mark the page dirty, and set it dirty in the radix tree, and mark the inode
* dirty.
*
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
*/
static int __set_page_dirty(struct page *page, struct address_space *mapping,int warn) yangPSO 发表于 2014-11-24 17:03 static/image/common/back.gif
在前面的一个帖子
http://bbs.chinaunix.net/thread-4161518-1-1.html
关于文件读写互斥的问题 ...
呵呵,我以前也有这样的疑问,曾经讨论过,请参考:
http://bbs.chinaunix.net/forum.php?mod=viewthread&tid=4142500&highlight=
“内核在修改页时,MMU会自动置位页表项中脏位。在回收和回写时,会通过逆向映射把
这个位转移到page描述符的标志中去。。。。。然后内核就能检查到页是不是脏页了。” 回复 3# humjb_1983
我看了一下代码,觉得理由如下,不知道跟你们之前讨论是否一致:
================================
执行mmap系统调用导致如下的函数调用:
sys_mmap // 位于文件sys_x86_64.c (arch\x86\kernel)
sys_ mmap_pgoff // 位于文件util.c (mm)
do_mmap_pgoff // 位于文件mmap.c (mm)
mmap_region
首先,在函数do_mmap_pgoff()中设置vm_flags标志。
如果文件是以可写方式打开的,那么vm_flags标志至少包含以下的标志位:
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | VM_SHARED
接着看函数mmap_region(),其中包含如下的代码片段:
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
............
if (vma_wants_writenotify(vma))
vma->vm_page_prot = vm_get_page_prot(vm_flags & ~VM_SHARED);
首先看看其中调用的函数vma_wants_writenotify(),其开头部分如下:/*
* Some shared mappigns will want the pages marked read-only
* to track write events. If so, we'll downgrade vm_page_prot
* to the private version (using protection_map[] without the
* VM_SHARED bit).
*/
int vma_wants_writenotify(struct vm_area_struct *vma)
{
unsigned int vm_flags = vma->vm_flags;
/* If it was private or non-writable, the write bit is already clear */
if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
return 0;
/* The backer wishes to know when pages are first written to? */
if (vma->vm_ops && vma->vm_ops->page_mkwrite)
return 1;所以对于xfs,函数vma_wants_writenotify()返回1。
再看看函数vm_get_page_prot()的定义:/* description of effects of mapping type and prot in current implementation.
* this is due to the limited x86 page protection hardware.The expected
* behavior is in parens:
*
* map_type prot
* PROT_NONE PROT_READ PROT_WRITE PROT_EXEC
* MAP_SHARED r: (no) no r: (yes) yes r: (no) yes r: (no) yes
* w: (no) no w: (no) no w: (yes) yes w: (no) no
* x: (no) no x: (no) yes x: (no) yes x: (yes) yes
*
* MAP_PRIVATE r: (no) no r: (yes) yes r: (no) yes r: (no) yes
* w: (no) no w: (no) no w: (copy) copy w: (no) no
* x: (no) no x: (no) yes x: (no) yes x: (yes) yes
*
*/
pgprot_t protection_map = {
__P000, __P001, __P010, __P011, __P100, __P101, __P110, __P111,
__S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111
};
pgprot_t vm_get_page_prot(unsigned long vm_flags)
{
return __pgprot(pgprot_val(protection_map[vm_flags &
(VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
pgprot_val(arch_vm_get_page_prot(vm_flags)));
}
上面去掉VM_SHARED后,函数vm_get_page_prot()应该返回__P111。
对于X86,__P000等的定义如下:/* xwr */
#define __P000 PAGE_NONE
#define __P001 PAGE_READONLY
#define __P010 PAGE_COPY
#define __P011 PAGE_COPY
#define __P100 PAGE_READONLY_EXEC
#define __P101 PAGE_READONLY_EXEC
#define __P110 PAGE_COPY_EXEC
#define __P111 PAGE_COPY_EXEC
#define __S000 PAGE_NONE
#define __S001 PAGE_READONLY
#define __S010 PAGE_SHARED
#define __S011 PAGE_SHARED
#define __S100 PAGE_READONLY_EXEC
#define __S101 PAGE_READONLY_EXEC
#define __S110 PAGE_SHARED_EXEC
#define __S111 PAGE_SHARED_EXEC即这里返回的是PAGE_COPY_EXEC,即#define PAGE_COPY_EXEC __pgprot(_PAGE_PRESENT | _PAGE_USER | \
_PAGE_ACCESSED)也就是页是存在的但是不可写的,写将导致页写保护,即调用到函数do_wp_page()。
函数do_wp_page()有下面的代码片段:} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
............
tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
...........
reuse = 1;
}
.............
if (reuse) {
reuse:
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
goto unlock;
}
/*
* Ok, we need to copy. Oh, well..
*/
....................
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
....................
unlock:
pte_unmap_unlock(page_table, ptl);这里通过page_mkwrite再次将页面设置为脏的,但是没有分配新的页面,
因此,写保护仍然是写保护,如果继续memcpy仍然会执行函数do_wp_page()。
总之,通过写保护实现了对mmap映射的memcpy追踪。
yangPSO 发表于 2014-11-25 21:06 static/image/common/back.gif
回复 3# humjb_1983
我看了一下代码,觉得理由如下,不知道跟你们之前讨论是否一致:
你说的流程跟我们之前讨论的有差别,建议你跟踪打一下点,就可以确认具体流程了。 本帖最后由 yangPSO 于 2014-11-26 23:10 编辑
回复 5# humjb_1983
我在前面的说法确实有问题,page_mkwrite回调函数的本意只是想捕获第1次写。
实际上在函数__do_fault()的代码中给pte打上了_PAGE_RW标志,所以后面不会发生写保护:
if (likely(pte_same(*page_table, orig_pte))) {
flush_icache_page(vma, page);
entry = mk_pte(page, vma->vm_page_prot);
if (flags & FAULT_FLAG_WRITE)
entry =maybe_mkwrite(pte_mkdirty(entry), vma);《================================注意这里的maybe_mkwrite
if (anon) {
inc_mm_counter(mm, anon_rss);
page_add_new_anon_rmap(page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
get_page(dirty_page);
}
}
set_pte_at(mm, address, page_table, entry);
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, entry);
} else {在执行munmap时会有下面的函数调用:
sys_munmap
vm_munmap
do_munmap
unmap_region
unmap_vmas
unmap_single_vma
unmap_page_range
zap_pud_range
zap_pmd_range
zap_pte_range
page_remove_rmap
if (pte_dirty(ptent))
set_page_dirty(page);
但是在执行munmap之前, 页表项中脏位是如何传递到page中去的?
如果执行munmap之前没有传递,msync系统调用不是没有意义了? 按照楼上所说,猜测msync可能是scsi sync操作,即scsi的0x35命令,保证数据立即同步,只要设备不突然掉电,不发估计也可以。 回复 7# 镇水铁牛
我这里说的msync是常规的系统调用,与scsi没有直接关联。
msync系统调用会将mmap映射的vma范围内的page cache回写,但前提是这些page cache是脏的。
yangPSO 发表于 2014-11-27 08:55 static/image/common/back.gif
回复 7# 镇水铁牛
我这里说的msync是常规的系统调用,与scsi没有直接关联。
msync系统调用会将mmap映射的 ...
用systemtap打了下点,我的环境ext4,证明在writeback流程中会根据pte的脏标记来set_page_dirty,流程如下:
Returning from:0xffffffff811292b0 : set_page_dirty+0x0/0x70
Returning to:0xffffffff81129498 : clear_page_dirty_for_io+0x108/0x130
0xffffffffa02306af : write_cache_pages_da+0x1cf/0x490
0xffffffffa0230c53 : ext4_da_writepages+0x2e3/0x670
0xffffffff8112aa31 : do_writepages+0x21/0x40
0xffffffff81114c0b : __filemap_fdatawrite_range+0x5b/0x60
0xffffffff81114c6a : filemap_write_and_wait_range+0x5a/0x90
0xffffffff811aa50e : vfs_fsync_range+0x7e/0xe0
0xffffffff811aa5dd : vfs_fsync+0x1d/0x20
0xffffffff811aa61e : do_fsync+0x3e/0x60
0xffffffff811aa670 : sys_fsync+0x10/0x20
0xffffffff8100b0f2 : system_call_fastpath+0x16/0x1b
关键就在clear_page_dirty_for_io中。。。。。。看似要clear dirty,但实际由重新set了一把,这个逻辑看似有点混乱,实则为了避免竞争,见注释:
int clear_page_dirty_for_io(struct page *page)
{
struct address_space *mapping = page_mapping(page);
BUG_ON(!PageLocked(page));
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
*
* We use this sequence to make sure that
*(a) we account for dirty stats properly
*(b) we tell the low-level filesystem to
* mark the whole page dirty if it was
* dirty in a pagetable. Only to then
*(c) clean the page again and return 1 to
* cause the writeback.
*
* This way we avoid all nasty races with the
* dirty bit in multiple places and clearing
* them concurrently from different threads.
*
* Note! Normally the "set_page_dirty(page)"
* has no effect on the actual dirty bit - since
* that will already usually be set. But we
* need the side effects, and it can help us
* avoid races.
*
* We basically use the page "master dirty bit"
* as a serialization point for all the different
* threads doing their things.
*/
if (page_mkclean(page))
set_page_dirty(page); 本帖最后由 yangPSO 于 2014-11-29 18:14 编辑
回复 9# humjb_1983
谢谢,我也验证了一把,确实是的。
只是不知道“避免竞争”如何理解? 另外如下函数调用关系
clear_page_dirty_for_io
page_mkclean
page_mkclean_file
page_mkclean_one
entry = pte_wrprotect(entry);
设置了写保护,用意如何?
页:
[1]
2