论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2010-04-17 13:28 |只看该作者 |正序浏览

本帖最后由 kouu 于 2010-04-17 13:30 编辑

最近注意到这样一个现象: 使用mmap映射一个文件以后, 如果文件大小被其他进程减小, 则访问map以内文件大小以外的内存时, 进程将收到SIGBUS信号而退出.

设有进程A和B.
进程A 通过 mmap 映射一个普通文件, 设映射的到内存的起始地址为p, 大小为a(单位为page大小, 以下都使用相同的单位).
进程B 将该文件的size减小为b(b<a).
这时, 进程A 读p+n(b<n<a)的内存时(这个地址的vma还存在, 但是已经超出文件之外了), 内核会抛出一个SIGBUG信号, 使得进程A退出.
(下文会附上一些代码.)

从这个现象可以发现, 通过mmap去访问文件是非常危险的. 一旦文件被其他进程修改(比如被编辑, 被cp覆盖, 等), 通过mmap去访问该文件的进程就有可能因为SIGBUG而非预期地退出.

现在有两个问题想与大家讨论:

1, 这种现象有可能避免吗?
我只想到给文件加强制锁的方法, 避免文件被其他进程修改. 还有其他什么办法吗?

2, 进程在读p+n的内存时, 内核为什么要发出SIGBUG信号呢? 考虑到p+n的内存是处在合法的map之内, 内核如果给进程映射一个零页面(或其他), 让进程读到一些无用的数据. 这样会有什么问题呢?

望大家不吝指教, 非常感谢~

附, 用户态的测试程序:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <sys/mman.h>
#include <unistd.h>
#define FILESIZE 8192
void handle_sigbus(int sig)
{
printf("SIGBUS!\n");
_exit(0);
}
void main()
{
int i;
char *p, tmp;
int fd = open("tmp.ttt", O_RDWR);
p = (char*)mmap(NULL,FILESIZE, PROT_READ|PROT_WRITE,MAP_SHARED, fd,
0);
signal(SIGBUS, handle_sigbus);
getchar();
for (i=0; i<FILESIZE; i++) {
tmp = p[i];
}
printf("ok\n");
}

复制代码

在执行这个程序前：

kouu@kouu-one:~/test$ stat tmp.ttt
File: "tmp.ttt"
Size: 239104 Blocks: 480 IO Block: 4096 普通文件

复制代码

把程序跑起来，显然8192大小的内存是可以映射的。然后程序会停在getchar()处。

kouu@kouu-one:~/test$ echo "" > tmp.ttt
kouu@kouu-one:~/test$ stat tmp.ttt
File: "tmp.ttt"
Size: 1 Blocks: 8 IO Block: 4096 普通文件

复制代码

现在我们将 tmp.ttt弄成1字节的。然后给程序一个输入，让它从getchar()返回。

kouu@kouu-one:~/test$ ./a.out
SIGBUS!

复制代码

立刻，程序就收到SIGBUS信号了。

附, 内核代码导读:

首先是mmap的调用过程，考虑最普遍的情况，一个vma会被分配，并且与对应的file建立联系。

mmap_region()

......
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
......
if (file) {
......
vma->vm_file = file;
get_file(file);
error = file->f_op->mmap(file, vma);
......
} else if (vm_flags & VM_SHARED) {
......

复制代码

这里是通过file->f_op->mmap函数来“建立联系”的，而一般情况下，这个函数等于generic_file_mmap。

generic_file_mmap()

......
vma->vm_ops = &generic_file_vm_ops;
vma->vm_flags |= VM_CAN_NONLINEAR;
......

复制代码

其中：

struct vm_operations_struct generic_file_vm_ops
= {
.fault = filemap_fault,
};

复制代码

接下来，当对应的虚拟内存被访问时，将触发访存异常。内核捕捉到异常，再完成内存分配和读文件的事情。
do_page_fault就是内核用于捕捉访存异常的函数。其中内核会先确认引起异常的内存地址是合法的，并且找出它所对应的vma（如果找不到就是不合法）。然后分配内存、建立页表。对于本文中描述的mmap映射了某个文件的这种情况，内核还需要把文件对应位置上的数据读到新分配的内存上，这个工作主要是由vma->vm_ops->fault来完成的。前面我们看到vma->vm_ops是如何被赋值的了，而且这个vma->vm_ops->fault就等于filemap_fault。

filemap_fault()

......
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
......

复制代码

这个函数做的第一件事情就是检查要访问的地址偏移（相对于文件的）是否超过了文件大小，如果超过就返回VM_FAULT_SIGBUS，这将导致SIGBUS信号被发送给进程。

文库|博客

kouu

家境小康

论坛徽章:: 0

23楼 [报告]

发表于 2010-04-24 21:17 |只看该作者

VM_DENYWRITE只对可执行文件有效了,对普通文件,mmap后,并不能阻止其他文件已写的方式打开文件(lz可以试验下,cp覆盖一个你正在运行的可执行文件,肯定会报FILE BUSY的错误)
augustusqing 发表于 2010-04-24 13:10

呵呵, 之前一直有注意到这个现象, 本来说要找下原因的, 一直忘记... 今天经augustusqing兄这么一指点, 原来如此... 非常感谢~

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

augustusqing

家境小康

论坛徽章:: 0

22楼 [报告]

发表于 2010-04-24 13:13 |只看该作者

以前追踪过O_TRUNC

O_TRUNC跟踪：
从fs/open.c->filp_open()开始：O_TRUNC就在flags中
struct file *filp_open(const char * filename, int flags, int mode) ->
open_namei(filename, namei_flags, mode, &nd)                   ->
   may_open(nd, acc_mode, flag)                               ->在这个函数看到，对于O_TRUNC，会比其他情况下，多调用一个do_truncate()
      do_truncate(dentry, 0)                                  ->第二个参数就是0了
      notify_change(dentry, &newattrs)                         -> 长度为0，存储在newattrs中
         inode_setattr(inode, attr)                         ->设置inode的相关属性了
vmtruncate(inode, attr->ia_size)                ->进行vm的截断操作了。
vmtruncate()是这次的重点了，贴出注视和最关键的三句：
/*
* Handle all mappings that got truncated by a "truncate()"
* system call.
*
* NOTE! We have to be ready to update the memory sharing
* between the file and the memory map for a potential last
* incomplete page.  Ugly, but necessary.
*/
int vmtruncate(struct inode * inode, loff_t offset)//这里offset为0，从0开始截断，就全截了，相当清空一次inode的所有页缓存，
{
struct address_space *mapping = inode->i_mapping;//获取的就是inode的页缓存结构，后两个函数有很好的注释
unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
truncate_inode_pages(mapping, offset);
}

unmap_mapping_range()的注视：
/**
* unmap_mapping_range - unmap the portion of all mmaps
* in the specified address_space corresponding to the specified
* page range in the underlying file.
* @address_space: the address space containing mmaps to be unmapped.
* @holebegin: byte in first page to unmap, relative to the start of
* the underlying file.  This will be rounded down to a PAGE_SIZE
* boundary.  Note that this is different from vmtruncate(), which
* must keep the partial page.  In contrast, we must get rid of
* partial pages.
* @holelen: size of prospective hole in bytes.  This will be rounded
* up to a PAGE_SIZE boundary.  A holelen of zero truncates to the
* end of the file.
* @even_cows: 1 when truncating a file, unmap even private COWed pages;
* but 0 when invalidating pagecache, don't throw away private data.
*/
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows)
{
}

truncate_inode_pages()的注视：
/**
* truncate_inode_pages - truncate *all* the pages from an offset
* @mapping: mapping to truncate
* @lstart: offset from which to truncate
*
* Truncate the page cache at a set offset, removing the pages that are beyond
* that offset (and zeroing out partial pages).
*
* Truncate takes two passes - the first pass is nonblocking.  It will not
* block on page locks and it will not block on writeback.  The second pass
* will wait.  This is to prevent as much IO as possible in the affected region.
* The first pass will remove most pages, so the search cost of the second pass
* is low.
*
* When looking at page->index outside the page lock we need to be careful to
* copy it into a local to avoid races (it could change at any time).
*
* We pass down the cache-hot hint to the page freeing code.  Even if the
* mapping is large, it is probably the case that the final pages are the most
* recently touched, and freeing happens in ascending file offset order.
*
* Called under (and serialised by) inode->i_sem.
*/
void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
{
}

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

augustusqing

家境小康

论坛徽章:: 0

21楼 [报告]

发表于 2010-04-24 13:10 |只看该作者

sorry,非常汗颜了,做了试验后,确实发现我错了

如LZ所说:
那么在本例中，既然没有写，MAP_PRIVATE 和 MAP_SHARED 也就没有什么区别了。

确实,这里对于只读取的来说,对于只有读取需求的情况下,MAP_PRIVATE和MAP_SHARED确实没有区别了

这里想起,主要是内核从哪个版本开始,对于普通文件,开始不支持VM_DENYWRITE参数
后续的版本,VM_DENYWRITE只对可执行文件有效了,对普通文件,mmap后,并不能阻止其他文件已写的方式打开文件(lz可以试验下,cp覆盖一个你正在运行的可执行文件,肯定会报FILE BUSY的错误)

LZ和Godbach兄的覆盖文件,都是因为用到了O_TRUNC参数,而内核中O_TRUNC参数的实现,会把相应文件的页缓存全部清除，及把页缓存对应的mmap全部unmap掉
LZ可以试验下,自己open下tmp.ttt文件,不带O_TRUNC,然后再往里面塞东西,不要截断文件大小,就不会对正在mmap读取的进程产生威胁了

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

kouu

家境小康

论坛徽章:: 0

20楼 [报告]

发表于 2010-04-23 14:00 |只看该作者

MAP_PRIVATE 和 MAP_SHARED 有什么区别？

按我的理解，只有对map进行写了以后才有区别。
MAP_SHARED 对应的页表项还是指向文件的cache，读写都还是作用在文件的cache上；
MAP_PRIVATE 分配新的内存，对应的页表项切换到这些新内存上来。以后对对应位置的读写就跟文件cache没有关系了；

如果我的这个理解没错，那么在本例中，既然没有写，MAP_PRIVATE 和 MAP_SHARED 也就没有什么区别了。

如果我理解有误，还望指点。非常感谢~

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

kouu

家境小康

论坛徽章:: 0

19楼 [报告]

发表于 2010-04-23 13:50 |只看该作者

试过了，2.6.9、2.6.29试过两个版本。

你用代码试验了MAP_PRIVATE?
惊恐中...... too

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

augustusqing

家境小康

论坛徽章:: 0

18楼 [报告]

发表于 2010-04-23 13:18 |只看该作者

1、MAP_PRIVATE 和 MAP_SHARED 都是一样的效果

你用代码试验了MAP_PRIVATE?
惊恐中......

你是哪个版本内核了?

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

kouu

家境小康

论坛徽章:: 0

17楼 [报告]

发表于 2010-04-22 15:29 |只看该作者

1,楼主在mmap的时候,为啥要用MAP_SHARED了?试试MAP_PRIVATE,就不会了
既然是共享,双方就要有游戏规则哦.
...
augustusqing 发表于 2010-04-22 13:37

感谢你的回答~

1、MAP_PRIVATE 和 MAP_SHARED 都是一样的效果。这应该是POSIX的规定：

The mmap() function can be used to map a region of memory that is larger than the current size of the object. Memory access within the mapping but beyond the current end of the underlying objects may result in SIGBUS signals being sent to the process. The reason for this is that the size of the object can be manipulated by other processes and can change at any moment. The implementation should tell the application that a memory reference is outside the object where this can be detected; otherwise, written data may be lost and read data may not reflect actual data in the object.

见：http://www.opengroup.org/onlinepubs/000095399/functions/mmap.html

2、我的两个问题应该不是一样的吧~