Chapter 15 The Process Address Space
地址空间
进程地址空间由进程可寻址的虚拟内存和允许进程使用的虚拟内存中的地址组成。每个进程都有32或64位的地址空间,取决于具体架构。进程可以选择与其他进程共享它们的地址空间,我们将这些进程称为线程。以4GB(32位)地址空间为例,进程虽然有这么大的地址空间,但并不能访问全部内容。地址空间内能访问的范围称为memory areas,进程和内核可以动态增加或者删除内存区域。内存区域具有关联的权限,例如可读、可写和可执行,关联进程必须遵守这些权限。内存区域可以包含如下内容

内存描述符
内核用称为内存描述符的数据结构表示进程的地址空间,这个结构包含进程地址空间的所有信息,由struct mm_struct表示。
// <linux/mm_types.h>
struct mm_struct {
struct vm_area_struct *mmap; /* list of memory areas */
struct rb_root mm_rb; /* red-black tree of VMAs */
struct vm_area_struct *mmap_cache; /* last used memory area */
unsigned long free_area_cache; /* 1st address space hole */
pgd_t *pgd; /* page global directory */
atomic_t mm_users; /* address space users */
atomic_t mm_count; /* primary usage counter */
int map_count; /* number of memory areas */
struct rw_semaphore mmap_sem; /* memory area semaphore */
spinlock_t page_table_lock; /* page table lock */
struct list_head mmlist; /* list of all mm_structs */
unsigned long start_code; /* start address of code */
unsigned long end_code; /* final address of code */
unsigned long start_data; /* start address of data */
unsigned long end_data; /* final address of data */
unsigned long start_brk; /* start address of heap */
unsigned long brk; /* final address of heap */
unsigned long start_stack; /* start address of stack */
unsigned long arg_start; /* start of arguments */
unsigned long arg_end; /* end of arguments */
unsigned long env_start; /* start of environment */
unsigned long env_end; /* end of environment */
unsigned long rss; /* pages allocated */
unsigned long total_vm; /* total number of pages */
unsigned long locked_vm; /* number of locked pages */
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* saved auxv */
cpumask_t cpu_vm_mask; /* lazy TLB switch mask */
mm_context_t context; /* arch-specific data */
unsigned long flags; /* status flags */
int core_waiters; /* thread core dump waiters */
struct core_state *core_state; /* core dump support */
spinlock_t ioctx_lock; /* AIO I/O list lock */
struct hlist_head ioctx_list; /* AIO I/O list */
};
-
mm_users表示有多少个进程在使用这个地址空间。比如两个线程共享,mm_users=2 -
mm_count是mm_struct的引用计数。
The mm_count field is the primary reference count for the mm_struct.All mm_users equate to one increment of mm_count.Thus, in the previous example, mm_count is only one. If nine threads shared an address space, mm_users would be nine, but again mm_count would be only one.
-
mmap和mm_rb字段是不同的数据结构,包含相同的东西:这个地址空间中的所有内存区域,前者以链表形式方法,后者使用红黑树,链表方便遍历,红黑树方便查找 - 所有的 mm_struct 结构都通过
mmlist字段串在一个双向链表中
分配内存描述符
进程的内存描述符存放在mm域。copy_mm() 函数在 fork 期间将父级的内存描述符复制到其子函数。mm_struct由alocate_mm()从mm_cachep_slab缓存分配。用户进程和内核进程的唯一区别是内核线程没有地址空间。如果clone()设置了CLONE_VM域,allocate_mm()不会调用,而是使用copy_mm()
if (clone_flags & CLONE_VM) {
/*
* current is the parent process and
* tsk is the child process during a fork()
*/
atomic_inc(¤t->mm->mm_users);
tsk->mm = current->mm;
}
销毁内存描述符
当进程相关的内存描述符销毁时,会调用exit_mm(),定义在kernel/exit.c
It then calls mmput(), which decrements the memory descriptor’s mm_users user counter. If the user count reaches zero, mmdrop() is called to decrement the mm_count usage counter. If that counter is finally zero, the free_mm() macro is invoked to return the mm_struct to the mm_cachep slab cache via kmem_cache_free(), because the memory descriptor does not have any users.
mm_struct和内核线程
内核线程没有地址进程空间,mm域位空。当进程切换到内核线程时,mm为空,内核线程的active_mm执行前一个进程的地址空间。因为内核线程不访问用户空间内存,它们只使用与内核内存相关的地址空间中的信息,这对所有进程都是一样的。
虚拟内存区域
虚拟内存区域由vm_area_struct表示,定义在<linux/mm_types.h>。vm_area_struct 结构描述了给定地址空间中连续间隔上的单个内存区域。每个内存区域都拥有某些属性,例如权限和一组相关的操作
struct vm_area_struct {
struct mm_struct *vm_mm; /* associated mm_struct */
unsigned long vm_start; /* VMA start, inclusive */
unsigned long vm_end; /* VMA end , exclusive */
struct vm_area_struct *vm_next; /* list of VMA’s */
pgprot_t vm_page_prot; /* access permissions */
unsigned long vm_flags; /* flags */
struct rb_node vm_rb; /* VMA’s node in the tree */
union { /* links to address_space->i_mmap or i_mmap_nonlinear */
struct {
struct list_head list;
void *parent;
struct vm_area_struct *head;
} vm_set;
struct prio_tree_node prio_tree_node;
} shared;
struct list_head anon_vma_node; /* anon_vma entry */
struct anon_vma *anon_vma; /* anonymous VMA object */
struct vm_operations_struct *vm_ops; /* associated ops */
unsigned long vm_pgoff; /* offset within file */
struct file *vm_file; /* mapped file, if any */
void *vm_private_data; /* private data */
};
- VMA的起始和结束地址分别由
vm_start和vm_end表示[vm_start, vm_end],长度是vm_start-vm_end;同一地址空间不同内存区域的间隔不能重叠 -
vm_mm指向对应的mm_struct
VMA标记
vm_flags是bitflag,定义在<linux/mm.h>,用来表示指定内存区域中包含的页面的行为并提供有关页面的信息,与页面属性不同的是,VMA 标志指定内核负责的行为,而不是硬件。此外,vm_flags 包含与内存区域中的每个页面或整个内存区域相关的信息,而不是特定的单个页面

- VM_READ、VM_WRITE 和 VM_EXEC 标志为这个特定内存区域中的页面指定了通常的读、写和执行权限;它们可以结合使用
- VM_SHARED 标志指定内存区域是否包含在多个进程之间共享的映射。如果没有设置,就是私有映射。
- VM_IO 标志指定此内存区域是设备 I/O 空间的映射。当在设备驱动程序的 I/O 空间上调用 mmap() 时,该字段通常由设备驱动程序设置
- VM_RESERVED 标志指定不得换出内存区域。它也被设备驱动程序映射使用。
- VM_SEQ_READ 标志向内核提供了应用程序正在此映射中执行顺序(即线性和连续)读取的提示,内核从而可以做一些优化readhead
- VM_RAND_READ与上面相反,然后内核可以选择减少或完全禁用对后备文件的预读
上面两个标记通过madvise()系统调用带MADV_SEQUENTIAL and MADV_RANDOM实现。
什么是readahead?
Read-ahead is the act of reading sequentially ahead of requested data, in hopes that the additional data will be needed soon. Such behavior is beneficial if applications are reading data sequentially. If data access patterns are random, however, read-ahead is not effective.
VMA操作集
vm_area_struct 结构中的 vm_ops 字段指向与给定内存区域相关联的操作表,内核可以调用它来操作 VMA。
#include <linux/mm.h>
struct vm_operations_struct {
void (*open) (struct vm_area_struct *);
void (*close) (struct vm_area_struct *);
int (*fault) (struct vm_area_struct *, struct vm_fault *);
int (*page_mkwrite) (struct vm_area_struct *vma, struct vm_fault *vmf);
int (*access) (struct vm_area_struct *, unsigned long ,
void *, int, int);
};
void open(struct vm_area_struct *area) This function is invoked when the given memory area is added to an address space. void close(struct vm_area_struct *area) This function is invoked when the given memory area is removed from an address space. int fault(struct vm_area_sruct *area, struct vm_fault *vmf) This function is invoked by the page fault handler when a page that is not present in physical memory is accessed. int page_mkwrite(struct vm_area_sruct *area, struct vm_fault *vmf) This function is invoked by the page fault handler when a page that was read-only is being made writable. int access(struct vm_area_struct *vma, unsigned long address, void *buf, int len, int write) This function is invoked by access_process_vm() when get_user_pages() fails.
内存区域的列表和树
如前所述,内存区域是通过内存描述符的 mmap 和 mm_rb 字段访问的
The first field, mmap, links together all the memory area objects in a singly linked list. Each vm_area_struct structure is linked into the list via its vm_next field.The areas are sorted by ascending address. The second field, mm_rb, links together all the memory area objects in a red-black tree. The root of the red-black tree is mm_rb, and each vm_area_struct structure in this address space is linked to the tree via its vm_rb field
内存区域操作
内核经常要对一个内存区域进行操作,比如在给定的VMA中是否存在给定的地址。
find_vma()
用于搜索给定内存地址所在的 VMA。
// mm/mmap.c
struct vm_area_struct * find_vma(struct mm_struct *mm, unsigned long addr);
此函数查找包含 addr 或从大于 addr 的地址开始的第一个内存区域。请注意,因为返回的 VMA 可能从大于 addr 的地址开始,所以给定的地址不一定位于返回的 VMA 内。ind_vma() 函数的结果缓存在内存描述符的mmap_cache字段中
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
struct rb_node *rb_node;
struct vm_area_struct *vma;
/* Check the cache first. */
vma = vmacache_find(mm, addr);
if (likely(vma))
return vma;
rb_node = mm->mm_rb.rb_node;
while (rb_node) {
struct vm_area_struct *tmp;
tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
if (tmp->vm_end > addr) {
vma = tmp;
if (tmp->vm_start <= addr)
break;
rb_node = rb_node->rb_left;
} else
rb_node = rb_node->rb_right;
}
if (vma)
vmacache_update(addr, vma);
return vma;
}
find_vma_prev()
find_vma_prev() 函数的工作方式与 find_vma() 相同,但它还返回 addr 之前的最后一个 VMA
struct vm_area_struct * find_vma_prev(struct mm_struct *mm, unsigned long addr,
struct vm_area_struct **pprev)
The pprev argument stores a pointer to the VMA preceding addr.
find_vma_intersection()
find_vma_intersection() 函数返回与给定地址间隔重叠的第一个 VMA
/* Look up the first VMA which intersects the interval start_addr..end_addr-1,
NULL if none. Assume start_addr < end_addr. */
static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * mm, unsigned long start_addr, unsigned long end_addr)
{
struct vm_area_struct * vma = find_vma(mm,start_addr);
if (vma && end_addr <= vma->vm_start)
vma = NULL;
return vma;
}
mmap() & do_mmap()
内核使用 do_mmap() 函数来创建新的线性地址区间
// linux/mm.h
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
此函数将file指定的文件映射到偏移量 offset 处,长度为 len。文件参数可以为 NULL,偏移量可以为零,在这种情况下,映射是不以文件为背景的,也就是匿名映射。addr 桑蚕丝可选地指定开始搜索空闲间隔的初始地址,prot 参数指定对内存区域中页面的访问权限。定义咋i<asm/mmap.h>
映射属性:
flags 参数指定与剩余 VMA 标志相对应的标志
如果可能,将区间与相邻的内存区域合并。否则,从 vm_area_cachep slab 缓存中分配一个新的 vm_area_struct 结构,并通过 vma_link() 函数将新的内存区域添加到地址空间的链表和内存区域的红黑树中,最后地址空间中total_vm更新。
do_mmap暴露到用户空间系统调用是mmap()
void * mmap2(void *start,
size_t length,
int prot,
int flags,
int fd,
off_t pgoff)
这个系统调用被命名为 mmap2(),因为它是 mmap() 的第二个变体。原始的 mmap() 以字节为单位的偏移量作为最后一个参数;当前 mmap2() 接收页面偏移量,可以映射具有更大偏移量的更大文件.
munmap & do_munmap()
do_munmap() 函数从指定的进程地址空间中删除一个地址区间
// <linux/mm.h>
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
第一个参数指定地址空间,从中删除从长度为 len 字节的地址开始的间隔。成功时,返回零。否则,返回负错误代码”
munmap()是对应的系统调用。
int munmap(void *start, size_t length)
系统调用实现如下
asmlinkage long sys_munmap(unsigned long addr, size_t len)
{
int ret;
struct mm_struct *mm;
mm = current->mm;
down_write(&mm->mmap_sem);
ret = do_munmap(mm, addr, len);
up_write(&mm->mmap_sem);
return ret;
}
Page Tables
虽然应用程序在映射到物理地址的虚拟内存上运行,但处理器直接在这些物理地址上运行。因此,当应用程序访问虚拟内存地址时,必须先将其转换为物理地址,然后处理器才能解析请求。执行此查找是通过页表完成的。
每个进程都有自己的页表(当然,线程共享它们)。内存描述符的 pgd 字段指向进程的页面全局目录。TLB可以加速虚拟地址到物理地址的查询过程。