Skip to content
  • Paul Jackson's avatar
    [PATCH] cpuset: rebind vma mempolicies fix · 4225399a
    Paul Jackson authored
    
    
    Fix more of longstanding bug in cpuset/mempolicy interaction.
    
    NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
    to just the Memory Nodes allowed by that cpuset.  The kernel maintains
    internal state for each mempolicy, tracking what nodes are used for the
    MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
    
    When a tasks cpuset memory placement changes, whether because the cpuset
    changed, or because the task was attached to a different cpuset, then the
    tasks mempolicies have to be rebound to the new cpuset placement, so as to
    preserve the cpuset-relative numbering of the nodes in that policy.
    
    An earlier fix handled such mempolicy rebinding for mempolicies attached to a
    task.
    
    This fix rebinds mempolicies attached to vma's (address ranges in a tasks
    address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
    updating vma's, the rebinding of vma mempolicies has to be done when the
    cpuset memory placement is changed, at which time mmap_sem can be safely
    acquired.  The tasks mempolicy is rebound later, when the task next attempts
    to allocate memory and notices that its task->cpuset_mems_generation is
    out-of-date with its cpusets mems_generation.
    
    Because walking the tasklist to find all tasks attached to a changing cpuset
    requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
    affected tasks while doing the tasklist scan.  In general, one cannot acquire
    a semaphore (which can sleep) while already holding a spinlock (such as
    tasklist_lock).  So a list of mm references has to be built up during the
    tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
    acquired, and the vma's in that mm rebound.
    
    Once the tasklist lock is dropped, affected tasks may fork new tasks, before
    their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
    point to the cpuset being rebound (there can only be one; cpuset modifications
    are done under a global 'manage_sem' semaphore), and the mpol_copy code that
    is used to copy a tasks mempolicies during fork catches such forking tasks,
    and ensures their children are also rebound.
    
    When a task is moved to a different cpuset, it is easier, as there is only one
    task involved.  It's mm->vma's are scanned, using the same
    mpol_rebind_policy() as used above.
    
    It may happen that both the mpol_copy hook and the update done via the
    tasklist scan update the same mm twice.  This is ok, as the mempolicies of
    each vma in an mm keep track of what mems_allowed they are relative to, and
    safely no-op a second request to rebind to the same nodes.
    
    Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    4225399a