Skip to content
  • Yasunori Goto's avatar
    sched: Fix ancient race in do_exit() · b5740f4b
    Yasunori Goto authored
    
    
    try_to_wake_up() has a problem which may change status from TASK_DEAD to
    TASK_RUNNING in race condition with SMI or guest environment of virtual
    machine. As a result, exited task is scheduled() again and panic occurs.
    
    Here is the sequence how it occurs:
    
     ----------------------------------+-----------------------------
                                       |
                CPU A                  |             CPU B
     ----------------------------------+-----------------------------
    
    TASK A calls exit()....
    
    do_exit()
    
      exit_mm()
        down_read(mm->mmap_sem);
    
        rwsem_down_failed_common()
    
          set TASK_UNINTERRUPTIBLE
          set waiter.task <= task A
          list_add to sem->wait_list
               :
          raw_spin_unlock_irq()
          (I/O interruption occured)
    
                                          __rwsem_do_wake(mmap_sem)
    
                                            list_del(&waiter->list);
                                            waiter->task = NULL
                                            wake_up_process(task A)
                                              try_to_wake_up()
                                                 (task is still
                                                   TASK_UNINTERRUPTIBLE)
                                                  p->on_rq is still 1.)
    
                                                  ttwu_do_wakeup()
                                                     (*A)
                                                       :
         (I/O interruption handler finished)
    
          if (!waiter.task)
              schedule() is not called
              due to waiter.task is NULL.
    
          tsk->state = TASK_RUNNING
    
              :
                                                  check_preempt_curr();
                                                      :
      task->state = TASK_DEAD
                                                  (*B)
                                            <---    set TASK_RUNNING (*C)
    
         schedule()
         (exit task is running again)
         BUG_ON() is called!
     --------------------------------------------------------
    
    The execution time between (*A) and (*B) is usually very short,
    because the interruption is disabled, and setting TASK_RUNNING at (*C)
    must be executed before setting TASK_DEAD.
    
    HOWEVER, if SMI is interrupted between (*A) and (*B),
    (*C) is able to execute AFTER setting TASK_DEAD!
    Then, exited task is scheduled again, and BUG_ON() is called....
    
    If the system works on guest system of virtual machine, the time
    between (*A) and (*B) may be also long due to scheduling of hypervisor,
    and same phenomenon can occur.
    
    By this patch, do_exit() waits for releasing task->pi_lock which is used
    in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
    waking up.
    
    Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
    Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
    Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com
    
    
    Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    b5740f4b