1. 05 Oct, 2014 1 commit
    • Oleg Nesterov's avatar
      introduce for_each_thread() to replace the buggy while_each_thread() · 641bc58d
      Oleg Nesterov authored
      commit 0c740d0afc3bff0a097ad03a1c8df92757516f5c upstream.
      while_each_thread() and next_thread() should die, almost every lockless
      usage is wrong.
      1. Unless g == current, the lockless while_each_thread() is not safe.
         while_each_thread(g, t) can loop forever if g exits, next_thread()
         can't reach the unhashed thread in this case. Note that this can
         happen even if g is the group leader, it can exec.
      2. Even if while_each_thread() itself was correct, people often use
         it wrongly.
         It was never safe to just take rcu_read_lock() and loop unless
         you verify that pid_alive(g) == T, even the first next_thread()
         can point to the already freed/reused memory.
      This patch adds signal_struct->thread_head and task->thread_node to
      create the normal rcu-safe list with the stable head.  The new
      for_each_thread(g, t) helper is always safe under rcu_read_lock() as
      long as this task_struct can't go away.
      Note: of course it is ugly to have both task_struct->thread_node and the
      old task_struct->thread_group, we will kill it later, after we change
      the users of while_each_thread() to use for_each_thread().
      Perhaps we can kill it even before we convert all users, we can
      reimplement next_thread(t) using the new thread_head/thread_node.  But
      we can't do this right now because this will lead to subtle behavioural
      changes.  For example, do/while_each_thread() always sees at least one
      task, while for_each_thread() can do nothing if the whole thread group
      has died.  Or thread_group_empty(), currently its semantics is not clear
      unless thread_group_leader(p) and we need to audit the callers before we
      can change it.
      So this patch adds the new interface which has to coexist with the old
      one for some time, hopefully the next changes will be more or less
      straightforward and the old one will go away soon.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarSergey Dyasly <dserrg@gmail.com>
      Tested-by: default avatarSergey Dyasly <dserrg@gmail.com>
      Reviewed-by: default avatarSameer Nanda <snanda@chromium.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: "Ma, Xindong" <xindong.ma@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 07 Jun, 2014 1 commit
  3. 27 Apr, 2014 2 commits
    • Oleg Nesterov's avatar
      exit: call disassociate_ctty() before exit_task_namespaces() · 16739781
      Oleg Nesterov authored
      commit c39df5fa37b0623589508c95515b4aa1531c524e upstream.
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      Investigated by Peter Hurley.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Oleg Nesterov's avatar
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · a756e3d4
      Oleg Nesterov authored
      commit dfccbb5e49a621c1b21a62527d61fc4305617aca upstream.
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Tested-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 15 Jun, 2013 1 commit
    • Oleg Nesterov's avatar
      move exit_task_namespaces() outside of exit_notify() · 8aac6270
      Oleg Nesterov authored
      exit_notify() does exit_task_namespaces() after
      forget_original_parent(). This was needed to ensure that ->nsproxy
      can't be cleared prematurely, an exiting child we are going to
      reparent can do do_notify_parent() and use the parent's (ours) pid_ns.
      However, after 32084504 "pidns: use task_active_pid_ns in
      do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
      on task_active_pid_ns().
      Move exit_task_namespaces() from exit_notify() to do_exit(), after
      exit_fs() and before exit_task_work().
      This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
      does fput() which needs task_work_add().
      Note: this particular problem can be fixed if we change fput(), and
      that change makes sense anyway. But there is another reason to move
      the callsite. The original reason for exit_task_namespaces() from
      the middle of exit_notify() was subtle and it has already gone away,
      now this looks confusing. And this allows us do simplify exit_notify(),
      we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
      of PF_EXITING in forget_original_parent().
      Reported-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  5. 09 Apr, 2013 1 commit
  6. 31 Mar, 2013 1 commit
    • Paul Walmsley's avatar
      Revert "lockdep: check that no locks held at freeze time" · dbf520a9
      Paul Walmsley authored
      This reverts commit 6aa97070.
      Commit 6aa97070 ("lockdep: check that no locks held at freeze time")
      causes problems with NFS root filesystems.  The failures were noticed on
      OMAP2 and 3 boards during kernel init:
        [ BUG: swapper/0/1 still has locks held! ]
        3.9.0-rc3-00344-ga937536b #1 Not tainted
        1 lock held by swapper/0/1:
         #0:  (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574
        stack backtrace:
      Although the rootfs mounts, the system is unstable.  Here's a transcript
      from a PM test:
      Here's what the test log should look like:
      Mailing list discussion is here:
      Deal with this for v3.9 by reverting the problem commit, until folks can
      figure out the right long-term course of action.
      Signed-off-by: default avatarPaul Walmsley <paul@pwsan.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: <maciej.rutecki@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ben Chan <benchan@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 04 Mar, 2013 1 commit
  8. 28 Feb, 2013 2 commits
  9. 27 Jan, 2013 1 commit
    • Frederic Weisbecker's avatar
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker authored
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
  10. 29 Nov, 2012 1 commit
  11. 28 Nov, 2012 1 commit
    • Frederic Weisbecker's avatar
      cputime: Rename thread_group_times to thread_group_cputime_adjusted · e80d0a1a
      Frederic Weisbecker authored
      We have thread_group_cputime() and thread_group_times(). The naming
      doesn't provide enough information about the difference between
      these two APIs.
      To lower the confusion, rename thread_group_times() to
      thread_group_cputime_adjusted(). This name better suggests that
      it's a version of thread_group_cputime() that does some stabilization
      on the raw cputime values. ie here: scale on top of CFS runtime
      stats and bound lower value for monotonicity.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
  12. 19 Nov, 2012 1 commit
  13. 27 Sep, 2012 2 commits
  14. 24 Sep, 2012 1 commit
    • Eric Dumazet's avatar
      net: use a per task frag allocator · 5640f768
      Eric Dumazet authored
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      This page is used to build fragments for skbs.
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 26 Jul, 2012 1 commit
    • Josh Boyer's avatar
      posix_types.h: Cleanup stale __NFDBITS and related definitions · 8ded2bbc
      Josh Boyer authored
      Recently, glibc made a change to suppress sign-conversion warnings in
      FD_SET (glibc commit ceb9e56b3d1).  This uncovered an issue with the
      kernel's definition of __NFDBITS if applications #include
      <linux/types.h> after including <sys/select.h>.  A build failure would
      be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
      flags to gcc.
      It was suggested that the kernel should either match the glibc
      definition of __NFDBITS or remove that entirely.  The current in-kernel
      uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
      uses of the related __FDELT and __FDMASK defines.  Given that, we'll
      continue the cleanup that was started with commit 8b3d1cda
      ("posix_types: Remove fd_set macros") and drop the remaining unused
      Additionally, linux/time.h has similar macros defined that expand to
      nothing so we'll remove those at the same time.
      Reported-by: default avatarJeff Law <law@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      CC: <stable@vger.kernel.org>
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      [ .. and fix up whitespace as per akpm ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 22 Jul, 2012 1 commit
  17. 20 Jun, 2012 3 commits
  18. 08 Jun, 2012 1 commit
    • Linus Torvalds's avatar
      Revert "mm: correctly synchronize rss-counters at exit/exec" · 48d212a2
      Linus Torvalds authored
      This reverts commit 40af1bbd.
      It's horribly and utterly broken for at least the following reasons:
       - calling sync_mm_rss() from mmput() is fundamentally wrong, because
         there's absolutely no reason to believe that the task that does the
         mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
         you name it.
       - calling it *after* having done mmdrop() on it is doubly insane, since
         the mm struct may well be gone now.
       - testing mm against NULL before you call it is insane too, since a
      NULL mm there would have caused oopses long before.
      .. and those are just the three bugs I found before I decided to give up
      looking for me and revert it asap.  I should have caught it before I
      even took it, but I trusted Andrew too much.
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  19. 07 Jun, 2012 1 commit
    • Konstantin Khlebnikov's avatar
      mm: correctly synchronize rss-counters at exit/exec · 40af1bbd
      Konstantin Khlebnikov authored
      mm->rss_stat counters have per-task delta: task->rss_stat.  Before
      changing task->mm pointer the kernel must flush this delta with
      do_exit() already calls sync_mm_rss() to flush the rss-counters before
      committing the rss statistics into task->signal->maxrss, taskstats,
      audit and other stuff.  Unfortunately the kernel does this before
      calling mm_release(), which can call put_user() for processing
      task->clear_child_tid.  So at this point we can trigger page-faults and
      task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
      inconsistent and check_mm() will print something like this:
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
      | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
      This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
      out of do_exit() and calls it earlier.  After mm_release() there should
      be no pagefaults.
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>		[3.4.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 01 Jun, 2012 2 commits
  21. 24 May, 2012 2 commits
    • Oleg Nesterov's avatar
      genirq: reimplement exit_irq_thread() hook via task_work_add() · 4d1d61a6
      Oleg Nesterov authored
      exit_irq_thread() and task->irq_thread are needed to handle the unexpected
      (and unlikely) exit of irq-thread.
      We can use task_work instead and make this all private to
      kernel/irq/manage.c, cleanup plus micro-optimization.
      1. rename exit_irq_thread() to irq_thread_dtor(), make it
         static, and move it up before irq_thread().
      2. change irq_thread() to do task_work_add(irq_thread_dtor)
         at the start and task_work_cancel() before return.
         tracehook_notify_resume() can never play with kthreads,
         only do_exit()->exit_task_work() can call the callback
         and this is what we want.
      3. remove task_struct->irq_thread and the special hook
         in do_exit().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Smith <dsmith@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Oleg Nesterov's avatar
      task_work_add: generic process-context callbacks · e73f8959
      Oleg Nesterov authored
      Provide a simple mechanism that allows running code in the (nonatomic)
      context of the arbitrary task.
      The caller does task_work_add(task, task_work) and this task executes
      task_work->func() either from do_notify_resume() or from do_exit().  The
      callback can rely on PF_EXITING to detect the latter case.
      "struct task_work" can be embedded in another struct, still it has "void
      *data" to handle the most common/simple case.
      This allows us to kill the ->replacement_session_keyring hack, and
      potentially this can have more users.
      Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into
      tracehook_notify_resume() and do_exit().  But at the same time we can
      remove the "replacement_session_keyring != NULL" checks from
      arch/*/signal.c and exit_creds().
      Note: task_work_add/task_work_run abuses ->pi_lock.  This is only because
      this lock is already used by lookup_pi_state() to synchronize with
      do_exit() setting PF_EXITING.  Fortunately the scope of this lock in
      task_work.c is really tiny, and the code is unlikely anyway.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Gordeev <agordeev@redhat.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Smith <dsmith@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  22. 17 May, 2012 1 commit
  23. 03 May, 2012 1 commit
  24. 23 Mar, 2012 2 commits
    • Denys Vlasenko's avatar
      kernel/exit.c: if init dies, log a signal which killed it, if any · 397a21f2
      Denys Vlasenko authored
      I just received another user's pleas for help when their init
      mysteriously died.  I again explained that they need to check whether it
      died because of bad instruction, a segv, or something else.  Which was
      an annoying detour into writing a trivial C program to spawn his init
      and print its exit code:
      I hear you saying "just test it under /bin/sh".  Well, the crashing init
      _was_ /bin/sh.
      Which prompted me to make kernel do this first step automatically.  We can
      print exit code, which makes it possible to see that death was from e.g.
      SIGILL without writing test programs.
      [akpm@linux-foundation.org: add 0x to hex number output]
      Signed-off-by: default avatarDenys Vlasenko <vda.linux@googlemail.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Lennart Poettering's avatar
      prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision · ebec18a6
      Lennart Poettering authored
      Userspace service managers/supervisors need to track their started
      services.  Many services daemonize by double-forking and get implicitly
      re-parented to PID 1.  The service manager will no longer be able to
      receive the SIGCHLD signals for them, and is no longer in charge of
      reaping the children with wait().  All information about the children is
      lost at the moment PID 1 cleans up the re-parented processes.
      With this prctl, a service manager process can mark itself as a sort of
      'sub-init', able to stay as the parent for all orphaned processes
      created by the started services.  All SIGCHLD signals will be delivered
      to the service manager.
      Receiving SIGCHLD and doing wait() is in cases of a service-manager much
      preferred over any possible asynchronous notification about specific
      PIDs, because the service manager has full access to the child process
      data in /proc and the PID can not be re-used until the wait(), the
      service-manager itself is in charge of, has happened.
      As a side effect, the relevant parent PID information does not get lost
      by a double-fork, which results in a more elaborate process tree and
      'ps' output:
        # ps afx
        253 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        294 ?        Sl     0:00 /usr/libexec/polkit-1/polkitd
        328 ?        S      0:00 /usr/sbin/modem-manager
        608 ?        Sl     0:00 /usr/libexec/colord
        658 ?        Sl     0:00 /usr/libexec/upowerd
        819 ?        Sl     0:00 /usr/libexec/imsettings-daemon
        916 ?        Sl     0:00 /usr/libexec/udisks-daemon
        917 ?        S      0:00  \_ udisks-daemon: not polling any devices
        # ps afx
        294 ?        Ss     0:00 /bin/dbus-daemon --system --nofork
        426 ?        Sl     0:00  \_ /usr/libexec/polkit-1/polkitd
        449 ?        S      0:00  \_ /usr/sbin/modem-manager
        635 ?        Sl     0:00  \_ /usr/libexec/colord
        705 ?        Sl     0:00  \_ /usr/libexec/upowerd
        959 ?        Sl     0:00  \_ /usr/libexec/udisks-daemon
        960 ?        S      0:00  |   \_ udisks-daemon: not polling any devices
        977 ?        Sl     0:00  \_ /usr/libexec/packagekitd
      This prctl is orthogonal to PID namespaces.  PID namespaces are isolated
      from each other, while a service management process usually requires the
      services to live in the same namespace, to be able to talk to each
      Users of this will be the systemd per-user instance, which provides
      init-like functionality for the user's login session and D-Bus, which
      activates bus services on-demand.  Both need init-like capabilities to
      be able to properly keep track of the services they start.
      Many thanks to Oleg for several rounds of review and insights.
      [akpm@linux-foundation.org: fix comment layout and spelling]
      [akpm@linux-foundation.org: add lengthy code comment from Oleg]
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLennart Poettering <lennart@poettering.net>
      Signed-off-by: default avatarKay Sievers <kay.sievers@vrfy.org>
      Acked-by: default avatarValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  25. 22 Mar, 2012 1 commit
  26. 20 Mar, 2012 2 commits
    • Oleg Nesterov's avatar
      exit_signal: fix the "parent has changed security domain" logic · b6e238dc
      Oleg Nesterov authored
      exit_notify() changes ->exit_signal if the parent already did exec.
      This doesn't really work, we are not going to send the signal now
      if there is another live thread or the exiting task is traced. The
      parent can exec before the last dies or the tracer detaches.
      Move this check into do_notify_parent() which actually sends the
      The user-visible change is that we do not change ->exit_signal,
      and thus the exiting task is still "clone children" for
      do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
      current logic is racy anyway.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Oleg Nesterov's avatar
      exit_signal: simplify the "we have changed execution domain" logic · e6368253
      Oleg Nesterov authored
      exit_notify() checks "tsk->self_exec_id != tsk->parent_exec_id"
      to handle the "we have changed execution domain" case.
      We can change do_thread() to always set ->exit_signal = SIGCHLD
      and remove this check to simplify the code.
      We could change setup_new_exec() instead, this looks more logical
      because it increments ->self_exec_id. But note that de_thread()
      already resets ->exit_signal if it changes the leader, let's keep
      both changes close to each other.
      Note that we change ->exit_signal lockless, this changes the rules.
      Thereafter ->exit_signal is not stable under tasklist but this is
      fine, the only possible change is OLDSIG -> SIGCHLD. This can race
      with eligible_child() but the race is harmless. We can race with
      reparent_leader() which changes our ->exit_signal in parallel, but
      it does the same change to SIGCHLD.
      The noticeable user-visible change is that the execing task is not
      "visible" to do_wait()->eligible_child(__WCLONE) right after exec.
      To me this looks more logical, and this is consistent with mt case.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  27. 09 Mar, 2012 1 commit
  28. 04 Mar, 2012 1 commit
  29. 19 Feb, 2012 1 commit
    • David Howells's avatar
      Replace the fd_sets in struct fdtable with an array of unsigned longs · 1fd36adc
      David Howells authored
      Replace the fd_sets in struct fdtable with an array of unsigned longs and then
      use the standard non-atomic bit operations rather than the FD_* macros.
       (1) Removes the abuses of struct fd_set:
           (a) Since we don't want to allocate a full fd_set the vast majority of the
           	 time, we actually, in effect, just allocate a just-big-enough array of
           	 unsigned longs and cast it to an fd_set type - so why bother with the
           	 fd_set at all?
           (b) Some places outside of the core fdtable handling code (such as
           	 SELinux) want to look inside the array of unsigned longs hidden inside
           	 the fd_set struct for more efficient iteration over the entire set.
       (2) Eliminates the use of FD_*() macros in the kernel completely.
       (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.ukSigned-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
  30. 13 Feb, 2012 1 commit
  31. 27 Jan, 2012 1 commit
    • Yasunori Goto's avatar
      sched: Fix ancient race in do_exit() · b5740f4b
      Yasunori Goto authored
      try_to_wake_up() has a problem which may change status from TASK_DEAD to
      TASK_RUNNING in race condition with SMI or guest environment of virtual
      machine. As a result, exited task is scheduled() again and panic occurs.
      Here is the sequence how it occurs:
                  CPU A                  |             CPU B
      TASK A calls exit()....
            set waiter.task <= task A
            list_add to sem->wait_list
            (I/O interruption occured)
                                              waiter->task = NULL
                                              wake_up_process(task A)
                                                   (task is still
                                                    p->on_rq is still 1.)
           (I/O interruption handler finished)
            if (!waiter.task)
                schedule() is not called
                due to waiter.task is NULL.
            tsk->state = TASK_RUNNING
        task->state = TASK_DEAD
                                              <---    set TASK_RUNNING (*C)
           (exit task is running again)
           BUG_ON() is called!
      The execution time between (*A) and (*B) is usually very short,
      because the interruption is disabled, and setting TASK_RUNNING at (*C)
      must be executed before setting TASK_DEAD.
      HOWEVER, if SMI is interrupted between (*A) and (*B),
      (*C) is able to execute AFTER setting TASK_DEAD!
      Then, exited task is scheduled again, and BUG_ON() is called....
      If the system works on guest system of virtual machine, the time
      between (*A) and (*B) may be also long due to scheduling of hypervisor,
      and same phenomenon can occur.
      By this patch, do_exit() waits for releasing task->pi_lock which is used
      in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
      waking up.
      Signed-off-by: default avatarYasunori Goto <y-goto@jp.fujitsu.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.comSigned-off-by: default avatarIngo Molnar <mingo@elte.hu>