1. 15 Dec, 2015 2 commits
  2. 24 Oct, 2015 2 commits
  3. 01 Oct, 2015 1 commit
    • Eric W. Biederman's avatar
      unshare: Unsharing a thread does not require unsharing a vm · 6b7d2f5b
      Eric W. Biederman authored
      commit 12c641ab8270f787dfcce08b5f20ce8b65008096 upstream.
      In the logic in the initial commit of unshare made creating a new
      thread group for a process, contingent upon creating a new memory
      address space for that process.  That is wrong.  Two separate
      processes in different thread groups can share a memory address space
      and clone allows creation of such proceses.
      This is significant because it was observed that mm_users > 1 does not
      mean that a process is multi-threaded, as reading /proc/PID/maps
      temporarily increments mm_users, which allows other processes to
      (accidentally) interfere with unshare() calls.
      Correct the check in check_unshare_flags() to test for
      !thread_group_empty() for CLONE_THREAD, CLONE_SIGHAND, and CLONE_VM.
      For sighand->count > 1 for CLONE_SIGHAND and CLONE_VM.
      For !current_is_single_threaded instead of mm_users > 1 for CLONE_VM.
      By using the correct checks in unshare this removes the possibility of
      an accidental denial of service attack.
      Additionally using the correct checks in unshare ensures that only an
      explicit unshare(CLONE_VM) can possibly trigger the slow path of
      current_is_single_threaded().  As an explict unshare(CLONE_VM) is
      pointless it is not expected there are many applications that make
      that call.
      Fixes: b2e0d987 userns: Implement unshare of the user namespace
      Reported-by: default avatarRicky Zhou <rickyz@chromium.org>
      Reported-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  4. 02 Apr, 2015 1 commit
  5. 09 Oct, 2014 1 commit
  6. 05 Oct, 2014 2 commits
    • Oleg Nesterov's avatar
      introduce for_each_thread() to replace the buggy while_each_thread() · 641bc58d
      Oleg Nesterov authored
      commit 0c740d0afc3bff0a097ad03a1c8df92757516f5c upstream.
      while_each_thread() and next_thread() should die, almost every lockless
      usage is wrong.
      1. Unless g == current, the lockless while_each_thread() is not safe.
         while_each_thread(g, t) can loop forever if g exits, next_thread()
         can't reach the unhashed thread in this case. Note that this can
         happen even if g is the group leader, it can exec.
      2. Even if while_each_thread() itself was correct, people often use
         it wrongly.
         It was never safe to just take rcu_read_lock() and loop unless
         you verify that pid_alive(g) == T, even the first next_thread()
         can point to the already freed/reused memory.
      This patch adds signal_struct->thread_head and task->thread_node to
      create the normal rcu-safe list with the stable head.  The new
      for_each_thread(g, t) helper is always safe under rcu_read_lock() as
      long as this task_struct can't go away.
      Note: of course it is ugly to have both task_struct->thread_node and the
      old task_struct->thread_group, we will kill it later, after we change
      the users of while_each_thread() to use for_each_thread().
      Perhaps we can kill it even before we convert all users, we can
      reimplement next_thread(t) using the new thread_head/thread_node.  But
      we can't do this right now because this will lead to subtle behavioural
      changes.  For example, do/while_each_thread() always sees at least one
      task, while for_each_thread() can do nothing if the whole thread group
      has died.  Or thread_group_empty(), currently its semantics is not clear
      unless thread_group_leader(p) and we need to audit the callers before we
      can change it.
      So this patch adds the new interface which has to coexist with the old
      one for some time, hopefully the next changes will be more or less
      straightforward and the old one will go away soon.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarSergey Dyasly <dserrg@gmail.com>
      Tested-by: default avatarSergey Dyasly <dserrg@gmail.com>
      Reviewed-by: default avatarSameer Nanda <snanda@chromium.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: "Ma, Xindong" <xindong.ma@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Oleg Nesterov's avatar
      kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code · beed6106
      Oleg Nesterov authored
      commit 80628ca06c5d42929de6bc22c0a41589a834d151 upstream.
      Cleanup and preparation for the next changes.
      Move the "if (clone_flags & CLONE_THREAD)" code down under "if
      (likely(p->pid))" and turn it into into the "else" branch.  This makes the
      process/thread initialization more symmetrical and removes one check.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 07 Jul, 2014 1 commit
  8. 01 Jul, 2014 1 commit
  9. 07 Jun, 2014 1 commit
  10. 09 Jan, 2014 1 commit
    • Rik van Riel's avatar
      mm: fix TLB flush race between migration, and change_protection_range · d303cf46
      Rik van Riel authored
      commit 20841405940e7be0617612d521e206e4b6b325db upstream.
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      During that time, a CPU may continue writing to the page.
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      The basic race looks like this:
      CPU A			CPU B			CPU C
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      [*] the old page may belong to a new user at this point!
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      This should fix both NUMA migration and compaction.
      [mgorman@suse.de: fix build]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. 27 Sep, 2013 1 commit
  12. 20 Aug, 2013 1 commit
  13. 08 May, 2013 1 commit
  14. 23 Mar, 2013 1 commit
  15. 13 Mar, 2013 1 commit
    • Eric W. Biederman's avatar
      userns: Don't allow CLONE_NEWUSER | CLONE_FS · e66eded8
      Eric W. Biederman authored
      Don't allowing sharing the root directory with processes in a
      different user namespace.  There doesn't seem to be any point, and to
      allow it would require the overhead of putting a user namespace
      reference in fs_struct (for permission checks) and incrementing that
      reference count on practically every call to fork.
      So just perform the inexpensive test of forbidding sharing fs_struct
      acrosss processes in different user namespaces.  We already disallow
      other forms of threading when unsharing a user namespace so this
      should be no real burden in practice.
      This updates setns, clone, and unshare to disallow multiple user
      namespaces sharing an fs_struct.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 07 Mar, 2013 1 commit
    • Frederic Weisbecker's avatar
      cputime: Dynamically scale cputime for full dynticks accounting · 9fbc42ea
      Frederic Weisbecker authored
      The full dynticks cputime accounting is able to account either
      using the tick or the context tracking subsystem. This way
      the housekeeping CPU can keep the low overhead tick based
      This latter mode has a low jiffies resolution granularity and
      need to be scaled against CFS precise runtime accounting to
      improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
      now we also need to expand it to full dynticks accounting dynamic
      off-case as well.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  17. 04 Mar, 2013 1 commit
  18. 28 Feb, 2013 1 commit
  19. 23 Feb, 2013 1 commit
  20. 27 Jan, 2013 1 commit
    • Frederic Weisbecker's avatar
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker authored
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
  21. 20 Jan, 2013 1 commit
  22. 25 Dec, 2012 1 commit
  23. 19 Dec, 2012 1 commit
  24. 18 Dec, 2012 1 commit
    • Glauber Costa's avatar
      fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1
      Glauber Costa authored
      Because those architectures will draw their stacks directly from the page
      allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
      flag, and issue the corresponding free_pages.
      This code path is taken when the architecture doesn't define
      CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
      THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
      architectures fall in this category.
      This will guarantee that every stack page is accounted to the memcg the
      process currently lives on, and will have the allocations to fail if they
      go over limit.
      For the time being, I am defining a new variant of THREADINFO_GFP, not to
      mess with the other path.  Once the slab is also tracked by memcg, we can
      get rid of that flag.
      Tested to successfully protect against :(){ :|:& };:
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Acked-by: default avatarFrederic Weisbecker <fweisbec@redhat.com>
      Acked-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: JoonSoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  25. 11 Dec, 2012 1 commit
    • Mel Gorman's avatar
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman authored
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
  26. 29 Nov, 2012 5 commits
  27. 28 Nov, 2012 1 commit
    • Frederic Weisbecker's avatar
      cputime: Consolidate cputime adjustment code · d37f761d
      Frederic Weisbecker authored
      task_cputime_adjusted() and thread_group_cputime_adjusted()
      essentially share the same code. They just don't use the same
      * The first function uses the cputime in the task struct and the
      previous adjusted snapshot that ensures monotonicity.
      * The second adds the cputime of all tasks in the group and the
      previous adjusted snapshot of the whole group from the signal
      Just consolidate the common code that does the adjustment. These
      functions just need to fetch the values from the appropriate
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
  28. 20 Nov, 2012 1 commit
  29. 19 Nov, 2012 5 commits
    • Eric W. Biederman's avatar
      userns: Allow unprivileged users to create user namespaces. · 5eaf563e
      Eric W. Biederman authored
      Now that we have been through every permission check in the kernel
      having uid == 0 and gid == 0 in your local user namespace no
      longer adds any special privileges.  Even having a full set
      of caps in your local user namespace is safe because capabilies
      are relative to your local user namespace, and do not confer
      unexpected privileges.
      Over the long term this should allow much more of the kernels
      functionality to be safely used by non-root users.  Functionality
      like unsharing the mount namespace that is only unsafe because
      it can fool applications whose privileges are raised when they
      are executed.  Since those applications have no privileges in
      a user namespaces it becomes safe to spoof and confuse those
      applications all you want.
      Those capabilities will still need to be enabled carefully because
      we may still need things like rlimits on the number of unprivileged
      mounts but that is to avoid DOS attacks not to avoid fooling root
      owned processes.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      pidns: Support unsharing the pid namespace. · 50804fe3
      Eric W. Biederman authored
      Unsharing of the pid namespace unlike unsharing of other namespaces
      does not take affect immediately.  Instead it affects the children
      created with fork and clone.  The first of these children becomes the init
      process of the new pid namespace, the rest become oddball children
      of pid 0.  From the point of view of the new pid namespace the process
      that created it is pid 0, as it's pid does not map.
      A couple of different semantics were considered but this one was
      settled on because it is easy to implement and it is usable from
      pam modules.  The core reasons for the existence of unshare.
      I took a survey of the callers of pam modules and the following
      appears to be a representative sample of their logic.
      	setup stuff include pam
      	child = fork();
      	if (!child) {
                      exec /bin/bash
              pam and other cleanup
      As you can see there is a fork to create the unprivileged user
      space process.  Which means that the unprivileged user space
      process will appear as pid 1 in the new pid namespace.  Further
      most login processes do not cope with extraneous children which
      means shifting the duty of reaping extraneous child process to
      the creator of those extraneous children makes the system more
      The practical reason for this set of pid namespace semantics is
      that it is simple to implement and verify they work correctly.
      Whereas an implementation that requres changing the struct
      pid on a process comes with a lot more races and pain.  Not
      the least of which is that glibc caches getpid().
      These semantics are implemented by having two notions
      of the pid namespace of a proces.  There is task_active_pid_ns
      which is the pid namspace the process was created with
      and the pid namespace that all pids are presented to
      that process in.  The task_active_pid_ns is stored
      in the struct pid of the task.
      Then there is the pid namespace that will be used for children
      that pid namespace is stored in task->nsproxy->pid_ns.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      pidns: Consolidate initialzation of special init task state · 1c4042c2
      Eric W. Biederman authored
      Instead of setting child_reaper and SIGNAL_UNKILLABLE one way
      for the system init process, and another way for pid namespace
      init processes test pid->nr == 1 and use the same code for both.
      For the global init this results in SIGNAL_UNKILLABLE being set
      much earlier in the initialization process.
      This is a small cleanup and it paves the way for allowing unshare and
      enter of the pid namespace as that path like our global init also will
      not set CLONE_NEWPID.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      pidns: Make the pidns proc mount/umount logic obvious. · 0a01f2cc
      Eric W. Biederman authored
      Track the number of pids in the proc hash table.  When the number of
      pids goes to 0 schedule work to unmount the kernel mount of proc.
      Move the mount of proc into alloc_pid when we allocate the pid for
      Remove the surprising calls of pid_ns_release proc in fork and
      proc_flush_task.  Those code paths really shouldn't know about proc
      namespace implementation details and people have demonstrated several
      times that finding and understanding those code paths is difficult and
      Because of the call path detach pid is alwasy called with the
      rtnl_lock held free_pid is not allowed to sleep, so the work to
      unmounting proc is moved to a work queue.  This has the side benefit
      of not blocking the entire world waiting for the unnecessary
      rcu_barrier in deactivate_locked_super.
      In the process of making the code clear and obvious this fixes a bug
      reported by Gao feng <gaofeng@cn.fujitsu.com> where we would leak a
      mount of proc during clone(CLONE_NEWPID|CLONE_NEWNET) if copy_pid_ns
      succeeded and copy_net_ns failed.
      Acked-by: default avatar"Serge E. Hallyn" <serge@hallyn.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
    • Eric W. Biederman's avatar
      pidns: Use task_active_pid_ns where appropriate · 17cf22c3
      Eric W. Biederman authored
      The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
      aka ns_of_pid(task_pid(tsk)) should have the same number of
      cache line misses with the practical difference that
      ns_of_pid(task_pid(tsk)) is released later in a processes life.
      Furthermore by using task_active_pid_ns it becomes trivial
      to write an unshare implementation for the the pid namespace.
      So I have used task_active_pid_ns everywhere I can.
      In fork since the pid has not yet been attached to the
      process I use ns_of_pid, to achieve the same effect.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>