1. 01 Aug, 2015 1 commit
  2. 11 Jun, 2015 1 commit
  3. 06 Jun, 2015 2 commits
  4. 17 May, 2015 3 commits
  5. 13 May, 2015 1 commit
  6. 06 May, 2015 3 commits
    • Geert Uytterhoeven's avatar
      nosave: consolidate __nosave_{begin,end} in <asm/sections.h> · e034445e
      Geert Uytterhoeven authored
      commit 7f8998c7aef3ac9c5f3f2943e083dfa6302e90d0 upstream.
      The different architectures used their own (and different) declarations:
          extern __visible const void __nosave_begin, __nosave_end;
          extern const void __nosave_begin, __nosave_end;
          extern long __nosave_begin, __nosave_end;
      Consolidate them using the first variant in <asm/sections.h>.
      Signed-off-by: 's avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Lv Zheng's avatar
      ACPICA: Utilities: split IO address types from data type models. · 27b22d01
      Lv Zheng authored
      commit 2b8760100e1de69b6ff004c986328a82947db4ad upstream.
      ACPICA commit aacf863cfffd46338e268b7415f7435cae93b451
      It is reported that on a physically 64-bit addressed machine, 32-bit kernel
      can trigger crashes in accessing the memory regions that are beyond the
      32-bit boundary. The region field's start address should still be 32-bit
      compliant, but after a calculation (adding some offsets), it may exceed the
      32-bit boundary. This case is rare and buggy, but there are real BIOSes
      leaked with such issues (see References below).
      This patch fixes this gap by always defining IO addresses as 64-bit, and
      allows OSPMs to optimize it for a real 32-bit machine to reduce the size of
      the internal objects.
      Internal acpi_physical_address usages in the structures that can be fixed
      by this change include:
       1. struct acpi_object_region:
          acpi_physical_address		address;
       2. struct acpi_address_range:
          acpi_physical_address		start_address;
          acpi_physical_address		end_address;
       3. struct acpi_mem_space_context;
          acpi_physical_address		address;
       4. struct acpi_table_desc
          acpi_physical_address		address;
      See known issues 1 for other usages.
      Note that acpi_io_address which is used for ACPI_PROCESSOR may also suffer
      from same problem, so this patch changes it accordingly.
      For iasl, it will enforce acpi_physical_address as 32-bit to generate
      32-bit OSPM compatible tables on 32-bit platforms, we need to define
      ACPI_32BIT_PHYSICAL_ADDRESS for it in acenv.h.
      Known issues:
       1. Cleanup of mapped virtual address
         In struct acpi_mem_space_context, acpi_physical_address is used as a virtual
          acpi_physical_address                   mapped_physical_address;
         It is better to introduce acpi_virtual_address or use acpi_size instead.
         This patch doesn't make such a change. Because this should be done along
         with a change to acpi_os_map_memory()/acpi_os_unmap_memory().
         There should be no functional problem to leave this unchanged except
         that only this structure is enlarged unexpectedly.
      Link: https://github.com/acpica/acpica/commit/aacf863c
      Reference: https://bugzilla.kernel.org/show_bug.cgi?id=87971
      Reference: https://bugzilla.kernel.org/show_bug.cgi?id=79501Reported-and-tested-by: 's avatarPaul Menzel <paulepanter@users.sourceforge.net>
      Reported-and-tested-by: 's avatarSial Nije <sialnije@gmail.com>
      Signed-off-by: 's avatarLv Zheng <lv.zheng@intel.com>
      Signed-off-by: 's avatarBob Moore <robert.moore@intel.com>
      Signed-off-by: 's avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Felipe Balbi's avatar
      usb: define a generic USB_RESUME_TIMEOUT macro · 7a2d2855
      Felipe Balbi authored
      commit 62f0342de1f012f3e90607d39e20fce811391169 upstream.
      Every USB Host controller should use this new
      macro to define for how long resume signalling
      should be driven on the bus.
      Currently, almost every single USB controller
      is using a 20ms timeout for resume signalling.
      That's problematic for two reasons:
      a) sometimes that 20ms timer expires a little
      before 20ms, which makes us fail certification
      b) some (many) devices actually need more than
      20ms resume signalling.
      Sure, in case of (b) we can state that the device
      is against the USB spec, but the fact is that
      we have no control over which device the certification
      lab will use. We also have no control over which host
      they will use. Most likely they'll be using a Windows
      PC which, again, we have no control over how that
      USB stack is written and how long resume signalling
      they are using.
      At the end of the day, we must make sure Linux passes
      electrical compliance when working as Host or as Device
      and currently we don't pass compliance as host because
      we're driving resume signallig for exactly 20ms and
      that confuses certification test setup resulting in
      Certification failure.
      Acked-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: 's avatarPeter Chen <peter.chen@freescale.com>
      Signed-off-by: 's avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 29 Apr, 2015 5 commits
    • Linus Torvalds's avatar
      vm: add VM_FAULT_SIGSEGV handling support · 0c42d1fb
      Linus Torvalds authored
      commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream.
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d4514 ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: 's avatarTakashi Iwai <tiwai@suse.de>
      Tested-by: 's avatarJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      [shengyong: Backport to 3.10
       - adjust context
       - ignore modification for arch nios2, because 3.10 does not support it
       - ignore modification for driver lustre, because 3.10 does not support it
       - ignore VM_FAULT_FALLBACK in VM_FAULT_ERROR, becase 3.10 does not support
         this flag
       - add SIGSEGV handling to powerpc/cell spu_fault.c, because 3.10 does not
         separate it to copro_fault.c
       - add SIGSEGV handling in mm/memory.c, because 3.10 does not separate it
         to gup.c
      Signed-off-by: 's avatarSheng Yong <shengyong1@huawei.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Al Viro's avatar
      move d_rcu from overlapping d_child to overlapping d_alias · 6637ecd3
      Al Viro authored
      commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      [hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2:
       - Apply name changes in all the different places we use d_alias and d_child
       - Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()]
      Signed-off-by: 's avatarhujianyang <hujianyang@huawei.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kirill A. Shutemov's avatar
      mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) support · 23f1538b
      Kirill A. Shutemov authored
      commit ee53664bda169f519ce3c6a22d378f0b946c8178 upstream.
      Sasha Levin found a NULL pointer dereference that is due to a missing
      page table lock, which in turn is due to the pmd entry in question being
      a transparent huge-table entry.
      The code - introduced in commit 1998cc04 ("mm: make
      madvise(MADV_WILLNEED) support swap file prefetch") - correctly checks
      for this situation using pmd_none_or_trans_huge_or_clear_bad(), but it
      turns out that that function doesn't work correctly.
      pmd_none_or_trans_huge_or_clear_bad() expected that pmd_bad() would
      trigger if the transparent hugepage bit was set, but it doesn't do that
      if pmd_numa() is also set. Note that the NUMA bit only gets set on real
      NUMA machines, so people trying to reproduce this on most normal
      development systems would never actually trigger this.
      Fix it by removing the very subtle (and subtly incorrect) expectation,
      and instead just checking pmd_trans_huge() explicitly.
      Reported-by: 's avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: 's avatarAndrea Arcangeli <aarcange@redhat.com>
      [ Additionally remove the now stale test for pmd_trans_huge() inside the
        pmd_bad() case - Linus ]
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Wang Long <long.wanglong@huawei.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Alex Elder's avatar
      remove extra definitions of U32_MAX · 1554b19c
      Alex Elder authored
      commit 04f9b74e4d96d349de12fdd4e6626af4a9f75e09 upstream.
      Now that the definition is centralized in <linux/kernel.h>, the
      definitions of U32_MAX (and related) elsewhere in the kernel can be
      Signed-off-by: 's avatarAlex Elder <elder@linaro.org>
      Acked-by: 's avatarSage Weil <sage@inktank.com>
      Acked-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Alex Elder's avatar
      conditionally define U32_MAX · b81036aa
      Alex Elder authored
      commit 77719536dc00f8fd8f5abe6dadbde5331c37f996 upstream.
      The symbol U32_MAX is defined in several spots.  Change these
      definitions to be conditional.  This is in preparation for the next
      patch, which centralizes the definition in <linux/kernel.h>.
      Signed-off-by: 's avatarAlex Elder <elder@linaro.org>
      Cc: Sage Weil <sage@inktank.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  8. 19 Apr, 2015 2 commits
    • Alex Elder's avatar
      kernel.h: define u8, s8, u32, etc. limits · 0121b8bf
      Alex Elder authored
      commit 89a0714106aac7309c7dfa0f004b39e1e89d2942 upstream.
      Create constants that define the maximum and minimum values
      representable by the kernel types u8, s8, u16, s16, and so on.
      Signed-off-by: 's avatarAlex Elder <elder@linaro.org>
      Cc: Sage Weil <sage@inktank.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Bart Van Assche's avatar
      Defer processing of REQ_PREEMPT requests for blocked devices · 3e01cca3
      Bart Van Assche authored
      commit bba0bdd7ad4713d82338bcd9b72d57e9335a664b upstream.
      SCSI transport drivers and SCSI LLDs block a SCSI device if the
      transport layer is not operational. This means that in this state
      no requests should be processed, even if the REQ_PREEMPT flag has
      been set. This patch avoids that a rescan shortly after a cable
      pull sporadically triggers the following kernel oops:
      BUG: unable to handle kernel paging request at ffffc9001a6bc084
      IP: [<ffffffffa04e08f2>] mlx4_ib_post_send+0xd2/0xb30 [mlx4_ib]
      Process rescan-scsi-bus (pid: 9241, threadinfo ffff88053484a000, task ffff880534aae100)
      Call Trace:
       [<ffffffffa0718135>] srp_post_send+0x65/0x70 [ib_srp]
       [<ffffffffa071b9df>] srp_queuecommand+0x1cf/0x3e0 [ib_srp]
       [<ffffffffa0001ff1>] scsi_dispatch_cmd+0x101/0x280 [scsi_mod]
       [<ffffffffa0009ad1>] scsi_request_fn+0x411/0x4d0 [scsi_mod]
       [<ffffffff81223b37>] __blk_run_queue+0x27/0x30
       [<ffffffff8122a8d2>] blk_execute_rq_nowait+0x82/0x110
       [<ffffffff8122a9c2>] blk_execute_rq+0x62/0xf0
       [<ffffffffa000b0e8>] scsi_execute+0xe8/0x190 [scsi_mod]
       [<ffffffffa000b2f3>] scsi_execute_req+0xa3/0x130 [scsi_mod]
       [<ffffffffa000c1aa>] scsi_probe_lun+0x17a/0x450 [scsi_mod]
       [<ffffffffa000ce86>] scsi_probe_and_add_lun+0x156/0x480 [scsi_mod]
       [<ffffffffa000dc2f>] __scsi_scan_target+0xdf/0x1f0 [scsi_mod]
       [<ffffffffa000dfa3>] scsi_scan_host_selected+0x183/0x1c0 [scsi_mod]
       [<ffffffffa000edfb>] scsi_scan+0xdb/0xe0 [scsi_mod]
       [<ffffffffa000ee13>] store_scan+0x13/0x20 [scsi_mod]
       [<ffffffff811c8d9b>] sysfs_write_file+0xcb/0x160
       [<ffffffff811589de>] vfs_write+0xce/0x140
       [<ffffffff81158b53>] sys_write+0x53/0xa0
       [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
       [<00007f611c9d9300>] 0x7f611c9d92ff
      Reported-by: 's avatarMax Gurtuvoy <maxg@mellanox.com>
      Signed-off-by: 's avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: 's avatarMike Christie <michaelc@cs.wisc.edu>
      Signed-off-by: 's avatarJames Bottomley <JBottomley@Odin.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  9. 02 Apr, 2015 1 commit
  10. 26 Mar, 2015 1 commit
    • Tejun Heo's avatar
      workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · d8bee0e3
      Tejun Heo authored
      commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream.
      cancel[_delayed]_work_sync() are implemented using
      __cancel_work_timer() which grabs the PENDING bit using
      try_to_grab_pending() and then flushes the work item with PENDING set
      to prevent the on-going execution of the work item from requeueing
      try_to_grab_pending() can always grab PENDING bit without blocking
      except when someone else is doing the above flushing during
      cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
      this case, __cancel_work_timer() currently invokes flush_work().  The
      assumption is that the completion of the work item is what the other
      canceling task would be waiting for too and thus waiting for the same
      condition and retrying should allow forward progress without excessive
      busy looping
      Unfortunately, this doesn't work if preemption is disabled or the
      latter task has real time priority.  Let's say task A just got woken
      up from flush_work() by the completion of the target work item.  If,
      before task A starts executing, task B gets scheduled and invokes
      __cancel_work_timer() on the same work item, its try_to_grab_pending()
      will return -ENOENT as the work item is still being canceled by task A
      and flush_work() will also immediately return false as the work item
      is no longer executing.  This puts task B in a busy loop possibly
      preventing task A from executing and clearing the canceling state on
      the work item leading to a hang.
      task A			task B			worker
      						executing work
        set work CANCELING
          block for work completion
      						completion, wakes up A
      			while (forever) {
      			    -ENOENT as work is being canceled
      			    false as work is no longer executing
      This patch removes the possible hang by updating __cancel_work_timer()
      to explicitly wait for clearing of CANCELING rather than invoking
      flush_work() after try_to_grab_pending() fails with -ENOENT.
      Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
      v3: bit_waitqueue() can't be used for work items defined in vmalloc
          area.  Switched to custom wake function which matches the target
          work item and exclusive wait and wakeup.
      v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
          the target bit waitqueue has wait_bit_queue's on it.  Use
          DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
      Signed-off-by: 's avatarTejun Heo <tj@kernel.org>
      Reported-by: 's avatarRabin Vincent <rabin.vincent@axis.com>
      Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
      Tested-by: 's avatarJesper Nilsson <jesper.nilsson@axis.com>
      Tested-by: 's avatarRabin Vincent <rabin.vincent@axis.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. 06 Mar, 2015 2 commits
    • Sebastian Andrzej Siewior's avatar
      usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGN · 8b1d57fd
      Sebastian Andrzej Siewior authored
      commit 5efd2ea8c9f4f12916ffc8ba636792ce052f6911 upstream.
      the following error pops up during "testusb -a -t 10"
      | musb-hdrc musb-hdrc.1.auto: dma_pool_free buffer-128,	f134e000/be842000 (bad dma)
      hcd_buffer_create() creates a few buffers, the smallest has 32 bytes of
      size. ARCH_KMALLOC_MINALIGN is set to 64 bytes. This combo results in
      hcd_buffer_alloc() returning memory which is 32 bytes aligned and it
      might by identified by buffer_offset() as another buffer. This means the
      buffer which is on a 32 byte boundary will not get freed, instead it
      tries to free another buffer with the error message.
      This patch fixes the issue by creating the smallest DMA buffer with the
      smaller). This might be 32, 64 or even 128 bytes. The next three pools
      will have the size 128, 512 and 2048.
      In case the smallest pool is 128 bytes then we have only three pools
      instead of four (and zero the first entry in the array).
      The last pool size is always 2048 bytes which is the assumed PAGE_SIZE /
      2 of 4096. I doubt it makes sense to continue using PAGE_SIZE / 2 where
      we would end up with 8KiB buffer in case we have 16KiB pages.
      Instead I think it makes sense to have a common size(s) and extend them
      if there is need to.
      There is a BUILD_BUG_ON() now in case someone has a minalign of more than
      128 bytes.
      Signed-off-by: 's avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: 's avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Jan Kara's avatar
      fsnotify: fix handling of renames in audit · 65c62025
      Jan Kara authored
      commit 6ee8e25fc3e916193bce4ebb43d5439e1e2144ab upstream.
      Commit e9fd702a ("audit: convert audit watches to use fsnotify
      instead of inotify") broke handling of renames in audit.  Audit code
      wants to update inode number of an inode corresponding to watched name
      in a directory.  When something gets renamed into a directory to a
      watched name, inotify previously passed moved inode to audit code
      however new fsnotify code passes directory inode where the change
      happened.  That confuses audit and it starts watching parent directory
      instead of a file in a directory.
      This can be observed for example by doing:
        cd /tmp
        touch foo bar
        auditctl -w /tmp/foo
        touch foo
        mv bar foo
        touch foo
      In audit log we see events like:
        type=CONFIG_CHANGE msg=audit(1423563584.155:90): auid=1000 ses=2 op="updated rules" path="/tmp/foo" key=(null) list=4 res=1
        type=PATH msg=audit(1423563584.155:91): item=2 name="bar" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
        type=PATH msg=audit(1423563584.155:91): item=3 name="foo" inode=1046842 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE
        type=PATH msg=audit(1423563584.155:91): item=4 name="foo" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=CREATE
      and that's it - we see event for the first touch after creating the
      audit rule, we see events for rename but we don't see any event for the
      last touch.  However we start seeing events for unrelated stuff
      happening in /tmp.
      Fix the problem by passing moved inode as data in the FS_MOVED_FROM and
      FS_MOVED_TO events instead of the directory where the change happens.
      This doesn't introduce any new problems because noone besides
      audit_watch.c cares about the passed value:
        fs/notify/fanotify/fanotify.c cares only about FSNOTIFY_EVENT_PATH events.
        fs/notify/dnotify/dnotify.c doesn't care about passed 'data' value at all.
        fs/notify/inotify/inotify_fsnotify.c uses 'data' only for FSNOTIFY_EVENT_PATH.
        kernel/audit_tree.c doesn't care about passed 'data' at all.
        kernel/audit_watch.c expects moved inode as 'data'.
      Fixes: e9fd702a ("audit: convert audit watches to use fsnotify instead of inotify")
      Signed-off-by: 's avatarJan Kara <jack@suse.cz>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  12. 27 Feb, 2015 2 commits
    • Eric Dumazet's avatar
      ipv4: tcp: get rid of ugly unicast_sock · 6bed3166
      Eric Dumazet authored
      [ Upstream commit bdbbb8527b6f6a358dbcb70dac247034d665b8e4 ]
      In commit be9f4a44 ("ipv4: tcp: remove per net tcp_sock")
      I tried to address contention on a socket lock, but the solution
      I chose was horrible :
      commit 3a7c384f ("ipv4: tcp: unicast_sock should not land outside
      of TCP stack") addressed a selinux regression.
      commit 0980e56e ("ipv4: tcp: set unicast_sock uc_ttl to -1")
      took care of another regression.
      commit b5ec8eea ("ipv4: fix ip_send_skb()") fixed another regression.
      commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
      was another shot in the dark.
      Really, just use a proper socket per cpu, and remove the skb_orphan()
      call, to re-enable flow control.
      This solves a serious problem with FQ packet scheduler when used in
      hostile environments, as we do not want to allocate a flow structure
      for every RST packet sent in response to a spoofed packet.
      Signed-off-by: 's avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Hannes Frederic Sowa's avatar
      ipv4: try to cache dst_entries which would cause a redirect · 8c6dafeb
      Hannes Frederic Sowa authored
      [ Upstream commit df4d92549f23e1c037e83323aff58a21b3de7fe0 ]
      Not caching dst_entries which cause redirects could be exploited by hosts
      on the same subnet, causing a severe DoS attack. This effect aggravated
      since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()").
      Lookups causing redirects will be allocated with DST_NOCACHE set which
      will force dst_release to free them via RCU.  Unfortunately waiting for
      RCU grace period just takes too long, we can end up with >1M dst_entries
      waiting to be released and the system will run OOM. rcuos threads cannot
      catch up under high softirq load.
      Attaching the flag to emit a redirect later on to the specific skb allows
      us to cache those dst_entries thus reducing the pressure on allocation
      and deallocation.
      This issue was discovered by Marcelo Leitner.
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: 's avatarMarcelo Leitner <mleitner@redhat.com>
      Signed-off-by: 's avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: 's avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: 's avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: 's avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  13. 11 Feb, 2015 1 commit
  14. 30 Jan, 2015 3 commits
  15. 16 Jan, 2015 2 commits
    • Linus Torvalds's avatar
      mm: propagate error from stack expansion even for guard page · 88b5d12c
      Linus Torvalds authored
      commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream.
      Jay Foad reports that the address sanitizer test (asan) sometimes gets
      confused by a stack pointer that ends up being outside the stack vma
      that is reported by /proc/maps.
      This happens due to an interaction between RLIMIT_STACK and the guard
      page: when we do the guard page check, we ignore the potential error
      from the stack expansion, which effectively results in a missing guard
      page, since the expected stack expansion won't have been done.
      And since /proc/maps explicitly ignores the guard page (commit
      d7824370: "mm: fix up some user-visible effects of the stack guard
      page"), the stack pointer ends up being outside the reported stack area.
      This is the minimal patch: it just propagates the error.  It also
      effectively makes the guard page part of the stack limit, which in turn
      measn that the actual real stack is one page less than the stack limit.
      Let's see if anybody notices.  We could teach acct_stack_growth() to
      allow an extra page for a grow-up/grow-down stack in the rlimit test,
      but I don't want to add more complexity if it isn't needed.
      Reported-and-tested-by: 's avatarJay Foad <jay.foad@gmail.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tony Lindgren's avatar
      pstore-ram: Allow optional mapping with pgprot_noncached · c0d9d658
      Tony Lindgren authored
      commit 027bc8b08242c59e19356b4b2c189f2d849ab660 upstream.
      On some ARMs the memory can be mapped pgprot_noncached() and still
      be working for atomic operations. As pointed out by Colin Cross
      <ccross@android.com>, in some cases you do want to use
      pgprot_noncached() if the SoC supports it to see a debug printk
      just before a write hanging the system.
      On ARMs, the atomic operations on strongly ordered memory are
      implementation defined. So let's provide an optional kernel parameter
      for configuring pgprot_noncached(), and use pgprot_writecombine() by
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rob Herring <robherring2@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: 's avatarKees Cook <keescook@chromium.org>
      Signed-off-by: 's avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: 's avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  16. 08 Jan, 2015 3 commits
    • Eric W. Biederman's avatar
      userns: Add a knob to disable setgroups on a per user namespace basis · 1c587ee5
      Eric W. Biederman authored
      commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
        A value of "allow" means the segtoups system call is enabled.
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      userns: Don't allow setgroups until a gid mapping has been setablished · fc9b65e3
      Eric W. Biederman authored
      commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream.
      setgroups is unique in not needing a valid mapping before it can be called,
      in the case of setgroups(0, NULL) which drops all supplemental groups.
      The design of the user namespace assumes that CAP_SETGID can not actually
      be used until a gid mapping is established.  Therefore add a helper function
      to see if the user namespace gid mapping has been established and call
      that function in the setgroups permission check.
      This is part of the fix for CVE-2014-8989, being able to drop groups
      without privilege using user namespaces.
      Reviewed-by: 's avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      groups: Consolidate the setgroups permission checks · 4accc8c8
      Eric W. Biederman authored
      commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream.
      Today there are 3 instances of setgroups and due to an oversight their
      permission checking has diverged.  Add a common function so that
      they may all share the same permission checking code.
      This corrects the current oversight in the current permission checks
      and adds a helper to avoid this in the future.
      A user namespace security fix will update this new helper, shortly.
      Signed-off-by: 's avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  17. 06 Dec, 2014 3 commits
  18. 21 Nov, 2014 4 commits
    • Johannes Weiner's avatar
      mm: memcg: handle non-error OOM situations more gracefully · f8a51179
      Johannes Weiner authored
      commit 4942642080ea82d99ab5b653abb9a12b7ba31f4a upstream.
      Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: 's avatarazurIt <azurit@pobox.sk>
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Johannes Weiner's avatar
      mm: memcg: do not trap chargers with full callstack on OOM · f79d6a46
      Johannes Weiner authored
      commit 3812c8c8f3953921ef18544110dafc3505c1ac62 upstream.
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      OOM invoking task:
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      Debugged by Michal Hocko.
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: 's avatarazurIt <azurit@pobox.sk>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Johannes Weiner's avatar
      mm: memcg: enable memcg OOM killer only for user faults · 11f34787
      Johannes Weiner authored
      commit 519e52473ebe9db5cdef44670d5a97f1fd53d721 upstream.
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Johannes Weiner's avatar
      arch: mm: pass userspace fault flag to generic fault handler · e2ec2c2b
      Johannes Weiner authored
      commit 759496ba6407c6994d6a5ce3a5e74937d7816208 upstream.
      Unlike global OOM handling, memory cgroup code will invoke the OOM killer
      in any OOM situation because it has no way of telling faults occuring in
      kernel context - which could be handled more gracefully - from
      user-triggered faults.
      Pass a flag that identifies faults originating in user space from the
      architecture-specific fault handlers to generic code so that memcg OOM
      handling can be improved.
      Signed-off-by: 's avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: 's avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>