1. 17 Sep, 2014 40 commits
    • Sage Weil's avatar
      libceph: gracefully handle large reply messages from the mon · 12477ec8
      Sage Weil authored
      commit 73c3d4812b4c755efeca0140f606f83772a39ce4 upstream.
      We preallocate a few of the message types we get back from the mon.  If we
      get a larger message than we are expecting, fall back to trying to allocate
      a new one instead of blindly using the one we have.
      Signed-off-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ilya Dryomov's avatar
      libceph: rename ceph_msg::front_max to front_alloc_len · 842a5780
      Ilya Dryomov authored
      commit 3cea4c3071d4e55e9d7356efe9d0ebf92f0c2204 upstream.
      Rename front_max field of struct ceph_msg to front_alloc_len to make
      its purpose more clear.
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@inktank.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Jason Gunthorpe's avatar
      tpm: Provide a generic means to override the chip returned timeouts · d64269e3
      Jason Gunthorpe authored
      commit 8e54caf407b98efa05409e1fee0e5381abd2b088 upstream.
      Some Atmel TPMs provide completely wrong timeouts from their
      TPM_CAP_PROP_TIS_TIMEOUT query. This patch detects that and returns
      new correct values via a DID/VID table in the TIS driver.
      Tested on ARM using an AT97SC3204T FW version 37.16
      [PHuewe: without this fix these 'broken' Atmel TPMs won't function on
      older kernels]
      Signed-off-by: default avatar"Berg, Christopher" <Christopher.Berg@atmel.com>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarPeter Huewe <peterhuewe@gmx.de>
      [bwh: Backported to 3.10:
       - Adjust filename, context
       - s/chip->ops->/chip->vendor./]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Linus Torvalds's avatar
      vfs: fix bad hashing of dentries · d4c96061
      Linus Torvalds authored
      commit 99d263d4c5b2f541dfacb5391e22e8c91ea982a6 upstream.
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
       "The test case is essentially
            for (i = 0; i < 1000000; i++)
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      The reason for this hash regression is two-fold:
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: default avatarJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Al Viro's avatar
      dcache.c: get rid of pointless macros · a6c56468
      Al Viro authored
      commit 482db9066199813d6b999b65a3171afdbec040b6 upstream.
      D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()).
      At this point they are actively misguiding - they imply that values are
      compiler constants, which is no longer true.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Bart Van Assche's avatar
      IB/srp: Fix deadlock between host removal and multipathd · 70efec16
      Bart Van Assche authored
      commit bcc05910359183b431da92713e98eed478edf83a upstream.
      If scsi_remove_host() is invoked after a SCSI device has been blocked,
      if the fast_io_fail_tmo or dev_loss_tmo work gets scheduled on the
      workqueue executing srp_remove_work() and if an I/O request is
      scheduled after the SCSI device had been blocked by e.g. multipathd
      then the following deadlock can occur:
          kworker/6:1     D ffff880831f3c460     0   195      2 0x00000000
          Call Trace:
           [<ffffffff814aafd9>] schedule+0x29/0x70
           [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0
           [<ffffffff8105af6f>] msleep+0x2f/0x40
           [<ffffffff8123b0ae>] __blk_drain_queue+0x4e/0x180
           [<ffffffff8123d2d5>] blk_cleanup_queue+0x225/0x230
           [<ffffffffa0010732>] __scsi_remove_device+0x62/0xe0 [scsi_mod]
           [<ffffffffa000ed2f>] scsi_forget_host+0x6f/0x80 [scsi_mod]
           [<ffffffffa0002eba>] scsi_remove_host+0x7a/0x130 [scsi_mod]
           [<ffffffffa07cf5c5>] srp_remove_work+0x95/0x180 [ib_srp]
           [<ffffffff8106d7aa>] process_one_work+0x1ea/0x6c0
           [<ffffffff8106dd9b>] worker_thread+0x11b/0x3a0
           [<ffffffff810758bd>] kthread+0xed/0x110
           [<ffffffff814b972c>] ret_from_fork+0x7c/0xb0
          multipathd      D ffff880096acc460     0  5340      1 0x00000000
          Call Trace:
           [<ffffffff814aafd9>] schedule+0x29/0x70
           [<ffffffff814aa0ef>] schedule_timeout+0x10f/0x2a0
           [<ffffffff814ab79b>] io_schedule_timeout+0x9b/0xf0
           [<ffffffff814abe1c>] wait_for_completion_io_timeout+0xdc/0x110
           [<ffffffff81244b9b>] blk_execute_rq+0x9b/0x100
           [<ffffffff8124f665>] sg_io+0x1a5/0x450
           [<ffffffff8124fd21>] scsi_cmd_ioctl+0x2a1/0x430
           [<ffffffff8124fef2>] scsi_cmd_blk_ioctl+0x42/0x50
           [<ffffffffa00ec97e>] sd_ioctl+0xbe/0x140 [sd_mod]
           [<ffffffff8124bd04>] blkdev_ioctl+0x234/0x840
           [<ffffffff811cb491>] block_ioctl+0x41/0x50
           [<ffffffff811a0df0>] do_vfs_ioctl+0x300/0x520
           [<ffffffff811a1051>] SyS_ioctl+0x41/0x80
           [<ffffffff814b9962>] tracesys+0xd0/0xd5
      Fix this by scheduling removal work on another workqueue than the
      transport layer timers.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Reviewed-by: default avatarDavid Dillow <dave@thedillows.org>
      Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Tejun Heo's avatar
      blkcg: don't call into policy draining if root_blkg is already gone · f5b48b7a
      Tejun Heo authored
      commit 2a1b4cf2331d92bc009bf94fa02a24604cdaf24c upstream.
      While a queue is being destroyed, all the blkgs are destroyed and its
      ->root_blkg pointer is set to NULL.  If someone else starts to drain
      while the queue is in this state, the following oops happens.
        NULL pointer dereference at 0000000000000028
        IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        PGD e4a1067 PUD b773067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched]
        CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000
        RIP: 0010:[<ffffffff8144e944>]  [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230
        RSP: 0018:ffff88000efd7bf0  EFLAGS: 00010046
        RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450
        R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28
        FS:  00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0
         ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80
         ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58
         ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450
        Call Trace:
         [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60
         [<ffffffff81427641>] __blk_drain_queue+0x71/0x180
         [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0
         [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120
         [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50
         [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40
         [<ffffffff8142d476>] blk_release_queue+0x26/0xd0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff81427505>] blk_put_queue+0x15/0x20
         [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0
         [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0
         [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20
         [<ffffffff817930e2>] device_release+0x32/0xa0
         [<ffffffff81454968>] kobject_cleanup+0x38/0x70
         [<ffffffff81454848>] kobject_put+0x28/0x60
         [<ffffffff817934d7>] put_device+0x17/0x20
         [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0
         [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40
         [<ffffffff817d1257>] sdev_store_delete+0x27/0x30
         [<ffffffff81792ca8>] dev_attr_store+0x18/0x30
         [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50
         [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0
         [<ffffffff811f69bd>] SyS_write+0x4d/0xc0
         [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b
      776687bce42b ("block, blk-mq: draining can't be skipped even if
      bypass_depth was non-zero") made it easier to trigger this bug by
      making blk_queue_bypass_start() drain even when it loses the first
      bypass test to blk_cleanup_queue(); however, the bug has always been
      there even before the commit as blk_queue_bypass_start() could race
      against queue destruction, win the initial bypass test but perform the
      actual draining after blk_cleanup_queue() already destroyed all blkgs.
      Fix it by skippping calling into policy draining if all the blkgs are
      already gone.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Reported-by: default avatarJet Chen <jet.chen@intel.com>
      Tested-by: default avatarShirish Pargaonkar <spargaonkar@suse.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Roger Quadros's avatar
      mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc() · 6562c0cc
      Roger Quadros authored
      commit 40ddbf5069bd4e11447c0088fc75318e0aac53f0 upstream.
      commit 65b97cf6 introduced in v3.7 caused a regression
      by using a reversed CS_MASK thus causing omap_calculate_ecc to
      always fail. As the NAND base driver never checks for .calculate()'s
      return value, the zeroed ECC values are used as is without showing
      any error to the user. However, this won't work and the NAND device
      won't be guarded by any error code.
      Fix the issue by using the correct mask.
      Code was tested on omap3beagle using the following procedure
      - flash the primary bootloader (MLO) from the kernel to the first
      NAND partition using nandwrite.
      - boot the board from NAND. This utilizes OMAP ROM loader that
      relies on 1-bit Hamming code ECC.
      Fixes: 65b97cf6 (mtd: nand: omap2: handle nand on gpmc)
      Signed-off-by: default avatarRoger Quadros <rogerq@ti.com>
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Kevin Hao's avatar
      mtd/ftl: fix the double free of the buffers allocated in build_maps() · a9d28db6
      Kevin Hao authored
      commit a152056c912db82860a8b4c23d0bd3a5aa89e363 upstream.
      I got the following panic on my fsl p5020ds board.
        Unable to handle kernel paging request for data at address 0x7375627379737465
        Faulting instruction address: 0xc000000000100778
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=24 CoreNet Generic
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-next-20140613 #145
        task: c0000000fe080000 ti: c0000000fe088000 task.ti: c0000000fe088000
        NIP: c000000000100778 LR: c00000000010073c CTR: 0000000000000000
        REGS: c0000000fe08aa00 TRAP: 0300   Not tainted  (3.15.0-next-20140613)
        MSR: 0000000080029000 <CE,EE,ME>  CR: 24ad2e24  XER: 00000000
        DEAR: 7375627379737465 ESR: 0000000000000000 SOFTE: 1
        GPR00: c0000000000c99b0 c0000000fe08ac80 c0000000009598e0 c0000000fe001d80
        GPR04: 00000000000000d0 0000000000000913 c000000007902b20 0000000000000000
        GPR08: c0000000feaae888 0000000000000000 0000000007091000 0000000000200200
        GPR12: 0000000028ad2e28 c00000000fff4000 c0000000007abe08 0000000000000000
        GPR16: c0000000007ab160 c0000000007aaf98 c00000000060ba68 c0000000007abda8
        GPR20: c0000000007abde8 c0000000feaea6f8 c0000000feaea708 c0000000007abd10
        GPR24: c000000000989370 c0000000008c6228 00000000000041ed c0000000fe00a400
        GPR28: c00000000017c1cc 00000000000000d0 7375627379737465 c0000000fe001d80
        NIP [c000000000100778] .__kmalloc_track_caller+0x70/0x168
        LR [c00000000010073c] .__kmalloc_track_caller+0x34/0x168
        Call Trace:
        [c0000000fe08ac80] [c00000000087e6b8] uevent_sock_list+0x0/0x10 (unreliable)
        [c0000000fe08ad20] [c0000000000c99b0] .kstrdup+0x44/0x90
        [c0000000fe08adc0] [c00000000017c1cc] .__kernfs_new_node+0x4c/0x130
        [c0000000fe08ae70] [c00000000017d7e4] .kernfs_new_node+0x2c/0x64
        [c0000000fe08aef0] [c00000000017db00] .kernfs_create_dir_ns+0x34/0xc8
        [c0000000fe08af80] [c00000000018067c] .sysfs_create_dir_ns+0x58/0xcc
        [c0000000fe08b010] [c0000000002c711c] .kobject_add_internal+0xc8/0x384
        [c0000000fe08b0b0] [c0000000002c7644] .kobject_add+0x64/0xc8
        [c0000000fe08b140] [c000000000355ebc] .device_add+0x11c/0x654
        [c0000000fe08b200] [c0000000002b5988] .add_disk+0x20c/0x4b4
        [c0000000fe08b2c0] [c0000000003a21d4] .add_mtd_blktrans_dev+0x340/0x514
        [c0000000fe08b350] [c0000000003a3410] .mtdblock_add_mtd+0x74/0xb4
        [c0000000fe08b3e0] [c0000000003a32cc] .blktrans_notify_add+0x64/0x94
        [c0000000fe08b470] [c00000000039b5b4] .add_mtd_device+0x1d4/0x368
        [c0000000fe08b520] [c00000000039b830] .mtd_device_parse_register+0xe8/0x104
        [c0000000fe08b5c0] [c0000000003b8408] .of_flash_probe+0x72c/0x734
        [c0000000fe08b750] [c00000000035ba40] .platform_drv_probe+0x38/0x84
        [c0000000fe08b7d0] [c0000000003599a4] .really_probe+0xa4/0x29c
        [c0000000fe08b870] [c000000000359d3c] .__driver_attach+0x100/0x104
        [c0000000fe08b900] [c00000000035746c] .bus_for_each_dev+0x84/0xe4
        [c0000000fe08b9a0] [c0000000003593c0] .driver_attach+0x24/0x38
        [c0000000fe08ba10] [c000000000358f24] .bus_add_driver+0x1c8/0x2ac
        [c0000000fe08bab0] [c00000000035a3a4] .driver_register+0x8c/0x158
        [c0000000fe08bb30] [c00000000035b9f4] .__platform_driver_register+0x6c/0x80
        [c0000000fe08bba0] [c00000000084e080] .of_flash_driver_init+0x1c/0x30
        [c0000000fe08bc10] [c000000000001864] .do_one_initcall+0xbc/0x238
        [c0000000fe08bd00] [c00000000082cdc0] .kernel_init_freeable+0x188/0x268
        [c0000000fe08bdb0] [c0000000000020a0] .kernel_init+0x1c/0xf7c
        [c0000000fe08be30] [c000000000000884] .ret_from_kernel_thread+0x58/0xd4
        Instruction dump:
        41bd0010 480000c8 4bf04eb5 60000000 e94d0028 e93f0000 7cc95214 e8a60008
        7fc9502a 2fbe0000 419e00c8 e93f0022 <7f7e482a> 39200000 88ed06b2 992d06b2
        ---[ end trace b4c9a94804a42d40 ]---
      It seems that the corrupted partition header on my mtd device triggers
      a bug in the ftl. In function build_maps() it will allocate the buffers
      needed by the mtd partition, but if something goes wrong such as kmalloc
      failure, mtd read error or invalid partition header parameter, it will
      free all allocated buffers and then return non-zero. In my case, it
      seems that partition header parameter 'NumTransferUnits' is invalid.
      And the ftl_freepart() is a function which free all the partition
      buffers allocated by build_maps(). Given the build_maps() is a self
      cleaning function, so there is no need to invoke this function even
      if build_maps() return with error. Otherwise it will causes the
      buffers to be freed twice and then weird things would happen.
      Signed-off-by: default avatarKevin Hao <haokexin@gmail.com>
      Signed-off-by: default avatarBrian Norris <computersforpeace@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Pavel Shilovsky's avatar
      CIFS: Fix wrong restart readdir for SMB1 · 659c6399
      Pavel Shilovsky authored
      commit f736906a7669a77cf8cabdcbcf1dc8cb694e12ef upstream.
      The existing code calls server->ops->close() that is not
      right. This causes XFS test generic/310 to fail. Fix this
      by using server->ops->closedir() function.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Pavel Shilovsky's avatar
      CIFS: Fix wrong filename length for SMB2 · 0c17ceb6
      Pavel Shilovsky authored
      commit 1bbe4997b13de903c421c1cc78440e544b5f9064 upstream.
      The existing code uses the old MAX_NAME constant. This causes
      XFS test generic/013 to fail. Fix it by replacing MAX_NAME with
      PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Pavel Shilovsky's avatar
      CIFS: Fix wrong directory attributes after rename · 4cf2ef68
      Pavel Shilovsky authored
      commit b46799a8f28c43c5264ac8d8ffa28b311b557e03 upstream.
      When we requests rename we also need to update attributes
      of both source and target parent directories. Not doing it
      causes generic/309 xfstest to fail on SMB2 mounts. Fix this
      by marking these directories for force revalidating.
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Steve French's avatar
      CIFS: Possible null ptr deref in SMB2_tcon · c6bef3b6
      Steve French authored
      commit 18f39e7be0121317550d03e267e3ebd4dbfbb3ce upstream.
      As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
      and there is one path in which tcon can be null.
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Reported-by: default avatarRaphael Geissert <geissert@debian.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Pavel Shilovsky's avatar
      CIFS: Fix async reading on reconnects · 8f516091
      Pavel Shilovsky authored
      commit 038bc961c31b070269ecd07349a7ee2e839d4fec upstream.
      If we get into read_into_pages() from cifs_readv_receive() and then
      loose a network, we issue cifs_reconnect that moves all mids to
      a private list and issue their callbacks. The callback of the async
      read request sets a mid to retry, frees it and wakes up a process
      that waits on the rdata completion.
      After the connection is established we return from read_into_pages()
      with a short read, use the mid that was freed before and try to read
      the remaining data from the a newly created socket. Both actions are
      not what we want to do. In reconnect cases (-EAGAIN) we should not
      mask off the error with a short read but should return the error
      code instead.
      Acked-by: default avatarJeff Layton <jlayton@samba.org>
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Pavel Shilovsky's avatar
      CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2 · 9b1eceeb
      Pavel Shilovsky authored
      commit 21496687a79424572f46a84c690d331055f4866f upstream.
      The existing mapping causes unlink() call to return error after delete
      operation. Changing the mapping to -EACCES makes the client process
      the call like CIFS protocol does - reset dos attributes with ATTR_READONLY
      flag masked off and retry the operation.
      Signed-off-by: default avatarPavel Shilovsky <pshilovsky@samba.org>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ilya Dryomov's avatar
      libceph: do not hard code max auth ticket len · 9c38ff70
      Ilya Dryomov authored
      commit c27a3e4d667fdcad3db7b104f75659478e0c68d8 upstream.
      We hard code cephx auth ticket buffer size to 256 bytes.  This isn't
      enough for any moderate setups and, in case tickets themselves are not
      encrypted, leads to buffer overflows (ceph_x_decrypt() errors out, but
      ceph_decode_copy() doesn't - it's just a memcpy() wrapper).  Since the
      buffer is allocated dynamically anyway, allocated it a bit later, at
      the point where we know how much is going to be needed.
      Fixes: http://tracker.ceph.com/issues/8979Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ilya Dryomov's avatar
      libceph: add process_one_ticket() helper · 2e1dbf27
      Ilya Dryomov authored
      commit 597cda357716a3cf8d994cb11927af917c8d71fa upstream.
      Add a helper for processing individual cephx auth tickets.  Needed for
      the next commit, which deals with allocating ticket buffers.  (Most of
      the diff here is whitespace - view with git diff -b).
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ilya Dryomov's avatar
      libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly · a6489727
      Ilya Dryomov authored
      commit 5f740d7e1531099b888410e6bab13f68da9b1a4d upstream.
      Determining ->last_piece based on the value of ->page_offset + length
      is incorrect because length here is the length of the entire message.
      ->last_piece set to false even if page array data item length is <=
      PAGE_SIZE, which results in invalid length passed to
      ceph_tcp_{send,recv}page() and causes various asserts to fire.
          # cat pages-cursor-init.sh
          rbd create --size 10 --image-format 2 foo
          FOO_DEV=$(rbd map foo)
          dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
          rbd snap create foo@snap
          rbd snap protect foo@snap
          rbd clone foo@snap bar
          # rbd_resize calls librbd rbd_resize(), size is in bytes
          ./rbd_resize bar $(((4 << 20) + 512))
          rbd resize --size 10 bar
          BAR_DEV=$(rbd map bar)
          # trigger a 512-byte copyup -- 512-byte page array data item
          dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5
      The problem exists only in ceph_msg_data_pages_cursor_init(),
      ceph_msg_data_pages_advance() does the right thing.  The size_t cast is
      Signed-off-by: default avatarIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: default avatarSage Weil <sage@redhat.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • NeilBrown's avatar
      md/raid1,raid10: always abort recover on write error. · b08633de
      NeilBrown authored
      commit 2446dba03f9dabe0b477a126cbeb377854785b47 upstream.
      Currently we don't abort recovery on a write error if the write error
      to the recovering device was triggerd by normal IO (as opposed to
      recovery IO).
      This means that for one bitmap region, the recovery might write to the
      recovering device for a few sectors, then not bother for subsequent
      sectors (as it never writes to failed devices).  In this case
      the bitmap bit will be cleared, but it really shouldn't.
      The result is that if the recovering device fails and is then re-added
      (after fixing whatever hardware problem triggerred the failure),
      the second recovery won't redo the region it was in the middle of,
      so some of the device will not be recovered properly.
      If we abort the recovery, the region being processes will be cancelled
      (bit not cleared) and the whole region will be retried.
      As the bug can result in data corruption the patch is suitable for
      -stable.  For kernels prior to 3.11 there is a conflict in raid10.c
      which will require care.
      Original-from: jiao hui <jiaohui@bwstor.com.cn>
      Reported-and-tested-by: default avatarjiao hui <jiaohui@bwstor.com.cn>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Chris Mason's avatar
      xfs: don't zero partial page cache pages during O_DIRECT writes · d96dbb06
      Chris Mason authored
      commit 85e584da3212140ee80fd047f9058bbee0bc00d5 upstream.
      xfs is using truncate_pagecache_range to invalidate the page cache
      during DIO reads.  This is different from the other filesystems who
      only invalidate pages during DIO writes.
      truncate_pagecache_range is meant to be used when we are freeing the
      underlying data structs from disk, so it will zero any partial
      ranges in the page.  This means a DIO read can zero out part of the
      page cache page, and it is possible the page will stay in cache.
      buffered reads will find an up to date page with zeros instead of
      the data actually on disk.
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      [dchinner: catch error and warn if it fails. Comment.]
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Dave Chinner's avatar
      xfs: don't zero partial page cache pages during O_DIRECT writes · 1025b461
      Dave Chinner authored
      commit 834ffca6f7e345a79f6f2e2d131b0dfba8a4b67a upstream.
      Similar to direct IO reads, direct IO writes are using
      truncate_pagecache_range to invalidate the page cache. This is
      incorrect due to the sub-block zeroing in the page cache that
      truncate_pagecache_range() triggers.
      This patch fixes things by using invalidate_inode_pages2_range
      instead.  It preserves the page cache invalidation, but won't zero
      any pages.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Dave Chinner's avatar
      xfs: don't dirty buffers beyond EOF · 3430681f
      Dave Chinner authored
      commit 22e757a49cf010703fcb9c9b4ef793248c39b0c2 upstream.
      generic/263 is failing fsx at this point with a page spanning
      EOF that cannot be invalidated. The operations are:
      1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
      1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
      1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
      where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
      write attempts to invalidate the cached page over this range, it
      fails with -EBUSY and so any attempt to do page invalidation fails.
      The real question is this: Why can't that page be invalidated after
      it has been written to disk and cleaned?
      Well, there's data on the first two buffers in the page (1k block
      size, 4k page), but the third buffer on the page (i.e. beyond EOF)
      is failing drop_buffers because it's bh->b_state == 0x3, which is
      BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
      OK, set_buffer_dirty() is called on all buffers from
      __set_page_buffers_dirty(), regardless of whether the buffer is
      beyond EOF or not, which means that when we get to ->writepage,
      we have buffers marked dirty beyond EOF that we need to clean.
      So, we need to implement our own .set_page_dirty method that
      doesn't dirty buffers beyond EOF.
      This is messy because the buffer code is not meant to be shared
      and it has interesting locking issues on the buffer dirty bits.
      So just copy and paste it and then modify it to suit what we need.
      Note: the solutions the other filesystems and generic block code use
      of marking the buffers clean in ->writepage does not work for XFS.
      It still leaves dirty buffers beyond EOF and invalidations still
      fail. Hence rather than play whack-a-mole, this patch simply
      prevents those buffers from being dirtied in the first place.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Dave Chinner's avatar
      xfs: quotacheck leaves dquot buffers without verifiers · 9a9237c9
      Dave Chinner authored
      commit 5fd364fee81a7888af806e42ed8a91c845894f2d upstream.
      When running xfs/305, I noticed that quotacheck was flushing dquot
      buffers that did not have the xfs_dquot_buf_ops verifiers attached:
      XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
      ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00  DQ....e.........
      ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
       ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
       ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
      Call Trace:
       [<ffffffff81cf1cca>] dump_stack+0x45/0x56
       [<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0
       [<ffffffff810db520>] ? wake_up_state+0x20/0x20
       [<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0
       [<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0
       [<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0
       [<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220
       [<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90
       [<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90
       [<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0
       [<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0
       [<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0
       [<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340
       [<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0
       [<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170
       [<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20
       [<ffffffff811ca8c9>] mount_fs+0x39/0x1b0
       [<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120
       [<ffffffff811e757e>] do_mount+0x23e/0xad0
       [<ffffffff8117abde>] ? __get_free_pages+0xe/0x50
       [<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150
       [<ffffffff811e8103>] SyS_mount+0x83/0xc0
       [<ffffffff81cfd40b>] tracesys+0xdd/0xe2
      This was caused by dquot buffer readahead not attaching a verifier
      structure to the buffer when readahead was issued, resulting in the
      followup read of the buffer finding a valid buffer and so not
      attaching new verifiers to the buffer as part of the read.
      Also, when a verifier failure occurs, we then read the buffer
      without verifiers. Attach the verifiers manually after this read so
      that if the buffer is then written it will be verified that the
      corruption has been repaired.
      Further, when flushing a dquot we don't ask for a verifier when
      reading in the dquot buffer the dquot belongs to. Most of the time
      this isn't an issue because the buffer is still cached, but when it
      is not cached it will result in writing the dquot buffer without
      having the verfier attached.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Steve Wise's avatar
      RDMA/iwcm: Use a default listen backlog if needed · 433d80d6
      Steve Wise authored
      commit 2f0304d21867476394cd51a54e97f7273d112261 upstream.
      If the user creates a listening cm_id with backlog of 0 the IWCM ends
      up not allowing any connection requests at all.  The correct behavior
      is for the IWCM to pick a default value if the user backlog parameter
      is zero.
      Lustre from version 1.8.8 onward uses a backlog of 0, which breaks
      iwarp support without this fix.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarRoland Dreier <roland@purestorage.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • NeilBrown's avatar
      md/raid10: Fix memory leak when raid10 reshape completes. · 26584e18
      NeilBrown authored
      commit b39685526f46976bcd13aa08c82480092befa46c upstream.
      When a raid10 commences a resync/recovery/reshape it allocates
      some buffer space.
      When a resync/recovery completes the buffer space is freed.  But not
      when the reshape completes.
      This can result in a small memory leak.
      There is a subtle side-effect of this bug.  When a RAID10 is reshaped
      to a larger array (more devices), the reshape is immediately followed
      by a "resync" of the new space.  This "resync" will use the buffer
      space which was allocated for "reshape".  This can cause problems
      including a "BUG" in the SCSI layer.  So this is suitable for -stable.
      Fixes: 3ea7daa5Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • NeilBrown's avatar
      md/raid10: fix memory leak when reshaping a RAID10. · 1075d2bd
      NeilBrown authored
      commit ce0b0a46955d1bb389684a2605dbcaa990ba0154 upstream.
      raid10 reshape clears unwanted bits from a bio->bi_flags using
      a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
      was added.
      Since then it clears that bit but shouldn't.  This results in a
      memory leak.
      So change to used the approved method of clearing unwanted bits.
      As this causes a memory leak which can consume all of memory
      the fix is suitable for -stable.
      Fixes: a38352e0
      Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • NeilBrown's avatar
      md/raid6: avoid data corruption during recovery of double-degraded RAID6 · 318a3d59
      NeilBrown authored
      commit 9c4bdf697c39805078392d5ddbbba5ae5680e0dd upstream.
      During recovery of a double-degraded RAID6 it is possible for
      some blocks not to be recovered properly, leading to corruption.
      If a write happens to one block in a stripe that would be written to a
      missing device, and at the same time that stripe is recovering data
      to the other missing device, then that recovered data may not be written.
      This patch skips, in the double-degraded case, an optimisation that is
      only safe for single-degraded arrays.
      Bug was introduced in 2.6.32 and fix is suitable for any kernel since
      then.  In an older kernel with separate handle_stripe5() and
      handle_stripe6() functions the patch must change handle_stripe6().
      Fixes: 6c0069c0
      Cc: Yuri Tikhonov <yur@emcraft.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: default avatar"Manibalan P" <pmanibalan@amiindia.co.in>
      Tested-by: default avatar"Manibalan P" <pmanibalan@amiindia.co.in>
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Vignesh Raman's avatar
      Bluetooth: Avoid use of session socket after the session gets freed · 07b41b34
      Vignesh Raman authored
      commit 32333edb82fb2009980eefc5518100068147ab82 upstream.
      The commits 08c30aca "Bluetooth: Remove
      RFCOMM session refcnt" and 8ff52f7d
      "Bluetooth: Return RFCOMM session ptrs to avoid freed session"
      allow rfcomm_recv_ua and rfcomm_session_close to delete the session
      (and free the corresponding socket) and propagate NULL session pointer
      to the upper callers.
      Additional fix is required to terminate the loop in rfcomm_process_rx
      function to avoid use of freed 'sk' memory.
      The issue is only reproducible with kernel option CONFIG_PAGE_POISONING
      enabled making freed memory being changed and filled up with fixed char
      value used to unmask use-after-free issues.
      Signed-off-by: default avatarVignesh Raman <Vignesh_Raman@mentor.com>
      Signed-off-by: default avatarVitaly Kuzmichev <Vitaly_Kuzmichev@mentor.com>
      Acked-by: default avatarDean Jenkins <Dean_Jenkins@mentor.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Vladimir Davydov's avatar
      Bluetooth: never linger on process exit · 819f3e7a
      Vladimir Davydov authored
      commit 093facf3634da1b0c2cc7ed106f1983da901bbab upstream.
      If the current process is exiting, lingering on socket close will make
      it unkillable, so we should avoid it.
        #include <sys/types.h>
        #include <sys/socket.h>
        #define BTPROTO_L2CAP   0
        #define BTPROTO_SCO     2
        #define BTPROTO_RFCOMM  3
        int main()
                int fd;
                struct linger ling;
                fd = socket(PF_BLUETOOTH, SOCK_STREAM, BTPROTO_RFCOMM);
                //or: fd = socket(PF_BLUETOOTH, SOCK_DGRAM, BTPROTO_L2CAP);
                //or: fd = socket(PF_BLUETOOTH, SOCK_SEQPACKET, BTPROTO_SCO);
                ling.l_onoff = 1;
                ling.l_linger = 1000000000;
                setsockopt(fd, SOL_SOCKET, SO_LINGER, &ling, sizeof(ling));
                return 0;
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      mnt: Add tests for unprivileged remount cases that have found to be faulty · bbeed681
      Eric W. Biederman authored
      commit db181ce011e3c033328608299cd6fac06ea50130 upstream.
      Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
      read-only bind mount read-only in a user namespace the
      MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
      to the remount a read-only mount read-write.
      Upon review of the code in remount it was discovered that the code allowed
      nosuid, noexec, and nodev to be cleared.  It was also discovered that
      the code was allowing the per mount atime flags to be changed.
      The first naive patch to fix these issues contained the flaw that using
      default atime settings when remounting a filesystem could be disallowed.
      To avoid this problems in the future add tests to ensure unprivileged
      remounts are succeeding and failing at the appropriate times.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      mnt: Change the default remount atime from relatime to the existing value · 99dd97b8
      Eric W. Biederman authored
      commit ffbc6f0ead47fa5a1dc9642b0331cb75c20a640e upstream.
      Since March 2009 the kernel has treated the state that if no
      MS_..ATIME flags are passed then the kernel defaults to relatime.
      Defaulting to relatime instead of the existing atime state during a
      remount is silly, and causes problems in practice for people who don't
      specify any MS_...ATIME flags and to get the default filesystem atime
      setting.  Those users may encounter a permission error because the
      default atime setting does not work.
      A default that does not work and causes permission problems is
      ridiculous, so preserve the existing value to have a default
      atime setting that is always guaranteed to work.
      Using the default atime setting in this way is particularly
      interesting for applications built to run in restricted userspace
      environments without /proc mounted, as the existing atime mount
      options of a filesystem can not be read from /proc/mounts.
      In practice this fixes user space that uses the default atime
      setting on remount that are broken by the permission checks
      keeping less privileged users from changing more privileged users
      atime settings.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      mnt: Correct permission checks in do_remount · 187985d9
      Eric W. Biederman authored
      commit 9566d6742852c527bf5af38af5cbb878dad75705 upstream.
      While invesgiating the issue where in "mount --bind -oremount,ro ..."
      would result in later "mount --bind -oremount,rw" succeeding even if
      the mount started off locked I realized that there are several
      additional mount flags that should be locked and are not.
      In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
      flags in addition to MNT_READONLY should all be locked.  These
      flags are all per superblock, can all be changed with MS_BIND,
      and should not be changable if set by a more privileged user.
      The following additions to the current logic are added in this patch.
      - nosuid may not be clearable by a less privileged user.
      - nodev  may not be clearable by a less privielged user.
      - noexec may not be clearable by a less privileged user.
      - atime flags may not be changeable by a less privileged user.
      The logic with atime is that always setting atime on access is a
      global policy and backup software and auditing software could break if
      atime bits are not updated (when they are configured to be updated),
      and serious performance degradation could result (DOS attack) if atime
      updates happen when they have been explicitly disabled.  Therefore an
      unprivileged user should not be able to mess with the atime bits set
      by a more privileged user.
      The additional restrictions are implemented with the addition of
      mnt flags.
      Taken together these changes and the fixes for MNT_LOCK_READONLY
      should make it safe for an unprivileged user to create a user
      namespace and to call "mount --bind -o remount,... ..." without
      the danger of mount flags being changed maliciously.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount · 81d4c13e
      Eric W. Biederman authored
      commit 07b645589dcda8b7a5249e096fece2a67556f0f4 upstream.
      There are no races as locked mount flags are guaranteed to never change.
      Moving the test into do_remount makes it more visible, and ensures all
      filesystem remounts pass the MNT_LOCK_READONLY permission check.  This
      second case is not an issue today as filesystem remounts are guarded
      by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
      mount namespaces, but it could become an issue in the future.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Eric W. Biederman's avatar
      mnt: Only change user settable mount flags in remount · 8c30f227
      Eric W. Biederman authored
      commit a6138db815df5ee542d848318e5dae681590fccd upstream.
      Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
      read-only bind mount read-only in a user namespace the
      MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
      to the remount a read-only mount read-write.
      Correct this by replacing the mask of mount flags to preserve
      with a mask of mount flags that may be changed, and preserve
      all others.   This ensures that any future bugs with this mask and
      remount will fail in an easy to detect way where new mount flags
      simply won't change.
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Up rb_iter_peek() loop count to 3 · 7f70b62e
      Steven Rostedt (Red Hat) authored
      commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.
      After writting a test to try to trigger the bug that caused the
      ring buffer iterator to become corrupted, I hit another bug:
       WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
       Modules linked in: ipt_MASQUERADE sunrpc [...]
       CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
        ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
        ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
       Call Trace:
        [<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
        [<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
        [<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
        [<ffffffff810c74a3>] ? s_start+0xd7/0x17b
        [<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
        [<ffffffff8114cf94>] ? seq_read+0x148/0x361
        [<ffffffff81132d98>] ? vfs_read+0x93/0xf1
        [<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
        [<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
      Debugging this bug, which triggers when the rb_iter_peek() loops too
      many times (more than 2 times), I discovered there's a case that can
      cause that function to legitimately loop 3 times!
      rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
      only deals with the reader page (it's for consuming reads). The
      rb_iter_peek() is for traversing the buffer without consuming it, and as
      such, it can loop for one more reason. That is, if we hit the end of
      the reader page or any page, it will go to the next page and try again.
      That is, we have this:
       1. iter->head > iter->head_page->page->commit
          (rb_inc_iter() which moves the iter to the next page)
          try again
       2. event = rb_iter_head_event()
          event->type_len == RINGBUF_TYPE_TIME_EXTEND
          try again
       3. read the event.
      But we never get to 3, because the count is greater than 2 and we
      cause the WARNING and return NULL.
      Up the counter to 3.
      Fixes: 69d1b839 "ring-buffer: Bind time extend and data events together"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Always reset iterator to reader page · 814aa5ad
      Steven Rostedt (Red Hat) authored
      commit 651e22f2701b4113989237c3048d17337dd2185c upstream.
      When performing a consuming read, the ring buffer swaps out a
      page from the ring buffer with a empty page and this page that
      was swapped out becomes the new reader page. The reader page
      is owned by the reader and since it was swapped out of the ring
      buffer, writers do not have access to it (there's an exception
      to that rule, but it's out of scope for this commit).
      When reading the "trace" file, it is a non consuming read, which
      means that the data in the ring buffer will not be modified.
      When the trace file is opened, a ring buffer iterator is allocated
      and writes to the ring buffer are disabled, such that the iterator
      will not have issues iterating over the data.
      Although the ring buffer disabled writes, it does not disable other
      reads, or even consuming reads. If a consuming read happens, then
      the iterator is reset and starts reading from the beginning again.
      My tests would sometimes trigger this bug on my i386 box:
      WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
      Modules linked in:
      CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
      Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
       00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
       f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
       ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
      Call Trace:
       [<c18796b3>] dump_stack+0x4b/0x75
       [<c103a0e3>] warn_slowpath_common+0x7e/0x95
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c103a185>] warn_slowpath_fmt+0x33/0x35
       [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
       [<c10bed04>] trace_find_cmdline+0x40/0x64
       [<c10c3c16>] trace_print_context+0x27/0xec
       [<c10c4360>] ? trace_seq_printf+0x37/0x5b
       [<c10c0b15>] print_trace_line+0x319/0x39b
       [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
       [<c10c13b1>] s_show+0x192/0x1ab
       [<c10bfd9a>] ? s_next+0x5a/0x7c
       [<c112e76e>] seq_read+0x267/0x34c
       [<c1115a25>] vfs_read+0x8c/0xef
       [<c112e507>] ? seq_lseek+0x154/0x154
       [<c1115ba2>] SyS_read+0x54/0x7f
       [<c188488e>] syscall_call+0x7/0xb
      ---[ end trace 3f507febd6b4cc83 ]---
      >>>> ##### CPU 1 buffer started ####
      Which was the __trace_find_cmdline() function complaining about the pid
      in the event record being negative.
      After adding more test cases, this would trigger more often. Strangely
      enough, it would never trigger on a single test, but instead would trigger
      only when running all the tests. I believe that was the case because it
      required one of the tests to be shutting down via delayed instances while
      a new test started up.
      After spending several days debugging this, I found that it was caused by
      the iterator becoming corrupted. Debugging further, I found out why
      the iterator became corrupted. It happened with the rb_iter_reset().
      As consuming reads may not read the full reader page, and only part
      of it, there's a "read" field to know where the last read took place.
      The iterator, must also start at the read position. In the rb_iter_reset()
      code, if the reader page was disconnected from the ring buffer, the iterator
      would start at the head page within the ring buffer (where writes still
      happen). But the mistake there was that it still used the "read" field
      to start the iterator on the head page, where it should always start
      at zero because readers never read from within the ring buffer where
      writes occur.
      I originally wrote a patch to have it set the iter->head to 0 instead
      of iter->head_page->read, but then I questioned why it wasn't always
      setting the iter to point to the reader page, as the reader page is
      still valid.  The list_empty(reader_page->list) just means that it was
      successful in swapping out. But the reader_page may still have data.
      There was a bug report a long time ago that was not reproducible that
      had something about trace_pipe (consuming read) not matching trace
      (iterator read). This may explain why that happened.
      Anyway, the correct answer to this bug is to always use the reader page
      an not reset the iterator to inside the writable ring buffer.
      Fixes: d769041f "ring_buffer: implement new locking"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Jiri Kosina's avatar
      ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock · 4f6a1e62
      Jiri Kosina authored
      commit 6726655dfdd2dc60c035c690d9f10cb69d7ea075 upstream.
      There is a following AB-BA dependency between cpu_hotplug.lock and
      1) cpu_hotplug.lock -> cpuidle_lock
      2) cpuidle_lock -> cpu_hotplug.lock
      acpi_os_execute_deferred() workqueue
      Fix this by reversing the order acpi_processor_cst_has_changed() does
      thigs -- let it first execute the protection against CPU hotplug by
      calling get_online_cpus() and obtain the cpuidle lock only after that (and
      perform the symmentric change when allowing CPUs hotplug again and
      dropping cpuidle lock).
      Spotted by lockdep.
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Lan Tianyu's avatar
      ACPI: Run fixed event device notifications in process context · c55d35d2
      Lan Tianyu authored
      commit 236105db632c6279a020f78c83e22eaef746006b upstream.
      Currently, notify callbacks for fixed button events are run from
      interrupt context.  That is not necessary and after commit 0bf6368ee8f2
      (ACPI / button: Add ACPI Button event via netlink routine) it causes
      netlink routines to be called from interrupt context which is not
      Also, that is different from non-fixed device events (including
      non-fixed button events) whose notify callbacks are all executed from
      process context.
      For the above reasons, make fixed button device notify callbacks run
      in process context which will avoid the deadlock when using netlink
      to report button events to user space.
      Fixes: 0bf6368ee8f2 (ACPI / button: Add ACPI Button event via netlink routine)
      Link: https://lkml.org/lkml/2014/8/21/606Reported-by: default avatarBenjamin Block <bebl@mageta.org>
      Reported-by: default avatarKnut Petersen <Knut_Petersen@t-online.de>
      Signed-off-by: default avatarLan Tianyu <tianyu.lan@intel.com>
      [rjw: Function names, subject and changelog.]
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • David E. Box's avatar
      ACPICA: Utilities: Fix memory leak in acpi_ut_copy_iobject_to_iobject · b3e98f0c
      David E. Box authored
      commit 8aa5e56eeb61a099ea6519eb30ee399e1bc043ce upstream.
      Adds return status check on copy routines to delete the allocated destination
      object if either copy fails. Reported by Colin Ian King on bugs.acpica.org,
      Bug 1087.
      The last applicable commit:
       Commit: 3371c19c
       Subject: ACPICA: Remove ACPI_GET_OBJECT_TYPE macro
      Link: https://bugs.acpica.org/show_bug.cgi?id=1087Reported-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid E. Box <david.e.box@linux.intel.com>
      Signed-off-by: default avatarBob Moore <robert.moore@intel.com>
      Signed-off-by: default avatarLv Zheng <lv.zheng@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Ben Hutchings's avatar
      bfa: Fix undefined bit shift on big-endian architectures with 32-bit DMA address · db065663
      Ben Hutchings authored
      commit 03a6c3ff3282ee9fa893089304d951e0be93a144 upstream.
      bfa_swap_words() shifts its argument (assumed to be 64-bit) by 32 bits
      each way.  In two places the argument type is dma_addr_t, which may be
      32-bit, in which case the effect of the bit shift is undefined:
      drivers/scsi/bfa/bfa_fcpim.c: In function 'bfa_ioim_send_ioreq':
      drivers/scsi/bfa/bfa_fcpim.c:2497:4: warning: left shift count >= width of type [enabled by default]
          addr = bfa_sgaddr_le(sg_dma_address(sg));
      drivers/scsi/bfa/bfa_fcpim.c:2497:4: warning: right shift count >= width of type [enabled by default]
      drivers/scsi/bfa/bfa_fcpim.c:2509:4: warning: left shift count >= width of type [enabled by default]
          addr = bfa_sgaddr_le(sg_dma_address(sg));
      drivers/scsi/bfa/bfa_fcpim.c:2509:4: warning: right shift count >= width of type [enabled by default]
      Avoid this by adding casts to u64 in bfa_swap_words().
      Compile-tested only.
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: default avatarAnil Gurumurthy <anil.gurumurthy@qlogic.com>
      Fixes: f16a1750 ('[SCSI] bfa: remove all OS wrappers')
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>