1. 12 Dec, 2013 1 commit
  2. 08 Dec, 2013 2 commits
    • Tomoki Sekiyama's avatar
      elevator: acquire q->sysfs_lock in elevator_change() · 72b9401c
      Tomoki Sekiyama authored
      commit 7c8a3679e3d8e9d92d58f282161760a0e247df97 upstream.
      
      Add locking of q->sysfs_lock into elevator_change() (an exported function)
      to ensure it is held to protect q->elevator from elevator_init(), even if
      elevator_change() is called from non-sysfs paths.
      sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
      version, as the lock is already taken by elv_iosched_store().
      Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72b9401c
    • Tomoki Sekiyama's avatar
      elevator: Fix a race in elevator switching and md device initialization · 6d53d392
      Tomoki Sekiyama authored
      commit eb1c160b22655fd4ec44be732d6594fd1b1e44f4 upstream.
      
      The soft lockup below happens at the boot time of the system using dm
      multipath and the udev rules to switch scheduler.
      
      [  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
      [  356.127001] RIP: 0010:[<ffffffff81072a7d>]  [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
      ...
      [  356.127001] Call Trace:
      [  356.127001]  [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
      [  356.127001]  [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
      [  356.127001]  [<ffffffff810738b2>] del_timer_sync+0x52/0x60
      [  356.127001]  [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
      [  356.127001]  [<ffffffff812c98df>] elevator_exit+0x2f/0x50
      [  356.127001]  [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
      [  356.127001]  [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
      [  356.127001]  [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
      [  356.127001]  [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
      [  356.127001]  [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
      [  356.127001]  [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
      [  356.127001]  [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
      
      This is caused by a race between md device initialization by multipathd and
      shell script to switch the scheduler using sysfs.
      
       - multipathd:
         SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
         -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
          q->elevator = elevator_alloc(q, e); // not yet initialized
      
       - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
         elevator_switch (in the call trace above)
          struct elevator_queue *old = q->elevator;
          q->elevator = elevator_alloc(q, new_e);
          elevator_exit(old);                 // lockup! (*)
      
       - multipathd: (cont.)
          err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified
      
      (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
      while timer->base == NULL. In this case, as timer will never initialized,
      it results in lockup.
      
      This patch introduces acquisition of q->sysfs_lock around elevator_init()
      into blk_init_allocated_queue(), to provide mutual exclusion between
      initialization of the q->scheduler and switching of the scheduler.
      
      This should fix this bugzilla:
      https://bugzilla.redhat.com/show_bug.cgi?id=902012Signed-off-by: default avatarTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d53d392
  3. 04 Dec, 2013 1 commit
    • Mikulas Patocka's avatar
      blk-core: Fix memory corruption if blkcg_init_queue fails · d8db1a5f
      Mikulas Patocka authored
      commit fff4996b7db7955414ac74386efa5e07fd766b50 upstream.
      
      If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
      to clean up structures allocated by the backing dev.
      
      ------------[ cut here ]------------
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
      ODEBUG: free active (active state 0) object type: percpu_counter hint:           (null)
      Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
      CPU: 0 PID: 2739 Comm: lvchange Tainted: G        W
      3.10.15-devel #14
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
       0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
       ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
       ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
      Call Trace:
       [<ffffffff813c8fd4>] dump_stack+0x19/0x1b
       [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0
       [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50
       [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250
       [<ffffffff81229a15>] debug_print_object+0x85/0xa0
       [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250
       [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0
       [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0
       [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10
       [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod]
       [<ffffffff81097662>] ? get_lock_stats+0x22/0x70
       [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
       [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520
       [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0
       [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      ---[ end trace 4b5ff0d55673d986 ]---
      ------------[ cut here ]------------
      
      This fix should be backported to stable kernels starting with 2.6.37. Note
      that in the kernels prior to 3.5 the affected code is different, but the
      bug is still there - bdi_init is called and bdi_destroy isn't.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8db1a5f
  4. 29 Nov, 2013 2 commits
    • Mike Snitzer's avatar
      block: properly stack underlying max_segment_size to DM device · 0deb6f9c
      Mike Snitzer authored
      commit d82ae52e68892338068e7559a0c0657193341ce4 upstream.
      
      Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
      (65536) even if the underlying device(s) have a larger value -- this is
      due to blk_stack_limits() using min_not_zero() when stacking the
      max_segment_size limit.
      
      1073741824
      
      before patch:
      65536
      
      after patch:
      1073741824
      Reported-by: default avatarLukasz Flis <l.flis@cyfronet.pl>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0deb6f9c
    • Jeff Moyer's avatar
      block: fix race between request completion and timeout handling · 869d4e7f
      Jeff Moyer authored
      commit 4912aa6c11e6a5d910264deedbec2075c6f1bb73 upstream.
      
      crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      
      Pid: 491, comm: scsi_eh_0 Tainted: G        W  ----------------   2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
      RIP: 0010:[<ffffffff8124e424>]  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
      RSP: 0018:ffff881057eefd60  EFLAGS: 00010012
      RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
      RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
      RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
      R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
      FS:  0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
      Stack:
       0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
      <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
      <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
      Call Trace:
       [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
       [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
       [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
       [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
       [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
       [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
       [<ffffffff810908c6>] kthread+0x96/0xa0
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffff81090830>] ? kthread+0x0/0xa0
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
      RIP  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
       RSP <ffff881057eefd60>
      
      The RIP is this line:
              BUG_ON(blk_queued_rq(rq));
      
      After digging through the code, I think there may be a race between the
      request completion and the timer handler running.
      
      A timer is started for each request put on the device's queue (see
      blk_start_request->blk_add_timer).  If the request does not complete
      before the timer expires, the timer handler (blk_rq_timed_out_timer)
      will mark the request complete atomically:
      
      static inline int blk_mark_rq_complete(struct request *rq)
      {
              return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
      }
      
      and then call blk_rq_timed_out.  The latter function will call
      scsi_times_out, which will return one of BLK_EH_HANDLED,
      BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
      returned, blk_clear_rq_complete is called, and blk_add_timer is again
      called to simply wait longer for the request to complete.
      
      Now, if the request happens to complete while this is going on, what
      happens?  Given that we know the completion handler will bail if it
      finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
      handler running after that bit is cleared.  So, from the above
      paragraph, after the call to blk_clear_rq_complete.  If the completion
      sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
      there (I haven't seen this in the cores).  Next, if we get the
      completion before the call to list_add_tail, then the timer will
      eventually fire for an old req, which may either be freed or reallocated
      (there is evidence that this might be the case).  Finally, if the
      completion comes in *after* the addition to the timeout list, I think
      it's harmless.  The request will be removed from the timeout list,
      req_atom_complete will be set, and all will be well.
      
      This will only actually explain the coredumps *IF* the request
      structure was freed, reallocated *and* queued before the error handler
      thread had a chance to process it.  That is possible, but it may make
      sense to keep digging for another race.  I think that if this is what
      was happening, we would see other instances of this problem showing up
      as null pointer or garbage pointer dereferences, for example when the
      request structure was not re-used.  It looks like we actually do run
      into that situation in other reports.
      
      This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
      &req->atomic_flags)); from blk_add_timer to the only caller that could
      trip over it (blk_start_request).  It then inverts the calls to
      blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
      the race.  I've boot tested this patch, but nothing more.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      869d4e7f
  5. 01 Oct, 2013 1 commit
  6. 20 Aug, 2013 1 commit
    • Jianpeng Ma's avatar
      elevator: Fix a race in elevator switching · a6ad83fc
      Jianpeng Ma authored
      commit d50235b7bc3ee0a0427984d763ea7534149531b4 upstream.
      
      There's a race between elevator switching and normal io operation.
          Because the allocation of struct elevator_queue and struct elevator_data
          don't in a atomic operation.So there are have chance to use NULL
          ->elevator_data.
          For example:
              Thread A:                               Thread B
              blk_queu_bio                            elevator_switch
              spin_lock_irq(q->queue_block)           elevator_alloc
              elv_merge                               elevator_init_fn
      
          Because call elevator_alloc, it can't hold queue_lock and the
          ->elevator_data is NULL.So at the same time, threadA call elv_merge and
          nedd some info of elevator_data.So the crash happened.
      
          Move the elevator_alloc into func elevator_init_fn, it make the
          operations in a atomic operation.
      
          Using the follow method can easy reproduce this bug
          1:dd if=/dev/sdb of=/dev/null
          2:while true;do echo noop > scheduler;echo deadline > scheduler;done
      
          The test method also use this method.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Jonghwan Choi <jhbird.choi@samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6ad83fc
  7. 13 Jul, 2013 1 commit
  8. 17 May, 2013 1 commit
  9. 08 May, 2013 1 commit
  10. 30 Apr, 2013 1 commit
  11. 18 Apr, 2013 1 commit
  12. 11 Apr, 2013 1 commit
  13. 09 Apr, 2013 1 commit
    • Jun'ichi Nomura's avatar
      blkcg: fix "scheduling while atomic" in blk_queue_bypass_start · e5072664
      Jun'ichi Nomura authored
      Since 749fefe6 in v3.7 ("block: lift the initial queue bypass mode
      on blk_register_queue() instead of blk_init_allocated_queue()"),
      the following warning appears when multipath is used with CONFIG_PREEMPT=y.
      
      This patch moves blk_queue_bypass_start() before radix_tree_preload()
      to avoid the sleeping call while preemption is disabled.
      
        BUG: scheduling while atomic: multipath/2460/0x00000002
        1 lock held by multipath/2460:
         #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
        Modules linked in: ...
        Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
        Call Trace:
         [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
         [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
         [<ffffffff814291e6>] schedule+0x64/0x66
         [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
         [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
         [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
         [<ffffffff814289e3>] wait_for_common+0x9d/0xee
         [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
         [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
         [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
         [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
         [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
         [<ffffffff8106fb18>] ? complete+0x21/0x53
         [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
         [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
         [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
         [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
         [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
         [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
         [<ffffffff811d8c41>] elevator_init+0xe1/0x115
         [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
         [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
         [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
         [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
         [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
         [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
         [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
         [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
         [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
         [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
         [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
         [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5072664
  14. 08 Apr, 2013 2 commits
    • Kay Sievers's avatar
      driver core: add uid and gid to devtmpfs · 3c2670e6
      Kay Sievers authored
      Some drivers want to tell userspace what uid and gid should be used for
      their device nodes, so allow that information to percolate through the
      driver core to userspace in order to make this happen.  This means that
      some systems (i.e.  Android and friends) will not need to even run a
      udev-like daemon for their device node manager and can just rely in
      devtmpfs fully, reducing their footprint even more.
      Signed-off-by: default avatarKay Sievers <kay@vrfy.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c2670e6
    • Jens Axboe's avatar
      Revert "loop: cleanup partitions when detaching loop device" · c2fccc1c
      Jens Axboe authored
      This reverts commit 8761a3dc.
      
      There are situations where the destruction path is called
      with the bdev->bd_mutex already held, which then deadlocks in
      loop_clr_fd(). The normal partition cleanup does a trylock()
      on the mutex, but it'd be nice to have a more bullet proof
      method in loop. So punt this more involved fix to the next
      merge window, and just back out this buggy fix for now.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c2fccc1c
  15. 03 Apr, 2013 1 commit
    • Arnd Bergmann's avatar
      block: avoid using uninitialized value in from queue_var_store · c678ef52
      Arnd Bergmann authored
      As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
      that use a value generated by queue_var_store independent of whether
      that value was set or not.
      
      block/blk-sysfs.c: In function 'queue_store_nonrot':
      block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Unlike most other such warnings, this one is not a false positive,
      writing any non-number string into the sysfs files indeed has
      an undefined result, rather than returning an error.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c678ef52
  16. 23 Mar, 2013 4 commits
    • Kent Overstreet's avatar
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet authored
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d
    • Kent Overstreet's avatar
      block: Refactor blk_update_request() · f79ea416
      Kent Overstreet authored
      Converts it to use bio_advance(), simplifying it quite a bit in the
      process.
      
      Note that req_bio_endio() now always calls bio_advance() - which means
      it always loops over the biovec, not just on partial completions. Don't
      expect it to affect performance, but worth noting.
      
      Tested it by forcing partial updates, and dumping before and after on
      various bio/bvec fields when doing a partial update.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      f79ea416
    • Lin Ming's avatar
      block: implement runtime pm strategy · c8158819
      Lin Ming authored
      When a request is added:
          If device is suspended or is suspending and the request is not a
          PM request, resume the device.
      
      When the last request finishes:
          Call pm_runtime_mark_last_busy().
      
      When pick a request:
          If device is resuming/suspending, then only PM request is allowed
          to go.
      
      The idea and API is designed by Alan Stern and described here:
      http://marc.info/?l=linux-scsi&m=133727953625963&w=2Signed-off-by: default avatarLin Ming <ming.m.lin@intel.com>
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8158819
    • Lin Ming's avatar
      block: add runtime pm helpers · 6c954667
      Lin Ming authored
      Add runtime pm helper functions:
      
      void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
        - Initialization function for drivers to call.
      
      int blk_pre_runtime_suspend(struct request_queue *q)
        - If any requests are in the queue, mark last busy and return -EBUSY.
          Otherwise set q->rpm_status to RPM_SUSPENDING and return 0.
      
      void blk_post_runtime_suspend(struct request_queue *q, int err)
        - If the suspend succeeded then set q->rpm_status to RPM_SUSPENDED.
          Otherwise set it to RPM_ACTIVE and mark last busy.
      
      void blk_pre_runtime_resume(struct request_queue *q)
        - Set q->rpm_status to RPM_RESUMING.
      
      void blk_post_runtime_resume(struct request_queue *q, int err)
        - If the resume succeeded then set q->rpm_status to RPM_ACTIVE
          and call __blk_run_queue, then mark last busy and autosuspend.
          Otherwise set q->rpm_status to RPM_SUSPENDED.
      
      The idea and API is designed by Alan Stern and described here:
      http://marc.info/?l=linux-scsi&m=133727953625963&w=2Signed-off-by: default avatarLin Ming <ming.m.lin@intel.com>
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6c954667
  17. 22 Mar, 2013 2 commits
  18. 20 Mar, 2013 1 commit
  19. 04 Mar, 2013 1 commit
    • Li Zefan's avatar
      cgroup: fix cgroup_path() vs rename() race · 65dff759
      Li Zefan authored
      rename() will change dentry->d_name. The result of this race can
      be worse than seeing partially rewritten name, but we might access
      a stale pointer because rename() will re-allocate memory to hold
      a longer name.
      
      As accessing dentry->name must be protected by dentry->d_lock or
      parent inode's i_mutex, while on the other hand cgroup-path() can
      be called with some irq-safe spinlocks held, we can't generate
      cgroup path using dentry->d_name.
      
      Alternatively we make a copy of dentry->d_name and save it in
      cgrp->name when a cgroup is created, and update cgrp->name at
      rename().
      
      v5: use flexible array instead of zero-size array.
      v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
          - add cgroup_name() wrapper.
      v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
      v2: make cgrp->name RCU safe.
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      65dff759
  20. 28 Feb, 2013 8 commits
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
    • Ming Lei's avatar
      block/partitions: optimize memory allocation in check_partition() · ac2e5327
      Ming Lei authored
      Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
      it is easy to trigger page allocation failure by check_partition,
      especially in hotplug block device situation(such as, USB mass storage,
      MMC card, ...), and Felipe Balbi has observed the failure.
      
      This patch does below optimizations on the allocation of struct
      parsed_partitions to try to address the issue:
      
      - make parsed_partitions.parts as pointer so that the pointed memory can
        fit in 32KB buffer, then approximate 32KB memory can be saved
      
      - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
        still a bit big for kmalloc
      
      - given that many devices have the partition count limit, so only
        allocate disk_max_parts() partitions instead of 256 partitions always
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Reported-by: default avatarFelipe Balbi <balbi@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac2e5327
    • Ming Lei's avatar
      block/partitions/mac.c: obey the state->limit constraint · 06004e6e
      Ming Lei authored
      It isn't necessary to read the information of partitions whose number is
      equal and more than state->limit since only maximum state->limit
      partitions will be added inside rescan_partitions().
      
      That is also what other kind of partitions are doing.
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06004e6e
    • Peter Jones's avatar
      block/partitions/efi.c: ensure that the GPT header is at least the size of the structure. · 8b8a6e18
      Peter Jones authored
      UEFI 2.3.1D will include a change to the spec language mandating that a
      GPT header must be greater than *or equal to* the size of the defined
      structure.  While verifying that this would work on Linux, I discovered
      that we're not actually checking the minimum bound at all.
      
      The result of this is that when we verify the checksum, it's possible that
      on a malformed header (with header_size of 0), we won't actually verify
      any data.
      
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: default avatarPeter Jones <pjones@redhat.com>
      Acked-by: default avatarMatt Fleming <matt.fleming@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b8a6e18
    • Philippe De Muyter's avatar
      block/partition/msdos: detect AIX formatted disks even without 55aa · 86ee8ba6
      Philippe De Muyter authored
      AIX formatted disks do not always have the MSDOS 55aa signature.
      This happens e.g. for unbootable AIX disks.
      
      Up to now, such disks were not recognized as AIX disks, because of the
      missing 55aa.  Fix that by inverting the two tests.  Let's first
      check for the AIX magic strings, and only if that fails check for
      the MSDOS magic word.
      Signed-off-by: default avatarPhilippe De Muyter <phdm@macqel.be>
      Cc: Andreas Mohr <andi@lisas.de>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Olaf Hering <olh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86ee8ba6
    • Tejun Heo's avatar
      block: convert to idr_alloc() · bab998d6
      Tejun Heo authored
      Convert to the much saner new idr interface.  Both bsg and genhd
      protect idr w/ mutex making preloading unnecessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bab998d6
    • Tejun Heo's avatar
      block: fix synchronization and limit check in blk_alloc_devt() · ce23bba8
      Tejun Heo authored
      idr allocation in blk_alloc_devt() wasn't synchronized against lookup
      and removal, and its limit check was off by one - 1 << MINORBITS is
      the number of minors allowed, not the maximum allowed minor.
      
      Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
      checking.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ce23bba8
    • Tomas Henzl's avatar
      block: fix ext_devt_idr handling · 7b74e912
      Tomas Henzl authored
      While adding and removing a lot of disks disks and partitions this
      sometimes shows up:
      
        WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
        Hardware name:
        sysfs: cannot create duplicate filename '/dev/block/259:751'
        Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
        Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
        Call Trace:
          warn_slowpath_common+0x87/0xc0
          warn_slowpath_fmt+0x46/0x50
          sysfs_add_one+0xc9/0x130
          sysfs_do_create_link+0x12b/0x170
          sysfs_create_link+0x13/0x20
          device_add+0x317/0x650
          idr_get_new+0x13/0x50
          add_partition+0x21c/0x390
          rescan_partitions+0x32b/0x470
          sd_open+0x81/0x1f0 [sd_mod]
          __blkdev_get+0x1b6/0x3c0
          blkdev_get+0x10/0x20
          register_disk+0x155/0x170
          add_disk+0xa6/0x160
          sd_probe_async+0x13b/0x210 [sd_mod]
          add_wait_queue+0x46/0x60
          async_thread+0x102/0x250
          default_wake_function+0x0/0x20
          async_thread+0x0/0x250
          kthread+0x96/0xa0
          child_rip+0xa/0x20
          kthread+0x0/0xa0
          child_rip+0x0/0x20
      
      This most likely happens because dev_t is freed while the number is
      still used and idr_get_new() is not protected on every use.  The fix
      adds a mutex where it wasn't before and moves the dev_t free function so
      it is called after device del.
      Signed-off-by: default avatarTomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b74e912
  21. 24 Feb, 2013 1 commit
    • Ming Lei's avatar
      block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices · 25e823c8
      Ming Lei authored
      Apply the introduced pm_runtime_set_memalloc_noio on block device so
      that PM core will teach mm to not allocate memory with GFP_IOFS when
      calling the runtime_resume and runtime_suspend callback for block
      devices and its ancestors.
      Signed-off-by: default avatarMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25e823c8
  22. 22 Feb, 2013 4 commits
    • Glauber Costa's avatar
      cfq: fix lock imbalance with failed allocations · a3cc86c2
      Glauber Costa authored
      While stress-running very-small container scenarios with the Kernel Memory
      Controller, I've run into a lockdep-detected lock imbalance in
      cfq-iosched.c.
      
      I'll apologize beforehand for not posting a backlog: I didn't anticipate
      it would be so hard to reproduce, so I didn't save my serial output and
      went directly on debugging.  Turns out that it did not happen again in
      more than 20 runs, making it a quite rare pattern.
      
      But here is my analysis:
      
      When we are in very low-memory situations, we will arrive at
      cfq_find_alloc_queue and may not find a queue, having to resort to the oom
      queue, in an rcu-locked condition:
      
        if (!cfqq || cfqq == &cfqd->oom_cfqq)
            [ ... ]
      
      Next, we will release the rcu lock, and try to allocate a queue, retrying
      if we succeed:
      
        rcu_read_unlock();
        spin_unlock_irq(cfqd->queue->queue_lock);
        new_cfqq = kmem_cache_alloc_node(cfq_pool,
                        gfp_mask | __GFP_ZERO,
                        cfqd->queue->node);
         spin_lock_irq(cfqd->queue->queue_lock);
         if (new_cfqq)
             goto retry;
      
      We are unlocked at this point, but it should be fine, since we will
      reacquire the rcu_read_lock when we retry.
      
      Except of course, that we may not retry: the allocation may very well fail
      and we'll keep on going through the flow:
      
      The next branch is:
      
          if (cfqq) {
      	[ ... ]
          } else
              cfqq = &cfqd->oom_cfqq;
      
      And right before exiting, we'll issue rcu_read_unlock().
      
      Being already unlocked, this is the likely source of our imbalance.  Since
      cfqq is either already NULL or made NULL in the first statement of the
      outter branch, the only viable alternative here seems to be to return the
      oom queue right away in case of allocation failure.
      
      Please review the following patch and apply if you agree with my analysis.
      Signed-off-by: default avatarGlauber Costa <glommer@parallels.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a3cc86c2
    • Mikulas Patocka's avatar
      block: don't select PERCPU_RWSEM · 79d0b7f0
      Mikulas Patocka authored
      The block device doesn't use percpu rw-semaphore anymore, so don't select
      it for compilation.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79d0b7f0
    • Darrick J. Wong's avatar
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong authored
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
    • Darrick J. Wong's avatar
      bdi: allow block devices to say that they require stable page writes · 7d311cda
      Darrick J. Wong authored
      This patchset ("stable page writes, part 2") makes some key
      modifications to the original 'stable page writes' patchset.  First, it
      provides creators (devices and filesystems) of a backing_dev_info a flag
      that declares whether or not it is necessary to ensure that page
      contents cannot change during writeout.  It is no longer assumed that
      this is true of all devices (which was never true anyway).  Second, the
      flag is used to relaxed the wait_on_page_writeback calls so that wait
      only occurs if the device needs it.  Third, it fixes up the remaining
      disk-backed filesystems to use this improved conditional-wait logic to
      provide stable page writes on those filesystems.
      
      It is hoped that (for people not using checksumming devices, anyway)
      this patchset will give back unnecessary performance decreases since the
      original stable page write patchset went into 3.0.  Sorry about not
      fixing it sooner.
      
      Complaints were registered by several people about the long write
      latencies introduced by the original stable page write patchset.
      Generally speaking, the kernel ought to allocate as little extra memory
      as possible to facilitate writeout, but for people who simply cannot
      wait, a second page stability strategy is (re)introduced: snapshotting
      page contents.  The waiting behavior is still the default strategy; to
      enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
      set.  This flag is used to bandaid^Henable stable page writeback on
      ext3[1], and is not used anywhere else.
      
      Given that there are already a few storage devices and network FSes that
      have rolled their own page stability wait/page snapshot code, it would
      be nice to move towards consolidating all of these.  It seems possible
      that iscsi and raid5 may wish to use the new stable page write support
      to enable zero-copy writeout.
      
      Thank you to Jan Kara for helping fix a couple more filesystems.
      
      Per Andrew Morton's request, here are the result of using dbench to measure
      latencies on ext2:
      
      3.8.0-rc3:
         Operation      Count    AvgLat    MaxLat
         ----------------------------------------
         WriteX        109347     0.028    59.817
         ReadX         347180     0.004     3.391
         Flush          15514    29.828   287.283
      
        Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
         WriteX        105556     0.029     4.273
         ReadX         335004     0.005     4.112
         Flush          14982    30.540   298.634
      
        Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, for ext2 the maximum write latency decreases from ~60ms
      on a laptop hard disk to ~4ms.  I'm not sure why the flush latencies
      increase, though I suspect that being able to dirty pages faster gives
      the flusher more work to do.
      
      On ext4, the average write latency decreases as well as all the maximum
      latencies:
      
      3.8.0-rc3:
         WriteX         85624     0.152    33.078
         ReadX         272090     0.010    61.210
         Flush          12129    36.219   168.260
      
        Throughput 44.8618 MB/sec  4 clients  4 procs  max_latency=168.276 ms
      
      3.8.0-rc3 + patches:
         WriteX         86082     0.141    30.928
         ReadX         273358     0.010    36.124
         Flush          12214    34.800   165.689
      
        Throughput 44.9941 MB/sec  4 clients  4 procs  max_latency=165.722 ms
      
      XFS seems to exhibit similar latency improvements as ext2:
      
      3.8.0-rc3:
         WriteX        125739     0.028   104.343
         ReadX         399070     0.005     4.115
         Flush          17851    25.004   131.390
      
        Throughput 66.0024 MB/sec  4 clients  4 procs  max_latency=131.406 ms
      
      3.8.0-rc3 + patches:
         WriteX        123529     0.028     6.299
         ReadX         392434     0.005     4.287
         Flush          17549    25.120   188.687
      
        Throughput 64.9113 MB/sec  4 clients  4 procs  max_latency=188.704 ms
      
      ...and btrfs, just to round things out, also shows some latency
      decreases:
      
      3.8.0-rc3:
         WriteX         67122     0.083    82.355
         ReadX         212719     0.005     2.828
         Flush           9547    47.561   147.418
      
        Throughput 35.3391 MB/sec  4 clients  4 procs  max_latency=147.433 ms
      
      3.8.0-rc3 + patches:
         WriteX         64898     0.101    71.631
         ReadX         206673     0.005     7.123
         Flush           9190    47.963   219.034
      
        Throughput 34.0795 MB/sec  4 clients  4 procs  max_latency=219.044 ms
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own wait code, or they don't block at all.  The blocking
      behavior is back to what it was before 3.0 if you don't have a disk
      requiring stable page writes.
      
      This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
      xfs.  I've spot-checked 3.8.0-rc4 and seem to be getting the same
      results as -rc3.
      
      [1] The alternative fixes to ext3 include fixing the locking order and
      page bit handling like we did for ext4 (but then why not just use
      ext4?), or setting PG_writeback so early that ext3 becomes extremely
      slow.  I tried that, but the number of write()s I could initiate dropped
      by nearly an order of magnitude.  That was a bit much even for the
      author of the stable page series! :)
      
      This patch:
      
      Creates a per-backing-device flag that tracks whether or not pages must
      be held immutable during writeout.  Eventually it will be used to waive
      wait_for_page_writeback() if nothing requires stable pages.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d311cda
  23. 15 Feb, 2013 1 commit
    • Vladimir Davydov's avatar
      block: account iowait time when waiting for completion of IO request · 5577022f
      Vladimir Davydov authored
      Using wait_for_completion() for waiting for a IO request to be executed
      results in wrong iowait time accounting. For example, a system having
      the only task doing write() and fdatasync() on a block device can be
      reported being idle instead of iowaiting as it should because
      blkdev_issue_flush() calls wait_for_completion() which in turn calls
      schedule() that does not increment the iowait proc counter and thus does
      not turn on iowait time accounting.
      
      The patch makes block layer use wait_for_completion_io() instead of
      wait_for_completion() where appropriate to account iowait time
      correctly.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5577022f