1. 22 Feb, 2014 1 commit
    • Oleg Nesterov's avatar
      md/raid5: Fix CPU hotplug callback registration · 4d4ef86d
      Oleg Nesterov authored
      commit 789b5e0315284463617e106baad360cb9e8db3ac upstream.
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      Interestingly, the raid5 code can actually prevent double initialization and
      hence can use the following simplified form of callback registration:
      A hotplug operation that occurs between registering the notifier and calling
      get_online_cpus(), won't disrupt anything, because the code takes care to
      perform the memory allocations only once.
      So reorganize the code in raid5 this way to fix the deadlock with callback
      Cc: linux-raid@vger.kernel.org
      Fixes: 36d1c647Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
      free_scratch_buffer() helper to condense code further and wrote the changelog.]
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 06 Feb, 2014 1 commit
    • NeilBrown's avatar
      md/raid5: fix long-standing problem with bitmap handling on write failure. · 6ba854e9
      NeilBrown authored
      commit 9f97e4b128d2ea90a5f5063ea0ee3b0911f4c669 upstream.
      Before a write starts we set a bit in the write-intent bitmap.
      When the write completes we clear that bit if the write was successful
      to all devices.  However if the write wasn't fully successful we
      should not clear the bit.  If the faulty drive is subsequently
      re-added, the fact that the bit is still set ensure that we will
      re-write the data that is missing.
      This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
      bitmap bit when this flag is not set.
      Currently we correctly set the flag if a write starts when some
      devices are failed or missing.  But we do *not* set the flag if some
      device failed during the write attempt.
      This is wrong and can result in clearing the bit inappropriately.
      So: set the flag when a write fails.
      This bug has been present since bitmaps were introduces, so the fix is
      suitable for any -stable kernel.
      Reported-by: default avatarEthan Wilson <ethan.wilson@shiftmail.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  3. 25 Jan, 2014 1 commit
  4. 13 Nov, 2013 2 commits
  5. 04 Aug, 2013 1 commit
    • NeilBrown's avatar
      md/raid5: fix interaction of 'replace' and 'recovery'. · c1dadcc1
      NeilBrown authored
      commit f94c0b6658c7edea8bc19d13be321e3860a3fa54 upstream.
      If a device in a RAID4/5/6 is being replaced while another is being
      recovered, then the writes to the replacement device currently don't
      happen, resulting in corruption when the replacement completes and the
      new drive takes over.
      This is because the replacement writes are only triggered when
      's.replacing' is set and not when the similar 's.sync' is set (which
      is the case during resync and recovery - it means all devices need to
      be read).
      So schedule those writes when s.replacing is set as well.
      In this case we cannot use "STRIPE_INSYNC" to record that the
      replacement has happened as that is needed for recording that any
      parity calculation is complete.  So introduce STRIPE_REPLACED to
      record if the replacement has happened.
      For safety we should also check that STRIPE_COMPUTE_RUN is not set.
      This has a similar effect to the "s.locked == 0" test.  The latter
      ensure that now IO has been flagged but not started.  The former
      checks if any parity calculation has been flagged by not started.
      We must wait for both of these to complete before triggering the
      Add a similar test to the subsequent check for "are we finished yet".
      This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
      but it makes it more obvious that the REPLACE will happen before we
      think we are finished.
      Finally if a NeedReplace device is not UPTODATE then that is an
      error.  We really must trigger a warning.
      This bug was introduced in commit 9a3e1101
      (md/raid5:  detect and handle replacements during recovery.)
      which introduced replacement for raid5.
      That was in 3.3-rc3, so any stable kernel since then would benefit
      from this fix.
      Reported-by: default avatarqindehua <13691222965@163.com>
      Tested-by: default avatarqindehua <qindehua@163.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  6. 13 Jun, 2013 1 commit
    • H. Peter Anvin's avatar
      md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9
      H. Peter Anvin authored
      There are cases where the kernel will believe that the WRITE SAME
      command is supported by a block device which does not, in fact,
      support WRITE SAME.  This currently happens for SATA drivers behind a
      SAS controller, but there are probably a hundred other ways that can
      happen, including drive firmware bugs.
      After receiving an error for WRITE SAME the block layer will retry the
      request as a plain write of zeroes, but mdraid will consider the
      failure as fatal and consider the drive failed.  This has the effect
      that all the mirrors containing a specific set of data are each
      offlined in very rapid succession resulting in data loss.
      However, just bouncing the request back up to the block layer isn't
      ideal either, because the whole initial request-retry sequence should
      be inside the write bitmap fence, which probably means that md needs
      to do its own conversion of WRITE SAME to write zero.
      Until the failure scenario has been sorted out, disable WRITE SAME for
      raid1, raid5, and raid10.
      [neilb: added raid5]
      This patch is appropriate for any -stable since 3.7 when write_same
      support was added.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  7. 30 May, 2013 1 commit
  8. 24 Apr, 2013 2 commits
  9. 18 Apr, 2013 1 commit
  10. 23 Mar, 2013 3 commits
    • Kent Overstreet's avatar
      raid5: use bio_reset() · 2f6db2a7
      Kent Overstreet authored
      Had to shuffle the code around a bit (where bi_rw and bi_end_io were
      set), but shouldn't really be anything tricky here
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
    • Kent Overstreet's avatar
      block: Use bio_sectors() more consistently · aa8b57aa
      Kent Overstreet authored
      Bunch of places in the code weren't using it where they could be -
      this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
      into a struct bvec_iter.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: "Ed L. Cashin" <ecashin@coraid.com>
      CC: Nick Piggin <npiggin@kernel.dk>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Jim Paris <jim@jtan.com>
      CC: Geoff Levand <geoff@infradead.org>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarEd Cashin <ecashin@coraid.com>
    • Kent Overstreet's avatar
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet authored
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
  11. 20 Mar, 2013 3 commits
    • NeilBrown's avatar
      md/raid5: ensure sync and DISCARD don't happen at the same time. · f8dfcffd
      NeilBrown authored
      A number of problems can occur due to races between
      resync/recovery and discard.
      - if sync_request calls handle_stripe() while a discard is
        happening on the stripe, it might call handle_stripe_clean_event
        before all of the individual discard requests have completed
        (so some devices are still locked, but not all).
        Since commit ca64cae9
           md/raid5: Make sure we clear R5_Discard when discard is finished.
        this will cause R5_Discard to be cleared for the parity device,
        so handle_stripe_clean_event() will not be called when the other
        devices do become unlocked, so their ->written will not be cleared.
        This ultimately leads to a WARN_ON in init_stripe and a lock-up.
      - If handle_stripe_clean_event() does clear R5_UPTODATE at an awkward
        time for resync, it can lead to s->uptodate being less than disks
        in handle_parity_checks5(), which triggers a BUG (because it is
       - keep R5_Discard on the parity device until all other devices have
         completed their discard request
       - make sure we don't try to have a 'discard' and a 'sync' action at
         the same time.
         This involves a new stripe flag to we know when a 'discard' is
         happening, and the use of R5_Overlap on the parity disk so when a
         discard is wanted while a sync is active, so we know to wake up
         the discard at the appropriate time.
      Discard support for RAID5 was added in 3.7, so this is suitable for
      any -stable kernel since 3.7.
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Tested-by: default avatarJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Jonathan Brassow's avatar
      MD RAID5: Avoid accessing gendisk or queue structs when not available · e3620a3a
      Jonathan Brassow authored
      MD RAID5:  Fix kernel oops when RAID4/5/6 is used via device-mapper
      Commit a9add5d9 (v3.8-rc1) added blktrace calls to the RAID4/5/6 driver.
      However, when device-mapper is used to create RAID4/5/6 arrays, the
      mddev->gendisk and mddev->queue fields are not setup.  Therefore, calling
      things like trace_block_bio_remap will cause a kernel oops.  This patch
      conditionalizes those calls on whether the proper fields exist to make
      the calls.  (Device-mapper will call trace_block_bio_remap on its own.)
      This patch is suitable for the 3.8.y stable kernel.
      Cc: stable@vger.kernel.org (v3.8+)
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: schedule_construction should abort if nothing to do. · ce7d363a
      NeilBrown authored
      Since commit 1ed850f3
          md/raid5: make sure to_read and to_write never go negative.
      It has been possible for handle_stripe_dirtying to be called
      when there isn't actually any work to do.
      It then calls schedule_reconstruction() which will set R5_LOCKED
      on the parity block(s) even when nothing else is happening.
      This then causes problems in do_release_stripe().
      So add checks to schedule_reconstruction() so that if it doesn't
      find anything to do, it just aborts.
      This bug was introduced in v3.7, so the patch is suitable
      for -stable kernels since then.
      Cc: stable@vger.kernel.org (v3.7+)
      Reported-by: default avatarmajianpeng <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  12. 28 Feb, 2013 1 commit
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      I'm not sure why, but the hlist for each entry iterators were conceived
              list_for_each_entry(pos, head, member)
      The hlist ones were greedy and wanted an extra parameter:
              hlist_for_each_entry(tpos, pos, head, member)
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      Besides the semantic patch, there was some manual work required:
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      -T b;
          <+... when != b
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      - b,
      c) S
      for_each_busy_worker(a, c,
      - b,
      d) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c) S
      -(a, b)
      + sk_for_each_from(a) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d, e) S
      - b,
      c) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      - b,
      c) S
      - b,
      c, d) S
      - b,
      c, d) S
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 27 Feb, 2013 1 commit
    • NeilBrown's avatar
      md: remove CONFIG_MULTICORE_RAID456 · 51acbcec
      NeilBrown authored
      This doesn't seem to actually help and we have an alternate
      multi-threading approach waiting in the wings, so just get
      rid of this config option and associated code.
      As a bonus, we remove one use of CONFIG_EXPERIMENTAL
      Cc: Dan Williams <djbw@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  14. 14 Jan, 2013 1 commit
    • Tejun Heo's avatar
      block: add missing block_bio_complete() tracepoint · 3a366e61
      Tejun Heo authored
      bio completion didn't kick block_bio_complete TP.  Only dm was
      explicitly triggering the TP on IO completion.  This makes
      block_bio_complete TP useless for tracers which want to know about
      bios, and all other bio based drivers skip generating blktrace
      completion events.
      This patch makes all bio completions via bio_endio() generate
      block_bio_complete TP.
      * Explicit trace_block_bio_complete() invocation removed from dm and
        the trace point is unexported.
      * @rq dropped from trace_block_bio_complete().  bios may fly around
        w/o queue associated.  Verifying and accessing the assocaited queue
        belongs to TP probes.
      * blktrace now gets both request and bio completions.  Make it ignore
        bio completions if request completion path is happening.
      This makes all bio based drivers generate blktrace completion events
      properly and makes the block_bio_complete TP actually useful.
      v2: With this change, block_bio_complete TP could be invoked on sg
          commands which have bio's with %NULL bi_bdev.  Update TP
          assignment code to check whether bio->bi_bdev is %NULL before
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Original-patch-by: default avatarNamhyung Kim <namhyung@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  15. 17 Dec, 2012 1 commit
  16. 13 Dec, 2012 1 commit
  17. 30 Nov, 2012 1 commit
    • Lukas Czerner's avatar
      wait: add wait_event_lock_irq() interface · eed8c02e
      Lukas Czerner authored
      New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
      moves the private wait_event_lock_irq() macro from MD to regular wait
      includes, introduces new macro wait_event_lock_irq_cmd() instead of using
      the old method with omitting cmd parameter which is ugly and makes a use
      of new macros in the MD. It also introduces the _interruptible_ variant.
      The use of new interface is when one have a special lock to protect data
      structures used in the condition, or one also needs to invoke "cmd"
      before putting it to sleep.
      All new macros are expected to be called with the lock taken. The lock
      is released before sleep and is reacquired afterwards. We will leave the
      macro with the lock held.
      Note to DM: IMO this should also fix theoretical race on waitqueue while
      using simultaneously wait_event_lock_irq() and wait_event() because of
      lack of locking around current state setting and wait queue removal.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  18. 21 Nov, 2012 2 commits
    • NeilBrown's avatar
      md/raid5: Make sure we clear R5_Discard when discard is finished. · ca64cae9
      NeilBrown authored
      commit 9e444768
          MD: raid5 avoid unnecessary zero page for trim
      change raid5 to clear R5_Discard when the complete request is
      handled rather than when submitting the per-device discard request.
      However it did not clear R5_Discard for the parity device.
      This means that if the stripe_head was reused before it expired from
      the cache, the setting would be wrong and a hang would result.
      Also if the R5_Uptodate bit happens to be set, R5_Discard again
      won't be cleared.  But R5_Uptodate really should be clear at this point.
      So make sure R5_Discard is cleared in all cases, and clear
      R5_Uptodate when a 'discard' completes.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: move resolving of reconstruct_state earlier in · ef5b7c69
      NeilBrown authored
      The chunk of code in stripe_handle which responds to a
      *_result value in reconstruct_state is really the completion
      of some processing that happened outside of handle_stripe
      (possibly asynchronously) and so should be one of the first
      things done in handle_stripe().
      After the next patch it will be important that it happens before
      handle_stripe_clean_event(), as that will clear some dev->flags
      bit that this code tests.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  19. 20 Nov, 2012 1 commit
  20. 29 Oct, 2012 1 commit
  21. 11 Oct, 2012 9 commits
    • NeilBrown's avatar
      md/raid5: be careful not to resize_stripes too big. · e56108d6
      NeilBrown authored
      When a RAID5 is reshaping, conf->raid_disks is increased
      before mddev->delta_disks becomes zero.
      This can result in check_reshape calling resize_stripes with a
      number that is too large.  This particularly happens
      when md_check_recovery calls ->check_reshape().
      If we use ->previous_raid_disks, we don't risk this.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Jianpeng Ma's avatar
      Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races · 7f7583d4
      Jianpeng Ma authored
      Now that multiple threads can handle stripes, it is safer to
      use an atomic64_t for resync_mismatches, to avoid update races.
      Signed-off-by: default avatarJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: make sure to_read and to_write never go negative. · 1ed850f3
      NeilBrown authored
      to_read and to_write are part of the result of analysing
      a stripe before handling it.
      Their use is to avoid some loops and tests if the values are
      known to be zero.  Thus it is not a problem if they are a
      little bit larger than they should be.
      So decrementing them in handle_failed_stripe serves little value, and
      due to races it could cause some loops to be skipped incorrectly.
      So remove those decrements.
      Reported-by: default avatar"Jianpeng Ma" <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Alexander Lyakas's avatar
    • NeilBrown's avatar
      md/raid5: protect debug message against NULL derefernce. · b97390ae
      NeilBrown authored
      The pr_debug in add_stripe_bio could race with something
      changing *bip, so it is best to hold the lock until
      after the pr_debug.
      Reported-by: default avatar"Jianpeng Ma" <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • NeilBrown's avatar
      md/raid5: add some missing locking in handle_failed_stripe. · 143c4d05
      NeilBrown authored
      We really should hold the stripe_lock while accessing
      'toread' else we could race with add_stripe_bio and corrupt
      a list.
      Reported-by: default avatar"Jianpeng Ma" <majianpeng@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      MD: raid5 avoid unnecessary zero page for trim · 9e444768
      Shaohua Li authored
      We want to avoid zero discarded dev page, because it's useless for discard.
      But if we don't zero it, another read/write hit such page in the cache and will
      get inconsistent data.
      To avoid zero the page, we don't set R5_UPTODATE flag after construction is
      done. In this way, discard write request is still issued and finished, but read
      will not hit the page. If the stripe gets accessed soon, we need reread the
      stripe, but since the chance is low, the reread isn't a big deal.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      MD: raid5 trim support · 620125f2
      Shaohua Li authored
      Discard for raid4/5/6 has limitation. If discard request size is
      small, we do discard for one disk, but we need calculate parity and
      write parity disk.  To correctly calculate parity, zero_after_discard
      must be guaranteed. Even it's true, we need do discard for one disk
      but write another disks, which makes the parity disks wear out
      fast. This doesn't make sense. So an efficient discard for raid4/5/6
      should discard all data disks and parity disks, which requires the
      write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
      is smaller than chunk_size, such pattern is almost impossible in
      practice. So in this patch, I only handle the case that A's size
      equals to chunk_size. That is discard request should be aligned to
      stripe size and its size is multiple of stripe size.
      Since we can only handle request with specific alignment and size (or
      part of the request fitting stripes), we can't guarantee
      zero_after_discard even zero_after_discard is true in low level
      The block layer doesn't send down correctly aligned requests even
      correct discard alignment is set, so I must filter out.
      For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
      zero_after_discard is true for all disks, data is consistent after
      discard.  Otherwise, data might be lost. Let's consider a scenario:
      discard a stripe, write data to one disk and write parity disk. The
      stripe could be still inconsistent till then depending on using data
      from other data disks or parity disks to calculate new parity. If the
      disk is broken, we can't restore it. So in this patch, we only enable
      discard support if all disks have zero_after_discard.
      If discard fails in one disk, we face the similar inconsistent issue
      above. The patch will make discard follow the same path as normal
      write request. If discard fails, a resync will be scheduled to make
      the data consistent. This isn't good to have extra writes, but data
      consistency is important.
      If a subsequent read/write request hits raid5 cache of a discarded
      stripe, the discarded dev page should have zero filled, so the data is
      consistent. This patch will always zero dev page for discarded request
      stripe. This isn't optimal because discard request doesn't need such
      payload. Next patch will avoid it.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
    • Shaohua Li's avatar
      MD: change the parameter of md thread · 4ed8731d
      Shaohua Li authored
      Change the thread parameter, so the thread can carry extra info. Next patch
      will use it.
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  22. 24 Sep, 2012 1 commit
    • NeilBrown's avatar
      md/raid5: add missing spin_lock_init. · cb13ff69
      NeilBrown authored
      commit b17459c0
         raid5: add a per-stripe lock
      added a spin_lock to the 'stripe_head' struct.
      Unfortunately there are two places where this struct is allocated
      but the spin lock was only initialised in one of them.
      So add the missing spin_lock_init.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
  23. 19 Sep, 2012 2 commits
  24. 01 Aug, 2012 1 commit