1. 16 Jan, 2015 1 commit
    • Tejun Heo's avatar
      writeback: fix a subtle race condition in I_DIRTY clearing · 21fe2674
      Tejun Heo authored
      commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.
      
      After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
      tests inode->i_state locklessly to see whether it already has all the
      necessary I_DIRTY bits set.  The comment above the barrier doesn't
      contain any useful information - memory barriers can't ensure "changes
      are seen by all cpus" by itself.
      
      And it sure enough was broken.  Please consider the following
      scenario.
      
       CPU 0					CPU 1
       -------------------------------------------------------------------------------
      
      					enters __writeback_single_inode()
      					grabs inode->i_lock
      					tests PAGECACHE_TAG_DIRTY which is clear
       enters __set_page_dirty()
       grabs mapping->tree_lock
       sets PAGECACHE_TAG_DIRTY
       releases mapping->tree_lock
       leaves __set_page_dirty()
      
       enters __mark_inode_dirty()
       smp_mb()
       sees I_DIRTY_PAGES set
       leaves __mark_inode_dirty()
      					clears I_DIRTY_PAGES
      					releases inode->i_lock
      
      Now @inode has dirty pages w/ I_DIRTY_PAGES clear.  This doesn't seem
      to lead to an immediately critical problem because requeue_inode()
      later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
      deciding whether the inode needs to be requeued for IO and there are
      enough unintentional memory barriers inbetween, so while the inode
      ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
      IO list.
      
      The lack of explicit barrier may also theoretically affect the other
      I_DIRTY bits which deal with metadata dirtiness.  There is no
      guarantee that a strong enough barrier exists between
      I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
      inode.  Filesystem inode writeout path likely has enough stuff which
      can behave as full barrier but it's theoretically possible that the
      writeout may not see all the updates from ->dirty_inode().
      
      Fix it by adding an explicit smp_mb() after I_DIRTY clearing.  Note
      that I_DIRTY_PAGES needs a special treatment as it always needs to be
      cleared to be interlocked with the lockless test on
      __mark_inode_dirty() side.  It's cleared unconditionally and
      reinstated after smp_mb() if the mapping still has dirty pages.
      
      Also add comments explaining how and why the barriers are paired.
      
      Lightly tested.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21fe2674
  2. 07 Jun, 2014 1 commit
  3. 27 Apr, 2014 2 commits
    • Jan Kara's avatar
      bdi: avoid oops on device removal · bf097203
      Jan Kara authored
      commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream.
      
      After commit 839a8e86 ("writeback: replace custom worker pool
      implementation with unbound workqueue") when device is removed while we
      are writing to it we crash in bdi_writeback_workfn() ->
      set_worker_desc() because bdi->dev is NULL.
      
      This can happen because even though bdi_unregister() cancels all pending
      flushing work, nothing really prevents new ones from being queued from
      balance_dirty_pages() or other places.
      
      Fix the problem by clearing BDI_registered bit in bdi_unregister() and
      checking it before scheduling of any flushing work.
      
      Fixes: 839a8e86Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf097203
    • Derek Basehore's avatar
      backing_dev: fix hung task on sync · 39305a6a
      Derek Basehore authored
      commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream.
      
      bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
      schedule work to writeback dirty inodes.  The problem with this is that
      it can delay work that is scheduled for immediate execution, such as the
      work from sync_inodes_sb().  This can happen since mod_delayed_work()
      can now steal work from a work_queue.  This fixes the problem by using
      queue_delayed_work() instead.  This is a regression caused by commit
      839a8e86 ("writeback: replace custom worker pool implementation with
      unbound workqueue").
      
      The reason that this causes a problem is that laptop-mode will change
      the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
      In the case that bdi_wakeup_thread_delayed() races with
      sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
      task.  Even if dirty_writeback_centisecs is not long enough to cause a
      hung task, we still don't want to delay sync for that long.
      
      We fix the problem by using queue_delayed_work() when we want to
      schedule writeback sometime in future.  This function doesn't change the
      timer if it is already armed.
      
      For the same reason, we also change bdi_writeback_workfn() to
      immediately queue the work again in the case that the work_list is not
      empty.  The same problem can happen if the sync work is run on the
      rescue worker.
      
      [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
      Signed-off-by: default avatarDerek Basehore <dbasehore@chromium.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zento.linux.org.uk>
      Reviewed-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Benson Leung <bleung@chromium.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Luigi Semenzato <semenzato@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39305a6a
  4. 25 Jan, 2014 1 commit
    • Jan Kara's avatar
      writeback: Fix data corruption on NFS · cc46cb33
      Jan Kara authored
      commit f9b0e058cbd04ada76b13afffa7e1df830543c24 upstream.
      
      Commit 4f8ad655 "writeback: Refactor writeback_single_inode()" added
      a condition to skip clean inode. However this is wrong in WB_SYNC_ALL
      mode because there we also want to wait for outstanding writeback on
      possibly clean inode. This was causing occasional data corruption issues
      on NFS because it uses sync_inode() to make sure all outstanding writes
      are flushed to the server before truncating the inode and with
      sync_inode() returning prematurely file was sometimes extended back
      by an outstanding write after it was truncated.
      
      So modify the test to also check for pages under writeback in
      WB_SYNC_ALL mode.
      
      Fixes: 4f8ad655Reported-and-tested-by: default avatarDan Duval <dan.duval@oracle.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc46cb33
  5. 01 May, 2013 1 commit
    • Tejun Heo's avatar
      writeback: set worker desc to identify writeback workers in task dumps · ef3b1019
      Tejun Heo authored
      Writeback has been recently converted to use workqueue instead of its
      private thread pool implementation.  One negative side effect of this
      conversion is that there's no easy to tell which backing device a
      writeback work item was working on at the time of task dump, be it
      sysrq-t, BUG, WARN or whatever, which, according to our writeback
      brethren, is important in tracking down issues with a lot of mounted
      file systems on a lot of different devices.
      
      This patch restores that information using the new worker description
      facility.  bdi_writeback_workfn() calls set_work_desc() to identify
      which bdi it's working on.  The description is printed out together with
      the worqueue name and worker function as in the following example dump.
      
       WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
       Modules linked in:
       Pid: 28, comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24 empty empty/S3992
       Workqueue: writeback bdi_writeback_workfn (flush-8:16)
        ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
        ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
        ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
       Call Trace:
        [<ffffffff81c61855>] dump_stack+0x19/0x1b
        [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
        [<ffffffff8108f54a>] warn_slowpath_null+0x1a/0x20
        [<ffffffff81200144>] bdi_writeback_workfn+0x2b4/0x3c0
        [<ffffffff810b4c87>] process_one_work+0x1d7/0x660
        [<ffffffff810b5c72>] worker_thread+0x122/0x380
        [<ffffffff810bdfea>] kthread+0xea/0xf0
        [<ffffffff81c6cedc>] ret_from_fork+0x7c/0xb0
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef3b1019
  6. 02 Apr, 2013 1 commit
    • Tejun Heo's avatar
      writeback: replace custom worker pool implementation with unbound workqueue · 839a8e86
      Tejun Heo authored
      Writeback implements its own worker pool - each bdi can be associated
      with a worker thread which is created and destroyed dynamically.  The
      worker thread for the default bdi is always present and serves as the
      "forker" thread which forks off worker threads for other bdis.
      
      there's no reason for writeback to implement its own worker pool when
      using unbound workqueue instead is much simpler and more efficient.
      This patch replaces custom worker pool implementation in writeback
      with an unbound workqueue.
      
      The conversion isn't too complicated but the followings are worth
      mentioning.
      
      * bdi_writeback->last_active, task and wakeup_timer are removed.
        delayed_work ->dwork is added instead.  Explicit timer handling is
        no longer necessary.  Everything works by either queueing / modding
        / flushing / canceling the delayed_work item.
      
      * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off
        bdi_writeback->dwork.  On each execution, it processes
        bdi->work_list and reschedules itself if there are more things to
        do.
      
        The function also handles low-mem condition, which used to be
        handled by the forker thread.  If the function is running off a
        rescuer thread, it only writes out limited number of pages so that
        the rescuer can serve other bdis too.  This preserves the flusher
        creation failure behavior of the forker thread.
      
      * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell
        bdi_writeback_workfn() about on-going bdi unregistration so that it
        always drains work_list even if it's running off the rescuer.  Note
        that the original code was broken in this regard.  Under memory
        pressure, a bdi could finish unregistration with non-empty
        work_list.
      
      * The default bdi is no longer special.  It now is treated the same as
        any other bdi and bdi_cap_flush_forker() is removed.
      
      * BDI_pending is no longer used.  Removed.
      
      * Some tracepoints become non-applicable.  The following TPs are
        removed - writeback_nothread, writeback_wake_thread,
        writeback_wake_forker_thread, writeback_thread_start,
        writeback_thread_stop.
      
      Everything, including devices coming and going away and rescuer
      operation under simulated memory pressure, seems to work fine in my
      test setup.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      839a8e86
  7. 14 Jan, 2013 1 commit
    • Tejun Heo's avatar
      writeback: add more tracepoints · 9fb0a7da
      Tejun Heo authored
      Add tracepoints for page dirtying, writeback_single_inode start, inode
      dirtying and writeback.  For the latter two inode events, a pair of
      events are defined to denote start and end of the operations (the
      starting one has _start suffix and the one w/o suffix happens after
      the operation is complete).  These inode ops are FS specific and can
      be non-trivial and having enclosing tracepoints is useful for external
      tracers.
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: writeback_dirty_inode[_start] TPs may be called for files on
          pseudo FSes w/ unregistered bdi.  Check whether bdi->dev is %NULL
          before dereferencing.
      
      v3: buffer dirtying moved to a block TP.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9fb0a7da
  8. 12 Jan, 2013 1 commit
    • Miao Xie's avatar
      vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them · 10ee27a0
      Miao Xie authored
      writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
      with down_read_trylock() because
      
      - If ->s_umount is write locked, then the sb is not idle. That is
        writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.
      
      - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
        writeback, it may bring us deadlock problem when doing umount. In order to
        fix the problem, ext4 and btrfs implemented their own writeback functions
        instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
        code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().
      
      The name of these two functions is cumbersome, so rename them to
      try_to_writeback_inodes_sb(_nr).
      
      This idea came from Christoph Hellwig.
      Some code is from the patch of Kamal Mostafa.
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      10ee27a0
  9. 13 Dec, 2012 1 commit
  10. 27 Nov, 2012 1 commit
    • Jan Kara's avatar
      writeback: put unused inodes to LRU after writeback completion · 4eff96dd
      Jan Kara authored
      Commit 169ebd90 ("writeback: Avoid iput() from flusher thread")
      removed iget-iput pair from inode writeback.  As a side effect, inodes
      that are dirty during iput_final() call won't be ever added to inode LRU
      (iput_final() doesn't add dirty inodes to LRU and later when the inode
      is cleaned there's noone to add the inode there).  Thus inodes are
      effectively unreclaimable until someone looks them up again.
      
      The practical effect of this bug is limited by the fact that inodes are
      pinned by a dentry for long enough that the inode gets cleaned.  But
      still the bug can have nasty consequences leading up to OOM conditions
      under certain circumstances.  Following can easily reproduce the
      problem:
      
        for (( i = 0; i < 1000; i++ )); do
          mkdir $i
          for (( j = 0; j < 1000; j++ )); do
            touch $i/$j
            echo 2 > /proc/sys/vm/drop_caches
          done
        done
      
      then one needs to run 'sync; ls -lR' to make inodes reclaimable again.
      
      We fix the issue by inserting unused clean inodes into the LRU after
      writeback finishes in inode_sync_complete().
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>		[3.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eff96dd
  11. 09 Oct, 2012 1 commit
  12. 21 Sep, 2012 1 commit
  13. 20 Sep, 2012 1 commit
    • Theodore Ts'o's avatar
      ext4: fix potential deadlock in ext4_nonda_switch() · 00d4e736
      Theodore Ts'o authored
      In ext4_nonda_switch(), if the file system is getting full we used to
      call writeback_inodes_sb_if_idle().  The problem is that we can be
      holding i_mutex already, and this causes a potential deadlock when
      writeback_inodes_sb_if_idle() when it tries to take s_umount.  (See
      lockdep output below).
      
      As it turns out we don't need need to hold s_umount; the fact that we
      are in the middle of the write(2) system call will keep the superblock
      pinned.  Unfortunately writeback_inodes_sb() checks to make sure
      s_umount is taken, and the VFS uses a different mechanism for making
      sure the file system doesn't get unmounted out from under us.  The
      simplest way of dealing with this is to just simply grab s_umount
      using a trylock, and skip kicking the writeback flusher thread in the
      very unlikely case that we can't take a read lock on s_umount without
      blocking.
      
      Also, we now check the cirteria for kicking the writeback thread
      before we decide to whether to fall back to non-delayed writeback, so
      if there are any outstanding delayed allocation writes, we try to get
      them resolved as soon as possible.
      
         [ INFO: possible circular locking dependency detected ]
         3.6.0-rc1-00042-gce894ca #367 Not tainted
         -------------------------------------------------------
         dd/8298 is trying to acquire lock:
          (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
      
         but task is already holding lock:
          (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         which lock already depends on the new lock.
      
         2 locks held by dd/8298:
          #0:  (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
          #1:  (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
      
         stack backtrace:
         Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
         Call Trace:
          [<c015b79c>] ? console_unlock+0x345/0x372
          [<c06d62a1>] print_circular_bug+0x190/0x19d
          [<c019906c>] __lock_acquire+0x86d/0xb6c
          [<c01999db>] ? mark_held_locks+0x5c/0x7b
          [<c0199724>] lock_acquire+0x66/0xb9
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c06db935>] down_read+0x28/0x58
          [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
          [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
          [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
          [<c0271ece>] ext4_da_write_begin+0x27/0x193
          [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
          [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
          [<c01ddce7>] generic_file_aio_write+0x78/0xd3
          [<c026d336>] ext4_file_write+0x480/0x4a6
          [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
          [<c0180944>] ? sched_clock_cpu+0x11a/0x13e
          [<c01967e9>] ? trace_hardirqs_off+0xb/0xd
          [<c018099f>] ? local_clock+0x37/0x4e
          [<c0209f2c>] do_sync_write+0x67/0x9d
          [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
          [<c020a7b9>] vfs_write+0x7b/0xe6
          [<c020a9a6>] sys_write+0x3b/0x64
          [<c06dd4bd>] syscall_call+0x7/0xb
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      00d4e736
  14. 11 Sep, 2012 1 commit
  15. 01 Aug, 2012 1 commit
  16. 22 Jul, 2012 1 commit
  17. 09 Jun, 2012 1 commit
  18. 08 Jun, 2012 1 commit
    • Jan Kara's avatar
      writeback: Fix lock imbalance in writeback_sb_inodes() · ead188f9
      Jan Kara authored
      Fix bug introduced by 169ebd90.  We have to have wb_list_lock locked when
      restarting writeback loop after having waited for inode writeback.
      
      Bug description by Ted Tso:
      
        I can reproduce this fairly easily by using ext4 w/o a journal, running
        under KVM with 1024megs memory, with fsstress (xfstests #13):
      
        [   45.153294] =====================================
        [   45.154784] [ BUG: bad unlock balance detected! ]
        [   45.155591] 3.5.0-rc1-00002-gb22b1f17 #124 Not tainted
        [   45.155591] -------------------------------------
        [   45.155591] flush-254:16/2499 is trying to release lock (&(&wb->list_lock)->rlock) at:
        [   45.155591] [<c022c3da>] writeback_sb_inodes+0x160/0x327
        [   45.155591] but there are no more locks to release!
      Reported-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Tested-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      ead188f9
  19. 06 May, 2012 7 commits
    • Jan Kara's avatar
      writeback: Avoid iput() from flusher thread · 169ebd90
      Jan Kara authored
      Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
      because iput() can do a lot of work - for example truncate the inode if it's
      the last iput on unlinked file. Some filesystems depend on flusher thread
      progressing (e.g. because they need to flush delay allocated blocks to reduce
      allocation uncertainty) and so flusher thread doing truncate creates
      interesting dependencies and possibilities for deadlocks.
      
      We get rid of iput() in flusher thread by using the fact that I_SYNC inode
      flag effectively pins the inode in memory. So if we take care to either hold
      i_lock or have I_SYNC set, we can get away without taking inode reference
      in writeback_sb_inodes().
      
      As a side effect of these changes, we also fix possible use-after-free in
      wb_writeback() because inode_wait_for_writeback() call could try to reacquire
      i_lock on the inode that was already free.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      169ebd90
    • Jan Kara's avatar
      writeback: Refactor writeback_single_inode() · 4f8ad655
      Jan Kara authored
      The code in writeback_single_inode() is relatively complex. The list requeing
      logic makes sense only for flusher thread but not really for sync_inode() or
      write_inode_now() callers. Also when we want to get rid of inode references
      held by flusher thread, we will need a special I_SYNC handling there.
      
      So separate part of writeback_single_inode() which does the real writeback work
      into __writeback_single_inode() and make writeback_single_inode() do only stuff
      necessary for callers writing only one inode, moving the special list handling
      into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
      could skip some inode during sync(2) because other writer refiled it from b_io
      to b_dirty list. Also I_SYNC handling is moved into the callers of
      __writeback_single_inode() to make locking easier.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      4f8ad655
    • Jan Kara's avatar
      writeback: Remove wb->list_lock from writeback_single_inode() · f0d07b7f
      Jan Kara authored
      writeback_single_inode() doesn't need wb->list_lock for anything on entry now.
      So remove the requirement. This makes locking of writeback_single_inode()
      temporarily awkward (entering with i_lock, returning with i_lock and
      wb->list_lock) but it will be sanitized in the next patch.
      
      Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was
      just taking it to make usage convenient for callers but with
      writeback_single_inode() changing it's not very convenient anymore. So remove
      the lock from that function.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      f0d07b7f
    • Jan Kara's avatar
      writeback: Separate inode requeueing after writeback · ccb26b5a
      Jan Kara authored
      Move inode requeueing after inode has been written out into a separate
      function.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      ccb26b5a
    • Jan Kara's avatar
      writeback: Move I_DIRTY_PAGES handling · 6290be1c
      Jan Kara authored
      Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
      writing them all, just clear the bit only when we succeeded writing all the
      pages. We also move the clearing of the bit close to other i_state handling to
      separate it from writeback list handling. This is desirable because list
      handling will differ for flusher thread and other writeback_single_inode()
      callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
      checking it in ->writepages or ->write_inode implementation) so this movement
      is safe.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      6290be1c
    • Jan Kara's avatar
      writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() · cc1676d9
      Jan Kara authored
      When writeback_single_inode() is called on inode which has I_SYNC already
      set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
      this makes sense only if the caller is flusher thread. For other callers of
      writeback_single_inode() it doesn't really make sense and may be even wrong
      - flusher thread may be doing WB_SYNC_ALL writeback in parallel.
      
      So we move requeueing from writeback_single_inode() to writeback_sb_inodes().
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      cc1676d9
    • Jan Kara's avatar
      writeback: Move clearing of I_SYNC into inode_sync_complete() · 365b94ae
      Jan Kara authored
      Move clearing of I_SYNC into inode_sync_complete().  It is more logical to have
      clearing of I_SYNC bit and waking of waiters in one place. Also later we will
      have two places needing to clear I_SYNC and wake up waiters so this allows them
      to use the common helper. Moving of I_SYNC clearing to a later stage of
      writeback_single_inode() is safe since we hold i_lock all the time.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      365b94ae
  20. 21 Mar, 2012 2 commits
  21. 07 Mar, 2012 1 commit
  22. 29 Feb, 2012 1 commit
  23. 01 Feb, 2012 1 commit
  24. 08 Jan, 2012 1 commit
  25. 04 Jan, 2012 1 commit
  26. 18 Dec, 2011 2 commits
  27. 29 Nov, 2011 1 commit
  28. 21 Nov, 2011 1 commit
    • Tejun Heo's avatar
      freezer: implement and use kthread_freezable_should_stop() · 8a32c441
      Tejun Heo authored
      Writeback and thinkpad_acpi have been using thaw_process() to prevent
      deadlock between the freezer and kthread_stop(); unfortunately, this
      is inherently racy - nothing prevents freezing from happening between
      thaw_process() and kthread_stop().
      
      This patch implements kthread_freezable_should_stop() which enters
      refrigerator if necessary but is guaranteed to return if
      kthread_stop() is invoked.  Both thaw_process() users are converted to
      use the new function.
      
      Note that this deadlock condition exists for many of freezable
      kthreads.  They need to be converted to use the new should_stop or
      freezable workqueue.
      
      Tested with synthetic test case.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarHenrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      8a32c441
  29. 30 Oct, 2011 2 commits
  30. 03 Oct, 2011 1 commit
    • Wu Fengguang's avatar
      writeback: per-bdi background threshold · b00949aa
      Wu Fengguang authored
      One thing puzzled me is that in JBOD case, the per-disk writeout
      performance is smaller than the corresponding single-disk case even
      when they have comparable bdi_thresh. Tracing shows find that in single
      disk case, bdi_writeback is always kept high while in JBOD case, it
      could drop low from time to time and correspondingly bdi_reclaimable
      could sometimes rush high.
      
      The fix is to watch bdi_reclaimable and kick background writeback as
      soon as it goes high. This resembles the global background threshold
      but in per-bdi manner. The trick is, as long as bdi_reclaimable does
      not go high, bdi_writeback naturally won't go low because
      bdi_reclaimable+bdi_writeback ~= bdi_thresh.
      
      With less fluctuated writeback pages, JBOD performance is observed to
      increase noticeably in various cases.
      
      vmstat:nr_written values before/after patch:
      
        3.1.0-rc4-wo-underrun+      3.1.0-rc4-bgthresh3+  
      ------------------------  ------------------------  
                     125596480       +25.9%    158179363  JBOD-10HDD-16G/ext4-100dd-1M-24p-16384M-20:10-X
                      61790815      +110.4%    130032231  JBOD-10HDD-16G/ext4-10dd-1M-24p-16384M-20:10-X
                      58853546        -0.1%     58823828  JBOD-10HDD-16G/ext4-1dd-1M-24p-16384M-20:10-X
                     110159811       +24.7%    137355377  JBOD-10HDD-16G/xfs-100dd-1M-24p-16384M-20:10-X
                      69544762       +10.8%     77080047  JBOD-10HDD-16G/xfs-10dd-1M-24p-16384M-20:10-X
                      50644862        +0.5%     50890006  JBOD-10HDD-16G/xfs-1dd-1M-24p-16384M-20:10-X
                      42677090       +28.0%     54643527  JBOD-10HDD-thresh=100M/ext4-100dd-1M-24p-16384M-100M:10-X
                      47491324       +13.3%     53785605  JBOD-10HDD-thresh=100M/ext4-10dd-1M-24p-16384M-100M:10-X
                      52548986        +0.9%     53001031  JBOD-10HDD-thresh=100M/ext4-1dd-1M-24p-16384M-100M:10-X
                      26783091       +36.8%     36650248  JBOD-10HDD-thresh=100M/xfs-100dd-1M-24p-16384M-100M:10-X
                      35526347       +14.0%     40492312  JBOD-10HDD-thresh=100M/xfs-10dd-1M-24p-16384M-100M:10-X
                      44670723        -1.1%     44177606  JBOD-10HDD-thresh=100M/xfs-1dd-1M-24p-16384M-100M:10-X
                     127996037       +22.4%    156719990  JBOD-10HDD-thresh=2G/ext4-100dd-1M-24p-16384M-2048M:10-X
                      57518856        +3.8%     59677625  JBOD-10HDD-thresh=2G/ext4-10dd-1M-24p-16384M-2048M:10-X
                      51919909       +12.2%     58269894  JBOD-10HDD-thresh=2G/ext4-1dd-1M-24p-16384M-2048M:10-X
                      86410514       +79.0%    154660433  JBOD-10HDD-thresh=2G/xfs-100dd-1M-24p-16384M-2048M:10-X
                      40132519       +38.6%     55617893  JBOD-10HDD-thresh=2G/xfs-10dd-1M-24p-16384M-2048M:10-X
                      48423248        +7.5%     52042927  JBOD-10HDD-thresh=2G/xfs-1dd-1M-24p-16384M-2048M:10-X
                     206041046       +44.1%    296846536  JBOD-10HDD-thresh=4G/xfs-100dd-1M-24p-16384M-4096M:10-X
                      72312903       -19.4%     58272885  JBOD-10HDD-thresh=4G/xfs-10dd-1M-24p-16384M-4096M:10-X
                      50635672        -0.5%     50384787  JBOD-10HDD-thresh=4G/xfs-1dd-1M-24p-16384M-4096M:10-X
                      68308534      +115.7%    147324758  JBOD-10HDD-thresh=800M/ext4-100dd-1M-24p-16384M-800M:10-X
                      57882933       +14.5%     66269621  JBOD-10HDD-thresh=800M/ext4-10dd-1M-24p-16384M-800M:10-X
                      52183472       +12.8%     58855181  JBOD-10HDD-thresh=800M/ext4-1dd-1M-24p-16384M-800M:10-X
                      53788956       +94.2%    104460352  JBOD-10HDD-thresh=800M/xfs-100dd-1M-24p-16384M-800M:10-X
                      44493342       +35.5%     60298210  JBOD-10HDD-thresh=800M/xfs-10dd-1M-24p-16384M-800M:10-X
                      42641209       +18.9%     50681038  JBOD-10HDD-thresh=800M/xfs-1dd-1M-24p-16384M-800M:10-X
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      b00949aa