1. 17 May, 2015 1 commit
  2. 26 Mar, 2015 1 commit
    • Ryusuke Konishi's avatar
      nilfs2: fix deadlock of segment constructor during recovery · f37b5d42
      Ryusuke Konishi authored
      commit 283ee1482f349d6c0c09dfb725db5880afc56813 upstream.
      
      According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
      during recovery at mount time.  The code path that caused the deadlock was
      as follows:
      
        nilfs_fill_super()
          load_nilfs()
            nilfs_salvage_orphan_logs()
              * Do roll-forwarding, attach segment constructor for recovery,
                and kick it.
      
              nilfs_segctor_thread()
                nilfs_segctor_thread_construct()
                 * A lock is held with nilfs_transaction_lock()
                   nilfs_segctor_do_construct()
                     nilfs_segctor_drop_written_files()
                       iput()
                         iput_final()
                           write_inode_now()
                             writeback_single_inode()
                               __writeback_single_inode()
                                 do_writepages()
                                   nilfs_writepage()
                                     nilfs_construct_dsync_segment()
                                       nilfs_transaction_lock() --> deadlock
      
      This can happen if commit 7ef3ff2fea8b ("nilfs2: fix deadlock of segment
      constructor over I_SYNC flag") is applied and roll-forward recovery was
      performed at mount time.  The roll-forward recovery can happen if datasync
      write is done and the file system crashes immediately after that.  For
      instance, we can reproduce the issue with the following steps:
      
       < nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
       # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
       # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
       count=1 && reboot -nfh
       < the system will immediately reboot >
       # mount -t nilfs2 /dev/sdb1 /nilfs
      
      The deadlock occurs because iput() can run segment constructor through
      writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags.  The
      above commit changed segment constructor so that it calls iput()
      asynchronously for inodes with i_nlink == 0, but that change was
      imperfect.
      
      This fixes the another deadlock by deferring iput() in segment constructor
      even for the case that mount is not finished, that is, for the case that
      MS_ACTIVE flag is not set.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Reported-by: default avatarYuxuan Shui <yshuiv7@gmail.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f37b5d42
  3. 18 Mar, 2015 1 commit
    • Ryusuke Konishi's avatar
      nilfs2: fix potential memory overrun on inode · b6b14e98
      Ryusuke Konishi authored
      commit 957ed60b53b519064a54988c4e31e0087e47d091 upstream.
      
      Each inode of nilfs2 stores a root node of a b-tree, and it turned out to
      have a memory overrun issue:
      
      Each b-tree node of nilfs2 stores a set of key-value pairs and the number
      of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as
      a few other "bn_*" members.
      
      Since the value of "bn_nchildren" is used for operations on the key-values
      within the b-tree node, it can cause memory access overrun if a large
      number is incorrectly set to "bn_nchildren".
      
      For instance, nilfs_btree_node_lookup() function determines the range of
      binary search with it, and too large "bn_nchildren" leads
      nilfs_btree_node_get_key() in that function to overrun.
      
      As for intermediate b-tree nodes, this is prevented by a sanity check
      performed when each node is read from a drive, however, no sanity check
      has been done for root nodes stored in inodes.
      
      This patch fixes the issue by adding missing sanity check against b-tree
      root nodes so that it's called when on-memory inodes are read from ifile,
      inode metadata file.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b6b14e98
  4. 11 Feb, 2015 1 commit
    • Ryusuke Konishi's avatar
      nilfs2: fix deadlock of segment constructor over I_SYNC flag · ec7cae16
      Ryusuke Konishi authored
      commit 7ef3ff2fea8bf5e4a21cef47ad87710a3d0fdb52 upstream.
      
      Nilfs2 eventually hangs in a stress test with fsstress program.  This
      issue was caused by the following deadlock over I_SYNC flag between
      nilfs_segctor_thread() and writeback_sb_inodes():
      
        nilfs_segctor_thread()
          nilfs_segctor_thread_construct()
            nilfs_segctor_unlock()
              nilfs_dispose_list()
                iput()
                  iput_final()
                    evict()
                      inode_wait_for_writeback()  * wait for I_SYNC flag
      
        writeback_sb_inodes()
           * set I_SYNC flag on inode->i_state
          __writeback_single_inode()
            do_writepages()
              nilfs_writepages()
                nilfs_construct_dsync_segment()
                  nilfs_segctor_sync()
                     * wait for completion of segment constructor
          inode_sync_complete()
             * clear I_SYNC flag after __writeback_single_inode() completed
      
      writeback_sb_inodes() calls do_writepages() for dirty inodes after
      setting I_SYNC flag on inode->i_state.  do_writepages() in turn calls
      nilfs_writepages(), which can run segment constructor and wait for its
      completion.  On the other hand, segment constructor calls iput(), which
      can call evict() and wait for the I_SYNC flag on
      inode_wait_for_writeback().
      
      Since segment constructor doesn't know when I_SYNC will be set, it
      cannot know whether iput() will block or not unless inode->i_nlink has a
      non-zero count.  We can prevent evict() from being called in iput() by
      implementing sop->drop_inode(), but it's not preferable to leave inodes
      with i_nlink == 0 for long periods because it even defers file
      truncation and inode deallocation.  So, this instead resolves the
      deadlock by calling iput() asynchronously with a workqueue for inodes
      with i_nlink == 0.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec7cae16
  5. 16 Jan, 2015 1 commit
    • Ryusuke Konishi's avatar
      nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races · 344e833f
      Ryusuke Konishi authored
      commit 705304a863cc41585508c0f476f6d3ec28cf7e00 upstream.
      
      Same story as in commit 41080b5a ("nfsd race fixes: ext2") (similar
      ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
      of insert_inode_locked() and a bug of a check for dead inodes needs to
      be fixed.
      
      If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
      insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
      the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.
      
      If nilfs_iget() is called before nilfs_new_inode() calls
      insert_inode_locked4(), it will create an in-core inode and read its
      data from the on-disk inode.  But, nilfs_iget() will find i_nlink equals
      zero and fail at nilfs_read_inode_common(), which will lead it to call
      iget_failed() and cleanly fail.
      
      However, this sanity check doesn't work as expected for reused on-disk
      inodes because they leave a non-zero value in i_mode field and it
      hinders the test of i_nlink.  This patch also fixes the issue by
      removing the test on i_mode that nilfs2 doesn't need.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      344e833f
  6. 05 Oct, 2014 1 commit
    • Andreas Rohner's avatar
      nilfs2: fix data loss with mmap() · a0778f70
      Andreas Rohner authored
      commit 56d7acc792c0d98f38f22058671ee715ff197023 upstream.
      
      This bug leads to reproducible silent data loss, despite the use of
      msync(), sync() and a clean unmount of the file system.  It is easily
      reproducible with the following script:
      
        ----------------[BEGIN SCRIPT]--------------------
        mkfs.nilfs2 -f /dev/sdb
        mount /dev/sdb /mnt
      
        dd if=/dev/zero bs=1M count=30 of=/mnt/testfile
      
        umount /mnt
        mount /dev/sdb /mnt
        CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"
      
        /root/mmaptest/mmaptest /mnt/testfile 30 10 5
      
        sync
        CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
        umount /mnt
        mount /dev/sdb /mnt
        CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
        umount /mnt
      
        echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
        echo "AFTER MMAP:\t$CHECKSUM_AFTER"
        echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
        ----------------[END SCRIPT]--------------------
      
      The mmaptest tool looks something like this (very simplified, with
      error checking removed):
      
        ----------------[BEGIN mmaptest]--------------------
        data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
                    MAP_SHARED, fd, file_offset);
      
        for (i = 0; i < write_count; ++i) {
              memcpy(data + i * 4096, buf, sizeof(buf));
              msync(data, file_size - file_offset, MS_SYNC))
        }
        ----------------[END mmaptest]--------------------
      
      The output of the script looks something like this:
      
        BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
        AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
        AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
      
      So it is clear, that the changes done using mmap() do not survive a
      remount.  This can be reproduced a 100% of the time.  The problem was
      introduced in commit 136e8770 ("nilfs2: fix issue of
      nilfs_set_page_dirty() for page at EOF boundary").
      
      If the page was read with mpage_readpage() or mpage_readpages() for
      example, then it has no buffers attached to it.  In that case
      page_has_buffers(page) in nilfs_set_page_dirty() will be false.
      Therefore nilfs_set_file_dirty() is never called and the pages are never
      collected and never written to disk.
      
      This patch fixes the problem by also calling nilfs_set_file_dirty() if the
      page has no buffers attached to it.
      
      [akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
      Signed-off-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Tested-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0778f70
  7. 25 Jan, 2014 1 commit
    • Andreas Rohner's avatar
      nilfs2: fix segctor bug that causes file system corruption · 4cb1e59f
      Andreas Rohner authored
      commit 70f2fe3a26248724d8a5019681a869abdaf3e89a upstream.
      
      There is a bug in the function nilfs_segctor_collect, which results in
      active data being written to a segment, that is marked as clean.  It is
      possible, that this segment is selected for a later segment
      construction, whereby the old data is overwritten.
      
      The problem shows itself with the following kernel log message:
      
        nilfs_sufile_do_cancel_free: segment 6533 must be clean
      
      Usually a few hours later the file system gets corrupted:
      
        NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
        NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)
      
      The issue can be reproduced with a file system that is nearly full and
      with the cleaner running, while some IO intensive task is running.
      Although it is quite hard to reproduce.
      
      This is what happens:
      
       1. The cleaner starts the segment construction
       2. nilfs_segctor_collect is called
       3. sc_stage is on NILFS_ST_SUFILE and segments are freed
       4. sc_stage is on NILFS_ST_DAT current segment is full
       5. nilfs_segctor_extend_segments is called, which
          allocates a new segment
       6. The new segment is one of the segments freed in step 3
       7. nilfs_sufile_cancel_freev is called and produces an error message
       8. Loop around and the collection starts again
       9. sc_stage is on NILFS_ST_SUFILE and segments are freed
          including the newly allocated segment, which will contain active
          data and can be allocated at a later time
      10. A few hours later another segment construction allocates the
          segment and causes file system corruption
      
      This can be prevented by simply reordering the statements.  If
      nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
      the freed segments are marked as dirty and cannot be allocated any more.
      Signed-off-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Reviewed-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: default avatarAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4cb1e59f
  8. 13 Oct, 2013 1 commit
    • Vyacheslav Dubeyko's avatar
      nilfs2: fix issue with race condition of competition between segments for dirty blocks · d8974c7f
      Vyacheslav Dubeyko authored
      commit 7f42ec3941560f0902fe3671e36f2c20ffd3af0a upstream.
      
      Many NILFS2 users were reported about strange file system corruption
      (for example):
      
         NILFS: bad btree node (blocknr=185027): level = 0, flags = 0x0, nchildren = 768
         NILFS error (device sda4): nilfs_bmap_last_key: broken bmap (inode number=11540)
      
      But such error messages are consequence of file system's issue that takes
      place more earlier.  Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
      and Anton Eliasson <devel@antoneliasson.se> were reported about another
      issue not so recently.  These reports describe the issue with segctor
      thread's crash:
      
        BUG: unable to handle kernel paging request at 0000000000004c83
        IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
        Call Trace:
         nilfs_segctor_do_construct+0xf25/0x1b20 [nilfs2]
         nilfs_segctor_construct+0x17b/0x290 [nilfs2]
         nilfs_segctor_thread+0x122/0x3b0 [nilfs2]
         kthread+0xc0/0xd0
         ret_from_fork+0x7c/0xb0
      
      These two issues have one reason.  This reason can raise third issue
      too.  Third issue results in hanging of segctor thread with eating of
      100% CPU.
      
      REPRODUCING PATH:
      
      One of the possible way or the issue reproducing was described by
      Jermoe me Poulin <jeromepoulin@gmail.com>:
      
      1. init S to get to single user mode.
      2. sysrq+E to make sure only my shell is running
      3. start network-manager to get my wifi connection up
      4. login as root and launch "screen"
      5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
      6. lscp | xz -9e > lscp.txt.xz
      7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
      8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
      9. start a screen and launch strace -f -o find-cat.log -t find
      /mnt/nilfs -type f -exec cat {} > /dev/null \;
      10. start a screen and launch strace -f -o apt-get.log -t apt-get update
      11. launch the last command again as it did not crash the first time
      12. apt-get crashes
      13. ps aux > ps-aux-crashed.log
      13. sysrq+W
      14. sysrq+E  wait for everything to terminate
      15. sysrq+SUSB
      
      Simplified way of the issue reproducing is starting kernel compilation
      task and "apt-get update" in parallel.
      
      REPRODUCIBILITY:
      
      The issue is reproduced not stable [60% - 80%].  It is very important to
      have proper environment for the issue reproducing.  The critical
      conditions for successful reproducing:
      
      (1) It should have big modified file by mmap() way.
      
      (2) This file should have the count of dirty blocks are greater that
          several segments in size (for example, two or three) from time to time
          during processing.
      
      (3) It should be intensive background activity of files modification
          in another thread.
      
      INVESTIGATION:
      
      First of all, it is possible to see that the reason of crash is not valid
      page address:
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
        NILFS [nilfs_segctor_complete_write]:2101 segbuf->sb_segnum 6783
      
      Moreover, value of b_page (0x1a82) is 6786.  This value looks like segment
      number.  And b_blocknr with b_size values look like block numbers.  So,
      buffer_head's pointer points on not proper address value.
      
      Detailed investigation of the issue is discovered such picture:
      
        [-----------------------------SEGMENT 6783-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111149024, segbuf->sb_segnum 6783
      
        [-----------------------------SEGMENT 6784-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff8802174a6798, bh->b_assoc_buffers.prev ffff880221cffee8
        NILFS [nilfs_segctor_do_construct]:2336 nilfs_segctor_assign
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 1, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111150080, segbuf->sb_segnum 6784, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111164416, segbuf->sb_segnum 6784, segbuf->sb_nbio 15
      
        [-----------------------------SEGMENT 6785-------------------------------]
        NILFS [nilfs_segctor_do_construct]:2310 nilfs_segctor_begin_construction
        NILFS [nilfs_segctor_do_construct]:2321 nilfs_segctor_collect
        NILFS [nilfs_lookup_dirty_data_buffers]:782 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_lookup_dirty_data_buffers]:783 bh->b_assoc_buffers.next ffff880219277e80, bh->b_assoc_buffers.prev ffff880221cffc88
        NILFS [nilfs_segctor_do_construct]:2367 nilfs_segctor_update_segusage
        NILFS [nilfs_segctor_do_construct]:2371 nilfs_segctor_prepare_write
        NILFS [nilfs_segctor_do_construct]:2376 nilfs_add_checksums_on_logs
        NILFS [nilfs_segctor_do_construct]:2381 nilfs_segctor_write
        NILFS [nilfs_segbuf_submit_bh]:575 bh->b_count 2, bh->b_page ffffea000709b000, page->index 0, i_ino 1033103, i_size 25165824
        NILFS [nilfs_segbuf_submit_bh]:576 segbuf->sb_segnum 6785
        NILFS [nilfs_segbuf_submit_bh]:577 bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111165440, segbuf->sb_segnum 6785, segbuf->sb_nbio 0
        [----------] ditto
        NILFS [nilfs_segbuf_submit_bio]:464 bio->bi_sector 111177728, segbuf->sb_segnum 6785, segbuf->sb_nbio 12
      
        NILFS [nilfs_segctor_do_construct]:2399 nilfs_segctor_wait
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6783
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6784
        NILFS [nilfs_segbuf_wait]:676 segbuf->sb_segnum 6785
      
        NILFS [nilfs_segctor_complete_write]:2100 bh->b_count 0, bh->b_blocknr 13895680, bh->b_size 13897727, bh->b_page 0000000000001a82
      
        BUG: unable to handle kernel paging request at 0000000000001a82
        IP: [<ffffffffa024d0f2>] nilfs_end_page_io+0x12/0xd0 [nilfs2]
      
      Usually, for every segment we collect dirty files in list.  Then, dirty
      blocks are gathered for every dirty file, prepared for write and
      submitted by means of nilfs_segbuf_submit_bh() call.  Finally, it takes
      place complete write phase after calling nilfs_end_bio_write() on the
      block layer.  Buffers/pages are marked as not dirty on final phase and
      processed files removed from the list of dirty files.
      
      It is possible to see that we had three prepare_write and submit_bio
      phases before segbuf_wait and complete_write phase.  Moreover, segments
      compete between each other for dirty blocks because on every iteration
      of segments processing dirty buffer_heads are added in several lists of
      payload_buffers:
      
        [SEGMENT 6784]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880218bcdf50
        [SEGMENT 6785]: bh->b_assoc_buffers.next ffff880218a0d5f8, bh->b_assoc_buffers.prev ffff880222cc7ee8
      
      The next pointer is the same but prev pointer has changed.  It means
      that buffer_head has next pointer from one list but prev pointer from
      another.  Such modification can be made several times.  And, finally, it
      can be resulted in various issues: (1) segctor hanging, (2) segctor
      crashing, (3) file system metadata corruption.
      
      FIX:
      This patch adds:
      
      (1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
          for every proccessed dirty block;
      
      (2) checking of BH_Async_Write flag in
          nilfs_lookup_dirty_data_buffers() and
          nilfs_lookup_dirty_node_buffers();
      
      (3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
          nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().
      Reported-by: default avatarJerome Poulin <jeromepoulin@gmail.com>
      Reported-by: default avatarAnton Eliasson <devel@antoneliasson.se>
      Cc: Paul Fertser <fercerpav@gmail.com>
      Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
      Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
      Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net>
      Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
      Cc: Elmer Zhang <freeboy6716@gmail.com>
      Cc: Kenneth Langga <klangga@gmail.com>
      Signed-off-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8974c7f
  9. 29 Aug, 2013 2 commits
  10. 24 May, 2013 1 commit
    • Ryusuke Konishi's avatar
      nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary · 136e8770
      Ryusuke Konishi authored
      nilfs2: fix issue of nilfs_set_page_dirty for page at EOF boundary
      
      DESCRIPTION:
       There are use-cases when NILFS2 file system (formatted with block size
      lesser than 4 KB) can be remounted in RO mode because of encountering of
      "broken bmap" issue.
      
      The issue was reported by Anthony Doggett <Anthony2486@interfaces.org.uk>:
       "The machine I've been trialling nilfs on is running Debian Testing,
        Linux version 3.2.0-4-686-pae (debian-kernel@lists.debian.org) (gcc
        version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.35-2), but I've
        also reproduced it (identically) with Debian Unstable amd64 and Debian
        Experimental (using the 3.8-trunk kernel).  The problematic partitions
        were formatted with "mkfs.nilfs2 -b 1024 -B 8192"."
      
      SYMPTOMS:
      (1) System log contains error messages likewise:
      
          [63102.496756] nilfs_direct_assign: invalid pointer: 0
          [63102.496786] NILFS error (device dm-17): nilfs_bmap_assign: broken bmap (inode number=28)
          [63102.496798]
          [63102.524403] Remounting filesystem read-only
      
      (2) The NILFS2 file system is remounted in RO mode.
      
      REPRODUSING PATH:
      (1) Create volume group with name "unencrypted" by means of vgcreate utility.
      (2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):
      
      ----------------[BEGIN SCRIPT]--------------------
      
      VG=unencrypted
      lvcreate --size 2G --name ntest $VG
      mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
      mkdir /var/tmp/n
      mkdir /var/tmp/n/ntest
      mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
      mkdir /var/tmp/n/ntest/thedir
      cd /var/tmp/n/ntest/thedir
      sleep 2
      date
      darcs init
      sleep 2
      dmesg|tail -n 5
      date
      darcs whatsnew || true
      date
      sleep 2
      dmesg|tail -n 5
      ----------------[END SCRIPT]--------------------
      
      REPRODUCIBILITY: 100%
      
      INVESTIGATION:
      As it was discovered, the issue takes place during segment
      construction after executing such sequence of user-space operations:
      
        open("_darcs/index", O_RDWR|O_CREAT|O_NOCTTY, 0666) = 7
        fstat(7, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
        ftruncate(7, 60)
      
      The error message "NILFS error (device dm-17): nilfs_bmap_assign: broken
      bmap (inode number=28)" takes place because of trying to get block
      number for third block of the file with logical offset #3072 bytes.  As
      it is possible to see from above output, the file has 60 bytes of the
      whole size.  So, it is enough one block (1 KB in size) allocation for
      the whole file.  Trying to operate with several blocks instead of one
      takes place because of discovering several dirty buffers for this file
      in nilfs_segctor_scan_file() method.
      
      The root cause of this issue is in nilfs_set_page_dirty function which
      is called just before writing to an mmapped page.
      
      When nilfs_page_mkwrite function handles a page at EOF boundary, it
      fills hole blocks only inside EOF through __block_page_mkwrite().
      
      The __block_page_mkwrite() function calls set_page_dirty() after filling
      hole blocks, thus nilfs_set_page_dirty function (=
      a_ops->set_page_dirty) is called.  However, the current implementation
      of nilfs_set_page_dirty() wrongly marks all buffers dirty even for page
      at EOF boundary.
      
      As a result, buffers outside EOF are inconsistently marked dirty and
      queued for write even though they are not mapped with nilfs_get_block
      function.
      
      FIX:
      This modifies nilfs_set_page_dirty() not to mark hole blocks dirty.
      
      Thanks to Vyacheslav Dubeyko for his effort on analysis and proposals
      for this issue.
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Reported-by: default avatarAnthony Doggett <Anthony2486@interfaces.org.uk>
      Reported-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      136e8770
  11. 08 May, 2013 1 commit
  12. 01 May, 2013 3 commits
    • Vyacheslav Dubeyko's avatar
      nilfs2: remove unneeded test in nilfs_writepage() · eb53b6db
      Vyacheslav Dubeyko authored
      page->mapping->host cannot be NULL in nilfs_writepage(), so remove the
      unneeded test.
      
      The fixes the smatch warning: "fs/nilfs2/inode.c:211 nilfs_writepage()
      error: we previously assumed 'inode' could be null (see line 195)".
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb53b6db
    • Vyacheslav Dubeyko's avatar
      nilfs2: fix using of PageLocked() in nilfs_clear_dirty_page() · dc33f5f3
      Vyacheslav Dubeyko authored
      Change test_bit(PG_locked, &page->flags) to PageLocked().
      Signed-off-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc33f5f3
    • Vyacheslav Dubeyko's avatar
      nilfs2: fix issue with flush kernel thread after remount in RO mode because of... · 8c26c4e2
      Vyacheslav Dubeyko authored
      nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption
      
      The NILFS2 driver remounts itself in RO mode in the case of discovering
      metadata corruption (for example, discovering a broken bmap).  But
      usually, this takes place when there have been file system operations
      before remounting in RO mode.
      
      Thereby, NILFS2 driver can be in RO mode with presence of dirty pages in
      modified inodes' address spaces.  It results in flush kernel thread's
      infinite trying to flush dirty pages in RO mode.  As a result, it is
      possible to see such side effects as: (1) flush kernel thread occupies
      50% - 99% of CPU time; (2) system can't be shutdowned without manual
      power switch off.
      
      SYMPTOMS:
      (1) System log contains error message: "Remounting filesystem read-only".
      (2) The flush kernel thread occupies 50% - 99% of CPU time.
      (3) The system can't be shutdowned without manual power switch off.
      
      REPRODUCTION PATH:
      (1) Create volume group with name "unencrypted" by means of vgcreate utility.
      (2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):
      
        ----------------[BEGIN SCRIPT]--------------------
        #!/bin/bash
      
        VG=unencrypted
        #apt-get install nilfs-tools darcs
        lvcreate --size 2G --name ntest $VG
        mkfs.nilfs2 -b 1024 -B 8192 /dev/mapper/$VG-ntest
        mkdir /var/tmp/n
        mkdir /var/tmp/n/ntest
        mount /dev/mapper/$VG-ntest /var/tmp/n/ntest
        mkdir /var/tmp/n/ntest/thedir
        cd /var/tmp/n/ntest/thedir
        sleep 2
        date
        darcs init
        sleep 2
        dmesg|tail -n 5
        date
        darcs whatsnew || true
        date
        sleep 2
        dmesg|tail -n 5
        ----------------[END SCRIPT]--------------------
      
      (3) Try to shutdown the system.
      
      REPRODUCIBILITY: 100%
      
      FIX:
      
      This patch implements checking mount state of NILFS2 driver in
      nilfs_writepage(), nilfs_writepages() and nilfs_mdt_write_page()
      methods.  If it is detected the RO mount state then all dirty pages are
      simply discarded with warning messages is written in system log.
      
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Anthony Doggett <Anthony2486@interfaces.org.uk>
      Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp>
      Cc: Piotr Szymaniak <szarpaj@grubelek.pl>
      Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com>
      Cc: Elmer Zhang <freeboy6716@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c26c4e2
  13. 04 Mar, 2013 1 commit
    • Eric W. Biederman's avatar
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman authored
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarKees Cook <keescook@google.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  14. 26 Feb, 2013 1 commit
  15. 23 Feb, 2013 1 commit
  16. 22 Feb, 2013 1 commit
    • Darrick J. Wong's avatar
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong authored
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  17. 05 Feb, 2013 1 commit
    • Vyacheslav Dubeyko's avatar
      nilfs2: fix fix very long mount time issue · a9bae189
      Vyacheslav Dubeyko authored
      There exists a situation when GC can work in background alone without
      any other filesystem activity during significant time.
      
      The nilfs_clean_segments() method calls nilfs_segctor_construct() that
      updates superblocks in the case of NILFS_SC_SUPER_ROOT and
      THE_NILFS_DISCONTINUED flags are set.  But when GC is working alone the
      nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag.
      As a result, the update of superblocks doesn't occurred all this time
      and in the case of SPOR superblocks keep very old values of last super
      root placement.
      
      SYMPTOMS:
      
      Trying to mount a NILFS2 volume after SPOR in such environment ends with
      very long mounting time (it can achieve about several hours in some
      cases).
      
      REPRODUCING PATH:
      
      1. It needs to use external USB HDD, disable automount and doesn't
         make any additional filesystem activity on the NILFS2 volume.
      
      2. Generate temporary file with size about 100 - 500 GB (for example,
         dd if=/dev/zero of=<file_name> bs=1073741824 count=200).  The size of
         file defines duration of GC working.
      
      3. Then it needs to delete file.
      
      4. Start GC manually by means of command "nilfs-clean -p 0".  When you
         start GC by means of such way then, at the end, superblocks is updated
         by once.  So, for simulation of SPOR, it needs to wait sometime (15 -
         40 minutes) and simply switch off USB HDD manually.
      
      5. Switch on USB HDD again and try to mount NILFS2 volume.  As a
         result, NILFS2 volume will mount during very long time.
      
      REPRODUCIBILITY: 100%
      
      FIX:
      
      This patch adds checking that superblocks need to update and set
      THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call.
      Reported-by: default avatarSergey Alexandrov <splavgm@gmail.com>
      Signed-off-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Tested-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9bae189
  18. 11 Jan, 2013 1 commit
  19. 20 Dec, 2012 1 commit
  20. 12 Dec, 2012 1 commit
    • Rafael Aquini's avatar
      mm: redefine address_space.assoc_mapping · 252aa6f5
      Rafael Aquini authored
      Overhaul struct address_space.assoc_mapping renaming it to
      address_space.private_data and its type is redefined to void*.  By this
      approach we consistently name the .private_* elements from struct
      address_space as well as allow extended usage for address_space
      association with other data structures through ->private_data.
      
      Also, all users of old ->assoc_mapping element are converted to reflect
      its new name and type change (->private_data).
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      252aa6f5
  21. 09 Oct, 2012 1 commit
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov authored
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  22. 03 Oct, 2012 1 commit
  23. 01 Oct, 2012 1 commit
    • Theodore Ts'o's avatar
      ext4: fix mtime update in nodelalloc mode · 041bbb6d
      Theodore Ts'o authored
      Commits 5e8830dc and 41c4d25f introduced a regression into
      v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
      take place for files modified via mmap if the page was already in the
      page cache.  This would also affect ext3 file systems mounted using
      the ext4 file system driver.
      
      The problem was that ext4_page_mkwrite() had a shortcut which would
      avoid calling __block_page_mkwrite() under some circumstances, and the
      above two commit transferred the responsibility of calling
      file_update_time() to __block_page_mkwrite --- which woudln't get
      called in some circumstances.
      
      Since __block_page_mkwrite() only has three callers,
      block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
      best way to solve this is to move the responsibility for calling
      file_update_time() to its caller.
      
      This problem was found via xfstests #215 with a file system mounted
      with -o nodelalloc.
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: stable@vger.kernel.org
      041bbb6d
  24. 21 Sep, 2012 1 commit
  25. 04 Aug, 2012 1 commit
  26. 31 Jul, 2012 6 commits
  27. 14 Jul, 2012 3 commits
  28. 20 Jun, 2012 1 commit
    • Ryusuke Konishi's avatar
      nilfs2: ensure proper cache clearing for gc-inodes · fbb24a3a
      Ryusuke Konishi authored
      A gc-inode is a pseudo inode used to buffer the blocks to be moved by
      garbage collection.
      
      Block caches of gc-inodes must be cleared every time a garbage collection
      function (nilfs_clean_segments) completes.  Otherwise, stale blocks
      buffered in the caches may be wrongly reused in successive calls of the GC
      function.
      
      For user files, this is not a problem because their gc-inodes are
      distinguished by a checkpoint number as well as an inode number.  They
      never buffer different blocks if either an inode number, a checkpoint
      number, or a block offset differs.
      
      However, gc-inodes of sufile, cpfile and DAT file can store different data
      for the same block offset.  Thus, the nilfs_clean_segments function can
      move incorrect block for these meta-data files if an old block is cached.
      I found this is really causing meta-data corruption in nilfs.
      
      This fixes the issue by ensuring cache clear of gc-inodes and resolves
      reported GC problems including checkpoint file corruption, b-tree
      corruption, and the following warning during GC.
      
        nilfs_palloc_freev: entry number 307234 already freed.
        ...
      Signed-off-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Tested-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>	[2.6.37+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbb24a3a
  29. 01 Jun, 2012 1 commit
  30. 30 May, 2012 1 commit