1. 29 Jun, 2015 1 commit
    • Ben Hutchings's avatar
      pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomic · 14f81062
      Ben Hutchings authored
      pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec,
      the first time atomically and the second time not.  The second attempt
      needs to continue from the iovec position, pipe buffer offset and
      remaining length where the first attempt failed, but currently the
      pipe buffer offset and remaining length are reset.  This will corrupt
      the piped data (possibly also leading to an information leak between
      processes) and may also corrupt kernel memory.
      This was fixed upstream by commits f0d1bec9d58d ("new helper:
      copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to
      copy_page_to_iter()"), but those aren't suitable for stable.  This fix
      for older kernel versions was made by Seth Jennings for RHEL and I
      have extracted it from their update.
      References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855Signed-off-by: 's avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  2. 25 Jun, 2015 2 commits
  3. 22 Jun, 2015 2 commits
  4. 06 Jun, 2015 6 commits
  5. 17 May, 2015 2 commits
  6. 13 May, 2015 1 commit
    • Lukas Czerner's avatar
      ext4: fix data corruption caused by unwritten and delayed extents · 5a7d1e16
      Lukas Czerner authored
      commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream.
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      This problem can be easily reproduced by running the following xfs_io.
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      Signed-off-by: 's avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: 's avatarTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  7. 06 May, 2015 5 commits
    • Al Viro's avatar
      RCU pathwalk breakage when running into a symlink overmounting something · 3b388f33
      Al Viro authored
      commit 3cab989afd8d8d1bc3d99fef0e7ed87c31e7b647 upstream.
      Calling unlazy_walk() in walk_component() and do_last() when we find
      a symlink that needs to be followed doesn't acquire a reference to vfsmount.
      That's fine when the symlink is on the same vfsmount as the parent directory
      (which is almost always the case), but it's not always true - one _can_
      manage to bind a symlink on top of something.  And in such cases we end up
      with excessive mntput().
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Lukas Czerner's avatar
      ext4: make fsync to sync parent dir in no-journal for real this time · 72a2c039
      Lukas Czerner authored
      commit e12fb97222fc41e8442896934f76d39ef99b590a upstream.
      Previously commit 14ece102 added a
      support for for syncing parent directory of newly created inodes to
      make sure that the inode is not lost after a power failure in
      no-journal mode.
      However this does not work in majority of cases, namely:
       - if the directory has inline data
       - if the directory is already indexed
       - if the directory already has at least one block and:
      	- the new entry fits into it
      	- or we've successfully converted it to indexed
      So in those cases we might lose the inode entirely even after fsync in
      the no-journal mode. This also includes ext2 default mode obviously.
      I've noticed this while running xfstest generic/321 and even though the
      test should fail (we need to run fsck after a crash in no-journal mode)
      I could not find a newly created entries even when if it was fsynced
      Fix this by adjusting the ext4_add_entry() successful exit paths to set
      the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the
      parent directory as well.
      Signed-off-by: 's avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: 's avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: 's avatarJan Kara <jack@suse.cz>
      Cc: Frank Mayhar <fmayhar@google.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Michael Davidson's avatar
      fs/binfmt_elf.c: fix bug in loading of PIE binaries · 00c65c8a
      Michael Davidson authored
      commit a87938b2e246b81b4fb713edb371a9fa3c5c3c86 upstream.
      With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down
      address allocation strategy, load_elf_binary() will attempt to map a PIE
      binary into an address range immediately below mm->mmap_base.
      Unfortunately, load_elf_ binary() does not take account of the need to
      allocate sufficient space for the entire binary which means that, while
      the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent
      PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are
      that is supposed to be the "gap" between the stack and the binary.
      Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this
      means that binaries with large data segments > 128MB can end up mapping
      part of their data segment over their stack resulting in corruption of the
      stack (and the data segment once the binary starts to run).
      Any PIE binary with a data segment > 128MB is vulnerable to this although
      address randomization means that the actual gap between the stack and the
      end of the binary is normally greater than 128MB.  The larger the data
      segment of the binary the higher the probability of failure.
      Fix this by calculating the total size of the binary in the same way as
      Signed-off-by: 's avatarMichael Davidson <md@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Filipe Manana's avatar
      Btrfs: fix inode eviction infinite loop after cloning into it · 6073c416
      Filipe Manana authored
      commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream.
      If we attempt to clone a 0 length region into a file we can end up
      inserting a range in the inode's extent_io tree with a start offset
      that is greater then the end offset, which triggers immediately the
      following warning:
      [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3914.620886] BTRFS: end < start 4095 4096
      [ 3914.638093] Call Trace:
      [ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
      [ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
      [ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
      [ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
      [ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
      [ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
      [ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
      [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---
      This later makes the inode eviction handler enter an infinite loop that
      keeps dumping the following warning over and over:
      [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
      [ 3915.119913] BTRFS: end < start 4095 4096
      [ 3915.137394] Call Trace:
      [ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
      [ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
      [ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
      [ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
      [ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
      [ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
      [ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
      [ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
      [ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
      [ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
      [ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
      [ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
      [ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
      [ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
      [ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
      [ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---
      So just bail out of the clone ioctl if the length of the region to clone
      is zero, without locking any extent range, in order to prevent this issue
      (same behaviour as a pwrite with a 0 length for example).
      This is trivial to reproduce. For example, the steps for the test I just
      made for fstests:
        mkfs.btrfs -f SCRATCH_DEV
        touch $SCRATCH_MNT/foo
        touch $SCRATCH_MNT/bar
        $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
        umount $SCRATCH_MNT
      A test case for fstests follows soon.
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: 's avatarOmar Sandoval <osandov@osandov.com>
      Signed-off-by: 's avatarChris Mason <clm@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Filipe Manana's avatar
      Btrfs: fix log tree corruption when fs mounted with -o discard · bf339141
      Filipe Manana authored
      commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream.
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log)
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarChris Mason <clm@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  8. 29 Apr, 2015 7 commits
  9. 19 Apr, 2015 3 commits
  10. 13 Apr, 2015 1 commit
    • Sergei Antonov's avatar
      hfsplus: fix B-tree corruption after insertion at position 0 · 23b443f7
      Sergei Antonov authored
      commit 98cf21c61a7f5419d82f847c4d77bf6e96a76f5f upstream.
      Fix B-tree corruption when a new record is inserted at position 0 in the
      node in hfs_brec_insert().  In this case a hfs_brec_update_parent() is
      called to update the parent index node (if exists) and it is passed
      hfs_find_data with a search_key containing a newly inserted key instead
      of the key to be updated.  This results in an inconsistent index node.
      The bug reproduces on my machine after an extents overflow record for
      the catalog file (CNID=4) is inserted into the extents overflow B-tree.
      Because of a low (reserved) value of CNID=4, it has to become the first
      record in the first leaf node.
      The resulting first leaf node is correct:
        | key0.CNID=4 | key1.CNID=123 | key2.CNID=456, ... |
      But the parent index key0 still contains the previous key CNID=123:
        | key0.CNID=123 | ... |
      A change in hfs_brec_insert() makes hfs_brec_update_parent() work
      correctly by preventing it from getting fd->record=-1 value from
      Along the way, I removed duplicate code with unification of the if
      condition.  The resulting code is equivalent to the original code
      because node is never 0.
      Also hfs_brec_update_parent() will now return an error after getting a
      negative fd->record value.  However, the return value of
      hfs_brec_update_parent() is not checked anywhere in the file and I'm
      leaving it unchanged by this patch.  brec.c lacks error checking after
      some other calls too, but this issue is of less importance than the one
      being fixed by this patch.
      Signed-off-by: 's avatarSergei Antonov <saproj@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: 's avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Acked-by: 's avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Cc: Anton Altaparmakov <aia21@cam.ac.uk>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. 02 Apr, 2015 1 commit
  12. 26 Mar, 2015 3 commits
    • Ryusuke Konishi's avatar
      nilfs2: fix deadlock of segment constructor during recovery · f37b5d42
      Ryusuke Konishi authored
      commit 283ee1482f349d6c0c09dfb725db5880afc56813 upstream.
      According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
      during recovery at mount time.  The code path that caused the deadlock was
      as follows:
              * Do roll-forwarding, attach segment constructor for recovery,
                and kick it.
                 * A lock is held with nilfs_transaction_lock()
                                       nilfs_transaction_lock() --> deadlock
      This can happen if commit 7ef3ff2fea8b ("nilfs2: fix deadlock of segment
      constructor over I_SYNC flag") is applied and roll-forward recovery was
      performed at mount time.  The roll-forward recovery can happen if datasync
      write is done and the file system crashes immediately after that.  For
      instance, we can reproduce the issue with the following steps:
       < nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
       # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
       # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
       count=1 && reboot -nfh
       < the system will immediately reboot >
       # mount -t nilfs2 /dev/sdb1 /nilfs
      The deadlock occurs because iput() can run segment constructor through
      writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags.  The
      above commit changed segment constructor so that it calls iput()
      asynchronously for inodes with i_nlink == 0, but that change was
      This fixes the another deadlock by deferring iput() in segment constructor
      even for the case that mount is not finished, that is, for the case that
      MS_ACTIVE flag is not set.
      Signed-off-by: 's avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Reported-by: 's avatarYuxuan Shui <yshuiv7@gmail.com>
      Tested-by: 's avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Miklos Szeredi's avatar
      fuse: notify: don't move pages · 12bda52e
      Miklos Szeredi authored
      commit 0d2783626a53d4c922f82d51fa675cb5d13f0d36 upstream.
      fuse_try_move_page() is not prepared for replacing pages that have already
      been read.
      Reported-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Miklos Szeredi's avatar
      fuse: set stolen page uptodate · ce6f8834
      Miklos Szeredi authored
      commit aa991b3b267e24f578bac7b09cc57579b660304b upstream.
      Regular pipe buffers' ->steal method (generic_pipe_buf_steal()) doesn't set
      Don't warn on this condition, just set the uptodate flag.
      Signed-off-by: 's avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  13. 18 Mar, 2015 6 commits
    • Ryusuke Konishi's avatar
      nilfs2: fix potential memory overrun on inode · b6b14e98
      Ryusuke Konishi authored
      commit 957ed60b53b519064a54988c4e31e0087e47d091 upstream.
      Each inode of nilfs2 stores a root node of a b-tree, and it turned out to
      have a memory overrun issue:
      Each b-tree node of nilfs2 stores a set of key-value pairs and the number
      of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as
      a few other "bn_*" members.
      Since the value of "bn_nchildren" is used for operations on the key-values
      within the b-tree node, it can cause memory access overrun if a large
      number is incorrectly set to "bn_nchildren".
      For instance, nilfs_btree_node_lookup() function determines the range of
      binary search with it, and too large "bn_nchildren" leads
      nilfs_btree_node_get_key() in that function to overrun.
      As for intermediate b-tree nodes, this is prevented by a sanity check
      performed when each node is read from a drive, however, no sanity check
      has been done for root nodes stored in inodes.
      This patch fixes the issue by adding missing sanity check against b-tree
      root nodes so that it's called when on-memory inodes are read from ifile,
      inode metadata file.
      Signed-off-by: 's avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Al Viro's avatar
      procfs: fix race between symlink removals and traversals · cf6c05a7
      Al Viro authored
      commit 7e0e953bb0cf649f93277ac8fb67ecbb7f7b04a9 upstream.
      use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Al Viro's avatar
      debugfs: leave freeing a symlink body until inode eviction · db32c774
      Al Viro authored
      commit 0db59e59299f0b67450c5db21f7f316c8fb04e84 upstream.
      As it is, we have debugfs_remove() racing with symlink traversals.
      Supply ->evict_inode() and do freeing there - inode will remain
      pinned until we are done with the symlink body.
      And rip the idiocy with checking if dentry is positive right after
      we'd verified debugfs_positive(), which is a stronger check...
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Al Viro's avatar
      autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation · d91c5de5
      Al Viro authored
      commit 0a280962dc6e117e0e4baa668453f753579265d9 upstream.
      X-Coverup: just ask spender
      Signed-off-by: 's avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Quentin Casasnovas's avatar
      Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. · edf2ec99
      Quentin Casasnovas authored
      commit dd9ef135e3542ffc621c4eb7f0091870ec7a1504 upstream.
      Improper arithmetics when calculting the address of the extended ref could
      lead to an out of bounds memory read and kernel panic.
      Signed-off-by: 's avatarQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Reviewed-by: 's avatarDavid Sterba <dsterba@suse.cz>
      Signed-off-by: 's avatarChris Mason <clm@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    • Filipe Manana's avatar
      Btrfs: fix data loss in the fast fsync path · fa41700e
      Filipe Manana authored
      commit 3a8b36f378060d20062a0918e99fae39ff077bf0 upstream.
      When using the fast file fsync code path we can miss the fact that new
      writes happened since the last file fsync and therefore return without
      waiting for the IO to finish and write the new extents to the fsync log.
      Here's an example scenario where the fsync will miss the fact that new
      file data exists that wasn't yet durably persisted:
      1. fs_info->last_trans_committed == N - 1 and current transaction is
         transaction N (fs_info->generation == N);
      2. do a buffered write;
      3. fsync our inode, this clears our inode's full sync flag, starts
         an ordered extent and waits for it to complete - when it completes
         at btrfs_finish_ordered_io(), the inode's last_trans is set to the
         value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
      4. transaction N is committed, so fs_info->last_trans_committed is now
         set to the value N and fs_info->generation remains with the value N;
      5. do another buffered write, when this happens btrfs_file_write_iter
         sets our inode's last_trans to the value N + 1 (that is
         fs_info->generation + 1 == N + 1);
      6. transaction N + 1 is started and fs_info->generation now has the
         value N + 1;
      7. transaction N + 1 is committed, so fs_info->last_trans_committed
         is set to the value N + 1;
      8. fsync our inode - because it doesn't have the full sync flag set,
         we only start the ordered extent, we don't wait for it to complete
         (only in a later phase) therefore its last_trans field has the
         value N + 1 set previously by btrfs_file_write_iter(), and so we
             inode->last_trans <= fs_info->last_trans_committed
                 (N + 1)              (N + 1)
         Which made us not log the last buffered write and exit the fsync
         handler immediately, returning success (0) to user space and resulting
         in data loss after a crash.
      This can actually be triggered deterministically and the following excerpt
      from a testcase I made for xfstests triggers the issue. It moves a dummy
      file across directories and then fsyncs the old parent directory - this
      is just to trigger a transaction commit, so moving files around isn't
      directly related to the issue but it was chosen because running 'sync' for
      example does more than just committing the current transaction, as it
      flushes/waits for all file data to be persisted. The issue can also happen
      at random periods, since the transaction kthread periodicaly commits the
      current transaction (about every 30 seconds by default).
      The body of the test is:
        _scratch_mkfs >> $seqres.full 2>&1
        # Create our main test file 'foo', the one we check for data loss.
        # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
        # bit from its flags (btrfs inode specific flags).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                        -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
        # Now create one other file and 2 directories. We will move this second file
        # from one directory to the other later because it forces btrfs to commit its
        # currently open transaction if we fsync the old parent directory. This is
        # necessary to trigger the data loss bug that affected btrfs.
        mkdir $SCRATCH_MNT/testdir_1
        touch $SCRATCH_MNT/testdir_1/bar
        mkdir $SCRATCH_MNT/testdir_2
        # Make sure everything is durably persisted.
        # Write more 8Kb of data to our file.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        # Move our 'bar' file into a new directory.
        mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
        # Fsync our first directory. Because it had a file moved into some other
        # directory, this made btrfs commit the currently open transaction. This is
        # a condition necessary to trigger the data loss bug.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
        # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
        # data we wrote previously to be persisted and available if a crash happens.
        # This did not happen with btrfs, because of the transaction commit that
        # happened when we fsynced the parent directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        # Now check that all data we wrote before are available.
        echo "File content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      The expected golden output for the test, which is what we get with this
      fix applied (or when running against ext3/4 and xfs), is:
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
      Without this fix applied, the output shows the test file does not have
      the second 8Kb extent that we successfully fsynced:
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
      So fix this by skipping the fsync only if we're doing a full sync and
      if the inode's last_trans is <= fs_info->last_trans_committed, or if
      the inode is already in the log. Also remove setting the inode's
      last_trans in btrfs_file_write_iter since it's useless/unreliable.
      Also because btrfs_file_write_iter no longer sets inode->last_trans to
      fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
      bail out if last_trans is 0, otherwise something as simple as the following
      example wouldn't log the second write on the last fsync:
        1. write to file
        2. fsync file
        3. fsync file
             |--> btrfs_inode_in_log() returns true and it set last_trans to 0
        4. write to file
             |--> btrfs_file_write_iter() no longers sets last_trans, so it
                  remained with a value of 0
        5. fsync
             |--> inode->last_trans == 0, so it bails out without logging the
                  second write
      A test case for xfstests will be sent soon.
      Signed-off-by: 's avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: 's avatarChris Mason <clm@fb.com>
      Signed-off-by: 's avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>