Skip to content
  • Tejun Heo's avatar
    writeback: fix a subtle race condition in I_DIRTY clearing · 21fe2674
    Tejun Heo authored
    
    
    commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.
    
    After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
    tests inode->i_state locklessly to see whether it already has all the
    necessary I_DIRTY bits set.  The comment above the barrier doesn't
    contain any useful information - memory barriers can't ensure "changes
    are seen by all cpus" by itself.
    
    And it sure enough was broken.  Please consider the following
    scenario.
    
     CPU 0					CPU 1
     -------------------------------------------------------------------------------
    
    					enters __writeback_single_inode()
    					grabs inode->i_lock
    					tests PAGECACHE_TAG_DIRTY which is clear
     enters __set_page_dirty()
     grabs mapping->tree_lock
     sets PAGECACHE_TAG_DIRTY
     releases mapping->tree_lock
     leaves __set_page_dirty()
    
     enters __mark_inode_dirty()
     smp_mb()
     sees I_DIRTY_PAGES set
     leaves __mark_inode_dirty()
    					clears I_DIRTY_PAGES
    					releases inode->i_lock
    
    Now @inode has dirty pages w/ I_DIRTY_PAGES clear.  This doesn't seem
    to lead to an immediately critical problem because requeue_inode()
    later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
    deciding whether the inode needs to be requeued for IO and there are
    enough unintentional memory barriers inbetween, so while the inode
    ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
    IO list.
    
    The lack of explicit barrier may also theoretically affect the other
    I_DIRTY bits which deal with metadata dirtiness.  There is no
    guarantee that a strong enough barrier exists between
    I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
    inode.  Filesystem inode writeout path likely has enough stuff which
    can behave as full barrier but it's theoretically possible that the
    writeout may not see all the updates from ->dirty_inode().
    
    Fix it by adding an explicit smp_mb() after I_DIRTY clearing.  Note
    that I_DIRTY_PAGES needs a special treatment as it always needs to be
    cleared to be interlocked with the lockless test on
    __mark_inode_dirty() side.  It's cleared unconditionally and
    reinstated after smp_mb() if the mapping still has dirty pages.
    
    Also add comments explaining how and why the barriers are paired.
    
    Lightly tested.
    
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Mikulas Patocka <mpatocka@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Reviewed-by: default avatarJan Kara <jack@suse.cz>
    Signed-off-by: default avatarJens Axboe <axboe@fb.com>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    21fe2674