1. 20 Jun, 2013 1 commit
  2. 09 Apr, 2013 1 commit
  3. 21 Mar, 2013 1 commit
  4. 02 Mar, 2013 1 commit
  5. 27 Nov, 2012 1 commit
    • Jan Kara's avatar
      writeback: put unused inodes to LRU after writeback completion · 4eff96dd
      Jan Kara authored
      Commit 169ebd90 ("writeback: Avoid iput() from flusher thread")
      removed iget-iput pair from inode writeback.  As a side effect, inodes
      that are dirty during iput_final() call won't be ever added to inode LRU
      (iput_final() doesn't add dirty inodes to LRU and later when the inode
      is cleaned there's noone to add the inode there).  Thus inodes are
      effectively unreclaimable until someone looks them up again.
      
      The practical effect of this bug is limited by the fact that inodes are
      pinned by a dentry for long enough that the inode gets cleaned.  But
      still the bug can have nasty consequences leading up to OOM conditions
      under certain circumstances.  Following can easily reproduce the
      problem:
      
        for (( i = 0; i < 1000; i++ )); do
          mkdir $i
          for (( j = 0; j < 1000; j++ )); do
            touch $i/$j
            echo 2 > /proc/sys/vm/drop_caches
          done
        done
      
      then one needs to run 'sync; ls -lR' to make inodes reclaimable again.
      
      We fix the issue by inserting unused clean inodes into the LRU after
      writeback finishes in inode_sync_complete().
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>		[3.5+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4eff96dd
  6. 13 Oct, 2012 1 commit
  7. 31 Jul, 2012 1 commit
  8. 14 Jul, 2012 6 commits
    • David Howells's avatar
      VFS: Split inode_permission() · 0bdaea90
      David Howells authored
      Split inode_permission() into inode- and superblock-dependent parts.
      
      This is aimed at unionmounts where the superblock from the upper layer has to
      be checked rather than the superblock from the lower layer as the upper layer
      may be writable, thus allowing an unwritable file from the lower layer to be
      copied up and modified.
      
      Original-author: Valerie Aurora <vaurora@redhat.com>
      Signed-off-by: David Howells <dhowells@redhat.com> (Further development)
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0bdaea90
    • Al Viro's avatar
      kill struct opendata · 30d90494
      Al Viro authored
      Just pass struct file *.  Methods are happier that way...
      There's no need to return struct file * from finish_open() now,
      so let it return int.  Next: saner prototypes for parts in
      namei.c
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      30d90494
    • Al Viro's avatar
      kill opendata->{mnt,dentry} · a4a3bdd7
      Al Viro authored
      ->filp->f_path is there for purpose...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a4a3bdd7
    • Miklos Szeredi's avatar
      vfs: remove open intents from nameidata · 015c3bbc
      Miklos Szeredi authored
      All users of open intents have been converted to use ->atomic_{open,create}.
      
      This patch gets rid of nd->intent.open and related infrastructure.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      015c3bbc
    • Miklos Szeredi's avatar
      vfs: add i_op->atomic_open() · d18e9008
      Miklos Szeredi authored
      Add a new inode operation which is called on the last component of an open.
      Using this the filesystem can look up, possibly create and open the file in one
      atomic operation.  If it cannot perform this (e.g. the file type turned out to
      be wrong) it may signal this by returning NULL instead of an open struct file
      pointer.
      
      i_op->atomic_open() is only called if the last component is negative or needs
      lookup.  Handling cached positive dentries here doesn't add much value: these
      can be opened using f_op->open().  If the cached file turns out to be invalid,
      the open can be retried, this time using ->atomic_open() with a fresh dentry.
      
      For now leave the old way of using open intents in lookup and revalidate in
      place.  This will be removed once all the users are converted.
      
      David Howells noticed that if ->atomic_open() opens the file but does not create
      it, handle_truncate() will be called on it even if it is not a regular file.
      Fix this by checking the file type in this case too.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d18e9008
    • Al Viro's avatar
      get rid of ->mnt_longterm · f7a99c5b
      Al Viro authored
      it's enough to set ->mnt_ns of internal vfsmounts to something
      distinct from all struct mnt_namespace out there; then we can
      just use the check for ->mnt_ns != NULL in the fast path of
      mntput_no_expire()
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f7a99c5b
  9. 01 Jun, 2012 1 commit
  10. 30 May, 2012 1 commit
    • Andi Kleen's avatar
      brlocks/lglocks: turn into functions · eea62f83
      Andi Kleen authored
      lglocks and brlocks are currently generated with some complicated macros
      in lglock.h.  But there's no reason to not just use common utility
      functions and put all the data into a common data structure.
      
      Since there are at least two users it makes sense to share this code in a
      library.  This is also easier maintainable than a macro forest.
      
      This will also make it later possible to dynamically allocate lglocks and
      also use them in modules (this would both still need some additional, but
      now straightforward, code)
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      eea62f83
  11. 07 Jan, 2012 1 commit
    • Miklos Szeredi's avatar
      vfs: protect remounting superblock read-only · 4ed5e82f
      Miklos Szeredi authored
      Currently remouting superblock read-only is racy in a major way.
      
      With the per mount read-only infrastructure it is now possible to
      prevent most races, which this patch attempts.
      
      Before starting the remount read-only, iterate through all mounts
      belonging to the superblock and if none of them have any pending
      writes, set sb->s_readonly_remount.  This indicates that remount is in
      progress and no further write requests are allowed.  If the remount
      succeeds set MS_RDONLY and reset s_readonly_remount.
      
      If the remounting is unsuccessful just reset s_readonly_remount.
      This can result in transient EROFS errors, despite the fact the
      remount failed.  Unfortunately hodling off writes is difficult as
      remount itself may touch the filesystem (e.g. through load_nls())
      which would deadlock.
      
      A later patch deals with delayed writes due to nlink going to zero.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Tested-by: default avatarToshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4ed5e82f
  12. 04 Jan, 2012 4 commits
  13. 20 Jul, 2011 2 commits
    • Dave Chinner's avatar
      superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6
      Dave Chinner authored
      The per-sb shrinker has the same requirement as the writeback
      threads of ensuring that the superblock is usable and pinned for the
      time it takes to run the work. Both need to take a passive reference
      to the sb, take a read lock on the s_umount lock and then only
      continue if an unmount is not in progress.
      
      pin_sb_for_writeback() does this exactly, so move it to fs/super.c
      and rename it to grab_super_passive() and exporting it via
      fs/internal.h for all the VFS code to be able to use.
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      12ad3ab6
    • Al Viro's avatar
      Make ->d_sb assign-once and always non-NULL · a4464dbc
      Al Viro authored
      New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
      Allocates dentry, sets its ->d_sb to given superblock and sets
      ->d_op accordingly.  Old d_alloc(NULL, name) callers are converted
      to that (all of them know what superblock they want).  d_alloc()
      itself is left only for parent != NULl case; uses __d_alloc(),
      inserts result into the list of parent's children.
      
      Note that now ->d_sb is assign-once and never NULL *and*
      ->d_parent is never NULL either.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a4464dbc
  14. 25 Mar, 2011 2 commits
  15. 21 Mar, 2011 1 commit
  16. 18 Mar, 2011 1 commit
    • Al Viro's avatar
      vfs: split off vfsmount-related parts of vfs_kern_mount() · 9d412a43
      Al Viro authored
      new function: mount_fs().  Does all work done by vfs_kern_mount()
      except the allocation and filling of vfsmount; returns root dentry
      or ERR_PTR().
      
      vfs_kern_mount() switched to using it and taken to fs/namespace.c,
      along with its wrappers.
      
      alloc_vfsmnt()/free_vfsmnt() made static.
      
      functions in namespace.c slightly reordered.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9d412a43
  17. 15 Mar, 2011 1 commit
  18. 14 Mar, 2011 2 commits
    • Al Viro's avatar
      open-style analog of vfs_path_lookup() · 73d049a4
      Al Viro authored
      new function: file_open_root(dentry, mnt, name, flags) opens the file
      vfs_path_lookup would arrive to.
      
      Note that name can be empty; in that case the usual requirement that
      dentry should be a directory is lifted.
      
      open-coded equivalents switched to it, may_open() got down exactly
      one caller and became static.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      73d049a4
    • Al Viro's avatar
      switch do_filp_open() to struct open_flags · 47c805dc
      Al Viro authored
      take calculation of open_flags by open(2) arguments into new helper
      in fs/open.c, move filp_open() over there, have it and do_sys_open()
      use that helper, switch exec.c callers of do_filp_open() to explicit
      (and constant) struct open_flags.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      47c805dc
  19. 24 Feb, 2011 1 commit
    • NeilBrown's avatar
      Fix over-zealous flush_disk when changing device size. · 93b270f7
      NeilBrown authored
      There are two cases when we call flush_disk.
      In one, the device has disappeared (check_disk_change) so any
      data will hold becomes irrelevant.
      In the oter, the device has changed size (check_disk_size_change)
      so data we hold may be irrelevant.
      
      In both cases it makes sense to discard any 'clean' buffers,
      so they will be read back from the device if needed.
      
      In the former case it makes sense to discard 'dirty' buffers
      as there will never be anywhere safe to write the data.  In the
      second case it *does*not* make sense to discard dirty buffers
      as that will lead to file system corruption when you simply enlarge
      the containing devices.
      
      flush_disk calls __invalidate_devices.
      __invalidate_device calls both invalidate_inodes and invalidate_bdev.
      
      invalidate_inodes *does* discard I_DIRTY inodes and this does lead
      to fs corruption.
      
      invalidate_bev *does*not* discard dirty pages, but I don't really care
      about that at present.
      
      So this patch adds a flag to __invalidate_device (calling it
      __invalidate_device2) to indicate whether dirty buffers should be
      killed, and this is passed to invalidate_inodes which can choose to
      skip dirty inodes.
      
      flusk_disk then passes true from check_disk_change and false from
      check_disk_size_change.
      
      dm avoids tripping over this problem by calling i_size_write directly
      rathher than using check_disk_size_change.
      
      md does use check_disk_size_change and so is affected.
      
      This regression was introduced by commit 608aeef1 which causes
      check_disk_size_change to call flush_disk, so it is suitable for any
      kernel since 2.6.27.
      
      Cc: stable@kernel.org
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Andrew Patterson <andrew.patterson@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      93b270f7
  20. 17 Jan, 2011 2 commits
  21. 16 Jan, 2011 2 commits
    • Al Viro's avatar
      sanitize vfsmount refcounting changes · f03c6599
      Al Viro authored
      Instead of splitting refcount between (per-cpu) mnt_count
      and (SMP-only) mnt_longrefs, make all references contribute
      to mnt_count again and keep track of how many are longterm
      ones.
      
      Accounting rules for longterm count:
      	* 1 for each fs_struct.root.mnt
      	* 1 for each fs_struct.pwd.mnt
      	* 1 for having non-NULL ->mnt_ns
      	* decrement to 0 happens only under vfsmount lock exclusive
      
      That allows nice common case for mntput() - since we can't drop the
      final reference until after mnt_longterm has reached 0 due to the rules
      above, mntput() can grab vfsmount lock shared and check mnt_longterm.
      If it turns out to be non-zero (which is the common case), we know
      that this is not the final mntput() and can just blindly decrement
      percpu mnt_count.  Otherwise we grab vfsmount lock exclusive and
      do usual decrement-and-check of percpu mnt_count.
      
      For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
      namespace.c uses the latter in places where we don't already hold
      vfsmount lock exclusive and opencodes a few remaining spots where
      we need to manipulate mnt_longterm.
      
      Note that we mostly revert the code outside of fs/namespace.c back
      to what we used to have; in particular, normal code doesn't need
      to care about two kinds of references, etc.  And we get to keep
      the optimization Nick's variant had bought us...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f03c6599
    • David Howells's avatar
      Unexport do_add_mount() and add in follow_automount(), not ->d_automount() · ea5b778a
      David Howells authored
      Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
      added rather than calling do_add_mount() itself.  follow_automount() will then
      do the addition.
      
      This slightly complicates things as ->d_automount() normally wants to add the
      new vfsmount to an expiration list and start an expiration timer.  The problem
      with that is that the vfsmount will be deleted if it has a refcount of 1 and
      the timer will not repeat if the expiration list is empty.
      
      To this end, we require the vfsmount to be returned from d_automount() with a
      refcount of (at least) 2.  One of these refs will be dropped unconditionally.
      In addition, follow_automount() must get a 3rd ref around the call to
      do_add_mount() lest it eat a ref and return an error, leaving the mount we
      have open to being expired as we would otherwise have only 1 ref on it.
      
      d_automount() should also add the the vfsmount to the expiration list (by
      calling mnt_set_expiry()) and start the expiration timer before returning, if
      this mechanism is to be used.  The vfsmount will be unlinked from the
      expiration list by follow_automount() if do_add_mount() fails.
      
      This patch also fixes the call to do_add_mount() for AFS to propagate the mount
      flags from the parent vfsmount.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ea5b778a
  22. 07 Jan, 2011 1 commit
    • Nick Piggin's avatar
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin authored
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      b3e19d92
  23. 29 Oct, 2010 1 commit
  24. 26 Oct, 2010 3 commits
  25. 18 Aug, 2010 1 commit
    • Nick Piggin's avatar
      fs: brlock vfsmount_lock · 99b7db7b
      Nick Piggin authored
      fs: brlock vfsmount_lock
      
      Use a brlock for the vfsmount lock. It must be taken for write whenever
      modifying the mount hash or associated fields, and may be taken for read when
      performing mount hash lookups.
      
      A new lock is added for the mnt-id allocator, so it doesn't need to take
      the heavy vfsmount write-lock.
      
      The number of atomics should remain the same for fastpath rlock cases, though
      code would be slightly slower due to per-cpu access. Scalability is not not be
      much improved in common cases yet, due to other locks (ie. dcache_lock) getting
      in the way. However path lookups crossing mountpoints should be one case where
      scalability is improved (currently requiring the global lock).
      
      The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
      Altix system (high latency to remote nodes), a simple umount microbenchmark
      (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
      took 6.8s, afterwards took 7.1s, about 5% slower.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      99b7db7b