1. 13 Feb, 2014 1 commit
  2. 20 Jun, 2013 1 commit
  3. 08 May, 2013 2 commits
    • Kent Overstreet's avatar
      aio: don't include aio.h in sched.h · a27bb332
      Kent Overstreet authored
      Faster kernel compiles by way of fewer unnecessary includes.
      [akpm@linux-foundation.org: fix fallout]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Zach Brown's avatar
      aio: remove retry-based AIO · 41003a7b
      Zach Brown authored
      This removes the retry-based AIO infrastructure now that nothing in tree
      is using it.
      We want to remove retry-based AIO because it is fundemantally unsafe.
      It retries IO submission from a kernel thread that has only assumed the
      mm of the submitting task.  All other task_struct references in the IO
      submission path will see the kernel thread, not the submitting task.
      This design flaw means that nothing of any meaningful complexity can use
      retry-based AIO.
      This removes all the code and data associated with the retry machinery.
      The most significant benefit of this is the removal of the locking
      around the unused run list in the submission path.
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      Cc: Zach Brown <zab@redhat.com>
      Cc: Felipe Balbi <balbi@ti.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Acked-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Reviewed-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 04 May, 2013 1 commit
  5. 29 Apr, 2013 1 commit
  6. 09 Apr, 2013 3 commits
  7. 27 Mar, 2013 1 commit
  8. 21 Mar, 2013 1 commit
  9. 04 Mar, 2013 2 commits
  10. 24 Feb, 2013 1 commit
  11. 23 Feb, 2013 1 commit
  12. 21 Dec, 2012 1 commit
  13. 18 Dec, 2012 1 commit
  14. 03 Oct, 2012 1 commit
    • Catalin Marinas's avatar
      compat: fs: Generic compat_sys_sendfile implementation · 8f9c0119
      Catalin Marinas authored
      This function is used by sparc, powerpc and arm64 for compat support.
      The patch adds a generic implementation which calls do_sendfile()
      directly and avoids set_fs().
      The sparc architecture has wrappers for the sign extensions while
      powerpc relies on the compiler to do the this. The patch adds wrappers
      for powerpc to handle the u32->int type conversion.
      compat_sys_sendfile64() can be replaced by a sys_sendfile() call since
      compat_loff_t has the same size as off_t on a 64-bit system.
      On powerpc, the patch also changes the 64-bit sendfile call from
      sys_sendile64 to sys_sendfile.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  15. 27 Sep, 2012 1 commit
  16. 22 Jul, 2012 1 commit
    • Eric Sandeen's avatar
      vfs: allow custom EOF in generic_file_llseek code · e8b96eb5
      Eric Sandeen authored
      For ext3/4 htree directories, using the vfs llseek function with
      SEEK_END goes to i_size like for any other file, but in reality
      we want the maximum possible hash value.  Recent changes
      in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
      but replicating this core code seems like a bad idea, especially
      since the copy has already diverged from the vfs.
      This patch updates generic_file_llseek_size to accept
      both a custom maximum offset, and a custom EOF position.  With this
      in place, ext4_dir_llseek can pass in the appropriate maximum hash
      position for both maxsize and eof, and get what it wants.
      As far as I know, this does not fix any bugs - nfs in the kernel
      doesn't use SEEK_END, and I don't know of any user who does.  But
      some ext4 folks seem keen on doing the right thing here, and I can't
      really argue.
      (Patch also fixes up some comments slightly)
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  17. 01 Jun, 2012 1 commit
  18. 29 Feb, 2012 1 commit
  19. 01 Nov, 2011 1 commit
    • Christopher Yeoh's avatar
      Cross Memory Attach · fcf63409
      Christopher Yeoh authored
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      There is a basic man page for the proposed interface here:
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 28 Oct, 2011 2 commits
    • Andi Kleen's avatar
      vfs: add generic_file_llseek_size · 5760495a
      Andi Kleen authored
      Add a generic_file_llseek variant to the VFS that allows passing in
      the maximum file size of the file system, instead of always
      using maxbytes from the superblock.
      This can be used to eliminate some cut'n'paste seek code in ext4.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    • Andi Kleen's avatar
      vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2
      Andi Kleen authored
      The i_mutex lock use of generic _file_llseek hurts.  Independent processes
      accessing the same file synchronize over a single lock, even though
      they have no need for synchronization at all.
      Under high utilization this can cause llseek to scale very poorly on larger
      This patch does some rethinking of the llseek locking model:
      First the 64bit f_pos is not necessarily atomic without locks
      on 32bit systems. This can already cause races with read() today.
      This was discussed on linux-kernel in the past and deemed acceptable.
      The patch does not change that.
      Let's look at the different seek variants:
      SEEK_SET: Doesn't really need any locking.
      If there's a race one writer wins, the other loses.
      For 32bit the non atomic update races against read()
      stay the same. Without a lock they can also happen
      against write() now.  The read() race was deemed
      acceptable in past discussions, and I think if it's
      ok for read it's ok for write too.
      => Don't need a lock.
      SEEK_END: This behaves like SEEK_SET plus it reads
      the maximum size too. Reading the maximum size would have the
      32bit atomic problem. But luckily we already have a way to read
      the maximum size without locking (i_size_read), so we
      can just use that instead.
      Without i_mutex there is no synchronization with write() anymore,
      however since the write() update is atomic on 64bit it just behaves
      like another racy SEEK_SET.  On non atomic 32bit it's the same
      as SEEK_SET.
      => Don't need a lock, but need to use i_size_read()
      SEEK_CUR: This has a read-modify-write race window
      on the same file. One could argue that any application
      doing unsynchronized seeks on the same file is already broken.
      But for the sake of not adding a regression here I'm
      using the file->f_lock to synchronize this. Using this
      lock is much better than the inode mutex because it doesn't
      synchronize between processes.
      => So still need a lock, but can use a f_lock.
      This patch implements this new scheme in generic_file_llseek.
      I dropped generic_file_llseek_unlocked and changed all callers.
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
  21. 26 Jul, 2011 1 commit
  22. 21 Jul, 2011 1 commit
    • Josef Bacik's avatar
      fs: add SEEK_HOLE and SEEK_DATA flags · 982d8165
      Josef Bacik authored
      This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags.  Turns out
      using fiemap in things like cp cause more problems than it solves, so lets try
      and give userspace an interface that doesn't suck.  We need to match solaris
      here, and the definitions are
      *o* If /whence/ is SEEK_HOLE, the offset of the start of the
      next hole greater than or equal to the supplied offset
      is returned. The definition of a hole is provided near
      the end of the DESCRIPTION.
      *o* If /whence/ is SEEK_DATA, the file pointer is set to the
      start of the next non-hole file region greater than or
      equal to the supplied offset.
      So in the generic case the entire file is data and there is a virtual hole at
      the end.  That means we will just return i_size for SEEK_HOLE and will return
      the same offset for SEEK_DATA.  This is how Solaris does it so we have to do it
      the same way.
      Signed-off-by: default avatarJosef Bacik <josef@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  23. 13 Jan, 2011 1 commit
  24. 17 Nov, 2010 1 commit
  25. 29 Oct, 2010 1 commit
  26. 26 Oct, 2010 1 commit
  27. 15 Oct, 2010 2 commits
    • Arnd Bergmann's avatar
      vfs: make no_llseek the default · 776c163b
      Arnd Bergmann authored
      All file operations now have an explicit .llseek
      operation pointer, so we can change the default
      action for future code.
      This makes changes the default from default_llseek
      to no_llseek, which always returns -ESPIPE if
      a user tries to seek on a file without a .llseek
      The name of the default_llseek function remains
      unchanged, if anyone thinks we should change it,
      please speak up.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
    • Arnd Bergmann's avatar
      vfs: don't use BKL in default_llseek · ab91261f
      Arnd Bergmann authored
      There are currently 191 users of default_llseek.
      Nine of these are in device drivers that use the
      big kernel lock. None of these ever touch
      file->f_pos outside of llseek or file_pos_write.
      Consequently, we never rely on the BKL
      in the default_llseek function and can
      replace that with i_mutex, which is also
      used in generic_file_llseek.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
  28. 28 Jul, 2010 1 commit
    • Eric Paris's avatar
      fsnotify: pass a file instead of an inode to open, read, and write · 2a12a9d7
      Eric Paris authored
      fanotify, the upcoming notification system actually needs a struct path so it can
      do opens in the context of listeners, and it needs a file so it can get f_flags
      from the original process.  Close was the only operation that already was passing
      a struct file to the notification hook.  This patch passes a file for access,
      modify, and open as well as they are easily available to these hooks.
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
  29. 27 May, 2010 1 commit
  30. 24 Mar, 2010 1 commit
  31. 04 Nov, 2009 1 commit
  32. 24 Sep, 2009 1 commit
  33. 11 May, 2009 1 commit
  34. 04 Apr, 2009 1 commit
    • Linus Torvalds's avatar
      Make non-compat preadv/pwritev use native register size · 601cc11d
      Linus Torvalds authored
      Instead of always splitting the file offset into 32-bit 'high' and 'low'
      parts, just split them into the largest natural word-size - which in C
      terms is 'unsigned long'.
      This allows 64-bit architectures to avoid the unnecessary 32-bit
      shifting and masking for native format (while the compat interfaces will
      obviously always have to do it).
      This also changes the order of 'high' and 'low' to be "low first".  Why?
      Because when we have it like this, the 64-bit system calls now don't use
      the "pos_high" argument at all, and it makes more sense for the native
      system call to simply match the user-mode prototype.
      This results in a much more natural calling convention, and allows the
      compiler to generate much more straightforward code.  On x86-64, we now
              testq   %rcx, %rcx      # pos_l
              js      .L122   #,
              movq    %rcx, -48(%rbp) # pos_l, pos
      from the C source
              loff_t pos = pos_from_hilo(pos_h, pos_l);
              if (pos < 0)
                      return -EINVAL;
      and the 'pos_h' register isn't even touched.  It used to generate code
              mov     %r8d, %r8d      # pos_low, pos_low
              salq    $32, %rcx       #, tmp71
              movq    %r8, %rax       # pos_low, pos.386
              orq     %rcx, %rax      # tmp71, pos.386
              js      .L122   #,
              movq    %rax, -48(%rbp) # pos.386, pos
      which isn't _that_ horrible, but it does show how the natural word size
      is just a more sensible interface (same arguments will hold in the user
      level glibc wrapper function, of course, so the kernel side is just half
      of the equation!)
      Note: in all cases the user code wrapper can again be the same. You can
      just do
      	#define HALF_BITS (sizeof(unsigned long)*4)
      	__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);
      or something like that.  That way the user mode wrapper will also be
      nicely passing in a zero (it won't actually have to do the shifts, the
      compiler will understand what is going on) for the last argument.
      And that is a good idea, even if nobody will necessarily ever care: if
      we ever do move to a 128-bit lloff_t, this particular system call might
      be left alone.  Of course, that will be the least of our worries if we
      really ever need to care, so this may not be worth really caring about.
      [ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]
      Acked-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Ralf Baechle <ralf@linux-mips.org>>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>