1. 07 Jun, 2014 1 commit
  2. 18 Oct, 2013 1 commit
  3. 08 May, 2013 1 commit
  4. 17 Apr, 2013 1 commit
  5. 04 Mar, 2013 1 commit
    • Eric W. Biederman's avatar
      fs: Limit sys_mount to only request filesystem modules. · 7f78e035
      Eric W. Biederman authored
      Modify the request_module to prefix the file system type with "fs-"
      and add aliases to all of the filesystems that can be built as modules
      to match.
      
      A common practice is to build all of the kernel code and leave code
      that is not commonly needed as modules, with the result that many
      users are exposed to any bug anywhere in the kernel.
      
      Looking for filesystems with a fs- prefix limits the pool of possible
      modules that can be loaded by mount to just filesystems trivially
      making things safer with no real cost.
      
      Using aliases means user space can control the policy of which
      filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
      with blacklist and alias directives.  Allowing simple, safe,
      well understood work-arounds to known problematic software.
      
      This also addresses a rare but unfortunate problem where the filesystem
      name is not the same as it's module name and module auto-loading
      would not work.  While writing this patch I saw a handful of such
      cases.  The most significant being autofs that lives in the module
      autofs4.
      
      This is relevant to user namespaces because we can reach the request
      module in get_fs_type() without having any special permissions, and
      people get uncomfortable when a user specified string (in this case
      the filesystem type) goes all of the way to request_module.
      
      After having looked at this issue I don't think there is any
      particular reason to perform any filtering or permission checks beyond
      making it clear in the module request that we want a filesystem
      module.  The common pattern in the kernel is to call request_module()
      without regards to the users permissions.  In general all a filesystem
      module does once loaded is call register_filesystem() and go to sleep.
      Which means there is not much attack surface exposed by loading a
      filesytem module unless the filesystem is mounted.  In a user
      namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
      which most filesystems do not set today.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarKees Cook <keescook@google.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      7f78e035
  6. 26 Feb, 2013 1 commit
  7. 23 Feb, 2013 2 commits
  8. 12 Dec, 2012 3 commits
    • Rafael Aquini's avatar
      mm: adjust address_space_operations.migratepage() return code · 78bd5209
      Rafael Aquini authored
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a
      guest, thus imposing performance penalties associated with the reduced
      number of transparent huge pages that could be used by the guest workload.
      
      This patch-set follows the main idea discussed at 2012 LSFMMS session:
      "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
      to introduce the required changes to the virtio_balloon driver, as well as
      the changes to the core compaction & migration bits, in order to make
      those subsystems aware of ballooned pages and allow memory balloon pages
      become movable within a guest, thus avoiding the aforementioned
      fragmentation issue
      
      Following are numbers that prove this patch benefits on allowing
      compaction to be more effective at memory ballooned guests.
      
      Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
      running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
      chunks, at every minute (inflating/deflating), while test was running:
      
      ===BEGIN stress-highalloc
      
      STRESS-HIGHALLOC
                       highalloc-3.7     highalloc-3.7
                           rc4-clean         rc4-patch
      Pass 1          55.00 ( 0.00%)    62.00 ( 7.00%)
      Pass 2          54.00 ( 0.00%)    62.00 ( 8.00%)
      while Rested    75.00 ( 0.00%)    80.00 ( 5.00%)
      
      MMTests Statistics: duration
                       3.7         3.7
                 rc4-clean   rc4-patch
      User         1207.59     1207.46
      System       1300.55     1299.61
      Elapsed      2273.72     2157.06
      
      MMTests Statistics: vmstat
                                      3.7         3.7
                                rc4-clean   rc4-patch
      Page Ins                    3581516     2374368
      Page Outs                  11148692    10410332
      Swap Ins                         80          47
      Swap Outs                      3641         476
      Direct pages scanned          37978       33826
      Kswapd pages scanned        1828245     1342869
      Kswapd pages reclaimed      1710236     1304099
      Direct pages reclaimed        32207       31005
      Kswapd efficiency               93%         97%
      Kswapd velocity             804.077     622.546
      Direct efficiency               84%         91%
      Direct velocity              16.703      15.682
      Percentage direct scans          2%          2%
      Page writes by reclaim        79252        9704
      Page writes file              75611        9228
      Page writes anon               3641         476
      Page reclaim immediate        16764       11014
      Page rescued immediate            0           0
      Slabs scanned               2171904     2152448
      Direct inode steals             385        2261
      Kswapd inode steals          659137      609670
      Kswapd skipped wait               1          69
      THP fault alloc                 546         631
      THP collapse alloc              361         339
      THP splits                      259         263
      THP fault fallback               98          50
      THP collapse fail                20          17
      Compaction stalls               747         499
      Compaction success              244         145
      Compaction failures             503         354
      Compaction pages moved       370888      474837
      Compaction move failure       77378       65259
      
      ===END stress-highalloc
      
      This patch:
      
      Introduce MIGRATEPAGE_SUCCESS as the default return code for
      address_space_operations.migratepage() method and documents the expected
      return code for the same method in failure cases.
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      78bd5209
    • Michel Lespinasse's avatar
      mm: use vm_unmapped_area() in hugetlbfs · 08659355
      Michel Lespinasse authored
      Update the hugetlb_get_unmapped_area function to make use of
      vm_unmapped_area() instead of implementing a brute force search.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08659355
    • Andi Kleen's avatar
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen authored
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
  9. 06 Dec, 2012 1 commit
  10. 09 Oct, 2012 2 commits
    • Michel Lespinasse's avatar
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse authored
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • Konstantin Khlebnikov's avatar
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov authored
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  11. 03 Oct, 2012 1 commit
  12. 21 Sep, 2012 1 commit
  13. 01 Aug, 2012 1 commit
  14. 14 Jul, 2012 1 commit
  15. 06 May, 2012 1 commit
  16. 26 Apr, 2012 1 commit
    • Aneesh Kumar K.V's avatar
      hugetlbfs: lockdep annotate root inode properly · 65ed7601
      Aneesh Kumar K.V authored
      This fixes the below reported false lockdep warning.  e096d0c7
      ("lockdep: Add helper function for dir vs file i_mutex annotation") added
      a similar annotation for every other inode in hugetlbfs but missed the
      root inode because it was allocated by a separate function.
      
      For HugeTLB fs we allow taking i_mutex in mmap.  HugeTLB fs doesn't
      support file write and its file read callback is modified in a05b0855
      ("hugetlbfs: avoid taking i_mutex from hugetlbfs_read()") to not take
      i_mutex.  Hence for HugeTLB fs with regular files we really don't take
      i_mutex with mmap_sem held.
      
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.4.0-rc1+ #322 Not tainted
       -------------------------------------------------------
       bash/1572 is trying to acquire lock:
        (&mm->mmap_sem){++++++}, at: [<ffffffff810f1618>] might_fault+0x40/0x90
      
       but task is already holding lock:
        (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&sb->s_type->i_mutex_key#12){+.+.+.}:
              [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
              [<ffffffff816a2f5e>] __mutex_lock_common+0x48/0x350
              [<ffffffff816a3325>] mutex_lock_nested+0x2a/0x31
              [<ffffffff811fb8e1>] hugetlbfs_file_mmap+0x7d/0x104
              [<ffffffff810f859a>] mmap_region+0x272/0x47d
              [<ffffffff810f8a39>] do_mmap_pgoff+0x294/0x2ee
              [<ffffffff810f8b65>] sys_mmap_pgoff+0xd2/0x10e
              [<ffffffff8103d19e>] sys_mmap+0x1d/0x1f
              [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      
       -> #0 (&mm->mmap_sem){++++++}:
              [<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
              [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
              [<ffffffff810f1645>] might_fault+0x6d/0x90
              [<ffffffff81125d62>] filldir+0x6a/0xc2
              [<ffffffff81133a83>] dcache_readdir+0x5c/0x222
              [<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
              [<ffffffff811260b6>] sys_getdents+0x79/0xc9
              [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(&sb->s_type->i_mutex_key#12);
                                      lock(&mm->mmap_sem);
                                      lock(&sb->s_type->i_mutex_key#12);
         lock(&mm->mmap_sem);
      
        *** DEADLOCK ***
      
       1 lock held by bash/1572:
        #0:  (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
      
       stack backtrace:
       Pid: 1572, comm: bash Not tainted 3.4.0-rc1+ #322
       Call Trace:
        [<ffffffff81699a3c>] print_circular_bug+0x1f8/0x209
        [<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
        [<ffffffff810f38aa>] ? handle_pte_fault+0x5ff/0x614
        [<ffffffff8109e622>] ? mark_lock+0x2d/0x258
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff816a3249>] ? __mutex_lock_common+0x333/0x350
        [<ffffffff810f1645>] might_fault+0x6d/0x90
        [<ffffffff810f1618>] ? might_fault+0x40/0x90
        [<ffffffff81125d62>] filldir+0x6a/0xc2
        [<ffffffff81133a83>] dcache_readdir+0x5c/0x222
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
        [<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
        [<ffffffff811260b6>] sys_getdents+0x79/0xc9
        [<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65ed7601
  17. 05 Apr, 2012 1 commit
    • Hillf Danton's avatar
      hugetlbfs: remove unregister_filesystem() when initializing module · 7563ec4c
      Hillf Danton authored
      It was introduced by d1d5e05f ("hugetlbfs: return error code when
      initializing module") but as Al pointed out, is a bad idea.
      
      Quoted comments from Al:
       "Note that unregister_filesystem() in module init is *always* wrong;
        it's not an issue here (it's done too early to care about and
        realistically the box is not going anywhere - it'll panic when attempt
        to exec /sbin/init fails, if not earlier), but it's a damn bad
        example.
      
        Consider a normal fs module.  Somebody loads it and in parallel with
        that we get a mount attempt on that fs type.  It comes between
        register and failure exits that causes unregister; at that point we
        are screwed since grabbing a reference to module as done by mount is
        enough to prevent exit, but not to prevent the failure of init.  As
        the result, module will get freed when init fails, mounted fs of that
        type be damned."
      
      So remove it.
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7563ec4c
  18. 22 Mar, 2012 7 commits
    • Hillf Danton's avatar
      hugetlbfs: return error code when initializing module · d1d5e05f
      Hillf Danton authored
      Return an errno upon failure to create inode kmem cache, and unregister
      the FS upon failure to mount.
      
      [akpm@linux-foundation.org: remove unneeded test of `error']
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1d5e05f
    • Steven Truelove's avatar
      hugetlbfs: fix alignment of huge page requests · 40716e29
      Steven Truelove authored
      When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
      PAGE_SIZE, but this is not sufficient.
      
      Modify hugetlb_file_setup() to align requests to the huge page size, and
      to accept an address argument so that all alignment checks can be
      performed in hugetlb_file_setup(), rather than in its callers.  Change
      newseg() and mmap_pgoff() to match the new prototype and eliminate a now
      redundant alignment check.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: default avatarSteven Truelove <steven.truelove@utoronto.ca>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40716e29
    • David Rientjes's avatar
      mm, hugetlb: add thread name and pid to SHM_HUGETLB mlock rlimit warning · 21a3c273
      David Rientjes authored
      Add the thread name and pid of the application that is allocating shm
      segments with MAP_HUGETLB without being a part of
      /proc/sys/vm/hugetlb_shm_group or having CAP_IPC_LOCK.
      
      This identifies the application so it may be fixed by avoiding using the
      deprecated exception (see Documentation/feature-removal-schedule.txt).
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21a3c273
    • David Gibson's avatar
      hugepages: fix use after free bug in "quota" handling · 90481622
      David Gibson authored
      hugetlbfs_{get,put}_quota() are badly named.  They don't interact with the
      general quota handling code, and they don't much resemble its behaviour.
      Rather than being about maintaining limits on on-disk block usage by
      particular users, they are instead about maintaining limits on in-memory
      page usage (including anonymous MAP_PRIVATE copied-on-write pages)
      associated with a particular hugetlbfs filesystem instance.
      
      Worse, they work by having callbacks to the hugetlbfs filesystem code from
      the low-level page handling code, in particular from free_huge_page().
      This is a layering violation of itself, but more importantly, if the
      kernel does a get_user_pages() on hugepages (which can happen from KVM
      amongst others), then the free_huge_page() can be delayed until after the
      associated inode has already been freed.  If an unmount occurs at the
      wrong time, even the hugetlbfs superblock where the "quota" limits are
      stored may have been freed.
      
      Andrew Barry proposed a patch to fix this by having hugepages, instead of
      storing a pointer to their address_space and reaching the superblock from
      there, had the hugepages store pointers directly to the superblock,
      bumping the reference count as appropriate to avoid it being freed.
      Andrew Morton rejected that version, however, on the grounds that it made
      the existing layering violation worse.
      
      This is a reworked version of Andrew's patch, which removes the extra, and
      some of the existing, layering violation.  It works by introducing the
      concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
      finite logical pool of hugepages to allocate from.  hugetlbfs now creates
      a subpool for each filesystem instance with a page limit set, and a
      pointer to the subpool gets added to each allocated hugepage, instead of
      the address_space pointer used now.  The subpool has its own lifetime and
      is only freed once all pages in it _and_ all other references to it (i.e.
      superblocks) are gone.
      
      subpools are optional - a NULL subpool pointer is taken by the code to
      mean that no subpool limits are in effect.
      
      Previous discussion of this bug found in:  "Fix refcounting in hugetlbfs
      quota handling.". See:  https://lkml.org/lkml/2011/8/11/28 or
      http://marc.info/?l=linux-mm&m=126928970510627&w=1
      
      v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
      alloc_huge_page() - since it already takes the vma, it is not necessary.
      Signed-off-by: default avatarAndrew Barry <abarry@cray.com>
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90481622
    • David Gibson's avatar
      hugetlb: cleanup hugetlb.h · a1d776ee
      David Gibson authored
      Make a couple of small cleanups to linux/include/hugetlb.h.  The
      set_file_hugepages() function, which was not used anywhere is removed,
      and the hugetlbfs_config and hugetlbfs_inode_info structures with its
      HUGETLBFS_I helper function are moved into inode.c, the only place they
      were used.
      
      These structures are really linked to the hugetlbfs filesystem
      specifically not to hugepage mm handling in general, so they belong in
      the filesystem code not in a generally available header.
      
      It would be nice to move the hugetlbfs_sb_info (superblock) structure in
      there as well, but it's currently needed in a number of places via the
      hstate_vma() and hstate_inode().
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Andrew Barry <abarry@cray.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1d776ee
    • Aneesh Kumar K.V's avatar
      hugetlbfs: avoid taking i_mutex from hugetlbfs_read() · a05b0855
      Aneesh Kumar K.V authored
      Taking i_mutex in hugetlbfs_read() can result in deadlock with mmap as
      explained below
      
       Thread A:
        read() on hugetlbfs
         hugetlbfs_read() called
          i_mutex grabbed
           hugetlbfs_read_actor() called
            __copy_to_user() called
             page fault is triggered
       Thread B, sharing address space with A:
        mmap() the same file
         ->mmap_sem is grabbed on task_B->mm->mmap_sem
          hugetlbfs_file_mmap() is called
           attempt to grab ->i_mutex and block waiting for A to give it up
       Thread A:
        pagefault handled blocked on attempt to grab task_A->mm->mmap_sem,
       which happens to be the same thing as task_B->mm->mmap_sem.  Block waiting
       for B to give it up.
      
      AFAIU the i_mutex locking was added to hugetlbfs_read() as per
      http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html to take
      care of the race between truncate and read.  This patch fixes this by
      looking at page->mapping under lock_page() (find_lock_page()) to ensure
      that the inode didn't get truncated in the range during a parallel read.
      
      Ideally we can extend the patch to make sure we don't increase i_size in
      mmap.  But that will break userspace, because applications will now have
      to use truncate(2) to increase i_size in hugetlbfs.
      
      Based on the original patch from Hillf Danton.
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@kernel.org>		[everything after 2007 :)]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a05b0855
    • Xiao Guangrong's avatar
      hugetlbfs: fix hugetlb_get_unmapped_area() · 4bfc130d
      Xiao Guangrong authored
      Use/update cached_hole_size and free_area_cache properly to speedup
      finding of a free region.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bfc130d
  19. 21 Mar, 2012 1 commit
  20. 13 Jan, 2012 2 commits
    • Mel Gorman's avatar
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman authored
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
    • Mel Gorman's avatar
      mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage · b969c4ab
      Mel Gorman authored
      Asynchronous compaction is used when allocating transparent hugepages to
      avoid blocking for long periods of time.  Due to reports of stalling,
      there was a debate on disabling synchronous compaction but this severely
      impacted allocation success rates.  Part of the reason was that many dirty
      pages are skipped in asynchronous compaction by the following check;
      
      	if (PageDirty(page) && !sync &&
      		mapping->a_ops->migratepage != migrate_page)
      			rc = -EBUSY;
      
      This skips over all mapping aops using buffer_migrate_page() even though
      it is possible to migrate some of these pages without blocking.  This
      patch updates the ->migratepage callback with a "sync" parameter.  It is
      the responsibility of the callback to fail gracefully if migration would
      block.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b969c4ab
  21. 04 Jan, 2012 6 commits
  22. 02 Nov, 2011 1 commit
  23. 25 Aug, 2011 1 commit
    • Josh Boyer's avatar
      lockdep: Add helper function for dir vs file i_mutex annotation · e096d0c7
      Josh Boyer authored
      Purely in-memory filesystems do not use the inode hash as the dcache
      tells us if an entry already exists.  As a result, they do not call
      unlock_new_inode, and thus directory inodes do not get put into a
      different lockdep class for i_sem.
      
      We need the different lockdep classes, because the locking order for
      i_mutex is different for directory inodes and regular inodes.  Directory
      inodes can do "readdir()", which takes i_mutex *before* possibly taking
      mm->mmap_sem (due to a page fault while copying the directory entry to
      user space).
      
      In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
      before accessing i_mutex.
      
      The two cases can never happen for the same inode, so no real deadlock
      can occur, but without the different lockdep classes, lockdep cannot
      understand that.  As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
      can lead to false positives from lockdep like below:
      
          find/645 is trying to acquire lock:
           (&mm->mmap_sem){++++++}, at: [<ffffffff81109514>] might_fault+0x5c/0xac
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff81149f34>]
          vfs_readdir+0x5b/0xb4
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff814db822>] __mutex_lock_common+0x4c/0x361
                [<ffffffff814dbc46>] mutex_lock_nested+0x40/0x45
                [<ffffffff811daa87>] hugetlbfs_file_mmap+0x82/0x110
                [<ffffffff81111557>] mmap_region+0x258/0x432
                [<ffffffff811119dd>] do_mmap_pgoff+0x2ac/0x306
                [<ffffffff81111b4f>] sys_mmap_pgoff+0x118/0x16a
                [<ffffffff8100c858>] sys_mmap+0x22/0x24
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
          -> #0 (&mm->mmap_sem){++++++}:
                [<ffffffff8108a4bc>] __lock_acquire+0xa1a/0xcf7
                [<ffffffff8108ac26>] lock_acquire+0xbf/0x103
                [<ffffffff81109541>] might_fault+0x89/0xac
                [<ffffffff81149cff>] filldir+0x6f/0xc7
                [<ffffffff811586ea>] dcache_readdir+0x67/0x205
                [<ffffffff81149f54>] vfs_readdir+0x7b/0xb4
                [<ffffffff8114a073>] sys_getdents+0x7e/0xd1
                [<ffffffff814e3ec2>] system_call_fastpath+0x16/0x1b
      
      This patch moves the directory vs file lockdep annotation into a helper
      function that can be called by in-memory filesystems and has hugetlbfs
      call it.
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e096d0c7
  24. 26 Jul, 2011 1 commit