Skip to content
  • Khalid Aziz's avatar
    mm: fix aio performance regression for database caused by THP · 09642082
    Khalid Aziz authored
    commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 upstream.
    
    I am working with a tool that simulates oracle database I/O workload.
    This tool (orion to be specific -
    <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24
    
    >)
    allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag.  It then
    does aio into these pages from flash disks using various common block
    sizes used by database.  I am looking at performance with two of the most
    common block sizes - 1M and 64K.  aio performance with these two block
    sizes plunged after Transparent HugePages was introduced in the kernel.
    Here are performance numbers:
    
    		pre-THP		2.6.39		3.11-rc5
    1M read		8384 MB/s	5629 MB/s	6501 MB/s
    64K read	7867 MB/s	4576 MB/s	4251 MB/s
    
    I have narrowed the performance impact down to the overheads introduced by
    THP in __get_page_tail() and put_compound_page() routines.  perf top shows
    >40% of cycles being spent in these two routines.  Every time direct I/O
    to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
    the pages and calls put_page() when I/O completes to put the reference
    away.  THP introduced significant amount of locking overhead to get_page()
    and put_page() when dealing with compound pages because hugepages can be
    split underneath get_page() and put_page().  It added this overhead
    irrespective of whether it is dealing with hugetlbfs pages or transparent
    hugepages.  This resulted in 20%-45% drop in aio performance when using
    hugetlbfs pages.
    
    Since hugetlbfs pages can not be split, there is no reason to go through
    all the locking overhead for these pages from what I can see.  I added
    code to __get_page_tail() and put_compound_page() to bypass all the
    locking code when working with hugetlbfs pages.  This improved performance
    significantly.  Performance numbers with this patch:
    
    		pre-THP		3.11-rc5	3.11-rc5 + Patch
    1M read		8384 MB/s	6501 MB/s	8371 MB/s
    64K read	7867 MB/s	4251 MB/s	6510 MB/s
    
    Performance with 64K read is still lower than what it was before THP, but
    still a 53% improvement.  It does mean there is more work to be done but I
    will take a 53% improvement for now.
    
    Please take a look at the following patch and let me know if it looks
    reasonable.
    
    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
    Cc: Pravin B Shelar <pshelar@nicira.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Andi Kleen <andi@firstfloor.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
    09642082