android_kernel_google_msm/arch
Michal Hocko c3a0afddee mm: hugetlbfs: correctly populate shared pmd
commit eb48c07146 upstream.

Each page mapped in a process's address space must be correctly
accounted for in _mapcount.  Normally the rules for this are
straightforward but hugetlbfs page table sharing is different.  The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

  kernel BUG at mm/filemap.c:135!
  invalid opcode: 0000 [#1] SMP
  CPU 22
  Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
  Pid: 18001, comm: mpitest Tainted: G        W    3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
  RIP: 0010:[<ffffffff8112cfed>]  [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170
  Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
  Call Trace:
    delete_from_page_cache+0x40/0x80
    truncate_hugepages+0x115/0x1f0
    hugetlbfs_evict_inode+0x18/0x30
    evict+0x9f/0x1b0
    iput_final+0xe3/0x1e0
    iput+0x3e/0x50
    d_kill+0xf8/0x110
    dput+0xe2/0x1b0
    __fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte.  The logic is if
the PMD page is the same, they must be shared.  This assumes that the
sharing is between the parent and child.  However, if the sharing is
with a different process entirely then this check fails as in this
diagram:

  parent
    |
    ------------>pmd
                 src_pte----------> data page
                                        ^
  other--------->pmd--------------------|
                  ^
  child-----------|
                 dst_pte

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other.  This is
possible due to the following style of race.

  PROC A                                          PROC B
  copy_hugetlb_page_range                         copy_hugetlb_page_range
    src_pte == huge_pte_offset                      src_pte == huge_pte_offset
    !src_pte so no sharing                          !src_pte so no sharing

  (time passes)

  hugetlb_fault                                   hugetlb_fault
    huge_pte_alloc                                  huge_pte_alloc
      huge_pmd_share                                 huge_pmd_share
        LOCK(i_mmap_mutex)
        find nothing, no sharing
        UNLOCK(i_mmap_mutex)
                                                      LOCK(i_mmap_mutex)
                                                      find nothing, no sharing
                                                      UNLOCK(i_mmap_mutex)
      pmd_alloc                                       pmd_alloc
      LOCK(instantiation_mutex)
      fault
      UNLOCK(instantiation_mutex)
                                                  LOCK(instantiation_mutex)
                                                  fault
                                                  UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed.  When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd.  This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman <lwoodman@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-14 10:00:06 -07:00
..
alpha alpha: Don't export SOCK_NONBLOCK to user space. 2012-09-14 10:00:06 -07:00
arm ARM: imx: select CPU_FREQ_TABLE when needed 2012-09-14 10:00:05 -07:00
avr32 avr32: fix nop compile fails from system.h split up 2012-04-04 08:23:44 -07:00
blackfin blackfin: fix ifdef fustercluck in mach-bf538/boards/ezkit.c 2012-04-26 14:46:51 -04:00
c6x irq: Kill pointless irqd_to_hw export 2012-04-10 22:39:17 -06:00
cris Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
frv frv: delete incorrect task prototypes causing compile fail 2012-05-17 18:00:51 -07:00
h8300 Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
hexagon hexagon: add missing cpu.h include 2012-04-23 12:57:24 -05:00
ia64 random: remove rand_initialize_irq() 2012-08-15 08:10:29 -07:00
m32r Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-03-29 18:12:23 -07:00
m68k m68k: Correct the Atari ALLOWINT definition 2012-08-09 08:31:53 -07:00
microblaze microblaze: Do not select GENERIC_GPIO by default 2012-06-10 00:36:05 +09:00
mips posix_types.h: Cleanup stale __NFDBITS and related definitions 2012-08-09 08:31:39 -07:00
mn10300 mn10300/CPU hotplug: Add missing call to notify_cpu_starting() 2012-05-15 18:16:57 -07:00
openrisc Disintegrate and delete asm/system.h 2012-03-28 15:58:21 -07:00
parisc PARISC: fix TLB fault path on PA2.0 narrow systems 2012-06-10 00:36:07 +09:00
powerpc powerpc/85xx: use the BRx registers to enable indirect mode on the P1022DS 2012-08-09 08:31:27 -07:00
s390 s390/compat: fix mmap compat system calls 2012-08-26 15:00:39 -07:00
score Delete all instances of asm/system.h 2012-03-28 18:30:03 +01:00
sh sh: Fix up tracepoint build fallout from static key introduction. 2012-04-27 11:12:38 +09:30
sparc KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat 2012-06-01 15:18:16 +08:00
tile tile: fix bug where fls(0) was not returning 0 2012-06-01 15:18:27 +08:00
um um: Implement a custom pte_same() function 2012-06-01 15:18:18 +08:00
unicore32 Merge branch 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping 2012-04-04 17:13:43 -07:00
x86 mm: hugetlbfs: correctly populate shared pmd 2012-09-14 10:00:06 -07:00
xtensa xtensa: fix build fail on undefined ack_bad_irq 2012-04-26 18:35:32 -04:00
.gitignore
Kconfig Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile 2012-03-29 14:49:45 -07:00