android_kernel_google_msm/mm
Michal Hocko 6c80480d01 mm, gup: close FOLL MAP_PRIVATE race
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

faultin_page drops FOLL_WRITE after the page fault handler did the CoW
and then we retry follow_page_mask to get our CoWed page. This is racy,
however because the page might have been unmapped by that time and so
we would have to do a page fault again, this time without CoW. This
would cause the page cache corruption for FOLL_FORCE on MAP_PRIVATE
read only mappings with obvious consequences.

This is an ancient bug that was actually already fixed once by Linus
eleven years ago in commit 4ceb5db975 ("Fix get_user_pages() race
for write access") but that was then undone due to problems on s390
by commit f33ea7f404 ("fix get_user_pages bug") because s390 didn't
have proper dirty pte tracking until abf09bed3c ("s390/mm: implement
software dirty bits"). This wasn't a problem at the time as pointed out
by Hugh Dickins because madvise relied on mmap_sem for write up until
0a27a14a62 ("mm: madvise avoid exclusive mmap_sem") but since then we
can race with madvise which can unmap the fresh COWed page or with KSM
and corrupt the content of the shared page.

This patch is based on the Linus' approach to not clear FOLL_WRITE after
the CoW page fault (aka VM_FAULT_WRITE) but instead introduces FOLL_COW
to note this fact. The flag is then rechecked during follow_pfn_pte to
enforce the page fault again if we do not see the CoWed page. Linus was
suggesting to check pte_dirty again as s390 is OK now. But that would
make backporting to some old kernels harder. So instead let's just make
sure that vm_normal_page sees a pure anonymous page.

This would guarantee we are seeing a real CoW page. Introduce
can_follow_write_pte which checks both pte_write and falls back to
PageAnon on forced write faults which passed CoW already. Thanks to Hugh
to point out that a special care has to be taken for KSM pages because
our COWed page might have been merged with a KSM one and keep its
PageAnon flag.

Change-Id: I164802be6d757c7a49b57416dfc9f4605ce0e1fb
Fixes: 0a27a14a62 ("mm: madvise avoid exclusive mmap_sem")
Reported-by: Phil "not Paul" Oester <kernel@linuxace.com>
Disclosed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
[bwh: Backported to 3.2:
 - Adjust filename, context, indentation
 - The 'no_page' exit path in follow_page() is different, so open-code the
   cleanup
 - Delete a now-unused label]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-11-12 18:53:37 -07:00
..
backing-dev.c bdi: use deferable timer for sync_supers task 2013-02-27 18:16:50 -08:00
bootmem.c mm: sparse: fix usemap allocation above node descriptor section 2016-10-29 23:12:12 +08:00
bounce.c
cleancache.c
compaction.c cma: fix watermark checking 2013-03-15 17:06:38 -07:00
debug-pagealloc.c
dmapool.c
fadvise.c
failslab.c
filemap.c fs: introduce inode operation ->update_time 2015-07-13 11:17:49 -07:00
filemap_xip.c fs: introduce inode operation ->update_time 2015-07-13 11:17:49 -07:00
fremap.c
highmem.c
huge_memory.c
hugetlb.c
hwpoison-inject.c
init-mm.c
internal.h cma: fix watermark checking 2013-03-15 17:06:38 -07:00
Kconfig mm: mmzone: MIGRATE_CMA migration type added 2013-02-27 18:14:01 -08:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c
ksm.c ksm: Provide support to use deferred timers for scanner thread 2016-10-29 23:12:17 +08:00
maccess.c
madvise.c mm: add a field to store names for private anonymous memory 2013-10-11 10:02:06 -07:00
Makefile mm: compaction: export some of the functions 2013-02-27 18:13:58 -08:00
memblock.c
memcontrol.c
memory-failure.c mm: page_isolation: MIGRATE_CMA isolation functions added 2013-02-27 18:14:02 -08:00
memory.c mm, gup: close FOLL MAP_PRIVATE race 2016-11-12 18:53:37 -07:00
memory_hotplug.c mm: page_isolation: MIGRATE_CMA isolation functions added 2013-02-27 18:14:02 -08:00
mempolicy.c mm: fix anon vma naming 2016-10-29 23:12:35 +08:00
mempool.c
migrate.c
mincore.c
mlock.c mm: reorder can_do_mlock to fix audit denial 2015-06-16 23:08:46 -07:00
mm_init.c
mmap.c FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR. 2016-10-29 23:12:40 +08:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c mm: add a field to store names for private anonymous memory 2013-10-11 10:02:06 -07:00
mremap.c
msync.c
nobootmem.c
nommu.c
oom_kill.c mm, oom: make dump_tasks public 2014-11-18 15:13:25 -08:00
page-writeback.c mm: fix calculation of dirtyable memory 2016-10-29 23:12:16 +08:00
page_alloc.c mm: workaround for widevine playback failed 2013-05-22 07:57:36 +00:00
page_cgroup.c
page_io.c
page_isolation.c mm: page_isolation: MIGRATE_CMA isolation functions added 2013-02-27 18:14:02 -08:00
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c
pgtable-generic.c
prio_tree.c
process_vm_access.c
quicklist.c
readahead.c mm: change initial readahead window size calculation 2016-10-29 23:12:18 +08:00
rmap.c
shmem.c
slab.c
slob.c
slub.c slub: fix a memory leak in get_partial_node() 2013-03-15 17:09:26 -07:00
sparse-vmemmap.c
sparse.c
swap.c
swap_state.c
swapfile.c
thrash.c
truncate.c
util.c nick kvfree() from apparmor 2014-11-18 15:13:23 -08:00
vmalloc.c
vmscan.c mm: vmscan: clear kswapd's special reclaim powers before exiting 2016-10-29 23:12:33 +08:00
vmstat.c mm: make counts of CMA free pages correct 2013-03-07 15:23:58 -08:00