android_kernel_samsung_msm8226/mm
Mel Gorman 88f5acf88a mm: page allocator: adjust the per-cpu counter threshold when memory is low
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold.  On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high.  The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues.  The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing.  This patch takes a different approach.  For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level.  This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node.  There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat().  We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced.  The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla                      11.6615%
disable-threshold            0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity.  I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Tested-by: Nicolas Bareil <nico@chdir.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-13 17:32:31 -08:00
..
backing-dev.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-10-26 17:58:44 -07:00
bootmem.c x86, memblock: Replace e820_/_early string with memblock_ 2010-08-27 11:13:47 -07:00
bounce.c bounce: call flush_dcache_page() after bounce_copy_vec() 2010-09-09 18:57:25 -07:00
compaction.c mm/compaction.c: avoid double mem_cgroup_del_lru() 2010-12-22 19:43:33 -08:00
debug-pagealloc.c
dmapool.c mm: add a might_sleep_if() to dma_pool_alloc() 2010-10-26 16:52:08 -07:00
fadvise.c readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM 2010-03-06 11:26:25 -08:00
failslab.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
filemap.c fs: dcache remove dcache_lock 2011-01-07 17:50:23 +11:00
filemap_xip.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
fremap.c Avoid pgoff overflow in remap_file_pages 2010-09-25 09:34:58 -07:00
highmem.c mm,x86: fix kmap_atomic_push vs ioremap_32.c 2010-10-27 18:03:05 -07:00
hugetlb.c mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() 2010-12-02 14:51:14 -08:00
hwpoison-inject.c HWPOISON, hugetlb: support hwpoison injection for hugepage 2010-08-11 09:23:11 +02:00
init-mm.c mm: provide init_mm mm_context initializer 2010-08-09 20:44:54 -07:00
internal.h mm: fix is_mem_section_removable() page_order BUG_ON check 2010-10-26 16:52:11 -07:00
Kconfig Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2010-10-22 17:31:36 -07:00
Kconfig.debug trivial: improve help text for mm debug config options 2009-09-21 15:14:57 +02:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c
kmemleak.c kmemleak: Fix typo in the comment 2010-08-08 21:57:23 +01:00
ksm.c ksm: annotate ksm_thread_mutex is no deadlock source 2010-12-02 14:51:15 -08:00
maccess.c MN10300: Save frame pointer in thread_info struct rather than global var 2010-10-27 17:29:01 +01:00
madvise.c HWPOISON: Add a madvise() injector for soft page offlining 2009-12-16 12:20:00 +01:00
Makefile percpu: use percpu allocator on UP too 2010-10-02 10:26:05 +03:00
memblock.c memblock: Annotate memblock functions with __init_memblock 2010-10-11 16:00:52 -07:00
memcontrol.c memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner check 2010-12-30 10:07:06 -08:00
memory-failure.c mem-hotplug: introduce {un}lock_memory_hotplug() 2010-12-02 14:51:15 -08:00
memory.c use clear_page()/copy_page() in favor of memset()/memcpy() on whole pages 2010-10-26 16:52:13 -07:00
memory_hotplug.c mem-hotplug: introduce {un}lock_memory_hotplug() 2010-12-02 14:51:15 -08:00
mempolicy.c mm/mempolicy.c: add rcu read lock to protect pid structure 2010-12-02 14:51:14 -08:00
mempool.c mm: remove broken 'kzalloc' mempool 2009-09-22 07:17:35 -07:00
migrate.c mm/migrate.c: fix compilation error 2010-12-22 19:43:34 -08:00
mincore.c mincore: do nested page table walks 2010-05-25 08:06:58 -07:00
mlock.c mm: Move vma_stack_continue into mm.h 2010-09-09 09:05:06 -07:00
mm_init.c
mmap.c install_special_mapping skips security_file_mmap check. 2010-12-15 12:30:36 -08:00
mmu_context.c exit: fix oops in sync_mm_rss 2010-03-24 16:31:21 -07:00
mmu_notifier.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
mmzone.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
mprotect.c perf_events: Fix perf_counter_mmap() hook in mprotect() 2010-11-09 10:19:38 -08:00
mremap.c mm: remove pte_*map_nested() 2010-10-26 16:52:08 -07:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nommu.c nommu: Provide stubbed alloc/free_vm_area() implementation. 2010-12-24 12:08:30 +09:00
oom_kill.c oom: kill all threads sharing oom killed task's mm 2010-10-26 16:52:05 -07:00
page-writeback.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-01-13 10:05:56 -08:00
page_alloc.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
page_cgroup.c kmemleak: Annotate false positive in init_section_page_cgroup() 2010-07-19 11:54:14 +01:00
page_io.c block: unify flags for struct bio and struct request 2010-08-07 18:20:39 +02:00
page_isolation.c mm: page_isolation: codeclean fix comment and rm unneeded val init 2010-10-26 16:52:11 -07:00
pagewalk.c mm: remove call to find_vma in pagewalk for non-hugetlbfs 2010-11-25 06:50:46 +09:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c percpu: move vmalloc based chunk management into percpu-vm.c 2010-05-01 08:30:50 +02:00
percpu.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-01-13 10:05:56 -08:00
prio_tree.c
quicklist.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
readahead.c readahead.c: fix comment 2010-05-25 08:07:00 -07:00
rmap.c mm/rmap.c: fix comment 2010-12-27 15:14:06 +01:00
shmem.c fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
slab.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 2011-01-10 08:38:01 -08:00
slob.c kernel: kmem_ptr_validate considered harmful 2011-01-07 17:50:16 +11:00
slub.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 2011-01-10 08:38:01 -08:00
sparse-vmemmap.c tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
sparse.c sparsemem: on no vmemmap path put mem_map on node high too 2010-05-25 08:06:56 -07:00
swap.c fuse: use release_pages() 2010-10-27 18:03:17 -07:00
swap_state.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
swapfile.c block: make blkdev_get/put() handle exclusive access 2010-11-13 11:55:17 +01:00
thrash.c
truncate.c Call the filesystem back whenever a page is removed from the page cache 2010-12-02 09:55:21 -05:00
util.c kernel: kmem_ptr_validate considered harmful 2011-01-07 17:50:16 +11:00
vmalloc.c vmalloc: eagerly clear ptes on vunmap 2010-12-02 14:51:15 -08:00
vmscan.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00
vmstat.c mm: page allocator: adjust the per-cpu counter threshold when memory is low 2011-01-13 17:32:31 -08:00