Commit graph

5907 commits

Author SHA1 Message Date
Hugh Dickins
ec5924204b tmpfs: change final i_blocks BUG to WARNING
commit 0f3c42f522 upstream.

Under a particular load on one machine, I have hit shmem_evict_inode()'s
BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
race between swapout and eviction.

It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
and the lack of coherent locking between mapping's nrpages and shmem's
swapped count.  There's a window in shmem_writepage(), between lowering
nrpages in shmem_delete_from_page_cache() and then raising swapped
count, when the freed count appears to be +1 when it should be 0, and
then the asymmetry stops it from being corrected with -1 before hitting
the BUG.

One answer is coherent locking: using tree_lock throughout, without
info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
used_blocks makes that messier than expected.  Another answer may be a
further effort to eliminate the weird shmem_recalc_inode() altogether,
but previous attempts at that failed.

So far undecided, but for now change the BUG_ON to WARN_ON: in usual
circumstances it remains a useful consistency check.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26 11:37:47 -08:00
Michal Hocko
e5c4ee6a08 memcg: oom: fix totalpages calculation for memory.swappiness==0
commit 9a5a8f19b4 upstream.

oom_badness() takes a totalpages argument which says how many pages are
available and it uses it as a base for the score calculation.  The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp.  memsw portion of it).

This is usually correct but since fe35004fbf ("mm: avoid swapping out
with swappiness==0") we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages.  This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.

The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26 11:37:45 -08:00
Takamori Yamaguchi
3f874ecc49 mm: bugfix: set current->reclaim_state to NULL while returning from kswapd()
commit b0a8cc58e6 upstream.

In kswapd(), set current->reclaim_state to NULL before returning, as
current->reclaim_state holds reference to variable on kswapd()'s stack.

In rare cases, while returning from kswapd() during memory offlining,
__free_slab() and freepages() can access the dangling pointer of
current->reclaim_state.

Signed-off-by: Takamori Yamaguchi <takamori.yamaguchi@jp.sony.com>
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26 11:37:19 -08:00
Jan Kara
71a36b53c8 mm: fix XFS oops due to dirty pages without buffers on s390
commit ef5d437f71 upstream.

On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit.  Thus when a page is written to via buffered
write, HW dirty bit gets set and when we later map and unmap the page,
page_remove_rmap() finds the dirty bit and calls set_page_dirty().

Dirtying of a page which shouldn't be dirty can cause all sorts of
problems to filesystems.  The bug we observed in practice is that
buffers from the page get freed, so when the page gets later marked as
dirty and writeback writes it, XFS crashes due to an assertion
BUG_ON(!PagePrivate(page)) in page_buffers() called from
xfs_count_page_state().

Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then
page unmapped.

Fix the issue by ignoring s390 HW dirty bit for page cache pages of
mappings with mapping_cap_account_dirty().  This is safe because for
such mappings when a page gets marked as writeable in PTE it is also
marked dirty in do_wp_page() or do_page_fault().  When the dirty bit is
cleared by clear_page_dirty_for_io(), the page gets writeprotected in
page_mkclean().  So pagecache page is writeable if and only if it is
dirty.

Thanks to Hugh Dickins for pointing out mapping has to have
mapping_cap_account_dirty() for things to work and proposing a cleaned
up variant of the patch.

The patch has survived about two hours of running fsx-linux on tmpfs
while heavily swapping and several days of running on out build machines
where the original problem was triggered.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31 10:02:56 -07:00
Yinghai Lu
368845fde9 x86, mm: Trim memory in memblock to be page aligned
commit 6ede1fd3cb upstream.

We will not map partial pages, so need to make sure memblock
allocation will not allocate those bytes out.

Also we will use for_each_mem_pfn_range() to loop to map memory
range to keep them consistent.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/CAE9FiQVZirvaBMFYRfXMmWEcHbKSicQEHz4VAwUv0xFCk51ZNw@mail.gmail.com
Acked-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-31 10:02:56 -07:00
Hugh Dickins
530258fcea tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
commit 35c2a7f490 upstream.

Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
	u64 inum = fid->raw[2];
which is unhelpfully reported as at the end of shmem_alloc_inode():

BUG: unable to handle kernel paging request at ffff880061cd3000
IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Call Trace:
 [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6

Right, tmpfs is being stupid to access fid->raw[2] before validating that
fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
fall at the end of a page, and the next page not be present.

But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
could oops in the same way: add the missing fh_len checks to those.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Sage Weil <sage@inktank.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-21 09:27:57 -07:00
Mel Gorman
d9302b128e mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()
commit 00442ad04a upstream.

Commit cc9a6c8776 ("cpuset: mm: reduce large amounts of memory barrier
related damage v3") introduced a potential memory corruption.
shmem_alloc_page() uses a pseudo vma and it has one significant unique
combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.

get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
and mpol_cond_put() DOES decrease a policy ref when a policy has
MPOL_F_SHARED.  Therefore, when a cpuset update race occurs,
alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
reference count and frees the policy prematurely.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:57 +09:00
KOSAKI Motohiro
04ea8a8388 mempolicy: fix refcount leak in mpol_set_shared_policy()
commit 63f74ca21f upstream.

When shared_policy_replace() fails to allocate new->policy is not freed
correctly by mpol_set_shared_policy().  The problem is that shared
mempolicy code directly call kmem_cache_free() in multiple places where
it is easy to make a mistake.

This patch creates an sp_free wrapper function and uses it. The bug was
introduced pre-git age (IOW, before 2.6.12-rc2).

[mgorman@suse.de: Editted changelog]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:57 +09:00
Mel Gorman
04a30bd9dc mempolicy: fix a race in shared_policy_replace()
commit b22d127a39 upstream.

shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
be dereferenced if sp->lock is not held and 2) another thread can modify
sp_node between spin_unlock for allocating a new sp node and next
spin_lock.  The bug was introduced before 2.6.12-rc2.

Kosaki's original patch for this problem was to allocate an sp node and
policy within shared_policy_replace and initialise it when the lock is
reacquired.  I was not keen on this approach because it partially
duplicates sp_alloc().  As the paths were sp->lock is taken are not that
performance critical this patch converts sp->lock to sp->mutex so it can
sleep when calling sp_alloc().

[kosaki.motohiro@jp.fujitsu.com: Original patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:56 +09:00
KOSAKI Motohiro
33efe2910b mempolicy: remove mempolicy sharing
commit 869833f2c5 upstream.

Dave Jones' system call fuzz testing tool "trinity" triggered the
following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
     __slab_alloc+0x3d3/0x445
     kmem_cache_alloc+0x29d/0x2b0
     mpol_new+0xa3/0x140
     sys_mbind+0x142/0x620
     system_call_fastpath+0x16/0x1b

    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
     __slab_free+0x2e/0x1de
     kmem_cache_free+0x25a/0x260
     __mpol_put+0x27/0x30
     remove_vma+0x68/0x90
     exit_mmap+0x118/0x140
     mmput+0x73/0x110
     exit_mm+0x108/0x130
     do_exit+0x162/0xb90
     do_group_exit+0x4f/0xc0
     sys_exit_group+0x17/0x20
     system_call_fastpath+0x16/0x1b

    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

The problem is that the structure is being prematurely freed due to a
reference count imbalance. In the following case mbind(addr, len) should
replace the memory policies of both vma1 and vma2 and thus they will
become to share the same mempolicy and the new mempolicy will have the
MPOL_F_SHARED flag.

  +-------------------+-------------------+
  |     vma1          |     vma2(shmem)   |
  +-------------------+-------------------+
  |                                       |
 addr                                 addr+len

alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
maintaining the mempolicy reference count.  The current rule is that
get_vma_policy() only increments refcount for shmem VMA and
mpol_conf_put() only decrements refcount if the policy has
MPOL_F_SHARED.

In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
The reference count will be decreased even though was not increased
whenever alloc_page_vma() is called.  This has been broken since commit
[52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.

There is another serious bug with the sharing of memory policies.
Currently, mempolicy rebind logic (it is called from cpuset rebinding)
ignores a refcount of mempolicy and override it forcibly.  Thus, any
mempolicy sharing may cause mempolicy corruption.  The bug was
introduced by commit [68860ec1: cpusets: automatic numa mempolicy
rebinding].

Ideally, the shared policy handling would be rewritten to either
properly handle COW of the policy structures or at least reference count
MPOL_F_SHARED based exclusively on information within the policy.
However, this patch takes the easier approach of disabling any policy
sharing between VMAs.  Each new range allocated with sp_alloc will
allocate a new policy, set the reference count to 1 and drop the
reference count of the old policy.  This increases the memory footprint
but is not expected to be a major problem as mbind() is unlikely to be
used for fine-grained ranges.  It is also inefficient because it means
we allocate a new policy even in cases where mbind_range() could use the
new_policy passed to it.  However, it is more straight-forward and the
change should be invisible to the user.

[mgorman@suse.de: Edited changelog]
Reported-by: Dave Jones <davej@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:56 +09:00
KOSAKI Motohiro
14cfd9f986 revert "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages"
commit 8d34694c1a upstream.

Commit 05f144a0d5 ("mm: mempolicy: Let vma_merge and vma_split handle
vma->vm_policy linkages") removed vma->vm_policy updates code but it is
the purpose of mbind_range().  Now, mbind_range() is virtually a no-op
and while it does not allow memory corruption it is not the right fix.
This patch is a revert.

[mgorman@suse.de: Edited changelog]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:56 +09:00
Hugh Dickins
6a91e161a7 mm: fix invalidate_complete_page2() lock ordering
commit ec4d9f626d upstream.

In fuzzing with trinity, lockdep protested "possible irq lock inversion
dependency detected" when isolate_lru_page() reenabled interrupts while
still holding the supposedly irq-safe tree_lock:

invalidate_inode_pages2
  invalidate_complete_page2
    spin_lock_irq(&mapping->tree_lock)
    clear_page_mlock
      isolate_lru_page
        spin_unlock_irq(&zone->lru_lock)

isolate_lru_page() is correct to enable interrupts unconditionally:
invalidate_complete_page2() is incorrect to call clear_page_mlock() while
holding tree_lock, which is supposed to nest inside lru_lock.

Both truncate_complete_page() and invalidate_complete_page() call
clear_page_mlock() before taking tree_lock to remove page from radix_tree.
 I guess invalidate_complete_page2() preferred to test PageDirty (again)
under tree_lock before committing to the munlock; but since the page has
already been unmapped, its state is already somewhat inconsistent, and no
worse if clear_page_mlock() moved up.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Deciphered-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:51 +09:00
Michal Hocko
26fc07d8da hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach
commit 36e4f20af8 upstream.

Commit 0c176d52b0 ("mm: hugetlb: fix pgoff computation when unmapping
page from vma") fixed pgoff calculation but it has replaced it by
vma_hugecache_offset() which is not approapriate for offsets used for
vma_prio_tree_foreach() because that one expects index in page units
rather than in huge_page_shift.

Johannes said:

: The resulting index may not be too big, but it can be too small: assume
: hpage size of 2M and the address to unmap to be 0x200000.  This is regular
: page index 512 and hpage index 1.  If you have a VMA that maps the file
: only starting at the second huge page, that VMAs vm_pgoff will be 512 but
: you ask for offset 1 and miss it even though it does map the page of
: interest.  hugetlb_cow() will try to unmap, miss the vma, and retry the
: cow until the allocation succeeds or the skipped vma(s) go away.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:50 +09:00
Michael Wang
750025ab19 slab: fix the DEADLOCK issue on l3 alien lock
commit 947ca1856a upstream.

DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
the process of this fake report is:

	   kmem_cache_free()	//free obj in cachep
	-> cache_free_alien()	//acquire cachep's l3 alien lock
	-> __drain_alien_cache()
	-> free_block()
	-> slab_destroy()
	-> kmem_cache_free()	//free slab in cachep->slabp_cache
	-> cache_free_alien()	//acquire cachep->slabp_cache's l3 alien lock

Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
fake report generated.

This should not happen since we already have init_lock_keys() which will
reassign the lock class for both l3 list and l3 alien.

However, init_lock_keys() was invoked at a wrong position which is before we
invoke enable_cpucache() on each cache.

Since until set slab_state to be FULL, we won't invoke enable_cpucache()
on caches to build their l3 alien while creating them, so although we invoked
init_lock_keys(), the l3 alien lock class won't change since we don't have
them until invoked enable_cpucache() later.

This patch will invoke init_lock_keys() after we done enable_cpucache()
instead of before to avoid the fake DEADLOCK report.

Michael traced the problem back to a commit in release 3.0.0:

commit 30765b92ad
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu Jul 28 23:22:56 2011 +0200

    slab, lockdep: Annotate the locks before using them

    Fernando found we hit the regular OFF_SLAB 'recursion' before we
    annotate the locks, cure this.

    The relevant portion of the stack-trace:

    > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
    > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
    > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
    > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
    > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
    > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
    > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
    > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
    > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
    > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf

    Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
    Acked-by: Pekka Enberg <penberg@kernel.org>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

The commit moved init_lock_keys() before we build up the alien, so we
failed to reclass it.

Acked-by: Christoph Lameter <cl@linux.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13 05:38:37 +09:00
Satoru Moriya
49194d4e7d mm: avoid swapping out with swappiness==0
commit fe35004fbf upstream.

Sometimes we'd like to avoid swapping out anonymous memory.  In
particular, avoid swapping out pages of important process or process
groups while there is a reasonable amount of pagecache on RAM so that we
can satisfy our customers' requirements.

OTOH, we can control how aggressive the kernel will swap memory pages with
/proc/sys/vm/swappiness for global and
/sys/fs/cgroup/memory/memory.swappiness for each memcg.

But with current reclaim implementation, the kernel may swap out even if
we set swappiness=0 and there is pagecache in RAM.

This patch changes the behavior with swappiness==0.  If we set
swappiness==0, the kernel does not swap out completely (for global reclaim
until the amount of free pages and filebacked pages in a zone has been
reduced to something very very small (nr_free + nr_filebacked < high
watermark)).

Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02 10:30:38 -07:00
Yinghai Lu
391c314928 mm: sparse: fix usemap allocation above node descriptor section
commit 99ab7b1944 upstream.

After commit f5bf18fa22 ("bootmem/sparsemem: remove limit constraint
in alloc_bootmem_section"), usemap allocations may easily be placed
outside the optimal section that holds the node descriptor, even if
there is space available in that section.  This results in unnecessary
hotplug dependencies that need to have the node unplugged before the
section holding the usemap.

The reason is that the bootmem allocator doesn't guarantee a linear
search starting from the passed allocation goal but may start out at a
much higher address absent an upper limit.

Fix this by trying the allocation with the limit at the section end,
then retry without if that fails.  This keeps the fix from f5bf18fa22
of not panicking if the allocation does not fit in the section, but
still makes sure to try to stay within the section at first.

[rewritten massively by Johannes to apply to 3.4]

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-02 10:30:36 -07:00
qiuxishi
967c0e4ab9 memory hotplug: fix section info double registration bug
commit f14851af0e upstream.

There may be a bug when registering section info.  For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.

  node0: start_pfn=0x100,    spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
  node1: start_pfn=0x80000,  spanned_pfn=0x80000,  present_pfn=0x80000, => 0x080000-0x100000
  node2: start_pfn=0x100000, spanned_pfn=0x80000,  present_pfn=0x80000, => 0x100000-0x180000
  node3: start_pfn=0x180000, spanned_pfn=0x80000,  present_pfn=0x80000, => 0x180000-0x200000

  free_all_bootmem_node()
	register_page_bootmem_info_node()
		register_page_bootmem_info_section()

When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().

  sparse_remove_one_section()
	free_section_usemap()
		free_map_bootmem()
			put_page_bootmem()

[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02 10:30:06 -07:00
Li Haifeng
aa7994f281 mm/page_alloc: fix the page address of higher page's buddy calculation
commit 0ba8f2d593 upstream.

The heuristic method for buddy has been introduced since commit
43506fad21 ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists").  But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.

Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.

Signed-off-by: Haifeng Li <omycle@gmail.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KyongHo Cho <pullip.cho@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02 10:30:05 -07:00
Dave Jones
05d71a5a25 Remove user-triggerable BUG from mpol_to_str
commit 80de7c3138 upstream.

Trivially triggerable, found by trinity:

  kernel BUG at mm/mempolicy.c:2546!
  Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
  Call Trace:
    show_numa_map+0xd5/0x450
    show_pid_numa_map+0x13/0x20
    traverse+0xf2/0x230
    seq_read+0x34b/0x3e0
    vfs_read+0xac/0x180
    sys_pread64+0xa2/0xc0
    system_call_fastpath+0x1a/0x1f
  RIP: mpol_to_str+0x156/0x360

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-14 10:00:22 -07:00
Mel Gorman
49ca240411 mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
commit d833352a43 upstream.

If a process creates a large hugetlbfs mapping that is eligible for page
table sharing and forks heavily with children some of whom fault and
others which destroy the mapping then it is possible for page tables to
get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
output a message to the kernel log.  The final teardown will trigger a
BUG_ON in mm/filemap.c.

This was reproduced in 3.4 but is known to have existed for a long time
and goes back at least as far as 2.6.37.  It was probably was introduced
in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
look like this;

[  ..........] Lots of bad pmd messages followed by this
[  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
[  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
[  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
[  127.186778] ------------[ cut here ]------------
[  127.186781] kernel BUG at mm/filemap.c:134!
[  127.186782] invalid opcode: 0000 [#1] SMP
[  127.186783] CPU 7
[  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
[  127.186801]
[  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
[  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
[  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
[  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
[  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
[  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
[  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
[  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
[  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
[  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
[  127.186821] Stack:
[  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
[  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
[  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
[  127.186827] Call Trace:
[  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
[  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
[  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
[  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
[  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
[  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
[  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
[  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
[  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
[  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
[  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
[  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
[  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
[  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
[  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
[  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
[  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
[  127.186870]  RSP <ffff8804144b5c08>
[  127.186871] ---[ end trace 7cbac5d1db69f426 ]---

The bug is a race and not always easy to reproduce.  To reproduce it I was
doing the following on a single socket I7-based machine with 16G of RAM.

$ hugeadm --pool-pages-max DEFAULT:13G
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
$ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
$ for i in `seq 1 9000`; do ./hugetlbfs-test; done

On my particular machine, it usually triggers within 10 minutes but
enabling debug options can change the timing such that it never hits.
Once the bug is triggered, the machine is in trouble and needs to be
rebooted.  The machine will respond but processes accessing proc like "ps
aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
hard reset or a sysrq-b.

The basic problem is a race between page table sharing and teardown.  For
the most part page table sharing depends on i_mmap_mutex.  In some cases,
it is also taking the mm->page_table_lock for the PTE updates but with
shared page tables, it is the i_mmap_mutex that is more important.

Unfortunately it appears to be also insufficient. Consider the following
situation

Process A					Process B
---------					---------
hugetlb_fault					shmdt
  						LockWrite(mmap_sem)
    						  do_munmap
						    unmap_region
						      unmap_vmas
						        unmap_single_vma
						          unmap_hugepage_range
      						            Lock(i_mmap_mutex)
							    Lock(mm->page_table_lock)
							    huge_pmd_unshare/unmap tables <--- (1)
							    Unlock(mm->page_table_lock)
      						            Unlock(i_mmap_mutex)
  huge_pte_alloc				      ...
    Lock(i_mmap_mutex)				      ...
    vma_prio_walk, find svma, spte		      ...
    Lock(mm->page_table_lock)			      ...
    share spte					      ...
    Unlock(mm->page_table_lock)			      ...
    Unlock(i_mmap_mutex)			      ...
  hugetlb_no_page									  <--- (2)
						      free_pgtables
						        unlink_file_vma
							hugetlb_free_pgd_range
						    remove_vma_list

In this scenario, it is possible for Process A to share page tables with
Process B that is trying to tear them down.  The i_mmap_mutex on its own
does not prevent Process A walking Process B's page tables.  At (1) above,
the page tables are not shared yet so it unmaps the PMDs.  Process A sets
up page table sharing and at (2) faults a new entry.  Process B then trips
up on it in free_pgtables.

This patch fixes the problem by adding a new function
__unmap_hugepage_range_final that is only called when the VMA is about to
be destroyed.  This function clears VM_MAYSHARE during
unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
ineligible for sharing and avoids the race.  Superficially this looks like
it would then be vunerable to truncate and madvise issues but hugetlbfs
has its own truncate handlers so does not use unmap_mapping_range() and
does not support madvise(DONTNEED).

This should be treated as a -stable candidate if it is merged.

Test program is as follows. The test case was mostly written by Michal
Hocko with a few minor changes to reproduce this bug.

==== CUT HERE ====

static size_t huge_page_size = (2UL << 20);
static size_t nr_huge_page_A = 512;
static size_t nr_huge_page_B = 5632;

unsigned int get_random(unsigned int max)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	srandom(tv.tv_usec);
	return random() % max;
}

static void play(void *addr, size_t size)
{
	unsigned char *start = addr,
		      *end = start + size,
		      *a;
	start += get_random(size/2);

	/* we could itterate on huge pages but let's give it more time. */
	for (a = start; a < end; a += 4096)
		*a = 0;
}

int main(int argc, char **argv)
{
	key_t key = IPC_PRIVATE;
	size_t sizeA = nr_huge_page_A * huge_page_size;
	size_t sizeB = nr_huge_page_B * huge_page_size;
	int shmidA, shmidB;
	void *addrA = NULL, *addrB = NULL;
	int nr_children = 300, n = 0;

	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}
	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
		perror("shmget:");
		return 1;
	}

	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
		perror("shmat");
		return 1;
	}

fork_child:
	switch(fork()) {
		case 0:
			switch (n%3) {
			case 0:
				play(addrA, sizeA);
				break;
			case 1:
				play(addrB, sizeB);
				break;
			case 2:
				break;
			}
			break;
		case -1:
			perror("fork:");
			break;
		default:
			if (++n < nr_children)
				goto fork_child;
			play(addrA, sizeA);
			break;
	}
	shmdt(addrA);
	shmdt(addrB);
	do {
		wait(NULL);
	} while (--n > 0);
	shmctl(shmidA, IPC_RMID, NULL);
	shmctl(shmidB, IPC_RMID, NULL);
	return 0;
}

[akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15 08:10:31 -07:00
Xiao Guangrong
cc9fdb9cdd mm: mmu_notifier: fix freed page still mapped in secondary MMU
commit 3ad3d901bb upstream.

mmu_notifier_release() is called when the process is exiting.  It will
delete all the mmu notifiers.  But at this time the page belonging to the
process is still present in page tables and is present on the LRU list, so
this race will happen:

      CPU 0                 CPU 1
mmu_notifier_release:    try_to_unmap:
   hlist_del_init_rcu(&mn->hlist);
                            ptep_clear_flush_notify:
                                  mmu nofifler not found
                            free page  !!!!!!
                            /*
                             * At the point, the page has been
                             * freed, but it is still mapped in
                             * the secondary MMU.
                             */

  mn->ops->release(mn, mm);

Then the box is not stable and sometimes we can get this bug:

[  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
[  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
[  738.075936] page flags: 0x20000000000014(referenced|dirty)

The same issue is present in mmu_notifier_unregister().

We can call ->release before deleting the notifier to ensure the page has
been unmapped from the secondary MMU before it is freed.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15 08:10:08 -07:00
Joonsoo Kim
ce6f0ff8d1 mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()
commit dc32f63453 upstream.

Commit a6bc32b899 ("mm: compaction: introduce sync-light migration for
use by compaction") changed the declaration of migrate_pages() and
migrate_huge_pages().

But it missed changing the argument of migrate_huge_pages() in
soft_offline_huge_page().  In this case, we should call
migrate_huge_pages() with MIGRATE_SYNC.

Additionally, there is a mismatch between type the of argument and the
function declaration for migrate_pages().

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-15 08:10:05 -07:00
Tony Luck
7b689c5d93 x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults
commit 6751ed65dc upstream.

In commit dad1743e59 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t->si_addr. This
would prevent application level recovery from the fault.

Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09 08:31:29 -07:00
Aaditya Kumar
07aa70120e mm: fix lost kswapd wakeup in kswapd_stop()
commit 1c7e7f6c07 upstream.

Offlining memory may block forever, waiting for kswapd() to wake up
because kswapd() does not check the event kthread->should_stop before
sleeping.

The proper pattern, from Documentation/memory-barriers.txt, is:

   ---  waker  ---
   event_indicated = 1;
   wake_up_process(event_daemon);

   ---  sleeper  ---
   for (;;) {
      set_current_state(TASK_UNINTERRUPTIBLE);
      if (event_indicated)
         break;
      schedule();
   }

   set_current_state() may be wrapped by:
      prepare_to_wait();

In the kswapd() case, event_indicated is kthread->should_stop.

  === offlining memory (waker) ===
   kswapd_stop()
      kthread_stop()
         kthread->should_stop = 1
         wake_up_process()
         wait_for_completion()

  ===  kswapd_try_to_sleep (sleeper) ===
   kswapd_try_to_sleep()
      prepare_to_wait()
           .
           .
      schedule()
           .
           .
      finish_wait()

The schedule() needs to be protected by a test of kthread->should_stop,
which is wrapped by kthread_should_stop().

Reproducer:
   Do heavy file I/O in background.
   Do a memory offline/online in a tight loop

Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-29 08:04:19 -07:00
Yinghai Lu
7ad71f960f memblock: free allocated memblock_reserved_regions later
commit 29f6738609 upstream.

memblock_free_reserved_regions() calls memblock_free(), but
memblock_free() would double reserved.regions too, so we could free the
old range for reserved.regions.

Also tj said there is another bug which could be related to this.

| I don't think we're saving any noticeable
| amount by doing this "free - give it to page allocator - reserve
| again" dancing.  We should just allocate regions aligned to page
| boundaries and free them later when memblock is no longer in use.

in that case, when DEBUG_PAGEALLOC, will get panic:

     memblock_free: [0x0000102febc080-0x0000102febf080] memblock_free_reserved_regions+0x37/0x39
  BUG: unable to handle kernel paging request at ffff88102febd948
  IP: [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155
  PGD 4826063 PUD cf67a067 PMD cf7fa067 PTE 800000102febd160
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  CPU 0
  Pid: 0, comm: swapper Not tainted 3.5.0-rc2-next-20120614-sasha #447
  RIP: 0010:[<ffffffff836a5774>]  [<ffffffff836a5774>] __next_free_mem_range+0x9b/0x155

See the discussion at https://lkml.org/lkml/2012/6/13/469

So try to allocate with PAGE_SIZE alignment and free it later.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:45 -07:00
David Rientjes
7a08b440fa mm, thp: abort compaction if migration page cannot be charged to memcg
commit 4bf2bba375 upstream.

If page migration cannot charge the temporary page to the memcg,
migrate_pages() will return -ENOMEM.  This isn't considered in memory
compaction however, and the loop continues to iterate over all
pageblocks trying to isolate and migrate pages.  If a small number of
very large memcgs happen to be oom, however, these attempts will mostly
be futile leading to an enormous amout of cpu consumption due to the
page migration failures.

This patch will short circuit and fail memory compaction if
migrate_pages() returns -ENOMEM.  COMPACT_PARTIAL is returned in case
some migrations were successful so that the page allocator will retry.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:44 -07:00
Jiang Liu
0e343dbe08 memory hotplug: fix invalid memory access caused by stale kswapd pointer
commit d8adde17e5 upstream.

kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined.  But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL.  The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again.  Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff81051a94>] exit_creds+0x12/0x78
  PGD 0
  Oops: 0000 [#1] SMP
  last sysfs file: /sys/devices/system/memory/memory391/state
  CPU 11
  Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
  Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
  RIP: 0010:exit_creds+0x12/0x78
  RSP: 0018:ffff8806044f1d78  EFLAGS: 00010202
  RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
  RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
  RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
  R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
  R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
  FS:  00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
  Stack:
   ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
   ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
   0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
  Call Trace:
    __put_task_struct+0x5d/0x97
    kthread_stop+0x50/0x58
    offline_pages+0x324/0x3da
    memory_block_change_state+0x179/0x1db
    store_mem_state+0x9e/0xbb
    sysfs_write_file+0xd0/0x107
    vfs_write+0xad/0x169
    sys_write+0x45/0x6e
    system_call_fastpath+0x16/0x1b
  Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 <8b> 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
  RIP  exit_creds+0x12/0x78
   RSP <ffff8806044f1d78>
  CR2: 0000000000000000

[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:44 -07:00
Andy Lutomirski
0c4ad5cc8c mm: Hold a file reference in madvise_remove
commit 9ab4233dd0 upstream.

Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).

The bug was introduced by commit 90ed52ebe4 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")

Cc: Hugh Dickins <hugh@veritas.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
 - Adjust context
 - madvise_remove() calls vmtruncate_range(), not do_fallocate()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:43 -07:00
Eric Dumazet
2c07f25ea7 splice: fix racy pipe->buffers uses
commit 047fe36052 upstream.

Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()

commit 35f3d14dbb (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.

Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.

Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.

splice_shrink_spd() loses its struct pipe_inode_info argument.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[bwh: Backported to 3.2:
 - Adjust context in vmsplice_to_pipe()
 - Update one more call to splice_shrink_spd(), from skb_splice_bits()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:42 -07:00
Greg Pearson
757fcf2b6d mm/memblock: fix overlapping allocation when doubling reserved array
commit 48c3b583bb upstream.

__alloc_memory_core_early() asks memblock for a range of memory then try
to reserve it.  If the reserved region array lacks space for the new
range, memblock_double_array() is called to allocate more space for the
array.  If memblock is used to allocate memory for the new array it can
end up using a range that overlaps with the range originally allocated in
__alloc_memory_core_early(), leading to possible data corruption.

With this patch memblock_double_array() now calls memblock_find_in_range()
with a narrowed candidate range (in cases where the reserved.regions array
is being doubled) so any memory allocated will not overlap with the
original range that was being reserved.  The range is narrowed by passing
in the starting address and size of the previously allocated range.  Then
the range above the ending address is searched and if a candidate is not
found, the range below the starting address is searched.

Signed-off-by: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:27 -07:00
Gavin Shan
659586507a mm/memblock: fix memory leak on extending regions
commit 181eb39425 upstream.

The overall memblock has been organized into the memory regions and
reserved regions.  Initially, the memory regions and reserved regions are
stored in the predetermined arrays of "struct memblock _region".  It's
possible for the arrays to be enlarged when we have newly added regions,
but no free space left there.  The policy here is to create double-sized
array either by slab allocator or memblock allocator.  Unfortunately, we
didn't free the old array, which might be allocated through slab allocator
before.  That would cause memory leak.

The patch introduces 2 variables to trace where (slab or memblock) the
memory and reserved regions come from.  The memory for the memory or
reserved regions will be deallocated by kfree() if that was allocated by
slab allocator.  Thus to fix the memory leak issue.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:27 -07:00
Gavin Shan
40e0484473 mm/memblock: cleanup on duplicate VA/PA conversion
commit 4e2f07750d upstream.

The overall memblock has been organized into the memory regions and
reserved regions.  Initially, the memory regions and reserved regions are
stored in the predetermined arrays of "struct memblock _region".  It's
possible for the arrays to be enlarged when we have newly added regions
for them, but no enough space there.  Under the situation, We will created
double-sized array to meet the requirement.  However, the original
implementation converted the VA (Virtual Address) of the newly allocated
array of regions to PA (Physical Address), then translate back when we
allocates the new array from slab.  That's actually unnecessary.

The patch removes the duplicate VA/PA conversion.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16 09:04:27 -07:00
Hugh Dickins
78ac34ad31 swap: fix shmem swapping when more than 8 areas
commit 9b15b817f3 upstream.

Minchan Kim reports that when a system has many swap areas, and tmpfs
swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read
back the page cannot locate it, and the read fails with -ENOMEM.

Whoops.  Yes, I blindly followed read_swap_header()'s pte_to_swp_entry(
swp_entry_to_pte()) technique for determining maximum usable swap
offset, without stopping to realize that that actually depends upon the
pte swap encoding shifting swap offset to the higher bits and truncating
it there.  Whereas our radix_tree swap encoding leaves offset in the
lower bits: it's swap "type" (that is, index of swap area) that was
truncated.

Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the
broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header().

This does not reduce the usable size of a swap area any further, it
leaves it as claimed when making the original commit: no change from 3.0
on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB
per swapfile on i386 with PAE.  It's not a change I would have risked
five years ago, but with x86_64 supported for ten years, I believe it's
appropriate now.

Hmm, and what if some architecture implements its swap pte with offset
encoded below type? That would equally break the maximum usable swap
offset check.  Happily, they all follow the same tradition of encoding
offset above type, but I'll prepare a check on that for next.

Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-22 11:36:55 -07:00
Joonsoo Kim
75df3ae206 slub: fix a memory leak in get_partial_node()
commit 02d7633fa5 upstream.

In the case which is below,

1. acquire slab for cpu partial list
2. free object to it by remote cpu
3. page->freelist = t

then memory leak is occurred.

Change acquire_slab() not to zap freelist when it works for cpu partial list.
I think it is a sufficient solution for fixing a memory leak.

Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 30 hackbench 50 process 4000 > /dev/null' is done.

***Vanilla***
Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     256  Total  :     468   Sanity Checks : Off  Total: 3833856
SlabObj:     256  Full   :     111   Redzoning     : Off  Used : 2004992
SlabSiz:    8192  Partial:     302   Poisoning     : Off  Loss : 1828864
Loss   :       0  CpuSlab:      55   Tracking      : Off  Lalig:       0
Align  :       8  Objects:      32   Tracing       : Off  Lpadd:       0

***Patched***
Sizes (bytes)     Slabs              Debug                Memory
------------------------------------------------------------------------
Object :     256  Total  :     300   Sanity Checks : Off  Total: 2457600
SlabObj:     256  Full   :     204   Redzoning     : Off  Used : 2348800
SlabSiz:    8192  Partial:      33   Poisoning     : Off  Loss :  108800
Loss   :       0  CpuSlab:      63   Tracking      : Off  Lalig:       0
Align  :       8  Objects:      32   Tracing       : Off  Lpadd:       0

Total and loss number is the impact of this patch.

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-10 00:36:11 +09:00
Dave Hansen
58f9c97642 mm: fix vma_resv_map() NULL pointer
commit 4523e14585 upstream.

hugetlb_reserve_pages() can be used for either normal file-backed
hugetlbfs mappings, or MAP_HUGETLB.  In the MAP_HUGETLB, semi-anonymous
mode, there is not a VMA around.  The new call to resv_map_put() assumed
that there was, and resulted in a NULL pointer dereference:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
  IP: vma_resv_map+0x9/0x30
  PGD 141453067 PUD 1421e1067 PMD 0
  Oops: 0000 [#1] PREEMPT SMP
  ...
  Pid: 14006, comm: trinity-child6 Not tainted 3.4.0+ #36
  RIP: vma_resv_map+0x9/0x30
  ...
  Process trinity-child6 (pid: 14006, threadinfo ffff8801414e0000, task ffff8801414f26b0)
  Call Trace:
    resv_map_put+0xe/0x40
    hugetlb_reserve_pages+0xa6/0x1d0
    hugetlb_file_setup+0x102/0x2c0
    newseg+0x115/0x360
    ipcget+0x1ce/0x310
    sys_shmget+0x5a/0x60
    system_call_fastpath+0x16/0x1b

This was reported by Dave Jones, but was reproducible with the
libhugetlbfs test cases, so shame on me for not running them in the
first place.

With this, the oops is gone, and the output of libhugetlbfs's
run_tests.py is identical to plain 3.4 again.

[ Marked for stable, since this was introduced by commit c50ac05081
  ("hugetlb: fix resv_map leak in error path") which was also marked for
  stable ]

Reported-by: Dave Jones <davej@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-10 00:36:11 +09:00
Dave Hansen
7a0786dcc3 hugetlb: fix resv_map leak in error path
commit c50ac05081 upstream.

When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
does a resv_map_alloc().  It depends on code in hugetlbfs's
vm_ops->close() to release that allocation.

However, in the mmap() failure path, we do a plain unmap_region() without
the remove_vma() which actually calls vm_ops->close().

This is a decent fix.  This leak could get reintroduced if new code (say,
after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
an error.  But, I think it would have to unroll the reservation anyway.

Christoph's test case:

	http://marc.info/?l=linux-mm&m=133728900729735

This patch applies to 3.4 and later.  A version for earlier kernels is at
https://lkml.org/lkml/2012/5/22/418.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Christoph Lameter <cl@linux.com>
Tested-by: Christoph Lameter <cl@linux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-10 00:36:09 +09:00
KyongHo
e42adb30f4 mm: fix faulty initialization in vmalloc_init()
commit dbda591d92 upstream.

The transfer of ->flags causes some of the static mapping virtual
addresses to be prematurely freed (before the mapping is removed) because
VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set.  This might
cause subsequent vmalloc/ioremap calls to fail because it might allocate
one of the freed virtual address ranges that aren't unmapped.

va->flags has different types of flags from tmp->flags.  If a region with
VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
by __purge_vmap_area_lazy().

Fix vmalloc_init() to correctly initialize vmap_area for the given
vm_struct.

Also initialise va->vm.  If it is not set, find_vm_area() for the early
vm regions will always fail.

Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Cc: "Olav Haugan" <ohaugan@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-10 00:36:06 +09:00
Michal Hocko
319b564166 mm: consider all swapped back pages in used-once logic
commit e48982734e upstream.

Commit 6457474624 ("vmscan: detect mapped file pages used only once")
made mapped pages have another round in inactive list because they might
be just short lived and so we could consider them again next time.  This
heuristic helps to reduce pressure on the active list with a streaming
IO worklods.

This patch fixes a regression introduced by this commit for heavy shmem
based workloads because unlike Anon pages, which are excluded from this
heuristic because they are usually long lived, shmem pages are handled
as a regular page cache.

This doesn't work quite well, unfortunately, if the workload is mostly
backed by shmem (in memory database sitting on 80% of memory) with a
streaming IO in the background (backup - up to 20% of memory).  Anon
inactive list is full of (dirty) shmem pages when watermarks are hit.
Shmem pages are kept in the inactive list (they are referenced) in the
first round and it is hard to reclaim anything else so we reach lower
scanning priorities very quickly which leads to an excessive swap out.

Let's fix this by excluding all swap backed pages (they tend to be long
lived wrt.  the regular page cache anyway) from used-once heuristic and
rather activate them if they are referenced.

The customer's workload is shmem backed database (80% of RAM) and they
are measuring transactions/s with an IO in the background (20%).
Transactions touch more or less random rows in the table.  The
transaction rate fell by a factor of 3 (in the worst case) because of
commit 64574746.  This patch restores the previous numbers.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-10 00:36:06 +09:00
Mel Gorman
ab9094fa03 mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages
commit 05f144a0d5 upstream.

Dave Jones' system call fuzz testing tool "trinity" triggered the
following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
     __slab_alloc+0x3d3/0x445
     kmem_cache_alloc+0x29d/0x2b0
     mpol_new+0xa3/0x140
     sys_mbind+0x142/0x620
     system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
     __slab_free+0x2e/0x1de
     kmem_cache_free+0x25a/0x260
     __mpol_put+0x27/0x30
     remove_vma+0x68/0x90
     exit_mmap+0x118/0x140
     mmput+0x73/0x110
     exit_mm+0x108/0x130
     do_exit+0x162/0xb90
     do_group_exit+0x4f/0xc0
     sys_exit_group+0x17/0x20
     system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

This implied a reference counting bug and the problem happened during
mbind().

mbind() applies a new memory policy to a range and uses mbind_range() to
merge existing VMAs or split them as necessary.  In the event of splits,
mpol_dup() will allocate a new struct mempolicy and maintain existing
reference counts whose rules are documented in
Documentation/vm/numa_memory_policy.txt .

The problem occurs with shared memory policies.  The vm_op->set_policy
increments the reference count if necessary and split_vma() and
vma_merge() have already handled the existing reference counts.
However, policy_vma() screws it up by replacing an existing
vma->vm_policy with one that potentially has the wrong reference count
leading to a premature free.  This patch removes the damage caused by
policy_vma().

With this patch applied Dave's trinity tool runs an mbind test for 5
minutes without error.  /proc/slabinfo reported that there are no
numa_policy or shared_policy_node objects allocated after the test
completed and the shared memory region was deleted.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Jones <davej@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Stephen Wilson <wilsons@start.ca>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-01 15:18:19 +08:00
Hugh Dickins
62ade86ab6 memcg,thp: fix res_counter:96 regression
Occasionally, testing memcg's move_charge_at_immigrate on rc7 shows
a flurry of hundreds of warnings at kernel/res_counter.c:96, where
res_counter_uncharge_locked() does WARN_ON(counter->usage < val).

The first trace of each flurry implicates __mem_cgroup_cancel_charge()
of mc.precharge, and an audit of mc.precharge handling points to
mem_cgroup_move_charge_pte_range()'s THP handling in commit 12724850e8
("memcg: avoid THP split in task migration").

Checking !mc.precharge is good everywhere else, when a single page is to
be charged; but here the "mc.precharge -= HPAGE_PMD_NR" likely to
follow, is liable to result in underflow (a lot can change since the
precharge was estimated).

Simply check against HPAGE_PMD_NR: there's probably a better
alternative, trying precharge for more, splitting if unsuccessful; but
this one-liner is safer for now - no kernel/res_counter.c:96 warnings
seen in 26 hours.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-19 10:10:27 -07:00
majianpeng
02e1a9cd1e slub: missing test for partial pages flush work in flush_all()
I found some kernel messages such as:

    SLUB raid5-md127: kmem_cache_destroy called for cache that still has objects.
    Pid: 6143, comm: mdadm Tainted: G           O 3.4.0-rc6+        #75
    Call Trace:
    kmem_cache_destroy+0x328/0x400
    free_conf+0x2d/0xf0 [raid456]
    stop+0x41/0x60 [raid456]
    md_stop+0x1a/0x60 [md_mod]
    do_md_stop+0x74/0x470 [md_mod]
    md_ioctl+0xff/0x11f0 [md_mod]
    blkdev_ioctl+0xd8/0x7a0
    block_ioctl+0x3b/0x40
    do_vfs_ioctl+0x96/0x560
    sys_ioctl+0x91/0xa0
    system_call_fastpath+0x16/0x1b

Then using kmemleak I found these messages:

    unreferenced object 0xffff8800b6db7380 (size 112):
      comm "mdadm", pid 5783, jiffies 4294810749 (age 90.589s)
      hex dump (first 32 bytes):
        01 01 db b6 ad 4e ad de ff ff ff ff ff ff ff ff  .....N..........
        ff ff ff ff ff ff ff ff 98 40 4a 82 ff ff ff ff  .........@J.....
      backtrace:
        kmemleak_alloc+0x21/0x50
        kmem_cache_alloc+0xeb/0x1b0
        kmem_cache_open+0x2f1/0x430
        kmem_cache_create+0x158/0x320
        setup_conf+0x649/0x770 [raid456]
        run+0x68b/0x840 [raid456]
        md_run+0x529/0x940 [md_mod]
        do_md_run+0x18/0xc0 [md_mod]
        md_ioctl+0xba8/0x11f0 [md_mod]
        blkdev_ioctl+0xd8/0x7a0
        block_ioctl+0x3b/0x40
        do_vfs_ioctl+0x96/0x560
        sys_ioctl+0x91/0xa0
        system_call_fastpath+0x16/0x1b

This bug was introduced by commit a8364d5555 ("slub: only IPI CPUs that
have per cpu obj to flush"), which did not include checks for per cpu
partial pages being present on a cpu.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-17 18:00:51 -07:00
Hugh Dickins
1b76b02f15 mm: raise MemFree by reverting percpu_pagelist_fraction to 0
Why is there less MemFree than there used to be?  It perturbed a test,
so I've just been bisecting linux-next, and now find the offender went
upstream yesterday.

Commit 93278814d3 "mm: fix division by 0 in percpu_pagelist_fraction()"
mistakenly initialized percpu_pagelist_fraction to the sysctl's minimum 8,
which leaves 1/8th of memory on percpu lists (on each cpu??); but most of
us expect it to be left unset at 0 (and it's not then used as a divisor).

  MemTotal: 8061476kB  8061476kB  8061476kB  8061476kB  8061476kB  8061476kB
  Repetitive test with percpu_pagelist_fraction 8:
  MemFree:  6948420kB  6237172kB  6949696kB  6840692kB  6949048kB  6862984kB
  Same test with percpu_pagelist_fraction back to 0:
  MemFree:  7945000kB  7944908kB  7948568kB  7949060kB  7948796kB  7948812kB

Signed-off-by: Hugh Dickins <hughd@google.com>
[ We really should fix the crazy sysctl interface too, but that's a
  separate thing - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-11 09:23:39 -07:00
Linus Torvalds
7c283324da Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc fixes from Andrew Morton.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (8 patches)
  MAINTAINERS: add maintainer for LED subsystem
  mm: nobootmem: fix sign extend problem in __free_pages_memory()
  drivers/leds: correct __devexit annotations
  memcg: free spare array to avoid memory leak
  namespaces, pid_ns: fix leakage on fork() failure
  hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()
  mm: fix division by 0 in percpu_pagelist_fraction()
  proc/pid/pagemap: correctly report non-present ptes and holes between vmas
2012-05-10 15:17:24 -07:00
Russ Anderson
6bc2e853c6 mm: nobootmem: fix sign extend problem in __free_pages_memory()
Systems with 8 TBytes of memory or greater can hit a problem where only
the the first 8 TB of memory shows up.  This is due to "int i" being
smaller than "unsigned long start_aligned", causing the high bits to be
dropped.

The fix is to change `i' to unsigned long to match start_aligned
and end_aligned.

Thanks to Jack Steiner for assistance tracking this down.

Signed-off-by: Russ Anderson <rja@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Sha Zhengju
8c7577637c memcg: free spare array to avoid memory leak
When the last event is unregistered, there is no need to keep the spare
array anymore.  So free it to avoid memory leak.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Chris Metcalf
4998a6c0ed hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()
Commit 66aebce747 ("hugetlb: fix race condition in hugetlb_fault()")
added code to avoid a race condition by elevating the page refcount in
hugetlb_fault() while calling hugetlb_cow().

However, one code path in hugetlb_cow() includes an assertion that the
page count is 1, whereas it may now also have the value 2 in this path.

The consensus is that this BUG_ON has served its purpose, so rather than
extending it to cover both cases, we just remove it.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hillf Danton <dhillf@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>	[3.0.29+, 3.2.16+, 3.3.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Sasha Levin
93278814d3 mm: fix division by 0 in percpu_pagelist_fraction()
percpu_pagelist_fraction_sysctl_handler() has only considered -EINVAL as
a possible error from proc_dointvec_minmax().

If any other error is returned, it would proceed to divide by zero since
percpu_pagelist_fraction wasn't getting initialized at any point.  For
example, writing 0 bytes into the proc file would trigger the issue.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Catalin Marinas
100d13c3b5 kmemleak: Fix the kmemleak tracking of the percpu areas with !SMP
Kmemleak tracks the percpu allocations via a specific API and the
originally allocated areas must be removed from kmemleak (via
kmemleak_free). The code was already doing this for SMP systems.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-09 10:13:29 -07:00
Tejun Heo
42b6428145 percpu: pcpu_embed_first_chunk() should free unused parts after all allocs are complete
pcpu_embed_first_chunk() allocates memory for each node, copies percpu
data and frees unused portions of it before proceeding to the next
group.  This assumes that allocations for different nodes doesn't
overlap; however, depending on memory topology, the bootmem allocator
may end up allocating memory from a different node than the requested
one which may overlap with the portion freed from one of the previous
percpu areas.  This leads to percpu groups for different nodes
overlapping which is a serious bug.

This patch separates out copy & partial free from the allocation loop
such that all allocations are complete before partial frees happen.

This also fixes overlapping frees which could happen on allocation
failure path - out_free_areas path frees whole groups but the groups
could have portions freed at that point.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Reported-by: "Pavel V. Panteleev" <pp_84@mail.ru>
Tested-by: "Pavel V. Panteleev" <pp_84@mail.ru>
LKML-Reference: <E1SNhwY-0007ui-V7.pp_84-mail-ru@f220.mail.ru>
2012-05-09 10:08:16 -07:00
Linus Torvalds
e9b19cd43f Merge branch 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull two percpu fixes from Tejun Heo:
 "One adds missing KERN_CONT on split printk()s and the other makes
  the percpu allocator avoid using PMD_SIZE as atom_size on x86_32.

  Using PMD_SIZE led to vmalloc area exhaustion on certain
  configurations (x86_32 android) and the only cost of using PAGE_SIZE
  instead is static percpu area not being aligned to large page
  mapping."

* 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bit
  percpu: use KERN_CONT in pcpu_dump_alloc_info()
2012-05-08 11:06:07 -07:00