Commit Graph

7618 Commits

Author SHA1 Message Date
Michal Hocko 5f57b3de29 mm: do not bug_on on incorrect length in __mm_populate()
commit bb177a732c4369bb58a1fe1df8f552b6f0f7db5f upstream.

syzbot has noticed that a specially crafted library can easily hit
VM_BUG_ON in __mm_populate

  kernel BUG at mm/gup.c:1242!
  invalid opcode: 0000 [#1] SMP
  CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
  RIP: 0010:__mm_populate+0x1e2/0x1f0
  Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
  Call Trace:
     vm_brk_flags+0xc3/0x100
     vm_brk+0x1f/0x30
     load_elf_library+0x281/0x2e0
     __ia32_sys_uselib+0x170/0x1e0
     do_fast_syscall_32+0xca/0x420
     entry_SYSENTER_compat+0x70/0x7f

The reason is that the length of the new brk is not page aligned when we
try to populate the it.  There is no reason to bug on that though.
do_brk_flags already aligns the length properly so the mapping is
expanded as it should.  All we need is to tell mm_populate about it.
Besides that there is absolutely no reason to to bug_on in the first
place.  The worst thing that could happen is that the last page wouldn't
get populated and that is far from putting system into an inconsistent
state.

Fix the issue by moving the length sanitization code from do_brk_flags
up to vm_brk_flags.  The only other caller of do_brk_flags is brk
syscall entry and it makes sure to provide the proper length so t here
is no need for sanitation and so we can use do_brk_flags without it.

Also remove the bogus BUG_ONs.

[osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: syzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16:
 - There is no do_brk_flags() function; update do_brk()
 - do_brk(), vm_brk() return the address on success
 - Adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:08:08 +02:00
Kees Cook d5a9bd2989 mm: refuse wrapped vm_brk requests
commit ba093a6d9397da8eafcfbaa7d95bd34255da39a0 upstream.

The vm_brk() alignment calculations should refuse to overflow.  The ELF
loader depending on this, but it has been fixed now.  No other unsafe
callers have been found.

Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Ismael Ripoll Ripoll <iripoll@upv.es>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:08:08 +02:00
Steven Miao f05d0b90c3 mm: nommu: per-thread vma cache fix
mm could be removed from current task struct, using previous vma->vm_mm

It will crash on blackfin after updated to Linux 3.15.  The commit "mm:
per-thread vma caching" caused the crash.  mm could be removed from
current task struct before

  mmput()->
    exit_mmap()->
      delete_vma_from_mm()

the detailed fault information:

    NULL pointer access
    Kernel OOPS in progress
    Deferred Exception context
    CURRENT PROCESS:
    COMM=modprobe PID=278  CPU=0
    invalid mm
    return address: [0x000531de]; contents of:
    0x000531b0:  c727  acea  0c42  181d  0000  0000  0000  a0a8
    0x000531c0:  b090  acaa  0c42  1806  0000  0000  0000  a0e8
    0x000531d0:  b0d0  e801  0000  05b3  0010  e522  0046 [a090]
    0x000531e0:  6408  b090  0c00  17cc  3042  e3ff  f37b  2fc8

    CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
    task: 0572b720 ti: 0569e000 task.ti: 0569e000
    Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
    ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
    Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014

    SEQUENCER STATUS:		Not tainted
     SEQSTAT: 00000027  IPEND: 8008  IMASK: ffff  SYSCFG: 2806
      EXCAUSE   : 0x27
      physical IVG3 asserted : <0xffa00744> { _trap + 0x0 }
      physical IVG15 asserted : <0xffa00d68> { _evt_system_call + 0x0 }
      logical irq   6 mapped  : <0xffa003bc> { _bfin_coretmr_interrupt + 0x0 }
      logical irq   7 mapped  : <0x00008828> { _bfin_fault_routine + 0x0 }
      logical irq  11 mapped  : <0x00007724> { _l2_ecc_err + 0x0 }
      logical irq  13 mapped  : <0x00008828> { _bfin_fault_routine + 0x0 }
      logical irq  39 mapped  : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
      logical irq  40 mapped  : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
     RETE: <0x00000000> /* Maybe null pointer? */
     RETN: <0x0569fe50> /* kernel dynamic memory (maybe user-space) */
     RETX: <0x00000480> /* Maybe fixed code section */
     RETS: <0x00053384> { _exit_mmap + 0x28 }
     PC  : <0x000531de> { _delete_vma_from_mm + 0x92 }
    DCPLB_FAULT_ADDR: <0x00000008> /* Maybe null pointer? */
    ICPLB_FAULT_ADDR: <0x000531de> { _delete_vma_from_mm + 0x92 }
    PROCESSOR STATE:
     R0 : 00000004    R1 : 0569e000    R2 : 00bf3db4    R3 : 00000000
     R4 : 057f9800    R5 : 00000001    R6 : 0569ddd0    R7 : 0572b720
     P0 : 0572b854    P1 : 00000004    P2 : 00000000    P3 : 0569dda0
     P4 : 0572b720    P5 : 0566c368    FP : 0569fe5c    SP : 0569fd74
     LB0: 057f523f    LT0: 057f523e    LC0: 00000000
     LB1: 0005317c    LT1: 00053172    LC1: 00000002
     B0 : 00000000    L0 : 00000000    M0 : 0566f5bc    I0 : 00000000
     B1 : 00000000    L1 : 00000000    M1 : 00000000    I1 : ffffffff
     B2 : 00000001    L2 : 00000000    M2 : 00000000    I2 : 00000000
     B3 : 00000000    L3 : 00000000    M3 : 00000000    I3 : 057f8000
    A0.w: 00000000   A0.x: 00000000   A1.w: 00000000   A1.x: 00000000
    USP : 056ffcf8  ASTAT: 02003024

    Hardware Trace:
       0 Target : <0x00003fb8> { _trap_c + 0x0 }
         Source : <0xffa006d8> { _exception_to_level5 + 0xa0 } JUMP.L
       1 Target : <0xffa00638> { _exception_to_level5 + 0x0 }
         Source : <0xffa004f2> { _bfin_return_from_exception + 0x6 } RTX
       2 Target : <0xffa004ec> { _bfin_return_from_exception + 0x0 }
         Source : <0xffa00590> { _ex_trap_c + 0x70 } JUMP.S
       3 Target : <0xffa00520> { _ex_trap_c + 0x0 }
         Source : <0xffa0076e> { _trap + 0x2a } JUMP (P4)
       4 Target : <0xffa00744> { _trap + 0x0 }
          FAULT : <0x000531de> { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
         Source : <0x000531da> { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
       5 Target : <0x000531da> { _delete_vma_from_mm + 0x8e }
         Source : <0x00053176> { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
       6 Target : <0x0005314c> { _delete_vma_from_mm + 0x0 }
         Source : <0x00053380> { _exit_mmap + 0x24 } JUMP.L
       7 Target : <0x00053378> { _exit_mmap + 0x1c }
         Source : <0x00053394> { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
       8 Target : <0x00053390> { _exit_mmap + 0x34 }
         Source : <0xffa020e0> { __cond_resched + 0x20 } RTS
       9 Target : <0xffa020c0> { __cond_resched + 0x0 }
         Source : <0x0005338c> { _exit_mmap + 0x30 } JUMP.L
      10 Target : <0x0005338c> { _exit_mmap + 0x30 }
         Source : <0x0005333a> { _delete_vma + 0xb2 } RTS
      11 Target : <0x00053334> { _delete_vma + 0xac }
         Source : <0x0005507a> { _kmem_cache_free + 0xba } RTS
      12 Target : <0x00055068> { _kmem_cache_free + 0xa8 }
         Source : <0x0005505e> { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
      13 Target : <0x00055052> { _kmem_cache_free + 0x92 }
         Source : <0x0005501a> { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
      14 Target : <0x00054ff4> { _kmem_cache_free + 0x34 }
         Source : <0x00054fce> { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
      15 Target : <0x00054fc0> { _kmem_cache_free + 0x0 }
         Source : <0x00053330> { _delete_vma + 0xa8 } JUMP.L
    Kernel Stack
    Stack info:
     SP: [0x0569ff24] <0x0569ff24> /* kernel dynamic memory (maybe user-space) */
     Memory from 0x0569ff20 to 056a0000
    0569ff20: 00000001 [04e8da5a] 00008000  00000000  00000000  056a0000  04e8da5a  04e8da5a
    0569ff40: 04eb9eea  ffa00dce  02003025  04ea09c5  057f523f  04ea09c4  057f523e  00000000
    0569ff60: 00000000  00000000  00000000  00000000  00000000  00000000  00000001  00000000
    0569ff80: 00000000  00000000  00000000  00000000  00000000  00000000  00000000  00000000
    0569ffa0: 0566f5bc  057f8000  057f8000  00000001  04ec0170  056ffcf8  056ffd04  057f9800
    0569ffc0: 04d1d498  057f9800  057f8fe4  057f8ef0  00000001  057f928c  00000001  00000001
    0569ffe0: 057f9800  00000000  00000008  00000007  00000001  00000001  00000001 <00002806>
    Return addresses in stack:
        address : <0x00002806> { _show_cpuinfo + 0x2d2 }
    Modules linked in:
    Kernel panic - not syncing: Kernel exception
    [ end Kernel panic - not syncing: Kernel exception

Signed-off-by: Steven Miao <realmz6@gmail.com>
Acked-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[3.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:08 +02:00
Davidlohr Bueso 27b9adad69 mm,vmacache: optimize overflow system-wide flushing
For single threaded workloads, we can avoid flushing and iterating through
the entire list of tasks, making the whole function a lot faster,
requiring only a single atomic read for the mm_users.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:07 +02:00
Davidlohr Bueso b6d5307265 mm,vmacache: add debug data
Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
hit rate -- exported in /proc/vmstat.

Any updates to the caching scheme needs this kind of data, thus it can
save some work re-implementing the counting all the time.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:07 +02:00
Linus Torvalds 3dbd0d29e6 mm: don't pointlessly use BUG_ON() for sanity check
BUG_ON() is a big hammer, and should be used _only_ if there is some
major corruption that you cannot possibly recover from, making it
imperative that the current process (and possibly the whole machine) be
terminated with extreme prejudice.

The trivial sanity check in the vmacache code is *not* such a fatal
error.  Recovering from it is absolutely trivial, and using BUG_ON()
just makes it harder to debug for no actual advantage.

To make matters worse, the placement of the BUG_ON() (only if the range
check matched) actually makes it harder to hit the sanity check to begin
with, so _if_ there is a bug (and we just got a report from Srivatsa
Bhat that this can indeed trigger), it is harder to debug not just
because the machine is possibly dead, but because we don't have better
coverage.

BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
because it is simply just about the worst thing you can ever do if you
hit some "this cannot happen" situation.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:06 +02:00
Davidlohr Bueso 7c1a95e0ae mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:06 +02:00
Shakeel Butt 8ef718e452 mm, oom: fix use-after-free in oom_kill_process
commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream.

Syzbot instance running on upstream kernel found a use-after-free bug in
oom_kill_process.  On further inspection it seems like the process
selected to be oom-killed has exited even before reaching
read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
and the put_task_struct within for_each_thread() frees the tsk and
for_each_thread() tries to access the tsk.  The easiest fix is to do
get/put across the for_each_thread() on the selected task.

Now the next question is should we continue with the oom-kill as the
previously selected task has exited? However before adding more
complexity and heuristics, let's answer why we even look at the children
of oom-kill selected task? The select_bad_process() has already selected
the worst process in the system/memcg.  Due to race, the selected
process might not be the worst at the kill time but does that matter?
The userspace can use the oom_score_adj interface to prefer children to
be killed before the parent.  I looked at the history but it seems like
this is there before git history.

Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
Fixes: 6b0c81b3be ("mm, oom: reduce dependency on tasklist_lock")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 22:05:56 +02:00
Linus Torvalds f1bcd09be5 mremap: properly flush TLB before releasing the page
commit eb66ae030829605d61fbef1909ce310e29f78821 upstream.

Jann Horn points out that our TLB flushing was subtly wrong for the
mremap() case.  What makes mremap() special is that we don't follow the
usual "add page to list of pages to be freed, then flush tlb, and then
free pages".  No, mremap() obviously just _moves_ the page from one page
table location to another.

That matters, because mremap() thus doesn't directly control the
lifetime of the moved page with a freelist: instead, the lifetime of the
page is controlled by the page table locking, that serializes access to
the entry.

As a result, we need to flush the TLB not just before releasing the lock
for the source location (to avoid any concurrent accesses to the entry),
but also before we release the destination page table lock (to avoid the
TLB being flushed after somebody else has already done something to that
page).

This also makes the whole "need_flush" logic unnecessary, since we now
always end up flushing the TLB for every valid entry.

Reported-and-tested-by: Jann Horn <jannh@google.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Tested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[will: backport to 4.4 stable]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:53:28 +02:00
syphyr 1f5472e491 mm: Reduce Samsung's verbose logging when mapping memory 2019-07-27 21:53:25 +02:00
Joel Fernandes (Google) 4642f0d0c9 mm: shmem.c: Correctly annotate new inodes for lockdep
commit b45d71fb89ab8adfe727b9d0ee188ed58582a647 upstream.

Directories and inodes don't necessarily need to be in the same lockdep
class.  For ex, hugetlbfs splits them out too to prevent false positives
in lockdep.  Annotate correctly after new inode creation.  If its a
directory inode, it will be put into a different class.

This should fix a lockdep splat reported by syzbot:

> ======================================================
> WARNING: possible circular locking dependency detected
> 4.18.0-rc8-next-20180810+ #36 Not tainted
> ------------------------------------------------------
> syz-executor900/4483 is trying to acquire lock:
> 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
> include/linux/fs.h:765 [inline]
> 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
> shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
>
> but task is already holding lock:
> 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
> drivers/staging/android/ashmem.c:448
>
> which lock already depends on the new lock.
>
> -> #2 (ashmem_mutex){+.+.}:
>        __mutex_lock_common kernel/locking/mutex.c:925 [inline]
>        __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
>        mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
>        ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
>        call_mmap include/linux/fs.h:1844 [inline]
>        mmap_region+0xf27/0x1c50 mm/mmap.c:1762
>        do_mmap+0xa10/0x1220 mm/mmap.c:1535
>        do_mmap_pgoff include/linux/mm.h:2298 [inline]
>        vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
>        ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
>        __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
>        __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
>        __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
>        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #1 (&mm->mmap_sem){++++}:
>        __might_fault+0x155/0x1e0 mm/memory.c:4568
>        _copy_to_user+0x30/0x110 lib/usercopy.c:25
>        copy_to_user include/linux/uaccess.h:155 [inline]
>        filldir+0x1ea/0x3a0 fs/readdir.c:196
>        dir_emit_dot include/linux/fs.h:3464 [inline]
>        dir_emit_dots include/linux/fs.h:3475 [inline]
>        dcache_readdir+0x13a/0x620 fs/libfs.c:193
>        iterate_dir+0x48b/0x5d0 fs/readdir.c:51
>        __do_sys_getdents fs/readdir.c:231 [inline]
>        __se_sys_getdents fs/readdir.c:212 [inline]
>        __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
>        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #0 (&sb->s_type->i_mutex_key#9){++++}:
>        lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
>        down_write+0x8f/0x130 kernel/locking/rwsem.c:70
>        inode_lock include/linux/fs.h:765 [inline]
>        shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
>        ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
>        ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
>        vfs_ioctl fs/ioctl.c:46 [inline]
>        file_ioctl fs/ioctl.c:501 [inline]
>        do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
>        ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
>        __do_sys_ioctl fs/ioctl.c:709 [inline]
>        __se_sys_ioctl fs/ioctl.c:707 [inline]
>        __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
>        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> other info that might help us debug this:
>
> Chain exists of:
>   &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(ashmem_mutex);
>                                lock(&mm->mmap_sem);
>                                lock(ashmem_mutex);
>   lock(&sb->s_type->i_mutex_key#9);
>
>  *** DEADLOCK ***
>
> 1 lock held by syz-executor900/4483:
>  #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
> ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448

Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Suggested-by: NeilBrown <neilb@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:53:13 +02:00
Vlastimil Babka bacd1cdaee mm, page_alloc: do not break __GFP_THISNODE by zonelist reset
commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream.

In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies.  The zonelist is obtained
from current CPU's node.  This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g.  because the
allocating thread has been migrated to a different CPU.

This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended.  If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.

Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.

We can fix it by simply removing the zonelist reset completely.  There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place.  This was different
when commit 183f6371aa ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.

We might consider this for 4.17 although I don't know if there's
anything currently broken.

SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
ones I'll have to check.

So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes.  BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.

Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 183f6371aa ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: Resetting the zonelist may still be useful here,
 so keep doing it if __GFP_THISNODE is not used.]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:52:50 +02:00
Linus Torvalds 444fda6e37 mmap: relax file size limit for regular files
commit 423913ad4ae5b3e8fb8983f70969fb522261ba26 upstream.

Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
introduced to catch problems in various ad-hoc character device drivers
doing mmap and getting the size limits wrong.  In the process, it used
"known good" limits for the normal cases of mapping regular files and
block device drivers.

It turns out that the "s_maxbytes" limit was less "known good" than I
thought.  In particular, /proc doesn't set it, but exposes one regular
file to mmap: /proc/vmcore.  As a result, that file got limited to the
default MAX_INT s_maxbytes value.

This went unnoticed for a while, because apparently the only thing that
needs it is the s390 kernel zfcpdump, but there might be other tools
that use this too.

Vasily suggested just changing s_maxbytes for all of /proc, which isn't
wrong, but makes me nervous at this stage.  So instead, just make the
new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
affect anything else.  It wasn't the regular file case I was worried
about.

I'd really prefer for maxsize to have been per-inode, but that is not
how things are today.

Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:52:12 +02:00
Linus Torvalds c5e0b7057d mmap: introduce sane default mmap limits
commit be83bbf806822b1b89e0a0f23cd87cddc409e429 upstream.

The internal VM "mmap()" interfaces are based on the mmap target doing
everything using page indexes rather than byte offsets, because
traditionally (ie 32-bit) we had the situation that the byte offset
didn't fit in a register.  So while the mmap virtual address was limited
by the word size of the architecture, the backing store was not.

So we're basically passing "pgoff" around as a page index, in order to
be able to describe backing store locations that are much bigger than
the word size (think files larger than 4GB etc).

But while this all makes a ton of sense conceptually, we've been dogged
by various drivers that don't really understand this, and internally
work with byte offsets, and then try to work with the page index by
turning it into a byte offset with "pgoff << PAGE_SHIFT".

Which obviously can overflow.

Adding the size of the mapping to it to get the byte offset of the end
of the backing store just exacerbates the problem, and if you then use
this overflow-prone value to check various limits of your device driver
mmap capability, you're just setting yourself up for problems.

The correct thing for drivers to do is to do their limit math in page
indices, the way the interface is designed.  Because the generic mmap
code _does_ test that the index doesn't overflow, since that's what the
mmap code really cares about.

HOWEVER.

Finding and fixing various random drivers is a sisyphean task, so let's
just see if we can just make the core mmap() code do the limiting for
us.  Realistically, the only "big" backing stores we need to care about
are regular files and block devices, both of which are known to do this
properly, and which have nice well-defined limits for how much data they
can access.

So let's special-case just those two known cases, and then limit other
random mmap users to a backing store that still fits in "unsigned long".
Realistically, that's not much of a limit at all on 64-bit, and on
32-bit architectures the only worry might be the GPU drivers, which can
have big physical address spaces.

To make it possible for drivers like that to say that they are 64-bit
clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
file flags to allow drivers to mark their file descriptors as safe in
the full 64-bit mmap address space.

[ The timing for doing this is less than optimal, and this should really
  go in a merge window. But realistically, this needs wide testing more
  than it needs anything else, and being main-line is the only way to do
  that.

  So the earlier the better, even if it's outside the proper development
  cycle        - Linus ]

Change-Id: I6886a262e0a7156aa1d99af301f47bbce8301800
Cc: Kees Cook <keescook@chromium.org>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Dave Airlie <airlied@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:52:11 +02:00
Oleg Nesterov a078112827 mm: do_mmap_pgoff: cleanup the usage of file_inode()
Simple cleanup.  Move "struct inode *inode" variable into "if (file)"
block to simplify the code and avoid the unnecessary check.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Colin Cross <ccross@android.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 21:52:11 +02:00
Naoya Horiguchi 669c5b32b5 pagewalk: improve vma handling
commit fafaa4264eba49fd10695c193a82760558d093f4 upstream.

Current implementation of page table walker has a fundamental problem in
vma handling, which started when we tried to handle vma(VM_HUGETLB).
Because it's done in pgd loop, considering vma boundary makes code
complicated and bug-prone.

From the users viewpoint, some user checks some vma-related condition to
determine whether the user really does page walk over the vma.

In order to solve these, this patch moves vma check outside pgd loop and
introduce a new callback ->test_walk().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16 as dependency of L1TF mitigation]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:51:51 +02:00
Naoya Horiguchi c4c4d02a51 mm/pagewalk: remove pgd_entry() and pud_entry()
commit 0b1fbfe50006c41014cc25660c0e735d21c34939 upstream.

Currently no user of page table walker sets ->pgd_entry() or
->pud_entry(), so checking their existence in each loop is just wasting
CPU cycle.  So let's remove it to reduce overhead.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16 as dependency of L1TF mitigation]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:51:50 +02:00
Andrey Ryabinin 8b69e7ec54 mm/fadvise.c: fix signed overflow UBSAN complaint
[ Upstream commit a718e28f538441a3b6612da9ff226973376cdf0f ]

Signed integer overflow is undefined according to the C standard.  The
overflow in ksys_fadvise64_64() is deliberate, but since it is signed
overflow, UBSAN complains:

	UBSAN: Undefined behaviour in mm/fadvise.c:76:10
	signed integer overflow:
	4 + 9223372036854775805 cannot be represented in type 'long long int'

Use unsigned types to do math.  Unsigned overflow is defined so UBSAN
will not complain about it.  This patch doesn't change generated code.

[akpm@linux-foundation.org: add comment explaining the casts]
Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: <icytxw@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 21:51:37 +02:00
jie@chenjie6@huwei.com 4532163963 mm/memory.c: check return value of ioremap_prot
[ Upstream commit 24eee1e4c47977bdfb71d6f15f6011e7b6188d04 ]

ioremap_prot() can return NULL which could lead to an oops.

Link: http://lkml.kernel.org/r/1533195441-58594-1-git-send-email-chenjie6@huawei.com
Signed-off-by: chen jie <chenjie6@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: chenjie <chenjie6@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 21:51:23 +02:00
Vinayak Menon 15899ebfbb mm: fix cma accounting in zone_watermark_ok
Some cases were reported where atomic unmovable allocations of order 2
fails, but kswapd does not wakeup. And in such cases it was seen that,
when zone_watermark_ok check is  performed to decide whether to wake up
kswapd, there were lot of CMA pages of order 2 and above. This makes
the watermark check succeed resulting in kswapd not being woken up. But
since these atomic unmovable allocations can't come from CMA region,
further atomic allocations keeps failing, without kswapd trying to
reclaim. Usually concurrent movable allocations result in reclaim and
improves the situtation, but the case reported was from a network test
which was resulting in only atomic skb allocations being attempted.

Change-Id: If953b8a8cfb0a5caa1fb63d3c032b194942f8091
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Prakash Gupta <guptap@codeaurora.org>
2019-07-27 21:51:09 +02:00
Vinayak Menon 2ba917cdbc mm: add zone counter for cma pages
Add per free area nr_free_cma counter. The idea is
to also track the number of cma pages present in
free pages. This will be used in later patches to
fix issues with zone_watermark_ok.

Change-Id: I97da9d2f3642db56fc541c48ab56a7ce78e2333c
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Prakash Gupta <guptap@codeaurora.org>
2019-07-27 21:51:09 +02:00
Vlastimil Babka ebe70ab3e6 mm/page_alloc: prevent merging between isolated and other pageblocks
Hanjun Guo has reported that a CMA stress test causes broken accounting of
CMA and free pages:

> Before the test, I got:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal:         204800 kB
> CmaFree:          195044 kB
>
>
> After running the test:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal:         204800 kB
> CmaFree:         6602584 kB
>
> So the freed CMA memory is more than total..
>
> Also the the MemFree is more than mem total:
>
> -bash-4.3# cat /proc/meminfo
> MemTotal:       16342016 kB
> MemFree:        22367268 kB
> MemAvailable:   22370528 kB

Laura Abbott has confirmed the issue and suspected the freepage accounting
rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
pageblocks:

> CMA isolates MAX_ORDER aligned blocks, but, during the process,
> partialy isolated block exists. If MAX_ORDER is 11 and
> pageblock_order is 9, two pageblocks make up MAX_ORDER
> aligned block and I can think following scenario because pageblock
> (un)isolation would be done one by one.
>
> (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
> MIGRATE_ISOLATE, respectively.
>
> CC -> IC -> II (Isolation)
> II -> CI -> CC (Un-isolation)
>
> If some pages are freed at this intermediate state such as IC or CI,
> that page could be merged to the other page that is resident on
> different type of pageblock and it will cause wrong freepage count.

This was supposed to be prevented by CMA operating on MAX_ORDER blocks, but
since it doesn't hold the zone->lock between pageblocks, a race window does
exist.

It's also likely that unexpected merging can occur between MIGRATE_ISOLATE
and non-CMA pageblocks. This should be prevented in __free_one_page() since
commit 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated
pageblock"). However, we only check the migratetype of the pageblock where
buddy merging has been initiated, not the migratetype of the buddy pageblock
(or group of pageblocks) which can be MIGRATE_ISOLATE.

Joonsoo has suggested checking for buddy migratetype as part of
page_is_buddy(), but that would add extra checks in allocator hotpath and
bloat-o-meter has shown significant code bloat (the function is inline).

This patch reduces the bloat at some expense of more complicated code. The
buddy-merging while-loop in __free_one_page() is initially bounded to
pageblock_border and without any migratetype checks. The checks are placed
outside, bumping the max_order if merging is allowed, and returning to the
while-loop with a statement which can't be possibly considered harmful.

This fixes the accounting bug and also removes the arguably weird state in the
original commit 3c605096d315 where buddies could be left unmerged.

Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Link: https://lkml.org/lkml/2016/3/2/280
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Debugged-by: Laura Abbott <labbott@redhat.com>
Debugged-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>	[3.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I768a9c4886aa3fe2e827aba682f67bac2dba6f71
Git-commit: d9dddbf556674bf125ecd925b24e43a5cf2a568a
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
[vinemenon@codeaurora.org: fix trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Prakash Gupta <guptap@codeaurora.org>
2019-07-27 21:51:08 +02:00
Fengguang Wu 9064b0d310 readahead: make context readahead more conservative
This helps performance on moderately dense random reads on SSD.

Transaction-Per-Second numbers provided by Taobao:

		QPS	case
		-------------------------------------------------------
		7536	disable context readahead totally
w/ patch:	7129	slower size rampup and start RA on the 3rd read
		6717	slower size rampup
w/o patch:	5581	unmodified context readahead

Before, readahead will be started whenever reading page N+1 when it happen
to read N recently.  After patch, we'll only start readahead when *three*
random reads happen to access pages N, N+1, N+2.  The probability of this
happening is extremely low for pure random reads, unless they are very
dense, which actually deserves some readahead.

Also start with a smaller readahead window.  The impact to interleaved
sequential reads should be small, because for a long run stream, the the
small readahead window rampup phase is negletable.

The context readahead actually benefits clustered random reads on HDD
whose seek cost is pretty high.  However as SSD is increasingly used for
random read workloads it's better for the context readahead to concentrate
on interleaved sequential reads.

Another SSD rand read test from Miao

        # file size:        2GB
        # read IO amount: 625MB
        sysbench --test=fileio          \
                --max-requests=10000    \
                --num-threads=1         \
                --file-num=1            \
                --file-block-size=64K   \
                --file-test-mode=rndrd  \
                --file-fsync-freq=0     \
                --file-fsync-end=off    run

shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from
104MB/s to 121MB/s.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Tested-by: Tao Ma <tm@tao.ma>
Tested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 21:49:54 +02:00
Hugh Dickins 688adefcae mm: fix the NULL mapping case in __isolate_lru_page()
commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream.

George Boole would have noticed a slight error in 4.16 commit
69d763fc6d3a ("mm: pin address_space before dereferencing it while
isolating an LRU page").  Fix it, to match both the comment above it,
and the original behaviour.

Although anonymous pages are not marked PageDirty at first, we have an
old habit of calling SetPageDirty when a page is removed from swap
cache: so there's a category of ex-swap pages that are easily
migratable, but were inadvertently excluded from compaction's async
migration in 4.16.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by:  Ivan Kalvachev <ikalvachev@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 21:49:37 +02:00
Yisheng Xie 31e0724ee2 mm/mempolicy.c: avoid use uninitialized preferred_node
commit 8970a63e965b43288c4f5f40efbc2bbf80de7f16 upstream.

Alexander reported a use of uninitialized memory in __mpol_equal(),
which is caused by incorrect use of preferred_node.

When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
numa_node_id() instead of preferred_node, however, __mpol_equal() uses
preferred_node without checking whether it is MPOL_F_LOCAL or not.

[akpm@linux-foundation.org: slight comment tweak]
Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
Fixes: fc36b8d3d8 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reported-by: Alexander Potapenko <glider@google.com>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:27 +02:00
Mel Gorman f468d060ee mm: pin address_space before dereferencing it while isolating an LRU page
commit 69d763fc6d3aee787a3e8c8c35092b4f4960fa5d upstream.

Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as follows

CPU1                                            CPU2

truncate(inode)                                 __isolate_lru_page()
  ...
  truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
      spin_lock_irqsave(&mapping->tree_lock, flags);
        __delete_from_page_cache(page, NULL)
          page_cache_tree_delete(..)
            ...                                   mapping = page_mapping(page);
            page->mapping = NULL;
            ...
      spin_unlock_irqrestore(&mapping->tree_lock, flags);
      page_cache_free_page(mapping, page)
        put_page(page)
          if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

                                                  if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.

The race is theoretically possible but unlikely.  Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page.  Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.

This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired.  There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed.  However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.

Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
Fixes: c824493528 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:27 +02:00
Anshuman Khandual 73b87bd20a mm/mprotect: add a cond_resched() inside change_pmd_range()
commit 4991c09c7c812dba13ea9be79a68b4565bb1fa4e upstream.

While testing on a large CPU system, detected the following RCU stall
many times over the span of the workload.  This problem is solved by
adding a cond_resched() in the change_pmd_range() function.

  INFO: rcu_sched detected stalls on CPUs/tasks:
   154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
   (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
  Sending NMI from CPU 955 to CPUs 154:
  NMI backtrace for cpu 154
  CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
  NIP:  c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
  REGS: 00000000a4b0fb44 TRAP: 0501   Not tainted  (4.15.0-rc3+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22422082  XER: 00000000
  CFAR: 00000000006cf8f0 SOFTE: 1
  GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
  GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
  GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
  GPR12: ffffffffffffffff c00000000fb25100
  NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
  LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
  Call Trace:
    flush_hash_range+0x48/0x100
    __flush_tlb_pending+0x44/0xd0
    hpte_need_flush+0x408/0x470
    change_protection_range+0xaac/0xf10
    change_prot_numa+0x30/0xb0
    task_numa_work+0x2d0/0x3e0
    task_work_run+0x130/0x190
    do_notify_resume+0x118/0x120
    ret_from_except_lite+0x70/0x74
  Instruction dump:
  60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
  e9410060 e9610068 e9810070 44000022 <7d806378> e9810028 f88c0000 f8ac0008

Change-Id: Id2a7a0b449991301de8b0871fdcdec6b16c5ec0c
Link: http://lkml.kernel.org/r/20171214140551.5794-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:46:25 +02:00
chenjie 38a4a0bdb6 mm/madvise.c: fix madvise() infinite loop under special circumstances
commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91 upstream.

MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation.  The calling
convention is quite subtle there.  madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.

It seems this has been broken since introduction.  Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

[mhocko@suse.com: rewrite changelog]
Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
Fixes: fe77ba6f4f ("[PATCH] xip: madvice/fadvice: execute in place")
Signed-off-by: chenjie <chenjie6@huawei.com>
Signed-off-by: guoxuenan <guoxuenan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: zhangyi (F) <yi.zhang@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 21:45:21 +02:00
Nadav Amit 3291f75152 mm: migrate: prevent racy access to tlb_flush_pending
commit 16af97dc5a8975371a83d9e30a64038b48f40a2d upstream.

Patch series "fixes of TLB batching races", v6.

It turns out that Linux TLB batching mechanism suffers from various
races.  Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others.  The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables.  This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.

This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.

Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.

This patch (of 7):

Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work().  If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.

This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending.  The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.

An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.

Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16:
 - Drop change to dump_mm()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:45:21 +02:00
zhong jiang aebb380322 mm/mempolicy: fix use after free when calling get_mempolicy
commit 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99 upstream.

I hit a use after free issue when executing trinity and repoduced it
with KASAN enabled.  The related call trace is as follows.

  BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
  Read of size 2 by task syz-executor1/798

  INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
     __slab_alloc+0x768/0x970
     kmem_cache_alloc+0x2e7/0x450
     mpol_new.part.2+0x74/0x160
     mpol_new+0x66/0x80
     SyS_mbind+0x267/0x9f0
     system_call_fastpath+0x16/0x1b
  INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
     __slab_free+0x495/0x8e0
     kmem_cache_free+0x2f3/0x4c0
     __mpol_put+0x2b/0x40
     SyS_mbind+0x383/0x9f0
     system_call_fastpath+0x16/0x1b
  INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
  INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

  Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
  Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
  Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5                          kkkkkkk.
  Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb                          ........
  Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
  Memory state around the buggy address:
  ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

!shared memory policy is not protected against parallel removal by other
thread which is normally protected by the mmap_sem.  do_get_mempolicy,
however, drops the lock midway while we can still access it later.

Early premature up_read is a historical artifact from times when
put_user was called in this path see https://lwn.net/Articles/124754/
but that is gone since 8bccd85ffb ("[PATCH] Implement sys_* do_*
layering in the memory policy layer.").  but when we have the the
current mempolicy ref count model.  The issue was introduced
accordingly.

Fix the issue by removing the premature release.

Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:45:02 +02:00
Zefan Li bbc314da7d cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags
commit 2ad654bc5e2b211e92f66da1d819e47d79a866f0 upstream.

When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Change-Id: I4eef89f5d91eaa2e7e677e662133a7856909a13f
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: Kees Cook <keescook@chromium.org>
Fixes: 950592f7b9 ("cpusets: update tasks' page/slab spread flags in time")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:44:59 +02:00
Michal Hocko da34ab051a mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stack
commit 561b5e0709e4a248c67d024d4d94b6e31e3edf2f upstream.

Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page.  They are punching a new
MAP_FIXED mapping inside the existing stack Vma.

This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety.  It is a real regression on
the other hand.

Let's work around the problem by considering PROT_NONE mapping as a part
of the stack.  This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.

Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:44:55 +02:00
Linus Torvalds d284770333 Sanitize 'move_pages()' permission checks
commit 197e7e521384a23b9e585178f3f11c9fa08274b9 upstream.

The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.

So change the access checks to the more common 'ptrace_may_access()'
model instead.

This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.

Famous last words.

Change-Id: I033691ba22e3bd606e61e94d54c4f6def301b8a8
Reported-by: Otto Ebeling <otto.ebeling@iki.fi>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:44:50 +02:00
Helge Deller d8c638914b mm: fix overflow check in expand_upwards()
commit 37511fb5c91db93d8bd6e3f52f86e5a7ff7cfcdf upstream.

Jörn Engel noticed that the expand_upwards() function might not return
-ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.

Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
which all define TASK_SIZE as 0xffffffff, but since none of those have
an upwards-growing stack we currently have no actual issue.

Nevertheless let's fix this just in case any of the architectures with
an upward-growing stack (currently parisc, metag and partly ia64) define
TASK_SIZE similar.

Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
Signed-off-by: Helge Deller <deller@gmx.de>
Reported-by: Jörn Engel <joern@purestorage.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:44:38 +02:00
Josh Poimboeuf 893438334a mm/page_alloc: Remove kernel address exposure in free_reserved_area()
commit adb1fe9ae2ee6ef6bc10f3d5a588020e7664dfa7 upstream.

Linus suggested we try to remove some of the low-hanging fruit related
to kernel address exposure in dmesg.  The only leaks I see on my local
system are:

  Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000)
  Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000)
  Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000)
  Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000)
  Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000)

Linus says:

  "I suspect we should just remove [the addresses in the 'Freeing'
   messages]. I'm sure they are useful in theory, but I suspect they
   were more useful back when the whole "free init memory" was
   originally done.

   These days, if we have a use-after-free, I suspect the init-mem
   situation is the easiest situation by far. Compared to all the dynamic
   allocations which are much more likely to show it anyway. So having
   debug output for that case is likely not all that productive."

With this patch the freeing messages now look like this:

  Freeing SMP alternatives memory: 32K
  Freeing initrd memory: 10588K
  Freeing unused kernel memory: 3592K
  Freeing unused kernel memory: 1352K
  Freeing unused kernel memory: 632K

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:44:12 +02:00
Vinayak Menon b32eb3e65f mm: vmpressure: fix sending wrong events on underflow
commit e1587a4945408faa58d0485002c110eb2454740c upstream.

At the end of a window period, if the reclaimed pages is greater than
scanned, an unsigned underflow can result in a huge pressure value and
thus a critical event.  Reclaimed pages is found to go higher than
scanned because of the addition of reclaimed slab pages to reclaimed in
shrink_node without a corresponding increment to scanned pages.

Minchan Kim mentioned that this can also happen in the case of a THP
page where the scanned is 1 and reclaimed could be 512.

Link: http://lkml.kernel.org/r/1486641577-11685-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Shiraz Hashim <shashim@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:43:56 +02:00
Michal Hocko e48604e79a mm, fs: check for fatal signals in do_generic_file_read()
commit 5abf186a30a89d5b9c18a6bf93a2c192c9fd52f6 upstream.

do_generic_file_read() can be told to perform a large request from
userspace.  If the system is under OOM and the reading task is the OOM
victim then it has an access to memory reserves and finishing the full
request can lead to the full memory depletion which is dangerous.  Make
sure we rather go with a short read and allow the killed task to
terminate.

Link: http://lkml.kernel.org/r/20170201092706.9966-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:43:51 +02:00
Toshi Kani 7fb0c52e66 mm/memory_hotplug.c: check start_pfn in test_pages_in_a_zone()
commit deb88a2a19e85842d79ba96b05031739ec327ff4 upstream.

Patch series "fix a kernel oops when reading sysfs valid_zones", v2.

A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory.  [1] When the start address of a
memory block is not backed by struct page, i.e.  a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops.  This issue was observed on multiple x86-64 systems with
more than 64GiB of memory.  This patch-set fixes this issue.

Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.

Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).

Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9.  However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.

So, I recommend that we backport it up to 4.4.

[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
    large-memory x86-64 systems")'

This patch (of 2):

test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'.  Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.

Fix it by properly setting 'sec_end_pfn' to the next section pfn.

Also make sure that this function returns 1 only when the range belongs
to a zone.

Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:43:51 +02:00
Keno Fischer f999d3a31a mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream.

In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from
__get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE
after a COW was resolved to setting the (newly introduced) FOLL_COW
instead.  Simultaneously, the check in gup.c was updated to still allow
writes with FOLL_FORCE set if FOLL_COW had also been set.

However, a similar check in huge_memory.c was forgotten.  As a result,
remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets
FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails
out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is
true.

While in this state the process is stil SIGKILLable, but little else
works (e.g.  no ptrace attach, no other signals).  This is easily
reproduced with the following code (assuming thp are set to always):

    #include <assert.h>
    #include <fcntl.h>
    #include <stdint.h>
    #include <stdio.h>
    #include <string.h>
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/types.h>
    #include <sys/wait.h>
    #include <unistd.h>

    #define TEST_SIZE 5 * 1024 * 1024

    int main(void) {
      int status;
      pid_t child;
      int fd = open("/proc/self/mem", O_RDWR);
      void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
                        MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
      assert(addr != MAP_FAILED);
      pid_t parent_pid = getpid();
      if ((child = fork()) == 0) {
        void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
                           MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
        assert(addr2 != MAP_FAILED);
        memset(addr2, 'a', TEST_SIZE);
        pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
        return 0;
      }
      assert(child == waitpid(child, &status, 0));
      assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
      return 0;
    }

Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously
to the update in gup.c in the original commit.  The same pattern exists
in follow_devmap_pmd.  However, we should not be able to reach that
check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we
ever do.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com
Signed-off-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
 - Drop change to follow_devmap_pmd()
 - pmd_dirty() is not available; check the page flags as in
   can_follow_write_pte()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[mhocko:
  This has been forward ported from the 3.2 stable tree.
  And fixed to return NULL.]
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:43:25 +02:00
Oliver O'Halloran 24db5e0996 mm/init: fix zone boundary creation
commit 90cae1fe1c3540f791d5b8e025985fa5e699b2bb upstream.

As a part of memory initialisation the architecture passes an array to
free_area_init_nodes() which specifies the max PFN of each memory zone.
This array is not necessarily monotonic (due to unused zones) so this
array is parsed to build monotonic lists of the min and max PFN for each
zone.  ZONE_MOVABLE is special cased here as its limits are managed by
the mm subsystem rather than the architecture.  Unfortunately, this
special casing is broken when ZONE_MOVABLE is the not the last zone in
the zone list.  The core of the issue is:

	if (i == ZONE_MOVABLE)
		continue;
	arch_zone_lowest_possible_pfn[i] =
		arch_zone_highest_possible_pfn[i-1];

As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
be set to zero.  This patch fixes this bug by adding explicitly tracking
where the next zone should start rather than relying on the contents
arch_zone_highest_possible_pfn[].

Thie is low priority.  To get bitten by this you need to enable a zone
that appears after ZONE_MOVABLE in the zone_type enum.  As far as I can
tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
so I can't see this affecting too many people.

I only noticed this because I've been fiddling with ZONE_DEVICE on
powerpc and 4.6 broke my test kernel.  This bug, in conjunction with the
changes in Taku Izumi's kernelcore=mirror patch (d91749c1dda71) and
powerpc being the odd architecture which initialises max_zone_pfn[] to
~0ul instead of 0 caused all of system memory to be placed into
ZONE_DEVICE at boot, followed a panic since device memory cannot be used
for kernel allocations.  I've already submitted a patch to fix the
powerpc specific bits, but I figured this should be fixed too.

Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:43:22 +02:00
zhong jiang ef5181ed77 mm,ksm: fix endless looping in allocating memory when ksm enable
commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream.

I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.

Call trace:
[<ffffffc000086a88>] __switch_to+0x74/0x8c
[<ffffffc000a1bae0>] __schedule+0x23c/0x7bc
[<ffffffc000a1c09c>] schedule+0x3c/0x94
[<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350
[<ffffffc000a1e32c>] down_write+0x64/0x80
[<ffffffc00021f794>] __ksm_exit+0x90/0x19c
[<ffffffc0000be650>] mmput+0x118/0x11c
[<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74
[<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4
[<ffffffc0000d0f34>] get_signal+0x444/0x5e0
[<ffffffc000089fcc>] do_signal+0x1d8/0x450
[<ffffffc00008a35c>] do_notify_resume+0x70/0x78

The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator

ksm_do_scan
	scan_get_next_rmap_item
		down_read
		get_next_rmap_item
			alloc_rmap_item   #ksmd will loop permanently.

There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel.  Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.

Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.

While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.

Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Suggested-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:42:51 +02:00
Jann Horn 4f43178d45 swapfile: fix memory corruption via malformed swapfile
commit dd111be69114cc867f8e826284559bfbc1c40e37 upstream.

When root activates a swap partition whose header has the wrong
endianness, nr_badpages elements of badpages are swabbed before
nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

This normally is not a security issue because it can only be exploited
by root (more specifically, a process with CAP_SYS_ADMIN or the ability
to modify a swap file/partition), and such a process can already e.g.
modify swapped-out memory of any other userspace process on the system.

Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:42:14 +02:00
Andrea Arcangeli a57dd9f235 mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream.

pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
introduced to locklessy (but atomically) detect when a pmd is a regular
(stable) pmd or when the pmd is unstable and can infinitely transition
from pmd_none() and pmd_trans_huge() from under us, while only holding
the mmap_sem for reading (for writing not).

While holding the mmap_sem only for reading, MADV_DONTNEED can run from
under us and so before we can assume the pmd to be a regular stable pmd
we need to compare it against pmd_none() and pmd_trans_huge() in an
atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
tiny window for a race.

Useful applications are unlikely to notice the difference as doing
MADV_DONTNEED concurrently with a page fault would lead to undefined
behavior.

[js] 3.12 backport: no pmd_devmap in 3.12 yet.

[akpm@linux-foundation.org: tidy up comment grammar/layout]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:42:02 +02:00
Willy Tarreau 9766f1788d squash mm: Export migrate_page_... : also make it non-static
commit ce16887b69e94a8c0305e88c918989f8bc1bd6b7 upstream.

Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:42:00 +02:00
Richard Weinberger 74d3d77865 mm: Export migrate_page_move_mapping and migrate_page_copy
commit 1118dce773d84f39ebd51a9fe7261f9169cb056e upstream.

Export these symbols such that UBIFS can implement
->migratepage.

Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[wt: also add the prototype to include/linux/migrate.h]

Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:41:54 +02:00
Hugh Dickins 26e8852774 tmpfs: fix regression hang in fallocate undo
commit 7f556567036cb7f89aabe2f0954b08566b4efb53 upstream.

The well-spotted fallocate undo fix is good in most cases, but not when
fallocate failed on the very first page.  index 0 then passes lend -1
to shmem_undo_range(), and that has two bad effects: (a) that it will
undo every fallocation throughout the file, unrestricted by the current
range; but more importantly (b) it can cause the undo to hang, because
lend -1 is treated as truncation, which makes it keep on retrying until
every page has gone, but those already fully instantiated will never go
away.  Big thank you to xfstests generic/269 which demonstrates this.

Fixes: b9b4bb26af01 ("tmpfs: don't undo fallocate past its last page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:41:47 +02:00
Anthony Romano 03b824ddbc tmpfs: don't undo fallocate past its last page
commit b9b4bb26af017dbe930cd4df7f9b2fc3a0497bfe upstream.

When fallocate is interrupted it will undo a range that extends one byte
past its range of allocated pages.  This can corrupt an in-use page by
zeroing out its first byte.  Instead, undo using the inclusive byte
range.

Fixes: 1635f6a741 ("tmpfs: undo fallocation on failure")
Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com
Signed-off-by: Anthony Romano <anthony.romano@coreos.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Brandon Philips <brandon@ifup.co>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:41:47 +02:00
Al Viro dd8e7e379c it's still short a few helpers, but infrastructure should be OK now...
Change-Id: I0adb8fe9c5029bad3ac52629003c3b78e9442936
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-03 11:52:03 +01:00
LuK1337 65f8423215 Import T813XXS2BRC2 kernel source changes
Change-Id: I90bb6c013287c1edbf8ca607d1666cc4c62d504e
2018-05-26 00:39:42 +02:00
Marissa Wall 88344f35ca BACKPORT: Sanitize 'move_pages()' permission checks
The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.

So change the access checks to the more common 'ptrace_may_access()'
model instead.

This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.

Famous last words.

Reported-by: Otto Ebeling <otto.ebeling@iki.fi>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

cherry-picked from: 197e7e521384a23b9e585178f3f11c9fa08274b9

This branch does not have the PTRACE_MODE_REALCREDS flag but its
default behavior is the same as PTRACE_MODE_REALCREDS. So use
PTRACE_MODE_READ instead of PTRACE_MODE_READ_REALCREDS.

Change-Id: I75364561d91155c01f78dd62cdd41c5f0f418854
2018-05-26 00:39:35 +02:00
dookiedude ec57310b8e fs: Remove Samsung implementation of sdcardfs
Remove Samsung version of sdcardfs before we use AOSP source

Change-Id: I33710450b91d8cfde38a27967b0527e6a72fb440
2018-02-06 13:12:17 +01:00
Helge Deller 3d8ad8845f Allow stack to grow up to address space limit
commit bd726c90b6b8ce87602208701b208a208e6d5600 upstream.

Fix expand_upwards() on architectures with an upward-growing stack (parisc,
metag and partly IA-64) to allow the stack to reliably grow exactly up to
the address space limit given by TASK_SIZE.

Change-Id: I911e49b27d519aae257bf57cadff303e25872a14
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-07-11 00:01:09 +00:00
Hugh Dickins 07916a320e mm: fix new crash in unmapped_area_topdown()
commit f4cb767d76cf7ee72f97dd76f6cfa6c76a5edc89 upstream.

Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
mmap testing.  That's the VM_BUG_ON(gap_end < gap_start) at the
end of unmapped_area_topdown().  Linus points out how MAP_FIXED
(which does not have to respect our stack guard gap intentions)
could result in gap_end below gap_start there.  Fix that, and
the similar case in its alternative, unmapped_area().

Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Change-Id: I4403e032a62f034df7991a3aa08f56ae7f7a20a6
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Debugged-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-11 00:01:09 +00:00
Hugh Dickins 1448dc70cd mm: larger stack guard gap, between vmas
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Change-Id: I899511079c5057ee5299ef1aff5ab8f0c77c740d
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
     s390 uses generic arch_get_unmapped_area()]
[wt: backport to 3.16: adjust context]
[wt: backport to 3.10: adjust context ; code logic in PARISC's
     arch_get_unmapped_area() wasn't found ; code inserted into
     expand_upwards() and expand_downwards() runs under anon_vma lock;
     changes for gup.c:faultin_page go to memory.c:__get_user_pages();
     included Hugh Dickins' fixes]
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-07-11 00:00:39 +00:00
Hugh Dickins 925ea8fa69 mm: migrate dirty page without clear_page_dirty_for_io etc
commit 42cb14b110a5698ccf26ce59c4441722605a3743 upstream.

clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.

No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.

It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).

Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same.  Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).

But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().

CVE-2016-3070

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ciwillia@brocade.com: backported to 3.10: adjusted context]
Signed-off-by: Charles (Chas) Williams <ciwillia@brocade.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>

Change-Id: I3ae67539b3a0ee9157a2e7d4ce8fce1cf8cacf31
2017-05-05 19:20:22 +00:00
Chris Salls b6e7473ffd mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
In the case that compat_get_bitmap fails we do not want to copy the
bitmap to the user as it will contain uninitialized stack data and leak
sensitive data.

Signed-off-by: Chris Salls <salls@cs.ucsb.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-22 23:02:48 +02:00
Luca Stefani f0bb324d50 This is the 3.10.98 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWz1zgAAoJEDjbvchgkmk+yU8P/10DITNzrhCfz5wbhvvn9Uvo
 7H1DziOora3u9h8/rz6xqgFEz2/9cZ03KoLcpGha7kEFBsvgVhN3uSI0YFpVV2mT
 8/oh1ADdkky3Pld0f7gDGydDvrmgqx83/69SQ8hDQ8Mr2QTaKNvK05QGC2/EO9kI
 OcUAXjdAGglmf5rfhNhXodG/F2DtsA55uCzeyuBhcPE3bM7d4/48pwr1b2tW2CR8
 hsprRvSz+kGgHXQy8jYdxKEI66OC/i22xVnxEc8PZmPZ0fFfmszzc9nzhcseWfpe
 0JGgfwAtM8Va+bX4kfvqPpc2qR0r8Z2iEKNnAHnGutOvSWvow0l1OEedsb/+s1J6
 /AYlPIkgTxwLDAwBIymPgowkEMOPVZzPL0tkoZI8wjB+eqUxxLlIa2dNByCyUs/U
 1xTy+0UDMMDXG911mJl+yZFvd4R7lQUavIEStmMQ+A/Go2KrATaqIM8WETBlm7oH
 s3hZ3E+RBWmfD/6JQwsJNkwv6yWeaRXNE+bj8C1r/uBdPyGqX9T22OaIOlio+I71
 XBNEM5mrTlNeNVIUIKW29qmLBxBrH2LLwpv/dRyfOfzfhi1B+dl9+3sJauvrSmWi
 jrR1khGmmaZcfOT2DVmpwlDQCQcyMcy8S8RTTAHhhuNmWtSjdc3TcfRlHXvP0sOu
 ruXBufxernb94E7sqsvF
 =LW9r
 -----END PGP SIGNATURE-----

Merge tag 'v3.10.98' into HEAD

This is the 3.10.98 stable release
2017-04-18 17:17:24 +02:00
Luca Stefani 82b37d9f2f Merge remote-tracking branch 'f2fs/linux-3.10.y' into HEAD
Change-Id: Ic2fe24529f029909ddd96490bd6d885d60f88be2
2017-04-18 17:02:28 +02:00
LuK1337 4e71469c73 Merge tag 'LA.BR.1.3.6-03510-8976.0' into HEAD
Change-Id: Ie506850703bf9550ede802c13ba5f8c2ce723fa3
2017-04-18 12:11:50 +02:00
LuK1337 fc9499e55a Import latest Samsung release
* Package version: T713XXU2BQCO

Change-Id: I293d9e7f2df458c512d59b7a06f8ca6add610c99
2017-04-18 03:43:52 +02:00
Linus Torvalds ccbfd2121c mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db975 ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better).  The
s390 dirty bit was implemented in abf09bed3c ("s390/mm: implement
software dirty bits") which made it into v3.9.  Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.

Change-Id: I597644627c24d95c3d2b15e825737b35c236a047
Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask;
     s/faultin_page/__get_user_page]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Git-repo: http://git.kernel.org/cgit/linux/kernel/git/wtarreau/linux-stable.git
Git-commit: 9691eac5593ff1e2f82391ad327f21d90322aec1
Signed-off-by: Ravi Kumar Siddojigari <rsiddoji@codeaurora.org>
2016-11-01 03:25:55 -07:00
Suyog Sarda b0c67828b5 lowmemorykiller: Introduce sysfs node for ALMK and PPR adj threshold
The grouping of tasks based on oom_score_adj values change from
one framework to another. This requires corresponding changes in
the threshold values set for almk and per process reclaim.
Introduce sysfs nodes to set threshold adj for process reclaim
and adaptive LMK dynamically.

Change-Id: Ib7565bfd5d2e93aa4ff8fdd20414cac0a0f38bf7
Signed-off-by: Suyog Sarda <ssarda@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-07-06 23:07:02 -07:00
Martijn Coenen ba88acf8c2 UPSTREAM: memcg: Only free spare array when readers are done
A spare array holding mem cgroup threshold events is kept around
to make sure we can always safely deregister an event and have an
array to store the new set of events in.

In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.

Fixed by only freeing after synchronize_rcu().

Change-Id: Iee3ad8eb400612ec24898832eb19ff34eb2aecb4
Signed-off-by: Martijn Coenen <maco@google.com>
2016-05-18 14:36:06 +05:30
dcashman d2de0d753c FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR.
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)

ASLR  only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: I66ac01c6f4f2c8dcfc84d1f1e99490b8385b3ed4
2016-05-18 14:36:00 +05:30
Minchan Kim 2fcdbe3b1b BACKPORT: mm: /proc/pid/smaps:: show proportional swap share of the mapping
We want to know per-process workingset size for smart memory management
on userland and we use swap(ex, zram) heavily to maximize memory
efficiency so workingset includes swap as well as RSS.

On such system, if there are lots of shared anonymous pages, it's really
hard to figure out exactly how many each process consumes memory(ie, rss
+ wap) if the system has lots of shared anonymous memory(e.g, android).

This patch introduces SwapPss field on /proc/<pid>/smaps so we can get
more exact workingset size per process.

Bongkyu tested it. Result is below.

1. 50M used swap
SwapTotal: 461976 kB
SwapFree: 411192 kB

$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
48236
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
141184

2. 240M used swap
SwapTotal: 461976 kB
SwapFree: 216808 kB

$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
230315
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1387744

[akpm@linux-foundation.org: simplify kunmap_atomic() call]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Bongkyu Kim <bongkyu.kim@lge.com>
Tested-by: Bongkyu Kim <bongkyu.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Bug: 26190646
Change-Id: Idf92d682fdef432bdd66e530a7e7cdff8f375db1
Signed-off-by: Thierry Strudel <tstrudel@google.com>
2016-05-18 14:35:57 +05:30
Sergey Senozhatsky 9d7ed54d83 UPSTREAM: zsmalloc: fix a null pointer dereference in destroy_handle_cache()
(cherry-pick from commit 02f7b4145da113683ad64c74bf64605e16b71789)

If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
will dereference the NULL pool->handle_cachep.

Modify destroy_handle_cache() to avoid this.

Bug: 25951511

Change-Id: Ie4ea82fe34ac02e6e2548e6ed47257366d7b92f5
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:56 +05:30
Sergey Senozhatsky f609c3c2bc UPSTREAM: zsmalloc: remove extra cond_resched() in __zs_compact
(cherry-pick from commit 160a117f0864871ae1bab26554a985a1d2861afd)

Do not perform cond_resched() before the busy compaction loop in
__zs_compact(), because this loop does it when needed.

Bug: 25951511

Change-Id: I3b20b46f3a4fb44a2bf6ccb17264acf30deb7111
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:56 +05:30
Heesub Shin 9594f03013 UPSTREAM: zsmalloc: fix fatal corruption due to wrong size class selection
(cherry-pick from commit 81da9b13f73653bf5f38c63af8029fc459198ac0)

There is no point in overriding the size class below.  It causes fatal
corruption on the next chunk on the 3264-bytes size class, which is the
last size class that is not huge.

For example, if the requested size was exactly 3264 bytes, current
zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
not 4096.  User access to this chunk may overwrite head of the next
adjacent chunk.

Here is the panic log captured when freelist was corrupted due to this:

    Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
    Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
    Modules linked in:
    exynos-snapshot: core register saved(CPU:5)
    CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
    exynos-snapshot: context saved(CPU:5)
    exynos-snapshot: item - log_kevents is disabled
    CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
    task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
    PC is at obj_idx_to_offset+0x0/0x1c
    LR is at obj_malloc+0x44/0xe8
    pc : [<ffffffc00030659c>] lr : [<ffffffc000306604>] pstate: a0000045
    sp : ffffffc0b71eb790
    x29: ffffffc0b71eb790 x28: ffffffc00204c000
    x27: 000000000001d96f x26: 0000000000000000
    x25: ffffffc098cc3500 x24: ffffffc0a13f2810
    x23: ffffffc098cc3501 x22: ffffffc0a13f2800
    x21: 000011e1a02006e3 x20: ffffffc0a13f2800
    x19: ffffffbc02a7e000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000feb
    x15: 0000000000000000 x14: 00000000a01003e3
    x13: 0000000000000020 x12: fffffffffffffff0
    x11: ffffffc08b264000 x10: 00000000e3a01004
    x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
    x7 : ffffffc000307d24 x6 : 0000000000000000
    x5 : 0000000000000038 x4 : 000000000000011e
    x3 : ffffffbc00003e90 x2 : 0000000000000cc0
    x1 : 00000000d0100371 x0 : ffffffbc00003e90

Bug: 25951511

Change-Id: I0c82f61aa779ddf906212ab6e47e16c088fe683c
Reported-by: Sooyong Suk <s.suk@samsung.com>
Signed-off-by: Heesub Shin <heesub.shin@samsung.com>
Tested-by: Sooyong Suk <s.suk@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:55 +05:30
Minchan Kim 3610a81ab2 UPSTREAM: zsmalloc: remove unnecessary insertion/removal of zspage in compaction
(cherry-pick from commit 839373e645d12613308d9148041c4bd967bce8d5)

In putback_zspage, we don't need to insert a zspage into list of zspage
in size_class again to just fix fullness group. We could do directly
without reinsertion so we could save some instuctions.

Bug: 25951511

Change-Id: I07ad8bac6d2f5dc90ac0d492626e067a02699979
Reported-by: Heesub Shin <heesub.shin@samsung.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Juneho Choi <juno.choi@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:55 +05:30
Sergey Senozhatsky f3dbbaaa41 UPSTREAM: zsmalloc: micro-optimize zs_object_copy()
(cherry-pick from commit 495819ead5ad02174208994ca610852a7791a2f2)

A micro-optimization.  Avoid additional branching and reduce (a bit)
registry pressure (f.e.  s_off += size; d_off += size; may be calculated
twise: first for >= PAGE_SIZE check and later for offset update in "else"
clause).

scripts/bloat-o-meter shows some improvement

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function                          old     new   delta
zs_object_copy                    550     540     -10

Bug: 25951511

Change-Id: Ie3255d79246493fc755e6256f12082e692c0fc3c
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:55 +05:30
Sergey Senozhatsky 4c117f8df8 UPSTREAM: zsmalloc: remove synchronize_rcu from zs_compact()
(cherry-pick from commit 1ec7cfb13acb8047ae5baafb43d2cd6b64ac85b9)

Do not synchronize rcu in zs_compact(). Neither zsmalloc not
zram use rcu.

Bug: 25951511

Change-Id: I2f2d1a81dac561ddfabb861bedcbb1ba773f207f
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:55 +05:30
Yinghao Xie f3ff102b86 UPSTREAM: mm/zsmalloc.c: fix comment for get_pages_per_zspage
(cherry-pick from commit 888fa374e625f3ee8f3cc2efebf52939d3bb6da1)

Bug: 25951511

Change-Id: I94fc95aff6dcab3999c3dcb275ca65b8cd4b3565
Signed-off-by: Yinghao Xie <yinghao.xie@sumsung.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:53 +05:30
Minchan Kim 588deb99ca UPSTREAM: zsmalloc: zsmalloc documentation
(cherry-pick from commit d02be50dba649b4246e0c1c4b7cb5d8a8d49de9a)

Create zsmalloc doc which explains design concept and stat information.

Bug: 25951511

Change-Id: Ie1e5ac914186558feabaf8fbea29154395267dda
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:53 +05:30
Minchan Kim e2d2259128 UPSTREAM: zsmalloc: add fullness into stat
(cherry-pick from commit 248ca1b053c82fa22427d22b33ac51a24c88a86d)

During investigating compaction, fullness information of each class is
helpful for investigating how the compaction works well.  With that, we
could know how compaction works well more clear on each size class.

Bug: 25951511

Change-Id: Idc07b265d005b680abb55b7dc61341a3de43a62c
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:53 +05:30
Minchan Kim 2a0ff415c7 UPSTREAM: zsmalloc: record handle in page->private for huge object
(cherry-pick from commit 7b60a68529b0d827d26ea3426c2addd071bff789)

We store handle on header of each allocated object so it increases the
size of each object by sizeof(unsigned long).

If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
4104B-class to add handle.

However, 4104B-class has 1-pages_per_zspage so wasted size by internal
fragment is 8192 - 4104, which is terrible.

So this patch records the handle in page->private on such huge object(ie,
pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
object so we could use 4096B-class, not 4104B-class.

Bug: 25951511

Change-Id: I392eed4a0e0db5a940bc8a97ef56c26a7397b0f9
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:52 +05:30
Minchan Kim bcddd9e649 UPSTREAM: zsmalloc: adjust ZS_ALMOST_FULL
(cherry-pick from commit d3d07c92ff69f784bb8c3279fa87678bfa2f7f6f)

Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
under 1/4 used objects(ie, fullness_threshold_frac).  It could make result
in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.

This patch changes the rule so that zsmalloc makes zspage which has above
3/4 used object ZS_ALMOST_FULL so it could make tight packing.

Bug: 25951511

Change-Id: I9283cd6e8ce9916ea7213b724946664e2a6f32cb
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:52 +05:30
Minchan Kim 50767af2e9 UPSTREAM: zsmalloc: support compaction
(cherry-pick from commit 312fcae227037619dc858c9ccd362c7b847730a2)

This patch provides core functions for migration of zsmalloc.  Migraion
policy is simple as follows.

for each size class {
        while {
                src_page = get zs_page from ZS_ALMOST_EMPTY
                if (!src_page)
                        break;
                dst_page = get zs_page from ZS_ALMOST_FULL
                if (!dst_page)
                        dst_page = get zs_page from ZS_ALMOST_EMPTY
                if (!dst_page)
                        break;
                migrate(from src_page, to dst_page);
        }
}

For migration, we need to identify which objects in zspage are allocated
to migrate them out.  We could know it by iterating of freed objects in a
zspage because first_page of zspage keeps free objects singly-linked list
but it's not efficient.  Instead, this patch adds a tag(ie,
OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
whether the object is allocated easily.

This patch adds another status bit in handle to synchronize between user
access through zs_map_object and migration.  During migration, we cannot
move objects user are using due to data coherency between old object and
new object.

Bug: 25951511

Change-Id: Ideb5295570cc1f6c4fcb18a8f8609c63a38c86e4
[akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:52 +05:30
Minchan Kim fdc42f6133 UPSTREAM: zsmalloc: factor out obj_[malloc|free]
(cherry-pick from commit c78062612fb525430b775a0bef4d3cc07e512da0)

In later patch, migration needs some part of functions in zs_malloc and
zs_free so this patch factor out them.

Bug: 25951511

Change-Id: I6079cbc1d3d107bc39f9dbb3412d9eb9039875ad
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:51 +05:30
Minchan Kim f22049e8eb UPSTREAM: zsmalloc: decouple handle and object
(cherry-pick from commit 2e40e163a25af3bd35d128d3e2e005916de5cce6)

Recently, we started to use zram heavily and some of issues
popped.

1) external fragmentation

I got a report from Juneho Choi that fork failed although there are plenty
of free pages in the system.  His investigation revealed zram is one of
the culprit to make heavy fragmentation so there was no more contiguous
16K page for pgd to fork in the ARM.

2) non-movable pages

Other problem of zram now is that inherently, user want to use zram as
swap in small memory system so they use zRAM with CMA to use memory
efficiently.  However, unfortunately, it doesn't work well because zRAM
cannot use CMA's movable pages unless it doesn't support compaction.  I
got several reports about that OOM happened with zram although there are
lots of swap space and free space in CMA area.

3) internal fragmentation

zRAM has started support memory limitation feature to limit memory usage
and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
harmonized with zram-swap to stop anonymous page reclaim if zram consumed
memory up to the limit although there are free space on the swap.  One
problem for that direction is zram has no way to know any hole in memory
space zsmalloc allocated by internal fragmentation so zram would regard
swap is full although there are free space in zsmalloc.  For solving the
issue, zram want to trigger compaction of zsmalloc before it decides full
or not.

This patchset is first step to support above issues.  For that, it adds
indirect layer between handle and object location and supports manual
compaction to solve 3th problem first of all.

After this patchset got merged, next step is to make VM aware of zsmalloc
compaction so that generic compaction will move zsmalloced-pages
automatically in runtime.

In my imaginary experiment(ie, high compress ratio data with heavy swap
in/out on 8G zram-swap), data is as follows,

Before =
zram allocated object :      60212066 bytes
zram total used:     140103680 bytes
ratio:         42.98 percent
MemFree:          840192 kB

Compaction

After =
frag ratio after compaction
zram allocated object :      60212066 bytes
zram total used:      76185600 bytes
ratio:         79.03 percent
MemFree:          901932 kB

Juneho reported below in his real platform with small aging.
So, I think the benefit would be bigger in real aging system
for a long time.

- frag_ratio increased 3% (ie, higher is better)
- memfree increased about 6MB
- In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased

frag ratio after swap fragment
used :        156677 kbytes
total:        166092 kbytes
frag_ratio :  94
meminfo before compaction
MemFree:           83724 kB
Node 0, zone   Normal  13642   1364     57     10     61     17      9      5      4      0      0
Node 0, zone  HighMem    425     29      1      0      0      0      0      0      0      0      0

num_migrated :  23630
compaction done

frag ratio after compaction
used :        156673 kbytes
total:        160564 kbytes
frag_ratio :  97
meminfo after compaction
MemFree:           89060 kB
Node 0, zone   Normal  14076   1544     67     14     61     17      9      5      4      0      0
Node 0, zone  HighMem    863     50      1      0      0      0      0      0      0      0      0

This patchset adds more logics(about 480 lines) in zsmalloc but when I
tested heavy swapin/out program, the regression for swapin/out speed is
marginal because most of overheads were caused by compress/decompress and
other MM reclaim stuff.

This patch (of 7):

Currently, handle of zsmalloc encodes object's location directly so it
makes support of migration hard.

This patch decouples handle and object via adding indirect layer.  For
that, it allocates handle dynamically and returns it to user.  The handle
is the address allocated by slab allocation so it's unique and we could
keep object's location in the memory space allocated for handle.

With it, we can change object's position without changing handle itself.

Bug: 25951511

Change-Id: Id50a98341f63c4e1bb39589ca992661486469dca
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:51 +05:30
Ganesh Mahendran 9b71df804c BACKPORT: mm/zsmalloc: add statistics support
(cherry-pick from commit 0f050d997e275cf0e47ddc7006284eaa3c6fe049)

Keeping fragmentation of zsmalloc in a low level is our target.  But now
we still need to add the debug code in zsmalloc to get the quantitative
data.

This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
statistics collection for developers.  Currently only the objects
statatitics in each class are collected.  User can get the information via
debugfs.

     cat /sys/kernel/debug/zsmalloc/zram0/...

For example:

After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
 class  size obj_allocated   obj_used pages_used
     0    32             0          0          0
     1    48           256         12          3
     2    64            64         14          1
     3    80            51          7          1
     4    96           128          5          3
     5   112            73          5          2
     6   128            32          4          1
     7   144             0          0          0
     8   160             0          0          0
     9   176             0          0          0
    10   192             0          0          0
    11   208             0          0          0
    12   224             0          0          0
    13   240             0          0          0
    14   256            16          1          1
    15   272            15          9          1
    16   288             0          0          0
    17   304             0          0          0
    18   320             0          0          0
    19   336             0          0          0
    20   352             0          0          0
    21   368             0          0          0
    22   384             0          0          0
    23   400             0          0          0
    24   416             0          0          0
    25   432             0          0          0
    26   448             0          0          0
    27   464             0          0          0
    28   480             0          0          0
    29   496            33          1          4
    30   512             0          0          0
    31   528             0          0          0
    32   544             0          0          0
    33   560             0          0          0
    34   576             0          0          0
    35   592             0          0          0
    36   608             0          0          0
    37   624             0          0          0
    38   640             0          0          0
    40   672             0          0          0
    42   704             0          0          0
    43   720            17          1          3
    44   736             0          0          0
    46   768             0          0          0
    49   816             0          0          0
    51   848             0          0          0
    52   864            14          1          3
    54   896             0          0          0
    57   944            13          1          3
    58   960             0          0          0
    62  1024             4          1          1
    66  1088            15          2          4
    67  1104             0          0          0
    71  1168             0          0          0
    74  1216             0          0          0
    76  1248             0          0          0
    83  1360             3          1          1
    91  1488            11          1          4
    94  1536             0          0          0
   100  1632             5          1          2
   107  1744             0          0          0
   111  1808             9          1          4
   126  2048             4          4          2
   144  2336             7          3          4
   151  2448             0          0          0
   168  2720            15         15         10
   190  3072            28         27         21
   202  3264             0          0          0
   254  4096         36209      36209      36209

 Total               37022      36326      36288

We can calculate the overall fragentation by the last line:
    Total               37022      36326      36288
    (37022 - 36326) / 37022 = 1.87%

Also by analysing objects alocated in every class we know why we got so
low fragmentation: Most of the allocated objects is in <class 254>.  And
there is only 1 page in class 254 zspage.  So, No fragmentation will be
introduced by allocating objs in class 254.

And in future, we can collect other zsmalloc statistics as we need and
analyse them.

Bug: 25951511

Change-Id: I117604ad5fec8be5c10810ea93049140fd51f80b
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:51 +05:30
Ganesh Mahendran 45797ad2eb BACKPORT: mm/zpool: add name argument to create zpool
(cherry-pick from commit 3eba0c6a56c04f2b017b43641a821f1ebfb7fb4c)

Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
them.  There is not a method to let zsmalloc/zbud find which caller they
belong to.

Now we want to add statistics collection in zsmalloc.  We need to name the
debugfs dir for each pool created.  The way suggested by Minchan Kim is to
use a name passed by caller(such as zram) to create the zsmalloc pool.

    /sys/kernel/debug/zsmalloc/zram0

This patch adds an argument `name' to zs_create_pool() and other related
functions.

Bug: 25951511

Change-Id: Ib71e8e63c71e808795073bd08c0aab14b43b4c35
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	drivers/block/zram/zram_drv.c
2016-05-18 14:35:51 +05:30
Ganesh Mahendran bc7a7b9717 BACKPORT: mm/zsmalloc: adjust order of functions
(cherry-pick from commit 66cdef663cd7a97aff6bbbf41a81a0205dc81ba2)

Currently functions in zsmalloc.c does not arranged in a readable and
reasonable sequence.  With the more and more functions added, we may
meet below inconvenience.  For example:

Current functions:

    void zs_init()
    {
    }

    static void get_maxobj_per_zspage()
    {
    }

Then I want to add a func_1() which is called from zs_init(), and this
new added function func_1() will used get_maxobj_per_zspage() which is
defined below zs_init().

    void func_1()
    {
        get_maxobj_per_zspage()
    }

    void zs_init()
    {
        func_1()
    }

    static void get_maxobj_per_zspage()
    {
    }

This will cause compiling issue. So we must add a declaration:

    static void get_maxobj_per_zspage();

before func_1() if we do not put get_maxobj_per_zspage() before
func_1().

In addition, puting module_[init|exit] functions at the bottom of the
file conforms to our habit.

So, this patch ajusts function sequence as:

    /* helper functions */
    ...
    obj_location_to_handle()
    ...

    /* Some exported functions */
    ...

    zs_map_object()
    zs_unmap_object()

    zs_malloc()
    zs_free()

    zs_init()
    zs_exit()

Bug: 25951511

Change-Id: I68377a213ade041b34e99a4280ebd57a933dfa83
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:49 +05:30
Ganesh Mahendran caeae3dbac UPSTREAM: mm/zsmalloc: allocate exactly size of struct zs_pool
(cherry-pick from commit 181366561ac1e1a7bc3b91dbe45e7614a2f758b9)

In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
  ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);

This patch allocate memory of exactly needed size.

Bug: 25951511

Change-Id: Iae1216f7ab304164ea7d687e0037574d6738c94e
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:48 +05:30
Ganesh Mahendran 0349162dc3 UPSTREAM: mm/zsmalloc: avoid duplicate assignment of prev_class
(cherry-pick from commit df8b5bb998f10cfc040ad30300f9a9ea4592ff82)

In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
And the prev_class only references to the previous size_class.  So we do
not need unnecessary assignement.

This patch assigns *prev_class* when a new size_class structure is
allocated and uses prev_class to check whether the first class has been
allocated.

Bug: 25951511

Change-Id: Ie5e4be867976af0e9ce786a58d1ee0147b7fb0ad
[akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:48 +05:30
Mahendran Ganesh 3685436a0c UPSTREAM: mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE
(cherry-pick from commit 40f9fb8cffc6a20ae269e3b43dfba7a4f65d7f50)

I sent a patch [1] for unnecessary check in zsmalloc.  And Minchan Kim
found zsmalloc even does not support allocating an obj with the size of
ZS_MAX_ALLOC_SIZE in some situations.

For example:
   In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
   ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
      MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
   ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
   ZS_SIZE_CLASS_DELTA is 256 bytes
   So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
                          ZS_SIZE_CLASS_DELTA + 1
                       = 256

   In zs_create_pool(), the max size obj which can be allocated will be:
      ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312

   We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
   allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
   users we can do.

 [1]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
 [2]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html

This patch fixes this issue by dynamiclly calculating zs_size_classes when
module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE.  Then the
max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.

Bug: 25951511

Change-Id: Ia35e3456e94ebaf14c65a13dde8b471ebe1095ab
[akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:48 +05:30
Minchan Kim 7062763e70 UPSTREAM: zsmalloc: correct fragile [kmap|kunmap]_atomic use
(cherry-pick from commit af4ee5e977acb150371c28bd85cb7e34cac48b13)

The kunmap_atomic should use virtual address getting by kmap_atomic.
However, some pieces of code in zsmalloc uses modified address, not the
one got by kmap_atomic for kunmap_atomic.

It's okay for working because zsmalloc modifies the address inner
PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
But it's still fragile with potential changing of kmap_atomic so let's
correct it.

I got a subtle bug when I implemented a new feature of zsmalloc
(compaction) due to a link's mishandling (the link was over page
boundary).  Although it was totally my mistake, it took a while to find
the cause because an unpredictable kmapped address was unmapped causing an
almost random crash.

Bug: 25951511

Change-Id: I9337684d102af93ec600077bf4c9658a942c8d09
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:47 +05:30
Sergey Senozhatsky 2f57d4ef53 UPSTREAM: zsmalloc: fix zs_init cpu notifier error handling
(cherry-pick from commit b1b00a5b8a6cf32e3973507decf1216709b55072)

Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
zpool_unregister_driver() from zs_init() if cpu notifier registration has
failed, because error handling is performed before we register the driver
via zpool_register_driver() call.

Factor out cpu notifier registration and unregistration code and fix
zs_init() error handling.

Bug: 25951511

Change-Id: I9311d16de84accd9c5d3f2a333b30fe189a37222
link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
[akpm@linux-foundation.org: squash bogus gcc warning]
[akpm@linux-foundation.org: use __init and __exit]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Mahendran Ganesh <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:47 +05:30
Joonsoo Kim 0bd808f18f UPSTREAM: zsmalloc: merge size_class to reduce fragmentation
(cherry-pick from commit 9eec4cd53f9865b733dc78cf5f6465871beed014)

zsmalloc has many size_classes to reduce fragmentation and they are in 16
bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096.  And,
zsmalloc has constraint that each zspage has 4 pages at maximum.

In this situation, we can see interesting aspect.  Let's think about
size_class for 1488, 1472, ..., 1376.  To prevent external fragmentation,
they uses 4 pages per zspage and so all they can contain 11 objects at
maximum.

16384 (4096 * 4) = 1488 * 11 + remains
16384 (4096 * 4) = 1472 * 11 + remains
16384 (4096 * 4) = ...
16384 (4096 * 4) = 1376 * 11 + remains

It means that they have same characteristics and classification between
them isn't needed.  If we use one size_class for them, we can reduce
fragementation and save some memory since both the 1488 and 1472 sized
classes can only fit 11 objects into 4 pages, and an object that's 1472
bytes can fit into an object that's 1488 bytes, merging these classes to
always use objects that are 1488 bytes will reduce the total number of
size classes.  And reducing the total number of size classes reduces
overall fragmentation, because a wider range of compressed pages can fit
into a single size class, leaving less unused objects in each size class.

For this purpose, this patch implement size_class merging.  If there is
size_class that have same pages_per_zspage and same number of objects per
zspage with previous size_class, we don't create new size_class.  Instead,
we use previous, same characteristic size_class.  With this way, above
example sizes (1488, 1472, ..., 1376) use just one size_class so we can
get much more memory utilization.

Below is result of my simple test.

TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
source code, remove directory in descending order in size.  (drivers arch
fs sound include net Documentation firmware kernel tools)

Each line represents orig_data_size, compr_data_size, mem_used_total,
fragmentation overhead (mem_used - compr_data_size) and overhead ratio
(overhead to compr_data_size), respectively, after untar and remove
operation is executed.

* untar-nomerge.out

orig_size compr_size used_size overhead overhead_ratio
525.88MB 199.16MB 210.23MB  11.08MB 5.56%
288.32MB  97.43MB 105.63MB   8.20MB 8.41%
177.32MB  61.12MB  69.40MB   8.28MB 13.55%
146.47MB  47.32MB  56.10MB   8.78MB 18.55%
124.16MB  38.85MB  48.41MB   9.55MB 24.58%
103.93MB  31.68MB  40.93MB   9.25MB 29.21%
 84.34MB  22.86MB  32.72MB   9.86MB 43.13%
 66.87MB  14.83MB  23.83MB   9.00MB 60.70%
 60.67MB  11.11MB  18.60MB   7.49MB 67.48%
 55.86MB   8.83MB  16.61MB   7.77MB 88.03%
 53.32MB   8.01MB  15.32MB   7.31MB 91.24%

* untar-merge.out

orig_size compr_size used_size overhead overhead_ratio
526.23MB 199.18MB 209.81MB  10.64MB 5.34%
288.68MB  97.45MB 104.08MB   6.63MB 6.80%
177.68MB  61.14MB  66.93MB   5.79MB 9.47%
146.83MB  47.34MB  52.79MB   5.45MB 11.51%
124.52MB  38.87MB  44.30MB   5.43MB 13.96%
104.29MB  31.70MB  36.83MB   5.13MB 16.19%
 84.70MB  22.88MB  27.92MB   5.04MB 22.04%
 67.11MB  14.83MB  19.26MB   4.43MB 29.86%
 60.82MB  11.10MB  14.90MB   3.79MB 34.17%
 55.90MB   8.82MB  12.61MB   3.79MB 42.97%
 53.32MB   8.01MB  11.73MB   3.73MB 46.53%

As you can see above result, merged one has better utilization (overhead
ratio, 5th column) and uses less memory (mem_used_total, 3rd column).

Bug: 25951511

Change-Id: I00825d2b8de666abb7a0d8b47348b89e8af80571
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: <juno.choi@lge.com>
Cc: "seungho1.park" <seungho1.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:46 +05:30
Dan Streetman 40ea37de5f UPSTREAM: zsmalloc: simplify init_zspage free obj linking
(cherry-pick from commit 5538c562377580947916b3366898f1eb5f53768e)

Change zsmalloc init_zspage() logic to iterate through each object on each
of its pages, checking the offset to verify the object is on the current
page before linking it into the zspage.

The current zsmalloc init_zspage free object linking code has logic that
relies on there only being one page per zspage when PAGE_SIZE is a
multiple of class->size.  It calculates the number of objects for the
current page, and iterates through all of them plus one, to account for
the assumed partial object at the end of the page.  While this currently
works, the logic can be simplified to just link the object at each
successive offset until the offset is larger than PAGE_SIZE, which does
not rely on PAGE_SIZE being a multiple of class->size.

Bug: 25951511

Change-Id: I89e562a18b083f24f4697b4154d5b238becb36e6
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:27 +05:30
Wang Sheng-Hui 3ab34848e6 UPSTREAM: mm/zsmalloc.c: correct comment for fullness group computation
(cherry-pick from commit 6dd9737e31504f9377a8a19810ea4922e88516c1)

The letter 'f' in "n <= N/f" stands for fullness_threshold_frac, not
1/fullness_threshold_frac.

Bug: 25951511

Change-Id: I3d3f090fab39fca1011999ea12e9aab187504e39
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:27 +05:30
Minchan Kim 2e9865bcb3 UPSTREAM: zsmalloc: change return value unit of zs_get_total_size_bytes
(cherry-pick from commit 722cdc17232f0f684011407f7cf3c40d39457971)

zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
*byte unit* but zsmalloc operates *page unit* rather than byte unit so
let's change the API so benefit we could get is that reduce unnecessary
overhead (ie, change page unit with byte unit) in zsmalloc.

Since return type is pages, "zs_get_total_pages" is better than
"zs_get_total_size_bytes".

Bug: 25951511

Change-Id: I2cbd9426483ae31c846923594e2cc3a8028e6cc2
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: <juno.choi@lge.com>
Cc: <seungho1.park@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: David Horner <ds2horner@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:26 +05:30
Minchan Kim 81265e0553 UPSTREAM: zsmalloc: move pages_allocated to zs_pool
(cherry-pick from commit 13de8933c96b4557f667c337676f05274e017f83)

Currently, zram has no feature to limit memory so theoretically zram can
deplete system memory.  Users have asked for a limit several times as even
without exhaustion zram makes it hard to control memory usage of the
platform.  This patchset adds the feature.

Patch 1 makes zs_get_total_size_bytes faster because it would be used
frequently in later patches for the new feature.

Patch 2 changes zs_get_total_size_bytes's return unit from bytes to page
so that zsmalloc doesn't need unnecessary operation(ie, << PAGE_SHIFT).

Patch 3 adds new feature.  I added the feature into zram layer, not
zsmalloc because limiation is zram's requirement, not zsmalloc so any
other user using zsmalloc(ie, zpool) shouldn't affected by unnecessary
branch of zsmalloc.  In future, if every users of zsmalloc want the
feature, then, we could move the feature from client side to zsmalloc
easily but vice versa would be painful.

Patch 4 adds news facility to report maximum memory usage of zram so that
this avoids user polling frequently via /sys/block/zram0/ mem_used_total
and ensures transient max are not missed.

This patch (of 4):

pages_allocated has counted in size_class structure and when user of
zsmalloc want to see total_size_bytes, it should gather all of count from
each size_class to report the sum.

It's not bad if user don't see the value often but if user start to see
the value frequently, it would be not a good deal for performance pov.

This patch moves the count from size_class to zs_pool so it could reduce
memory footprint (from [255 * 8byte] to [sizeof(atomic_long_t)]).

Bug: 25951511

Change-Id: I05526575b81c95a12a7f8f0ef05040ed18b5fa6f
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: <juno.choi@lge.com>
Cc: <seungho1.park@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Reviewed-by: David Horner <ds2horner@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:04 +05:30
Kees Cook 13406a675a BACKPORT: mm/zpool: use prefixed module loading
(cherry-pick from commit 137f8cff505ace6251dc442c7aa973d60c801a79)

To avoid potential format string expansion via module parameters, do not
use the zpool type directly in request_module() without a format string.
Additionally, to avoid arbitrary modules being loaded via zpool API
(e.g.  via the zswap_zpool_type module parameter) add a "zpool-" prefix
to the requested module, as well as module aliases for the existing
zpool types (zbud and zsmalloc).

Bug: 25951511

Change-Id: Id04e543f6e12e73e72bf79bdde4b1b13c35d7cae
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:04 +05:30
Dan Streetman f5321e895b BACKPORT: mm/zpool: zbud/zsmalloc implement zpool
(cherry-pick from commit c795779df29e180738568d2a5eb3a42f3b5e47f0)

Update zbud and zsmalloc to implement the zpool api.

Bug: 25951511

Change-Id: Ib58729c1efeb4834d566d29f9abf33fec1f7f79d
[fengguang.wu@intel.com: make functions static]
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Tested-by: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:35:04 +05:30
Dan Streetman 5c7d1805de BACKPORT: mm/zpool: implement common zpool api to zbud/zsmalloc
(cherry-pick from commit af8d417a04564bca0348e7e3c749ab12a3e837ad)

Add zpool api.

zpool provides an interface for memory storage, typically of compressed
memory.  Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.

Bug: 25951511

Change-Id: I25da4c5454ad97c35e7f666df936d4c199f656a4
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Tested-by: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	mm/Kconfig
	mm/Makefile
2016-05-18 14:35:03 +05:30
Weijie Yang 5b923aed0d UPSTREAM: zsmalloc: fixup trivial zs size classes value in comments
(cherry-pick from commit 7eb52512a977854eca51d9b692c2f3be8a0e5eeb)

According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K
page size, not 254.  The old value may forget count the ZS_MIN_ALLOC_SIZE
in.

This patch fixes this trivial issue in the comments.

Bug: 25951511

Change-Id: I7f3039f14a6813bc2e97972b6968ac09d87202ed
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:34:45 +05:30
Ben Hutchings 5e928e6fc1 UPSTREAM: mm/Kconfig: fix URL for zsmalloc benchmark
(cherry-pick from commit 2216ee853017f9c9371106c5c02d4fe42f61cbfa)

The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL.  While
we're at it, remove the unnecessary footnote notation.

Bug: 25951511

Change-Id: Ia2eb06b2a5d29960b51f0b6558ef5041fd9c03fa
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-18 14:34:44 +05:30
Nitin Cupta 567f10f8c1 BACKPORT: zsmalloc: add more comment
(cherry-pick from commit c3e3e88adccb3119b69484c56798ec616307a94f)

This patch adds lots of comments and it will help others
to review and enhance.

Bug: 25951511

Change-Id: I2c1edf24e917c2d51ef68a9987d81f9b6a4a2bd2
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 14:34:43 +05:30
Minchan Kim 21048f2545 UPSTREAM: zsmalloc: add Kconfig for enabling page table method
(cherry-pick from commit 1b945aeef0b9cb5e98d682c310272b08198e54b5)

Zsmalloc has two methods 1) copy-based and 2) pte based to
access objects that span two pages.
You can see history why we supported two approach from [1].

But it was bad choice that adding hard coding to select arch
which want to use pte based method because there are lots of
SoC in an architecure and they can have different cache size,
CPU speed and so on so it would be better to expose it to user
as selectable Kconfig option like Andrew Morton suggested.

[1] https://lkml.org/lkml/2012/7/11/58

Bug: 25951511

Change-Id: Ic6855e8fefc7a0f36db896e8b03869c143e982d6
Acked-by: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 14:34:43 +05:30
Srivatsa S. Bhat 20b00693b9 zsmalloc: Fix CPU hotplug callback registration
(cherry pick from commit f0e71fcd0fa6f3f5495cd9ad3f1e4acd94446a55)

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();

Fix the zsmalloc code by using this latter form of callback registration.

Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Bug: 24810447
Change-Id: Idda192da0c2d7cb3ca581ba2916fe9b4befe312e
2016-05-18 14:33:56 +05:30