(cherry-pick from commit 2e40e163a25af3bd35d128d3e2e005916de5cce6)
Recently, we started to use zram heavily and some of issues
popped.
1) external fragmentation
I got a report from Juneho Choi that fork failed although there are plenty
of free pages in the system. His investigation revealed zram is one of
the culprit to make heavy fragmentation so there was no more contiguous
16K page for pgd to fork in the ARM.
2) non-movable pages
Other problem of zram now is that inherently, user want to use zram as
swap in small memory system so they use zRAM with CMA to use memory
efficiently. However, unfortunately, it doesn't work well because zRAM
cannot use CMA's movable pages unless it doesn't support compaction. I
got several reports about that OOM happened with zram although there are
lots of swap space and free space in CMA area.
3) internal fragmentation
zRAM has started support memory limitation feature to limit memory usage
and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
harmonized with zram-swap to stop anonymous page reclaim if zram consumed
memory up to the limit although there are free space on the swap. One
problem for that direction is zram has no way to know any hole in memory
space zsmalloc allocated by internal fragmentation so zram would regard
swap is full although there are free space in zsmalloc. For solving the
issue, zram want to trigger compaction of zsmalloc before it decides full
or not.
This patchset is first step to support above issues. For that, it adds
indirect layer between handle and object location and supports manual
compaction to solve 3th problem first of all.
After this patchset got merged, next step is to make VM aware of zsmalloc
compaction so that generic compaction will move zsmalloced-pages
automatically in runtime.
In my imaginary experiment(ie, high compress ratio data with heavy swap
in/out on 8G zram-swap), data is as follows,
Before =
zram allocated object : 60212066 bytes
zram total used: 140103680 bytes
ratio: 42.98 percent
MemFree: 840192 kB
Compaction
After =
frag ratio after compaction
zram allocated object : 60212066 bytes
zram total used: 76185600 bytes
ratio: 79.03 percent
MemFree: 901932 kB
Juneho reported below in his real platform with small aging.
So, I think the benefit would be bigger in real aging system
for a long time.
- frag_ratio increased 3% (ie, higher is better)
- memfree increased about 6MB
- In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased
frag ratio after swap fragment
used : 156677 kbytes
total: 166092 kbytes
frag_ratio : 94
meminfo before compaction
MemFree: 83724 kB
Node 0, zone Normal 13642 1364 57 10 61 17 9 5 4 0 0
Node 0, zone HighMem 425 29 1 0 0 0 0 0 0 0 0
num_migrated : 23630
compaction done
frag ratio after compaction
used : 156673 kbytes
total: 160564 kbytes
frag_ratio : 97
meminfo after compaction
MemFree: 89060 kB
Node 0, zone Normal 14076 1544 67 14 61 17 9 5 4 0 0
Node 0, zone HighMem 863 50 1 0 0 0 0 0 0 0 0
This patchset adds more logics(about 480 lines) in zsmalloc but when I
tested heavy swapin/out program, the regression for swapin/out speed is
marginal because most of overheads were caused by compress/decompress and
other MM reclaim stuff.
This patch (of 7):
Currently, handle of zsmalloc encodes object's location directly so it
makes support of migration hard.
This patch decouples handle and object via adding indirect layer. For
that, it allocates handle dynamically and returns it to user. The handle
is the address allocated by slab allocation so it's unique and we could
keep object's location in the memory space allocated for handle.
With it, we can change object's position without changing handle itself.
Bug: 25951511
Change-Id: Id50a98341f63c4e1bb39589ca992661486469dca
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 3eba0c6a56c04f2b017b43641a821f1ebfb7fb4c)
Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
them. There is not a method to let zsmalloc/zbud find which caller they
belong to.
Now we want to add statistics collection in zsmalloc. We need to name the
debugfs dir for each pool created. The way suggested by Minchan Kim is to
use a name passed by caller(such as zram) to create the zsmalloc pool.
/sys/kernel/debug/zsmalloc/zram0
This patch adds an argument `name' to zs_create_pool() and other related
functions.
Bug: 25951511
Change-Id: Ib71e8e63c71e808795073bd08c0aab14b43b4c35
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 66cdef663cd7a97aff6bbbf41a81a0205dc81ba2)
Currently functions in zsmalloc.c does not arranged in a readable and
reasonable sequence. With the more and more functions added, we may
meet below inconvenience. For example:
Current functions:
void zs_init()
{
}
static void get_maxobj_per_zspage()
{
}
Then I want to add a func_1() which is called from zs_init(), and this
new added function func_1() will used get_maxobj_per_zspage() which is
defined below zs_init().
void func_1()
{
get_maxobj_per_zspage()
}
void zs_init()
{
func_1()
}
static void get_maxobj_per_zspage()
{
}
This will cause compiling issue. So we must add a declaration:
static void get_maxobj_per_zspage();
before func_1() if we do not put get_maxobj_per_zspage() before
func_1().
In addition, puting module_[init|exit] functions at the bottom of the
file conforms to our habit.
So, this patch ajusts function sequence as:
/* helper functions */
...
obj_location_to_handle()
...
/* Some exported functions */
...
zs_map_object()
zs_unmap_object()
zs_malloc()
zs_free()
zs_init()
zs_exit()
Bug: 25951511
Change-Id: I68377a213ade041b34e99a4280ebd57a933dfa83
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit df8b5bb998f10cfc040ad30300f9a9ea4592ff82)
In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
And the prev_class only references to the previous size_class. So we do
not need unnecessary assignement.
This patch assigns *prev_class* when a new size_class structure is
allocated and uses prev_class to check whether the first class has been
allocated.
Bug: 25951511
Change-Id: Ie5e4be867976af0e9ce786a58d1ee0147b7fb0ad
[akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 40f9fb8cffc6a20ae269e3b43dfba7a4f65d7f50)
I sent a patch [1] for unnecessary check in zsmalloc. And Minchan Kim
found zsmalloc even does not support allocating an obj with the size of
ZS_MAX_ALLOC_SIZE in some situations.
For example:
In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
ZS_SIZE_CLASS_DELTA is 256 bytes
So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
ZS_SIZE_CLASS_DELTA + 1
= 256
In zs_create_pool(), the max size obj which can be allocated will be:
ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312
We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
users we can do.
[1] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html
This patch fixes this issue by dynamiclly calculating zs_size_classes when
module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE. Then the
max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.
Bug: 25951511
Change-Id: Ia35e3456e94ebaf14c65a13dde8b471ebe1095ab
[akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit af4ee5e977acb150371c28bd85cb7e34cac48b13)
The kunmap_atomic should use virtual address getting by kmap_atomic.
However, some pieces of code in zsmalloc uses modified address, not the
one got by kmap_atomic for kunmap_atomic.
It's okay for working because zsmalloc modifies the address inner
PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
But it's still fragile with potential changing of kmap_atomic so let's
correct it.
I got a subtle bug when I implemented a new feature of zsmalloc
(compaction) due to a link's mishandling (the link was over page
boundary). Although it was totally my mistake, it took a while to find
the cause because an unpredictable kmapped address was unmapped causing an
almost random crash.
Bug: 25951511
Change-Id: I9337684d102af93ec600077bf4c9658a942c8d09
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit b1b00a5b8a6cf32e3973507decf1216709b55072)
Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
zpool_unregister_driver() from zs_init() if cpu notifier registration has
failed, because error handling is performed before we register the driver
via zpool_register_driver() call.
Factor out cpu notifier registration and unregistration code and fix
zs_init() error handling.
Bug: 25951511
Change-Id: I9311d16de84accd9c5d3f2a333b30fe189a37222
link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
[akpm@linux-foundation.org: squash bogus gcc warning]
[akpm@linux-foundation.org: use __init and __exit]
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Mahendran Ganesh <opensource.ganesh@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 9eec4cd53f9865b733dc78cf5f6465871beed014)
zsmalloc has many size_classes to reduce fragmentation and they are in 16
bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096. And,
zsmalloc has constraint that each zspage has 4 pages at maximum.
In this situation, we can see interesting aspect. Let's think about
size_class for 1488, 1472, ..., 1376. To prevent external fragmentation,
they uses 4 pages per zspage and so all they can contain 11 objects at
maximum.
16384 (4096 * 4) = 1488 * 11 + remains
16384 (4096 * 4) = 1472 * 11 + remains
16384 (4096 * 4) = ...
16384 (4096 * 4) = 1376 * 11 + remains
It means that they have same characteristics and classification between
them isn't needed. If we use one size_class for them, we can reduce
fragementation and save some memory since both the 1488 and 1472 sized
classes can only fit 11 objects into 4 pages, and an object that's 1472
bytes can fit into an object that's 1488 bytes, merging these classes to
always use objects that are 1488 bytes will reduce the total number of
size classes. And reducing the total number of size classes reduces
overall fragmentation, because a wider range of compressed pages can fit
into a single size class, leaving less unused objects in each size class.
For this purpose, this patch implement size_class merging. If there is
size_class that have same pages_per_zspage and same number of objects per
zspage with previous size_class, we don't create new size_class. Instead,
we use previous, same characteristic size_class. With this way, above
example sizes (1488, 1472, ..., 1376) use just one size_class so we can
get much more memory utilization.
Below is result of my simple test.
TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
source code, remove directory in descending order in size. (drivers arch
fs sound include net Documentation firmware kernel tools)
Each line represents orig_data_size, compr_data_size, mem_used_total,
fragmentation overhead (mem_used - compr_data_size) and overhead ratio
(overhead to compr_data_size), respectively, after untar and remove
operation is executed.
* untar-nomerge.out
orig_size compr_size used_size overhead overhead_ratio
525.88MB 199.16MB 210.23MB 11.08MB 5.56%
288.32MB 97.43MB 105.63MB 8.20MB 8.41%
177.32MB 61.12MB 69.40MB 8.28MB 13.55%
146.47MB 47.32MB 56.10MB 8.78MB 18.55%
124.16MB 38.85MB 48.41MB 9.55MB 24.58%
103.93MB 31.68MB 40.93MB 9.25MB 29.21%
84.34MB 22.86MB 32.72MB 9.86MB 43.13%
66.87MB 14.83MB 23.83MB 9.00MB 60.70%
60.67MB 11.11MB 18.60MB 7.49MB 67.48%
55.86MB 8.83MB 16.61MB 7.77MB 88.03%
53.32MB 8.01MB 15.32MB 7.31MB 91.24%
* untar-merge.out
orig_size compr_size used_size overhead overhead_ratio
526.23MB 199.18MB 209.81MB 10.64MB 5.34%
288.68MB 97.45MB 104.08MB 6.63MB 6.80%
177.68MB 61.14MB 66.93MB 5.79MB 9.47%
146.83MB 47.34MB 52.79MB 5.45MB 11.51%
124.52MB 38.87MB 44.30MB 5.43MB 13.96%
104.29MB 31.70MB 36.83MB 5.13MB 16.19%
84.70MB 22.88MB 27.92MB 5.04MB 22.04%
67.11MB 14.83MB 19.26MB 4.43MB 29.86%
60.82MB 11.10MB 14.90MB 3.79MB 34.17%
55.90MB 8.82MB 12.61MB 3.79MB 42.97%
53.32MB 8.01MB 11.73MB 3.73MB 46.53%
As you can see above result, merged one has better utilization (overhead
ratio, 5th column) and uses less memory (mem_used_total, 3rd column).
Bug: 25951511
Change-Id: I00825d2b8de666abb7a0d8b47348b89e8af80571
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: <juno.choi@lge.com>
Cc: "seungho1.park" <seungho1.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 5538c56237)
Change zsmalloc init_zspage() logic to iterate through each object on each
of its pages, checking the offset to verify the object is on the current
page before linking it into the zspage.
The current zsmalloc init_zspage free object linking code has logic that
relies on there only being one page per zspage when PAGE_SIZE is a
multiple of class->size. It calculates the number of objects for the
current page, and iterates through all of them plus one, to account for
the assumed partial object at the end of the page. While this currently
works, the logic can be simplified to just link the object at each
successive offset until the offset is larger than PAGE_SIZE, which does
not rely on PAGE_SIZE being a multiple of class->size.
Bug: 25951511
Change-Id: I89e562a18b083f24f4697b4154d5b238becb36e6
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 6dd9737e31)
The letter 'f' in "n <= N/f" stands for fullness_threshold_frac, not
1/fullness_threshold_frac.
Bug: 25951511
Change-Id: I3d3f090fab39fca1011999ea12e9aab187504e39
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 722cdc1723)
zs_get_total_size_bytes returns a amount of memory zsmalloc consumed with
*byte unit* but zsmalloc operates *page unit* rather than byte unit so
let's change the API so benefit we could get is that reduce unnecessary
overhead (ie, change page unit with byte unit) in zsmalloc.
Since return type is pages, "zs_get_total_pages" is better than
"zs_get_total_size_bytes".
Bug: 25951511
Change-Id: I2cbd9426483ae31c846923594e2cc3a8028e6cc2
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: <juno.choi@lge.com>
Cc: <seungho1.park@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: David Horner <ds2horner@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 13de8933c9)
Currently, zram has no feature to limit memory so theoretically zram can
deplete system memory. Users have asked for a limit several times as even
without exhaustion zram makes it hard to control memory usage of the
platform. This patchset adds the feature.
Patch 1 makes zs_get_total_size_bytes faster because it would be used
frequently in later patches for the new feature.
Patch 2 changes zs_get_total_size_bytes's return unit from bytes to page
so that zsmalloc doesn't need unnecessary operation(ie, << PAGE_SHIFT).
Patch 3 adds new feature. I added the feature into zram layer, not
zsmalloc because limiation is zram's requirement, not zsmalloc so any
other user using zsmalloc(ie, zpool) shouldn't affected by unnecessary
branch of zsmalloc. In future, if every users of zsmalloc want the
feature, then, we could move the feature from client side to zsmalloc
easily but vice versa would be painful.
Patch 4 adds news facility to report maximum memory usage of zram so that
this avoids user polling frequently via /sys/block/zram0/ mem_used_total
and ensures transient max are not missed.
This patch (of 4):
pages_allocated has counted in size_class structure and when user of
zsmalloc want to see total_size_bytes, it should gather all of count from
each size_class to report the sum.
It's not bad if user don't see the value often but if user start to see
the value frequently, it would be not a good deal for performance pov.
This patch moves the count from size_class to zs_pool so it could reduce
memory footprint (from [255 * 8byte] to [sizeof(atomic_long_t)]).
Bug: 25951511
Change-Id: I05526575b81c95a12a7f8f0ef05040ed18b5fa6f
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: <juno.choi@lge.com>
Cc: <seungho1.park@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Reviewed-by: David Horner <ds2horner@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 137f8cff50)
To avoid potential format string expansion via module parameters, do not
use the zpool type directly in request_module() without a format string.
Additionally, to avoid arbitrary modules being loaded via zpool API
(e.g. via the zswap_zpool_type module parameter) add a "zpool-" prefix
to the requested module, as well as module aliases for the existing
zpool types (zbud and zsmalloc).
Bug: 25951511
Change-Id: Id04e543f6e12e73e72bf79bdde4b1b13c35d7cae
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit af8d417a04)
Add zpool api.
zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.
Bug: 25951511
Change-Id: I25da4c5454ad97c35e7f666df936d4c199f656a4
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Tested-by: Seth Jennings <sjennings@variantweb.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 7eb52512a9)
According to calculation, ZS_SIZE_CLASSES value is 255 on systems with 4K
page size, not 254. The old value may forget count the ZS_MIN_ALLOC_SIZE
in.
This patch fixes this trivial issue in the comments.
Bug: 25951511
Change-Id: I7f3039f14a6813bc2e97972b6968ac09d87202ed
Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit 2216ee8530)
The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL. While
we're at it, remove the unnecessary footnote notation.
Bug: 25951511
Change-Id: Ia2eb06b2a5d29960b51f0b6558ef5041fd9c03fa
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-pick from commit c3e3e88adc)
This patch adds lots of comments and it will help others
to review and enhance.
Bug: 25951511
Change-Id: I2c1edf24e917c2d51ef68a9987d81f9b6a4a2bd2
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry-pick from commit 1b945aeef0)
Zsmalloc has two methods 1) copy-based and 2) pte based to
access objects that span two pages.
You can see history why we supported two approach from [1].
But it was bad choice that adding hard coding to select arch
which want to use pte based method because there are lots of
SoC in an architecure and they can have different cache size,
CPU speed and so on so it would be better to expose it to user
as selectable Kconfig option like Andrew Morton suggested.
[1] https://lkml.org/lkml/2012/7/11/58
Bug: 25951511
Change-Id: Ic6855e8fefc7a0f36db896e8b03869c143e982d6
Acked-by: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry-pick from commit 67296874eb)
zsmalloc encodes a handle using the pfn and an object
index. On hardware platforms with physical memory starting
at 0x0 the pfn can be 0. This causes the encoded handle to be
0 and is incorrectly interpreted as an allocation failure.
This issue affects all current and future SoCs with physical
memory starting at 0x0. All MSM8974 SoCs which includes
Google Nexus 5 devices are affected.
To prevent this false error we ensure that the encoded handle
will not be 0 when allocation succeeds.
Bug: 25951511
Change-Id: I6d3c18ba4963c89a673fe633ce1e7a29d767fefa
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry-pick from commit 396b7fd6f9)
The existing comments are using an odd style. Fixed them up to adhere
to the StyleGuide. No code changes.
Bug: 25951511
Change-Id: Ica40c276260a4a4fb573185929d804fa3685e1b0
Signed-off-by: Sara Bird <sara.bird.iar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry pick from commit f0e71fcd0f)
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Instead, the correct and race-free way of performing the callback
registration is:
cpu_notifier_register_begin();
for_each_online_cpu(cpu)
init_cpu(cpu);
/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);
cpu_notifier_register_done();
Fix the zsmalloc code by using this latter form of callback registration.
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Bug: 24810447
Change-Id: Idda192da0c2d7cb3ca581ba2916fe9b4befe312e
(cherry pick from commit 31fc00bb78)
Add my copyright to the zsmalloc source code which I maintain.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 24810447
Change-Id: Ic4137129666be7a6a383ed8b9c929ee97b6cc9fc
(cherry pick from bcf1647d08)
This patch moves zsmalloc under mm directory.
Before that, description will explain why we have needed custom
allocator.
Zsmalloc is a new slab-based memory allocator for storing compressed
pages. It is designed for low fragmentation and high allocation success
rate on large object, but <= PAGE_SIZE allocations.
zsmalloc differs from the kernel slab allocator in two primary ways to
achieve these design goals.
zsmalloc never requires high order page allocations to back slabs, or
"size classes" in zsmalloc terms. Instead it allows multiple
single-order pages to be stitched together into a "zspage" which backs
the slab. This allows for higher allocation success rate under memory
pressure.
Also, zsmalloc allows objects to span page boundaries within the zspage.
This allows for lower fragmentation than could be had with the kernel
slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE. With the
kernel slab allocator, if a page compresses to 60% of it original size,
the memory savings gained through compression is lost in fragmentation
because another object of the same size can't be stored in the leftover
space.
This ability to span pages results in zsmalloc allocations not being
directly addressable by the user. The user is given an
non-dereferencable handle in response to an allocation request. That
handle must be mapped, using zs_map_object(), which returns a pointer to
the mapped region that can be used. The mapping is necessary since the
object data may reside in two different noncontigious pages.
The zsmalloc fulfills the allocation needs for zram perfectly
[sjenning@linux.vnet.ibm.com: borrow Seth's quote]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 24810447
Change-Id: I7b7923baeb9989e002523c66696e4a98fb357c46
We want to know per-process workingset size for smart memory management
on userland and we use swap(ex, zram) heavily to maximize memory
efficiency so workingset includes swap as well as RSS.
On such system, if there are lots of shared anonymous pages, it's really
hard to figure out exactly how many each process consumes memory(ie, rss
+ wap) if the system has lots of shared anonymous memory(e.g, android).
This patch introduces SwapPss field on /proc/<pid>/smaps so we can get
more exact workingset size per process.
Bongkyu tested it. Result is below.
1. 50M used swap
SwapTotal: 461976 kB
SwapFree: 411192 kB
$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
48236
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
141184
2. 240M used swap
SwapTotal: 461976 kB
SwapFree: 216808 kB
$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
230315
$ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
1387744
[akpm@linux-foundation.org: simplify kunmap_atomic() call]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Bongkyu Kim <bongkyu.kim@lge.com>
Tested-by: Bongkyu Kim <bongkyu.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 26190646
Change-Id: Idf92d682fdef432bdd66e530a7e7cdff8f375db1
Signed-off-by: Thierry Strudel <tstrudel@google.com>
set_pageblock_order() may be called when memory hotplug, so need use
'__paginginit' instead of '__init'.
The related warning:
The function __meminit .free_area_init_node() references
a function __init .set_pageblock_order().
If .set_pageblock_order is only used by .free_area_init_node then
annotate .set_pageblock_order with a matching annotation.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the case that compat_get_bitmap fails we do not want to copy the
bitmap to the user as it will contain uninitialized stack data and leak
sensitive data.
Change-Id: Ia02cc50f336357469af11d8b3135e48be294f7e0
Signed-off-by: Chris Salls <salls@cs.ucsb.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit bd726c90b6b8ce87602208701b208a208e6d5600 upstream.
Fix expand_upwards() on architectures with an upward-growing stack (parisc,
metag and partly IA-64) to allow the stack to reliably grow exactly up to
the address space limit given by TASK_SIZE.
Change-Id: I5a480cdafc4fc2728b7a5dbe48155eaf8796f46e
Signed-off-by: Helge Deller <deller@gmx.de>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Change-Id: I611023b0bfe1cab7b3e5da13e331a7baaaaf6eb0
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
s390 uses generic arch_get_unmapped_area()]
[wt: backport to 3.16: adjust context]
[wt: backport to 3.10: adjust context ; code logic in PARISC's
arch_get_unmapped_area() wasn't found ; code inserted into
expand_upwards() and expand_downwards() runs under anon_vma lock;
changes for gup.c:faultin_page go to memory.c:__get_user_pages();
included Hugh Dickins' fixes]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Flex1911 <dedsa2002@gmail.com>
commit 42cb14b110a5698ccf26ce59c4441722605a3743 upstream.
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
CVE-2016-3070
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ciwillia@brocade.com: backported to 3.10: adjusted context]
Signed-off-by: Charles (Chas) Williams <ciwillia@brocade.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Change-Id: Ifee7c4f76e277763ef2dd63c810590cd237575a4
Reading page fault handler code I've noticed that under right
circumstances kernel would map anonymous pages into file mappings: if
the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
on ->mmap(), kernel would handle page fault to not populated pte with
do_anonymous_page().
Let's change page fault handler to use do_anonymous_page() only on
anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
shared.
For file mappings without vm_ops->fault() or shred VMA without vm_ops,
page fault on pte_none() entry would lead to SIGBUS.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d)
Change-Id: I1bb2bd41b5c4686c50172aabe38bcd2adb3744f4
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.
faultin_page drops FOLL_WRITE after the page fault handler did the CoW
and then we retry follow_page_mask to get our CoWed page. This is racy,
however because the page might have been unmapped by that time and so
we would have to do a page fault again, this time without CoW. This
would cause the page cache corruption for FOLL_FORCE on MAP_PRIVATE
read only mappings with obvious consequences.
This is an ancient bug that was actually already fixed once by Linus
eleven years ago in commit 4ceb5db975 ("Fix get_user_pages() race
for write access") but that was then undone due to problems on s390
by commit f33ea7f404 ("fix get_user_pages bug") because s390 didn't
have proper dirty pte tracking until abf09bed3c ("s390/mm: implement
software dirty bits"). This wasn't a problem at the time as pointed out
by Hugh Dickins because madvise relied on mmap_sem for write up until
0a27a14a62 ("mm: madvise avoid exclusive mmap_sem") but since then we
can race with madvise which can unmap the fresh COWed page or with KSM
and corrupt the content of the shared page.
This patch is based on the Linus' approach to not clear FOLL_WRITE after
the CoW page fault (aka VM_FAULT_WRITE) but instead introduces FOLL_COW
to note this fact. The flag is then rechecked during follow_pfn_pte to
enforce the page fault again if we do not see the CoWed page. Linus was
suggesting to check pte_dirty again as s390 is OK now. But that would
make backporting to some old kernels harder. So instead let's just make
sure that vm_normal_page sees a pure anonymous page.
This would guarantee we are seeing a real CoW page. Introduce
can_follow_write_pte which checks both pte_write and falls back to
PageAnon on forced write faults which passed CoW already. Thanks to Hugh
to point out that a special care has to be taken for KSM pages because
our COWed page might have been merged with a KSM one and keep its
PageAnon flag.
Change-Id: I164802be6d757c7a49b57416dfc9f4605ce0e1fb
Fixes: 0a27a14a62 ("mm: madvise avoid exclusive mmap_sem")
Reported-by: Phil "not Paul" Oester <kernel@linuxace.com>
Disclosed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
[bwh: Backported to 3.2:
- Adjust filename, context, indentation
- The 'no_page' exit path in follow_page() is different, so open-code the
cleanup
- Delete a now-unused label]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
The patch was incorrectly backported.
Revert it to bring in the correct backport.
This reverts commit 82a3f4741a.
Change-Id: I0e502701d6653ce6f004cbcf154b1ccc57e216e7
commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619 upstream.
This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db975 ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404 ("fix get_user_pages bug").
In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3c ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.
Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.
To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.
Change-Id: Ifcb16e37ceb6b8845d7a77a97f5fda6670d08378
Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[wt: s/gup.c/memory.c; s/follow_page_pte/follow_page_mask;
s/faultin_page/__get_user_page]
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit 59747d5d21)
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)
ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Ic74424e07710cd9ccb4a02871a829d14ef0cc4bc
Fix two bugs caused by merging anon vma_naming, a typo in
mempolicy.c and a bad merge in sys.c.
Change-Id: Ia4ced447d50573e68195e95ea2f2b4d9456b8a90
Signed-off-by: Colin Cross <ccross@android.com>
commit 71abdc15ad upstream.
When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:
On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
> <Interrupt>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0
This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.
Change-Id: I51cc36b913919dbaeb56322d7afe71e0c36f36bc
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 60cefed485 upstream.
Kswapd does not in all places have the same criteria for a balanced
zone. Zones are only being reclaimed when their high watermark is
breached, but compaction checks loop over the zonelist again when the
zone does not meet the low watermark plus two times the size of the
allocation. This gets kswapd stuck in an endless loop over a small
zone, like the DMA zone, where the high watermark is smaller than the
compaction requirement.
Add a function, zone_balanced(), that checks the watermark, and, for
higher order allocations, if compaction has enough free memory. Then
use it uniformly to check for balanced zones.
This makes sure that when the compaction watermark is not met, at least
reclaim happens and progress is made - or the zone is declared
unreclaimable at some point and skipped entirely.
Change-Id: I28fd1de8628aa7e799cbc11e313e8d8f3e2df977
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: George Spelvin <linux@horizon.com>
Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reported-by: Tomas Racek <tracek@redhat.com>
Tested-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[hq: Backported to 3.4: adjust context]
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b0a8cc58e6 upstream.
In kswapd(), set current->reclaim_state to NULL before returning, as
current->reclaim_state holds reference to variable on kswapd()'s stack.
In rare cases, while returning from kswapd() during memory offlining,
__free_slab() and freepages() can access the dangling pointer of
current->reclaim_state.
Change-Id: I49f69977895442b3a8328c9c39877369b4554ba5
Signed-off-by: Takamori Yamaguchi <takamori.yamaguchi@jp.sony.com>
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fe35004fbf upstream.
Sometimes we'd like to avoid swapping out anonymous memory. In
particular, avoid swapping out pages of important process or process
groups while there is a reasonable amount of pagecache on RAM so that we
can satisfy our customers' requirements.
OTOH, we can control how aggressive the kernel will swap memory pages with
/proc/sys/vm/swappiness for global and
/sys/fs/cgroup/memory/memory.swappiness for each memcg.
But with current reclaim implementation, the kernel may swap out even if
we set swappiness=0 and there is pagecache in RAM.
This patch changes the behavior with swappiness==0. If we set
swappiness==0, the kernel does not swap out completely (for global reclaim
until the amount of free pages and filebacked pages in a zone has been
reduced to something very very small (nr_free + nr_filebacked < high
watermark)).
Change-Id: Ibb357f534711e73838bca037bed7532bbf4d2728
Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1c7e7f6c07 upstream.
Offlining memory may block forever, waiting for kswapd() to wake up
because kswapd() does not check the event kthread->should_stop before
sleeping.
The proper pattern, from Documentation/memory-barriers.txt, is:
--- waker ---
event_indicated = 1;
wake_up_process(event_daemon);
--- sleeper ---
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (event_indicated)
break;
schedule();
}
set_current_state() may be wrapped by:
prepare_to_wait();
In the kswapd() case, event_indicated is kthread->should_stop.
=== offlining memory (waker) ===
kswapd_stop()
kthread_stop()
kthread->should_stop = 1
wake_up_process()
wait_for_completion()
=== kswapd_try_to_sleep (sleeper) ===
kswapd_try_to_sleep()
prepare_to_wait()
.
.
schedule()
.
.
finish_wait()
The schedule() needs to be protected by a test of kthread->should_stop,
which is wrapped by kthread_should_stop().
Reproducer:
Do heavy file I/O in background.
Do a memory offline/online in a tight loop
Change-Id: I3fcff189649427eb33ebf6c8753c1fed2cf0d687
Signed-off-by: Aaditya Kumar <aaditya.kumar@ap.sony.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e48982734e upstream.
Commit 6457474624 ("vmscan: detect mapped file pages used only once")
made mapped pages have another round in inactive list because they might
be just short lived and so we could consider them again next time. This
heuristic helps to reduce pressure on the active list with a streaming
IO worklods.
This patch fixes a regression introduced by this commit for heavy shmem
based workloads because unlike Anon pages, which are excluded from this
heuristic because they are usually long lived, shmem pages are handled
as a regular page cache.
This doesn't work quite well, unfortunately, if the workload is mostly
backed by shmem (in memory database sitting on 80% of memory) with a
streaming IO in the background (backup - up to 20% of memory). Anon
inactive list is full of (dirty) shmem pages when watermarks are hit.
Shmem pages are kept in the inactive list (they are referenced) in the
first round and it is hard to reclaim anything else so we reach lower
scanning priorities very quickly which leads to an excessive swap out.
Let's fix this by excluding all swap backed pages (they tend to be long
lived wrt. the regular page cache anyway) from used-once heuristic and
rather activate them if they are referenced.
The customer's workload is shmem backed database (80% of RAM) and they
are measuring transactions/s with an IO in the background (20%).
Transactions touch more or less random rows in the table. The
transaction rate fell by a factor of 3 (in the worst case) because of
commit 64574746. This patch restores the previous numbers.
Change-Id: Id6696030f349217d40d22cd8c17b3ffad1c4253f
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change the logic which determines the initial readahead window size
such that for small requests (one page) the initial window size
will be x4 the size of the original request, regardless of the
VM_MAX_READAHEAD value. This prevents a rapid ramp-up
that could be caused due to increasing VM_MAX_READAHEAD.
Change-Id: I93d59c515d7e6c6d62348790980ff7bd4f434997
Signed-off-by: Lee Susman <lsusman@codeaurora.org>
KSM thread to scan pages is getting schedule on definite timeout.
That wakes up CPU from idle state and hence may affect the power
consumption. Provide an optional support to use deferred timer
which suites low-power use-cases.
To enable deferred timers,
$ echo 1 > /sys/kernel/mm/ksm/deferred_timer
Change-Id: I07fe199f97fe1f72f9a9e1b0b757a3ac533719e8
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>