Commit graph

307114 commits

Author SHA1 Message Date
Chao Yu
7a5dfa9285 f2fs: introduce __remove_dirty_inode
Introduce __remove_dirty_inode to clean up codes in remove_dirty_dir_inode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:21 +08:00
Chao Yu
807a2aa41d f2fs: introduce dirty list node in inode info
Add a new dirt list node member in inode info for linking the inode to
global dirty list in superblock, instead of old implementation which
allocate slab cache memory as an entry to inode.

It avoids memory pressure due to slab cache allocation, and also makes
codes more clean.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:21 +08:00
Chao Yu
5cbea47f54 f2fs: rename {add,remove,release}_dirty_inode to {add,remove,release}_ino_entry
remove_dirty_dir_inode will be renamed to remove_dirty_inode as a generic
function in following patch for removing directory/regular/symlink inode
in global dirty list.

Here rename ino management related functions for readability, also in
order to avoid name conflict.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:21 +08:00
Jaegeuk Kim
b339fb1ce7 f2fs: add symbol to avoid any confusion with tools
This patch adds MAX_VOLUME_NAME to sync with f2fs-tools.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:21 +08:00
Chao Yu
e116a07e4e f2fs: do more integrity verification for superblock
Do more sanity check for superblock during ->mount.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:21 +08:00
Fan Li
bc4e9c22df f2fs: fix to update variable correctly when skip a unmapped block
map.m_len should be reduced after skip a block

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:21 +08:00
Fan Li
db88dd99f0 f2fs: write only the pages in range during defragment
@lend of filemap_write_and_wait_range is supposed to be a "offset
in bytes where the range ends (inclusive)". Subtract 1 to avoid
writing an extra page.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
732a06e287 f2fs: clean up node page updating flow
If read_node_page return LOCKED_PAGE, in its caller it's better a) skip
unneeded 'Update' flag and mapping info verfication; b) check nid value
stored in footer structure of node page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/node.c
2016-10-29 23:12:20 +08:00
Jaegeuk Kim
ae654a8d46 f2fs: use lock_buffer when changing superblock
When modifying sb contents, we need to use lock its buffer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Jaegeuk Kim
672830367a f2fs: refactor f2fs_commit_super
Previously, f2fs_commit_super hacks the bh->blocknr to write the broken
alternate superblock.
Instead of it, we should use the correct logic to retrieve its buffer head
with locking it appropriately.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Jaegeuk Kim
e8531acc83 f2fs: enhance the bit operation for SSR
This patch enhances the existing bit operation when f2fs allocates SSR
blocks.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
652c783647 f2fs: fix to convert inline inode in ->setattr
In commit 3c4541452748 ("f2fs: do not trim preallocated blocks when
truncating after i_size"), in order to follow the regulation: "truncate(x)
where x > i_size will not trim all blocks past i_size." like other file
systems, in ->setattr we invoked truncate_setsize instead of f2fs_truncate
to avoid unneeded block trimming in such case, but forgot to call
f2fs_convert_inline_inode keep consistency of inline data conversion rule.

This patch fixes to convert inline data if necessary.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
2f87117e0d f2fs: use sbi->blocks_per_seg to avoid unnecessary calculation
Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using
1 << sbi->log_blocks_per_seg.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
bca718dfe2 f2fs: kill f2fs_drop_largest_extent
For direct IO, f2fs only allocate new address for the block which is not
exist in the disk before, its mapping info should not exist in extent
cache previously, so here we do not need to call f2fs_drop_largest_extent
to drop related cache.

Due to no more callers for f2fs_drop_largest_extent now, kill it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
341459e176 f2fs: clean up argument of recover_data
In recover_data, value of argument 'type' will be CURSEG_WARM_NODE all
the time, remove it for cleanup.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
46baa0b8ab f2fs: clean up code with __has_cursum_space
Clean up codes in lookup_journal_in_cursum() with __has_cursum_space().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
da3dca97d2 f2fs: clean up error path in f2fs_readdir
No logic changes, just clean up the error path.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/dir.c
2016-10-29 23:12:20 +08:00
Jaegeuk Kim
11ce26140d f2fs: do not recover from previous remained wrong dnodes
If device does not support discard, some obsolete dnodes can be recovered
by roll-forward. This patch enhances the recovery flow.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Jaegeuk Kim
2aac3891c8 f2fs: avoid deadlock in f2fs_shrink_extent_tree
While handling extent trees, we can enter into a reclaiming path anytime.
If it tries to release some extent nodes in the same extent tree,
write_lock(&et->lock) would be hanged.
In order to avoid the deadlock, we can just skip it.

Note that, if it is an unreferenced tree, we should get write_lock(&et->lock)
successfully and release all of therein nodes.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
6646d6f871 f2fs: fix to report error in f2fs_readdir
get_lock_data_page in f2fs_readdir can fail due to a lot of reasons (i.e.
no memory or IO error...), it's better to report this kind of error to
user rather than ignoring it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:20 +08:00
Chao Yu
df01cbe690 f2fs: clear page uptodate when dropping cache for atomic write
We should clear uptodate flag for all pages atomic written when we drop
them, otherwise before these cached pages were reclaimed or invalidated
eventually, we will see invalid data when hitting them again.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Fan Li
3341a3dd70 f2fs: optimize __find_rev_next_bit
1. Skip __reverse_ulong if the bitmap is empty.
2. Reduce branches and codes.
According to my test, the performance of this new version is 5% higher on
an empty bitmap of 64bytes, and remains about the same in the worst scenario.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Chao Yu
89fbc8e0dd f2fs: fix to remove directory inode from dirty list
If last dirty dentry page was writebacked in reclaim path, we should
remove its directory inode from global dirty list to avoid unnecessary
flush for this inode when doing checkpoint.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Chao Yu
9e503744a3 f2fs: fix to enable missing ioctl interfaces in ->compat_ioctl
In 64-bit kernel f2fs can supports 32-bit ioctl system call by identifying
encoded code which is converted from 32-bit one to 64-bit one in
->compat_ioctl.

When we introduced new interfaces in ->ioctl, we forgot to enable them in
->compat_ioctl, so enable them for fixing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix wrongly added spaces together]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Chao Yu
8aca27c7ff f2fs: fix memory leak of kobject in error path of fill_super
f2fs_sb_info::s_kobj should be released in error path of fill_super,
otherwise it will lead to memory leak.

This bug was found by kmemleak:

dmesg:
kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800838dc358 (size 8):
  comm "mount", pid 4154, jiffies 4297482839 (age 1911.412s)
  hex dump (first 8 bytes):
    7a 72 61 6d 31 00 ff ff                          zram1...
  backtrace:
    [<ffffffff817a3668>] kmemleak_alloc+0x28/0x50
    [<ffffffff811dc99f>] __kmalloc_track_caller+0xef/0x1c0
    [<ffffffff8119d1c5>] kstrdup+0x45/0x80
    [<ffffffff8119d228>] kstrdup_const+0x28/0x30
    [<ffffffff813d2333>] kvasprintf_const+0x63/0xa0
    [<ffffffff813c59fc>] kobject_set_name_vargs+0x3c/0xa0
    [<ffffffff813c5a85>] kobject_add_varg+0x25/0x60
    [<ffffffff813c5b13>] kobject_init_and_add+0x53/0x70
    [<ffffffffa07ced19>] f2fs_fill_super+0x9d9/0xc40 [f2fs]
    [<ffffffff811ff0b2>] mount_bdev+0x192/0x1d0
    [<ffffffffa07cd3e5>] f2fs_mount+0x15/0x20 [f2fs]
    [<ffffffff811fec93>] mount_fs+0x43/0x170
    [<ffffffff81220256>] vfs_kern_mount+0x76/0x160
    [<ffffffff812211c8>] do_mount+0x258/0xdc0
    [<ffffffff81221dab>] SyS_mount+0x7b/0xc0
    [<ffffffff817aecd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
...

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Chao Yu
eb1bccd7fc f2fs: support file defragment
This patch introduces a new ioctl F2FS_IOC_DEFRAGMENT to support file
defragment in a specified range of regular file.

This ioctl can be used in very limited workload: if user expects high
sequential read performance in randomly written file, this interface
can be used for defragmentation, after that file can be written as
continuous as possible in the device.

Meanwhile, it has side-effect, it will make holes in segments where
blocks located originally, so it's better to trigger GC to eliminate
fragment in segments.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Chao Yu
94b23d0bf6 f2fs: commit atomic written page in LFS mode
We should always commit atomic written pages in LFS mode, otherwise data
will become corrupted if we encounter suddent power cut after partial
pages committed in IPU mode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Chao Yu
246e22992c f2fs: report error of f2fs_create_root_stats
f2fs_create_root_stats can fail due to no memory, report it to user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Yunlei He
3bdb893816 f2fs: Fix a system panic caused by f2fs_follow_link
In linux 3.10, we can not make sure the return value of nd_get_link function
is valid. So this patch add a check before use it.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:19 +08:00
Matt Wagantall
9a7dcc468f msm: cpufreq: Relax constraints on "msm-cpufreq" workqueue
This workqueue is not used in memory reclaim paths, so the
WQ_MEM_RECLAIM flag is not needed. Additionally, there is no
need to restrict the queue to having one in-flight work item.
Remove these constraints.

Change-Id: I9edde40917d3ec885ce061133de20680634321d0
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
2016-10-29 23:12:19 +08:00
Mahesh Sivasubramanian
1538aee4f7 msm: cpufreq: Configure WQ for higer priority
When the workqueue runs at a lower priority, it gets starved by higher
priority threads. Bump the WQ priority to high to ensure cpufreq work
queue gets a fair scheduling chance

Change-Id: I9e994da94a347dceb884e72ec3dd3da468a4471d
Signed-off-by: Mahesh Sivasubramanian <msivasub@codeaurora.org>
2016-10-29 23:12:19 +08:00
Narayanan Gopalakrishnan
498b49280a msm: cpufreq: increase priority of thread that increases frequencies
When the cpufreq governors request for increase in frequency from kworker
threads, there is chance that the kworker thread is cpu starved due to
other high priority threads, leading to undesirable increase in frequency
ramp-up latency. Add the calling thread to SCHED_FIFO policy when
requesting for increase in frequency and restore back to original policy
after ramp-up is completed.

Change-Id: Ie4fa199b7f9087717cb94dfcc2eddea27cc0012b
Signed-off-by: Narayanan Gopalakrishnan <nargop@codeaurora.org>
2016-10-29 23:12:19 +08:00
Mahesh Sivasubramanian
ba057921bf msm: rpm-smd: Configure WQ for higer priority
When the workqueue runs at a lower priority, it gets starved by higher
priority threads. Bump the WQ priority to high to ensure rpm smd work
queue gets a fair chance

Change-Id: I06b864611cf45afe6931d6030327806032894663
Signed-off-by: Mahesh Sivasubramanian <msivasub@codeaurora.org>
2016-10-29 23:12:19 +08:00
Srivatsa Vaddagiri
05b6d32aeb sched: Set MC (multi-core) sched domain's busy_factor attribute to 1
busy_factor attribute of a scheduler domain causes busy CPUs (CPUs
that are not idle) to load balance less frequently in that domain,
which could impact performance by increasing scheduling latency for
tasks.

As an example, consider MC scheduler domain's attribute values of
max_interval = 4ms and busy_factor = 64. Further consider
max_load_balance_interval = 100 (HZ/10). In this case, a non-idle CPU
could put off load balance check in MC domain by 100ms. This
effectively means that a CPU running a single task in its queue could
fail to notice increased load on another CPU (that is in same MC
domain) for upto 100ms, before picking up load from the overloaded
CPU. Needless to say, this leads to increased scheduling latency for
tasks, affecting performance adversely.

By setting MC domain's busy_factor value to 1, we limit maximum
interval that busy CPU can put off load balance checks to 4ms
(effectively 10ms, given HZ value of 100).

Change-Id: Id45869d06f5556ea8eec602b65c2ffd2143fe060
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2016-10-29 23:12:19 +08:00
Hong-Mei Li
2ace2c4bbc arm: lib: Fix makefile issue
Fix the typo introduced by commit fb7051a183cba3925f1f00c9bdadc2adcc682020.

Change-Id: Ia23c80f903ed431e60e4ba17c9c6d527e6febd87
Signed-off-by: Hong-Mei Li <a21834@motorola.com>
Reviewed-on: http://gerrit.pcs.mot.com/527915
SLT-Approved: Gerrit Code Review <gerrit-blurdev@motorola.com>
Tested-by: Jira Key <jirakey@motorola.com>
Reviewed-by: Christopher Fries <qcf001@motorola.com>
Submit-Approved: Jira Key <jirakey@motorola.com>
2016-10-29 23:12:18 +08:00
Chris Fries
60135ec5f7 msm: memutils: memcpy, memmove, copy_page optimization
Preload farther to take advantage of the memory bus, and assume
64-byte cache lines.  Unroll some pairs of ldm/stm as well, for
unexplainable reasons.

Future enhancements should include,

- #define for how far to preload, possibly defined separately for
  memcpy, copy_*_user
- Tuning for misaligned buffers
- Tuning for memmove
- Tuning for small buffers
- Understanding mechanism behind ldm/stm unroll causing some gains
  in copy_to_user

BASELINE (msm8960pro):
======================================================================
memcpy 1000MB at 5MB       : took 808850 usec, bandwidth 1236.236 MB/s
copy_to_user 1000MB at 5MB : took 810071 usec, bandwidth 1234.234 MB/s
copy_from_user 1000MB at 5M: took 942926 usec, bandwidth 1060.060 MB/s
memmove 1000GB at 5MB      : took 848588 usec, bandwidth 1178.178 MB/s
copy_to_user 1000GB at 4kB : took 847916 usec, bandwidth 1179.179 MB/s
copy_from_user 1000GB at 4k: took 935113 usec, bandwidth 1069.069 MB/s
copy_page 1000GB at 4kB    : took 779459 usec, bandwidth 1282.282 MB/s

THIS PATCH:
======================================================================
memcpy 1000MB at 5MB       : took 346223 usec, bandwidth 2888.888 MB/s
copy_to_user 1000MB at 5MB : took 348084 usec, bandwidth 2872.872 MB/s
copy_from_user 1000MB at 5M: took 348176 usec, bandwidth 2872.872 MB/s
memmove 1000GB at 5MB      : took 348267 usec, bandwidth 2871.871 MB/s
copy_to_user 1000GB at 4kB : took 377018 usec, bandwidth 2652.652 MB/s
copy_from_user 1000GB at 4k: took 371829 usec, bandwidth 2689.689 MB/s
copy_page 1000GB at 4kB    : took 383763 usec, bandwidth 2605.605 MB/s

Change-Id: I5e7605d4fb5c8492cfe7a73629d0d0ce7afd1f37
Signed-off-by: Chris Fries <C.Fries@motorola.com>
Reviewed-on: http://gerrit.pcs.mot.com/526804
SLT-Approved: Gerrit Code Review <gerrit-blurdev@motorola.com>
Tested-by: Jira Key <jirakey@motorola.com>
Reviewed-by: Igor Kovalenko <cik009@motorola.com>
Submit-Approved: Jira Key <jirakey@motorola.com>
2016-10-29 23:12:18 +08:00
Eric Dumazet
40df832fc0 softirq: reduce latencies
commit c10d73671a upstream.

In various network workloads, __do_softirq() latencies can be up
to 20 ms if HZ=1000, and 200 ms if HZ=100.

This is because we iterate 10 times in the softirq dispatcher,
and some actions can consume a lot of cycles.

This patch changes the fallback to ksoftirqd condition to :

- A time limit of 2 ms.
- need_resched() being set on current task

When one of this condition is met, we wakeup ksoftirqd for further
softirq processing if we still have pending softirqs.

Using need_resched() as the only condition can trigger RCU stalls,
as we can keep BH disabled for too long.

I ran several benchmarks and got no significant difference in
throughput, but a very significant reduction of latencies (one order
of magnitude) :

In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.

Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard+soft)

Before patch :

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92

After patch :

RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26

Change-Id: I94f96a9040a018644d3e2150f54acfd9a080992d
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
2016-10-29 23:12:18 +08:00
Dave Chinner
a8322514e2 sync: don't block the flusher thread waiting on IO
When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Change-Id: I9e55d65f5ecb2305497711d4688f0647d9346035
2016-10-29 23:12:18 +08:00
Lee Susman
b91dba25fc mm: change initial readahead window size calculation
Change the logic which determines the initial readahead window size
such that for small requests (one page) the initial window size
will be x4 the size of the original request, regardless of the
VM_MAX_READAHEAD value. This prevents a rapid ramp-up
that could be caused due to increasing VM_MAX_READAHEAD.

Change-Id: I93d59c515d7e6c6d62348790980ff7bd4f434997
Signed-off-by: Lee Susman <lsusman@codeaurora.org>
2016-10-29 23:12:18 +08:00
Thomas Gleixner
c8ad3a7446 tick: Cleanup NOHZ per cpu data on cpu down
commit 4b0c0f294f upstream.

Prarit reported a crash on CPU offline/online. The reason is that on
CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
up. If at cpu online an interrupt happens before the per cpu tick
device is registered the irq_enter() check potentially sees stale data
and dereferences a NULL pointer.

Cleanup the data after the cpu is dead.

Change-Id: I5f2be7afd398bc97b997ab2143f9d71230c44dd5
Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: Mike Galbraith <bitbucket@online.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-29 23:12:18 +08:00
Frederic Weisbecker
68a644ad50 nohz: Make tick_nohz_irq_exit() irq safe
commit e5ab012c32 upstream.

As it stands, irq_exit() may or may not be called with
irqs disabled, depending on __ARCH_IRQ_EXIT_IRQS_DISABLED
that the arch can define.

It makes tick_nohz_irq_exit() unsafe. For example two
interrupts can race in tick_nohz_stop_sched_tick(): the inner
most one computes the expiring time on top of the timer list,
then it's interrupted right before reprogramming the
clock. The new interrupt enqueues a new timer list timer,
it reprogram the clock to take it into account and it exits.
The CPUs resumes the inner most interrupt and performs the clock
reprogramming without considering the new timer list timer.

This regression has been introduced by:
     280f06774a
     ("nohz: Separate out irq exit and idle loop dyntick logic")

Let's fix it right now with the appropriate protections.

A saner long term solution will be to remove
__ARCH_IRQ_EXIT_IRQS_DISABLED and mandate that irq_exit() is called
with interrupts disabled.

Change-Id: Ib453b14a1bdaac53b6c685ba3eac92dbd0985b2a
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Link: http://lkml.kernel.org/r/1361373336-11337-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Lingzhu Xiang <lxiang@redhat.com>
Reviewed-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
ed4bddb779 rcu: Stop rcu_do_batch() from multiplexing the "count" variable
Commit b1420f1c (Make rcu_barrier() less disruptive) rearranged the
code in rcu_do_batch(), moving the ->qlen manipulation to follow
the requeueing of the callbacks.  Unfortunately, this rearrangement
clobbered the value of the "count" local variable before the value
of rdp->qlen was adjusted, resulting in the value of rdp->qlen being
inaccurate.  This commit therefore introduces an index variable "i",
avoiding the inadvertent multiplexing.

CRs-fixed: 657837
Change-Id: I1d7eab79a8e4105b407be87be3300129c32172ae
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Git-commit: b41772abeb
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
7727c23657 timer: Fix mod_timer_pinned() header comment
The mod_timer_pinned() header comment states that it prevents timers
from being migrated to a different CPU.  This is not the case, instead,
it ensures that the timer is posted to the current CPU, but does nothing
to prevent CPU-hotplug operations from migrating the timer.

This commit therefore brings the comment header into alignment with
reality.

CRs-fixed: 657837
Change-Id: I244c4c385cd1c47df216feda7b580b2876fab723
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Git-commit: 048a0e8f5e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
e3538e0325 rcu: Make rcu_barrier() less disruptive
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU.  Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.

However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state.  This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks.  Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().

CRs-fixed: 657837
Change-Id: Icb4392d9dc2cd25d9c9ea05b93ea9e2a99d24bcb
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: b1420f1c8b
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
10dd2ef4aa rcu: Precompute RCU_FAST_NO_HZ timer offsets
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels.  Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked.  In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed.  This means that rcu_prepare_for_idle()'s have
no effect.

This is not a problem on a busy system because something else will wake
up the CPU soon enough.  However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time.  If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.

This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time.  An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.

CRs-fixed: 657837
Change-Id: I45c9163678240ca0b6fcac06b025d4cdb140907c
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Git-commit: aa9b16306e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
71ef5ad0f7 rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables.
This works, but is hidden from someone scanning the data structures
in rcutree.h.  This commit therefore converts these per-CPU variables
to fields in the per-CPU rcu_dynticks structures.

CRs-fixed: 657837
Change-Id: I03e65c434b0156d4ab035421591aafb697ecd86b
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Git-commit: 5955f7eecd
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
33dc5bfbcc rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
In the current code, a short dyntick-idle interval (where there is
at least one non-lazy callback on the CPU) and a long dyntick-idle
interval (where there are only lazy callbacks on the CPU) are traced
identically, which can be less than helpful.  This commit therefore
emits different event traces in these two cases.

CRs-fixed: 657837
Change-Id: I0392bf0b9ffe3e319bdf54e7ff77f86ce5a8f212
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Git-commit: fd4b352687
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
a674b5f9ba rcu: Explicitly initialize RCU_FAST_NO_HZ per-CPU variables
The current initialization of the RCU_FAST_NO_HZ per-CPU variables makes
needless and fragile assumptions about the initial value of things like
the jiffies counter.  This commit therefore explicitly initializes all of
them that are better started with a non-zero value.  It also adds some
comments describing the per-CPU state variables.

CRs-fixed: 657837
Change-Id: Ia82aae8f5441a73be7e427f5fb74ec107144fa6e
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 98248a0e24
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:18 +08:00
Paul E. McKenney
9920854d06 rcu: Make RCU_FAST_NO_HZ handle timer migration
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
CPU goes offline, in which case it assumes that the CPU will have to come
out of dyntick-idle mode (cancelling the timer) in order to go offline.
This is important because when RCU_FAST_NO_HZ permits a CPU to enter
dyntick-idle mode despite having RCU callbacks pending, it posts a timer
on that CPU to force a wakeup on that CPU.  This wakeup ensures that the
CPU will eventually handle the end of the grace period, including invoking
its RCU callbacks.

However, Pascal Chapperon's test setup shows that the timer handler
rcu_idle_gp_timer_func() really does get invoked in some cases.  This is
problematic because this can cause the CPU that entered dyntick-idle
mode despite still having RCU callbacks pending to remain in
dyntick-idle mode indefinitely, which means that its RCU callbacks might
never be invoked.  This situation can result in grace-period delays or
even system hangs, which matches Pascal's observations of slow boot-up
and shutdown (https://lkml.org/lkml/2012/4/5/142).  See also the bugzilla:

	https://bugzilla.redhat.com/show_bug.cgi?id=806548

This commit therefore causes the "should never be invoked" timer handler
rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
the CPU for which the timer was intended, allowing that CPU to invoke
its RCU callbacks in a timely manner.

CRs-fixed: 657837
Change-Id: I39cc0c8bf36d2e3aa9c136e3cd399ab939daad2d
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 21e52e1566
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:17 +08:00
Paul E. McKenney
aa39813d91 rcu: Make exit_rcu() more precise and consolidate
When running preemptible RCU, if a task exits in an RCU read-side
critical section having blocked within that same RCU read-side critical
section, the task must be removed from the list of tasks blocking a
grace period (perhaps the current grace period, perhaps the next grace
period, depending on timing).  The exit() path invokes exit_rcu() to
do this cleanup.

However, the current implementation of exit_rcu() needlessly does the
cleanup even if the task did not block within the current RCU read-side
critical section, which wastes time and needlessly increases the size
of the state space.  Fix this by only doing the cleanup if the current
task is actually on the list of tasks blocking some grace period.

While we are at it, consolidate the two identical exit_rcu() functions
into a single function.

CRs-fixed: 657837
Change-Id: I59504aedb12064fcd6ce03973d441b547749076a
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 9dd8fb16c3
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[rggupt@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
2016-10-29 23:12:17 +08:00