f2fs_create_root_stats can fail due to no memory, report it to user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In linux 3.10, we can not make sure the return value of nd_get_link function
is valid. So this patch add a check before use it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This workqueue is not used in memory reclaim paths, so the
WQ_MEM_RECLAIM flag is not needed. Additionally, there is no
need to restrict the queue to having one in-flight work item.
Remove these constraints.
Change-Id: I9edde40917d3ec885ce061133de20680634321d0
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
When the workqueue runs at a lower priority, it gets starved by higher
priority threads. Bump the WQ priority to high to ensure cpufreq work
queue gets a fair scheduling chance
Change-Id: I9e994da94a347dceb884e72ec3dd3da468a4471d
Signed-off-by: Mahesh Sivasubramanian <msivasub@codeaurora.org>
When the cpufreq governors request for increase in frequency from kworker
threads, there is chance that the kworker thread is cpu starved due to
other high priority threads, leading to undesirable increase in frequency
ramp-up latency. Add the calling thread to SCHED_FIFO policy when
requesting for increase in frequency and restore back to original policy
after ramp-up is completed.
Change-Id: Ie4fa199b7f9087717cb94dfcc2eddea27cc0012b
Signed-off-by: Narayanan Gopalakrishnan <nargop@codeaurora.org>
When the workqueue runs at a lower priority, it gets starved by higher
priority threads. Bump the WQ priority to high to ensure rpm smd work
queue gets a fair chance
Change-Id: I06b864611cf45afe6931d6030327806032894663
Signed-off-by: Mahesh Sivasubramanian <msivasub@codeaurora.org>
busy_factor attribute of a scheduler domain causes busy CPUs (CPUs
that are not idle) to load balance less frequently in that domain,
which could impact performance by increasing scheduling latency for
tasks.
As an example, consider MC scheduler domain's attribute values of
max_interval = 4ms and busy_factor = 64. Further consider
max_load_balance_interval = 100 (HZ/10). In this case, a non-idle CPU
could put off load balance check in MC domain by 100ms. This
effectively means that a CPU running a single task in its queue could
fail to notice increased load on another CPU (that is in same MC
domain) for upto 100ms, before picking up load from the overloaded
CPU. Needless to say, this leads to increased scheduling latency for
tasks, affecting performance adversely.
By setting MC domain's busy_factor value to 1, we limit maximum
interval that busy CPU can put off load balance checks to 4ms
(effectively 10ms, given HZ value of 100).
Change-Id: Id45869d06f5556ea8eec602b65c2ffd2143fe060
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Preload farther to take advantage of the memory bus, and assume
64-byte cache lines. Unroll some pairs of ldm/stm as well, for
unexplainable reasons.
Future enhancements should include,
- #define for how far to preload, possibly defined separately for
memcpy, copy_*_user
- Tuning for misaligned buffers
- Tuning for memmove
- Tuning for small buffers
- Understanding mechanism behind ldm/stm unroll causing some gains
in copy_to_user
BASELINE (msm8960pro):
======================================================================
memcpy 1000MB at 5MB : took 808850 usec, bandwidth 1236.236 MB/s
copy_to_user 1000MB at 5MB : took 810071 usec, bandwidth 1234.234 MB/s
copy_from_user 1000MB at 5M: took 942926 usec, bandwidth 1060.060 MB/s
memmove 1000GB at 5MB : took 848588 usec, bandwidth 1178.178 MB/s
copy_to_user 1000GB at 4kB : took 847916 usec, bandwidth 1179.179 MB/s
copy_from_user 1000GB at 4k: took 935113 usec, bandwidth 1069.069 MB/s
copy_page 1000GB at 4kB : took 779459 usec, bandwidth 1282.282 MB/s
THIS PATCH:
======================================================================
memcpy 1000MB at 5MB : took 346223 usec, bandwidth 2888.888 MB/s
copy_to_user 1000MB at 5MB : took 348084 usec, bandwidth 2872.872 MB/s
copy_from_user 1000MB at 5M: took 348176 usec, bandwidth 2872.872 MB/s
memmove 1000GB at 5MB : took 348267 usec, bandwidth 2871.871 MB/s
copy_to_user 1000GB at 4kB : took 377018 usec, bandwidth 2652.652 MB/s
copy_from_user 1000GB at 4k: took 371829 usec, bandwidth 2689.689 MB/s
copy_page 1000GB at 4kB : took 383763 usec, bandwidth 2605.605 MB/s
Change-Id: I5e7605d4fb5c8492cfe7a73629d0d0ce7afd1f37
Signed-off-by: Chris Fries <C.Fries@motorola.com>
Reviewed-on: http://gerrit.pcs.mot.com/526804
SLT-Approved: Gerrit Code Review <gerrit-blurdev@motorola.com>
Tested-by: Jira Key <jirakey@motorola.com>
Reviewed-by: Igor Kovalenko <cik009@motorola.com>
Submit-Approved: Jira Key <jirakey@motorola.com>
commit c10d73671a upstream.
In various network workloads, __do_softirq() latencies can be up
to 20 ms if HZ=1000, and 200 ms if HZ=100.
This is because we iterate 10 times in the softirq dispatcher,
and some actions can consume a lot of cycles.
This patch changes the fallback to ksoftirqd condition to :
- A time limit of 2 ms.
- need_resched() being set on current task
When one of this condition is met, we wakeup ksoftirqd for further
softirq processing if we still have pending softirqs.
Using need_resched() as the only condition can trigger RCU stalls,
as we can keep BH disabled for too long.
I ran several benchmarks and got no significant difference in
throughput, but a very significant reduction of latencies (one order
of magnitude) :
In following bench, 200 antagonist "netperf -t TCP_RR" are started in
background, using all available cpus.
Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC
IRQ (hard+soft)
Before patch :
RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=550110.424
MIN_LATENCY=146858
MAX_LATENCY=997109
P50_LATENCY=305000
P90_LATENCY=550000
P99_LATENCY=710000
MEAN_LATENCY=376989.12
STDDEV_LATENCY=184046.92
After patch :
RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind
RT_LATENCY=40545.492
MIN_LATENCY=9834
MAX_LATENCY=78366
P50_LATENCY=33583
P90_LATENCY=59000
P99_LATENCY=69000
MEAN_LATENCY=38364.67
STDDEV_LATENCY=12865.26
Change-Id: I94f96a9040a018644d3e2150f54acfd9a080992d
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[xr: Backported to 3.4: Adjust context]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.
We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.
Effect of sync on fsmark before the patch:
FSUse% Count Size Files/sec App Overhead
.....
0 640000 4096 35154.6 1026984
0 720000 4096 36740.3 1023844
0 800000 4096 36184.6 916599
0 880000 4096 1282.7 1054367
0 960000 4096 3951.3 918773
0 1040000 4096 40646.2 996448
0 1120000 4096 43610.1 895647
0 1200000 4096 40333.1 921048
And a single sync pass took:
real 0m52.407s
user 0m0.000s
sys 0m0.090s
After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:
real 0m6.930s
user 0m0.000s
sys 0m0.039s
IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I9e55d65f5ecb2305497711d4688f0647d9346035
Change the logic which determines the initial readahead window size
such that for small requests (one page) the initial window size
will be x4 the size of the original request, regardless of the
VM_MAX_READAHEAD value. This prevents a rapid ramp-up
that could be caused due to increasing VM_MAX_READAHEAD.
Change-Id: I93d59c515d7e6c6d62348790980ff7bd4f434997
Signed-off-by: Lee Susman <lsusman@codeaurora.org>
commit 4b0c0f294f upstream.
Prarit reported a crash on CPU offline/online. The reason is that on
CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
up. If at cpu online an interrupt happens before the per cpu tick
device is registered the irq_enter() check potentially sees stale data
and dereferences a NULL pointer.
Cleanup the data after the cpu is dead.
Change-Id: I5f2be7afd398bc97b997ab2143f9d71230c44dd5
Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: Mike Galbraith <bitbucket@online.de>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e5ab012c32 upstream.
As it stands, irq_exit() may or may not be called with
irqs disabled, depending on __ARCH_IRQ_EXIT_IRQS_DISABLED
that the arch can define.
It makes tick_nohz_irq_exit() unsafe. For example two
interrupts can race in tick_nohz_stop_sched_tick(): the inner
most one computes the expiring time on top of the timer list,
then it's interrupted right before reprogramming the
clock. The new interrupt enqueues a new timer list timer,
it reprogram the clock to take it into account and it exits.
The CPUs resumes the inner most interrupt and performs the clock
reprogramming without considering the new timer list timer.
This regression has been introduced by:
280f06774a
("nohz: Separate out irq exit and idle loop dyntick logic")
Let's fix it right now with the appropriate protections.
A saner long term solution will be to remove
__ARCH_IRQ_EXIT_IRQS_DISABLED and mandate that irq_exit() is called
with interrupts disabled.
Change-Id: Ib453b14a1bdaac53b6c685ba3eac92dbd0985b2a
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Link: http://lkml.kernel.org/r/1361373336-11337-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Lingzhu Xiang <lxiang@redhat.com>
Reviewed-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b1420f1c (Make rcu_barrier() less disruptive) rearranged the
code in rcu_do_batch(), moving the ->qlen manipulation to follow
the requeueing of the callbacks. Unfortunately, this rearrangement
clobbered the value of the "count" local variable before the value
of rdp->qlen was adjusted, resulting in the value of rdp->qlen being
inaccurate. This commit therefore introduces an index variable "i",
avoiding the inadvertent multiplexing.
CRs-fixed: 657837
Change-Id: I1d7eab79a8e4105b407be87be3300129c32172ae
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Git-commit: b41772abeb
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
The mod_timer_pinned() header comment states that it prevents timers
from being migrated to a different CPU. This is not the case, instead,
it ensures that the timer is posted to the current CPU, but does nothing
to prevent CPU-hotplug operations from migrating the timer.
This commit therefore brings the comment header into alignment with
reality.
CRs-fixed: 657837
Change-Id: I244c4c385cd1c47df216feda7b580b2876fab723
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Git-commit: 048a0e8f5e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
The rcu_barrier() primitive interrupts each and every CPU, registering
a callback on every CPU. Once all of these callbacks have been invoked,
rcu_barrier() knows that every callback that was registered before
the call to rcu_barrier() has also been invoked.
However, there is no point in registering a callback on a CPU that
currently has no callbacks, most especially if that CPU is in a
deep idle state. This commit therefore makes rcu_barrier() avoid
interrupting CPUs that have no callbacks. Doing this requires reworking
the handling of orphaned callbacks, otherwise callbacks could slip through
rcu_barrier()'s net by being orphaned from a CPU that rcu_barrier() had
not yet interrupted to a CPU that rcu_barrier() had already interrupted.
This reworking was needed anyway to take a first step towards weaning
RCU from the CPU_DYING notifier's use of stop_cpu().
CRs-fixed: 657837
Change-Id: Icb4392d9dc2cd25d9c9ea05b93ea9e2a99d24bcb
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: b1420f1c8b
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
CRs-fixed: 657837
Change-Id: I45c9163678240ca0b6fcac06b025d4cdb140907c
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Git-commit: aa9b16306e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables.
This works, but is hidden from someone scanning the data structures
in rcutree.h. This commit therefore converts these per-CPU variables
to fields in the per-CPU rcu_dynticks structures.
CRs-fixed: 657837
Change-Id: I03e65c434b0156d4ab035421591aafb697ecd86b
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Git-commit: 5955f7eecd
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
In the current code, a short dyntick-idle interval (where there is
at least one non-lazy callback on the CPU) and a long dyntick-idle
interval (where there are only lazy callbacks on the CPU) are traced
identically, which can be less than helpful. This commit therefore
emits different event traces in these two cases.
CRs-fixed: 657837
Change-Id: I0392bf0b9ffe3e319bdf54e7ff77f86ce5a8f212
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Git-commit: fd4b352687
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
The current initialization of the RCU_FAST_NO_HZ per-CPU variables makes
needless and fragile assumptions about the initial value of things like
the jiffies counter. This commit therefore explicitly initializes all of
them that are better started with a non-zero value. It also adds some
comments describing the per-CPU state variables.
CRs-fixed: 657837
Change-Id: Ia82aae8f5441a73be7e427f5fb74ec107144fa6e
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 98248a0e24
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
CPU goes offline, in which case it assumes that the CPU will have to come
out of dyntick-idle mode (cancelling the timer) in order to go offline.
This is important because when RCU_FAST_NO_HZ permits a CPU to enter
dyntick-idle mode despite having RCU callbacks pending, it posts a timer
on that CPU to force a wakeup on that CPU. This wakeup ensures that the
CPU will eventually handle the end of the grace period, including invoking
its RCU callbacks.
However, Pascal Chapperon's test setup shows that the timer handler
rcu_idle_gp_timer_func() really does get invoked in some cases. This is
problematic because this can cause the CPU that entered dyntick-idle
mode despite still having RCU callbacks pending to remain in
dyntick-idle mode indefinitely, which means that its RCU callbacks might
never be invoked. This situation can result in grace-period delays or
even system hangs, which matches Pascal's observations of slow boot-up
and shutdown (https://lkml.org/lkml/2012/4/5/142). See also the bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=806548
This commit therefore causes the "should never be invoked" timer handler
rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
the CPU for which the timer was intended, allowing that CPU to invoke
its RCU callbacks in a timely manner.
CRs-fixed: 657837
Change-Id: I39cc0c8bf36d2e3aa9c136e3cd399ab939daad2d
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 21e52e1566
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
When running preemptible RCU, if a task exits in an RCU read-side
critical section having blocked within that same RCU read-side critical
section, the task must be removed from the list of tasks blocking a
grace period (perhaps the current grace period, perhaps the next grace
period, depending on timing). The exit() path invokes exit_rcu() to
do this cleanup.
However, the current implementation of exit_rcu() needlessly does the
cleanup even if the task did not block within the current RCU read-side
critical section, which wastes time and needlessly increases the size
of the state space. Fix this by only doing the cleanup if the current
task is actually on the list of tasks blocking some grace period.
While we are at it, consolidate the two identical exit_rcu() functions
into a single function.
CRs-fixed: 657837
Change-Id: I59504aedb12064fcd6ce03973d441b547749076a
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 9dd8fb16c3
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[rggupt@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
Timers are subject to migration, which can lead to the following
system-hang scenario when CONFIG_RCU_FAST_NO_HZ=y:
1. CPU 0 executes synchronize_rcu(), which posts an RCU callback.
2. CPU 0 then goes idle. It cannot immediately invoke the callback,
but there is nothing RCU needs from ti, so it enters dyntick-idle
mode after posting a timer.
3. The timer gets migrated to CPU 1.
4. CPU 0 never wakes up, so the synchronize_rcu() never returns, so
the system hangs.
This commit fixes this problem by using mod_timer_pinned(), as suggested
by Peter Zijlstra, to ensure that the timer is actually posted on the
running CPU.
CRs-fixed: 657837
Change-Id: Id4b3d586dc08db9b9a80739c3a05163b3be69e72
Reported-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: f511fc6246
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
RCU_FAST_NO_HZ uses a timer to limit the time that a CPU with callbacks
can remain in dyntick-idle mode. This timer is cancelled when the CPU
exits idle, and therefore should never fire. However, if the timer
were migrated to some other CPU for whatever reason (1) the timer could
actually fire and (2) firing on some other CPU would fail to wake up the
CPU with callbacks, possibly resulting in sluggishness or a system hang.
This commit therfore adds a WARN_ON_ONCE() to the timer handler in order
to detect this condition.
CRs-fixed: 657837
Change-Id: Ie667bd7ee157668b2f0fca7eaf67cb746be1d674
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 79b9a75fb7
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
Both Steven Rostedt's new idle-capable trace macros and the RCU_NONIDLE()
macro can cause RCU to momentarily pause out of idle without the rest
of the system being involved. This can cause rcu_prepare_for_idle()
to run through its state machine too quickly, which can in turn result
in needless scheduling-clock interrupts.
This commit therefore adds code to enable rcu_prepare_for_idle() to
distinguish between an initial entry to idle on the one hand (which needs
to advance the rcu_prepare_for_idle() state machine) and an idle reentry
due to idle-capable trace macros and RCU_NONIDLE() on the other hand
(which should avoid advancing the rcu_prepare_for_idle() state machine).
Additional state is maintained to allow the timer to be correctly reposted
when returning after a momentary pause out of idle, and even more state
is maintained to detect when new non-lazy callbacks have been enqueued
(which may require re-evaluation of the approach to idleness).
CRs-fixed: 657837
Change-Id: I7c5ef8183d42982876a50b500da080f15c61b848
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: c57afe80db
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
The RCU_FAST_NO_HZ facility uses an hrtimer to wake up a CPU when
it is allowed to go into dyntick-idle mode, which is almost always
cancelled soon after. This is not what hrtimers are good at, so
this commit switches to the timer wheel.
CRs-fixed: 657837
Change-Id: Ic528055544c929251f239c729c221cfcee539cfa
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 2ee3dc8066
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
Traces of rcu_prep_idle events can be confusing because
rcu_cleanup_after_idle() does no tracing. This commit therefore adds
this tracing.
CRs-fixed: 657837
Change-Id: I8ce6251d06e25f3ff26fe016aae456a33c40a466
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 2fdbb31b66
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ramesh Gupta Guntha <rggupt@codeaurora.org>
There have been some embedded applications that would benefit from
use of expedited grace-period primitives. In some ways, this is
similar to synchronize_net() doing either a normal or an expedited
grace period depending on lock state, but with control outside of
the kernel.
This commit therefore adds rcu_expedited boot and sysfs parameters
that cause the kernel to substitute expedited primitives for the
normal grace-period primitives.
[ paulmck: Add trace/event/rcu.h to kernel/srcu.c to avoid build error.
Get rid of infinite loop through contention path.]
CRs-fixed: 634363
Change-Id: I45addcb532fdaa47df3019ada3283e293ed40249
Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Git-commit: 3705b88db0
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[ joonwoop@codeaurora.org: resolved merge conflicts.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
In case when system contains no dirty pages, wakeup_flusher_threads()
will submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
immediately without doing anything. Thus sync(1) will write all the
dirty inodes from a WB_SYNC_ALL writeback pass which is slow.
Fix the problem by using get_nr_dirty_pages() in
wakeup_flusher_threads() instead of calculating number of dirty pages
manually. That function also takes number of dirty inodes into account.
Change-Id: I458027ae08d9a5a93202a7b97ace1f8da7a18a07
CC: stable@vger.kernel.org
Reported-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There is a race between mark inode dirty and writeback thread, see the
following scenario. In this case, writeback thread will not run though
there is dirty_io.
__mark_inode_dirty() bdi_writeback_workfn()
... ...
spin_lock(&inode->i_lock);
...
if (bdi_cap_writeback_dirty(bdi)) {
<<< assume wb has dirty_io, so wakeup_bdi is false.
<<< the following inode_dirty also have wakeup_bdi false.
if (!wb_has_dirty_io(&bdi->wb))
wakeup_bdi = true;
}
spin_unlock(&inode->i_lock);
<<< assume last dirty_io is removed here.
pages_written = wb_do_writeback(wb);
...
<<< work_list empty and wb has no dirty_io,
<<< delayed_work will not be queued.
if (!list_empty(&bdi->work_list) ||
(wb_has_dirty_io(wb) && dirty_writeback_interval))
queue_delayed_work(bdi_wq, &wb->dwork,
msecs_to_jiffies(dirty_writeback_interval * 10));
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
<<< new dirty_io is added.
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
spin_unlock(&bdi->wb.list_lock);
<<< though there is dirty_io, but wakeup_bdi is false,
<<< so writeback thread will not be waked up and
<<< the new dirty_io will not be flushed.
if (wakeup_bdi)
bdi_wakeup_thread_delayed(bdi);
Writeback will run until there is a new flush work queued. This may cause
a lot of dirty pages stay in memory for a long time.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
Change-Id: I973fcba5381881a003a035ffff48f64348660079
Return error if error comes in video driver
initialization.
Change-Id: I54c516808587fb7d94c3d517f8c72c6c9aa9d91d
CRs-fixed: 542535
Signed-off-by: Pushkaraj Patil <ppatil@codeaurora.org>
KSM thread to scan pages is getting schedule on definite timeout.
That wakes up CPU from idle state and hence may affect the power
consumption. Provide an optional support to use deferred timer
which suites low-power use-cases.
To enable deferred timers,
$ echo 1 > /sys/kernel/mm/ksm/deferred_timer
Change-Id: I07fe199f97fe1f72f9a9e1b0b757a3ac533719e8
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
There is a small race between when the cycle count is read from
the hardware and when the epoch stabilizes. Consider this
scenario:
CPU0 CPU1
---- ----
cyc = read_sched_clock()
cyc_to_sched_clock()
update_sched_clock()
...
cd.epoch_cyc = cyc;
epoch_cyc = cd.epoch_cyc;
...
epoch_ns + cyc_to_ns((cyc - epoch_cyc)
The cyc on cpu0 was read before the epoch changed. But we
calculate the nanoseconds based on the new epoch by subtracting
the new epoch from the old cycle count. Since epoch is most likely
larger than the old cycle count we calculate a large number that
will be converted to nanoseconds and added to epoch_ns, causing
time to jump forward too much.
Fix this problem by reading the hardware after the epoch has
stabilized.
Change-Id: I995133b229b2c2fedd5091406d1dc366d8bfff7b
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-commit: 336ae1180d
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[sboyd: reworked for file movement kernel/time -> arm/kernel]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
If we want load epoch_cyc and epoch_ns atomically,
we should update epoch_cyc_copy first of all.
This notify reader that updating is in progress.
If we update epoch_cyc first like as current implementation,
there is subtle error case.
Look at the below example.
<Initial Condition>
cyc = 9
ns = 900
cyc_copy = 9
== CASE 1 ==
<CPU A = reader> <CPU B = updater>
write cyc = 10
read cyc = 10
read ns = 900
write ns = 1000
write cyc_copy = 10
read cyc_copy = 10
output = (10, 900)
== CASE 2 ==
<CPU A = reader> <CPU B = updater>
read cyc = 9
write cyc = 10
write ns = 1000
read ns = 1000
read cyc_copy = 9
write cyc_copy = 10
output = (9, 1000)
If atomic read is ensured, output should be (9, 900) or (10, 1000).
But, output in example case are not.
So, change updating sequence in order to correct this problem.
Change-Id: Ia9196dd50a519f516f70c3138233624b669ef96a
Cc: <stable@vger.kernel.org>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
CRs-Fixed: 497236
Git-commit: 7c4e9ced42
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
The scheduler imposes a requirement to sched_clock()
which is to stop the clock during suspend, if we don't
do that any RT thread will be rescheduled in the future
which might cause any sort of problems.
This became an issue on OMAP when we converted omap-i2c.c
to use threaded IRQs, it turned out that depending on how
much time we spent on suspend, the I2C IRQ thread would
end up being rescheduled so far in the future that I2C
transfers would timeout and, because omap_hsmmc depends
on an I2C-connected device to detect if an MMC card is
inserted in the slot, our rootfs would just vanish.
arch/arm/kernel/sched_clock.c already had an optional
implementation (sched_clock_needs_suspend()) which would
handle scheduler's requirement properly, what this patch
does is simply to make that implementation non-optional.
Note that this has the side-effect that printk timings
won't reflect the actual time spent on suspend so other
methods to measure that will have to be used.
This has been tested with beagleboard XM (OMAP3630) and
pandaboard rev A3 (OMAP4430). Suspend to RAM is now working
after this patch.
Thanks to Kevin Hilman for helping out with debugging.
Change-Id: Ie2f9e3b22eb3d1f3806cf8c598f22e2fa1b8651f
Acked-by: Kevin Hilman <khilman@ti.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
CRs-Fixed: 497236
Git-commit: 6a4dae5e13
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Many clocks that are used to provide sched_clock will reset during
suspend. If read_sched_clock returns 0 after suspend, sched_clock will
appear to jump forward. This patch resets cd.epoch_cyc to the current
value of read_sched_clock during resume, which causes sched_clock() just
after suspend to return the same value as sched_clock() just before
suspend.
In addition, during the window where epoch_ns has been updated before
suspend, but epoch_cyc has not been updated after suspend, it is unknown
whether the clock has reset or not, and sched_clock() could return a
bogus value. Add a suspended flag, and return the pre-suspend epoch_ns
value during this period.
The new behavior is triggered by calling setup_sched_clock_needs_suspend
instead of setup_sched_clock.
Change-Id: I7441ef74dc6802c00eea61f3b8c0a25ac00a724d
Signed-off-by: Colin Cross <ccross@android.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
CRs-Fixed: 497236
Git-commit: 237ec6f2e5
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
F2FS has been shown to provide all-around performance improvements.
Change-Id: Ife324169b292c6dfa2e70fc40306cf453289127e
Signed-off-by: Zhao Wei Liew <zhaoweiliew@gmail.com>
The last patch is:
commit beaa57dd986d4f398728c060692fc2452895cfd8
Author: Chao Yu <chao2.yu@samsung.com>
Date: Thu Oct 22 18:24:12 2015 +0800
f2fs: fix to skip shrinking extent nodes
In f2fs_shrink_extent_tree we should stop shrink flow if we have already
shrunk enough nodes in extent cache.
Change-Id: I7bc76a98ce99412c59435f4573ace38fca604694
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The recovery image is too large when built with GCC 4.9.
Switch to XZ compression to reduce the image size.
Change-Id: I33e04f0daacc2bfac2212721921450f6a7927eb7
Signed-off-by: Zhao Wei Liew <zhaoweiliew@gmail.com>
I got a report about unkillable task eating CPU. Further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.
Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.
A similar problem is described in 124d3b7041 ("fix writev regression:
pan hanging unkillable and un-straceable"). If zero-length segmend
is followed by segment with invalid address,
iov_iter_fault_in_readable() checks only first segment (zero-length),
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.
Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.
Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
description.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: ea9b9907b8 ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org>
Change-Id: Id37193373294dd43191469389cfe68ca1736a54b
This change initializes kernel space stack variables
that are passed between kernel space and user space
using ioctls.
Non initialization of these variables may lead to
leakage of memory values from the kernel stack to
user space.
Change-Id: Icb195470545ee48b55671ac09798610178e833e1
CRs-fixed: 556771,563420
Signed-off-by: Deepak Verma <dverma@codeaurora.org>
[zhaoweiliew: Remove unported LTR and GET_PERF_LEVEL features]
Signed-off-by: Zhao Wei Liew <zhaoweiliew@gmail.com>
As dynamic fps feature is supported only for MIPI video panels,
return -EINVAL if sysfs node is written for panels other than
MIPI video. This change also removes world writable permissions
for sysfs node.
Change-Id: I72260e6032cb8a758b0457c36a808263a981260f
Signed-off-by: Vishnuvardhan Prodduturi <vproddut@codeaurora.org>
This change fix the possible memory corruption
by increasing the local array sizes in video
encoder.
Change-Id: If03cee2f428c3cef863178da70bc083159f203f5
Signed-off-by: Maheshwar Ajja <majja@codeaurora.org>
The system uses global_dirtyable_memory() to calculate number of
dirtyable pages/pages that can be allocated to the page cache. A bug
causes an underflow thus making the page count look like a big unsigned
number. This in turn confuses the dirty writeback throttling to
aggressively write back pages as they become dirty (usually 1 page at a
time). This generally only affects systems with highmem because the
underflowed count gets subtracted from the global count of dirtyable
memory.
The problem was introduced with v3.2-4896-gab8fabd
Fix is to ensure we don't get an underflowed total of either highmem or
global dirtyable memory.
Change-Id: Ib0c4a8f99870edc5ded863b88a472f2836c690d2
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Puneet Kumar <puneetster@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Damien Wyart <damien.wyart@free.fr>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: c8b74c2f66
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Commit 21111be8 "ARM: Fix negative idle stats for offline cpu" moved call
to cpu_die() to occur outside of rcu_idle_enter()/rcu_idle_exit() section.
As a result, the RCU_NONIDLE() wrapper to complete() call in
arch/arm/kernel/process.c:cpu_die() is no longer required (and is
technically incorrect to have). Removing RCU_NONIDLE() wrapper also removes
this warning seen during CPU offline:
[Note: Below message has been edited to fit 75-char per line limit.
Insignificant portions of warning message has been removed in each line.]
------------[ cut here ]------------
WARNING: at kernel/rcutree.c:456 rcu_idle_exit_common+0x4c/0xe0()
Modules linked in:
(unwind_backtrace+0x0/0x120) from (warn_slowpath_common+0x4c/0x64)
(warn_slowpath_common+0x4c/0x64) from (warn_slowpath_null+0x18/0x1c)
(warn_slowpath_null+0x18/0x1c) from (rcu_idle_exit_common+0x4c/0xe0)
(rcu_idle_exit_common+0x4c/0xe0) from (rcu_idle_exit+0xa8/0xc0)
(rcu_idle_exit+0xa8/0xc0) from (cpu_die+0x24/0x5c)
(cpu_die+0x24/0x5c) from (cpu_idle+0xdc/0xf0)
(cpu_idle+0xdc/0xf0) from (0x8160)
---[ end trace 61bf21937a496a37 ]---
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Change-Id: I721e6b20651674e6f6f584cf8d814af00b688c91
We see negative idle stats because of cpu dying without
cleaning up idle entry statistics. When a cpu is offline,
the most immediate thing you'd want to do is just call
into cpu_die() and not waste time calling notifier(IDLE_START).
CRs-Fixed: 414554
Change-Id: Iadc6a3ca39997e0ccf65d2a29b004e24b1b211a1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Taniya Das <tdas@codeaurora.org>
pagetable list is protected by ptlock. Few functions
while iterating over a pagetable list takes same ptlock,
kgsl_destroy_pagetable() also need same lock. This will
cause spinlock recursion if kgsl_destroy_pagetable()
called while iterating over a list. Created two versions
of same function one is with lock and other is without
lock.
CRs-Fixed: 621172
Change-Id: I61440f99022fce8629a57bb5661e2eef9613187b
Signed-off-by: Prakash Kamliya <pkamliya@codeaurora.org>
Currently we are adding a dma command to the staged list with in a
spinlock and then adding to workqueue using queue_work after unlocking
the spinlock. With this there is chance of executing DMA commands in out
of order in below concurrency case.
Tread1 Thread2
__msm_dmov_enqueue_cmd_ext
spin_lock_irqsave(..)
list_add_tail(..)
spin_unlock_irqrestore(..)
--PREEMPT--
__msm_dmov_enqueue_cmd_ext
spin_lock_irqsave(..)
list_add_tail(..)
spin_unlock_irqrestore(..)
queue_work()
..
queue_work()
So adding queue_work with in spin_lock will make sure that the
work added in the work_queue is processed in the same order as they
are added in staged_commands.
CRs-Fixed: 423190
Change-Id: I2ffd1327fb5f0cd1f06db7de9c026d1c4997fe4d
Acked-by: Gopi Krishna Nedanuri <gnedanur@qti.qualcomm.com>
Signed-off-by: Utsab Bose <ubose@codeaurora.org>