commit aac74dc495 upstream.
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.
This only changes the function name. No logic changes.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6acbfb9697 upstream.
Lai found that:
WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
...
migration_cpu_stop+0x1d/0x22
was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.
This isn't true since 5fbd036b55 ("sched: Cleanup cpu_active madness").
So set active and online at the same time to avoid this particular
problem.
Fixes: 5fbd036b55 ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6227cb00cc upstream.
The check at the beginning of cpupri_find() makes sure that the task_pri
variable does not exceed the cp->pri_to_cpu array length. But that length
is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two
priorities in that array.
As task_pri is computed from convert_prio() which should never be bigger
than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is
hit.
Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 201c373e8e upstream.
Various sd->*_idx's are used for refering the rq's load average table
when selecting a cpu to run. However they can be set to any number
with sysctl knobs so that it can crash the kernel if something bad is
given. Fix it by limiting them into the actual range.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 791c9e0292 upstream.
dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.
In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.
This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.
I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5aaa0b7a2e upstream.
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.
Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.
So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 556061b00c upstream.
While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.
The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.
The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.
This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 57d2aa00dc upstream.
The issue below was found in 2.6.34-rt rather than mainline rt
kernel, but the issue still exists upstream as well.
So please let me describe how it was noticed on 2.6.34-rt:
On this version, each softirq has its own thread, it means there
is at least one RT FIFO task per cpu. The priority of these
tasks is set to 49 by default. If user launches an RT FIFO task
with priority lower than 49 of softirq RT tasks, it's possible
there are two RT FIFO tasks enqueued one cpu runqueue at one
moment. By current strategy of balancing RT tasks, when it comes
to RT tasks, we really need to put them off to a CPU that they
can run on as soon as possible. Even if it means a bit of cache
line flushing, we want RT tasks to be run with the least latency.
When the user RT FIFO task which just launched before is
running, the sched timer tick of the current cpu happens. In this
tick period, the timeout value of the user RT task will be
updated once. Subsequently, we try to wake up one softirq RT
task on its local cpu. As the priority of current user RT task
is lower than the softirq RT task, the current task will be
preempted by the higher priority softirq RT task. Before
preemption, we check to see if current can readily move to a
different cpu. If so, we will reschedule to allow the RT push logic
to try to move current somewhere else. Whenever the woken
softirq RT task runs, it first tries to migrate the user FIFO RT
task over to a cpu that is running a task of lesser priority. If
migration is done, it will send a reschedule request to the found
cpu by IPI interrupt. Once the target cpu responds the IPI
interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu,
the sched timer tick of the cpu fires. So it will tick the user
RT task again. This also means the RT task timeout value will be
updated again. As the migration may be done in one tick period,
it means the user RT task timeout value will be updated twice
within one tick.
If we set a limit on the amount of cpu time for the user RT task
by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
upon reaching the soft limit.
But exactly when the SIGXCPU signal should be sent depends on the
RT task timeout value. In fact the timeout mechanism of sending
the SIGXCPU signal assumes the RT task timeout is increased once
every tick.
However, currently the timeout value may be added twice per
tick. So it results in the SIGXCPU signal being sent earlier
than expected.
To solve this issue, we prevent the timeout value from increasing
twice within one tick time by remembering the jiffies value of
last updating the timeout. As long as the RT task's jiffies is
different with the global jiffies value, we allow its timeout to
be updated.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Fan Du <fan.du@windriver.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a4c96ae319 upstream.
migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.
Instead unthrottle rt runqueues before migrating tasks.
Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()
Signed-off-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e221d028bb upstream.
Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 454c79999f upstream.
task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq. However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup. In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice. If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.
Modify task_tick_rt() to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0ac9b1c218 upstream.
Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )
Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:
put_prev_task(group_cfs_rq(a[0]), t1)
put_prev_entity(..., t1)
check_cfs_rq_runtime(group_cfs_rq(a[0]))
throttle_cfs_rq(group_cfs_rq(a[0]))
Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:
enqueue_task_fair(rq[0], t2)
enqueue_entity(group_cfs_rq(b[0]), t2)
enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
account_entity_enqueue(group_cfs_ra(b[0]), t2)
update_cfs_shares(group_cfs_rq(b[0]))
< skipped because b is part of a throttled hierarchy >
enqueue_entity(group_cfs_rq(a[0]), b[0])
...
We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.
Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 927b54fccb upstream.
__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock.
Fix this by ensuring that cfs_b->timer_active is cleared only if the
_latest_ call to do_sched_cfs_period_timer is returning as idle. Then
__start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for
that to succeed or timer_active == 1.
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit db06e78cc1 upstream.
hrtimer_expires_remaining does not take internal hrtimer locks and thus
must be guarded against concurrent __hrtimer_start_range_ns (but
returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe.
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1ee14e6c8c upstream.
When we transition cfs_bandwidth_used to false, any currently
throttled groups will incorrectly return false from cfs_rq_throttled.
While tg_set_cfs_bandwidth will unthrottle them eventually, currently
running code (including at least dequeue_task_fair and
distribute_cfs_runtime) will cause errors.
Fix this by turning off cfs_bandwidth_used only after unthrottling all
cfs_rqs.
Tested: toggle bandwidth back and forth on a loaded cgroup. Caused
crashes in minutes without the patch, hasn't crashed with it.
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f9f9ffc237 upstream.
throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.
Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.
(Also add some sched_debug lines for cfs_bandwidth.)
Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 757dfcaa41 upstream.
This patch touches the RT group scheduling case.
Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.
The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq. The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.
The patch below fixes the problem.
Signed-off-by: Kirill Tkhai <tkhai@yandex.ru>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6c9a27f5da upstream.
There is a small race between copy_process() and cgroup_attach_task()
where child->se.parent,cfs_rq points to invalid (old) ones.
parent doing fork() | someone moving the parent to another cgroup
-------------------------------+---------------------------------------------
copy_process()
+ dup_task_struct()
-> parent->se is copied to child->se.
se.parent,cfs_rq of them point to old ones.
cgroup_attach_task()
+ cgroup_task_migrate()
-> parent->cgroup is updated.
+ cpu_cgroup_attach()
+ sched_move_task()
+ task_move_group_fair()
+- set_task_rq()
-> se.parent,cfs_rq of parent
are updated.
+ cgroup_fork()
-> parent->cgroup is copied to child->cgroup. (*1)
+ sched_fork()
+ task_fork_fair()
-> se.parent,cfs_rq of child are accessed
while they point to old ones. (*2)
In the worst case, this bug can lead to "use-after-free" and cause a panic,
because it's new cgroup's refcount that is incremented at (*1),
so the old cgroup(and related data) can be freed before (*2).
In fact, a panic caused by this bug was originally caught in RHEL6.4.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
[...]
Call Trace:
[<ffffffff81051f25>] place_entity+0x75/0xa0
[<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
[<ffffffff81063c0b>] sched_fork+0x6b/0x140
[<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
[<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
[<ffffffff8106d2f4>] do_fork+0x94/0x460
[<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
[<ffffffff81009598>] sys_clone+0x28/0x30
[<ffffffff8100b393>] stub_clone+0x13/0x20
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a59f4e079d upstream.
The caller of sched_sliced() should pass se.cfs_rq and se as the
arguments, however in sched_rr_get_interval() we gave it
rq.cfs_rq and se, which made the following computation obviously
wrong.
The change was introduced by commit:
77034937dc sched: fix crash in sys_sched_rr_get_interval()
... 5 years ago, while it had been the correct 'cfs_rq_of' before
the commit. The change seems to be irrelevant to the commit
msg, which was to return a 0 timeslice for tasks that are on an
idle runqueue. So I believe that was just a plain typo.
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com
[ Since this is an ABI and an old bug, we'll test this via a
slow upstream route, to hopefully discover any app breakage. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 383efcd000 upstream.
try_to_wake_up_local() should only be invoked to wake up another
task in the same runqueue and BUG_ON()s are used to enforce the
rule. Missing try_to_wake_up_local() can stall workqueue
execution but such stalls are likely to be finite either by
another work item being queued or the one blocked getting
unblocked. There's no reason to trigger BUG while holding rq
lock crashing the whole system.
Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a1cbcaa9ea upstream.
The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.
CPU0 CPU1
sched_clock_local() sched_clock_remote(CPU0)
...
remote_clock = scd[CPU0]->clock
read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
read_high32bit(scd[CPU0]->clock)
While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.
It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.
The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.
The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.
The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:
<idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
<idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0
<idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.
There are only two other possibilities, which could cause a stale
sched clock:
1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
wrong value.
2) sched_clock() which reads the TSC returns a sporadic wrong value.
#1 can be excluded because sched_clock would continue to increase for
one jiffy and then go stale.
#2 can be excluded because it would not make the clock jump
forward. It would just result in a stale sched_clock for one jiffy.
After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.
So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.
Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.
Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit aa7f67304d upstream.
When the system has multiple domains do_sched_rt_period_timer()
can run on any CPU and may iterate over all rt_rq in
cpu_online_mask. This means when balance_runtime() is run for a
given rt_rq that rt_rq may be in a different rd than the current
processor. Thus if we use smp_processor_id() to get rd in
do_balance_runtime() we may borrow runtime from a rt_rq that is
not part of our rd.
This changes do_balance_runtime to get the rd from the passed in
rt_rq ensuring that we borrow runtime only from the correct rd
for the given rt_rq.
This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
when we try reclaim runtime lent to other rt_rq but runtime has
been lent to a rt_rq in another rd.
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Mike Galbraith <bitbucket@online.de>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9067ac85d5 upstream.
wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
Change it to use TASK_NORMAL and add the WARN_ON().
TASK_ALL has no other users, probably can be killed.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8f6189684e upstream.
Make stop scheduler class do the same accounting as other classes,
Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start. The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.
%CPU PID USER CMD
99.3 45 root [migration/10]
97.7 53 root [migration/12]
97.0 57 root [migration/13]
90.1 49 root [migration/11]
89.6 65 root [migration/15]
88.7 17 root [migration/3]
80.4 37 root [migration/8]
78.1 41 root [migration/9]
44.2 13 root [migration/2]
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d35be8bab9 upstream.
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.
In order to achieve that, do the following:
1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.
2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.
(Since userspace is frozen while doing all this, it will go unnoticed.)
3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.
Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8323f26ce3 upstream.
Stefan reported a crash on a kernel before a3e5d1091c ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.
Looking at all that I found a lack of serialization and plain
wrong comments.
The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.
Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit bea6832cc8 upstream.
On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.
This problem can be triggered in practice on very long lived processes:
PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
#1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
#2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
#3 [ffff880472a51cd0] die at ffffffff8100f26b
#4 [ffff880472a51d00] do_trap at ffffffff814f03f4
#5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
#6 [ffff880472a51e00] divide_error at ffffffff8100be7b
[exception RIP: thread_group_times+0x56]
RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
#8 [ffff880472a51f40] sys_times at ffffffff81088524
#9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 35cf4e50b1 upstream.
With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance. This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5167e8d541 upstream.
Thanks to Charles Wang for spotting the defects in the current code:
- If we go idle during the sample window -- after sampling, we get a
negative bias because we can negate our own sample.
- If we wake up during the sample window we get a positive bias
because we push the sample to a known active period.
So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.
Reported-and-tested-by: Doug Smythies <dsmythies@telus.net>
Reported-and-tested-by: Charles Wang <muming.wq@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a841f8cef4 upstream.
It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.
Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.
Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we have one cpu that failed to boot and boot cpu gave up on
waiting for it and then another cpu is being booted, kernel
might crash with following OOPS:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80
Call Trace:
[<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50
The crash happens in init_sched_groups_power() that expects
sched_groups to be circular linked list. However it is not
always true, since sched_groups preallocated in __sdt_alloc are
initialized in build_sched_groups and it may exit early
if (cpu != cpumask_first(sched_domain_span(sd)))
return 0;
without initializing sd->groups->next field.
Fix bug by initializing next field right after sched_group was
allocated.
Also-Reported-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Cc: a.p.zijlstra@chello.nl
Cc: pjt@google.com
Cc: seto.hidetoshi@jp.fujitsu.com
Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Under extreme memory used up situations, percpu allocation
might fail. We hit it when system goes to suspend-to-ram,
causing a kworker panic:
EIP: [<c124411a>] build_sched_domains+0x23a/0xad0
Kernel panic - not syncing: Fatal exception
Pid: 3026, comm: kworker/u:3
3.0.8-137473-gf42fbef #1
Call Trace:
[<c18cc4f2>] panic+0x66/0x16c
[...]
[<c1244c37>] partition_sched_domains+0x287/0x4b0
[<c12a77be>] cpuset_update_active_cpus+0x1fe/0x210
[<c123712d>] cpuset_cpu_inactive+0x1d/0x30
[...]
With this fix applied build_sched_domains() will return -ENOMEM and
the suspend attempt fails.
Signed-off-by: he, bo <bo.he@intel.com>
Reviewed-by: Zhang, Yanmin <yanmin.zhang@intel.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo
[ So, we fail to deallocate a CPU because we cannot allocate RAM :-/
I don't like that kind of sad behavior but nevertheless it should
not crash under high memory load. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commits 367456c756 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
left some more wreckage.
By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.
Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/
Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.
Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
[ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()
sched: Fix __schedule_bug() output when called from an interrupt
sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback
The function for_each_cpu_mask() expects a *pointer* to struct
cpumask as its second argument, whereas select_fallback_rq()
passes the value itself.
And moreover, for_each_cpu_mask() has been marked as obselete
in include/linux/cpumask.h. So move to the more appropriate
for_each_cpu() variant.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Dave Jones <davej@redhat.com>
Cc: Liu Chuansheng <chuansheng.liu@intel.com>
Cc: vapier@gentoo.org
Cc: rusty@rustcorp.com.au
Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpusets: Remove an unused variable
sched/rt: Improve pick_next_highest_task_rt()
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
sched/x86/smp: Do not enable IRQs over calibrate_delay()
sched: Fix compiler warning about declared inline after use
MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS
If schedule is called from an interrupt handler __schedule_bug()
will call show_regs() with the registers saved during the
interrupt handling done in do_IRQ(). This means we'll see the
registers and the backtrace for the process that was interrupted
and not the full backtrace explaining who called schedule().
This is due to 838225b ("sched: use show_regs() to improve
__schedule_bug() output", 2007-10-24) which improperly assumed
that get_irq_regs() would return the registers for the current
stack because it is being called from within an interrupt
handler. Simply remove the show_reg() code so that we dump a
backtrace for the interrupt handler that called schedule().
[ I ran across this when I was presented with a scheduling while
atomic log with a stacktrace pointing at spin_unlock_irqrestore().
It made no sense and I had to guess what interrupt handler could
be called and poke around for someone calling schedule() in an
interrupt handler. A simple test of putting an msleep() in
an interrupt handler works better with this patch because you
can actually see the msleep() call in the backtrace. ]
Also-reported-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Satyam Sharma <satyam@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.
asm/system.h has been disintegrated by this point on all arches into the
following common segments:
(1) asm/barrier.h
Moved memory barrier definitions here.
(2) asm/cmpxchg.h
Moved xchg() and cmpxchg() here. #included in asm/atomic.h.
(3) asm/bug.h
Moved die() and similar here.
(4) asm/exec.h
Moved arch_align_stack() here.
(5) asm/elf.h
Moved AT_VECTOR_SIZE_ARCH here.
(6) asm/switch_to.h
Moved switch_to() here.
Signed-off-by: David Howells <dhowells@redhat.com>
Avoid extra work by continuing on to the next rt_rq if the highest
prio task in current rt_rq is the same priority as our candidate
task.
More detailed explanation: if next is not NULL, then we have found a
candidate task, and its priority is next->prio. Now we are looking
for an even higher priority task in the other rt_rq's. idx is the
highest priority in the current candidate rt_rq. In the current 3.3
code, if idx is equal to next->prio, we would start scanning the tasks
in that rt_rq and replace the current candidate task with a task from
that rt_rq. But the new task would only have a priority that is equal
to our previous candidate task, so we have not advanced our goal of
finding a higher prio task. So we should avoid the extra work by
continuing on to the next rt_rq if idx is equal to next->prio.
Signed-off-by: Michael J Wang <mjwang@broadcom.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.
Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.
If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.
select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.
This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.
The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.
[ By setting online first and active later we have a window of
online,!active, fresh and bound kthreads have task_cpu() of 0 and
since cpu0 isn't in tsk_cpus_allowed() we end up in
select_fallback_rq() which excludes !active, resulting in a reset
of ->cpus_allowed and the thread running all over the place. ]
The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.
Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel/sched/fair.c:420: warning: 'account_cfs_rq_runtime' declared inline after being called
kernel/sched/fair.c:420: warning: previous declaration of 'account_cfs_rq_runtime' was here
kernel/sched/fair.c:1165: warning: 'return_cfs_rq_runtime' declared inlineafter being called
kernel/sched/fair.c:1165: warning: previous declaration of 'return_cfs_rq_runtime' was here
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120321200717.49BB4A024E@akpm.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull security subsystem updates for 3.4 from James Morris:
"The main addition here is the new Yama security module from Kees Cook,
which was discussed at the Linux Security Summit last year. Its
purpose is to collect miscellaneous DAC security enhancements in one
place. This also marks a departure in policy for LSM modules, which
were previously limited to being standalone access control systems.
Chromium OS is using Yama, and I believe there are plans for Ubuntu,
at least.
This patchset also includes maintenance updates for AppArmor, TOMOYO
and others."
Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key
rename.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits)
AppArmor: Fix location of const qualifier on generated string tables
TOMOYO: Return error if fails to delete a domain
AppArmor: add const qualifiers to string arrays
AppArmor: Add ability to load extended policy
TOMOYO: Return appropriate value to poll().
AppArmor: Move path failure information into aa_get_name and rename
AppArmor: Update dfa matching routines.
AppArmor: Minor cleanup of d_namespace_path to consolidate error handling
AppArmor: Retrieve the dentry_path for error reporting when path lookup fails
AppArmor: Add const qualifiers to generated string tables
AppArmor: Fix oops in policy unpack auditing
AppArmor: Fix error returned when a path lookup is disconnected
KEYS: testing wrong bit for KEY_FLAG_REVOKED
TOMOYO: Fix mount flags checking order.
security: fix ima kconfig warning
AppArmor: Fix the error case for chroot relative path name lookup
AppArmor: fix mapping of META_READ to audit and quiet flags
AppArmor: Fix underflow in xindex calculation
AppArmor: Fix dropping of allowed operations that are force audited
AppArmor: Add mising end of structure test to caps unpacking
...
Pull cgroup changes from Tejun Heo:
"Out of the 8 commits, one fixes a long-standing locking issue around
tasklist walking and others are cleanups."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
cgroup: remove cgroup_subsys argument from callbacks
cgroup: remove extra calls to find_existing_css_set
cgroup: replace tasklist_lock with rcu_read_lock
cgroup: simplify double-check locking in cgroup_attach_proc
cgroup: move struct cgroup_pidlist out from the header file
cgroup: remove cgroup_attach_task_current_cg()
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
This callback is called by the scheduler after rq->lock has been released
and interrupts enabled. It will be used in subsequent patches on the ARM
architecture.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Tested-by: Marc Zyngier <Marc.Zyngier@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/20120313110840.7b444deb6b1bb902c15f3cdf@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Various people reported nohz load tracking still being wrecked, but Doug
spotted the actual problem. We fold the nohz remainder in too soon,
causing us to loose samples and under-account.
So instead of playing catch-up up-front, always do a single load-fold
with whatever state we encounter and only then fold the nohz remainder
and play catch-up.
Reported-by: Doug Smythies <dsmythies@telus.net>
Reported-by: LesÅ=82aw Kope=C4=87 <leslaw.kopec@nasza-klasa.pl>
Reported-by: Aman Gupta <aman@tmm1.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-4v31etnhgg9kwd6ocgx3rxl8@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>