sched: Provide knob to prefer mostly_idle over idle cpus

sysctl_sched_prefer_idle lets the scheduler bias selection of
idle cpus over mostly idle cpus for tasks. This knob could be
useful to control balance between power and performance.

Change-Id: Ide6eef684ef94ac8b9927f53c220ccf94976fe67
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
This commit is contained in:
Srivatsa Vaddagiri 2014-10-31 16:04:00 -07:00 committed by Steve Muckle
parent 75d1c94217
commit 6e778f0cdc
4 changed files with 118 additions and 196 deletions

View File

@ -641,192 +641,63 @@ select_best_cpu(), represents the heart of the HMP scheduling
algorithm described in this document.
The behavior of select_best_cpu() differs depending on whether the
task being placed is a small task or not.
task being placed is a small task or not and the value of the sched_prefer_idle
tunable.
--- Wakeup Logic a Non-Small Task "p"
The following is evaluated for every online CPU i which task p may run on:
The order of CPU preference for a non-small task when sched_prefer_idle = 1 is
the following:
|
| task doesn't fit, but
| is this CPU a good
V fallback candidate?
+---------------+ +-------------+ +--------+
| does task fit |------------>| is CPU |----------->| ignore |
| on CPU | no | mostly idle | no | cpu |
+---------------+ +-------------+ +--------+
| |
| yes | yes
| | +--------------------------+
| --------->| load < min_fallback_load |
| +--------------------------+
| | |
| | yes | no
| V V
| +-----------------------+ +------------+
| | fallback_idle_cpu = i | | ignore cpu |
| task fits, prefer +-----------------------+ +------------+
| mostly idle CPUs | |
| or non-max capacity V V
| CPUs that won't hit next CPU next CPU
| spill threshold
V
+---------------------+ task does not meet load requirements
| CPU mostly idle || | no +------------+
| (!max_capacity && |---------->| ignore cpu |----> next CPU
| !(p causes spill)) | +------------+
+---------------------+
|
| yes
|
|
|
| is CPU in a lower power band
V than previously seen min cost CPU CPU in a lower power band
+---------------------+ than previously seen min,
| cost(p, i) is | yes +----------------------------+ override
| > band_limit % less |---------->| best_cpu = i | previously
| than current min | | min_cost = cost(p,i) | seen min_load
+---------------------+ | min_load = load(i) | CPU
| +----------------------------+
| no |
| ---------> next CPU
|
|
|
| does CPU have lower load than CPU has lower load than
V previously seen min_load CPU previously seen lowest load
+--------------------+ yes +-----------------+
| load(i) < min_load |------------------------->| best_cpu = i |
+--------------------+ | min_load = load |
| +-----------------+
| no |
| |
| if load is tied with lowest previously |
| seen lowest load, is power cost less |
V |
+------------------------+ |
| load(i) == min_load && | yes +--------------+ |
| cost(p, i) < min_cost |-------->| best_cpu = i | |
+------------------------+ +--------------+ |
| | |
| no | /
\_____________________________ | __________/
\ | /
| | |
V V V if power cost of this
+----------------------+ CPU is lower than
| cost(p,i) < min_cost | current min, update
+----------------------+ min_cost
| |
| yes | no
| ----------> next CPU
V
+----------------------+
| min_cost = cost(p,i) |-------> next CPU
+----------------------+
1. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
the task. Where there is a tie of two CPUs with the same load, the CPU with
the lowest power cost is chosen.
Once this flow chart has been evaluated for every online CPU the task
may run on, if a "best_cpu" was found, it is returned. If a best_cpu
was not found but a fallback_idle_cpu was found, then the
fallback_idle_cpu is returned. Finally, if no best_cpu or
fallback_idle cpu was found, then the task's previous CPU is returned.
2. The least-loaded CPU the task is allowed to run on in the lowest power band
where the task will fit and where the placement will not result in cpu
exceeding spill level. When there is a tie of two CPUs at same load, the
CPU with the lowest power cost is chosen.
Phew! Fortunately, all of that can be summarized relatively easily. The
order of CPU preference for a non-small task is the following:
3. The least-loaded mostly idle CPU that the task is allowed to run on where
the task won't fit (since there was no CPU where the task would fit).
1. The least-loaded CPU the task is allowed to run on in the lowest
power band where the task will fit and where the placement will
not result in cpu exceeding spill level. When there is a tie of
two cpus at same load, their CPU with the lowest power cost is
chosen.
4. The CPU which the task last ran on.
2. The least-loaded mostly idle CPU that the task is allowed to run
on where the task won't fit (since there was no CPU where the
task would fit).
The order of CPU preference for a non-small task when sched_prefer_idle = 0
is the following:
3. The CPU which the task last ran on.
1. The least-loaded non-idle mostly idle CPU the task is allowed to run on in
the lowest power band where the task will fit. When there is a tie of two
CPUs at same load, the CPU with the lowest power cost is chosen.
2. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
the task. Where there is a tie of two CPUs with the same load, the CPU with
the lowest power cost is chosen.
3. The least-loaded CPU the task is allowed to run on in the lowest power band
where the task will fit and where the placement will not result in the CPU
exceeding spill level. When there is a tie of two CPUs at the same load,
the CPU with the lowest power cost is chosen.
4. The least-loaded mostly idle CPU that the task is allowed to run on where
the task won't fit (since there was no CPU where the task would fit).
5. The CPU which the task last ran on.
--- Wakeup Logic a Small Task "p"
The online CPUs the task is allowed to run on are scanned and the
lowest power CPU is found. This is marked as the min_cost_cpu.
If the minimum cost CPU is mostly idle but not idle, that CPU is
immediately chosen.
If the minimum cost CPU is idle or not mostly idle, then the following
will be evaluated for every online CPU i the task is allowed to run
on:
| is CPU i in higher power band is this CPU lower power than
V than min_cost_cpu? best fallback CPU seen
+---------------------+ +-----------------------+
| cost(p, i) is | yes | cost(p,i) < | no +--------+
| > band_limit % more |--------------->| min_fallback_cpu_cost |----->| ignore |
| than min_cost_cpu | +-----------------------+ | cpu |
+---------------------+ | +--------+
| | yes |
| no | V
| | next cpu
| is this CPU V
V idle +-----------------------------------+
+-----------------+ yes | best_fallback_cpu = i |
| cpu cstate > 0? |----------- | min_fallback_cpu_cost = cost(p,i) |
+-----------------+ | +-----------------------------------+
| | |
| no | \------> next CPU
| | is this CPU
| is this CPU | the shallowest
V mostly idle | idle CPU seen
+--------------+ +----------------------+ no +--------+
| cpu i | | cstate < min_cstate? |----->| ignore |
| mostly idle? | +----------------------+ | cpu |--> next cpu
+--------------+ | +--------+
| | | yes
| no | yes |
| | +--------------+ | +---------------------+
| \------>| return cpu i | ----->| min_cstate_cpu = i |
| +--------------+ | min_cstate = cstate |
| +---------------------+
| |
| will task not cross spill |
| threshold, and is this the |
V least loaded busy CPU we've seen |
+-------------------------+ \-----> next cpu
| !(p causes spill) && | no +--------+
| load(i) < min_busy_load |------>| ignore |---> next cpu
+-------------------------+ | cpu |
| +--------+
| yes
V
+----------------------+
| best_busy_cpu = i |
| min_busy_load = load |--------> next cpu
+----------------------+
Note that the process of evaluating the flow chart for every online
CPU the task may run on could be interrupted if a mostly idle CPU is
found in the lowest power band. Such a CPU will be selected
immediately by the algorithm. Otherwise, once the flow chart has been
evaluated for every online CPU the task is allowed to run on, a CPU is
selected from the candidates. If one or more idle CPUs exist in the
lowest power band then the one in the shallowest C-state is
returned. If not, then the least loaded CPU in the lowest power band
which would not exceed its spill threshold by accepting the task is
selected, assuming it exists. If none of the former possibilities
exist, the most power-efficient CPU outside the lowest power band is
selected.
Phew! But once again this can all be summarized. The order of CPU
preference for a small task is the following:
The order of CPU preference for a small task is the following:
1. The lowest-power CPU, if it is not idle but is mostly idle.
2. A non-idle CPU in the lowest power band which is mostly idle. The
first such CPU found is selected.
3. An idle CPU in the lowest power band that is in the least shallow
C-state.
4. The least busy CPU in the lowest power band where adding the task
will not result in exceeding the spill threshold.
2. A non-idle CPU in the lowest power band which is mostly idle. The first
such CPU found is selected.
3. An idle CPU in the lowest power band that is in the least shallow C-state.
4. The least busy CPU in the lowest power band where adding the task will not
result in exceeding the spill threshold.
5. The most power-efficient CPU outside of the lowest power band.
*** 5.3 Scheduler Tick
@ -1369,6 +1240,16 @@ longer eligible to be seen as mostly idle. This will affect the task placement
logic described above, causing the scheduler to try and steer tasks away from
the CPU.
** 7.23 sched_prefer_idle
Appears at: /proc/sys/kernel/sched_prefer_idle
Default value: 1
Non-small tasks will prefer to wake up on idle CPUs if this tunable is set to 1.
If the tunable is set to 0, non-small tasks will prefer to wake up on mostly
idle CPUs which are not completely idle, increasing task packing behavior.
=========================
8. HMP SCHEDULER TRACE POINTS
=========================

View File

@ -61,6 +61,7 @@ extern unsigned int sysctl_sched_small_task_pct;
extern unsigned int sysctl_sched_upmigrate_pct;
extern unsigned int sysctl_sched_downmigrate_pct;
extern int sysctl_sched_upmigrate_min_nice;
extern unsigned int sysctl_sched_prefer_idle;
extern unsigned int sysctl_sched_powerband_limit_pct;
extern unsigned int sysctl_sched_boost;

View File

@ -1319,6 +1319,13 @@ unsigned int __read_mostly sysctl_sched_downmigrate_pct = 60;
*/
int __read_mostly sysctl_sched_upmigrate_min_nice = 15;
/*
* Tunable to govern scheduler wakeup placement CPU selection
* preference. If set, the scheduler chooses to wake up a task
* on an idle CPU.
*/
unsigned int __read_mostly sysctl_sched_prefer_idle = 1;
/*
* Scheduler boost is a mechanism to temporarily place tasks on CPUs
* with higher capacity than those where a task would have normally
@ -1852,13 +1859,15 @@ static int select_packing_target(struct task_struct *p, int best_cpu)
/* return cheapest cpu that can fit this task */
static int select_best_cpu(struct task_struct *p, int target, int reason)
{
int i, best_cpu = -1, fallback_idle_cpu = -1;
int i, best_cpu = -1, fallback_idle_cpu = -1, min_cstate_cpu = -1;
int prev_cpu = task_cpu(p);
int cpu_cost, min_cost = INT_MAX;
int min_idle_cost = INT_MAX, min_busy_cost = INT_MAX;
u64 load, min_load = ULLONG_MAX, min_fallback_load = ULLONG_MAX;
int small_task = is_small_task(p);
int boost = sched_boost();
int cstate, min_cstate = INT_MAX;
int prefer_idle = reason ? 1 : sysctl_sched_prefer_idle;
trace_sched_task_load(p, small_task, boost, reason);
@ -1911,43 +1920,67 @@ static int select_best_cpu(struct task_struct *p, int target, int reason)
* overrides load and C-state.
*/
if (power_delta_exceeded(cpu_cost, min_cost)) {
if (cpu_cost < min_cost) {
min_load = load;
min_cost = cpu_cost;
if (cpu_cost > min_cost)
continue;
min_cost = cpu_cost;
min_load = ULLONG_MAX;
min_cstate = INT_MAX;
min_cstate_cpu = -1;
best_cpu = -1;
}
/*
* Partition CPUs based on whether they are completely idle
* or not. For completely idle CPUs we choose the one in
* the lowest C-state and then break ties with power cost
*/
if (idle_cpu(i)) {
if (cstate > min_cstate)
continue;
if (cstate < min_cstate) {
min_idle_cost = cpu_cost;
min_cstate = cstate;
best_cpu = i;
min_cstate_cpu = i;
continue;
}
if (cpu_cost < min_idle_cost) {
min_idle_cost = cpu_cost;
min_cstate_cpu = i;
}
continue;
}
/* After power band, load is prioritized next. */
if (load < min_load) {
min_load = load;
min_cost = cpu_cost;
min_cstate = cstate;
best_cpu = i;
continue;
}
/*
* For CPUs that are not completely idle, pick one with the
* lowest load and break ties with power cost
*/
if (load > min_load)
continue;
/*
* The load is equal to the previous selected CPU.
* This will most often occur when deciding between
* idle CPUs. Power cost is prioritized after load,
* followed by cstate.
*/
if (cpu_cost < min_cost) {
min_cost = cpu_cost;
min_cstate = cstate;
if (load < min_load) {
min_load = load;
min_busy_cost = cpu_cost;
best_cpu = i;
continue;
}
if (cpu_cost == min_cost && cstate < min_cstate) {
min_cstate = cstate;
/*
* The load is equal to the previous selected CPU.
* This is rare but when it does happen opt for the
* more power efficient CPU option.
*/
if (cpu_cost < min_busy_cost) {
min_busy_cost = cpu_cost;
best_cpu = i;
}
}
if (min_cstate_cpu >= 0 && (prefer_idle ||
!(best_cpu >= 0 && mostly_idle_cpu(best_cpu))))
best_cpu = min_cstate_cpu;
done:
if (best_cpu < 0) {
if (unlikely(fallback_idle_cpu < 0))

View File

@ -409,6 +409,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
{
.procname = "sched_prefer_idle",
.data = &sysctl_sched_prefer_idle,
.maxlen = sizeof(unsigned int),
.mode = 0644,
.proc_handler = proc_dointvec,
},
{
.procname = "sched_init_task_load",
.data = &sysctl_sched_init_task_load_pct,