sched: Provide knob to prefer mostly_idle over idle cpus
sysctl_sched_prefer_idle lets the scheduler bias selection of idle cpus over mostly idle cpus for tasks. This knob could be useful to control balance between power and performance. Change-Id: Ide6eef684ef94ac8b9927f53c220ccf94976fe67 Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
This commit is contained in:
parent
75d1c94217
commit
6e778f0cdc
|
@ -641,192 +641,63 @@ select_best_cpu(), represents the heart of the HMP scheduling
|
|||
algorithm described in this document.
|
||||
|
||||
The behavior of select_best_cpu() differs depending on whether the
|
||||
task being placed is a small task or not.
|
||||
task being placed is a small task or not and the value of the sched_prefer_idle
|
||||
tunable.
|
||||
|
||||
--- Wakeup Logic a Non-Small Task "p"
|
||||
|
||||
The following is evaluated for every online CPU i which task p may run on:
|
||||
The order of CPU preference for a non-small task when sched_prefer_idle = 1 is
|
||||
the following:
|
||||
|
||||
|
|
||||
| task doesn't fit, but
|
||||
| is this CPU a good
|
||||
V fallback candidate?
|
||||
+---------------+ +-------------+ +--------+
|
||||
| does task fit |------------>| is CPU |----------->| ignore |
|
||||
| on CPU | no | mostly idle | no | cpu |
|
||||
+---------------+ +-------------+ +--------+
|
||||
| |
|
||||
| yes | yes
|
||||
| | +--------------------------+
|
||||
| --------->| load < min_fallback_load |
|
||||
| +--------------------------+
|
||||
| | |
|
||||
| | yes | no
|
||||
| V V
|
||||
| +-----------------------+ +------------+
|
||||
| | fallback_idle_cpu = i | | ignore cpu |
|
||||
| task fits, prefer +-----------------------+ +------------+
|
||||
| mostly idle CPUs | |
|
||||
| or non-max capacity V V
|
||||
| CPUs that won't hit next CPU next CPU
|
||||
| spill threshold
|
||||
V
|
||||
+---------------------+ task does not meet load requirements
|
||||
| CPU mostly idle || | no +------------+
|
||||
| (!max_capacity && |---------->| ignore cpu |----> next CPU
|
||||
| !(p causes spill)) | +------------+
|
||||
+---------------------+
|
||||
|
|
||||
| yes
|
||||
|
|
||||
|
|
||||
|
|
||||
| is CPU in a lower power band
|
||||
V than previously seen min cost CPU CPU in a lower power band
|
||||
+---------------------+ than previously seen min,
|
||||
| cost(p, i) is | yes +----------------------------+ override
|
||||
| > band_limit % less |---------->| best_cpu = i | previously
|
||||
| than current min | | min_cost = cost(p,i) | seen min_load
|
||||
+---------------------+ | min_load = load(i) | CPU
|
||||
| +----------------------------+
|
||||
| no |
|
||||
| ---------> next CPU
|
||||
|
|
||||
|
|
||||
|
|
||||
| does CPU have lower load than CPU has lower load than
|
||||
V previously seen min_load CPU previously seen lowest load
|
||||
+--------------------+ yes +-----------------+
|
||||
| load(i) < min_load |------------------------->| best_cpu = i |
|
||||
+--------------------+ | min_load = load |
|
||||
| +-----------------+
|
||||
| no |
|
||||
| |
|
||||
| if load is tied with lowest previously |
|
||||
| seen lowest load, is power cost less |
|
||||
V |
|
||||
+------------------------+ |
|
||||
| load(i) == min_load && | yes +--------------+ |
|
||||
| cost(p, i) < min_cost |-------->| best_cpu = i | |
|
||||
+------------------------+ +--------------+ |
|
||||
| | |
|
||||
| no | /
|
||||
\_____________________________ | __________/
|
||||
\ | /
|
||||
| | |
|
||||
V V V if power cost of this
|
||||
+----------------------+ CPU is lower than
|
||||
| cost(p,i) < min_cost | current min, update
|
||||
+----------------------+ min_cost
|
||||
| |
|
||||
| yes | no
|
||||
| ----------> next CPU
|
||||
V
|
||||
+----------------------+
|
||||
| min_cost = cost(p,i) |-------> next CPU
|
||||
+----------------------+
|
||||
1. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
|
||||
the task. Where there is a tie of two CPUs with the same load, the CPU with
|
||||
the lowest power cost is chosen.
|
||||
|
||||
Once this flow chart has been evaluated for every online CPU the task
|
||||
may run on, if a "best_cpu" was found, it is returned. If a best_cpu
|
||||
was not found but a fallback_idle_cpu was found, then the
|
||||
fallback_idle_cpu is returned. Finally, if no best_cpu or
|
||||
fallback_idle cpu was found, then the task's previous CPU is returned.
|
||||
2. The least-loaded CPU the task is allowed to run on in the lowest power band
|
||||
where the task will fit and where the placement will not result in cpu
|
||||
exceeding spill level. When there is a tie of two CPUs at same load, the
|
||||
CPU with the lowest power cost is chosen.
|
||||
|
||||
Phew! Fortunately, all of that can be summarized relatively easily. The
|
||||
order of CPU preference for a non-small task is the following:
|
||||
3. The least-loaded mostly idle CPU that the task is allowed to run on where
|
||||
the task won't fit (since there was no CPU where the task would fit).
|
||||
|
||||
1. The least-loaded CPU the task is allowed to run on in the lowest
|
||||
power band where the task will fit and where the placement will
|
||||
not result in cpu exceeding spill level. When there is a tie of
|
||||
two cpus at same load, their CPU with the lowest power cost is
|
||||
chosen.
|
||||
4. The CPU which the task last ran on.
|
||||
|
||||
2. The least-loaded mostly idle CPU that the task is allowed to run
|
||||
on where the task won't fit (since there was no CPU where the
|
||||
task would fit).
|
||||
The order of CPU preference for a non-small task when sched_prefer_idle = 0
|
||||
is the following:
|
||||
|
||||
3. The CPU which the task last ran on.
|
||||
1. The least-loaded non-idle mostly idle CPU the task is allowed to run on in
|
||||
the lowest power band where the task will fit. When there is a tie of two
|
||||
CPUs at same load, the CPU with the lowest power cost is chosen.
|
||||
|
||||
2. The shallowest-cstate idle CPU in the lowest-power cluster which can fit
|
||||
the task. Where there is a tie of two CPUs with the same load, the CPU with
|
||||
the lowest power cost is chosen.
|
||||
|
||||
3. The least-loaded CPU the task is allowed to run on in the lowest power band
|
||||
where the task will fit and where the placement will not result in the CPU
|
||||
exceeding spill level. When there is a tie of two CPUs at the same load,
|
||||
the CPU with the lowest power cost is chosen.
|
||||
|
||||
4. The least-loaded mostly idle CPU that the task is allowed to run on where
|
||||
the task won't fit (since there was no CPU where the task would fit).
|
||||
|
||||
5. The CPU which the task last ran on.
|
||||
|
||||
--- Wakeup Logic a Small Task "p"
|
||||
|
||||
The online CPUs the task is allowed to run on are scanned and the
|
||||
lowest power CPU is found. This is marked as the min_cost_cpu.
|
||||
|
||||
If the minimum cost CPU is mostly idle but not idle, that CPU is
|
||||
immediately chosen.
|
||||
|
||||
If the minimum cost CPU is idle or not mostly idle, then the following
|
||||
will be evaluated for every online CPU i the task is allowed to run
|
||||
on:
|
||||
| is CPU i in higher power band is this CPU lower power than
|
||||
V than min_cost_cpu? best fallback CPU seen
|
||||
+---------------------+ +-----------------------+
|
||||
| cost(p, i) is | yes | cost(p,i) < | no +--------+
|
||||
| > band_limit % more |--------------->| min_fallback_cpu_cost |----->| ignore |
|
||||
| than min_cost_cpu | +-----------------------+ | cpu |
|
||||
+---------------------+ | +--------+
|
||||
| | yes |
|
||||
| no | V
|
||||
| | next cpu
|
||||
| is this CPU V
|
||||
V idle +-----------------------------------+
|
||||
+-----------------+ yes | best_fallback_cpu = i |
|
||||
| cpu cstate > 0? |----------- | min_fallback_cpu_cost = cost(p,i) |
|
||||
+-----------------+ | +-----------------------------------+
|
||||
| | |
|
||||
| no | \------> next CPU
|
||||
| | is this CPU
|
||||
| is this CPU | the shallowest
|
||||
V mostly idle | idle CPU seen
|
||||
+--------------+ +----------------------+ no +--------+
|
||||
| cpu i | | cstate < min_cstate? |----->| ignore |
|
||||
| mostly idle? | +----------------------+ | cpu |--> next cpu
|
||||
+--------------+ | +--------+
|
||||
| | | yes
|
||||
| no | yes |
|
||||
| | +--------------+ | +---------------------+
|
||||
| \------>| return cpu i | ----->| min_cstate_cpu = i |
|
||||
| +--------------+ | min_cstate = cstate |
|
||||
| +---------------------+
|
||||
| |
|
||||
| will task not cross spill |
|
||||
| threshold, and is this the |
|
||||
V least loaded busy CPU we've seen |
|
||||
+-------------------------+ \-----> next cpu
|
||||
| !(p causes spill) && | no +--------+
|
||||
| load(i) < min_busy_load |------>| ignore |---> next cpu
|
||||
+-------------------------+ | cpu |
|
||||
| +--------+
|
||||
| yes
|
||||
V
|
||||
+----------------------+
|
||||
| best_busy_cpu = i |
|
||||
| min_busy_load = load |--------> next cpu
|
||||
+----------------------+
|
||||
|
||||
Note that the process of evaluating the flow chart for every online
|
||||
CPU the task may run on could be interrupted if a mostly idle CPU is
|
||||
found in the lowest power band. Such a CPU will be selected
|
||||
immediately by the algorithm. Otherwise, once the flow chart has been
|
||||
evaluated for every online CPU the task is allowed to run on, a CPU is
|
||||
selected from the candidates. If one or more idle CPUs exist in the
|
||||
lowest power band then the one in the shallowest C-state is
|
||||
returned. If not, then the least loaded CPU in the lowest power band
|
||||
which would not exceed its spill threshold by accepting the task is
|
||||
selected, assuming it exists. If none of the former possibilities
|
||||
exist, the most power-efficient CPU outside the lowest power band is
|
||||
selected.
|
||||
|
||||
Phew! But once again this can all be summarized. The order of CPU
|
||||
preference for a small task is the following:
|
||||
The order of CPU preference for a small task is the following:
|
||||
|
||||
1. The lowest-power CPU, if it is not idle but is mostly idle.
|
||||
2. A non-idle CPU in the lowest power band which is mostly idle. The
|
||||
first such CPU found is selected.
|
||||
3. An idle CPU in the lowest power band that is in the least shallow
|
||||
C-state.
|
||||
4. The least busy CPU in the lowest power band where adding the task
|
||||
will not result in exceeding the spill threshold.
|
||||
|
||||
2. A non-idle CPU in the lowest power band which is mostly idle. The first
|
||||
such CPU found is selected.
|
||||
|
||||
3. An idle CPU in the lowest power band that is in the least shallow C-state.
|
||||
|
||||
4. The least busy CPU in the lowest power band where adding the task will not
|
||||
result in exceeding the spill threshold.
|
||||
|
||||
5. The most power-efficient CPU outside of the lowest power band.
|
||||
|
||||
*** 5.3 Scheduler Tick
|
||||
|
@ -1369,6 +1240,16 @@ longer eligible to be seen as mostly idle. This will affect the task placement
|
|||
logic described above, causing the scheduler to try and steer tasks away from
|
||||
the CPU.
|
||||
|
||||
** 7.23 sched_prefer_idle
|
||||
|
||||
Appears at: /proc/sys/kernel/sched_prefer_idle
|
||||
|
||||
Default value: 1
|
||||
|
||||
Non-small tasks will prefer to wake up on idle CPUs if this tunable is set to 1.
|
||||
If the tunable is set to 0, non-small tasks will prefer to wake up on mostly
|
||||
idle CPUs which are not completely idle, increasing task packing behavior.
|
||||
|
||||
=========================
|
||||
8. HMP SCHEDULER TRACE POINTS
|
||||
=========================
|
||||
|
|
|
@ -61,6 +61,7 @@ extern unsigned int sysctl_sched_small_task_pct;
|
|||
extern unsigned int sysctl_sched_upmigrate_pct;
|
||||
extern unsigned int sysctl_sched_downmigrate_pct;
|
||||
extern int sysctl_sched_upmigrate_min_nice;
|
||||
extern unsigned int sysctl_sched_prefer_idle;
|
||||
extern unsigned int sysctl_sched_powerband_limit_pct;
|
||||
extern unsigned int sysctl_sched_boost;
|
||||
|
||||
|
|
|
@ -1319,6 +1319,13 @@ unsigned int __read_mostly sysctl_sched_downmigrate_pct = 60;
|
|||
*/
|
||||
int __read_mostly sysctl_sched_upmigrate_min_nice = 15;
|
||||
|
||||
/*
|
||||
* Tunable to govern scheduler wakeup placement CPU selection
|
||||
* preference. If set, the scheduler chooses to wake up a task
|
||||
* on an idle CPU.
|
||||
*/
|
||||
unsigned int __read_mostly sysctl_sched_prefer_idle = 1;
|
||||
|
||||
/*
|
||||
* Scheduler boost is a mechanism to temporarily place tasks on CPUs
|
||||
* with higher capacity than those where a task would have normally
|
||||
|
@ -1852,13 +1859,15 @@ static int select_packing_target(struct task_struct *p, int best_cpu)
|
|||
/* return cheapest cpu that can fit this task */
|
||||
static int select_best_cpu(struct task_struct *p, int target, int reason)
|
||||
{
|
||||
int i, best_cpu = -1, fallback_idle_cpu = -1;
|
||||
int i, best_cpu = -1, fallback_idle_cpu = -1, min_cstate_cpu = -1;
|
||||
int prev_cpu = task_cpu(p);
|
||||
int cpu_cost, min_cost = INT_MAX;
|
||||
int min_idle_cost = INT_MAX, min_busy_cost = INT_MAX;
|
||||
u64 load, min_load = ULLONG_MAX, min_fallback_load = ULLONG_MAX;
|
||||
int small_task = is_small_task(p);
|
||||
int boost = sched_boost();
|
||||
int cstate, min_cstate = INT_MAX;
|
||||
int prefer_idle = reason ? 1 : sysctl_sched_prefer_idle;
|
||||
|
||||
trace_sched_task_load(p, small_task, boost, reason);
|
||||
|
||||
|
@ -1911,43 +1920,67 @@ static int select_best_cpu(struct task_struct *p, int target, int reason)
|
|||
* overrides load and C-state.
|
||||
*/
|
||||
if (power_delta_exceeded(cpu_cost, min_cost)) {
|
||||
if (cpu_cost < min_cost) {
|
||||
min_load = load;
|
||||
min_cost = cpu_cost;
|
||||
if (cpu_cost > min_cost)
|
||||
continue;
|
||||
|
||||
min_cost = cpu_cost;
|
||||
min_load = ULLONG_MAX;
|
||||
min_cstate = INT_MAX;
|
||||
min_cstate_cpu = -1;
|
||||
best_cpu = -1;
|
||||
}
|
||||
|
||||
/*
|
||||
* Partition CPUs based on whether they are completely idle
|
||||
* or not. For completely idle CPUs we choose the one in
|
||||
* the lowest C-state and then break ties with power cost
|
||||
*/
|
||||
if (idle_cpu(i)) {
|
||||
if (cstate > min_cstate)
|
||||
continue;
|
||||
|
||||
if (cstate < min_cstate) {
|
||||
min_idle_cost = cpu_cost;
|
||||
min_cstate = cstate;
|
||||
best_cpu = i;
|
||||
min_cstate_cpu = i;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (cpu_cost < min_idle_cost) {
|
||||
min_idle_cost = cpu_cost;
|
||||
min_cstate_cpu = i;
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
/* After power band, load is prioritized next. */
|
||||
if (load < min_load) {
|
||||
min_load = load;
|
||||
min_cost = cpu_cost;
|
||||
min_cstate = cstate;
|
||||
best_cpu = i;
|
||||
continue;
|
||||
}
|
||||
/*
|
||||
* For CPUs that are not completely idle, pick one with the
|
||||
* lowest load and break ties with power cost
|
||||
*/
|
||||
if (load > min_load)
|
||||
continue;
|
||||
|
||||
/*
|
||||
* The load is equal to the previous selected CPU.
|
||||
* This will most often occur when deciding between
|
||||
* idle CPUs. Power cost is prioritized after load,
|
||||
* followed by cstate.
|
||||
*/
|
||||
if (cpu_cost < min_cost) {
|
||||
min_cost = cpu_cost;
|
||||
min_cstate = cstate;
|
||||
if (load < min_load) {
|
||||
min_load = load;
|
||||
min_busy_cost = cpu_cost;
|
||||
best_cpu = i;
|
||||
continue;
|
||||
}
|
||||
if (cpu_cost == min_cost && cstate < min_cstate) {
|
||||
min_cstate = cstate;
|
||||
|
||||
/*
|
||||
* The load is equal to the previous selected CPU.
|
||||
* This is rare but when it does happen opt for the
|
||||
* more power efficient CPU option.
|
||||
*/
|
||||
if (cpu_cost < min_busy_cost) {
|
||||
min_busy_cost = cpu_cost;
|
||||
best_cpu = i;
|
||||
}
|
||||
}
|
||||
|
||||
if (min_cstate_cpu >= 0 && (prefer_idle ||
|
||||
!(best_cpu >= 0 && mostly_idle_cpu(best_cpu))))
|
||||
best_cpu = min_cstate_cpu;
|
||||
done:
|
||||
if (best_cpu < 0) {
|
||||
if (unlikely(fallback_idle_cpu < 0))
|
||||
|
|
|
@ -409,6 +409,13 @@ static struct ctl_table kern_table[] = {
|
|||
.mode = 0644,
|
||||
.proc_handler = proc_dointvec,
|
||||
},
|
||||
{
|
||||
.procname = "sched_prefer_idle",
|
||||
.data = &sysctl_sched_prefer_idle,
|
||||
.maxlen = sizeof(unsigned int),
|
||||
.mode = 0644,
|
||||
.proc_handler = proc_dointvec,
|
||||
},
|
||||
{
|
||||
.procname = "sched_init_task_load",
|
||||
.data = &sysctl_sched_init_task_load_pct,
|
||||
|
|
Loading…
Reference in New Issue