Commit Graph

502 Commits

Author SHA1 Message Date
Zev Weiss f6cec2a49d kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
commit 8cf7630b29701d364f8df4a50e4f1f5e752b2778 upstream.

This bug has apparently existed since the introduction of this function
in the pre-git era (4500e91754d3 in Thomas Gleixner's history.git,
"[NET]: Add proc_dointvec_userhz_jiffies, use it for proper handling of
neighbour sysctls.").

As a minimal fix we can simply duplicate the corresponding check in
do_proc_dointvec_conv().

Link: http://lkml.kernel.org/r/20190207123426.9202-3-zev@bewilderbeest.net
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:11:27 +02:00
Will Deacon 271a597570 kernel/sysctl.c: fix out-of-bounds access when setting file-max
commit 9002b21465fa4d829edfc94a5a441005cffaa972 upstream.

Commit 32a5ad9c2285 ("sysctl: handle overflow for file-max") hooked up
min/max values for the file-max sysctl parameter via the .extra1 and
.extra2 fields in the corresponding struct ctl_table entry.

Unfortunately, the minimum value points at the global 'zero' variable,
which is an int.  This results in a KASAN splat when accessed as a long
by proc_doulongvec_minmax on 64-bit architectures:

  | BUG: KASAN: global-out-of-bounds in __do_proc_doulongvec_minmax+0x5d8/0x6a0
  | Read of size 8 at addr ffff2000133d1c20 by task systemd/1
  |
  | CPU: 0 PID: 1 Comm: systemd Not tainted 5.1.0-rc3-00012-g40b114779944 #2
  | Hardware name: linux,dummy-virt (DT)
  | Call trace:
  |  dump_backtrace+0x0/0x228
  |  show_stack+0x14/0x20
  |  dump_stack+0xe8/0x124
  |  print_address_description+0x60/0x258
  |  kasan_report+0x140/0x1a0
  |  __asan_report_load8_noabort+0x18/0x20
  |  __do_proc_doulongvec_minmax+0x5d8/0x6a0
  |  proc_doulongvec_minmax+0x4c/0x78
  |  proc_sys_call_handler.isra.19+0x144/0x1d8
  |  proc_sys_write+0x34/0x58
  |  __vfs_write+0x54/0xe8
  |  vfs_write+0x124/0x3c0
  |  ksys_write+0xbc/0x168
  |  __arm64_sys_write+0x68/0x98
  |  el0_svc_common+0x100/0x258
  |  el0_svc_handler+0x48/0xc0
  |  el0_svc+0x8/0xc
  |
  | The buggy address belongs to the variable:
  |  zero+0x0/0x40
  |
  | Memory state around the buggy address:
  |  ffff2000133d1b00: 00 00 00 00 00 00 00 00 fa fa fa fa 04 fa fa fa
  |  ffff2000133d1b80: fa fa fa fa 04 fa fa fa fa fa fa fa 04 fa fa fa
  | >ffff2000133d1c00: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 00 00
  |                                ^
  |  ffff2000133d1c80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
  |  ffff2000133d1d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fix the splat by introducing a unsigned long 'zero_ul' and using that
instead.

Link: http://lkml.kernel.org/r/20190403153409.17307-1-will.deacon@arm.com
Fixes: 32a5ad9c2285 ("sysctl: handle overflow for file-max")
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Christian Brauner <christian@brauner.io>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 22:10:11 +02:00
Christian Brauner 40a828a92e sysctl: handle overflow for file-max
[ Upstream commit 32a5ad9c22852e6bd9e74bdec5934ef9d1480bc5 ]

Currently, when writing

  echo 18446744073709551616 > /proc/sys/fs/file-max

/proc/sys/fs/file-max will overflow and be set to 0.  That quickly
crashes the system.

This commit sets the max and min value for file-max.  The max value is
set to long int.  Any higher value cannot currently be used as the
percpu counters are long ints and not unsigned integers.

Note that the file-max value is ultimately parsed via
__do_proc_doulongvec_minmax().  This function does not report error when
min or max are exceeded.  Which means if a value largen that long int is
written userspace will not receive an error instead the old value will be
kept.  There is an argument to be made that this should be changed and
__do_proc_doulongvec_minmax() should return an error when a dedicated min
or max value are exceeded.  However this has the potential to break
userspace so let's defer this to an RFC patch.

Link: http://lkml.kernel.org/r/20190107222700.15954-3-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Waiman Long <longman@redhat.com>
[christian@brauner.io: v4]
  Link: http://lkml.kernel.org/r/20190210203943.8227-3-christian@brauner.io
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-07-27 22:10:02 +02:00
Eric Biggers a02b779d53 pipe: reject F_SETPIPE_SZ with size over UINT_MAX
commit 96e99be40e4cff870a83233731121ec0f7f95075 upstream.

A pipe's size is represented as an 'unsigned int'.  As expected, writing a
value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
EINVAL.  However, the F_SETPIPE_SZ fcntl silently truncates such values to
32 bits, rather than failing with EINVAL as expected.  (It *does* fail
with EINVAL for values above (1 << 31) but <= UINT_MAX.)

Fix this by moving the check against UINT_MAX into round_pipe_size() which
is called in both cases.

Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:46 +02:00
Eric Biggers f32040ce1c pipe, sysctl: remove pipe_proc_fn()
commit 319e0a21bb7823abbb4818fe2724e572bbac77a2 upstream.

pipe_proc_fn() is no longer needed, as it only calls through to
proc_dopipe_max_size().  Just put proc_dopipe_max_size() in the ctl_table
entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS
stub for it.

(The reason the ENOSYS stub isn't needed is that the pipe-max-size
ctl_table entry is located directly in 'kern_table' rather than being
registered separately.  Therefore, the entry is already only defined when
the kernel is built with sysctl support.)

Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:45 +02:00
Eric Biggers bae0262aef pipe, sysctl: drop 'min' parameter from pipe-max-size converter
commit 4c2e4befb3cc9ce42d506aa537c9ab504723e98c upstream.

Patch series "pipe: buffer limits fixes and cleanups", v2.

This series simplifies the sysctl handler for pipe-max-size and fixes
another set of bugs related to the pipe buffer limits:

- The root user wasn't allowed to exceed the limits when creating new
  pipes.

- There was an off-by-one error when checking the limits, so a limit of
  N was actually treated as N - 1.

- F_SETPIPE_SZ accepted values over UINT_MAX.

- Reading the pipe buffer limits could be racy.

This patch (of 7):

Before validating the given value against pipe_min_size,
do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the
value up to pipe_min_size.  Therefore, the second check against
pipe_min_size is redundant.  Remove it.

Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:44 +02:00
Joe Lawrence 63e56f81d2 sysctl: check for UINT_MAX before unsigned int min/max
commit fb910c42ccebf853c29296185c45c11164a56098 upstream.

Mikulas noticed in the existing do_proc_douintvec_minmax_conv() and
do_proc_dopipe_max_size_conv() introduced in this patchset, that they
inconsistently handle overflow and min/max range inputs:

For example:

  0 ... param->min - 1 ---> ERANGE
  param->min ... param->max ---> the value is accepted
  param->max + 1 ... 0x100000000L + param->min - 1 ---> ERANGE
  0x100000000L + param->min ... 0x100000000L + param->max ---> EINVAL
  0x100000000L + param->max + 1, 0x200000000L + param->min - 1 ---> ERANGE
  0x200000000L + param->min ... 0x200000000L + param->max ---> EINVAL
  0x200000000L + param->max + 1, 0x300000000L + param->min - 1 ---> ERANGE

In do_proc_do*() routines which store values into unsigned int variables
(4 bytes wide for 64-bit builds), first validate that the input unsigned
long value (8 bytes wide for 64-bit builds) will fit inside the smaller
unsigned int variable.  Then check that the unsigned int value falls
inside the specified parameter min, max range.  Otherwise the unsigned
long -> unsigned int conversion drops leading bits from the input value,
leading to the inconsistent pattern Mikulas documented above.

Link: http://lkml.kernel.org/r/1507658689-11669-5-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16:
 - Drop changes in do_proc_douintvec_minmax_conv()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:44 +02:00
Joe Lawrence 7eecc8b01c pipe: add proc_dopipe_max_size() to safely assign pipe_max_size
commit 7a8d181949fb2c16be00f8cdb354794a30e46b39 upstream.

pipe_max_size is assigned directly via procfs sysctl:

  static struct ctl_table fs_table[] = {
          ...
          {
                  .procname       = "pipe-max-size",
                  .data           = &pipe_max_size,
                  .maxlen         = sizeof(int),
                  .mode           = 0644,
                  .proc_handler   = &pipe_proc_fn,
                  .extra1         = &pipe_min_size,
          },
          ...

  int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
                   size_t *lenp, loff_t *ppos)
  {
          ...
          ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)
          ...

and then later rounded in-place a few statements later:

          ...
          pipe_max_size = round_pipe_size(pipe_max_size);
          ...

This leaves a window of time between initial assignment and rounding
that may be visible to other threads.  (For example, one thread sets a
non-rounded value to pipe_max_size while another reads its value.)

Similar reads of pipe_max_size are potentially racy:

  pipe.c :: alloc_pipe_info()
  pipe.c :: pipe_set_size()

Add a new proc_dopipe_max_size() that consolidates reading the new value
from the user buffer, verifying bounds, and calling round_pipe_size()
with a single assignment to pipe_max_size.

Change-Id: I635ecef3cdbbc0edf5158bf92bd4c17eab390080
Link: http://lkml.kernel.org/r/1507658689-11669-4-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: Continue using int sysctl functions because we don't
 have proper unsigned int support]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:44 +02:00
Ethan Zhao 7fa1f79396 sched/sysctl: Check user input value of sysctl_sched_time_avg
commit 5ccba44ba118a5000cccc50076b0344632459779 upstream.

System will hang if user set sysctl_sched_time_avg to 0:

  [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

  Stack traceback for pid 0
  0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
  ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
  0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
  ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
  Call Trace:
  <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
  [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
  [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
  [<ffffffff810c5b84>] ? load_balance+0x194/0x900
  [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
  [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
  [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
  [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
  [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
  [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
  [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
  [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
   <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
  [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
  [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
  [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
  [<ffffffff81053373>] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function:

update_group_capacity()
  update_cpu_capacity()
    scale_rt_capacity()
     {
          ...
          total = sched_avg_period() + delta;
          used = div_u64(avg, total);
          ...
     }

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hitting invalid input, and set the minimum limit of
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: ethan.kernel@gmail.com
Cc: keescook@chromium.org
Cc: mcgrof@kernel.org
Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:45:34 +02:00
Eric Dumazet 022fbaee3b sysctl: fix proc_doulongvec_ms_jiffies_minmax()
commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

We perform the conversion between kernel jiffies and ms only when
exporting kernel value to user space.

We need to do the opposite operation when value is written by user.

Only matters when HZ != 1000

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2019-07-27 21:43:49 +02:00
LuK1337 fc9499e55a Import latest Samsung release
* Package version: T713XXU2BQCO

Change-Id: I293d9e7f2df458c512d59b7a06f8ca6add610c99
2017-04-18 03:43:52 +02:00
dcashman d2de0d753c FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR.
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337)

ASLR  only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

Bug: 24047224
Signed-off-by: Daniel Cashman <dcashman@android.com>
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: I66ac01c6f4f2c8dcfc84d1f1e99490b8385b3ed4
2016-05-18 14:36:00 +05:30
dcashman e83a83bed5 Revert "mm: mmap: Add new /proc tunable for mmap_base ASLR."
This reverts commit 3d269ec1afe3ac9508ebf53113d877e689fbb888.

Bug: 25973686
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: Iadc67de7359d8e861030a69520776342c15aa70b

Conflicts:
	kernel/sysctl.c
2016-05-18 14:36:00 +05:30
dcashman d92221e8a8 mm: mmap: Add new /proc tunable for mmap_base ASLR.
ASLR currently only uses 8 bits to generate the random offset for the
mmap base address on 32 bit architectures. This value was chosen to
prevent a poorly chosen value from dividing the address space in such
a way as to prevent large allocations. This may not be an issue on all
platforms. Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.

BUG=24047224
Signed-off-by: Daniel Cashman <dcashman@google.com>
Change-Id: I54556fecca4bd2ee2b80be7f305769363c32bb9a

Conflicts:
	kernel/sysctl.c
2016-05-18 14:34:39 +05:30
Srivatsa Vaddagiri a330f3d5bc sched: colocate related threads
Provide userspace interface for tasks to be grouped together as
"related" threads. For example, all threads involved in updating
display buffer could be tagged as related.

Scheduler will attempt to provide special treatment for group of
related threads such as:

1) Colocation of related threads in same "preferred" cluster
2) Aggregation of demand towards determination of cluster frequency

This patch extends scheduler to provide best-effort colocation support
for a group of related threads.

Change-Id: Ic2cd769faf5da4d03a8f3cb0ada6224d0101a5f5
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2015-11-25 21:43:34 -08:00
Srivatsa Vaddagiri 241c4c9ed5 timer: Queue timers on least power cpu
There is potential power benefit by offloading timer activity to cpus
of lesser power cost (power cluster). Both high-res and low-res timers
that are not pinned to one cpu are now enqueued on first online CPU
found in least shallow C-state in power-cluster.

CRs-Fixed: 764251
Change-Id: I2cea26c76972b566dfbfed084e377811a8784172
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2015-05-29 15:59:20 +05:30
Joonwoo Park 4251f58faa sched: check HMP scheduler tunables validity
Check tunables validity to take valid values only.

CRs-fixed: 812443
Change-Id: Ibb9ec0d6946247068174ab7abe775a6389412d5b
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2015-03-31 20:52:13 -07:00
Joonwoo Park 32851d8550 sched: add scheduling latency tracking procfs node
Add a new procfs node /proc/sys/kernel/sched_max_latency_us to track the
worst scheduling latency.  It provides easier way to identify maximum
scheduling latency seen across the CPUs.

Change-Id: I6e435bbf825c0a4dff2eded4a1256fb93f108d0e
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2015-03-03 09:56:02 -08:00
Joonwoo Park 4653e0549d sched: warn/panic upon excessive scheduling latency
Add new tunables /proc/sys/kernel/sched_latency_warn_threshold_us and
/proc/sys/kernel/sched_latency_panic_threshold_us to warn or panic for the
cases that tasks are runnable but not scheduled more than configured time.

This helps to find out unacceptably high scheduling latency more easily.

Change-Id: If077aba6211062cf26ee289970c5abcd1c218c82
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2015-03-03 09:55:42 -08:00
Srivatsa Vaddagiri 930bca74d2 sched: Remove sched_wake_to_idle for HMP scheduler
sched_wake_to_idle tunable is obsoleted by sched_prefer_idle tunable
in HMP scheduler. Remove the same when CONFIG_SCHED_HMP is defined

Change-Id: I7bcf12cc3c50df5ef09261f097711c9f29ec63a4
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2015-01-28 14:22:12 +05:30
Linux Build Service Account 380cadc7f3 Merge "sched: Per-cpu prefer_idle flag" 2014-12-29 17:31:47 -08:00
Srivatsa Vaddagiri 599bfc7503 sched: Per-cpu prefer_idle flag
Remove the global sysctl_sched_prefer_idle flag and replace it with a
per-cpu prefer_idle flag. The per-cpu flag is expected to same for all
cpus in a cluster. It thus provides convenient means to disable
packing in one cluster while allowing packing in another cluster.

Change-Id: Ie4cc73bb1a55b4eac5697be38e558546161faca1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-12-23 09:52:43 +05:30
Olav Haugan 7e13b27b8b sched: Add sysctl to enable power aware scheduling
Add sysctl to enable energy awareness at runtime. This is useful for
performance/power tuning/measurements and debugging. In addition this
will match up with the Documentation/scheduler/sched-hmp.txt documentation.

Change-Id: I0a9185498640d66917b38bf5d55f6c59fc60ad5c
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2014-12-22 14:37:33 -08:00
Joonwoo Park 2cec55a2e2 sched: Prevent race conditions where upmigrate_min_nice changes
When upmigrate_min_nice is changed dec_nr_big_small_task() can trigger
BUG_ON(rq->nr_big_tasks < 0).  This happens when there is a task which was
considered as non-big task due to its nice > upmigrate_min_nice and later
upmigrate_min_nice is changed to higher value so the task becomes big task.
In this case runqueue still has nr_big_tasks = 0 incorrectly with current
implementation.  Consequently next scheduler tick sees a big task to
schedule and try to decrease nr_big_tasks which is already 0.

Introduce sched_upmigrate_min_nice which is updated atomically and re-count
the number of big and small tasks to fix BUG_ON() triggering.

Change-Id: I6f5fc62ed22bbe5c52ec71613082a6e64f406e58
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-12-16 11:05:22 -08:00
Olav Haugan 4beca1fd4d sched: Avoid frequent task migration due to EA in lb
A new tunable exists that allow task migration to be throttled when the
scheduler tries to do task migrations due to Energy Awareness (EA). This
tunable is only taken into account when migrations occur in the tick
path. Extend the usage of the tunable to take into account the load
balancer (lb) path also.

In addition ensure that the start of task execution on a CPU is updated
correctly. If a task is preempted but still runnable on the same CPU the
start of execution should not be updated. Only update the start of
execution when a task wakes up after sleep or moves to a new CPU.

Change-Id: I6b2a8e06d8d2df8e0f9f62b7aba3b4ee4b2c1c4d
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2014-12-13 06:43:49 -08:00
Srivatsa Vaddagiri 32c6ac7c62 sched: Avoid frequent migration of running task
Power values for cpus can drop quite considerably when it goes idle.
As a result, the best choice for running a single task in a cluster
can vary quite rapidly. As the task keeps hopping cpus, other cpus go
idle and start being seen as more favorable target for running a task,
leading to task migrating almost every scheduler tick!

Prevent this by keeping track of when a task started running on a cpu
and allowing task migration in tick path (migration_needed()) on
account of energy efficiency reasons only if the task has run
sufficiently long (as determined by sysctl_sched_min_runtime
variable).

Note that currently sysctl_sched_min_runtime setting is considered
only in scheduler_tick()->migration_needed() path and not in
idle_balance() path. In other words, a task could be migrated to
another cpu which did a idle_balance(). This limitation should not
affect high-frequency migrations seen typically (when a single
high-demand task runs on high-performance cpu).

CRs-Fixed: 756570
Change-Id: I96413b7a81b623193c3bbcec6f3fa9dfec367d99
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-12-13 06:34:55 -08:00
Srivatsa Vaddagiri 6e778f0cdc sched: Provide knob to prefer mostly_idle over idle cpus
sysctl_sched_prefer_idle lets the scheduler bias selection of
idle cpus over mostly idle cpus for tasks. This knob could be
useful to control balance between power and performance.

Change-Id: Ide6eef684ef94ac8b9927f53c220ccf94976fe67
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-12-10 23:53:54 -08:00
Steve Muckle 75d1c94217 sched: make sched_cpu_high_irqload a runtime tunable
It may be desirable to be able to alter the scehd_cpu_high_irqload
setting easily, so make it a runtime tunable value.

Change-Id: I832030eec2aafa101f0f435a4fd2d401d447880d
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-12-10 23:53:53 -08:00
Srivatsa Vaddagiri ed7d7749e9 sched: per-cpu mostly_idle threshold
sched_mostly_idle_load and sched_mostly_idle_nr_run knobs help pack
tasks on cpus to some extent. In some cases, it may be desirable to
have different packing limits for different cpus. For example, pack to
a higher limit on high-performance cpus compared to power-efficient
cpus.

This patch removes the global mostly_idle tunables and makes them
per-cpu, thus letting task packing behavior to be controlled in a
fine-grained manner.

Change-Id: Ifc254cda34b928eae9d6c342ce4c0f64e531e6c2
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-11-06 15:27:00 +05:30
Srivatsa Vaddagiri f3386c7cfb sched: update governor notification logic
Make criteria for notifying governor to be per-cpu. Governor is
notified of any large change in cpu's busy time statistics
(rq->prev_runnable_sum) since the last reported value.

Change-Id: I727354d994d909b166d093b94d3dade7c7dddc0d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-15 14:57:18 -07:00
Srivatsa Vaddagiri 19b3f3f871 sched: Use absolute scale for notifying governor
Make the tunables used for deciding the need for notification to be on
absolute scale. The earlier scale (in percent terms relative to
cur_freq) does not work well with available range of frequencies. For
example, 100% tunable value would work well for lower range of
frequencies and not for higher range. Having the tunable to be on
absolute scale makes tuning more realistic.

Change-Id: I35a8c4e2f2e9da57f4ca4462072276d06ad386f1
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-03 14:03:56 -07:00
Srivatsa Vaddagiri 2568673dd6 sched: window-stats: Enhance cpu busy time accounting
rq->curr/prev_runnable_sum counters represent cpu demand from various
tasks that have run on a cpu. Any task that runs on a cpu will have a
representation in rq->curr_runnable_sum. Their partial_demand value
will be included in rq->curr_runnable_sum. Since partial_demand is
derived from historical load samples for a task, rq->curr_runnable_sum
could represent "inflated/un-realistic" cpu usage. As an example, lets
say that task with partial_demand of 10ms runs for only 1ms on a cpu.
What is included in rq->curr_runnable_sum is 10ms (and not the actual
execution time of 1ms). This leads to cpu busy time being reported on
the upside causing frequency to stay higher than necessary.

This patch fixes cpu busy accounting scheme to strictly represent
actual usage. It also provides for conditional fixup of busy time upon
migration and upon heavy-task wakeup.

CRs-Fixed: 691443
Change-Id: Ic4092627668053934049af4dfef65d9b6b901e6b
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-03 14:03:51 -07:00
Srivatsa Vaddagiri 86df733742 sched: improve logic for alerting governor
Currently we send notification to governor not taking note of cpus
that are synchronized with regard to their frequency. As a result,
scheduler could send pointless notifications (notification spam!).

Avoid this by considering synchronized cpus and alerting governor only
when the highest demand of any cpu within cluster far exceeds or falls
behind current frequency.

Change-Id: I74908b5a212404ca56b38eb94548f9b1fbcca33d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-10-03 13:46:18 -07:00
Linux Build Service Account 5b1594145e Merge "sched: window-stats: legacy mode" 2014-08-24 20:01:39 -07:00
Linux Build Service Account 273f377789 Merge "sched: window-stats: Code cleanup" 2014-08-24 20:01:37 -07:00
Srivatsa Vaddagiri 85ed6be992 sched: window-stats: legacy mode
Support legacy mode, which results in busy time being seen by governor
that is close to what it would have seen via existing APIs i.e
get_cpu_idle_time_us(), get_cpu_iowait_time_us() and
get_cpu_idle_time_jiffy(). In particular, legacy mode means that only
task execution time is counted in rq->curr_runnable_sum and
rq->prev_runnable_sum. Also task migration does not result in
adjustment of those counters.

Change-Id: If374ccc084aa73f77374b6b3ab4cd0a4ca7b8c90
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-22 14:43:14 -07:00
Srivatsa Vaddagiri dafe791457 sched: window-stats: Code cleanup
Remove code duplication associated with update of various window-stats
related sysctl tunables

Change-Id: I64e29ac065172464ba371a03758937999c42a71f
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-22 14:43:12 -07:00
Ian Maund 6440f462f9 Merge upstream tag 'v3.10.49' into msm-3.10
* commit 'v3.10.49': (529 commits)
  Linux 3.10.49
  ACPI / battery: Retry to get battery information if failed during probing
  x86, ioremap: Speed up check for RAM pages
  Score: Modify the Makefile of Score, remove -mlong-calls for compiling
  Score: The commit is for compiling successfully.
  Score: Implement the function csum_ipv6_magic
  score: normalize global variables exported by vmlinux.lds
  rtmutex: Plug slow unlock race
  rtmutex: Handle deadlock detection smarter
  rtmutex: Detect changes in the pi lock chain
  rtmutex: Fix deadlock detector for real
  ring-buffer: Check if buffer exists before polling
  drm/radeon: stop poisoning the GART TLB
  drm/radeon: fix typo in golden register setup on evergreen
  ext4: disable synchronous transaction batching if max_batch_time==0
  ext4: clarify error count warning messages
  ext4: fix unjournalled bg descriptor while initializing inode bitmap
  dm io: fix a race condition in the wake up code for sync_io
  Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code
  clk: spear3xx: Use proper control register offset
  ...

In addition to bringing in upstream commits, this merge also makes minor
changes to mainitain compatibility with upstream:

The definition of list_next_entry in qcrypto.c and ipa_dp.c has been
removed, as upstream has moved the definition to list.h. The implementation
of list_next_entry was identical between the two.

irq.c, for both arm and arm64 architecture, has had its calls to
__irq_set_affinity_locked updated to reflect changes to the API upstream.

Finally, as we have removed the sleep_length member variable of the
tick_sched struct, all changes made by upstream commit ec804bd do not
apply to our tree and have been removed from this merge. Only
kernel/time/tick-sched.c is impacted.

Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-08-20 13:23:09 -07:00
Olav Haugan df91ad278c sched: Make RAVG_HIST_SIZE tunable
Make RAVG_HIST_SIZE available from /proc/sys/kernel/sched_ravg_hist_size
to allow tuning of the size of the history that is used in computation
of task demand.

CRs-fixed: 706138
Change-Id: Id54c1e4b6e974a62d787070a0af1b4e8ce3b4be6
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2014-08-12 11:15:20 -07:00
Srivatsa Vaddagiri 5e8f14fbbc sched: window-stats: Allow acct_wait_time to be tuned
Add sysctl interface to tune sched_acct_wait_time variable at runtime

Change-Id: I38339cdb388a507019e429709a7c28e80b5b3585
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-12 10:51:30 -07:00
Srivatsa Vaddagiri a57fe9b6df sched: window-stats: Handle policy change properly
sched_window_stat_policy influences task demand and thus various
statistics maintained per-cpu like curr_runnable_sum. Changing policy
non-atomically would lead to improper accounting. For example, when
task is enqueued on a cpu's runqueue, its demand that is added to
rq->cumulative_runnable_avg could be based on AVG policy and when its
dequeued its demand that is removed can be based on MAX, leading to
erroneous accounting.

This change causes policy change to be "atomic" i.e all cpu's rq->lock
are held and all task's window-stats are reset before policy is changed.

Change-Id: I6a3e4fb7bc299dfc5c367693b5717a1ef518c32d
CRs-Fixed: 687409
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-08-06 15:36:59 +05:30
Srivatsa Vaddagiri fefafa08b7 sched: remove sysctl control for HMP and power-aware task placement
There is no real need to control HMP and power-aware task placement at
runtime after kernel has booted. Boot-time control should be
sufficient. Not allowing for runtime (sysctl) support simplifies the
code quite a bit.

Also rename sysctl_sched_enable_hmp_task_placement to be shorter.

Change-Id: I60cae51a173c6f73b79cbf90c50ddd41a27604aa
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 16:07:58 -07:00
Syed Rameez Mustafa 65eab4a6f5 sched/fair: Introduce scheduler boost for low latency workloads
Certain low latency bursty workloads require immediate use of highest
capacity CPUs in HMP systems. Existing load tracking mechanisms may be
unable to respond to the sudden surge in the system load within the
latency requirements. Introduce the scheduler boost feature for such
workloads. While boost is in effect the scheduler bypasses regular load
based task placement and prefers highest capacity CPUs in the system
for all non-small fair sched class tasks. Provide both a kernel and
userspace API for software that may have apriori knowledge about the
system workload.

Change-Id: I783f585d1f8c97219e629d9c54f712318821922f
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-07-22 14:23:02 -07:00
Steve Muckle 80753c7e5e sched: notify cpufreq on over/underprovisioned CPUs
After a migration occurs the source and destination CPUs may
not be running at frequencies which match the new task load on
those CPUs.

Previously, the scheduler was notifying cpufreq anytime a task
greater than a certain size migrates. This is suboptimal however
since this does not take into account the CPU's current
frequency and other task activity that may be present.

Change-Id: I5092bda3a517e1343f97e5a455957c25ee19b549
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-07-22 14:22:58 -07:00
Syed Rameez Mustafa dbd2db2471 sched: Introduce spill threshold tunables to manage overcommitment
When the number of tasks intended for a cluster exceed the number of
mostly idle CPUs in that cluster, the scheduler currently freely uses
CPUs in other clusters if possible. While this is optimal for
performance the power trade off can be quite significant. Introduce
spill threshold tunables that govern the extent to which the scheduler
should attempt to contain tasks within a cluster.

Change-Id: I797e6c6b2aa0c3a376dad93758abe1d587663624
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2014-07-22 14:22:58 -07:00
Steve Muckle bb3f4aae22 sched: add migration load change notifier for frequency guidance
When a task moves between CPUs in two different frequency domains
the cpufreq governor may wish to immediately modify the frequency
of both the source and destination CPUs of the migrating task.

A tunable is provided to establish what size task is considered
"significant" enough to warrant notifying cpufreq.

Also fix a bug that would cause load to not be accounted properly
during wakeup migrations.

Change-Id: Ie8f6b1cc4d43a602840dac18590b42a81327c95a
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
2014-07-22 14:22:57 -07:00
Steve Muckle bf1afbbbcd sched: add power aware scheduling sysctl
The sched_enable_power_aware sysctl will control whether
or not scheduling decisions are influenced by the power
consumption of individual CPUs.

Change-Id: I312f892cf76a3fccc4ecc8aa6703908b205267f0
Signed-off-by: Steve Muckle <smuckle@codeaurora.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 14:22:55 -07:00
Srivatsa Vaddagiri 259de62b7f sched: Basic task placement support for HMP systems
HMP systems have cpus with different power and performance
characteristics. Some cpus could offer better power at cost of lower
performance while other cpus could offer better performance at cost of
higher power. As a result, bandwidth consumed by a task to do some
"fixed" amount of work could vary across cpus.

Optimal task placement on HMP would involve placing a task on a cpu
where it can meet its performance goals at lowest power cost. Since
kernel has little to no awareness of performance goals of
applications, we guestimate whether task is meeting its performance
goals or not by looking at its cpu bandwidth consumption. High
bandwidth consumption could imply that task's performance can improve
by running on cpus with better capacity/performance-characterisitcs.

This patch makes the basic changes to support HMP. It provides a
configurable threshold and any task consuming bandwidth in excess of
threshold will be placed on a cpu with better capacity.

Change-Id: I3fd98edd430f73342fbef06411e8b2d1cf2f56fa
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 14:20:32 -07:00
Srivatsa Vaddagiri c1f998027e sched: Add CONFIG_SCHED_HMP Kconfig option
Add a compile-time flag to enable or disable scheduler features for
HMP (heterogenous multi-processor) systems. Main feature deals with
optimizing task placement for best power/performance tradeoff.

Also extend features currently dependent on CONFIG_SCHED_FREQ_INPUT to
be enabled for CONFIG_HMP as well.

Change-Id: I03b3942709a80cc19f7b934a8089e1d84c14d72d
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 14:20:31 -07:00
Srivatsa Vaddagiri dc2e1a4383 sched: Introduce CONFIG_SCHED_FREQ_INPUT
Introduce a compile time flag to enable scheduler guidance of
frequency selection. This flag is also used to turn on or off
window-based load stats feature.

Having a compile time flag will let some platforms avoid any
overhead that may be present with this scheduler feature.

Change-Id: Id8dec9839f90dcac82f58ef7e2bd0ccd0b6bd16c
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-07-22 14:20:30 -07:00