Commit graph

13848 commits

Author SHA1 Message Date
Stephen Hemminger
39b326bec0 tcp: remove Appropriate Byte Count support
TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change-Id: Ifc93fa043256c275580a9f93611ad1d013889bc4
2020-11-30 19:26:36 +03:00
David Herrmann
0309fda2fe shm: add memfd_create() syscall
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap().  It can support sealing and avoids any
connection to user-visible mount-points.  Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but with
a file-descriptor to it.

memfd_create() returns the raw shmem file, so calls like ftruncate() can
be used to modify the underlying inode.  Also calls like fstat() will
return proper information and mark the file as regular file.  If you want
sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
supported (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to a filesystem size limit.  It is still properly accounted to
memcg limits, though, and to the same overcommit or no-overcommit
accounting as all user memory.

Change-Id: Iaf959293e2c490523aeb46d56cc45b0e7bbe7bf5
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Angelo G. Del Regno <kholk11@gmail.com>
2020-10-25 02:37:54 -04:00
Daniel Micay
bacad4b4cb add toggle for disabling newly added USB devices
Based on the public grsecurity patches.

Change-Id: I2cbea91b351cda7d098f4e1aa73dff1acbd23cce
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
2020-10-25 00:03:27 -04:00
Jeff Layton
dde1c69f9b vfs: make path_openat take a struct filename pointer
...and fix up the callers. For do_file_open_root, just declare a
struct filename on the stack and fill out the .name field. For
do_filp_open, make it also take a struct filename pointer, and fix up its
callers to call it appropriately.

For filp_open, add a variant that takes a struct filename pointer and turn
filp_open into a wrapper around it.

Change-Id: Ibeb0479a22019e78b22990406d54c4ebed76a567
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-07 22:28:48 +04:00
Jeff Layton
3df0a6646d vfs: define struct filename and have getname() return it
getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.

For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.

This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.

Later, we'll add other information to the struct as it becomes
convenient.

Change-Id: Ib690c3dd4d56624f0ddb081e1c1d4f23c2dd0cd1
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-07 22:28:48 +04:00
Kees Cook
2f549f9575 fs: add link restriction audit reporting
Adds audit messages for unexpected link restriction violations so that
system owners will have some sort of potentially actionable information
about misbehaving processes.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: I4a6ef885b0680e1d554e32b7cc3506f8e0ba0b8a
2018-12-07 22:28:48 +04:00
Kees Cook
ec7215ac09 fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.

Symlinks:

A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.

Some pointers to the history of earlier discussion that I could find:

 1996 Aug, Zygo Blaxell
  http://marc.info/?l=bugtraq&m=87602167419830&w=2
 1996 Oct, Andrew Tridgell
  http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
 1997 Dec, Albert D Cahalan
  http://lkml.org/lkml/1997/12/16/4
 2005 Feb, Lorenzo Hernández García-Hierro
  http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
 2010 May, Kees Cook
  https://lkml.org/lkml/2010/5/30/144

Past objections and rebuttals could be summarized as:

 - Violates POSIX.
   - POSIX didn't consider this situation and it's not useful to follow
     a broken specification at the cost of security.
 - Might break unknown applications that use this feature.
   - Applications that break because of the change are easy to spot and
     fix. Applications that are vulnerable to symlink ToCToU by not having
     the change aren't. Additionally, no applications have yet been found
     that rely on this behavior.
 - Applications should just use mkstemp() or O_CREATE|O_EXCL.
   - True, but applications are not perfect, and new software is written
     all the time that makes these mistakes; blocking this flaw at the
     kernel is a single solution to the entire class of vulnerability.
 - This should live in the core VFS.
   - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
 - This should live in an LSM.
   - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

Hardlinks:

On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.

The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.

Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].

This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.

Change-Id: Ic4872c58e8a0672147c73b13175ea143e19915ba
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-07 22:28:48 +04:00
Al Viro
66c4da2876 stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: Id5a9a96c3202f724156c32fb266190334e7dbe48
2018-12-07 22:26:28 +04:00
Artem Borisov
e2c600a1f3 sched_clock: Squashed revert of the latest updates
Revert "sched_clock: Avoid corrupting hrtimer tree during suspend"

This reverts commit 8aad725c70.

Revert "sched_clock: Add support for >32 bit sched_clock"

This reverts commit 657eb100e4.

Revert "sched_clock: Use an hrtimer instead of timer"

This reverts commit b2ee62ec51.

Revert "sched_clock: Use seqcount instead of rolling our own"

This reverts commit 538b187b6e.

Revert "ARM: sched_clock: Load cycle count after epoch stabilizes"

This reverts commit 8c7175ba39.

Revert "sched_clock: Make ARM's sched_clock generic for all architectures"

This reverts commit ebb97da74a.

Revert "ARM: 7699/1: sched_clock: Add more notrace to prevent recursion"

This reverts commit 086da6a6c4.

Revert "ARM: make sched_clock just call a function pointer"

This reverts commit 0dd4fad6c9.

Revert "ARM: sched_clock: allow changing to higher frequency counter"

This reverts commit 4a3cf85432.

Change-Id: I98aaec7b554a2e11be4c551a864d952e0d8c3e22
2018-02-20 21:56:17 +03:00
Peter Zijlstra
3cffdb884f perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
commit 321027c1fe77f892f4ea07846aeae08cefbbb290 upstream.
commit fe525a280e8b5f04c7666fe22d1a4ef592f7b953 in 3.16.40
bug: 37901413

Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.

The problem is exactly that described in commit:

  f63a8daa5812 ("perf: Fix event->ctx locking")

... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.

That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.

So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).

Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.

Change-Id: I816a317dff3ce999c94d22b7e51152ad1dcc30a2
Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias <joaodias@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Min Chong <mchong@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f63a8daa5812 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16:
 - Use ACCESS_ONCE() instead of READ_ONCE()
 - Test perf_event::group_flags instead of group_caps
 - Add the err_locked cleanup block, which we didn't need before
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2018-01-13 17:13:44 +03:00
Peter Zijlstra
898386b287 BACKPORT: perf: Fix event->ctx locking
There have been a few reported issues wrt. the lack of locking around
changing event->ctx. This patch tries to address those.

It avoids the whole rwsem thing; and while it appears to work, please
give it some thought in review.

What I did fail at is sensible runtime checks on the use of
event->ctx, the RCU use makes it very hard.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150123125834.209535886@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit f63a8daa5812afef4f06c962351687e1ff9ccb2b)

Bug: 30955111
Bug: 31095224
Change-Id: I8dfc0aae8d1206c177454e0093dacd82b6129c55
Signed-off-by: Joao Dias <joaodias@google.com>
2018-01-13 17:13:44 +03:00
Yan, Zheng
b23629d405 perf: Introduce perf_pmu_migrate_context()
Originally from Peter Zijlstra. The helper migrates perf events
from one cpu to another cpu.

Change-Id: I4d3c45b4594f3d26bbe7cc9e3fb79675ffac8b5e
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-5-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-13 17:13:43 +03:00
Yan, Zheng
75e8341254 perf: Allow the PMU driver to choose the CPU on which to install events
Allow the pmu->event_init callback to change event->cpu, so the PMU driver
can choose the CPU on which to install events.

Change-Id: Ie1f67c8b9fac650002f059081fe325eb799690c1
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-4-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-13 17:13:43 +03:00
Peter Zijlstra
29484ea618 UPSTREAM: perf: Fix race in swevent hash
(cherry picked from commit 12ca6ad2e3a896256f086497a7c7406a547ee373)

There's a race on CPU unplug where we free the swevent hash array
while it can still have events on. This will result in a
use-after-free which is BAD.

Simply do not free the hash array on unplug. This leaves the thing
around and no use-after-free takes place.

When the last swevent dies, we do a for_each_possible_cpu() iteration
anyway to clean these up, at which time we'll free it, so no leakage
will occur.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change-Id: I14c0679a2934dccdbb052805e6430fe54b5978f0
Bug: 30952077
2018-01-13 17:13:43 +03:00
Oleg Nesterov
4d448bba95 BACKPORT: FROMLIST: pids: make task_tgid_nr_ns() safe
This was reported many times, and this was even mentioned in commit
52ee2dfdd4 "pids: refactor vnr/nr_ns helpers to make them safe" but
somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns()
is not safe because task->group_leader points to nowhere after the
exiting task passes exit_notify(), rcu_read_lock() can not help.

We really need to change __unhash_process() to nullify group_leader,
parent, and real_parent, but this needs some cleanups. Until then we
can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
fix the problem.

Reported-by: Troy Kensinger <tkensinger@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

(url: https://patchwork.kernel.org/patch/9913055/)
Bug: 31495866

Change-Id: I5e67b02a77e805f71fa3a787249f13c1310f02e2
2018-01-13 17:13:41 +03:00
Stephen Boyd
8aad725c70 sched_clock: Avoid corrupting hrtimer tree during suspend
During suspend we call sched_clock_poll() to update the epoch and
accumulated time and reprogram the sched_clock_timer to fire
before the next wrap-around time. Unfortunately,
sched_clock_poll() doesn't restart the timer, instead it relies
on the hrtimer layer to do that and during suspend we aren't
calling that function from the hrtimer layer. Instead, we're
reprogramming the expires time while the hrtimer is enqueued,
which can cause the hrtimer tree to be corrupted. Furthermore, we
restart the timer during suspend but we update the epoch during
resume which seems counter-intuitive.

Let's fix this by saving the accumulated state and canceling the
timer during suspend. On resume we can update the epoch and
restart the timer similar to what we would do if we were starting
the clock for the first time.

Change-Id: Iee2a1cca42e5b681347ea0607e9af420a63892d7
CRs-Fixed: 696826
Fixes: a08ca5d108 "sched_clock: Use an hrtimer instead of timer"
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2018-01-02 22:36:45 +03:00
Stephen Boyd
657eb100e4 sched_clock: Add support for >32 bit sched_clock
The ARM architected system counter has at least 56 usable bits.
Add support for counters with more than 32 bits to the generic
sched_clock implementation so we can increase the time between
wakeups due to dealing with wrap-around on these devices while
benefiting from the irqtime accounting and suspend/resume
handling that the generic sched_clock code already has. On my
system using 56 bits over 32 bits changes the wraparound time
from a few minutes to an hour. For faster running counters (GHz
range) this is even more important because we may not be able to
execute the timer in time to deal with the wraparound if only 32
bits are used.

We choose a maxsec value of 3600 seconds because we assume no
system will go idle for more than an hour. In the future we may
need to increase this value.

Note: All users should switch over to the 64-bit read function so
we can remove setup_sched_clock() in favor of sched_clock_register().

Change-Id: Ieea67ada51e6fce2637520bf5f60e789762d8694
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-commit: e7e3ff1bfe
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2018-01-02 22:36:43 +03:00
Stephen Boyd
b2ee62ec51 sched_clock: Use an hrtimer instead of timer
In the next patch we're going to increase the number of bits that
the generic sched_clock can handle to be greater than 32. With
more than 32 bits the wraparound time can be larger than what can
fit into the units that msecs_to_jiffies takes (unsigned int).
Luckily, the wraparound is initially calculated in nanoseconds
which we can easily use with hrtimers, so switch to using an
hrtimer.

Change-Id: Ia052d3d5d9c1ec7b59c440ceae852c12c61477e1
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
[jstultz: Fixup hrtimer intitialization order issue]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-commit: a08ca5d108
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2018-01-02 22:36:42 +03:00
Stephen Boyd
538b187b6e sched_clock: Use seqcount instead of rolling our own
We're going to increase the cyc value to 64 bits in the near
future. Doing that is going to break the custom seqcount
implementation in the sched_clock code because 64 bit numbers
aren't guaranteed to be atomic. Replace the cyc_copy with a
seqcount to avoid this problem.

Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-commit: 85c3d2dd15
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[imaund@codeaurora.org: resolve merge conflicts]
Signed-off-by: Ian Maund <imaund@codeaurora.org>

Change-Id: If7fae3228ff425c9f1763145cbd5c851141c7755
2018-01-02 22:36:39 +03:00
Stephen Boyd
8c7175ba39 ARM: sched_clock: Load cycle count after epoch stabilizes
There is a small race between when the cycle count is read from
the hardware and when the epoch stabilizes. Consider this
scenario:

 CPU0                           CPU1
 ----                           ----
 cyc = read_sched_clock()
 cyc_to_sched_clock()
                                 update_sched_clock()
                                  ...
                                  cd.epoch_cyc = cyc;
  epoch_cyc = cd.epoch_cyc;
  ...
  epoch_ns + cyc_to_ns((cyc - epoch_cyc)

The cyc on cpu0 was read before the epoch changed. But we
calculate the nanoseconds based on the new epoch by subtracting
the new epoch from the old cycle count. Since epoch is most likely
larger than the old cycle count we calculate a large number that
will be converted to nanoseconds and added to epoch_ns, causing
time to jump forward too much.

Fix this problem by reading the hardware after the epoch has
stabilized.

Change-Id: I2d81b8422814f209ca1e03a45eb370c9a0696c31
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-commit: 336ae1180d
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2018-01-02 22:35:58 +03:00
Stephen Boyd
ebb97da74a sched_clock: Make ARM's sched_clock generic for all architectures
Nothing about the sched_clock implementation in the ARM port is
specific to the architecture. Generalize the code so that other
architectures can use it by selecting GENERIC_SCHED_CLOCK.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
[jstultz: Merge minor collisions with other patches in my tree]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Git-commit: 38ff87f77a
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[imaund@codeaurora.org: resolve merge conflicts]
Signed-off-by: Ian Maund <imaund@codeaurora.org>
[flex1911: backport to 3.4]
Signed-off-by: Artem Borisov <dedsa2002@gmail.com>

Change-Id: I798c7c58dc9f476b07e60958a970aff7ceb4b797
2018-01-02 22:35:27 +03:00
Srivatsa S. Bhat
b96d7e4f12 CPU hotplug: Provide lockless versions of callback registration functions
(cherry pick from commit 93ae4f978c)

The following method of CPU hotplug callback registration is not safe
due to the possibility of an ABBA deadlock involving the cpu_add_remove_lock
and the cpu_hotplug.lock.

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

The deadlock is shown below:

          CPU 0                                         CPU 1
          -----                                         -----

   Acquire cpu_hotplug.lock
   [via get_online_cpus()]

                                              CPU online/offline operation
                                              takes cpu_add_remove_lock
                                              [via cpu_maps_update_begin()]

   Try to acquire
   cpu_add_remove_lock
   [via register_cpu_notifier()]

                                              CPU online/offline operation
                                              tries to acquire cpu_hotplug.lock
                                              [via cpu_hotplug_begin()]

                            *** DEADLOCK! ***

The problem here is that callback registration takes the locks in one order
whereas the CPU hotplug operations take the same locks in the opposite order.
To avoid this issue and to provide a race-free method to register CPU hotplug
callbacks (along with initialization of already online CPUs), introduce new
variants of the callback registration APIs that simply register the callbacks
without holding the cpu_add_remove_lock during the registration. That way,
we can avoid the ABBA scenario. However, we will need to hold the
cpu_add_remove_lock throughout the entire critical section, to protect updates
to the callback/notifier chain.

This can be achieved by writing the callback registration code as follows:

	cpu_maps_update_begin(); [ or cpu_notifier_register_begin(); see below ]

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* This doesn't take the cpu_add_remove_lock */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_maps_update_done();  [ or cpu_notifier_register_done(); see below ]

Note that we can't use get_online_cpus() here instead of cpu_maps_update_begin()
because the cpu_hotplug.lock is dropped during the invocation of CPU_POST_DEAD
notifiers, and hence get_online_cpus() cannot provide the necessary
synchronization to protect the callback/notifier chains against concurrent
reads and writes. On the other hand, since the cpu_add_remove_lock protects
the entire hotplug operation (including CPU_POST_DEAD), we can use
cpu_maps_update_begin/done() to guarantee proper synchronization.

Also, since cpu_maps_update_begin/done() is like a super-set of
get/put_online_cpus(), the former naturally protects the critical sections
from concurrent hotplug operations.

Since the names cpu_maps_update_begin/done() don't make much sense in CPU
hotplug callback registration scenarios, we'll introduce new APIs named
cpu_notifier_register_begin/done() and map them to cpu_maps_update_begin/done().

In summary, introduce the lockless variants of un/register_cpu_notifier() and
also export the cpu_notifier_register_begin/done() APIs for use by modules.
This way, we provide a race-free way to register hotplug callbacks as well as
perform initialization for the CPUs that are already online.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Bug: 24810447
Change-Id: I5f85fcb5cfaa5f5f04a29eefc361851e9c345a99
2018-01-01 21:26:45 +03:00
Andrey Vagin
5f83d0f802 BACKPORT: signal: allow to send any siginfo to itself
(cherry picked from commit 66dd34ad31)

The idea is simple.  We need to get the siginfo for each signal on
checkpointing dump, and then return it back on restore.

The first problem is that the kernel doesn't report complete siginfos to
userspace.  In a signal handler the kernel strips SI_CODE from siginfo.
When a siginfo is received from signalfd, it has a different format with
fixed sizes of fields.  The interface of signalfd was extended.  If a
signalfd is created with the flag SFD_RAW, it returns siginfo in a raw
format.

rt_sigqueueinfo looks suitable for restoring signals, but it can't send
siginfo with a positive si_code, because these codes are reserved for
the kernel.  In the real world each person has right to do anything with
himself, so I think a process should able to send any siginfo to itself.

This patch:

The kernel prevents sending of siginfo with positive si_code, because
these codes are reserved for kernel.  I think we can allow a task to
send such a siginfo to itself.  This operation should not be dangerous.

This functionality is required for restoring signals in
checkpoint/restart.

Change-Id: I40101d87eeb53ae05cfa0949439577a8f3f58f94
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-27 22:49:08 +03:00
John Stultz
ff9ff2f4b9 ANDROID: exec_domains: Disable request_module() call for personalities
(cherry pick from commit a9ac1262ce)

With Android M, Android environments use a separate execution
domain for 32bit processes.
See:
https://android-review.googlesource.com/#/c/122131/

This results in systems that use kernel modules to see selinux
audit noise like:
  type=1400 audit(28.989:15): avc: denied { module_request } for
  pid=1622 comm="app_process32" kmod="personality-8"
  scontext=u:r:zygote:s0 tcontext=u:r:kernel:s0 tclass=system

While using kernel modules is unadvised, some systems do require
them.

Thus to avoid developers adding sepolicy exceptions to allow for
request_module calls, this patch disables the logic which tries
to call request_module for the 32bit personality (ie:
personality-8), which doesn't actually exist.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Change-Id: I9cb90bd1291f0a858befa7d347c85464346702db
2017-12-27 22:46:26 +03:00
Guenter Roeck
f57b91255d seccomp: Replace BUG(!spin_is_locked()) with assert_spin_lock
Current upstream kernel hangs with mips and powerpc targets in
uniprocessor mode if SECCOMP is configured.

Bisect points to commit dbd952127d ("seccomp: introduce writer locking").
Turns out that code such as
	BUG_ON(!spin_is_locked(&list_lock));
can not be used in uniprocessor mode because spin_is_locked() always
returns false in this configuration, and that assert_spin_locked()
exists for that very purpose and must be used instead.

Fixes: dbd952127d ("seccomp: introduce writer locking")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-12-27 22:42:09 +03:00
Artem Borisov
d7992e6feb Merge remote-tracking branch 'stable/linux-3.4.y' into lineage-15.1
All bluetooth-related changes were omitted because of our ancient incompatible bt stack.

Change-Id: I96440b7be9342a9c1adc9476066272b827776e64
2017-12-27 17:13:15 +03:00
Liu ShuoX
3da71738d2 PM / Sleep: avoid 'autosleep' in shutdown progress
commit e5248a111b upstream.

Prevent automatic system suspend from happening during system
shutdown by making try_to_suspend() check system_state and return
immediately if it is not SYSTEM_RUNNING.

This prevents the following breakage from happening (scenario from
Zhang Yanmin):

 Kernel starts shutdown and calls all device driver's shutdown
 callback.  When a driver's shutdown is called, the last wakelock is
 released and suspend-to-ram starts.  However, as some driver's shut
 down callbacks already shut down devices and disabled runtime pm,
 the suspend-to-ram calls driver's suspend callback without noticing
 that device is already off and causes crash.

Change-Id: I09261fe136713cb6bdd66e061a9e886d077324c5
[rjw: Changelog]
Signed-off-by: Liu ShuoX <shuox.liu@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 426b7d5074424aab388af948ba75a5e1c8b9a702)
2017-10-15 17:05:14 +03:00
Dmitry Shmidt
36f258a110 PM: Check dpm_suspend_start() return code during partial resume
Bug: 24986869

Change-Id: Iea3e0f84e43827b365b96d34bc647e310523bd40
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Thierry Strudel <tstrudel@google.com>
2017-10-15 17:05:13 +03:00
Ruchi Kandoi
0ff10ad812 wakeup_reason: use vsnprintf instead of snprintf for vargs.
Bug: 22368519
Change-Id: I38f6f1ac6eaf9490bdc195c59e045b33ad154a72
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2017-10-15 17:05:12 +03:00
Dmitry Shmidt
f6d4e6286e Power: Add wakeup reasons counters from boot in suspend_since_boot
From left to right:
        1. Amount of no-wait cycles
        2. Amount of timeout cycles
        3. Max waiting time in ms

Change-Id: Ibc0bb1c4ea591d005cdbb095b6d21c0734d2eb8b
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-10-15 17:05:12 +03:00
Dmitry Shmidt
dc58d3c933 PM: Reduce waiting for wakeup reasons to 100 ms
In 80% cases there is no need to wait, and in case
of timeout we continue to resume.

Change-Id: I6ae44e0ef6f7aa497f57fcd5f6e6bc83dc781852
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-10-15 17:05:12 +03:00
Ruchi Kandoi
18bbd5eed4 suspend: Return error when pending wakeup source is found.
Suspend is aborted if the wakeup_source is pending. These wakeup sources
are checked multiple times before going to suspend. If it is found to be
pending then suspend is aborted and -EBUSY is returned. This happens at
all the places except the last time they are checked. In this case
suspend is aborted but the error is not set. Since the error is not
propogated the suspend accounting considers this as a sucessful suspend
instead of suspend abort.

Change-Id: Ib63b4ead755127eaf03e3b303aab3c782ad02ed1
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2017-10-15 17:05:11 +03:00
Iliyan Malchev
a50130f1be PM: wakeup_reasons: disable wakeup-reason deduction by default
Introduce a config item, CONFIG_DEDUCE_WAKEUP_REASONS, disabled by default.
Make CONFIG_PARTUALRESUME select it.

Change-Id: I7d831ff0a9dfe0a504824f4bc65ba55c4d92546b
Signed-off-by: Iliyan Malchev <malchev@google.com>
2017-10-15 17:05:11 +03:00
Dmitry Shmidt
9ff16b88b8 PM: Replace WARN_ON on timeout with one line print
Change-Id: Ia8b32b8ee225b7b62a327fecb10e9284ee4116df
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-10-15 17:05:10 +03:00
Ruchi Kandoi
124acbcfe3 power: Avoids bogus error messages for the suspend aborts.
Avoids printing bogus error message "tasks refusing to freeze", in cases
where pending wakeup source caused the suspend abort.

Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
Change-Id: I913ad290f501b31cd536d039834c8d24c6f16928
2017-10-15 17:05:10 +03:00
Iliyan Malchev
84612c4595 PM: wakeup_reasons: fix race condition
log_possible_wakeup_reason() and stop_logging_wakeup_reasons() can race, as the
latter can be called from process context, and both can run on separate cores.

Change-Id: I306441d0be46dd4fe58c55cdc162f9d61a28c27d
Signed-off-by: Iliyan Malchev <malchev@google.com>
2017-10-15 17:05:09 +03:00
Dmitry Shmidt
563e031bd3 Power: Report total suspend times from boot in suspend_since_boot
This node exports five values separated by space.
        From left to right:
        1. Amount of suspend/resume cycles
        2. Amount of suspend abort cycles
        3. Total time spent in suspend/resume process
        4. Total time in suspend abort process
        5. Total time spent sleep in suspend state

Change-Id: Ife188fd8386dce35f95fa7ba09fbc9d7e152db62
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2017-10-15 17:05:09 +03:00
Iliyan Malchev
953a4840dd PM: extend suspend_again mechanism to use partialresume
The old platform suspend_again callback overrides drivers' votes, such that if
it implemented and returns false, then we do not call the partialresume
handlers.  When it doesn't exists or returns true, then we also query the
registered drivers for consensus.

When a device resumes from suspend, the suspend/resume code invokes
partialresume to check to see if the set of wakeup interrupts all have matching
handlers. If this is not the case, the PM subsystem can continue to resume as
before.  If all of the wakeup sources have matching handlers, then those are
invoked in turn (and can block), and if all of them reach consensus that the
reason for the wakeup can be ignored, they say so to the PM subsystem, which
goes right back into suspend.

Signed-off-by: Iliyan Malchev <malchev@google.com>
Change-Id: Iaeb9ed78c4b5fb815c6e9c701233e703f481f962
2017-10-15 17:05:08 +03:00
Iliyan Malchev
4e0c8780cc power: add partial-resume framework
Partial resume refers to the concept of not waking up userspace when the kernel
comes out of suspend for certain types of events that we wish to discard.  An
example is a network packet that can be disacarded in the kernel, or spurious
wakeup event that we wish to ignore.  Partial resume allows drivers to register
callbacks, one one hand, and provides hooks into the PM's suspend/resume
mechanism, on the other.

When a device resumes from suspend, the core suspend/resume code invokes
partialresume to check to see if the set of wakeup interrupts all have
matching handlers. If this is not the case, the PM subsystem can continue to
resume as before.  If all of the wakeup sources have matching handlers, then
those are invoked in turn (and can block), and if all of them reach consensus
that the reason for the wakeup can be ignored, they say so to the PM subsystem,
which goes right back into suspend.  This latter support is implemented in a
separate change.

Signed-off-by: Iliyan Malchev <malchev@google.com>
Change-Id: Id50940bb22a550b413412264508d259f7121d442
2017-10-15 17:05:08 +03:00
Iliyan Malchev
7e87a4dc87 PM: wakeup_reason: correctly deduce wakeup interrupts
The wakeup_reason driver works by having a callback log_wakeup_reason(), be
called by the resume path once for each wakeup interrupt, with the irq number
as argument.  It then saves this interrupt in an array, and reports it when
requested (via /sys/kernel/wakeup_reasons/last_resume_reason) and also prints
the information out in kmsg.

This approach works, but it has the deficiency that often the reported wakeup
interrupt, while correct, is not the interrupt truly responsible for the
wakeup.  The reason for this is due to chained interrupt controllers (whether
in hardware or simulated in software).  It could be, for example, that the
power button is wired to a GPIO handled by a single interrupt for all GPIOs,
which interrupt then determines the GPIO and maps this to a software interrupt.
Whether this is done in software, or by chaining interrupt controllers, the end
result is that the wakeup reason will show not the interrupt associated with
the power button, but the base-GPIO interrupt instead.

This patch reworks the wakeup_sources driver such that it reports those final
interrupts we are interested in, and not the intermediate (and not the base)
ones.  It does so as follows:

-- The assumption is that generic_handle_irq() is called to dispatch all
   interrupts; due to this, chained interrupts result in recursive calls of
   generic_handle_irq().
-- We reconstruct the chains of interrupts that originate with the base wakeup
   interrupt and terminate with the interrupt we are interested in by tracing
   the calls to generic_handle_irq()
-- The tracing works by maitaining a per-cpu counter that is incremented with
   each call to generic_handle_irq(); that counter is reported to the
   wakeup_sources driver by a pair of functions, called
   log_possible_wakeup_reason_start() and log_possible_wakeup_reason_complete().
   The former is called before generic_handle_irq() handles the interrupt
   (thereby potentially calling itself recusively) and the latter afterward.
-- The two functions mentioned above are complemented by log_base_wake_reason()
   (renamed from log_wakeup_reason()), which is used to report the base wakeup
   interrupts to the wakeup_reason driver.
-- The three functions work together to build a set of trees, one per base
   wakeup reason, the leaves of which correspond to the interrupts we are
   interesed in; these trees can be arbitratily complex, though in reality they
   most often are a single node, or a chain of two nodes.  The complexity
   supports arbitrarily involved interrupt dispatch.
-- On resume, we build the tree; once the tree is completed, we walk it
   recursively, and print out to kmesg the (more useful) list of wakeup
   sources; simiarly, we walk the tree and print the leaves when
   /sys/kernel/wakeup_reasons/last_resume_reason is read.

Signed-off-by: Iliyan Malchev <malchev@google.com>
Change-Id: If8acb2951b61d2c6bcf4d011fe04d7f91057d139
2017-10-15 17:05:07 +03:00
Iliyan Malchev
4dbec3e7db irq_flow_handler_t now returns bool
Alter the signature of irq_flow_handler_t to return true for those interrupts
whose handlers were invoked, and false otherwise.  Also rework the actual
handlers, handle_.*_irq, to support the new signature.

Change-Id: I8a50410c477692bbcd39a0fefdac14253602d1f5
Signed-off-by: Iliyan Malchev <malchev@google.com>
2017-10-15 17:05:07 +03:00
Dmitry Shmidt
13e2b3277d PM: wakeup_reason: add check_wakeup_reason() to verify wakeup source irq
Wakeup reason is set before driver resume handlers are called.
It is cleared before driver suspend handlers are called, on
PM_SUSPEND_PREPARE.

Change-Id: I04218c9b0c115a7877e8029c73e6679ff82e0aa4
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
2017-10-15 16:24:04 +03:00
Ruchi Kandoi
e532467dab power: log the last suspend abort reason.
Extends the last_resume_reason to log suspend abort reason. The abort
reasons will have "Abort:" appended at the start to distinguish itself
from the resume reason.

Change-Id: Id3c62fc0cb86ca2e05a69e40de040b94f32be389
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
2017-10-15 16:23:56 +03:00
Ruchi Kandoi
1154a48192 PM: wakeup_reason: add functionality to log the last suspend-abort reason.
Extends the last_resume_reason to log suspend abort reason. The abort
reasons will have "Abort:" appended at the start to distinguish itself
from the resume reason.

Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
Change-Id: I3207f1844e3d87c706dfc298fb10e1c648814c5f
2017-10-15 16:17:14 +03:00
Iliyan Malchev
c9816de694 PM: wakeup_reason: add functions to query and clear wakeup reasons
The query results are valid until the next PM_SUSPEND_PREPARE.

Change-Id: I6bc2bd47c830262319576a001d39ac9a994916cf
Signed-off-by: Iliyan Malchev <malchev@google.com>
2017-10-15 16:17:14 +03:00
jinqian
ec32c90b65 Power: Report suspend times from last_suspend_time
This node epxorts two values separated by space.
From left to right:
1. time spent in suspend/resume process
2. time spent sleep in suspend state

Change-Id: I2cb9a9408a5fd12166aaec11b935a0fd6a408c63
2017-10-15 16:17:13 +03:00
Ruchi Kandoi
92b3e919c4 Power: Add guard condition for maximum wakeup reasons
Ensure the array for the wakeup reason IRQs does not overflow.

Change-Id: Iddc57a3aeb1888f39d4e7b004164611803a4d37c
Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2017-10-15 16:17:13 +03:00
Stepan Moskovchenko
6365c6e304 PM / Sleep: Clean up remnants of workqueue-based sync
When legacy wakelock code was removed in commit
f85607a715a74c65db812cd3901022888257f966, some of the code
for moving calls to sys_sync() from suspend paths into a
workqueue item had not been properly removed. Specifically,
one of the call sites to suspend_sys_sync_wait() has been
mistakenly replaced with a call to sys_sync(), which is not
necessary because the corresponding instance of
suspend_sys_sync_queue() was already replaced with
sys_sync(). Clean up the remnants of the legacy wakelock
code by removing the extraneous call to sys_sync() and
restoring some of the surrounding printk statements that
had been moved to suspend_sys_sync_queue() and subsequently
lost.

CRs-Fixed: 498669
Change-Id: Ifb2ede7808560f456c824d3d6359a4541c51b73f
Signed-off-by: Stepan Moskovchenko <stepanm@codeaurora.org>
2017-10-15 15:46:58 +03:00
Arve Hjønnevåg
516049d328 PM / Sleep: Fix a mistake in a conditional in autosleep_store()
The condition check in autosleep_store() is incorrect and prevents
/sys/power/autosleep from working as advertised.  Fix that.

[rjw: Added the changelog.]

Change-Id: I231cc24fc3f245003dcf5053ff6a71eb69ffa273
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Git-commit: 040e5bf65e
Git-repo: git://codeaurora.org/kernel/msm.git
Signed-off-by: Anurag Singh <anursing@codeaurora.org>
2017-10-15 15:46:58 +03:00
Amar Singhal
9801869142 power: main: Add conditional compilation for touch nodes
Add conditional compilation for touch event sysfs nodes. Otherwise,
if CONFIG_PM_SLEEP is not defined, there could be compilation errors.

Change-Id: I1ac7f284ec35eae2cfa076ef8e71c29ddc24817c
Signed-off-by: Amar Singhal <asinghal@codeaurora.org>
Signed-off-by: Anurag Singh <anursing@codeaurora.org>
2017-10-15 15:46:57 +03:00