Commit Graph

27882 Commits

Author SHA1 Message Date
Thomas Gleixner 81cbb3b1dc locking: Various static lock initializer fixes
The static lock initializers want to be fed the proper name of the
lock and not some random string. In mainline random strings are
obfuscating the readability of debug output, but for RT they prevent
the spinlock substitution. Fix it up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2023-03-24 20:53:04 +01:00
Al Viro d219028583 take rlimit check to callers of expand_files()
... except for one in android, where the check is different
and already done in caller.  No need to recalculate rlimit
many times in alloc_fd() either.

Change-Id: Ia6eb7e1af1047f4d4f188d89deb70d708fa9110a
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-11-26 22:02:15 +01:00
Al Viro 85efe69668 take descriptor-related part of close() to file.c
Change-Id: I939d86833db0108094a9552a9e6e41ac1d092d87
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-11-26 22:02:15 +01:00
Al Viro 998d75d211 take fget() and friends to fs/file.c
Change-Id: I53ad2cab96dc6f64e7ea212ecc04487cc0f06988
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-11-26 22:02:14 +01:00
Al Viro 64e99e7330 expose a low-level variant of fd_install() for binder
Similar situation to that of __alloc_fd(); do not use unless you
really have to.  You should not touch any descriptor table other
than your own; it's a sure sign of a really bad API design.

As with __alloc_fd(), you *must* use a first-class reference to
struct files_struct; something obtained by get_files_struct(some task)
(let alone direct task->files) will not do.  It must be either
current->files, or obtained by get_files_struct(current) by the
owner of that sucker and given to you.

Change-Id: Ia326b598ba7e1b315188ecea21250064433ae620
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-11-26 22:02:14 +01:00
Al Viro e25ec45dc3 move put_unused_fd() and fd_install() to fs/file.c
Change-Id: I38181db167e8c6222c84f62d8d0658e260b7ceb8
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-11-26 22:02:14 +01:00
Al Viro b66d20fae3 new helper: __alloc_fd()
Essentially, alloc_fd() in a files_struct we own a reference to.
Most of the time wanting to use it is a sign of lousy API
design (such as android/binder).  It's *not* a general-purpose
interface; better that than open-coding its guts, but again,
playing with other process' descriptor table is a sign of bad
design.

Change-Id: I0a62c0a1a9162d6e5961878d7dc7ff8ffcf82b56
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-11-26 22:02:13 +01:00
Al Viro af0847bda2 make get_unused_fd_flags() a function
... and get_unused_fd() a macro around it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: Id0975a8e07fa48cbb1baa30c17996dcc4b6df9ea
2021-11-26 22:02:13 +01:00
followmsi e7e8f34f94 Merge branch 'lineage-18.1' of https://github.com/LineageOS/android_kernel_google_msm into followmsi-11 2021-11-24 13:34:59 +01:00
Daniel Rosenberg f35f655694 BACKPORT: ANDROID: mnt: Propagate remount correctly
This switches over to propagation_next to respect
namepsace semantics.

Test: Remounting to change the options of a fs with mount based
      options should propagate to all shared copies of that mount,
      and the slaves/indirect slaves of those.
Bug: 122428178
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ic35cd2782a646435689f5bedfa1f218fe4ab8254
2021-09-16 18:33:29 -04:00
Eric W. Biederman c48074f579 BACKPORT: propogate_mnt: Handle the first propogated copy being a slave
commit 5ec0811d30378ae104f250bfc9b3640242d81e3f upstream.

When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>]  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38  EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS:  00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
>  ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
>  ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
>  0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
>  [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
>  [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
>  [<ffffffff811f1ec3>] graft_tree+0x63/0x70
>  [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
>  [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
>  [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
>  [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
>  [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
>  [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP  [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
>  RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---

This oops happens with the namespace_sem held and can be triggered by
non-root users.  An all around not pleasant experience.

To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.

Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.

The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.

To avoid other kinds of confusion last_dest is not changed when
computing last_source.  last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.

fixes: f2ebb3a921 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ie55a2c52db9773b461acc6ebe427221acb7093f0
2021-09-16 14:14:38 -04:00
Maxim Patlasov 01609cc4af BACKPORT: fs/pnode.c: treat zero mnt_group_id-s as unequal
commit 7ae8fd0351f912b075149a1e03a017be8b903b9a upstream.

propagate_one(m) calculates "type" argument for copy_tree() like this:

>    if (m->mnt_group_id == last_dest->mnt_group_id) {
>        type = CL_MAKE_SHARED;
>    } else {
>        type = CL_SLAVE;
>        if (IS_MNT_SHARED(m))
>           type |= CL_MAKE_SHARED;
>   }

The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.

The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:

>    last_source = last_source->mnt_master;
>    last_dest = last_source->mnt_parent;

proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.

[AV: don't open-code an obvious helper...]

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I78454c89b1f672e49c8ffcf63d1339a3e371aa87
2021-09-16 14:14:33 -04:00
followmsi 51290e763f Merge branch 'lineage-18.1' into followmsi-11 2021-08-23 14:34:45 +02:00
Daniel Rosenberg 5c4b88269c ANDROID: sdcardfs: Add option to not link obb
Add mount option unshared_obb to not link the obb
folders of multiple users together.

Bug: 27915347
Test: mount with option. Check if altering one obb
      alters the other
Signed-off-by: Daniel Rosenberg <drosen@google.com>

Change-Id: I3956e06bd0a222b0bbb2768c9a8a8372ada85e1e
2021-05-08 17:13:15 -04:00
Daniel Rosenberg 5e5a0b5125 fs: sdcardfs: Add missing option to show_options
unshared_obb was missing from show_options

bug: 133257717
Change-Id: I1bc49d1b4098052382a518540e5965e037aa39f1
Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
2021-05-08 17:11:24 -04:00
Amir Goldstein 42110ed780 locks: print unsigned ino in /proc/locks
commit 98ca480a8f22fdbd768e3dad07024c8d4856576c upstream.

An ino is unsigned, so display it as such in /proc/locks.

Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: I250a495fe3fc809e880535347f462fe552644edf
2021-01-24 09:56:22 +00:00
Jeff Layton 5e483034fd locks: rename FL_FILE_PVT and IS_FILE_PVT to use "*_OFDLCK" instead
File-private locks have been re-christened as "open file description"
locks.  Finish the symbol name cleanup in the internal implementation.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Iee48047540a7d8fefb5078cc005ae9ea8994f521
2021-01-24 09:56:22 +00:00
Jeff Layton 059e6ee3a1 locks: rename file-private locks to "open file description locks"
File-private locks have been merged into Linux for v3.15, and *now*
people are commenting that the name and macro definitions for the new
file-private locks suck.

...and I can't even disagree. The names and command macros do suck.

We're going to have to live with these for a long time, so it's
important that we be happy with the names before we're stuck with them.
The consensus on the lists so far is that they should be rechristened as
"open file description locks".

The name isn't a big deal for the kernel, but the command macros are not
visually distinct enough from the traditional POSIX lock macros. The
glibc and documentation folks are recommending that we change them to
look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
programmer will typo one of the commands wrong, and also makes it easier
to spot this difference when reading code.

This patch makes the following changes that I think are necessary before
v3.15 ships:

1) rename the command macros to their new names. These end up in the uapi
   headers and so are part of the external-facing API. It turns out that
   glibc doesn't actually use the fcntl.h uapi header, but it's hard to
   be sure that something else won't. Changing it now is safest.

2) make the the /proc/locks output display these as type "OFDLCK"

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Stefan Metzmacher <metze@samba.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Frank Filz <ffilzlnx@mindspring.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Ia975197281d4c80a4ad420d7621896d2f369cef6
2021-01-24 09:56:22 +00:00
Jeff Layton 45ab3fb2a2 locks: add new fcntl cmd values for handling file private locks
Due to some unfortunate history, POSIX locks have very strange and
unhelpful semantics. The thing that usually catches people by surprise
is that they are dropped whenever the process closes any file descriptor
associated with the inode.

This is extremely problematic for people developing file servers that
need to implement byte-range locks. Developers often need a "lock
management" facility to ensure that file descriptors are not closed
until all of the locks associated with the inode are finished.

Additionally, "classic" POSIX locks are owned by the process. Locks
taken between threads within the same process won't conflict with one
another, which renders them useless for synchronization between threads.

This patchset adds a new type of lock that attempts to address these
issues. These locks conflict with classic POSIX read/write locks, but
have semantics that are more like BSD locks with respect to inheritance
and behavior on close.

This is implemented primarily by changing how fl_owner field is set for
these locks. Instead of having them owned by the files_struct of the
process, they are instead owned by the filp on which they were acquired.
Thus, they are inherited across fork() and are only released when the
last reference to a filp is put.

These new semantics prevent them from being merged with classic POSIX
locks, even if they are acquired by the same process. These locks will
also conflict with classic POSIX locks even if they are acquired by
the same process or on the same file descriptor.

The new locks are managed using a new set of cmd values to the fcntl()
syscall. The initial implementation of this converts these values to
"classic" cmd values at a fairly high level, and the details are not
exposed to the underlying filesystem. We may eventually want to push
this handing out to the lower filesystem code but for now I don't
see any need for it.

Also, note that with this implementation the new cmd values are only
available via fcntl64() on 32-bit arches. There's little need to
add support for legacy apps on a new interface like this.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: I35691bdfed9cadcbbcb6ff6804d9eea1db661ddc
2021-01-24 09:56:22 +00:00
Jeff Layton ef04cb49df locks: pass the cmd value to fcntl_getlk/getlk64
Once we introduce file private locks, we'll need to know what cmd value
was used, as that affects the ownership and whether a conflict would
arise.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Iaeb8233ae25bde5ef0049118ff94e4a9e0f02214
2021-01-24 09:56:22 +00:00
Jeff Layton 9c43335d3c locks: report l_pid as -1 for FL_FILE_PVT locks
FL_FILE_PVT locks are no longer tied to a particular pid, and are
instead inheritable by child processes. Report a l_pid of '-1' for
these sorts of locks since the pid is somewhat meaningless for them.

This precedent comes from FreeBSD. There, POSIX and flock() locks can
conflict with one another. If fcntl(F_GETLK, ...) returns a lock set
with flock() then the l_pid member cannot be a process ID because the
lock is not held by a process as such.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: I7d702fcaaaf8592356926d51b60e53ee217ca747
2021-01-24 09:56:22 +00:00
Jeff Layton a84df81cdf locks: make /proc/locks show IS_FILE_PVT locks as type "FLPVT"
In a later patch, we'll be adding a new type of lock that's owned by
the struct file instead of the files_struct. Those sorts of locks
will be flagged with a new FL_FILE_PVT flag.

Report these types of locks as "FLPVT" in /proc/locks to distinguish
them from "classic" POSIX locks.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Id0b6d9c7a947b512e5683ad3b6188d73582c2de9
2021-01-24 09:56:22 +00:00
Jeff Layton e3691f6a9e locks: rename locks_remove_flock to locks_remove_file
This function currently removes leases in addition to flock locks and in
a later patch we'll have it deal with file-private locks too. Rename it
to locks_remove_file to indicate that it removes locks that are
associated with a particular struct file, and not just flock locks.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: I1289cfbc02eb778532e984a29adffb02a9370cc1
2021-01-24 09:56:22 +00:00
followmsi 8e1649a3cc Merge branch 'lineage-18.1' into followmsi-11 2020-12-19 13:42:14 +01:00
Eric W. Biederman 3e34c77b60 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Change-Id: I623b13dbdb44bb9ba7481f29575e1ca4ad8102f4
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
2020-12-14 20:34:05 +01:00
followmsi 5aa785e6e7 Merge branch 'lineage-18.0' into followmsi-11 2020-12-08 14:20:33 +01:00
David Herrmann 68d3e84f05 shm: add sealing API
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement.  However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies.  Look at the following two
use-cases:

  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING.  If you seal a file, a
specific set of operations is blocked on that file forever.  Unlike locks,
seals can only be set, never removed.  Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Change-Id: I2d6247d3287c61dbe6bafabf56554e80b414f938
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-07 22:15:12 +03:00
David Herrmann c54f92b735 mm: allow drivers to prevent new writable mappings
This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space.  To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set.  In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers.  This is the same
that we do for i_writecount.

Change-Id: Id16c5b650e451956a4f6df004483cb63197c613c
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-07 21:08:09 +03:00
Dan Carpenter a3a7bc82f0 fs: NULL dereference in posix_acl_to_xattr()
commit 47ba973440 upstream.

This patch moves the dereference of "buffer" after the check for NULL.
The only place which passes a NULL parameter is gfs2_set_acl().

Change-Id: I7ede500c05e646e4c07238d159b8f182a1fbf80d
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-07 21:05:11 +03:00
Trond Myklebust b8e7518373 fs: get_acl() must be allowed to return EOPNOTSUPP
posix_acl_xattr_get requires get_acl() to return EOPNOTSUPP if the
filesystem cannot support acls. This is needed for NFS, which can't
know whether or not the server supports acls until it tries to get/set
one.
This patch converts posix_acl_chmod and posix_acl_create to deal with
EOPNOTSUPP return values from get_acl().

Change-Id: I931423ae763a4950056c7b20938be6f1f4536e24
Reported-by: Russell King <linux@arm.linux.org.uk>
Link: http://lkml.kernel.org/r/20140130140834.GW15937@n2100.arm.linux.org.uk
Cc: Al Viro viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2020-12-07 21:05:08 +03:00
Christoph Hellwig 029e019446 fs: remove generic_acl
And instead convert tmpfs to use the new generic ACL code, with two stub
methods provided for in-memory filesystems.

Change-Id: Ide27840378dbbc062c32ded2a6420c1b6c28f57e
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:02:53 +03:00
Christoph Hellwig 17391fcb76 fs: make posix_acl_create more useful
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().

Change-Id: I7d8de350fe89ef3d2f9ff6eaa2c198b5403d33fc
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:02:49 +03:00
Christoph Hellwig 46d436597e fs: make posix_acl_chmod more useful
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.

Change-Id: I503ed1049a28ad01d32fe3fa85d8fc9b7e12610f
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:02:46 +03:00
Christoph Hellwig c9c8b0a4c3 fs: add generic xattr_acl handlers
With the ->set_acl inode operation we can implement the Posix ACL
xattr handlers in generic code instead of duplicating them all
over the tree.

Change-Id: I473c270b801d7faf3338d68a29c749ff929fc575
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:02:43 +03:00
Christoph Hellwig 2da5f19f1d fs: add get_acl helper
Factor out the code to get an ACL either from the inode or disk from
check_acl, so that it can be used elsewhere later on.

Change-Id: I81fab0da8228eaee0ba14b9b9942071caa12aeae
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:02:40 +03:00
Christoph Hellwig cebd9bb7bd fs: merge xattr_acl.c into posix_acl.c
Change-Id: I13a2b882e8ccd34cf858155d3a40beb3643ef216
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:02:37 +03:00
Andreas Gruenbacher 560a564a11 userns: relax the posix_acl_valid() checks
So far, POSIX ACLs are using a canonical representation that keeps all ACL
entries in a strict order; the ACL_USER and ACL_GROUP entries for specific
users and groups are ordered by user and group identifier, respectively.
The user-space code provides ACL entries in this order; the kernel
verifies that the ACL entry order is correct in posix_acl_valid().

User namespaces allow to arbitrary map user and group identifiers which
can cause the ACL_USER and ACL_GROUP entry order to differ between user
space and the kernel; posix_acl_valid() would then fail.

Work around this by allowing ACL_USER and ACL_GROUP entries to be in any
order in the kernel.  The effect is only minor: file permission checks
will pick the first matching ACL_USER entry, and check all matching
ACL_GROUP entries.

(The libacl user-space library and getfacl / setfacl tools will not create
ACLs with duplicate user or group idenfifiers; they will handle ACLs with
entries in an arbitrary order correctly.)

Change-Id: Ib73a93c56fb8029102ba2aec8ea3b56a7467fb86
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-07 21:02:33 +03:00
David Rientjes 67cc7b4e9a fs, xattr: fix bug when removing a name not in xattr list
Commit 38f3865744 ("xattr: extract simple_xattr code from tmpfs") moved
some code from tmpfs but introduced a subtle bug along the way.

If the name passed to simple_xattr_remove() does not exist in the list of
xattrs, then it is possible to call kfree(new_xattr) when new_xattr is
actually initialized to itself on the stack via uninitialized_var().

This causes a BUG() since the memory was not allocated via the slab
allocator and was not bypassed through to the page allocator because it
was too large.

Initialize the local variable to NULL so the kfree() never takes place.

Change-Id: I0f090df631e871657fb31914dce57c13e81e25c2
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-07 21:02:30 +03:00
Eric W. Biederman dbb94d545d userns: Fix posix_acl_file_xattr_userns gid conversion
The code needs to be from_kgid(make_kgid(...)...) not
from_kuid(make_kgid(...)...). Doh!

Change-Id: Ib3e4a257e76ba85e92bb4fe272c6af7c4dc60b5b
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-12-07 21:02:27 +03:00
Mimi Zohar bbd891e70c vfs: extend vfs_removexattr locking
This patch takes the i_mutex lock before security_inode_removexattr(),
instead of after, in preparation of calling ima_inode_removexattr().

Change-Id: I969fa5536599242a3403930379c49d733b1e1295
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
2020-12-07 21:02:24 +03:00
Eric W. Biederman 9a183cbdb0 userns: Pass a userns parameter into posix_acl_to_xattr and posix_acl_from_xattr
- Pass the user namespace the uid and gid values in the xattr are stored
   in into posix_acl_from_xattr.

 - Pass the user namespace kuid and kgid values should be converted into
   when storing uid and gid values in an xattr in posix_acl_to_xattr.

- Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to
  pass in &init_user_ns.

In the short term this change is not strictly needed but it makes the
code clearer.  In the longer term this change is necessary to be able to
mount filesystems outside of the initial user namespace that natively
store posix acls in the linux xattr format.

Change-Id: I7c2b18f16ec9d7ded49135cedc2e91a71e078087
Cc: Theodore Tso <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-12-07 21:02:21 +03:00
Eric W. Biederman 7c4459717a userns: Convert vfs posix_acl support to use kuids and kgids
- In setxattr if we are setting a posix acl convert uids and gids from
  the current user namespace into the initial user namespace, before
  the xattrs are passed to the underlying filesystem.

  Untranslatable uids and gids are represented as -1 which
  posix_acl_from_xattr will represent as INVALID_UID or INVALID_GID.
  posix_acl_valid will fail if an acl from userspace has any
  INVALID_UID or INVALID_GID values.  In net this guarantees that
  untranslatable posix acls will not be stored by filesystems.

- In getxattr if we are reading a posix acl convert uids and gids from
  the initial user namespace into the current user namespace.

  Uids and gids that can not be tranlsated into the current user namespace
  will be represented as -1.

- Replace e_id in struct posix_acl_entry with an anymouns union of
  e_uid and e_gid.  For the short term retain the e_id field
  until all of the users are converted.

- Don't set struct posix_acl.e_id in the cases where the acl type
  does not use e_id.  Greatly reducing the use of ACL_UNDEFINED_ID.

- Rework the ordering checks in posix_acl_valid so that I use kuid_t
  and kgid_t types throughout the code, and so that I don't need
  arithmetic on uid and gid types.

Change-Id: If5ca5f58195e29b3c8cb51bd7769c980dc7da3a4
Cc: Theodore Tso <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-07 21:02:17 +03:00
Aristeu Rozanski 01a31ec067 xattr: mark variable as uninitialized to make both gcc and smatch happy
new_xattr in __simple_xattr_set() is only initialized with a valid
pointer if value is not NULL, which only happens if this function is
called directly with the intention to remove an existing extended
attribute. Even being safe to be this way, smatch warns about possible
NULL dereference. Dan Carpenter suggested using uninitialized_var()
which will make both gcc and smatch happy.

Change-Id: I0a7ff73cd9bda419961e8421b1472ed14a64f71c
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-12-07 21:02:14 +03:00
Aristeu Rozanski eb39b11c9b fs: add missing documentation to simple_xattr functions
v2: add function documentation instead of adding a separate file under
    Documentation/

tj: Updated comment a bit and rolled in Randy's suggestions.

Change-Id: I2be6be5f4987127c354e2cca490876085ceaea45
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-12-07 21:02:11 +03:00
Jan Kara 44882b1b6f lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:

radix_tree_preload()
...
radix_tree_insert()
  radix_tree_node_alloc()
    if (rtp->nr) {
      ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
  radix_tree_node_alloc()
    if (rtp->nr) {
      ret = rtp->nodes[rtp->nr - 1];

And we give out one radix tree node twice.  That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.

We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt.  Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.

Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes.  However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed.  To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.

Change-Id: Iab513ed95e8a98cd890e68a820181a8a915da9aa
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-07 21:02:05 +03:00
Aristeu Rozanski d8542baf2a xattr: extract simple_xattr code from tmpfs
Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.

$ size vmlinux.o
   text    data     bss     dec     hex filename
4658782  880729 5195032 10734543         a3cbcf vmlinux.o
$ size vmlinux.o
   text    data     bss     dec     hex filename
4658957  880729 5195032 10734718         a3cc7e vmlinux.o

v7:
- checkpatch warnings fixed
- Implement the changes requested by Hugh Dickins:
	- make simple_xattrs_init and simple_xattrs_free inline
	- get rid of locking and list reinitialization in simple_xattrs_free,
	  they're not needed
v6:
- no changes
v5:
- no changes
v4:
- move simple_xattrs_free() to fs/xattr.c
v3:
- in kmem_xattrs_free(), reinitialize the list
- use simple_xattr_* prefix
- introduce simple_xattr_add() to prevent direct list usage

Change-Id: Id683f37a6190127866f75c5b4bb8fafc6491dcd8
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-12-07 21:01:31 +03:00
Al Viro ecd0ea0fb7 switch xattr syscalls to fget_light/fput_light
Change-Id: Iaccab81ce6951d3250c8f2d1d21669ae6ca08848
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-12-07 21:01:27 +03:00
Hugh Dickins 1a139cbc38 mm/fs: remove truncate_range
Remove vmtruncate_range(), and remove the truncate_range method from
struct inode_operations: only tmpfs ever supported it, and tmpfs has now
converted over to using the fallocate method of file_operations.

Update Documentation accordingly, adding (setlease and) fallocate lines.
And while we're in mm.h, remove duplicate declarations of shmem_lock() and
shmem_file_setup(): everyone is now using the ones in shmem_fs.h.

Change-Id: I1452e3356333d19b778ef60d6cf7625c31b77bf8
Based-on-patch-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Cong Wang <amwang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-07 20:57:30 +03:00
Rafael Aquini 29af476bb5 mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields.  With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram.  Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.

This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.

Change-Id: I51474973cc267ee368352e96e7229e0101f38aca
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-01 19:08:36 +01:00
Igor Redko 37e62218c3 mm/page_alloc.c: calculate 'available' memory in a separate function
commit d02bd27bd33dd7e8d22594cd568b81be0cb584cd upstream.

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.  This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.

This patch (of 2):

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.

In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).

Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 4.4 as dependency of commit a1078e821b60
 "xen: let alloc_xenballooned_pages() fail if not enough memory free"]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: Ib8cae94ddd9742752ecc795129be642c57f8cc15
2020-12-01 19:08:20 +01:00