Essentially, alloc_fd() in a files_struct we own a reference to.
Most of the time wanting to use it is a sign of lousy API
design (such as android/binder). It's *not* a general-purpose
interface; better that than open-coding its guts, but again,
playing with other process' descriptor table is a sign of bad
design.
Change-Id: I0a62c0a1a9162d6e5961878d7dc7ff8ffcf82b56
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This switches over to propagation_next to respect
namepsace semantics.
Test: Remounting to change the options of a fs with mount based
options should propagate to all shared copies of that mount,
and the slaves/indirect slaves of those.
Bug: 122428178
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ic35cd2782a646435689f5bedfa1f218fe4ab8254
commit 5ec0811d30378ae104f250bfc9b3640242d81e3f upstream.
When the first propgated copy was a slave the following oops would result:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
> IP: [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> PGD bacd4067 PUD bac66067 PMD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 1 PID: 824 Comm: mount Not tainted 4.6.0-rc5userns+ #1523
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> task: ffff8800bb0a8000 ti: ffff8800bac3c000 task.ti: ffff8800bac3c000
> RIP: 0010:[<ffffffff811fba4e>] [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP: 0018:ffff8800bac3fd38 EFLAGS: 00010283
> RAX: 0000000000000000 RBX: ffff8800bb77ec00 RCX: 0000000000000010
> RDX: 0000000000000000 RSI: ffff8800bb58c000 RDI: ffff8800bb58c480
> RBP: ffff8800bac3fd48 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000001ca1 R11: 0000000000001c9d R12: 0000000000000000
> R13: ffff8800ba713800 R14: ffff8800bac3fda0 R15: ffff8800bb77ec00
> FS: 00007f3c0cd9b7e0(0000) GS:ffff8800bfb00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000010 CR3: 00000000bb79d000 CR4: 00000000000006e0
> Stack:
> ffff8800bb77ec00 0000000000000000 ffff8800bac3fd88 ffffffff811fbf85
> ffff8800bac3fd98 ffff8800bb77f080 ffff8800ba713800 ffff8800bb262b40
> 0000000000000000 0000000000000000 ffff8800bac3fdd8 ffffffff811f1da0
> Call Trace:
> [<ffffffff811fbf85>] propagate_mnt+0x105/0x140
> [<ffffffff811f1da0>] attach_recursive_mnt+0x120/0x1e0
> [<ffffffff811f1ec3>] graft_tree+0x63/0x70
> [<ffffffff811f1f6b>] do_add_mount+0x9b/0x100
> [<ffffffff811f2c1a>] do_mount+0x2aa/0xdf0
> [<ffffffff8117efbe>] ? strndup_user+0x4e/0x70
> [<ffffffff811f3a45>] SyS_mount+0x75/0xc0
> [<ffffffff8100242b>] do_syscall_64+0x4b/0xa0
> [<ffffffff81988f3c>] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 00 00 75 ec 48 89 0d 02 22 22 01 8b 89 10 01 00 00 48 89 05 fd 21 22 01 39 8e 10 01 00 00 0f 84 e0 00 00 00 48 8b 80 d8 00 00 00 <48> 8b 50 10 48 89 05 df 21 22 01 48 89 15 d0 21 22 01 8b 53 30
> RIP [<ffffffff811fba4e>] propagate_one+0xbe/0x1c0
> RSP <ffff8800bac3fd38>
> CR2: 0000000000000010
> ---[ end trace 2725ecd95164f217 ]---
This oops happens with the namespace_sem held and can be triggered by
non-root users. An all around not pleasant experience.
To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.
Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.
The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.
To avoid other kinds of confusion last_dest is not changed when
computing last_source. last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.
fixes: f2ebb3a921 ("smarter propagate_mnt()")
Reported-by: Tycho Andersen <tycho.andersen@canonical.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ie55a2c52db9773b461acc6ebe427221acb7093f0
commit 7ae8fd0351f912b075149a1e03a017be8b903b9a upstream.
propagate_one(m) calculates "type" argument for copy_tree() like this:
> if (m->mnt_group_id == last_dest->mnt_group_id) {
> type = CL_MAKE_SHARED;
> } else {
> type = CL_SLAVE;
> if (IS_MNT_SHARED(m))
> type |= CL_MAKE_SHARED;
> }
The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.
The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:
> last_source = last_source->mnt_master;
> last_dest = last_source->mnt_parent;
proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.
[AV: don't open-code an obvious helper...]
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: I78454c89b1f672e49c8ffcf63d1339a3e371aa87
Add mount option unshared_obb to not link the obb
folders of multiple users together.
Bug: 27915347
Test: mount with option. Check if altering one obb
alters the other
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I3956e06bd0a222b0bbb2768c9a8a8372ada85e1e
unshared_obb was missing from show_options
bug: 133257717
Change-Id: I1bc49d1b4098052382a518540e5965e037aa39f1
Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
commit 98ca480a8f22fdbd768e3dad07024c8d4856576c upstream.
An ino is unsigned, so display it as such in /proc/locks.
Cc: stable@vger.kernel.org
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: I250a495fe3fc809e880535347f462fe552644edf
File-private locks have been re-christened as "open file description"
locks. Finish the symbol name cleanup in the internal implementation.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Iee48047540a7d8fefb5078cc005ae9ea8994f521
File-private locks have been merged into Linux for v3.15, and *now*
people are commenting that the name and macro definitions for the new
file-private locks suck.
...and I can't even disagree. The names and command macros do suck.
We're going to have to live with these for a long time, so it's
important that we be happy with the names before we're stuck with them.
The consensus on the lists so far is that they should be rechristened as
"open file description locks".
The name isn't a big deal for the kernel, but the command macros are not
visually distinct enough from the traditional POSIX lock macros. The
glibc and documentation folks are recommending that we change them to
look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
programmer will typo one of the commands wrong, and also makes it easier
to spot this difference when reading code.
This patch makes the following changes that I think are necessary before
v3.15 ships:
1) rename the command macros to their new names. These end up in the uapi
headers and so are part of the external-facing API. It turns out that
glibc doesn't actually use the fcntl.h uapi header, but it's hard to
be sure that something else won't. Changing it now is safest.
2) make the the /proc/locks output display these as type "OFDLCK"
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Carlos O'Donell <carlos@redhat.com>
Cc: Stefan Metzmacher <metze@samba.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Frank Filz <ffilzlnx@mindspring.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Ia975197281d4c80a4ad420d7621896d2f369cef6
Due to some unfortunate history, POSIX locks have very strange and
unhelpful semantics. The thing that usually catches people by surprise
is that they are dropped whenever the process closes any file descriptor
associated with the inode.
This is extremely problematic for people developing file servers that
need to implement byte-range locks. Developers often need a "lock
management" facility to ensure that file descriptors are not closed
until all of the locks associated with the inode are finished.
Additionally, "classic" POSIX locks are owned by the process. Locks
taken between threads within the same process won't conflict with one
another, which renders them useless for synchronization between threads.
This patchset adds a new type of lock that attempts to address these
issues. These locks conflict with classic POSIX read/write locks, but
have semantics that are more like BSD locks with respect to inheritance
and behavior on close.
This is implemented primarily by changing how fl_owner field is set for
these locks. Instead of having them owned by the files_struct of the
process, they are instead owned by the filp on which they were acquired.
Thus, they are inherited across fork() and are only released when the
last reference to a filp is put.
These new semantics prevent them from being merged with classic POSIX
locks, even if they are acquired by the same process. These locks will
also conflict with classic POSIX locks even if they are acquired by
the same process or on the same file descriptor.
The new locks are managed using a new set of cmd values to the fcntl()
syscall. The initial implementation of this converts these values to
"classic" cmd values at a fairly high level, and the details are not
exposed to the underlying filesystem. We may eventually want to push
this handing out to the lower filesystem code but for now I don't
see any need for it.
Also, note that with this implementation the new cmd values are only
available via fcntl64() on 32-bit arches. There's little need to
add support for legacy apps on a new interface like this.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: I35691bdfed9cadcbbcb6ff6804d9eea1db661ddc
Once we introduce file private locks, we'll need to know what cmd value
was used, as that affects the ownership and whether a conflict would
arise.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Iaeb8233ae25bde5ef0049118ff94e4a9e0f02214
FL_FILE_PVT locks are no longer tied to a particular pid, and are
instead inheritable by child processes. Report a l_pid of '-1' for
these sorts of locks since the pid is somewhat meaningless for them.
This precedent comes from FreeBSD. There, POSIX and flock() locks can
conflict with one another. If fcntl(F_GETLK, ...) returns a lock set
with flock() then the l_pid member cannot be a process ID because the
lock is not held by a process as such.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: I7d702fcaaaf8592356926d51b60e53ee217ca747
In a later patch, we'll be adding a new type of lock that's owned by
the struct file instead of the files_struct. Those sorts of locks
will be flagged with a new FL_FILE_PVT flag.
Report these types of locks as "FLPVT" in /proc/locks to distinguish
them from "classic" POSIX locks.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: Id0b6d9c7a947b512e5683ad3b6188d73582c2de9
This function currently removes leases in addition to flock locks and in
a later patch we'll have it deal with file-private locks too. Rename it
to locks_remove_file to indicate that it removes locks that are
associated with a particular struct file, and not just flock locks.
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Change-Id: I1289cfbc02eb778532e984a29adffb02a9370cc1
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.
A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.
Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.
Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives. Allowing simple, safe,
well understood work-arounds to known problematic software.
This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work. While writing this patch I saw a handful of such
cases. The most significant being autofs that lives in the module
autofs4.
This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.
After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module. The common pattern in the kernel is to call request_module()
without regards to the users permissions. In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted. In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.
Change-Id: I623b13dbdb44bb9ba7481f29575e1ca4ad8102f4
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
- one side cannot overwrite data while the other reads it
- one side cannot shrink the buffer while the other accesses it
- one side cannot grow the buffer beyond previously set boundaries
If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
1) A graphics client wants to share its rendering-buffer with a
graphics-server. The memory-region is allocated by the client for
read/write access and a second FD is passed to the server. While
scanning out from the memory region, the server has no guarantee that
the client doesn't shrink the buffer at any time, requiring rather
cumbersome SIGBUS handling.
2) A process wants to perform an RPC on another process. To avoid huge
bandwidth consumption, zero-copy is preferred. After a message is
assembled in-memory and a FD is passed to the remote side, both sides
want to be sure that neither modifies this shared copy, anymore. The
source may have put sensible data into the message without a separate
copy and the target may want to parse the message inline, to avoid a
local copy.
While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.
This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever. Unlike locks,
seals can only be set, never removed. Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.
An initial set of SEALS is introduced by this patch:
- SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
in size. This affects ftruncate() and open(O_TRUNC).
- GROW: If SEAL_GROW is set, the file in question cannot be increased
in size. This affects ftruncate(), fallocate() and write().
- WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
are possible. This affects fallocate(PUNCH_HOLE), mmap() and
write().
- SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
This basically prevents the F_ADD_SEAL operation on a file and
can be set to prevent others from adding further seals that you
don't want.
The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
1) The graphics server can verify that a passed file-descriptor has
SEAL_SHRINK set. This allows safe scanout, while the client is
allowed to increase buffer size for window-resizing on-the-fly.
Concurrent writes are explicitly allowed.
2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
process can modify the data while the other side parses it.
Furthermore, it guarantees that even with writable FDs passed to the
peer, it cannot increase the size to hit memory-limits of the source
process (in case the file-storage is accounted to the source).
The new API is an extension to fcntl(), adding two new commands:
F_GET_SEALS: Return a bitset describing the seals on the file. This
can be called on any FD if the underlying file supports
sealing.
F_ADD_SEALS: Change the seals of a given file. This requires WRITE
access to the file and F_SEAL_SEAL may not already be set.
Furthermore, the underlying file must support sealing and
there may not be any existing shared mapping of that file.
Otherwise, EBADF/EPERM is returned.
The given seals are _added_ to the existing set of seals
on the file. You cannot remove seals again.
The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.
Change-Id: I2d6247d3287c61dbe6bafabf56554e80b414f938
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch (of 6):
The i_mmap_writable field counts existing writable mappings of an
address_space. To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.
This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set. In case there
exists a writable mapping, this operation will fail with EBUSY.
Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers. This is the same
that we do for i_writecount.
Change-Id: Id16c5b650e451956a4f6df004483cb63197c613c
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 47ba973440 upstream.
This patch moves the dereference of "buffer" after the check for NULL.
The only place which passes a NULL parameter is gfs2_set_acl().
Change-Id: I7ede500c05e646e4c07238d159b8f182a1fbf80d
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
posix_acl_xattr_get requires get_acl() to return EOPNOTSUPP if the
filesystem cannot support acls. This is needed for NFS, which can't
know whether or not the server supports acls until it tries to get/set
one.
This patch converts posix_acl_chmod and posix_acl_create to deal with
EOPNOTSUPP return values from get_acl().
Change-Id: I931423ae763a4950056c7b20938be6f1f4536e24
Reported-by: Russell King <linux@arm.linux.org.uk>
Link: http://lkml.kernel.org/r/20140130140834.GW15937@n2100.arm.linux.org.uk
Cc: Al Viro viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
And instead convert tmpfs to use the new generic ACL code, with two stub
methods provided for in-memory filesystems.
Change-Id: Ide27840378dbbc062c32ded2a6420c1b6c28f57e
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Change-Id: I7d8de350fe89ef3d2f9ff6eaa2c198b5403d33fc
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Change-Id: I503ed1049a28ad01d32fe3fa85d8fc9b7e12610f
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With the ->set_acl inode operation we can implement the Posix ACL
xattr handlers in generic code instead of duplicating them all
over the tree.
Change-Id: I473c270b801d7faf3338d68a29c749ff929fc575
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Factor out the code to get an ACL either from the inode or disk from
check_acl, so that it can be used elsewhere later on.
Change-Id: I81fab0da8228eaee0ba14b9b9942071caa12aeae
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
So far, POSIX ACLs are using a canonical representation that keeps all ACL
entries in a strict order; the ACL_USER and ACL_GROUP entries for specific
users and groups are ordered by user and group identifier, respectively.
The user-space code provides ACL entries in this order; the kernel
verifies that the ACL entry order is correct in posix_acl_valid().
User namespaces allow to arbitrary map user and group identifiers which
can cause the ACL_USER and ACL_GROUP entry order to differ between user
space and the kernel; posix_acl_valid() would then fail.
Work around this by allowing ACL_USER and ACL_GROUP entries to be in any
order in the kernel. The effect is only minor: file permission checks
will pick the first matching ACL_USER entry, and check all matching
ACL_GROUP entries.
(The libacl user-space library and getfacl / setfacl tools will not create
ACLs with duplicate user or group idenfifiers; they will handle ACLs with
entries in an arbitrary order correctly.)
Change-Id: Ib73a93c56fb8029102ba2aec8ea3b56a7467fb86
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 38f3865744 ("xattr: extract simple_xattr code from tmpfs") moved
some code from tmpfs but introduced a subtle bug along the way.
If the name passed to simple_xattr_remove() does not exist in the list of
xattrs, then it is possible to call kfree(new_xattr) when new_xattr is
actually initialized to itself on the stack via uninitialized_var().
This causes a BUG() since the memory was not allocated via the slab
allocator and was not bypassed through to the page allocator because it
was too large.
Initialize the local variable to NULL so the kfree() never takes place.
Change-Id: I0f090df631e871657fb31914dce57c13e81e25c2
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The code needs to be from_kgid(make_kgid(...)...) not
from_kuid(make_kgid(...)...). Doh!
Change-Id: Ib3e4a257e76ba85e92bb4fe272c6af7c4dc60b5b
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This patch takes the i_mutex lock before security_inode_removexattr(),
instead of after, in preparation of calling ima_inode_removexattr().
Change-Id: I969fa5536599242a3403930379c49d733b1e1295
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com>
- Pass the user namespace the uid and gid values in the xattr are stored
in into posix_acl_from_xattr.
- Pass the user namespace kuid and kgid values should be converted into
when storing uid and gid values in an xattr in posix_acl_to_xattr.
- Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to
pass in &init_user_ns.
In the short term this change is not strictly needed but it makes the
code clearer. In the longer term this change is necessary to be able to
mount filesystems outside of the initial user namespace that natively
store posix acls in the linux xattr format.
Change-Id: I7c2b18f16ec9d7ded49135cedc2e91a71e078087
Cc: Theodore Tso <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
- In setxattr if we are setting a posix acl convert uids and gids from
the current user namespace into the initial user namespace, before
the xattrs are passed to the underlying filesystem.
Untranslatable uids and gids are represented as -1 which
posix_acl_from_xattr will represent as INVALID_UID or INVALID_GID.
posix_acl_valid will fail if an acl from userspace has any
INVALID_UID or INVALID_GID values. In net this guarantees that
untranslatable posix acls will not be stored by filesystems.
- In getxattr if we are reading a posix acl convert uids and gids from
the initial user namespace into the current user namespace.
Uids and gids that can not be tranlsated into the current user namespace
will be represented as -1.
- Replace e_id in struct posix_acl_entry with an anymouns union of
e_uid and e_gid. For the short term retain the e_id field
until all of the users are converted.
- Don't set struct posix_acl.e_id in the cases where the acl type
does not use e_id. Greatly reducing the use of ACL_UNDEFINED_ID.
- Rework the ordering checks in posix_acl_valid so that I use kuid_t
and kgid_t types throughout the code, and so that I don't need
arithmetic on uid and gid types.
Change-Id: If5ca5f58195e29b3c8cb51bd7769c980dc7da3a4
Cc: Theodore Tso <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
new_xattr in __simple_xattr_set() is only initialized with a valid
pointer if value is not NULL, which only happens if this function is
called directly with the intention to remove an existing extended
attribute. Even being safe to be this way, smatch warns about possible
NULL dereference. Dan Carpenter suggested using uninitialized_var()
which will make both gcc and smatch happy.
Change-Id: I0a7ff73cd9bda419961e8421b1472ed14a64f71c
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
v2: add function documentation instead of adding a separate file under
Documentation/
tj: Updated comment a bit and rolled in Randy's suggestions.
Change-Id: I2be6be5f4987127c354e2cca490876085ceaea45
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Change-Id: Iab513ed95e8a98cd890e68a820181a8a915da9aa
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Extract in-memory xattr APIs from tmpfs. Will be used by cgroup.
$ size vmlinux.o
text data bss dec hex filename
4658782 880729 5195032 10734543 a3cbcf vmlinux.o
$ size vmlinux.o
text data bss dec hex filename
4658957 880729 5195032 10734718 a3cc7e vmlinux.o
v7:
- checkpatch warnings fixed
- Implement the changes requested by Hugh Dickins:
- make simple_xattrs_init and simple_xattrs_free inline
- get rid of locking and list reinitialization in simple_xattrs_free,
they're not needed
v6:
- no changes
v5:
- no changes
v4:
- move simple_xattrs_free() to fs/xattr.c
v3:
- in kmem_xattrs_free(), reinitialize the list
- use simple_xattr_* prefix
- introduce simple_xattr_add() to prevent direct list usage
Change-Id: Id683f37a6190127866f75c5b4bb8fafc6491dcd8
Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Remove vmtruncate_range(), and remove the truncate_range method from
struct inode_operations: only tmpfs ever supported it, and tmpfs has now
converted over to using the fallocate method of file_operations.
Update Documentation accordingly, adding (setlease and) fallocate lines.
And while we're in mm.h, remove duplicate declarations of shmem_lock() and
shmem_file_setup(): everyone is now using the ones in shmem_fs.h.
Change-Id: I1452e3356333d19b778ef60d6cf7625c31b77bf8
Based-on-patch-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Cong Wang <amwang@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields. With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram. Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.
This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.
Change-Id: I51474973cc267ee368352e96e7229e0101f38aca
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit d02bd27bd33dd7e8d22594cd568b81be0cb584cd upstream.
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.
It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.
This patch (of 2):
Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.
In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).
Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 4.4 as dependency of commit a1078e821b60
"xen: let alloc_xenballooned_pages() fail if not enough memory free"]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Change-Id: Ib8cae94ddd9742752ecc795129be642c57f8cc15
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Change-Id: Ifbbbdfcdf871a8173856a13087400885357f95ee
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page allocated in fuse_dentry_canonical_path to be handled in
fuse_dev_do_write is allocated using __get_free_pages(GFP_KERNEL).
This may not return a page with data filled with 0. Now this
page may not have a null terminator at all.
If this happens and userspace fuse daemon screws up by passing a string
to kernel which is not NULL terminated (or did not fill anything),
then inside fuse driver in kernel when we try to do
strlen(fuse_dev_write->kern_path->getname_kernel)
on that page data -> it may give us issue with kernel paging request.
Unable to handle kernel paging request at virtual address
------------[ cut here ]------------
<..>
PC is at strlen+0x10/0x90
LR is at getname_kernel+0x2c/0xf4
<..>
strlen+0x10/0x90
kern_path+0x28/0x4c
fuse_dev_do_write+0x5b8/0x694
fuse_dev_write+0x74/0x94
do_iter_readv_writev+0x80/0xb8
do_readv_writev+0xec/0x1cc
vfs_writev+0x54/0x64
SyS_writev+0x64/0xe4
el0_svc_naked+0x24/0x28
To avoid this we should ensure in case of FUSE_CANONICAL_PATH,
the page is null terminated.
Change-Id: I33ca7cc76b4472eaa982c67bb20685df451121f5
Signed-off-by: Ritesh Harjani <riteshh@codeaurora.org>
Bug: 75984715
[Daniel - small edit, using args size ]
Signed-off-by: Daniel Rosenberg <drosen@google.com>
09f363c7 ("vmscan: fix shrinker callback bug in fs/super.c") fixed a
shrinker callback which was returning -1 when nr_to_scan is zero, which
caused excessive slab scanning. But 635697c6 ("vmscan: fix initial
shrinker size handling") fixed the problem, again so we can freely return
-1 although nr_to_scan is zero. So let's revert 09f363c7 because the
comment added in 09f363c7 made an unnecessary rule.
Change-Id: Ic57b698b97406b980e06bd213afa283868a779a2
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We already use them for openat() and friends, but fstat() also wants to
be able to use O_PATH file descriptors. This should make it more
directly comparable to the O_SEARCH of Solaris.
Note that you could already do the same thing with "fstatat()" and an
empty path, but just doing "fstat()" directly is simpler and faster, so
there is no reason not to just allow it directly.
See also commit 332a2e1244, which did the same thing for fchdir, for
the same reasons.
Reported-by: ольга крыжановская <olga.kryzhanovska@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # O_PATH introduced in 3.0+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I98db77b7f3fcadb1f0ecdb5c50ff4a7ad9db8bb3
We already use them for openat() and friends, but fstat() also wants to
be able to use O_PATH file descriptors. This should make it more
directly comparable to the O_SEARCH of Solaris.
Note that you could already do the same thing with "fstatat()" and an
empty path, but just doing "fstat()" directly is simpler and faster, so
there is no reason not to just allow it directly.
See also commit 332a2e1244, which did the same thing for fchdir, for
the same reasons.
Reported-by: ольга крыжановская <olga.kryzhanovska@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org # O_PATH introduced in 3.0+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>