android_kernel_google_msm

mirror of https://github.com/followmsi/android_kernel_google_msm.git synced 2024-11-06 23:17:41 +00:00

Author	SHA1	Message	Date
Miklos Szeredi	25bd1f6a36	vfs: do_last(): common slow lookup Make the slow lookup part of O_CREAT and non-O_CREAT opens common. This allows atomic_open to be hooked into the slow lookup part. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I1e3c97737246f53fb4e33d2df5586e0b422aa30e	2018-12-07 22:12:51 +04:00
Miklos Szeredi	ef9b34cffd	vfs: do_last(): separate O_CREAT specific code Check O_CREAT on the slow lookup paths where necessary. This allows the rest to be shared with plain open. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I498c7daabc652b940d2618f95580a97ed16fe129	2018-12-07 22:12:51 +04:00
Miklos Szeredi	8b85f578e2	vfs: do_last(): inline lookup_slow() Copy lookup_slow() into do_last(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I3612576e8f753f9482e7e4cb294c146ddd94462d	2018-12-07 22:12:51 +04:00
Al Viro	c79698cc6c	namei.c: let follow_link() do put_link() on failure no need for kludgy "set cookie to ERR_PTR(...) because we failed before we did actual ->follow_link() and want to suppress put_link()", no pointless check in put_link() itself. Callers checked if follow_link() has failed anyway; might as well break out of their loops if that happened, without bothering to call put_link() first. [AV: folded fixes from hch] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I240bcdcbf8a89b925716e69610def8c341dd2419	2018-12-07 22:12:51 +04:00
Miklos Szeredi	d1070d4ee7	vfs: retry last component if opening stale dentry NFS optimizes away d_revalidates for last component of open. This means that open itself can find the dentry stale. This patch allows the filesystem to return EOPENSTALE and the VFS will retry the lookup on just the last component if possible. If the lookup was done using RCU mode, including the last component, then this is not possible since the parent dentry is lost. In this case fall back to non-RCU lookup. Currently this is not used since NFS will always leave RCU mode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: Ie1f7465757dc9086ef0ffefe22c969ef3c6ddedb	2018-12-07 22:12:51 +04:00
Miklos Szeredi	9d86708f6c	vfs: do_last() common post lookup Now the post lookup code can be shared between O_CREAT and plain opens since they are essentially the same. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: Ia366d92c0e4edc183bfa23218676083872ef10f0	2018-12-07 22:12:51 +04:00
Miklos Szeredi	70497cfc0f	vfs: do_last(): add audit_inode before open This allows this code to be shared between O_CREAT and plain opens. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I07469a80d749749e2b161ef500afd98895b33e6a	2018-12-07 22:12:51 +04:00
Miklos Szeredi	17c45ceea1	vfs: do_last(): only return EISDIR for O_CREAT This allows this code to be shared between O_CREAT and plain opens. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I74a1b2a53fe009245bfce52dfc17f7628ac6d9c0	2018-12-07 22:12:51 +04:00
Miklos Szeredi	5f53e97907	vfs: do_last(): check LOOKUP_DIRECTORY Check for ENOTDIR before finishing open. This allows this code to be shared between O_CREAT and plain opens. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: Ib5bcfc2548d45b8c632bb11aecb3f5618887d3d1	2018-12-07 22:12:51 +04:00
Miklos Szeredi	228e8884b4	vfs: do_last(): make ENOENT exit RCU safe This will allow this code to be used in RCU mode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I72145c85cd20934d00bc9cbcb20efe42a636d592	2018-12-07 22:12:51 +04:00
Miklos Szeredi	725142ea33	vfs: make follow_link check RCU safe This will allow this code to be used in RCU mode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I16990161f7e02dcdfd8979cb5594fe4a8207fca1	2018-12-07 22:12:51 +04:00
Miklos Szeredi	ae607d85d3	vfs: do_last(): use inode variable Use helper variable instead of path->dentry->d_inode before complete_walk(). This will allow this code to be used in RCU mode. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I8fd1525e9d0efd992451664c0fb545bbd4bb40d3	2018-12-07 22:12:51 +04:00
Miklos Szeredi	4447ee76a6	vfs: do_last(): inline walk_component() Copy walk_component() into do_lookup(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I1568bb6b976521d4d01df892689e8e977b84af8f	2018-12-07 22:12:51 +04:00
Miklos Szeredi	d45ea2d979	vfs: do_last(): make exit RCU safe Allow returning from do_last() with LOOKUP_RCU still set on the "out:" and "exit:" labels. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: Ic63718547582cf09a2c0bbc86dcc9084bd080ffc	2018-12-07 22:12:51 +04:00
Miklos Szeredi	12f1e7dad9	vfs: split do_lookup() Split do_lookup() into two functions: lookup_fast() - does cached lookup without i_mutex lookup_slow() - does lookup with i_mutex Both follow managed dentries. The new functions are needed by atomic_open. Change-Id: Ic32255a88ff82a92622a20135eb034ad3fa1d5a7 Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2018-12-07 22:12:51 +04:00
Linus Torvalds	d671fd6a09	vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces The calling conventions for __d_lookup_rcu() and dentry_cmp() are annoying in different ways, and there is actually one single underlying reason for both of the annoyances. The fundamental reason is that we do the returned dentry sequence number check inside __d_lookup_rcu() instead of doing it in the caller. This results in two annoyances: - __d_lookup_rcu() now not only needs to return the dentry and the sequence number that goes along with the lookup, it also needs to return the inode pointer that was validated by that sequence number check. - and because we did the sequence number check early (to validate the name pointer and length) we also couldn't just pass the dentry itself to dentry_cmp(), we had to pass the counted string that contained the name. So that sequence number decision caused two separate ugly calling conventions. Both of these problems would be solved if we just did the sequence number check in the caller instead. There's only one caller, and that caller already has to do the sequence number check for the parent anyway, so just do that. That allows us to stop returning the dentry->d_inode in that in-out argument (pointer-to-pointer-to-inode), so we can make the inode argument just a regular input inode pointer. The caller can just load the inode from dentry->d_inode, and then do the sequence number check after that to make sure that it's synchronized with the name we looked up. And it allows us to just pass in the dentry to dentry_cmp(), which is what all the callers really wanted. Sure, dentry_cmp() has to be a bit careful about the dentry (which is not stable during RCU lookup), but that's actually very simple. And now that dentry_cmp() can clearly see that the first string argument is a dentry, we can use the direct word access for that, instead of the careful unaligned zero-padding. The dentry name is always properly aligned, since it is a single path component that is either embedded into the dentry itself, or was allocated with kmalloc() (see __d_alloc). Finally, this also uninlines the nasty slow-case for dentry comparisons: that one does need to do a sequence number check, since it will call in to the low-level filesystems, and we want to give those a stable inode pointer and path component length/start arguments. Doing an extra sequence check for that slow case is not a problem, though. Change-Id: If75355452c45e51c72838fa50e914df017b42531 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-12-07 22:12:51 +04:00
Linus Torvalds	51f831d729	VFS: clean up and simplify getname_flags() This removes a number of silly games around strncpy_from_user() in do_getname(), and removes that helper function entirely. We instead make getname_flags() just use strncpy_from_user() properly directly. Removing the wrapper function simplifies things noticeably, mostly because we no longer play the unnecessary games with segments (x86 strncpy_from_user() no longer needs the hack), but also because the empty path handling is just much more obvious. The return value of "strncpy_to_user()" is much more obvious than checking an odd error return case from do_getname(). [ non-x86 architectures were notified of this change several weeks ago, since it is possible that they have copied the old broken x86 strncpy_from_user. But nobody reacted, so .. See http://www.spinics.net/lists/linux-arch/msg17313.html for details ] Cc: linux-arch@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ic049b87071c10330b1cd3f864dd0c4c5d98464df	2018-12-07 22:12:51 +04:00
Eric W. Biederman	068f467ec8	vfs: Don't allow a user namespace root to make device nodes Safely making device nodes in a container is solvable but simply having the capability in a user namespace is not sufficient to make this work. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Change-Id: I4eb0afd78bb4a8b106dca3002c11ae81caae9e1d	2018-12-07 22:12:51 +04:00
Miklos Szeredi	86e4719bc3	vfs: nameidata_to_filp(): don't throw away file on error If open fails, don't put the file. This allows it to be reused if open needs to be retried. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: I78695bdf3ab30e2f313f1fb5ec79c9cd572f4c55	2018-12-07 22:12:51 +04:00
Miklos Szeredi	9355cee272	vfs: nameidata_to_filp(): inline __dentry_open() Copy __dentry_open() into nameidata_to_filp(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: If1d3ff701de987e426c3a4c813efff3d1d0db181	2018-12-07 22:12:51 +04:00
Miklos Szeredi	1e3571fc93	vfs: do_dentry_open(): don't put filp Move put_filp() out to __dentry_open(), the only caller now. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: Ica706ac4611a66cd6e637650bd8143148fa95e44	2018-12-07 22:12:51 +04:00
Miklos Szeredi	9b645714fb	vfs: split __dentry_open() Split __dentry_open() into two functions: do_dentry_open() - does most of the actual work, doesn't put file on failure open_check_o_direct() - after a successful open, checks direct_IO method This will allow i_op->atomic_open to do just the file initialization and leave the direct_IO checking to the VFS. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Change-Id: Ieb8a4e7c4ec9a12636c184d757a5f7a49c5e75df	2018-12-07 22:12:51 +04:00
Andrea Arcangeli	1fd1850bf6	fs/exec: fix use after free in execve "file" can be already freed if bprm->file is NULL after search_binary_handler() return. binfmt_script will do exactly that for example. If the VM reuses the file after fput run(), this will result in a use ater free. So obtain d_is_su before search_binary_handler() runs. This should explain this crash: [25333.009554] Unable to handle kernel NULL pointer dereference at virtual address 00000185 [..] [25333.009918] [2: am:21861] PC is at do_execve+0x354/0x474 Change-Id: I2a8a814d1c0aa75625be83cb30432cf13f1a0681 Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2018-02-16 20:15:06 -07:00
Daniel Rosenberg	b660e27533	ANDROID: sdcardfs: Fix missing break on default_normal Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 64672411 Change-Id: I98796df95dc9846adb77a11f49a1a254fb1618b1	2018-01-13 17:25:53 +03:00
Daniel Rosenberg	783ca29469	ANDROID: sdcardfs: Add default_normal option The default_normal option causes mounts with the gid set to AID_SDCARD_RW to have user specific gids, as in the normal case. Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I9619b8ac55f41415df943484dc8db1ea986cef6f Bug: 64672411	2018-01-13 17:25:30 +03:00
Daniel Rosenberg	4ac97e3645	ANDROID: sdcardfs: notify lower file of opens fsnotify_open is not called within dentry_open, so we need to call it ourselves. Change-Id: Ia7f323b3d615e6ca5574e114e8a5d7973fb4c119 Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 70706497	2018-01-13 17:25:26 +03:00
Al Viro	b7f0468bb9	BACKPORT: dentry name snapshots commit 49d31c2f389acfe83417083e1208422b4091cd9e upstream. take_dentry_name_snapshot() takes a safe snapshot of dentry name; if the name is a short one, it gets copied into caller-supplied structure, otherwise an extra reference to external name is grabbed (those are never modified). In either case the pointer to stable string is stored into the same structure. dentry must be held by the caller of take_dentry_name_snapshot(), but may be freely dropped afterwards - the snapshot will stay until destroyed by release_dentry_name_snapshot(). Intended use: struct name_snapshot s; take_dentry_name_snapshot(&s, dentry); ... access s.name ... release_dentry_name_snapshot(&s); Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name to pass down with event. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [carnil: backport 4.9: adjust context] [bwh: Backported to 3.16: - External names are not ref-counted, so copy them - Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [ghackmann@google.com: backported to 3.10: adjust context] Signed-off-by: Greg Hackmann <ghackmann@google.com> Change-Id: I612e687cbffa1a03107331a6b3f00911ffbebd8e Bug: 63689921	2018-01-13 17:13:38 +03:00
Shaohua Li	df680e4101	swap: make each swap partition have one address_space When I use several fast SSD to do swap, swapper_space.tree_lock is heavily contended. This makes each swap partition have one address_space to reduce the lock contention. There is an array of address_space for swap. The swap entry type is the index to the array. In my test with 3 SSD, this increases the swapout throughput 20%. [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: I8503ace83342398bf7be3d2216616868cca65311	2018-01-01 22:02:05 +03:00
Jan Kara	24f88dea2a	fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE writeback sync_file_range(2) is documented to issue writeback only for pages that are not currently being written. After all the system call has been created for userspace to be able to issue background writeout and so waiting for in-flight IO is undesirable there. However commit `ee53a891f4` ("mm: do_sync_mapping_range integrity fix") switched do_sync_mapping_range() and thus sync_file_range() to issue writeback in WB_SYNC_ALL mode since do_sync_mapping_range() was used by other code relying on WB_SYNC_ALL semantics. These days do_sync_mapping_range() went away and we can switch sync_file_range(2) back to issuing WB_SYNC_NONE writeback. That should help PostgreSQL avoid large latency spikes when flushing data in the background. Andres measured a 20% increase in transactions per second on an SSD disk. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Andres Freund <andres@anarazel.de> Tested-By: Andres Freund <andres@anarazel.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-12-31 13:02:49 +03:00
Minchan Kim	cb7925d8da	BACKPORT: mm: /proc/pid/smaps:: show proportional swap share of the mapping We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc//smaps \| grep "SwapPss:" \| awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc//smaps \| grep "Swap:" \| awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc//smaps \| grep "SwapPss:" \| awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc//smaps \| grep "Swap:" \| awk '{sum += $2} END {print sum}'; 1387744 [akpm@linux-foundation.org: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Bongkyu Kim <bongkyu.kim@lge.com> Tested-by: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 26190646 Change-Id: Idf92d682fdef432bdd66e530a7e7cdff8f375db1 Signed-off-by: Thierry Strudel <tstrudel@google.com>	2017-12-27 22:48:40 +03:00
JP Abgrall	16c096a3f5	ext4: Add support for FIDTRIM, a best-effort ioctl for deep discard trim * What This provides an interface for issuing an FITRIM which uses the secure discard instead of just a discard. Only the eMMC command is "secure", and not how the FS uses it: due to the fact that the FS might reassign a region somewhere else, the original deleted data will not be affected by the "trim" which only handles un-used regions. So we'll just call it "deep discard", and note that this is a "best effort" cleanup. * Why Once in a while, We want to be able to cleanup most of the unused blocks after erasing a bunch of files. We don't want to constantly secure-discard via a mount option. From an eMMC spec perspective, it tells the device to really get rid of all the data for the specified blocks and not just put them back into the pool of free ones (unlike the normal TRIM). The eMMC spec says the secure trim handling must make sure the data (and metadata) is not available anymore. A simple TRIM doesn't clear the data, it just puts blocks in the free pool. JEDEC Standard No. 84-A441 7.6.9 Secure Erase 7.6.10 Secure Trim From an FS perspective, it is acceptable to leave some data behind. - directory entries related to deleted files - databases entries related to deleted files - small-file data stored in inode extents - blocks held by the FS waiting to be re-used (mitigated by sync). - blocks reassigned by the FS prior to FIDTRIM. Change-Id: I676a1404a80130d93930c84898360f2e6fb2f81e Signed-off-by: Geremy Condra <gcondra@google.com> Signed-off-by: JP Abgrall <jpa@google.com>	2017-12-27 22:40:01 +03:00
Artem Borisov	d7992e6feb	Merge remote-tracking branch 'stable/linux-3.4.y' into lineage-15.1 All bluetooth-related changes were omitted because of our ancient incompatible bt stack. Change-Id: I96440b7be9342a9c1adc9476066272b827776e64	2017-12-27 17:13:15 +03:00
Michael Kerrisk	a467c2b9c8	PM: Rename CAP_EPOLLWAKEUP to CAP_BLOCK_SUSPEND As discussed in http://thread.gmane.org/gmane.linux.kernel/1249726/focus=1288990, the capability introduced in `4d7e30d989` to govern EPOLLWAKEUP seems misnamed: this capability is about governing the ability to suspend the system, not using a particular API flag (EPOLLWAKEUP). We should make the name of the capability more general to encourage reuse in related cases. (Whether or not this capability should also be used to govern the use of /sys/power/wake_lock is a question that needs to be separately resolved.) This patch renames the capability to CAP_BLOCK_SUSPEND. In order to ensure that the old capability name doesn't make it out into the wild, could you please apply and push up the tree to ensure that it is incorporated for the 3.5 release. Change-Id: Id7abf9a14f0a4b21c02eee057aff48687326c750 Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2017-12-15 16:47:14 +03:00
Rafael J. Wysocki	60bf2e03ee	epoll: Fix user space breakage related to EPOLLWAKEUP Commit `4d7e30d` (epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready) caused some applications to malfunction, because they set the bit corresponding to the new EPOLLWAKEUP flag in their eventpoll flags and they don't have the new CAP_EPOLLWAKEUP capability. To prevent that from happening, change epoll_ctl() to clear EPOLLWAKEUP in epds.events if the caller doesn't have the CAP_EPOLLWAKEUP capability instead of failing and returning an error code, which allows the affected applications to function normally. Change-Id: I266634be5e16d3390fd1c62686a215af215c8d51 Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2017-12-15 16:47:02 +03:00
Arve Hjønnevåg	bf3ef86fe9	epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready When an epoll_event, that has the EPOLLWAKEUP flag set, is ready, a wakeup_source will be active to prevent suspend. This can be used to handle wakeup events from a driver that support poll, e.g. input, if that driver wakes up the waitqueue passed to epoll before allowing suspend. Change-Id: I522bfcf488bb4817336e682b22bdfb1e0beaf3e4 Signed-off-by: Arve Hjønnevåg <arve@android.com> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>	2017-12-15 16:46:49 +03:00
Paul Keith	e675a50f40	sdcardfs: Backport and use some 3.10 hlist/hash macros * Fixes NPD when accessing /config/sdcardfs/packages_gid.list Change-Id: I4b628ffab5e8a83642439661f97f720946f31daf Signed-off-by: Paul Keith <javelinanddart@gmail.com>	2017-10-06 10:28:58 +03:00
Andrew Ruder	07b074b39e	fs/super.c: sync ro remount after blocking writers Move sync_filesystem() after sb_prepare_remount_readonly(). If writers sneak in anywhere from sync_filesystem() to sb_prepare_remount_readonly() it can cause inodes to be dirtied and writeback to occur well after sys_mount() has completely successfully. This was spotted by corrupted ubifs filesystems on reboot, but appears that it can cause issues with any filesystem using writeback. CRs-Fixed: 627559 Change-Id: Ib417b59d39210aab2de4e5ae48b18129e8bc3e26 Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> CC: Richard Weinberger <richard@nod.at> Co-authored-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Ruder <andrew.ruder@elecsyscorp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Git-commit: `807612db2f` Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org>	2017-10-06 10:28:36 +03:00
Daniel Rosenberg	f0226e8e93	ANDROID: sdcardfs: Add missing break Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 63245673 Change-Id: I5fc596420301045895e5a9a7e297fd05434babf9	2017-09-22 19:12:39 +03:00
Daniel Rosenberg	b58229a421	ANDROID: Sdcardfs: Move gid derivation under flag This moves the code to adjust the gid/uid of lower filesystem files under the mount flag derive_gid. Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I44eaad4ef67c7fcfda3b6ea3502afab94442610c Bug: 63245673	2017-09-22 19:12:39 +03:00
Daniel Rosenberg	7a97e952ea	ANDROID: mnt: Fix freeing of mount data Fix double free on error paths Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I1c25a175e87e5dd5cafcdcf9d78bf4c0dc3f88ef Bug: 65386954 Fixes: aa6d3ace42f9 ("mnt: Add filesystem private data to mount points")	2017-09-22 19:12:38 +03:00
Jaegeuk Kim	fc4f69b077	ANDROID: sdcardfs: override credential for ioctl to lower fs Otherwise, lower_fs->ioctl() fails due to inode_owner_or_capable(). Signed-off-by: Jaegeuk Kim <jaegeuk@google.com> Bug: 63260873 Change-Id: I623a6c7c5f8a3cbd7ec73ef89e18ddb093c43805	2017-09-22 19:12:38 +03:00
Andrew Chant	09d9a821ee	sdcardfs: limit stacking depth Limit filesystem stacking to prevent stack overflow. Bug: 32761463 Change-Id: I8b1462b9c0d6c7f00cf110724ffb17e7f307c51e Signed-off-by: Andrew Chant <achant@google.com> CVE-2014-9922	2017-09-22 19:12:37 +03:00
Miklos Szeredi	511ae95677	BACKPORT: fs: limit filesystem stacking depth Add a simple read-only counter to super_block that indicates how deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> (cherry picked from commit `69c433ed2e`) Bug: 32761463 Change-Id: I69b2fba2112db2ece09a1bf61a44f8fc4db00820 CVE-2014-9922	2017-09-22 19:12:37 +03:00
Daniel Rosenberg	b4b6b9d276	ANDROID: sdcardfs: Remove unnecessary lock The mmap_sem lock does not appear to be protecting anything, and has been removed in Samsung's more recent versions of sdcardfs. Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I76ff3e33002716b8384fc8be368028ed63dffe4e Bug: 63785372	2017-09-22 19:12:37 +03:00
Gao Xiang	aa7652a14a	ANDROID: sdcardfs: use mount_nodev and fix a issue in sdcardfs_kill_sb Use the VFS mount_nodev instead of customized mount_nodev_with_options and fix generic_shutdown_super to kill_anon_super because of set_anon_super Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Change-Id: Ibe46647aa2ce49d79291aa9d0295e9625cfccd80	2017-09-22 19:12:36 +03:00
Greg Hackmann	b429963b29	ANDROID: sdcardfs: remove dead function open_flags_to_access_mode() smatch warns about the suspicious formatting in the last line of open_flags_to_access_mode(). It turns out the only caller was deleted over a year ago by "ANDROID: sdcardfs: Bring up to date with Android M permissions:", so we can "fix" the function's formatting by deleting it. Change-Id: Id85946f3eb01722eef35b1815f405a6fda3aa4ff Signed-off-by: Greg Hackmann <ghackmann@google.com>	2017-09-22 19:12:36 +03:00
Daniel Rosenberg	72c554da0c	ANDROID: sdcardfs: d_splice_alias can return error values We must check that d_splice_alias was successful before using its output. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 62390017 Change-Id: Ifda0a052fb3f67e35c635a4e5e907876c5400978	2017-09-22 19:12:35 +03:00
Daniel Rosenberg	89a658a855	ANDROID: mnt: Fix next_descendent next_descendent did not properly handle the case where the initial mount had no slaves. In this case, we would look for the next slave, but since don't have a master, the check for wrapping around to the start of the list will always fail. Instead, we check for this case, and ensure that we end the iteration when we come back to the root. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 62094374 Change-Id: I43dfcee041aa3730cb4b9a1161418974ef84812e	2017-09-22 19:12:35 +03:00
Daniel Rosenberg	443df072f0	ANDROID: sdcardfs: Check for NULL in revalidate If the inode is in the process of being evicted, the top value may be NULL. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 38502532 Change-Id: I0b9d04aab621e0398d44d1c5dc53293106aa5f89	2017-09-22 19:12:35 +03:00
Dmitry Shmidt	684ee76f0e	ANDROID: sdcardfs: Add linux/kref.h include Change-Id: I8be0f6fc7aa6dc1d639d2d22b230783c68574389 Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>	2017-09-22 19:12:34 +03:00
Daniel Rosenberg	8472923d21	ANDROID: sdcardfs: Move top to its own struct Move top, and the associated data, to its own struct. This way, we can properly track refcounts on top without interfering with the inode's accounting. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 38045152 Change-Id: I1968e480d966c3f234800b72e43670ca11e1d3fd	2017-09-22 19:12:34 +03:00
Gao Xiang	1f52ba5673	ANDROID: sdcardfs: fix sdcardfs_destroy_inode for the inode RCU approach According to the following commits, fs: icache RCU free inodes vfs: fix the stupidity with i_dentry in inode destructors sdcardfs_destroy_inode should be fixed for the fast path safety. Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Change-Id: I84f43c599209d23737c7e28b499dd121cb43636d	2017-09-22 19:12:33 +03:00
Daniel Roseberg	62b650471f	ANDROID: sdcardfs: Don't iput if we didn't igrab If we fail to get top, top is either NULL, or igrab found that we're in the process of freeing that inode, and did not grab it. Either way, we didn't grab it, and have no business putting it. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 38117720 Change-Id: Ie2f587483b9abb5144263156a443e89bc69b767b	2017-09-22 19:12:33 +03:00
Daniel Rosenberg	e43a6b01eb	ANDROID: sdcardfs: Call lower fs's revalidate We should be calling the lower filesystem's revalidate inside of sdcardfs's revalidate, as wrapfs does. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35766959 Change-Id: I939d1c4192fafc1e21678aeab43fe3d588b8e2f4	2017-09-22 19:12:33 +03:00
Daniel Rosenberg	88598fa8ae	ANDROID: sdcardfs: Avoid setting GIDs outside of valid ranges When setting up the ownership of files on the lower filesystem, ensure that these values are in reasonable ranges for apps. If they aren't, default to AID_MEDIA_RW Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 37516160 Change-Id: I0bec76a61ac72aff0b993ab1ad04be8382178a00	2017-09-22 19:12:32 +03:00
Daniel Rosenberg	e293feb047	Revert "Revert "Android: sdcardfs: Don't do d_add for lower fs"" This reverts commit ffa75fdb9c408f49b9622b6d55752ed99ff61488. Turns out we just needed the right hash. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 37231161 Change-Id: I6a6de7f7df99ad42b20fa062913b219f64020c31	2017-09-22 19:12:32 +03:00
Daniel Rosenberg	24cce439d5	ANDROID: sdcardfs: Use filesystem specific hash We weren't accounting for FS specific hash functions, causing us to miss negative dentries for any FS that had one. Similar to a patch from esdfs commit 75bd25a9476d ("esdfs: support lower's own hash") Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I32d1ba304d728e0ca2648cacfb4c2e441ae63608	2017-09-22 19:12:32 +03:00
Daniel Rosenberg	b380b6ee1c	Revert "Android: sdcardfs: Don't do d_add for lower fs" This reverts commit 60df9f12992bc067216078ae756066c5d7c74d87. This change caused issues for sdcardfs on top of vfat Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: Ie56a91fda582af27921cc1a9de7ae19a9a988f2a	2017-09-22 19:12:31 +03:00
Daniel Rosenberg	d3c7b5afda	Android: sdcardfs: Don't complain in fixup_lower_ownership Not all filesystems support changing the owner of a file. We shouldn't complain if it doesn't happen. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 37488099 Change-Id: I403e44ab7230f176e6df82f6adb4e5c82ce57f33	2017-09-22 19:12:31 +03:00
Daniel Rosenberg	cbfba1be5a	Android: sdcardfs: Don't do d_add for lower fs For file based encryption, ext4 explicitly does not create negative dentries for encrypted files. If you force one over it, the decrypted file will be hidden until the cache is cleared. Instead, just fail out. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 37231161 Change-Id: Id2a9708dfa75e1c22f89915c529789caadd2ca4b	2017-09-22 19:12:30 +03:00
Daniel Rosenberg	7727a9e52e	ANDROID: sdcardfs: ->iget fixes Adapted from wrapfs commit 8c49eaa0sb9c ("Wrapfs: ->iget fixes") Change where we igrab/iput to ensure we always hold a valid lower_inode. Return ENOMEM (not EACCES) if iget5_locked returns NULL. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35766959 Change-Id: Id8d4e0c0cbc685a0a77685ce73c923e9a3ddc094	2017-09-22 19:12:30 +03:00
Daniel Rosenberg	d2f7577f20	Android: sdcardfs: Change cache GID value Change-Id: Ieb955dd26493da26a458bc20fbbe75bca32b094f Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 37193650	2017-09-22 19:12:29 +03:00
Daniel Rosenberg	c3da3c511e	ANDROID: sdcardfs: Directly pass lower file for mmap Instead of relying on a copy hack, pass the lower file as private data. This lets the kernel find the vma mapping for pages used by the file, allowing pages used by mapping to be reclaimed. This is adapted from following esdfs patches commit 0647e638d: ("esdfs: store lower file in vm_file for mmap") commit 064850866: ("esdfs: keep a counter for mmaped file") Change-Id: I75b74d1e5061db1b8c13be38d184e118c0851a1a Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:29 +03:00
Daniel Rosenberg	7fc65bd919	ANDROID: sdcardfs: update module info Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I958c7c226d4e9265fea8996803e5b004fb33d8ad	2017-09-22 19:12:29 +03:00
Daniel Rosenberg	5dc3989ffd	ANDROID: sdcardfs: use d_splice_alias adapted from wrapfs commit 9671770ff8b9 ("Wrapfs: use d_splice_alias") Refactor interpose code to allow lookup to use d_splice_alias. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35766959 Change-Id: Icf51db8658202c48456724275b03dc77f73f585b	2017-09-22 19:12:28 +03:00
Daniel Rosenberg	dde08eb9d7	ANDROID: sdcardfs: fix ->llseek to update upper and lower offset Adapted from wrapfs commit 1d1d23a47baa ("Wrapfs: fix ->llseek to update upper and lower offsets") Fixes bug: xfstests generic/257. f_pos consistently is required by and only by dir_ops->wrapfs_readdir, main_ops is not affected. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Mengyang Li <li.mengyang@stonybrook.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35766959 Change-Id: I360a1368ac37ea8966910a58972b81504031d437	2017-09-22 19:12:28 +03:00
Daniel Rosenberg	4923777bb0	ANDROID: sdcardfs: copy lower inode attributes in ->ioctl Adapted from wrapfs commit fbc9c6f83ea6 ("Wrapfs: copy lower inode attributes in ->ioctl") commit e97d8e26cc9e ("Wrapfs: use file_inode helper") Some ioctls (e.g., EXT2_IOC_SETFLAGS) can change inode attributes, so copy them from lower inode. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35766959 Change-Id: I0f12684b9dbd4088b4a622c7ea9c03087f40e572	2017-09-22 19:12:28 +03:00
Daniel Rosenberg	21af28ae07	ANDROID: sdcardfs: remove unnecessary call to do_munmap Adapted from wrapfs commit 5be6de9ecf02 ("Wrapfs: use vm_munmap in ->mmap") commit 2c9f6014a8bb ("Wrapfs: remove unnecessary call to vm_unmap in ->mmap") Code is unnecessary and causes deadlocks in newer kernels. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35766959 Change-Id: Ia252d60c60799d7e28fc5f1f0f5b5ec2430a2379	2017-09-22 19:12:27 +03:00
Daniel Rosenberg	2761faa286	ANDROID: sdcardfs: Fix style issues in macros Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: I89c4035029dc2236081a7685c55cac595d9e7ebf	2017-09-22 19:12:27 +03:00
Daniel Rosenberg	b3fd5e6086	ANDROID: sdcardfs: Use seq_puts over seq_printf Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: I3795ec61ce61e324738815b1ce3b0e09b25d723f	2017-09-22 19:12:26 +03:00
Daniel Rosenberg	3a6d093272	ANDROID: sdcardfs: Use to kstrout Switch from deprecated simple_strtoul to kstrout Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: If18bd133b4d2877f71e58b58fc31371ff6613ed5	2017-09-22 19:12:26 +03:00
Daniel Rosenberg	a1f2d9d927	ANDROID: sdcardfs: Use pr_[...] instead of printk Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: Ibc635ec865750530d32b87067779f681fe58a003	2017-09-22 19:12:26 +03:00
Daniel Rosenberg	def7e34e8b	ANDROID: sdcardfs: remove unneeded null check As pointed out by checkpatch, these functions already handle null inputs, so the checks are not needed. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: I189342f032dfcefee36b27648bb512488ad61d20	2017-09-22 19:12:24 +03:00
Daniel Rosenberg	1ff2bf9007	ANDROID: sdcardfs: Fix style issues with comments Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: I8791ef7eac527645ecb9407908e7e5ece35b8f80	2017-09-22 19:12:23 +03:00
Daniel Rosenberg	1fb0168abb	ANDROID: sdcardfs: Fix formatting This fixes various spacing and bracket related issues pointed out by checkpatch. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: I6e248833a7a04e3899f3ae9462d765cfcaa70c96	2017-09-22 19:12:23 +03:00
Daniel Rosenberg	9a544b4fd9	ANDROID: sdcardfs: correct order of descriptors Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: Ia6d16b19c8c911f41231d2a12be0740057edfacf	2017-09-22 19:12:23 +03:00
Daniel Rosenberg	cc90c372fa	ANDROID: sdcardfs: Fix gid issue We were already calculating most of these values, and erroring out because the check was confused by this. Instead of recalculating, adjust it as needed. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 36160015 Change-Id: I9caf3e2fd32ca2e37ff8ed71b1d392f1761bc9a9	2017-09-22 19:12:22 +03:00
Daniel Rosenberg	183d00676a	ANDROID: sdcardfs: Use tabs instead of spaces in multiuser.h Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35331000 Change-Id: Ic7801914a7dd377e270647f81070020e1f0bab9b	2017-09-22 19:12:22 +03:00
Daniel Rosenberg	a79d11e62a	ANDROID: sdcardfs: Remove uninformative prints At best these prints do not provide useful information, and at worst, some allow userspace to abuse the kernel log. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 36138424 Change-Id: I812c57cc6a22b37262935ab77f48f3af4c36827e	2017-09-22 19:12:21 +03:00
Daniel Rosenberg	527954f2c9	ANDROID: sdcardfs: move path_put outside of spinlock Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35643557 Change-Id: Ib279ebd7dd4e5884d184d67696a93e34993bc1ef	2017-09-22 19:12:21 +03:00
Daniel Rosenberg	9fc6dbe472	ANDROID: sdcardfs: Use case insensitive hash function Case insensitive comparisons don't help us much if we hash to different buckets... Signed-off-by: Daniel Rosenberg <drosen@google.com> bug: 36004503 Change-Id: I91e00dbcd860a709cbd4f7fd7fc6d855779f3285	2017-09-22 19:12:21 +03:00
Daniel Rosenberg	63cf0c300a	ANDROID: sdcardfs: declare MODULE_ALIAS_FS From commit ee616b78aa87 ("Wrapfs: declare MODULE_ALIAS_FS") Signed-off-by: Daniel Rosenberg <drosen@google.com> bug: 35766959 Change-Id: Ia4728ab49d065b1d2eb27825046f14b97c328cba	2017-09-22 19:12:20 +03:00
Eric W. Biederman	5c1997410b	fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Change-Id: I623b13dbdb44bb9ba7481f29575e1ca4ad8102f4 Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2017-09-22 19:12:20 +03:00
Daniel Rosenberg	628b9661d7	ANDROID: sdcardfs: Get the blocksize from the lower fs This changes sdcardfs to be more in line with the getattr in wrapfs, which calls the lower fs's getattr to get the block size Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 34723223 Change-Id: I1c9e16604ba580a8cdefa17f02dcc489d7351aed	2017-09-22 19:12:19 +03:00
Daniel Rosenberg	30cc539e4c	ANDROID: sdcardfs: Use d_invalidate instead of drop_recurisve drop_recursive did not properly remove stale dentries. Instead, we use the vfs's d_invalidate, which does the proper cleanup. Additionally, remove the no longer used drop_recursive, and fixup_top_recursive that that are no longer used. Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: Ibff61b0c34b725b024a050169047a415bc90f0d8	2017-09-22 19:12:19 +03:00
Daniel Rosenberg	6b5418751d	ANDROID: sdcardfs: Switch to internal case insensitive compare There were still a few places where we called into a case insensitive lookup that was not defined by sdcardfs. Moving them all to the same place will allow us to switch the implementation in the future. Additionally, the check in fixup_perms_recursive did not take into account the length of both strings, causing extraneous matches when the name we were looking for was a prefix of the child name. Signed-off-by: Daniel Rosenberg <drosen@google.com> Change-Id: I45ce768cd782cb4ea1ae183772781387c590ecc2	2017-09-22 19:12:18 +03:00
Daniel Rosenberg	280a7d21b7	ANDROID: sdcardfs: Use spin_lock_nested Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 36007653 Change-Id: I805d5afec797669679853fb2bb993ee38e6276e4	2017-09-22 19:12:18 +03:00
Daniel Rosenberg	9917de6f14	ANDROID: sdcardfs: Replace get/put with d_lock dput cannot be called with a spin_lock. Instead, we protect our accesses by holding the d_lock. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35643557 Change-Id: I22cf30856d75b5616cbb0c223724f5ab866b5114	2017-09-22 19:12:17 +03:00
Daniel Rosenberg	c3be309216	ANDROID: sdcardfs: rate limit warning print Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35848445 Change-Id: Ida72ea0ece191b2ae4a8babae096b2451eb563f6	2017-09-22 19:12:17 +03:00
Daniel Rosenberg	74b45b5a54	ANDROID: sdcardfs: support direct-IO (DIO) operations This comes from the wrapfs patch 2e346c83b26e Wrapfs: support direct-IO (DIO) operations Signed-off-by: Li Mengyang <li.mengyang@stonybrook.edu> Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 34133558 Change-Id: I3fd779c510ab70d56b1d918f99c20421b524cdc4	2017-09-22 19:12:17 +03:00
Daniel Rosenberg	7bc8a0524c	ANDROID: sdcardfs: implement vm_ops->page_mkwrite This comes from the wrapfs patch 3dfec0ffe5e2 Wrapfs: implement vm_ops->page_mkwrite Some file systems (e.g., ext4) require it. Reported by Ted Ts'o. Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu> Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 34133558 Change-Id: I1a389b2422c654a6d3046bb8ec3e20511aebfa8e	2017-09-22 19:12:16 +03:00
Daniel Rosenberg	d782165c3b	ANDROID: sdcardfs: Don't bother deleting freelist There is no point deleting entries from dlist, as that is a temporary list on the stack from which contains only entries that are being deleted. Not all code paths set up dlist, so those that don't were performing invalid accesses in hash_del_rcu. As an additional means to prevent any other issue, we null out the list entries when we allocate from the cache. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35666680 Change-Id: Ibb1e28c08c3a600c29418d39ba1c0f3db3bf31e5	2017-09-22 19:12:16 +03:00
Daniel Rosenberg	4a7fc6483f	ANDROID: sdcardfs: Add missing path_put "ANDROID: sdcardfs: Add GID Derivation to sdcardfs" introduced an unbalanced pat_get, leading to storage space not being freed after deleting a file until rebooting. This adds the missing path_put. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 34691169 Change-Id: Ia7ef97ec2eca2c555cc06b235715635afc87940e	2017-09-22 19:12:15 +03:00
Daniel Rosenberg	2f8e9489d3	ANDROID: sdcardfs: Fix incorrect hash This adds back the hash calculation removed as part of the previous patch, as it is in fact necessary. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 35307857 Change-Id: Ie607332bcf2c5d2efdf924e4060ef3f576bf25dc	2017-09-22 19:12:15 +03:00
Daniel Rosenberg	f9c56b73bd	ANDROID: sdcardfs: Switch strcasecmp for internal call This moves our uses of strcasecmp over to an internal call so we can easily change implementations later if we so desire. Additionally, we leverage qstr's where appropriate to save time on comparisons. Change-Id: I32fdc4fd0cd3b7b735dcfd82f60a2516fd8272a5 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:15 +03:00
Al Viro	acf4f74449	constify d_lookup() arguments Change-Id: I48ac8c9d7a63530b753b9d7b316e9222edeb5876 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-09-22 19:12:14 +03:00
Al Viro	44fc4d2f9a	constify __d_lookup() arguments Change-Id: I74c489fee16eb019d9d32572d867d6b54bf6cc91 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-09-22 19:12:14 +03:00
Daniel Rosenberg	2659b94b05	ANDROID: sdcardfs: switch to full_name_hash and qstr Use the kernel's string hash function instead of rolling our own. Additionally, save a bit of calculation by using the qstr struct in place of strings. Change-Id: I0bbeb5ec2a9233f40135ad632e6f22c30ffa95c1 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:13 +03:00
Daniel Rosenberg	b134627d5d	ANDROID: sdcardfs: Add GID Derivation to sdcardfs This changes sdcardfs to modify the user and group in the underlying filesystem depending on its usage. Ownership is set by Android user, and package, as well as if the file is under obb or cache. Other files can be labeled by extension. Those values are set via the configfs interace. To add an entry, mkdir -p [configfs root]/sdcardfs/extensions/[gid]/[ext] Bug: 34262585 Change-Id: I4e030ce84f094a678376349b1a96923e5076a0f4 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:13 +03:00
Daniel Rosenberg	ce26b95540	ANDROID: sdcardfs: Remove redundant operation We call get_derived_permission_new unconditionally, so we don't need to call update_derived_permission_lock, which does the same thing. Change-Id: I0748100828c6af806da807241a33bf42be614935 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:13 +03:00
Daniel Rosenberg	230896c581	ANDROID: sdcardfs: add support for user permission isolation This allows you to hide the existence of a package from a user by adding them to an exclude list. If a user creates that package's folder and is on the exclude list, they will not see that package's id. Bug: 34542611 Change-Id: I9eb82e0bf2457d7eb81ee56153b9c7d2f6646323 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:12 +03:00
Daniel Rosenberg	9e78e3b970	ANDROID: sdcardfs: Refactor configfs interface This refactors the configfs code to be more easily extended. It will allow additional files to be added easily. Bug: 34542611 Bug: 34262585 Change-Id: I73c9b0ae5ca7eb27f4ebef3e6807f088b512d539 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:12 +03:00
Daniel Rosenberg	26117e4ce4	ANDROID: sdcardfs: Allow non-owners to touch This modifies the permission checks in setattr to allow for non-owners to modify the timestamp of files to things other than the current time. This still requires write access, as enforced by the permission call, but relaxes the requirement that the caller must be the owner, allowing those with group permissions to change it as well. Bug: 11118565 Change-Id: Ied31f0cce2797675c7ef179eeb4e088185adcbad Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:11 +03:00
Daniel Rosenberg	fdfefc2e98	ANDROID: mnt: remount should propagate to slaves of slaves propagate_remount was not accounting for the slave mounts of other slave mounts, leading to some namespaces not recieving the remount information. Signed-off-by: Daniel Rosenberg <drosen@google.com> Bug: 33731928 Change-Id: Idc9e8c2ed126a4143229fc23f10a959c2d0a3854	2017-09-22 19:12:11 +03:00
Daniel Rosenberg	cd769ec65b	ANDROID: sdcardfs: Fix locking issue with permision fix up Don't use lookup_one_len so we can grab the spinlock that protects d_subdirs. Bug: 30954918 Change-Id: I0c6a393252db7beb467e0d563739a3a14e1b5115 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:11 +03:00
Daniel Rosenberg	ac0146e438	ANDROID: vfs: Missed updating truncate to truncate2 Bug: 30954918 Change-Id: I8163d3f86dd7aadb2ab3fc11816754f331986f05 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:10 +03:00
Al Viro	c4d2a199dd	BACKPORT: smarter propagate_mnt() The current mainline has copies propagated to all nodes, then tears down the copies we made for nodes that do not contain counterparts of the desired mountpoint. That sets the right propagation graph for the copies (at teardown time we move the slaves of removed node to a surviving peer or directly to master), but we end up paying a fairly steep price in useless allocations. It's fairly easy to create a situation where N calls of mount(2) create exactly N bindings, with O(N^2) vfsmounts allocated and freed in process. Fortunately, it is possible to avoid those allocations/freeings. The trick is to create copies in the right order and find which one would've eventually become a master with the current algorithm. It turns out to be possible in O(nodes getting propagation) time and with no extra allocations at all. One part is that we need to make sure that eventual master will be created before its slaves, so we need to walk the propagation tree in a different order - by peer groups. And iterate through the peers before dealing with the next group. Another thing is finding the (earlier) copy that will be a master of one we are about to create; to do that we are (temporary) marking the masters of mountpoints we are attaching the copies to. Either we are in a peer of the last mountpoint we'd dealt with, or we have the following situation: we are attaching to mountpoint M, the last copy S_0 had been attached to M_0 and there are sequences S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i}, S_{i} mounted on M{i} and we need to create a slave of the first S_{k} such that M is getting propagation from M_{k}. It means that the master of M_{k} will be among the sequence of masters of M. On the other hand, the nearest marked node in that sequence will either be the master of M_{k} or the master of M_{k-1} (the latter - in the case if M_{k-1} is a slave of something M gets propagation from, but in a wrong peer group). So we go through the sequence of masters of M until we find a marked one (P). Let N be the one before it. Then we go through the sequence of masters of S_0 until we find one (say, S) mounted on a node D that has P as master and check if D is a peer of N. If it is, S will be the master of new copy, if not - the master of S will be. That's it for the hard part; the rest is fairly simple. Iterator is in next_group(), handling of one prospective mountpoint is propagate_one(). It seems to survive all tests and gives a noticably better performance than the current mainline for setups that are seriously using shared subtrees. Change-Id: I45648e8a405544f768c5956711bdbdf509e2705a Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-09-22 19:12:10 +03:00
Al Viro	bd0f854d88	BACKPORT: don't bother with propagate_mnt() unless the target is shared If the dest_mnt is not shared, propagate_mnt() does nothing - there's no mounts to propagate to and thus no copies to create. Might as well don't bother calling it in that case. Change-Id: Id94af8ad288bf9bfc6ffb5570562bbc2dc2e0d87 Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-09-22 19:12:09 +03:00
Daniel Rosenberg	0ccedbc84f	sdcardfs: Use per mount permissions This switches sdcardfs over to using permission2. Instead of mounting several sdcardfs instances onto the same underlaying directory, you bind mount a single mount several times, and remount with the options you want. These are stored in the private mount data, allowing you to maintain the same tree, but have different permissions for different mount points. Warning functions have been added for permission, as it should never be called, and the correct behavior is unclear. Change-Id: I841b1d70ec60cf2b866fa48edeb74a0b0f8334f5 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:08 +03:00
Daniel Rosenberg	3edd0f78ef	sdcardfs: Add gid and mask to private mount data Adds support for mount2, remount2, and the functions to allocate/clone/copy the private data The next patch will switch over to actually using it. Change-Id: I8a43da26021d33401f655f0b2784ead161c575e3 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:08 +03:00
Daniel Rosenberg	8967b8cde8	sdcardfs: User new permission2 functions Change-Id: Ic7e0fb8fdcebb31e657b079fe02ac834c4a50db9 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:08 +03:00
Daniel Rosenberg	1620d1d7d4	vfs: Add setattr2 for filesystems with per mount permissions This allows filesystems to use their mount private data to influence the permssions they use in setattr2. It has been separated into a new call to avoid disrupting current setattr users. Change-Id: I19959038309284448f1b7f232d579674ef546385 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:07 +03:00
Daniel Rosenberg	19a3f7c232	vfs: Add permission2 for filesystems with per mount permissions This allows filesystems to use their mount private data to influence the permssions they return in permission2. It has been separated into a new call to avoid disrupting current permission users. Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:07 +03:00
Daniel Rosenberg	2686aceaa7	vfs: Allow filesystems to access their private mount data Now we pass the vfsmount when mounting and remounting. This allows the filesystem to actually set up the mount specific data, although we can't quite do anything with it yet. show_options is expanded to include data that lives with the mount. To avoid changing existing filesystems, these have been added as new vfs functions. Change-Id: If80670bfad9f287abb8ac22457e1b034c9697097 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:06 +03:00
Daniel Rosenberg	ccbd24c7a0	mnt: Add filesystem private data to mount points This starts to add private data associated directly to mount points. The intent is to give filesystems a sense of where they have come from, as a means of letting a filesystem take different actions based on this information. Change-Id: Ie769d7b3bb2f5972afe05c1bf16cf88c91647ab2 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:06 +03:00
Daniel Rosenberg	313cdb2651	ANDROID: sdcardfs: Fix backport issue for 3.10 Don't use make_kuid/make_guid Bug: 30954918 Change-Id: I56de640771872aeeae5a69c42bf2ce8a5cfa413f Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:05 +03:00
Daniel Rosenberg	87b619e0cb	sdcardfs: Move directory unlock before touch This removes a deadlock under low memory conditions. filp_open can call lookup_slow, which will attempt to lock the parent. Change-Id: I940643d0793f5051d1e79a56f4da2fa8ca3d8ff7 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:05 +03:00
alvin_liang	b94d192e3e	sdcardfs: fix external storage exporting incorrect uid Symptom: App cannot write into per-app folder Root Cause: sdcardfs exports incorrect uid Solution: fix uid Project: All Note: Test done by RD: passed Change-Id: Iff64f6f40ba4c679f07f4426d3db6e6d0db7e3ca	2017-09-22 19:12:04 +03:00
Daniel Rosenberg	00fb55c68d	sdcardfs: Added top to sdcardfs_inode_info Adding packages to the package list and moving files takes a large amount of locks, and is currently a heavy operation. This adds a 'top' field to the inode_info, which points to the inode for the top most directory whose owner you would like to match. On permission checks and get_attr, we look up the owner based on the information at top. When we change a package mapping, we need only modify the information in the corresponding top inode_info's. When renaming, we must ensure top is set correctly in all children. This happens when an app specific folder gets moved outside of the folder for that app. Change-Id: Ib749c60b568e9a45a46f8ceed985c1338246ec6c Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:04 +03:00
Daniel Rosenberg	853f8e0523	sdcardfs: Switch package list to RCU Switched the package id hashmap to use RCU. Change-Id: I9fdcab279009005bf28536247d11e13babab0b93 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:03 +03:00
Daniel Rosenberg	b7ae873970	sdcardfs: Fix locking for permission fix up Iterating over d_subdirs requires taking d_lock. Removed several unneeded locks. Change-Id: I5b1588e54c7e6ee19b756d6705171c7f829e2650 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:03 +03:00
Daniel Rosenberg	82f20f9741	sdcardfs: Check for other cases on path lookup This fixes a bug where the first lookup of a file or folder created under a different view would not be case insensitive. It will now search through for a case insensitive match if the initial lookup fails. Bug:28024488 Change-Id: I4ff9ce297b9f2f9864b47540e740fd491c545229 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:03 +03:00
Daniel Rosenberg	feccf66542	sdcardfs: override umask on mkdir and create The mode on files created on the lower fs should not be affected by the umask of the calling task's fs_struct. Instead, we create a copy and modify it as needed. This also lets us avoid the string shenanigans around .nomedia files. Bug: 27992761 Change-Id: Ia3a6e56c24c6e19b3b01c1827e46403bb71c2f4c Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:02 +03:00
Julia Lawall	b46375c331	ANDROID: sdcardfs: fix itnull.cocci warnings List_for_each_entry has the property that the first argument is always bound to a real list element, never NULL, so testing dentry is not needed. Generated by: scripts/coccinelle/iterators/itnull.cocci Change-Id: I51033a2649eb39451862b35b6358fe5cfe25c5f5 Cc: Daniel Rosenberg <drosen@google.com> Signed-off-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Guenter Roeck <groeck@chromium.org>	2017-09-22 19:12:02 +03:00
Daniel Rosenberg	ccf7c04945	sdcardfs: Truncate packages_gid.list on overflow packages_gid.list was improperly returning the wrong count. Use scnprintf instead, and inform the user that the list was truncated if it is. Bug: 30013843 Change-Id: Ida2b2ef7cd86dd87300bfb4c2cdb6bfe2ee1650d Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:01 +03:00
Daniel Rosenberg	611237fca8	vfs: change d_canonical_path to take two paths bug: 23904372 Change-Id: I4a686d64b6de37decf60019be1718e1d820193e6 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:01 +03:00
Daniel Rosenberg	f5fea6a938	fuse: Add support for d_canonical_path Allows FUSE to report to inotify that it is acting as a layered filesystem. The userspace component returns a string representing the location of the underlying file. If the string cannot be resolved into a path, the top level path is returned instead. bug: 23904372 Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:12:00 +03:00
Daniel Rosenberg	10db9d59bf	sdcardfs: remove unneeded __init and __exit Change-Id: I2a2d45d52f891332174c3000e8681c5167c1564f	2017-09-22 19:12:00 +03:00
Daniel Rosenberg	679046d6e7	sdcardfs: Remove unused code Change-Id: Ie97cba27ce44818ac56cfe40954f164ad44eccf6	2017-09-22 19:12:00 +03:00
Daniel Rosenberg	158b6be502	sdcardfs: remove effectless config option CONFIG_SDCARD_FS_CI_SEARCH only guards a define for LOOKUP_CASE_INSENSITIVE, which is never used in the kernel. Remove both, along with the option matching that supports it. Change-Id: I363a8f31de8ee7a7a934d75300cc9ba8176e2edf Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:11:59 +03:00
Daniel Rosenberg	49e4c3ad85	inotify: Fix erroneous update of bit count Patch "vfs: add d_canonical_path for stacked filesystem support" erroneously updated the ALL_INOTIFY_BITS count. This changes it back Change-Id: Idb04edc736da276159d30f04c40cff9d6b1e070f	2017-09-22 19:11:59 +03:00
Daniel Rosenberg	3bf0ab5c8d	sdcardfs: Add support for d_canonicalize Change-Id: I5d6f0e71b8ca99aec4b0894412f1dfd1cfe12add Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:11:58 +03:00
Daniel Rosenberg	9f8d208e0e	vfs: add d_canonical_path for stacked filesystem support Inotify does not currently know when a filesystem is acting as a wrapper around another fs. This means that inotify watchers will miss any modifications to the base file, as well as any made in a separate stacked fs that points to the same file. d_canonical_path solves this problem by allowing the fs to map a dentry to a path in the lower fs. Inotify can use it to find the appropriate place to watch to be informed of all changes to a file. Change-Id: I09563baffad1711a045e45c1bd0bd8713c2cc0b6 Signed-off-by: Daniel Rosenberg <drosen@google.com>	2017-09-22 19:11:58 +03:00
Daniel Rosenberg	8e6b0f6c8c	sdcardfs: Bring up to date with Android M permissions: In M, the workings of sdcardfs were changed significantly. This brings sdcardfs into line with the changes. Change-Id: I10e91a84a884c838feef7aa26c0a2b21f02e052e	2017-09-22 19:11:57 +03:00
Daniel Campello	1dbc72eb35	sdcardfs: Changed type-cast in packagelist management Change-Id: Ic8842de2d7274b7a5438938d2febf5d8da867148	2017-09-22 19:11:57 +03:00
fluxi	fd2464db10	sdcardfs: Port to 3.4 Analog port to 3.10 by Daniel Campello <campello@google.com>. Change-Id: I0b05890cdd4332c5cfc2ffdf66a3f3a7890cce35	2017-09-22 19:11:56 +03:00
Daniel Campello	6b980874d5	Included sdcardfs source code for kernel 3.0 Only included the source code as is for kernel 3.0. Following patches take care of porting this file system to version 3.10. Change-Id: I09e76db77cd98a059053ba5b6fd88572a4b75b5b Signed-off-by: Daniel Campello <campello@google.com>	2017-09-22 19:11:56 +03:00
Al Viro	be084ce9f1	move d_rcu from overlapping d_child to overlapping d_alias commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream. Change-Id: I85366e6ce0423ec9620bcc9cd3e7695e81aa1171 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [bwh: Backported to 3.2: - Apply name changes in all the different places we use d_alias and d_child - Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [lizf: Backported to 3.4: - adjust context - need one more name change in debugfs]	2017-09-22 19:11:55 +03:00
Al Viro	f787204b2f	get rid of kern_path_parent() all callers want the same thing, actually - a kinda-sorta analog of kern_path_create(). I.e. they want parent vfsmount/dentry (with ->i_mutex held, to make sure the child dentry is still their child) + the child dentry. Signed-off-by Al Viro <viro@zeniv.linux.org.uk> Change-Id: I58cc7b0a087646516db9af69962447d27fb3ee8b	2017-09-22 19:11:54 +03:00
Andy Lutomirski	b6dd4f6779	UPSTREAM: capabilities: ambient capabilities Credit where credit is due: this idea comes from Christoph Lameter with a lot of valuable input from Serge Hallyn. This patch is heavily based on Christoph's patch. ===== The status quo ===== On Linux, there are a number of capabilities defined by the kernel. To perform various privileged tasks, processes can wield capabilities that they hold. Each task has four capability masks: effective (pE), permitted (pP), inheritable (pI), and a bounding set (X). When the kernel checks for a capability, it checks pE. The other capability masks serve to modify what capabilities can be in pE. Any task can remove capabilities from pE, pP, or pI at any time. If a task has a capability in pP, it can add that capability to pE and/or pI. If a task has CAP_SETPCAP, then it can add any capability to pI, and it can remove capabilities from X. Tasks are not the only things that can have capabilities; files can also have capabilities. A file can have no capabilty information at all [1]. If a file has capability information, then it has a permitted mask (fP) and an inheritable mask (fI) as well as a single effective bit (fE) [2]. File capabilities modify the capabilities of tasks that execve(2) them. A task that successfully calls execve has its capabilities modified for the file ultimately being excecuted (i.e. the binary itself if that binary is ELF or for the interpreter if the binary is a script.) [3] In the capability evolution rules, for each mask Z, pZ represents the old value and pZ' represents the new value. The rules are: pP' = (X & fP) \| (pI & fI) pI' = pI pE' = (fE ? pP' : 0) X is unchanged For setuid binaries, fP, fI, and fE are modified by a moderately complicated set of rules that emulate POSIX behavior. Similarly, if euid == 0 or ruid == 0, then fP, fI, and fE are modified differently (primary, fP and fI usually end up being the full set). For nonroot users executing binaries with neither setuid nor file caps, fI and fP are empty and fE is false. As an extra complication, if you execute a process as nonroot and fE is set, then the "secure exec" rules are in effect: AT_SECURE gets set, LD_PRELOAD doesn't work, etc. This is rather messy. We've learned that making any changes is dangerous, though: if a new kernel version allows an unprivileged program to change its security state in a way that persists cross execution of a setuid program or a program with file caps, this persistent state is surprisingly likely to allow setuid or file-capped programs to be exploited for privilege escalation. ===== The problem ===== Capability inheritance is basically useless. If you aren't root and you execute an ordinary binary, fI is zero, so your capabilities have no effect whatsoever on pP'. This means that you can't usefully execute a helper process or a shell command with elevated capabilities if you aren't root. On current kernels, you can sort of work around this by setting fI to the full set for most or all non-setuid executable files. This causes pP' = pI for nonroot, and inheritance works. No one does this because it's a PITA and it isn't even supported on most filesystems. If you try this, you'll discover that every nonroot program ends up with secure exec rules, breaking many things. This is a problem that has bitten many people who have tried to use capabilities for anything useful. ===== The proposed change ===== This patch adds a fifth capability mask called the ambient mask (pA). pA does what most people expect pI to do. pA obeys the invariant that no bit can ever be set in pA if it is not set in both pP and pI. Dropping a bit from pP or pI drops that bit from pA. This ensures that existing programs that try to drop capabilities still do so, with a complication. Because capability inheritance is so broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and then calling execve effectively drops capabilities. Therefore, setresuid from root to nonroot conditionally clears pA unless SECBIT_NO_SETUID_FIXUP is set. Processes that don't like this can re-add bits to pA afterwards. The capability evolution rules are changed: pA' = (file caps or setuid or setgid ? 0 : pA) pP' = (X & fP) \| (pI & fI) \| pA' pI' = pI pE' = (fE ? pP' : pA') X is unchanged If you are nonroot but you have a capability, you can add it to pA. If you do so, your children get that capability in pA, pP, and pE. For example, you can set pA = CAP_NET_BIND_SERVICE, and your children can automatically bind low-numbered ports. Hallelujah! Unprivileged users can create user namespaces, map themselves to a nonzero uid, and create both privileged (relative to their namespace) and unprivileged process trees. This is currently more or less impossible. Hallelujah! You cannot use pA to try to subvert a setuid, setgid, or file-capped program: if you execute any such program, pA gets cleared and the resulting evolution rules are unchanged by this patch. Users with nonzero pA are unlikely to unintentionally leak that capability. If they run programs that try to drop privileges, dropping privileges will still work. It's worth noting that the degree of paranoia in this patch could possibly be reduced without causing serious problems. Specifically, if we allowed pA to persist across executing non-pA-aware setuid binaries and across setresuid, then, naively, the only capabilities that could leak as a result would be the capabilities in pA, and any attacker already has those capabilities. This would make me nervous, though -- setuid binaries that tried to privilege-separate might fail to do so, and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have unexpected side effects. (Whether these unexpected side effects would be exploitable is an open question.) I've therefore taken the more paranoid route. We can revisit this later. An alternative would be to require PR_SET_NO_NEW_PRIVS before setting ambient capabilities. I think that this would be annoying and would make granting otherwise unprivileged users minor ambient capabilities (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than it is with this patch. ===== Footnotes ===== [1] Files that are missing the "security.capability" xattr or that have unrecognized values for that xattr end up with has_cap set to false. The code that does that appears to be complicated for no good reason. [2] The libcap capability mask parsers and formatters are dangerously misleading and the documentation is flat-out wrong. fE is not a mask; it's a single bit. This has probably confused every single person who has tried to use file capabilities. [3] Linux very confusingly processes both the script and the interpreter if applicable, for reasons that elude me. The results from thinking about a script's file capabilities and/or setuid bits are mostly discarded. Preliminary userspace code is here, but it needs updating: https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2 Here is a test program that can be used to verify the functionality (from Christoph): /* * Test program for the ambient capabilities. This program spawns a shell * that allows running processes with a defined set of capabilities. * * (C) 2015 Christoph Lameter <cl@linux.com> * Released under: GPL v3 or later. * * * Compile using: * * gcc -o ambient_test ambient_test.o -lcap-ng * * This program must have the following capabilities to run properly: * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE * * A command to equip the binary with the right caps is: * * setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test * * * To get a shell with additional caps that can be inherited by other processes: * * ./ambient_test /bin/bash * * * Verifying that it works: * * From the bash spawed by ambient_test run * * cat /proc/$$/status * * and have a look at the capabilities. / / * Definitions from the kernel header files. These are going to be removed * when the /usr/include files have these defined. / static void set_ambient_cap(int cap) { int rc; capng_get_caps_process(); rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap); if (rc) { printf("Cannot add inheritable cap\n"); exit(2); } capng_apply(CAPNG_SELECT_CAPS); / Note the two 0s at the end. Kernel checks for these / if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) { perror("Cannot set cap"); exit(1); } } int main(int argc, char *argv) { int rc; set_ambient_cap(CAP_NET_RAW); set_ambient_cap(CAP_NET_ADMIN); set_ambient_cap(CAP_SYS_NICE); printf("Ambient_test forking shell\n"); if (execv(argv[1], argv + 1)) perror("Cannot exec"); return 0; } Signed-off-by: Christoph Lameter <cl@linux.com> # Original author Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Aaron Jones <aaronmdjones@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Andrew G. Morgan <morgan@kernel.org> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Markku Savela <msa@moth.iki.fi> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 58319057b7847667f0c9585b9de0e8932b0fdb08) Bug: 31038224 Change-Id: I88bc5caa782dc6be23dc7e839ff8e11b9a903f8c Signed-off-by: Jorge Lucangeli Obes <jorgelo@google.com>	2017-09-01 13:38:08 +03:00
Greg Hackmann	84a377cc19	timerfd: support CLOCK_BOOTTIME clock Add CLOCK_BOOTTIME support to timerfd Change-Id: I14dee6d1104f15a05f463a632268ac4564753faf Signed-off-by: Greg Hackmann <ghackmann@google.com>	2017-08-27 19:07:23 +03:00
Todd Poynor	9ae9589197	timerfd: add alarm timers Add support for clocks CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM. Change-Id: Iafc8445d3d7ffb35110c860f1607bf03f1edb895 Signed-off-by: Todd Poynor <toddpoynor@google.com>	2017-08-27 19:07:22 +03:00
John Stultz	1bc537874e	alarmtimers: Squash upstream changes staging: android-alarm: Switch from wakelocks to wakeup sources In their current AOSP tree, the Android in-kernel wakelock infrastructure has been reimplemented in terms of wakeup sources: http://git.linaro.org/gitweb?p=people/jstultz/android.git;a=commitdiff;h=e9911f4efdc55af703b8b3bb8c839e6f5dd173bb The Android alarm driver currently has stubbed out calls to wakelock functionality. So this patch simply converts the stubbed out wakelock calls to wakeup source calls, and removes the empty wakelock macros Greg, would you mind queuing this in staging-next? CC: Colin Cross <ccross@android.com> CC: Arve Hjønnevåg <arve@android.com> CC: Greg KH <gregkh@linuxfoundation.org> CC: Android Kernel Team <kernel-team@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Staging: android: alarm: Rename pr_alarm to alarm_dbg Rename a macro to make it explicit it's for debugging. Use %s: __func__ instead of embedding function names. Coalesce formats, align arguments. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: Android: Fix NULL pointer related warning in alarm-dev.c file Fixes the following sparse warning: drivers/staging/android/alarm-dev.c:259:35: warning: Using plain integer as NULL pointer Cc: Brian Swetland <swetland@google.com> Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: android: alarm: remove unnecessary goto statement Signed-off-by: Devendra Naga <devendra.aaru@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Staging: android: Alarm driver cleanups Little cleanups. Enum value ANDROID_ALARM_TYPE_COUNT was treated as an alarm type within a switch statement. That condition was unreachable though. Signed-off-by: Dae S. Kim <dae@velatum.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: alarm-dev: Drop pre Android 1.0 _OLD ioctls Per Colin's comment: "The "support old userspace code" comment for those two ioctls has been there since pre-Android 1.0. Those apis are not exposed to Android apps, I don't see any problem deleting them." Thus this patch removes the ANDROID_ALARM_SET_OLD and ANDROID_ALARM_SET_AND_WAIT_OLD ioctl compatability logic. Change-Id: I5138aaa3cdbfb758aaef2cc7591cb5340b3640a0 Cc: Serban Constantinescu <serban.constantinescu@arm.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Colin Cross <ccross@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: alarm-dev: Refactor alarm-dev ioctl code in prep for compat_ioctl Cleanup the Android alarm-dev driver's ioctl code to refactor it in preparation for compat_ioctl support. Cc: Serban Constantinescu <serban.constantinescu@arm.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Colin Cross <ccross@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: alarm-dev: Implement compat_ioctl support Implement compat_ioctl support for the alarm-dev ioctl. Cc: Serban Constantinescu <serban.constantinescu@arm.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Colin Cross <ccross@google.com> Cc: Android Kernel Team <kernel-team@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: alarm-dev: information leak in alarm_ioctl() Smatch complains that if we pass an invalid clock type then "ts" is never set. We need to check for errors earlier, otherwise we end up passing uninitialized stack data to userspace. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> staging: alarm-dev: information leak in alarm_compat_ioctl() If we pass an invalid clock type then "ts" is never set. We need to check for errors earlier, otherwise we end up passing uninitialized stack data to userspace. Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> rtc: alarm: Add power-on alarm feature Android does not support powering up of phone through alarm. Adding shutdown hook in alarm driver which will set alarm while phone is going down so as to power-up the phone after alarm expiration. Change-Id: Ic2611e33ae9c1f8e83f21efdb93e26ac9f9499de Signed-off-by: Matthew Qin <yqin@codeaurora.org> qpnp-rtc: clear alarm register when rtc irq is disabled The rtc alarm register should be cleared when the rtc irq is disabled Change-Id: I97a8bf989ff610093240a6b308a297702da6cb89 Signed-off-by: Xiaocheng Li <lix@codeaurora.org> Signed-off-by: Matthew Qin <yqin@codeaurora.org> alarm : Fix the race conditions in alarm-dev.c There will be race conditions between alarm set and alarm clear if set_power_on_alarm is out of the lock.But set_power_on_alarm can't be put into spin_lock. So add it into mutex_lock. CRs-Fixed: 639115 Change-Id: I4226a95e499211c0d50ff7ce269467a57a410dc7 Signed-off-by: Mao Jinlong <c_jmao@codeaurora.org> switch timerfd_[sg]ettime(2) to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> time: Enable alarmtimers Change-Id: I4d1ea553aa707ab0467e00dea86cedc8f6797b78	2017-08-25 20:00:02 +03:00
Jin Qian	9e20025f8b	f2fs: sanity check checkpoint segno and blkoff Make sure segno and blkoff read from raw image are valid. Cc: stable@vger.kernel.org Signed-off-by: Jin Qian <jinqian@google.com> [Jaegeuk Kim: adjust minor coding style] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Change-Id: Ie2505c071233c1a9dec2729fe1ad467689a1b7a2 (cherry picked from commit 15d3042a937c13f5d9244241c7a9c8416ff6e82a)	2017-08-07 18:11:20 -06:00
Jin Qian	46e0dfc447	f2fs: sanity check segment count F2FS uses 4 bytes to represent block address. As a result, supported size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments. Signed-off-by: Jin Qian <jinqian@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Change-Id: I16b3cd6279bff1a221781a80b9b34744c9e7098f (cherry picked from commit b9dd46188edc2f0d1f37328637860bb65a771124)	2017-08-07 18:11:13 -06:00
Thomas Gleixner	5a34ec804c	timerfd: Protect the might cancel mechanism proper The handling of the might_cancel queueing is not properly protected, so parallel operations on the file descriptor can race with each other and lead to list corruptions or use after free. Protect the context for these operations with a seperate lock. The wait queue lock cannot be reused for this because that would create a lock inversion scenario vs. the cancel lock. Replacing might_cancel with an atomic (atomic_t or atomic bit) does not help either because it still can race vs. the actual list operation. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "linux-fsdevel@vger.kernel.org" Cc: syzkaller <syzkaller@googlegroups.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Change-Id: I1f2d38a919ceb1ca1c7c9471dece0c1126383912 (cherry picked from commit 1e38da300e1e395a15048b0af1e5305bd91402f6)	2017-08-07 18:11:00 -06:00
Jan Kara	6f25be195e	udf: Check path length when reading symlink Symlink reading code does not check whether the resulting path fits into the page provided by the generic code. This isn't as easy as just checking the symlink size because of various encoding conversions we perform on path. So we have to check whether there is still enough space in the buffer on the fly. Change-Id: Id56d129029eaf2e651cf7236103fb73aa540ae1f CC: stable@vger.kernel.org Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no> Signed-off-by: Jan Kara <jack@suse.cz>	2017-07-10 01:48:57 +03:00
Kees Cook	b48a26fd3d	fs/exec.c: account for argv/envp pointers commit 98da7d08850fb8bdeb395d6368ed15753304aa0c upstream. When limiting the argv/envp strings during exec to 1/4 of the stack limit, the storage of the pointers to the strings was not included. This means that an exec with huge numbers of tiny strings could eat 1/4 of the stack limit in strings and then additional space would be later used by the pointers to the strings. For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721 single-byte strings would consume less than 2MB of stack, the max (8MB / 4) amount allowed, but the pointers to the strings would consume the remaining additional stack space (1677721 * 4 == 6710884). The result (1677721 + 6710884 == 8388605) would exhaust stack space entirely. Controlling this stack exhaustion could result in pathological behavior in setuid binaries (CVE-2017-1000365). [akpm@linux-foundation.org: additional commenting from Kees] Fixes: `b6a2fea393` ("mm: variable length argument support") Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Qualys Security Advisory <qsa@qualys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I2e01d7be2d52415264ff48c632bfe307008c4e03	2017-07-04 01:21:47 +03:00
Hugh Dickins	45c60e5957	mm: larger stack guard gap, between vmas commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Change-Id: I611023b0bfe1cab7b3e5da13e331a7baaaaf6eb0 Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] [wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes] [wt: backport to 3.18: adjust context ; no FOLL_POPULATE ; s390 uses generic arch_get_unmapped_area()] [wt: backport to 3.16: adjust context] [wt: backport to 3.10: adjust context ; code logic in PARISC's arch_get_unmapped_area() wasn't found ; code inserted into expand_upwards() and expand_downwards() runs under anon_vma lock; changes for gup.c:faultin_page go to memory.c:__get_user_pages(); included Hugh Dickins' fixes] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Flex1911 <dedsa2002@gmail.com>	2017-07-02 13:03:27 +03:00
Linus Torvalds	939b2740ab	splice: introduce FMODE_SPLICE_READ and FMODE_SPLICE_WRITE Introduce FMODE_SPLICE_READ and FMODE_SPLICE_WRITE. These modes check whether it is legal to read or write a file using splice. Both get automatically set on regular files and are not checked when a 'struct fileoperations' includes the splice_{read,write} methods. Change-Id: Icb601a7db12d4e07a62d790edaa8a9a5aed3ba2a Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>	2017-06-26 21:30:22 +03:00
Laura Abbott	5d5a14df85	fs: fuse: Add replacment for CMA pages into the LRU cache CMA pages are currently replaced in the FUSE file system since FUSE may hold on to CMA pages for a long time, preventing migration. The replacement page is added to the file cache but not the LRU cache. This may prevent the page from being properly aged and dropped, creating poor performance under tight memory condition. Fix this by adding the new page to the LRU cache after creation. Change-Id: Ib349abf1024d48386b835335f3fbacae040b6241 CRs-Fixed: 586855 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>	2017-06-26 21:01:44 +03:00
Jan Kara	b1a8c88774	BACKPORT: posix_acl: Clear SGID bit when setting file permissions [Partially applied during f2fs inclusion, changes now aligned to upstream] (cherry pick from commit 073931017b49d9458aa351605b43a7e34598caef) When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. NB: conflicts resolution included extending the change to all visible users of the near deprecated function posix_acl_equiv_mode replaced with posix_acl_update_mode. We did not resolve the ACL leak in this CL, require additional upstream fixes. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Bug: 32458736 [haggertk]: Backport to 3.4/msm8974 * convert use of capable_wrt_inode_uidgid to capable Change-Id: I19591ad452cc825ac282b3cfd2daaa72aa9a1ac1	2017-06-26 20:26:17 +03:00
Adrian Salido	21e685af37	fs/proc/array.c: make safe access to group_leader As mentioned in commit `52ee2dfdd4` ("pids: refactor vnr/nr_ns helpers to make them safe"). _nr_ns helpers used to be buggy. The commit addresses most of the helpers but is missing task_tgid_xxx() Without this protection there is a possible use after free reported by kasan instrumented kernel: ================================================================== BUG: KASAN: use-after-free in task_tgid_nr_ns+0x2c/0x44 at addr Read of size 8 by task cat/2472 CPU: 1 PID: 2472 Comm: cat Tainted: ** Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT) Call trace: [<ffffffc00020ad2c>] dump_backtrace+0x0/0x17c [<ffffffc00020aec0>] show_stack+0x18/0x24 [<ffffffc0011573d0>] dump_stack+0x94/0x100 [<ffffffc0003c7dc0>] kasan_report+0x308/0x554 [<ffffffc0003c7518>] __asan_load8+0x20/0x7c [<ffffffc00025a54c>] task_tgid_nr_ns+0x28/0x44 [<ffffffc00046951c>] proc_pid_status+0x444/0x1080 [<ffffffc000460f60>] proc_single_show+0x8c/0xdc [<ffffffc0004081b0>] seq_read+0x2e8/0x6f0 [<ffffffc0003d1420>] vfs_read+0xd8/0x1e0 [<ffffffc0003d1b98>] SyS_read+0x68/0xd4 Accessing group_leader while holding rcu_lock and using the now safe helpers introduced in the commit mentioned, this race condition is addressed. Signed-off-by: Adrian Salido <salidoa@google.com> Change-Id: I4315217922dda375a30a3581c0c1740dda7b531b Bug: 31495866	2017-06-07 13:26:13 -06:00
Tom Marshall	903457eb2c	kernel: Fix potential refcount leak in su check Change-Id: I3d241ae805ba708c18bccfd5e5d6cdcc8a5bc1c8	2017-05-19 18:41:42 -06:00
Tom Marshall	75ec7fa33f	kernel: Only expose su when daemon is running Note: this is for the 3.4 kernel It has been claimed that the PG implementation of 'su' has security vulnerabilities even when disabled. Unfortunately, the people that find these vulnerabilities often like to keep them private so they can profit from exploits while leaving users exposed to malicious hackers. In order to reduce the attack surface for vulnerabilites, it is therefore necessary to make 'su' completely inaccessible when it is not in use (except by the root and system users). Change-Id: Ia7d50ba46c3d932c2b0ca5fc8e9ec69ec9045f85	2017-05-19 18:41:25 -06:00
Ben Hutchings	ade551c944	splice: Apply generic position and size checks to each write We need to check the position and size of file writes against various limits, using generic_write_check(). This was not being done for the splice write path. It was fixed upstream by commit `8d0207652c` ("->splice_write() via ->write_iter()") but we can't apply that. CVE-2014-7822 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Change-Id: Ic7e7bd78d4594c993c9684d32a0ddeaf70165bce (cherry picked from commit 894c6350eaad7e613ae267504014a456e00a3e2a)	2017-04-17 15:41:36 -06:00
Jeff Mahoney	dae4a6db45	ecryptfs: don't allow mmap when the lower fs doesn't support it There are legitimate reasons to disallow mmap on certain files, notably in sysfs or procfs. We shouldn't emulate mmap support on file systems that don't offer support natively. CVE-2016-1583 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: stable@vger.kernel.org [tyhicks: clean up f_op check by using ecryptfs_file_to_lower()] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> (adapted from commit f0fe970df3838c202ef6c07a4c2b36838ef0a88b) Change-Id: I3eb979e9476847834eeea0ecbaf07a53329a7219	2017-03-03 16:49:07 -07:00
Eryu Guan	e0dd30eb33	ext4: validate s_first_meta_bg at mount time Ralf Spenneberg reported that he hit a kernel crash when mounting a modified ext4 image. And it turns out that kernel crashed when calculating fs overhead (ext4_calculate_overhead()), this is because the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when setting bitmap buffer, which is PAGE_SIZE. ext4_calculate_overhead(): buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer blks = count_overhead(sb, i, buf); count_overhead(): for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400 ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun count++; } This can be reproduced easily for me by this script: #!/bin/bash rm -f fs.img mkdir -p /mnt/ext4 fallocate -l 16M fs.img mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img debugfs -w -R "ssv first_meta_bg 842150400" fs.img mount -o loop fs.img /mnt/ext4 Fix it by validating s_first_meta_bg first at mount time, and refusing to mount if its value exceeds the largest possible meta_bg number. Reported-by: Ralf Spenneberg <ralf@os-t.de> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> (cherry picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe) (minor backport adapted from cf851ad35fd1e9c7b8ed00741eca613bc1a9c8c8) Change-Id: If183ad4a873705c9a0312087577705298b3586fe	2017-03-03 13:40:24 -07:00
Nick Desaulniers	a8c9068848	BACKPORT: aio: mark AIO pseudo-fs noexec This ensures that do_mmap() won't implicitly make AIO memory mappings executable if the READ_IMPLIES_EXEC personality flag is set. Such behavior is problematic because the security_mmap_file LSM hook doesn't catch this case, potentially permitting an attacker to bypass a W^X policy enforced by SELinux. I have tested the patch on my machine. To test the behavior, compile and run this: #define _GNU_SOURCE #include <unistd.h> #include <sys/personality.h> #include <linux/aio_abi.h> #include <err.h> #include <stdlib.h> #include <stdio.h> #include <sys/syscall.h> int main(void) { personality(READ_IMPLIES_EXEC); aio_context_t ctx = 0; if (syscall(__NR_io_setup, 1, &ctx)) err(1, "io_setup"); char cmd[1000]; sprintf(cmd, "cat /proc/%d/maps \| grep -F '/[aio]'", (int)getpid()); system(cmd); return 0; } In the output, "rw-s" is good, "rwxs" is bad. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 22f6b4d34fcf039c63a94e7670e0da24f8575a5a) (cherry picked from googlesource commit `bc02d1d9f5`) Bug: 31711619 Change-Id: I9f2872703bef240d6b82320c744529459bb076dc	2017-03-03 13:15:25 -07:00
Jan Kara	bab8f030b0	isofs: Fix infinite looping over CE entries Rock Ridge extensions define so called Continuation Entries (CE) which define where is further space with Rock Ridge data. Corrupted isofs image can contain arbitrarily long chain of these, including a one containing loop and thus causing kernel to end in an infinite loop when traversing these entries. Limit the traversal to 32 entries which should be more than enough space to store all the Rock Ridge data. Reported-by: P J P <ppandit@redhat.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> (cherry picked from commit f54e18f1b831c92f6512d2eedb224cd63d607d3d) Change-Id: I62cd59b27ac11fbf0a04b0d02874df7f390338bb	2017-03-03 12:44:58 -07:00
Omar Sandoval	dbe124efba	block: fix use-after-free in sys_ioprio_get() get_task_ioprio() accesses the task->io_context without holding the task lock and thus can race with exit_io_context(), leading to a use-after-free. The reproducer below hits this within a few seconds on my 4-core QEMU VM: int main(int argc, char *argv) { pid_t pid, child; long nproc, i; / ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); / syscall(SYS_ioprio_set, 1, 0, 0x6000); nproc = sysconf(_SC_NPROCESSORS_ONLN); for (i = 0; i < nproc; i++) { pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { pid = fork(); assert(pid != -1); if (pid == 0) { _exit(0); } else { child = wait(NULL); assert(child == pid); } } } pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { / ioprio_get(IOPRIO_WHO_PGRP, 0); / syscall(SYS_ioprio_get, 2, 0); } } } for (;;) { / ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } return 0; } This gets us KASAN dumps like this: [ 35.526914] ================================================================== [ 35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c [ 35.530009] Read of size 2 by task ioprio-gpf/363 [ 35.530009] ============================================================================= [ 35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected [ 35.530009] ----------------------------------------------------------------------------- [ 35.530009] Disabling lock debugging due to kernel taint [ 35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360 [ 35.530009] ___slab_alloc+0x55d/0x5a0 [ 35.530009] __slab_alloc.isra.20+0x2b/0x40 [ 35.530009] kmem_cache_alloc_node+0x84/0x200 [ 35.530009] create_task_io_context+0x2b/0x370 [ 35.530009] get_task_io_context+0x92/0xb0 [ 35.530009] copy_process.part.8+0x5029/0x5660 [ 35.530009] _do_fork+0x155/0x7e0 [ 35.530009] SyS_clone+0x19/0x20 [ 35.530009] do_syscall_64+0x195/0x3a0 [ 35.530009] return_from_SYSCALL_64+0x0/0x6a [ 35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060 [ 35.530009] __slab_free+0x27b/0x3d0 [ 35.530009] kmem_cache_free+0x1fb/0x220 [ 35.530009] put_io_context+0xe7/0x120 [ 35.530009] put_io_context_active+0x238/0x380 [ 35.530009] exit_io_context+0x66/0x80 [ 35.530009] do_exit+0x158e/0x2b90 [ 35.530009] do_group_exit+0xe5/0x2b0 [ 35.530009] SyS_exit_group+0x1d/0x20 [ 35.530009] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080 [ 35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001 [ 35.530009] ================================================================== Fix it by grabbing the task lock while we poke at the io_context. Change-Id: I4261aaf076fab943a80a45b0a77e023aa4ecbbd8 Cc: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-11-11 13:35:57 +11:00
Nick Desaulniers	c3f8d15467	fs: ext4: disable support for fallocate FALLOC_FL_PUNCH_HOLE Bug: 28760453 Change-Id: I019c2de559db9e4b95860ab852211b456d78c4ca Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>	2016-10-31 23:29:10 +11:00
Eric W. Biederman	61f1c9ad34	mnt: Fail collect_mounts when applied to unmounted mounts The only users of collect_mounts are in audit_tree.c In audit_trim_trees and audit_add_tree_rule the path passed into collect_mounts is generated from kern_path passed an audit_tree pathname which is guaranteed to be an absolute path. In those cases collect_mounts is obviously intended to work on mounted paths and if a race results in paths that are unmounted when collect_mounts it is reasonable to fail early. The paths passed into audit_tag_tree don't have the absolute path check. But are used to play with fsnotify and otherwise interact with the audit_trees, so again operating only on mounted paths appears reasonable. Avoid having to worry about what happens when we try and audit unmounted filesystems by restricting collect_mounts to mounts that appear in the mount tree. Change-Id: I2edfee6d6951a2179ce8f53785b65ddb1eb95629 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2016-10-31 22:51:16 +11:00
Thomas Gleixner	34ba3c370d	kthread: Prevent unpark race which puts threads on the wrong cpu The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Change-Id: I9ad6fbe65992ad5b5cb9a252470a56ec51a4ff4f Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-10-29 23:12:39 +08:00
Jaegeuk Kim	e787ff9965	f2fs: set fsync mark only for the last dnode In order to give atomic writes, we should consider power failure during sync_node_pages in fsync. So, this patch marks fsync flag only in the last dnode block. Change-Id: Ib44a91bf820f6631fe359a8ac430ede77ceda403 Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	69baed249c	f2fs: report unwritten status in fsync_node_pages The fsync_node_pages should return pass or failure so that user could know fsync is completed or not. Change-Id: I3d588c44ad7452e66d3d6a795f2060de75fd5d0f Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	2d6e13b5bb	f2fs: flush dirty pages before starting atomic writes If somebody wrote some data before atomic writes, we should flush them in order to handle atomic data in a right period. Change-Id: I35611d9016330ef837554cff263bcbb10b4cc810 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	2e84eaaee8	f2fs: unset atomic/volatile flag in f2fs_release_file The atomic/volatile operation should be done in pair of start and commit ioctl. For example, if a killed process remains open-ended atomic operation, we should drop its flag as well as its atomic data. Otherwise, if sqlite initiates another operation which doesn't require atomic writes, it will lose every data, since f2fs still treats with them as atomic writes; nobody will trigger its commit. Change-Id: Ic97f7d88a1158e2f21f4bd5447870ff578641fb3 Reported-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	1282b71690	f2fs: fix dropping inmemory pages in a wrong time When one reader closes its file while the other writer is doing atomic writes, f2fs_release_file drops atomic data resulting in an empty commit. This patch fixes this wrong commit problem by checking openess of the file. Process0 Process1 open file start atomic write write data read data close file f2fs_release_file() clear atomic data commit atomic write Change-Id: I99b90b569a56cb53bccf8758f870e0f49849c6fd Reported-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	4ab3c246ba	f2fs: split sync_node_pages with fsync_node_pages This patch splits the existing sync_node_pages into (f)sync_node_pages. The fsync_node_pages is used for f2fs_sync_file only. Change-Id: I207b087a54f1a0c2e994a78cd6ed475578d7044e Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	3a5683358f	f2fs: avoid writing 0'th page in volatile writes The first page of volatile writes usually contains a sort of header information which will be used for recovery. (e.g., journal header of sqlite) If this is written without other journal data, user needs to handle the stale journal information. Change-Id: I85f4cfe4cbef32ed43b0f52d7328b42d411dd2da Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:38 +08:00
Jaegeuk Kim	7c65b74491	f2fs: avoid needless lock for node pages when fsyncing a file When fsync is called, sync_node_pages finds a proper direct node pages to flush. But, it locks unrelated direct node pages together unnecessarily. Change-Id: I6adc83f2e6592aea707851ee6e365afcc0e36f92 Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:37 +08:00
Chao Yu	36490001bb	f2fs: fix deadlock when flush inline data Below backtrace info was reported by Yunlei He: Call Trace: [<ffffffff817a9395>] schedule+0x35/0x80 [<ffffffff817abb7d>] rwsem_down_read_failed+0xed/0x130 [<ffffffff813c12a8>] call_rwsem_down_read_failed+0x18/0x [<ffffffff817ab1d0>] down_read+0x20/0x30 [<ffffffffa02a1a12>] f2fs_evict_inode+0x242/0x3a0 [f2fs] [<ffffffff81217057>] evict+0xc7/0x1a0 [<ffffffff81217cd6>] iput+0x196/0x200 [<ffffffff812134f9>] __dentry_kill+0x179/0x1e0 [<ffffffff812136f9>] dput+0x199/0x1f0 [<ffffffff811fe77b>] __fput+0x18b/0x220 [<ffffffff811fe84e>] ____fput+0xe/0x10 [<ffffffff81097427>] task_work_run+0x77/0x90 [<ffffffff81074d62>] exit_to_usermode_loop+0x73/0xa2 [<ffffffff81003b7a>] do_syscall_64+0xfa/0x110 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25 Call Trace: [<ffffffff817a9395>] schedule+0x35/0x80 [<ffffffff81216dc3>] __wait_on_freeing_inode+0xa3/0xd0 [<ffffffff810bc300>] ? autoremove_wake_function+0x40/0x4 [<ffffffff8121771d>] find_inode_fast+0x7d/0xb0 [<ffffffff8121794a>] ilookup+0x6a/0xd0 [<ffffffffa02bc740>] sync_node_pages+0x210/0x650 [f2fs] [<ffffffff8122e690>] ? do_fsync+0x70/0x70 [<ffffffffa02b085e>] block_operations+0x9e/0xf0 [f2fs] [<ffffffff8137b795>] ? bio_endio+0x55/0x60 [<ffffffffa02b0942>] write_checkpoint+0x92/0xba0 [f2fs] [<ffffffff8117da57>] ? mempool_free_slab+0x17/0x20 [<ffffffff8117de8b>] ? mempool_free+0x2b/0x80 [<ffffffff8122e690>] ? do_fsync+0x70/0x70 [<ffffffffa02a53e3>] f2fs_sync_fs+0x63/0xd0 [f2fs] [<ffffffff8129630f>] ? ext4_sync_fs+0xbf/0x190 [<ffffffff8122e6b0>] sync_fs_one_sb+0x20/0x30 [<ffffffff812002e9>] iterate_supers+0xb9/0x110 [<ffffffff8122e7b5>] sys_sync+0x55/0x90 [<ffffffff81003ae9>] do_syscall_64+0x69/0x110 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25 With following excuting serials, we will set inline_node in inode page after inode was unlinked, result in a deadloop described as below: 1. open file 2. write file 3. unlink file 4. write file 5. close file Thread A Thread B - dput - iput_final - inode->i_state \|= I_FREEING - evict - f2fs_evict_inode - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all (down_write(cp_rwsem)) - f2fs_lock_op (down_read(cp_rwsem)) - sync_node_pages - ilookup - find_inode_fast - __wait_on_freeing_inode (wait on I_FREEING clear) Here, we change to set inline_node flag only for linked inode for fixing. Change-Id: Ibf4326ecb4ba68e45e4e964092e1d2955341bc56 Reported-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Tested-by: Jaegeuk Kim <jaegeuk@kernel.org> Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:37 +08:00
Chao Yu	2f63209408	f2fs: fix to update dirty page count correctly Once we failed to merge inline data into inode page during flushing inline inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page count incorrect, result in panic in ->evict_inode, Fix it. ------------[ cut here ]------------ kernel BUG at /home/yuchao/git/devf2fs/inode.c:336! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 3 PID: 10004 Comm: umount Tainted: G O 4.6.0-rc5+ #17 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 task: f0c33000 ti: c5212000 task.ti: c5212000 EIP: 0060:[<f89aacb5>] EFLAGS: 00010202 CPU: 3 EIP is at f2fs_evict_inode+0x85/0x490 [f2fs] EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000 ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: `c5213e78` DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0 Stack: c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64 c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0 Call Trace: [<c176d45c>] ? _raw_spin_unlock+0x2c/0x50 [<c1204a68>] evict+0xa8/0x170 [<c1204b64>] dispose_list+0x34/0x50 [<c120588d>] evict_inodes+0x10d/0x130 [<c11ea941>] generic_shutdown_super+0x41/0xe0 [<c1185190>] ? unregister_shrinker+0x40/0x50 [<c1185190>] ? unregister_shrinker+0x40/0x50 [<c11eac52>] kill_block_super+0x22/0x70 [<f89af23e>] kill_f2fs_super+0x1e/0x20 [f2fs] [<c11eae1d>] deactivate_locked_super+0x3d/0x70 [<c11eb383>] deactivate_super+0x43/0x60 [<c1208ec9>] cleanup_mnt+0x39/0x80 [<c1208f50>] __cleanup_mnt+0x10/0x20 [<c107d091>] task_work_run+0x71/0x90 [<c105725a>] exit_to_usermode_loop+0x72/0x9e [<c1001c7c>] do_fast_syscall_32+0x19c/0x1c0 [<c176dd48>] sysenter_past_esp+0x45/0x74 EIP: [<f89aacb5>] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78 ---[ end trace d30536330b7fdc58 ]--- Change-Id: I68907a13e6ac726e54f5c2bbe219bc2c8400a558 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:37 +08:00
Jaegeuk Kim	0222e52820	ext4/fscrypto: avoid RCU lookup in d_revalidate As Al pointed, d_revalidate should return RCU lookup before using d_inode. This was originally introduced by: commit `34286d6662` ("fs: rcu-walk aware d_revalidate method"). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable <stable@vger.kernel.org> Conflicts: fs/ext4/crypto.c	2016-10-29 23:12:37 +08:00
Jaegeuk Kim	4f963da44b	fscrypto: don't let data integrity writebacks fail with ENOMEM This patch fixes the issue introduced by the ext4 crypto fix in a same manner. For F2FS, however, we flush the pending IOs and wait for a while to acquire free memory. Fixes: c9af28fdd4492 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Conflicts: fs/crypto/crypto.c	2016-10-29 23:12:37 +08:00
Jaegeuk Kim	92d54d00b5	f2fs: use dget_parent and file_dentry in f2fs_file_open This patch synced with the below two ext4 crypto fixes together. In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes overlayfs. To fix, now we need to use file_dentry() to access that field. [Backport NOTE] - Over 4.2, it should use file_dentry Fixes: c0a37d487884 ("ext4: use file_dentry()") Fixes: 9dd78d8c9a7b ("ext4: use dget_parent() in ext4_file_open()") Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	c5c8bb5165	fscrypto: use dget_parent() in fscrypt_d_revalidate() This patch updates fscrypto along with the below ext4 crypto change. Fixes: 3d43bcfef5f0 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()") Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Conflicts: fs/crypto/crypto.c	2016-10-29 23:12:36 +08:00
Shuoran Liu	5e468690df	f2fs: retrieve IO write stat from the right place In the following patch, f2fs: split journal cache from curseg cache journal cache is split from curseg cache. So IO write statistics should be retrived from journal cache but not curseg->sum_blk. Otherwise, it will get 0, and the stat is lost. Signed-off-by: Shuoran Liu <liushuoran@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	df869c0814	f2fs crypto: fix corrupted symlink in encrypted case In the encrypted symlink case, we should check its corrupted symname after decrypting it. Otherwise, we can report -ENOENT incorrectly, if encrypted symname starts with '\0'. Cc: stable 4.5+ <stable@vger.kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	b54ac0bf29	f2fs: cover large section in sanity check of super This patch fixes the bug which does not cover a large section case when checking the sanity of superblock. If f2fs detects misalignment, it will fix the superblock during the mount time, so it doesn't need to trigger fsck.f2fs further. Reported-by: Matthias Prager <linux@matthiasprager.de> Reported-by: David Gnedt <david.gnedt@davizone.at> Cc: stable 4.5+ <stable@vger.kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Linus Torvalds	1db29fa179	f2fs/crypto: fix xts_tweak initialization Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_" to just "FS" for preprocessor symbols). Because of the symbol renaming, it's a bit hard to see it as a file move: use git show -M30 0b81d07790726 to lower the rename detection to just 30% similarity and make git show the files as renamed (the header file won't be shown as a rename even then - since all it contains is symbol definitions, it looks almost completely different). Even with the renames showing as renames, the diffs are not all that easy to read, since so much is just the renames. But Eric Biggers noticed that it's not just all renames: the initialization of the xts_tweak had been broken too, using the inode number rather than the page offset. That's not right - it makes the xfs_tweak the same for all pages of each inode. It _might_ make sense to make the xfs_tweak contain both the offset _and_ the inode number, but not just the inode number. Reported-by: Eric Biggers <ebiggers3@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	841b7347fa	f2fs: submit node page write bios when really required If many threads calls fsync with data writes, we don't need to flush every bios having node page writes. The f2fs_wait_on_page_writeback will flush its bios when the page is really needed. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Arnd Bergmann	f9b55fda32	f2fs: add missing argument to f2fs_setxattr stub The f2fs_setxattr() prototype for CONFIG_F2FS_FS_XATTR=n has been wrong for a long time, since `8ae8f1627f` ("f2fs: support xattr security labels"), but there have never been any callers, so it did not matter. Now, the function gets called from f2fs_ioc_keyctl(), which causes a build failure: fs/f2fs/file.c: In function 'f2fs_ioc_keyctl': include/linux/stddef.h:7:14: error: passing argument 6 of 'f2fs_setxattr' makes integer from pointer without a cast [-Werror=int-conversion] #define NULL ((void )0) ^ fs/f2fs/file.c:1599:27: note: in expansion of macro 'NULL' value, F2FS_KEY_SIZE, NULL, type); ^ In file included from ../fs/f2fs/file.c:29:0: fs/f2fs/xattr.h:129:19: note: expected 'int' but argument is of type 'void ' static inline int f2fs_setxattr(struct inode inode, int index, ^ fs/f2fs/file.c:1597:9: error: too many arguments to function 'f2fs_setxattr' return f2fs_setxattr(inode, F2FS_XATTR_INDEX_KEY, ^ In file included from ../fs/f2fs/file.c:29:0: fs/f2fs/xattr.h:129:19: note: declared here static inline int f2fs_setxattr(struct inode inode, int index, Thsi changes the prototype of the empty stub function to match that of the actual implementation. This will not make the key management work when F2FS_FS_XATTR is disabled, but it gets it to build at least. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Chao Yu	135b6ab7ff	f2fs: fix to avoid unneeded unlock_new_inode During ->lookup, I_NEW state of inode was been cleared in f2fs_iget, so in error path, we don't need to clear it again. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Chao Yu	0487130e26	f2fs: clean up opened code with f2fs_update_dentry Just clean up opened code with existing function, no logic change. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	1e100874fc	f2fs: declare static functions Just to avoid sparse warnings. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Fan Li	6be6530f86	f2fs: modify the readahead method in ra_node_page() ra_node_page() is used to read ahead one node page. Comparing to regular read, it's faster because it doesn't wait for IO completion. But if it is called twice for reading the same block, and the IO request from the first call hasn't been completed before the second call, the second call will have to wait until the read is over. Here use the code in __do_page_cache_readahead() to solve this problem. It does nothing when someone else already puts the page in mapping. The status of page should be assured by whoever puts it there. This implement also prevents alteration of page reference count. Signed-off-by: Fan li <fanofcode.li@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	6f86d87cf6	f2fs crypto: sync ext4_lookup and ext4_file_open This patch tries to catch up with lookup and open policies in ext4. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:36 +08:00
Jaegeuk Kim	ea38719073	f2fs: define not-set fallocate flags This fixes building bugs in aosp. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:35 +08:00
Jaegeuk Kim	47393901fb	fs crypto: move per-file encryption from f2fs tree to fs/crypto This patch adds the renamed functions moved from the f2fs crypto files. [Backporting to 3.10] - Removed d_is_negative() in fscrypt_d_revalidate(). 1. definitions for per-file encryption used by ext4 and f2fs. 2. crypto.c for encrypt/decrypt functions a. IO preparation: - fscrypt_get_ctx / fscrypt_release_ctx b. before IOs: - fscrypt_encrypt_page - fscrypt_decrypt_page - fscrypt_zeroout_range c. after IOs: - fscrypt_decrypt_bio_pages - fscrypt_pullback_bio_page - fscrypt_restore_control_page 3. policy.c supporting context management. a. For ioctls: - fscrypt_process_policy - fscrypt_get_policy b. For context permission - fscrypt_has_permitted_context - fscrypt_inherit_context 4. keyinfo.c to handle permissions - fscrypt_get_encryption_info - fscrypt_free_encryption_info 5. fname.c to support filename encryption a. general wrapper functions - fscrypt_fname_disk_to_usr - fscrypt_fname_usr_to_disk - fscrypt_setup_filename - fscrypt_free_filename b. specific filename handling functions - fscrypt_fname_alloc_buffer - fscrypt_fname_free_buffer 6. Makefile and Kconfig Cc: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Ildar Muslukhov <ildarm@google.com> Signed-off-by: Uday Savagaonkar <savagaon@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Conflicts: fs/f2fs/f2fs.h fs/f2fs/inode.c fs/f2fs/super.c include/linux/fs.h Change-Id: I162f3562aed66e7e377ec38714fae96c651fbdd7	2016-10-29 23:12:35 +08:00
Yang Shi	2b9dcf81e6	f2fs: mutex can't be used by down_write_nest_lock() f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the comment in include/linux/rwsem.h. And, other than f2fs, it is just called in mm/mmap.c with two rwsem. So, it looks it is used wrongly by f2fs. And, it causes the below compile warning on -rt kernel too. In file included from fs/f2fs/xattr.c:25:0: fs/f2fs/f2fs.h: In function 'f2fs_lock_all': fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types] f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex); ^ fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write' #define f2fs_down_write(x, y) down_write_nest_lock(x, y) ^ In file included from include/linux/rwsem.h:22:0, from fs/f2fs/xattr.c:21: include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore ' but argument is of type 'struct mutex ' static inline void down_write_nest_lock(struct rw_semaphore *sem, Signed-off-by: Yang Shi <yang.shi@linaro.org> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:35 +08:00
Liu Xue	55340ae5aa	f2fs: recovery missing dot dentries in root directory If f2fs was corrupted with missing dot dentries in root dirctory, it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag in directory inode when fsck.f2fs detects missing dot dentries. Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com> Signed-off-by: Yong Sheng <shengyong1@huawei.com> Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:35 +08:00
Willy Tarreau	dcaffd537f	pipe: limit the per-user amount of pages allocated in pipes On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 102464kB + (2561024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: Documentation/sysctl/fs.txt fs/pipe.c include/linux/sched.h Change-Id: Ic7c678af18129943e16715fdaa64a97a7f0854be	2016-10-29 23:12:35 +08:00
Theodore Ts'o	2da70f4e34	ext4: avoid hang when mounting non-journal filesystems with orphan list When trying to mount a file system which does not contain a journal, but which does have a orphan list containing an inode which needs to be truncated, the mount call with hang forever in ext4_orphan_cleanup() because ext4_orphan_del() will return immediately without removing the inode from the orphan list, leading to an uninterruptible loop in kernel code which will busy out one of the CPU's on the system. This can be trivially reproduced by trying to mount the file system found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs source tree. If a malicious user were to put this on a USB stick, and mount it on a Linux desktop which has automatic mounts enabled, this could be considered a potential denial of service attack. (Not a big deal in practice, but professional paranoids worry about such things, and have even been known to allocate CVE numbers for such problems.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org	2016-10-29 23:12:34 +08:00
Anatol Pomozov	012f862b13	ext4: make orphan functions be no-op in no-journal mode Instead of checking whether the handle is valid, we check if journal is enabled. This avoids taking the s_orphan_lock mutex in all cases when there is no journal in use, including the error paths where ext4_orphan_del() is called with a handle set to NULL. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2016-10-29 23:12:34 +08:00
Chao Yu	b86bad0997	f2fs: fix to avoid deadlock when merging inline data When testing with fsstress, kworker and user threads were both blocked: INFO: task kworker/u16:1:16580 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u16:1 D ffff8803f2595390 0 16580 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-251:0) ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440 ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440 ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000 Call Trace: [<ffffffff816fe9d9>] schedule+0x29/0x70 [<ffffffff816ff895>] rwsem_down_read_failed+0xa5/0xf9 [<ffffffff81378584>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa0694feb>] f2fs_write_data_page+0x31b/0x420 [f2fs] [<ffffffffa0690f1a>] __f2fs_writepage+0x1a/0x50 [f2fs] [<ffffffffa06922a0>] f2fs_write_data_pages+0xe0/0x290 [f2fs] [<ffffffff811473b3>] do_writepages+0x23/0x40 [<ffffffff811cc3ee>] __writeback_single_inode+0x4e/0x250 [<ffffffff811cd4f1>] writeback_sb_inodes+0x2c1/0x470 [<ffffffff811cd73e>] __writeback_inodes_wb+0x9e/0xd0 [<ffffffff811cda0b>] wb_writeback+0x1fb/0x2d0 [<ffffffff811cdb7c>] wb_do_writeback+0x9c/0x220 [<ffffffff811ce232>] bdi_writeback_workfn+0x72/0x1c0 [<ffffffff8106b74e>] process_one_work+0x1de/0x5b0 [<ffffffff8106e78f>] worker_thread+0x11f/0x3e0 [<ffffffff810750ce>] kthread+0xde/0xf0 [<ffffffff817093f8>] ret_from_fork+0x58/0x90 fsstress thread stack: [<ffffffff81139f0e>] sleep_on_page+0xe/0x20 [<ffffffff81139ef7>] __lock_page+0x67/0x70 [<ffffffff8113b100>] find_lock_page+0x50/0x80 [<ffffffff8113b24f>] find_or_create_page+0x3f/0xb0 [<ffffffffa06983a9>] sync_node_pages+0x259/0x810 [f2fs] [<ffffffffa068d874>] write_checkpoint+0x1a4/0xce0 [f2fs] [<ffffffffa0686b0c>] f2fs_sync_fs+0x7c/0xd0 [f2fs] [<ffffffffa067c813>] f2fs_sync_file+0x143/0x5f0 [f2fs] [<ffffffff811d301b>] vfs_fsync_range+0x2b/0x40 [<ffffffff811d304c>] vfs_fsync+0x1c/0x20 [<ffffffff811d3291>] do_fsync+0x41/0x70 [<ffffffff811d32d3>] SyS_fdatasync+0x13/0x20 [<ffffffff817094a2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff The reason of this issue is: CPU0: CPU1: - f2fs_write_data_pages - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all - down_write(sbi->cp_rwsem) - lock_page(page) - f2fs_write_data_page - sync_node_pages - flush_inline_data - pagecache_get_page(page, GFP_LOCK) - f2fs_lock_op - down_read(sbi->cp_rwsem) This patch alters to use trylock_page in flush_inline_data to fix this ABBA deadlock issue. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:32 +08:00
Chao Yu	09f4658771	f2fs: introduce f2fs_flush_merged_bios for cleanup Add a new helper f2fs_flush_merged_bios to clean up redundant codes. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:32 +08:00
Chao Yu	cc7e1e5fe8	f2fs: introduce f2fs_update_data_blkaddr for cleanup Add a new help f2fs_update_data_blkaddr to clean up redundant codes. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2016-10-29 23:12:32 +08:00
Chao Yu	e90d738a5e	f2fs crypto: fix incorrect positioning for GCing encrypted data page For now, flow of GCing an encrypted data page: 1) try to grab meta page in meta inode's mapping with index of old block address of that data page 2) load data of ciphertext into meta page 3) allocate new block address 4) write the meta page into new block address 5) update block address pointer in direct node page. Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to check and wait on GCed encrypted data cached in meta page writebacked in order to avoid inconsistence among data page cache, meta page cache and data on-disk when updating. However, we will use new block address updated in step 5) as an index to lookup meta page in inner bio buffer. That would be wrong, and we will never find the GCing meta page, since we use the old block address as index of that page in step 1). This patch fixes the issue by adjust the order of step 1) and step 3), and in step 1) grab page with index generated in step 3). Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Conflicts: fs/f2fs/gc.c	2016-10-29 23:12:32 +08:00

... 2 3 4 5 6 ...

27882 commits