Commit graph

27882 commits

Author SHA1 Message Date
Daniel Rosenberg
8472923d21 ANDROID: sdcardfs: Move top to its own struct
Move top, and the associated data, to its own struct.
This way, we can properly track refcounts on top
without interfering with the inode's accounting.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 38045152
Change-Id: I1968e480d966c3f234800b72e43670ca11e1d3fd
2017-09-22 19:12:34 +03:00
Gao Xiang
1f52ba5673 ANDROID: sdcardfs: fix sdcardfs_destroy_inode for the inode RCU approach
According to the following commits,
fs: icache RCU free inodes
vfs: fix the stupidity with i_dentry in inode destructors

sdcardfs_destroy_inode should be fixed for the fast path safety.

Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Change-Id: I84f43c599209d23737c7e28b499dd121cb43636d
2017-09-22 19:12:33 +03:00
Daniel Roseberg
62b650471f ANDROID: sdcardfs: Don't iput if we didn't igrab
If we fail to get top, top is either NULL, or igrab found
that we're in the process of freeing that inode, and did
not grab it. Either way, we didn't grab it, and have no
business putting it.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 38117720
Change-Id: Ie2f587483b9abb5144263156a443e89bc69b767b
2017-09-22 19:12:33 +03:00
Daniel Rosenberg
e43a6b01eb ANDROID: sdcardfs: Call lower fs's revalidate
We should be calling the lower filesystem's revalidate
inside of sdcardfs's revalidate, as wrapfs does.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35766959
Change-Id: I939d1c4192fafc1e21678aeab43fe3d588b8e2f4
2017-09-22 19:12:33 +03:00
Daniel Rosenberg
88598fa8ae ANDROID: sdcardfs: Avoid setting GIDs outside of valid ranges
When setting up the ownership of files on the lower filesystem,
ensure that these values are in reasonable ranges for apps. If
they aren't, default to AID_MEDIA_RW

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 37516160
Change-Id: I0bec76a61ac72aff0b993ab1ad04be8382178a00
2017-09-22 19:12:32 +03:00
Daniel Rosenberg
e293feb047 Revert "Revert "Android: sdcardfs: Don't do d_add for lower fs""
This reverts commit ffa75fdb9c408f49b9622b6d55752ed99ff61488.

Turns out we just needed the right hash.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 37231161
Change-Id: I6a6de7f7df99ad42b20fa062913b219f64020c31
2017-09-22 19:12:32 +03:00
Daniel Rosenberg
24cce439d5 ANDROID: sdcardfs: Use filesystem specific hash
We weren't accounting for FS specific hash functions,
causing us to miss negative dentries for any FS that
had one.

Similar to a patch from esdfs
commit 75bd25a9476d ("esdfs: support lower's own hash")

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I32d1ba304d728e0ca2648cacfb4c2e441ae63608
2017-09-22 19:12:32 +03:00
Daniel Rosenberg
b380b6ee1c Revert "Android: sdcardfs: Don't do d_add for lower fs"
This reverts commit 60df9f12992bc067216078ae756066c5d7c74d87.

This change caused issues for sdcardfs on top of vfat

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ie56a91fda582af27921cc1a9de7ae19a9a988f2a
2017-09-22 19:12:31 +03:00
Daniel Rosenberg
d3c7b5afda Android: sdcardfs: Don't complain in fixup_lower_ownership
Not all filesystems support changing the owner of a file.
We shouldn't complain if it doesn't happen.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 37488099
Change-Id: I403e44ab7230f176e6df82f6adb4e5c82ce57f33
2017-09-22 19:12:31 +03:00
Daniel Rosenberg
cbfba1be5a Android: sdcardfs: Don't do d_add for lower fs
For file based encryption, ext4 explicitly does not
create negative dentries for encrypted files. If you
force one over it, the decrypted file will be hidden
until the cache is cleared. Instead, just fail out.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 37231161
Change-Id: Id2a9708dfa75e1c22f89915c529789caadd2ca4b
2017-09-22 19:12:30 +03:00
Daniel Rosenberg
7727a9e52e ANDROID: sdcardfs: ->iget fixes
Adapted from wrapfs
commit 8c49eaa0sb9c ("Wrapfs: ->iget fixes")

Change where we igrab/iput to ensure we always hold a valid lower_inode.
Return ENOMEM (not EACCES) if iget5_locked returns NULL.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35766959

Change-Id: Id8d4e0c0cbc685a0a77685ce73c923e9a3ddc094
2017-09-22 19:12:30 +03:00
Daniel Rosenberg
d2f7577f20 Android: sdcardfs: Change cache GID value
Change-Id: Ieb955dd26493da26a458bc20fbbe75bca32b094f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 37193650
2017-09-22 19:12:29 +03:00
Daniel Rosenberg
c3da3c511e ANDROID: sdcardfs: Directly pass lower file for mmap
Instead of relying on a copy hack, pass the lower file
as private data. This lets the kernel find the vma
mapping for pages used by the file, allowing pages
used by mapping to be reclaimed.

This is adapted from following esdfs patches
commit 0647e638d: ("esdfs: store lower file in vm_file for mmap")
commit 064850866: ("esdfs: keep a counter for mmaped file")

Change-Id: I75b74d1e5061db1b8c13be38d184e118c0851a1a
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:29 +03:00
Daniel Rosenberg
7fc65bd919 ANDROID: sdcardfs: update module info
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I958c7c226d4e9265fea8996803e5b004fb33d8ad
2017-09-22 19:12:29 +03:00
Daniel Rosenberg
5dc3989ffd ANDROID: sdcardfs: use d_splice_alias
adapted from wrapfs
commit 9671770ff8b9 ("Wrapfs: use d_splice_alias")

Refactor interpose code to allow lookup to use d_splice_alias.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35766959
Change-Id: Icf51db8658202c48456724275b03dc77f73f585b
2017-09-22 19:12:28 +03:00
Daniel Rosenberg
dde08eb9d7 ANDROID: sdcardfs: fix ->llseek to update upper and lower offset
Adapted from wrapfs
commit 1d1d23a47baa ("Wrapfs: fix ->llseek to update upper and lower
offsets")

Fixes bug: xfstests generic/257. f_pos consistently is required by and
only by dir_ops->wrapfs_readdir, main_ops is not affected.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Mengyang Li <li.mengyang@stonybrook.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35766959
Change-Id: I360a1368ac37ea8966910a58972b81504031d437
2017-09-22 19:12:28 +03:00
Daniel Rosenberg
4923777bb0 ANDROID: sdcardfs: copy lower inode attributes in ->ioctl
Adapted from wrapfs
commit fbc9c6f83ea6 ("Wrapfs: copy lower inode attributes in ->ioctl")
commit e97d8e26cc9e ("Wrapfs: use file_inode helper")

Some ioctls (e.g., EXT2_IOC_SETFLAGS) can change inode attributes, so copy
them from lower inode.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35766959
Change-Id: I0f12684b9dbd4088b4a622c7ea9c03087f40e572
2017-09-22 19:12:28 +03:00
Daniel Rosenberg
21af28ae07 ANDROID: sdcardfs: remove unnecessary call to do_munmap
Adapted from wrapfs
commit 5be6de9ecf02 ("Wrapfs: use vm_munmap in ->mmap")
commit 2c9f6014a8bb ("Wrapfs: remove unnecessary call
to vm_unmap in ->mmap")

Code is unnecessary and causes deadlocks in newer kernels.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35766959
Change-Id: Ia252d60c60799d7e28fc5f1f0f5b5ec2430a2379
2017-09-22 19:12:27 +03:00
Daniel Rosenberg
2761faa286 ANDROID: sdcardfs: Fix style issues in macros
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: I89c4035029dc2236081a7685c55cac595d9e7ebf
2017-09-22 19:12:27 +03:00
Daniel Rosenberg
b3fd5e6086 ANDROID: sdcardfs: Use seq_puts over seq_printf
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: I3795ec61ce61e324738815b1ce3b0e09b25d723f
2017-09-22 19:12:26 +03:00
Daniel Rosenberg
3a6d093272 ANDROID: sdcardfs: Use to kstrout
Switch from deprecated simple_strtoul to kstrout

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: If18bd133b4d2877f71e58b58fc31371ff6613ed5
2017-09-22 19:12:26 +03:00
Daniel Rosenberg
a1f2d9d927 ANDROID: sdcardfs: Use pr_[...] instead of printk
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: Ibc635ec865750530d32b87067779f681fe58a003
2017-09-22 19:12:26 +03:00
Daniel Rosenberg
def7e34e8b ANDROID: sdcardfs: remove unneeded null check
As pointed out by checkpatch, these functions already
handle null inputs, so the checks are not needed.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: I189342f032dfcefee36b27648bb512488ad61d20
2017-09-22 19:12:24 +03:00
Daniel Rosenberg
1ff2bf9007 ANDROID: sdcardfs: Fix style issues with comments
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: I8791ef7eac527645ecb9407908e7e5ece35b8f80
2017-09-22 19:12:23 +03:00
Daniel Rosenberg
1fb0168abb ANDROID: sdcardfs: Fix formatting
This fixes various spacing and bracket related issues
pointed out by checkpatch.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: I6e248833a7a04e3899f3ae9462d765cfcaa70c96
2017-09-22 19:12:23 +03:00
Daniel Rosenberg
9a544b4fd9 ANDROID: sdcardfs: correct order of descriptors
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: Ia6d16b19c8c911f41231d2a12be0740057edfacf
2017-09-22 19:12:23 +03:00
Daniel Rosenberg
cc90c372fa ANDROID: sdcardfs: Fix gid issue
We were already calculating most of these values,
and erroring out because the check was confused by this.
Instead of recalculating, adjust it as needed.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 36160015
Change-Id: I9caf3e2fd32ca2e37ff8ed71b1d392f1761bc9a9
2017-09-22 19:12:22 +03:00
Daniel Rosenberg
183d00676a ANDROID: sdcardfs: Use tabs instead of spaces in multiuser.h
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35331000
Change-Id: Ic7801914a7dd377e270647f81070020e1f0bab9b
2017-09-22 19:12:22 +03:00
Daniel Rosenberg
a79d11e62a ANDROID: sdcardfs: Remove uninformative prints
At best these prints do not provide useful information, and
at worst, some allow userspace to abuse the kernel log.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 36138424
Change-Id: I812c57cc6a22b37262935ab77f48f3af4c36827e
2017-09-22 19:12:21 +03:00
Daniel Rosenberg
527954f2c9 ANDROID: sdcardfs: move path_put outside of spinlock
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35643557
Change-Id: Ib279ebd7dd4e5884d184d67696a93e34993bc1ef
2017-09-22 19:12:21 +03:00
Daniel Rosenberg
9fc6dbe472 ANDROID: sdcardfs: Use case insensitive hash function
Case insensitive comparisons don't help us much if
we hash to different buckets...

Signed-off-by: Daniel Rosenberg <drosen@google.com>
bug: 36004503
Change-Id: I91e00dbcd860a709cbd4f7fd7fc6d855779f3285
2017-09-22 19:12:21 +03:00
Daniel Rosenberg
63cf0c300a ANDROID: sdcardfs: declare MODULE_ALIAS_FS
From commit ee616b78aa87 ("Wrapfs: declare MODULE_ALIAS_FS")

Signed-off-by: Daniel Rosenberg <drosen@google.com>
bug: 35766959
Change-Id: Ia4728ab49d065b1d2eb27825046f14b97c328cba
2017-09-22 19:12:20 +03:00
Eric W. Biederman
5c1997410b fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Change-Id: I623b13dbdb44bb9ba7481f29575e1ca4ad8102f4
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
2017-09-22 19:12:20 +03:00
Daniel Rosenberg
628b9661d7 ANDROID: sdcardfs: Get the blocksize from the lower fs
This changes sdcardfs to be more in line with the
getattr in wrapfs, which calls the lower fs's getattr
to get the block size

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 34723223
Change-Id: I1c9e16604ba580a8cdefa17f02dcc489d7351aed
2017-09-22 19:12:19 +03:00
Daniel Rosenberg
30cc539e4c ANDROID: sdcardfs: Use d_invalidate instead of drop_recurisve
drop_recursive did not properly remove stale dentries.
Instead, we use the vfs's d_invalidate, which does the proper cleanup.

Additionally, remove the no longer used drop_recursive, and
fixup_top_recursive that that are no longer used.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: Ibff61b0c34b725b024a050169047a415bc90f0d8
2017-09-22 19:12:19 +03:00
Daniel Rosenberg
6b5418751d ANDROID: sdcardfs: Switch to internal case insensitive compare
There were still a few places where we called into a case
insensitive lookup that was not defined by sdcardfs.
Moving them all to the same place will allow us to switch
the implementation in the future.

Additionally, the check in fixup_perms_recursive did not
take into account the length of both strings, causing
extraneous matches when the name we were looking for was
a prefix of the child name.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Change-Id: I45ce768cd782cb4ea1ae183772781387c590ecc2
2017-09-22 19:12:18 +03:00
Daniel Rosenberg
280a7d21b7 ANDROID: sdcardfs: Use spin_lock_nested
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 36007653
Change-Id: I805d5afec797669679853fb2bb993ee38e6276e4
2017-09-22 19:12:18 +03:00
Daniel Rosenberg
9917de6f14 ANDROID: sdcardfs: Replace get/put with d_lock
dput cannot be called with a spin_lock. Instead,
we protect our accesses by holding the d_lock.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35643557
Change-Id: I22cf30856d75b5616cbb0c223724f5ab866b5114
2017-09-22 19:12:17 +03:00
Daniel Rosenberg
c3be309216 ANDROID: sdcardfs: rate limit warning print
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35848445
Change-Id: Ida72ea0ece191b2ae4a8babae096b2451eb563f6
2017-09-22 19:12:17 +03:00
Daniel Rosenberg
74b45b5a54 ANDROID: sdcardfs: support direct-IO (DIO) operations
This comes from the wrapfs patch
2e346c83b26e Wrapfs: support direct-IO (DIO) operations

Signed-off-by: Li Mengyang <li.mengyang@stonybrook.edu>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 34133558
Change-Id: I3fd779c510ab70d56b1d918f99c20421b524cdc4
2017-09-22 19:12:17 +03:00
Daniel Rosenberg
7bc8a0524c ANDROID: sdcardfs: implement vm_ops->page_mkwrite
This comes from the wrapfs patch
3dfec0ffe5e2 Wrapfs: implement vm_ops->page_mkwrite

Some file systems (e.g., ext4) require it.  Reported by Ted Ts'o.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 34133558
Change-Id: I1a389b2422c654a6d3046bb8ec3e20511aebfa8e
2017-09-22 19:12:16 +03:00
Daniel Rosenberg
d782165c3b ANDROID: sdcardfs: Don't bother deleting freelist
There is no point deleting entries from dlist, as
that is a temporary list on the stack from which
contains only entries that are being deleted.

Not all code paths set up dlist, so those that
don't were performing invalid accesses in
hash_del_rcu. As an additional means to prevent
any other issue, we null out the list entries when
we allocate from the cache.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35666680
Change-Id: Ibb1e28c08c3a600c29418d39ba1c0f3db3bf31e5
2017-09-22 19:12:16 +03:00
Daniel Rosenberg
4a7fc6483f ANDROID: sdcardfs: Add missing path_put
"ANDROID: sdcardfs: Add GID Derivation to sdcardfs" introduced
an unbalanced pat_get, leading to storage space not being freed
after deleting a file until rebooting. This adds the missing path_put.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 34691169
Change-Id: Ia7ef97ec2eca2c555cc06b235715635afc87940e
2017-09-22 19:12:15 +03:00
Daniel Rosenberg
2f8e9489d3 ANDROID: sdcardfs: Fix incorrect hash
This adds back the hash calculation removed as part of
the previous patch, as it is in fact necessary.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 35307857
Change-Id: Ie607332bcf2c5d2efdf924e4060ef3f576bf25dc
2017-09-22 19:12:15 +03:00
Daniel Rosenberg
f9c56b73bd ANDROID: sdcardfs: Switch strcasecmp for internal call
This moves our uses of strcasecmp over to an internal call so we can
easily change implementations later if we so desire. Additionally,
we leverage qstr's where appropriate to save time on comparisons.

Change-Id: I32fdc4fd0cd3b7b735dcfd82f60a2516fd8272a5
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:15 +03:00
Al Viro
acf4f74449 constify d_lookup() arguments
Change-Id: I48ac8c9d7a63530b753b9d7b316e9222edeb5876
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-22 19:12:14 +03:00
Al Viro
44fc4d2f9a constify __d_lookup() arguments
Change-Id: I74c489fee16eb019d9d32572d867d6b54bf6cc91
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-22 19:12:14 +03:00
Daniel Rosenberg
2659b94b05 ANDROID: sdcardfs: switch to full_name_hash and qstr
Use the kernel's string hash function instead of rolling
our own. Additionally, save a bit of calculation by using
the qstr struct in place of strings.

Change-Id: I0bbeb5ec2a9233f40135ad632e6f22c30ffa95c1
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:13 +03:00
Daniel Rosenberg
b134627d5d ANDROID: sdcardfs: Add GID Derivation to sdcardfs
This changes sdcardfs to modify the user and group in the
underlying filesystem depending on its usage. Ownership is
set by Android user, and package, as well as if the file is
under obb or cache. Other files can be labeled by extension.
Those values are set via the configfs interace.

To add an entry,
mkdir -p [configfs root]/sdcardfs/extensions/[gid]/[ext]

Bug: 34262585
Change-Id: I4e030ce84f094a678376349b1a96923e5076a0f4
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:13 +03:00
Daniel Rosenberg
ce26b95540 ANDROID: sdcardfs: Remove redundant operation
We call get_derived_permission_new unconditionally, so we don't need
to call update_derived_permission_lock, which does the same thing.

Change-Id: I0748100828c6af806da807241a33bf42be614935
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:13 +03:00
Daniel Rosenberg
230896c581 ANDROID: sdcardfs: add support for user permission isolation
This allows you to hide the existence of a package from
a user by adding them to an exclude list. If a user
creates that package's folder and is on the exclude list,
they will not see that package's id.

Bug: 34542611
Change-Id: I9eb82e0bf2457d7eb81ee56153b9c7d2f6646323
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:12 +03:00
Daniel Rosenberg
9e78e3b970 ANDROID: sdcardfs: Refactor configfs interface
This refactors the configfs code to be more easily extended.
It will allow additional files to be added easily.

Bug: 34542611
Bug: 34262585
Change-Id: I73c9b0ae5ca7eb27f4ebef3e6807f088b512d539
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:12 +03:00
Daniel Rosenberg
26117e4ce4 ANDROID: sdcardfs: Allow non-owners to touch
This modifies the permission checks in setattr to
allow for non-owners to modify the timestamp of
files to things other than the current time.
This still requires write access, as enforced by
the permission call, but relaxes the requirement
that the caller must be the owner, allowing those
with group permissions to change it as well.

Bug: 11118565
Change-Id: Ied31f0cce2797675c7ef179eeb4e088185adcbad
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:11 +03:00
Daniel Rosenberg
fdfefc2e98 ANDROID: mnt: remount should propagate to slaves of slaves
propagate_remount was not accounting for the slave mounts
of other slave mounts, leading to some namespaces not
recieving the remount information.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Bug: 33731928
Change-Id: Idc9e8c2ed126a4143229fc23f10a959c2d0a3854
2017-09-22 19:12:11 +03:00
Daniel Rosenberg
cd769ec65b ANDROID: sdcardfs: Fix locking issue with permision fix up
Don't use lookup_one_len so we can grab the spinlock that
protects d_subdirs.

Bug: 30954918
Change-Id: I0c6a393252db7beb467e0d563739a3a14e1b5115
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:11 +03:00
Daniel Rosenberg
ac0146e438 ANDROID: vfs: Missed updating truncate to truncate2
Bug: 30954918
Change-Id: I8163d3f86dd7aadb2ab3fc11816754f331986f05
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:10 +03:00
Al Viro
c4d2a199dd BACKPORT: smarter propagate_mnt()
The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Change-Id: I45648e8a405544f768c5956711bdbdf509e2705a
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-22 19:12:10 +03:00
Al Viro
bd0f854d88 BACKPORT: don't bother with propagate_mnt() unless the target is shared
If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.

Change-Id: Id94af8ad288bf9bfc6ffb5570562bbc2dc2e0d87
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-22 19:12:09 +03:00
Daniel Rosenberg
0ccedbc84f sdcardfs: Use per mount permissions
This switches sdcardfs over to using permission2.
Instead of mounting several sdcardfs instances onto
the same underlaying directory, you bind mount a
single mount several times, and remount with the
options you want. These are stored in the private
mount data, allowing you to maintain the same tree,
but have different permissions for different mount
points.

Warning functions have been added for permission,
as it should never be called, and the correct
behavior is unclear.

Change-Id: I841b1d70ec60cf2b866fa48edeb74a0b0f8334f5
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:08 +03:00
Daniel Rosenberg
3edd0f78ef sdcardfs: Add gid and mask to private mount data
Adds support for mount2, remount2, and the functions
to allocate/clone/copy the private data

The next patch will switch over to actually using it.

Change-Id: I8a43da26021d33401f655f0b2784ead161c575e3
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:08 +03:00
Daniel Rosenberg
8967b8cde8 sdcardfs: User new permission2 functions
Change-Id: Ic7e0fb8fdcebb31e657b079fe02ac834c4a50db9
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:08 +03:00
Daniel Rosenberg
1620d1d7d4 vfs: Add setattr2 for filesystems with per mount permissions
This allows filesystems to use their mount private data to
influence the permssions they use in setattr2. It has
been separated into a new call to avoid disrupting current
setattr users.

Change-Id: I19959038309284448f1b7f232d579674ef546385
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:07 +03:00
Daniel Rosenberg
19a3f7c232 vfs: Add permission2 for filesystems with per mount permissions
This allows filesystems to use their mount private data to
influence the permssions they return in permission2. It has
been separated into a new call to avoid disrupting current
permission users.

Change-Id: I9d416e3b8b6eca84ef3e336bd2af89ddd51df6ca
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:07 +03:00
Daniel Rosenberg
2686aceaa7 vfs: Allow filesystems to access their private mount data
Now we pass the vfsmount when mounting and remounting.
This allows the filesystem to actually set up the mount
specific data, although we can't quite do anything with
it yet. show_options is expanded to include data that
lives with the mount.

To avoid changing existing filesystems, these have
been added as new vfs functions.

Change-Id: If80670bfad9f287abb8ac22457e1b034c9697097
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:06 +03:00
Daniel Rosenberg
ccbd24c7a0 mnt: Add filesystem private data to mount points
This starts to add private data associated directly
to mount points. The intent is to give filesystems
a sense of where they have come from, as a means of
letting a filesystem take different actions based on
this information.

Change-Id: Ie769d7b3bb2f5972afe05c1bf16cf88c91647ab2
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:06 +03:00
Daniel Rosenberg
313cdb2651 ANDROID: sdcardfs: Fix backport issue for 3.10
Don't use make_kuid/make_guid

Bug: 30954918
Change-Id: I56de640771872aeeae5a69c42bf2ce8a5cfa413f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:05 +03:00
Daniel Rosenberg
87b619e0cb sdcardfs: Move directory unlock before touch
This removes a deadlock under low memory conditions.
filp_open can call lookup_slow, which will attempt to
lock the parent.

Change-Id: I940643d0793f5051d1e79a56f4da2fa8ca3d8ff7
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:05 +03:00
alvin_liang
b94d192e3e sdcardfs: fix external storage exporting incorrect uid
Symptom: App cannot write into per-app folder
Root Cause: sdcardfs exports incorrect uid
Solution: fix uid
Project: All
Note:
Test done by RD: passed

Change-Id: Iff64f6f40ba4c679f07f4426d3db6e6d0db7e3ca
2017-09-22 19:12:04 +03:00
Daniel Rosenberg
00fb55c68d sdcardfs: Added top to sdcardfs_inode_info
Adding packages to the package list and moving files
takes a large amount of locks, and is currently a
heavy operation. This adds a 'top' field to the
inode_info, which points to the inode for the top
most directory whose owner you would like to match.

On permission checks and get_attr, we look up the
owner based on the information at top. When we change
a package mapping, we need only modify the information
in the corresponding top inode_info's. When renaming,
we must ensure top is set correctly in all children.
This happens when an app specific folder gets moved
outside of the folder for that app.

Change-Id: Ib749c60b568e9a45a46f8ceed985c1338246ec6c
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:04 +03:00
Daniel Rosenberg
853f8e0523 sdcardfs: Switch package list to RCU
Switched the package id hashmap to use RCU.

Change-Id: I9fdcab279009005bf28536247d11e13babab0b93
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:03 +03:00
Daniel Rosenberg
b7ae873970 sdcardfs: Fix locking for permission fix up
Iterating over d_subdirs requires taking d_lock.
Removed several unneeded locks.

Change-Id: I5b1588e54c7e6ee19b756d6705171c7f829e2650
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:03 +03:00
Daniel Rosenberg
82f20f9741 sdcardfs: Check for other cases on path lookup
This fixes a bug where the first lookup of a
file or folder created under a different view
would not be case insensitive. It will now
search through for a case insensitive match
if the initial lookup fails.

Bug:28024488
Change-Id: I4ff9ce297b9f2f9864b47540e740fd491c545229
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:03 +03:00
Daniel Rosenberg
feccf66542 sdcardfs: override umask on mkdir and create
The mode on files created on the lower fs should
not be affected by the umask of the calling
task's fs_struct. Instead, we create a copy
and modify it as needed. This also lets us avoid
the string shenanigans around .nomedia files.

Bug: 27992761
Change-Id: Ia3a6e56c24c6e19b3b01c1827e46403bb71c2f4c
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:02 +03:00
Julia Lawall
b46375c331 ANDROID: sdcardfs: fix itnull.cocci warnings
List_for_each_entry has the property that the first argument is always
bound to a real list element, never NULL, so testing dentry is not needed.

Generated by: scripts/coccinelle/iterators/itnull.cocci

Change-Id: I51033a2649eb39451862b35b6358fe5cfe25c5f5
Cc: Daniel Rosenberg <drosen@google.com>
Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Guenter Roeck <groeck@chromium.org>
2017-09-22 19:12:02 +03:00
Daniel Rosenberg
ccf7c04945 sdcardfs: Truncate packages_gid.list on overflow
packages_gid.list was improperly returning the wrong
count. Use scnprintf instead, and inform the user that
the list was truncated if it is.

Bug: 30013843
Change-Id: Ida2b2ef7cd86dd87300bfb4c2cdb6bfe2ee1650d
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:01 +03:00
Daniel Rosenberg
611237fca8 vfs: change d_canonical_path to take two paths
bug: 23904372
Change-Id: I4a686d64b6de37decf60019be1718e1d820193e6
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:01 +03:00
Daniel Rosenberg
f5fea6a938 fuse: Add support for d_canonical_path
Allows FUSE to report to inotify that it is acting
as a layered filesystem. The userspace component
returns a string representing the location of the
underlying file. If the string cannot be resolved
into a path, the top level path is returned instead.

bug: 23904372
Change-Id: Iabdca0bbedfbff59e9c820c58636a68ef9683d9f
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:12:00 +03:00
Daniel Rosenberg
10db9d59bf sdcardfs: remove unneeded __init and __exit
Change-Id: I2a2d45d52f891332174c3000e8681c5167c1564f
2017-09-22 19:12:00 +03:00
Daniel Rosenberg
679046d6e7 sdcardfs: Remove unused code
Change-Id: Ie97cba27ce44818ac56cfe40954f164ad44eccf6
2017-09-22 19:12:00 +03:00
Daniel Rosenberg
158b6be502 sdcardfs: remove effectless config option
CONFIG_SDCARD_FS_CI_SEARCH only guards a define for
LOOKUP_CASE_INSENSITIVE, which is never used in the
kernel. Remove both, along with the option matching
that supports it.

Change-Id: I363a8f31de8ee7a7a934d75300cc9ba8176e2edf
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:11:59 +03:00
Daniel Rosenberg
49e4c3ad85 inotify: Fix erroneous update of bit count
Patch "vfs: add d_canonical_path for stacked filesystem support"
erroneously updated the ALL_INOTIFY_BITS count. This changes it back

Change-Id: Idb04edc736da276159d30f04c40cff9d6b1e070f
2017-09-22 19:11:59 +03:00
Daniel Rosenberg
3bf0ab5c8d sdcardfs: Add support for d_canonicalize
Change-Id: I5d6f0e71b8ca99aec4b0894412f1dfd1cfe12add
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:11:58 +03:00
Daniel Rosenberg
9f8d208e0e vfs: add d_canonical_path for stacked filesystem support
Inotify does not currently know when a filesystem
is acting as a wrapper around another fs. This means
that inotify watchers will miss any modifications to
the base file, as well as any made in a separate
stacked fs that points to the same file.
d_canonical_path solves this problem by allowing the fs
to map a dentry to a path in the lower fs. Inotify
can use it to find the appropriate place to watch to
be informed of all changes to a file.

Change-Id: I09563baffad1711a045e45c1bd0bd8713c2cc0b6
Signed-off-by: Daniel Rosenberg <drosen@google.com>
2017-09-22 19:11:58 +03:00
Daniel Rosenberg
8e6b0f6c8c sdcardfs: Bring up to date with Android M permissions:
In M, the workings of sdcardfs were changed significantly.
This brings sdcardfs into line with the changes.

Change-Id: I10e91a84a884c838feef7aa26c0a2b21f02e052e
2017-09-22 19:11:57 +03:00
Daniel Campello
1dbc72eb35 sdcardfs: Changed type-cast in packagelist management
Change-Id: Ic8842de2d7274b7a5438938d2febf5d8da867148
2017-09-22 19:11:57 +03:00
fluxi
fd2464db10 sdcardfs: Port to 3.4
Analog port to 3.10 by Daniel Campello <campello@google.com>.

Change-Id: I0b05890cdd4332c5cfc2ffdf66a3f3a7890cce35
2017-09-22 19:11:56 +03:00
Daniel Campello
6b980874d5 Included sdcardfs source code for kernel 3.0
Only included the source code as is for kernel 3.0. Following patches
take care of porting this file system to version 3.10.

Change-Id: I09e76db77cd98a059053ba5b6fd88572a4b75b5b
Signed-off-by: Daniel Campello <campello@google.com>
2017-09-22 19:11:56 +03:00
Al Viro
be084ce9f1 move d_rcu from overlapping d_child to overlapping d_alias
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream.

Change-Id: I85366e6ce0423ec9620bcc9cd3e7695e81aa1171
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[bwh: Backported to 3.2:
 - Apply name changes in all the different places we use d_alias and d_child
 - Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[lizf: Backported to 3.4:
 - adjust context
 - need one more name change in debugfs]
2017-09-22 19:11:55 +03:00
Al Viro
f787204b2f get rid of kern_path_parent()
all callers want the same thing, actually - a kinda-sorta analog of
kern_path_create().  I.e. they want parent vfsmount/dentry (with
->i_mutex held, to make sure the child dentry is still their child)
+ the child dentry.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>

Change-Id: I58cc7b0a087646516db9af69962447d27fb3ee8b
2017-09-22 19:11:54 +03:00
Andy Lutomirski
b6dd4f6779 UPSTREAM: capabilities: ambient capabilities
Credit where credit is due: this idea comes from Christoph Lameter with
a lot of valuable input from Serge Hallyn.  This patch is heavily based
on Christoph's patch.

===== The status quo =====

On Linux, there are a number of capabilities defined by the kernel.  To
perform various privileged tasks, processes can wield capabilities that
they hold.

Each task has four capability masks: effective (pE), permitted (pP),
inheritable (pI), and a bounding set (X).  When the kernel checks for a
capability, it checks pE.  The other capability masks serve to modify
what capabilities can be in pE.

Any task can remove capabilities from pE, pP, or pI at any time.  If a
task has a capability in pP, it can add that capability to pE and/or pI.
If a task has CAP_SETPCAP, then it can add any capability to pI, and it
can remove capabilities from X.

Tasks are not the only things that can have capabilities; files can also
have capabilities.  A file can have no capabilty information at all [1].
If a file has capability information, then it has a permitted mask (fP)
and an inheritable mask (fI) as well as a single effective bit (fE) [2].
File capabilities modify the capabilities of tasks that execve(2) them.

A task that successfully calls execve has its capabilities modified for
the file ultimately being excecuted (i.e.  the binary itself if that
binary is ELF or for the interpreter if the binary is a script.) [3] In
the capability evolution rules, for each mask Z, pZ represents the old
value and pZ' represents the new value.  The rules are:

  pP' = (X & fP) | (pI & fI)
  pI' = pI
  pE' = (fE ? pP' : 0)
  X is unchanged

For setuid binaries, fP, fI, and fE are modified by a moderately
complicated set of rules that emulate POSIX behavior.  Similarly, if
euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
(primary, fP and fI usually end up being the full set).  For nonroot
users executing binaries with neither setuid nor file caps, fI and fP
are empty and fE is false.

As an extra complication, if you execute a process as nonroot and fE is
set, then the "secure exec" rules are in effect: AT_SECURE gets set,
LD_PRELOAD doesn't work, etc.

This is rather messy.  We've learned that making any changes is
dangerous, though: if a new kernel version allows an unprivileged
program to change its security state in a way that persists cross
execution of a setuid program or a program with file caps, this
persistent state is surprisingly likely to allow setuid or file-capped
programs to be exploited for privilege escalation.

===== The problem =====

Capability inheritance is basically useless.

If you aren't root and you execute an ordinary binary, fI is zero, so
your capabilities have no effect whatsoever on pP'.  This means that you
can't usefully execute a helper process or a shell command with elevated
capabilities if you aren't root.

On current kernels, you can sort of work around this by setting fI to
the full set for most or all non-setuid executable files.  This causes
pP' = pI for nonroot, and inheritance works.  No one does this because
it's a PITA and it isn't even supported on most filesystems.

If you try this, you'll discover that every nonroot program ends up with
secure exec rules, breaking many things.

This is a problem that has bitten many people who have tried to use
capabilities for anything useful.

===== The proposed change =====

This patch adds a fifth capability mask called the ambient mask (pA).
pA does what most people expect pI to do.

pA obeys the invariant that no bit can ever be set in pA if it is not
set in both pP and pI.  Dropping a bit from pP or pI drops that bit from
pA.  This ensures that existing programs that try to drop capabilities
still do so, with a complication.  Because capability inheritance is so
broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
then calling execve effectively drops capabilities.  Therefore,
setresuid from root to nonroot conditionally clears pA unless
SECBIT_NO_SETUID_FIXUP is set.  Processes that don't like this can
re-add bits to pA afterwards.

The capability evolution rules are changed:

  pA' = (file caps or setuid or setgid ? 0 : pA)
  pP' = (X & fP) | (pI & fI) | pA'
  pI' = pI
  pE' = (fE ? pP' : pA')
  X is unchanged

If you are nonroot but you have a capability, you can add it to pA.  If
you do so, your children get that capability in pA, pP, and pE.  For
example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
automatically bind low-numbered ports.  Hallelujah!

Unprivileged users can create user namespaces, map themselves to a
nonzero uid, and create both privileged (relative to their namespace)
and unprivileged process trees.  This is currently more or less
impossible.  Hallelujah!

You cannot use pA to try to subvert a setuid, setgid, or file-capped
program: if you execute any such program, pA gets cleared and the
resulting evolution rules are unchanged by this patch.

Users with nonzero pA are unlikely to unintentionally leak that
capability.  If they run programs that try to drop privileges, dropping
privileges will still work.

It's worth noting that the degree of paranoia in this patch could
possibly be reduced without causing serious problems.  Specifically, if
we allowed pA to persist across executing non-pA-aware setuid binaries
and across setresuid, then, naively, the only capabilities that could
leak as a result would be the capabilities in pA, and any attacker
*already* has those capabilities.  This would make me nervous, though --
setuid binaries that tried to privilege-separate might fail to do so,
and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
unexpected side effects.  (Whether these unexpected side effects would
be exploitable is an open question.) I've therefore taken the more
paranoid route.  We can revisit this later.

An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
ambient capabilities.  I think that this would be annoying and would
make granting otherwise unprivileged users minor ambient capabilities
(CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
it is with this patch.

===== Footnotes =====

[1] Files that are missing the "security.capability" xattr or that have
unrecognized values for that xattr end up with has_cap set to false.
The code that does that appears to be complicated for no good reason.

[2] The libcap capability mask parsers and formatters are dangerously
misleading and the documentation is flat-out wrong.  fE is *not* a mask;
it's a single bit.  This has probably confused every single person who
has tried to use file capabilities.

[3] Linux very confusingly processes both the script and the interpreter
if applicable, for reasons that elude me.  The results from thinking
about a script's file capabilities and/or setuid bits are mostly
discarded.

Preliminary userspace code is here, but it needs updating:
https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2

Here is a test program that can be used to verify the functionality
(from Christoph):

/*
 * Test program for the ambient capabilities. This program spawns a shell
 * that allows running processes with a defined set of capabilities.
 *
 * (C) 2015 Christoph Lameter <cl@linux.com>
 * Released under: GPL v3 or later.
 *
 *
 * Compile using:
 *
 *	gcc -o ambient_test ambient_test.o -lcap-ng
 *
 * This program must have the following capabilities to run properly:
 * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
 *
 * A command to equip the binary with the right caps is:
 *
 *	setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
 *
 *
 * To get a shell with additional caps that can be inherited by other processes:
 *
 *	./ambient_test /bin/bash
 *
 *
 * Verifying that it works:
 *
 * From the bash spawed by ambient_test run
 *
 *	cat /proc/$$/status
 *
 * and have a look at the capabilities.
 */

/*
 * Definitions from the kernel header files. These are going to be removed
 * when the /usr/include files have these defined.
 */

static void set_ambient_cap(int cap)
{
	int rc;

	capng_get_caps_process();
	rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
	if (rc) {
		printf("Cannot add inheritable cap\n");
		exit(2);
	}
	capng_apply(CAPNG_SELECT_CAPS);

	/* Note the two 0s at the end. Kernel checks for these */
	if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
		perror("Cannot set cap");
		exit(1);
	}
}

int main(int argc, char **argv)
{
	int rc;

	set_ambient_cap(CAP_NET_RAW);
	set_ambient_cap(CAP_NET_ADMIN);
	set_ambient_cap(CAP_SYS_NICE);

	printf("Ambient_test forking shell\n");
	if (execv(argv[1], argv + 1))
		perror("Cannot exec");

	return 0;
}

Signed-off-by: Christoph Lameter <cl@linux.com> # Original author
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Aaron Jones <aaronmdjones@gmail.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
Cc: Markku Savela <msa@moth.iki.fi>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: James Morris <james.l.morris@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 58319057b7847667f0c9585b9de0e8932b0fdb08)

Bug: 31038224
Change-Id: I88bc5caa782dc6be23dc7e839ff8e11b9a903f8c
Signed-off-by: Jorge Lucangeli Obes <jorgelo@google.com>
2017-09-01 13:38:08 +03:00
Greg Hackmann
84a377cc19 timerfd: support CLOCK_BOOTTIME clock
Add CLOCK_BOOTTIME support to timerfd

Change-Id: I14dee6d1104f15a05f463a632268ac4564753faf
Signed-off-by: Greg Hackmann <ghackmann@google.com>
2017-08-27 19:07:23 +03:00
Todd Poynor
9ae9589197 timerfd: add alarm timers
Add support for clocks CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM.

Change-Id: Iafc8445d3d7ffb35110c860f1607bf03f1edb895
Signed-off-by: Todd Poynor <toddpoynor@google.com>
2017-08-27 19:07:22 +03:00
John Stultz
1bc537874e alarmtimers: Squash upstream changes
staging: android-alarm: Switch from wakelocks to wakeup sources

In their current AOSP tree, the Android in-kernel wakelock
infrastructure has been reimplemented in terms of wakeup
sources:
http://git.linaro.org/gitweb?p=people/jstultz/android.git;a=commitdiff;h=e9911f4efdc55af703b8b3bb8c839e6f5dd173bb

The Android alarm driver currently has stubbed out calls
to wakelock functionality. So this patch simply converts
the stubbed out wakelock calls to wakeup source calls, and
removes the empty wakelock macros

Greg, would you mind queuing this in staging-next?

CC: Colin Cross <ccross@android.com>
CC: Arve Hjønnevåg <arve@android.com>
CC: Greg KH <gregkh@linuxfoundation.org>
CC: Android Kernel Team <kernel-team@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Staging: android: alarm: Rename pr_alarm to alarm_dbg

Rename a macro to make it explicit it's for debugging.

Use %s: __func__ instead of embedding function names.
Coalesce formats, align arguments.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: Android: Fix NULL pointer related warning in alarm-dev.c file

Fixes the following sparse warning:
drivers/staging/android/alarm-dev.c:259:35: warning: Using plain integer as NULL pointer

Cc: Brian Swetland <swetland@google.com>
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: android: alarm: remove unnecessary goto statement

Signed-off-by: Devendra Naga <devendra.aaru@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Staging: android: Alarm driver cleanups

Little cleanups. Enum value ANDROID_ALARM_TYPE_COUNT was treated as
an alarm type within a switch statement. That condition was unreachable
though.

Signed-off-by: Dae S. Kim <dae@velatum.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: alarm-dev: Drop pre Android 1.0 _OLD ioctls

Per Colin's comment:
"The "support old userspace code" comment for those two ioctls has
been there since pre-Android 1.0.  Those apis are not exposed to
Android apps, I don't see any problem deleting them."

Thus this patch removes the ANDROID_ALARM_SET_OLD and
ANDROID_ALARM_SET_AND_WAIT_OLD ioctl compatability
logic.

Change-Id: I5138aaa3cdbfb758aaef2cc7591cb5340b3640a0
Cc: Serban Constantinescu <serban.constantinescu@arm.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: alarm-dev: Refactor alarm-dev ioctl code in prep for compat_ioctl

Cleanup the Android alarm-dev driver's ioctl code to refactor it
in preparation for compat_ioctl support.

Cc: Serban Constantinescu <serban.constantinescu@arm.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: alarm-dev: Implement compat_ioctl support

Implement compat_ioctl support for the alarm-dev ioctl.

Cc: Serban Constantinescu <serban.constantinescu@arm.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Colin Cross <ccross@google.com>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: alarm-dev: information leak in alarm_ioctl()

Smatch complains that if we pass an invalid clock type then "ts" is
never set.  We need to check for errors earlier, otherwise we end up
passing uninitialized stack data to userspace.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

staging: alarm-dev: information leak in alarm_compat_ioctl()

If we pass an invalid clock type then "ts" is never set.  We need to
check for errors earlier, otherwise we end up passing uninitialized
stack data to userspace.

Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

rtc: alarm: Add power-on alarm feature

Android does not support powering up of phone through alarm.
Adding shutdown hook in alarm driver which will set alarm while phone
is going down so as to power-up the phone after alarm expiration.

Change-Id: Ic2611e33ae9c1f8e83f21efdb93e26ac9f9499de
Signed-off-by: Matthew Qin <yqin@codeaurora.org>

qpnp-rtc: clear alarm register when rtc irq is disabled

The rtc alarm register should be cleared when the rtc irq is
disabled

Change-Id: I97a8bf989ff610093240a6b308a297702da6cb89
Signed-off-by: Xiaocheng Li <lix@codeaurora.org>
Signed-off-by: Matthew Qin <yqin@codeaurora.org>

alarm : Fix the race conditions in alarm-dev.c

There will be race conditions between alarm set and alarm clear if
set_power_on_alarm is out of the lock.But set_power_on_alarm can't
be put into spin_lock. So add it into mutex_lock.

CRs-Fixed: 639115
Change-Id: I4226a95e499211c0d50ff7ce269467a57a410dc7
Signed-off-by: Mao Jinlong <c_jmao@codeaurora.org>

switch timerfd_[sg]ettime(2) to fget_light()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

time: Enable alarmtimers

Change-Id: I4d1ea553aa707ab0467e00dea86cedc8f6797b78
2017-08-25 20:00:02 +03:00
Jin Qian
9e20025f8b f2fs: sanity check checkpoint segno and blkoff
Make sure segno and blkoff read from raw image are valid.

Cc: stable@vger.kernel.org
Signed-off-by: Jin Qian <jinqian@google.com>
[Jaegeuk Kim: adjust minor coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

Change-Id: Ie2505c071233c1a9dec2729fe1ad467689a1b7a2
(cherry picked from commit 15d3042a937c13f5d9244241c7a9c8416ff6e82a)
2017-08-07 18:11:20 -06:00
Jin Qian
46e0dfc447 f2fs: sanity check segment count
F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

Change-Id: I16b3cd6279bff1a221781a80b9b34744c9e7098f
(cherry picked from commit b9dd46188edc2f0d1f37328637860bb65a771124)
2017-08-07 18:11:13 -06:00
Thomas Gleixner
5a34ec804c timerfd: Protect the might cancel mechanism proper
The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.

Protect the context for these operations with a seperate lock.

The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-fsdevel@vger.kernel.org"
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Change-Id: I1f2d38a919ceb1ca1c7c9471dece0c1126383912
(cherry picked from commit 1e38da300e1e395a15048b0af1e5305bd91402f6)
2017-08-07 18:11:00 -06:00
Jan Kara
6f25be195e udf: Check path length when reading symlink
Symlink reading code does not check whether the resulting path fits into
the page provided by the generic code. This isn't as easy as just
checking the symlink size because of various encoding conversions we
perform on path. So we have to check whether there is still enough space
in the buffer on the fly.

Change-Id: Id56d129029eaf2e651cf7236103fb73aa540ae1f
CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-10 01:48:57 +03:00
Kees Cook
b48a26fd3d fs/exec.c: account for argv/envp pointers
commit 98da7d08850fb8bdeb395d6368ed15753304aa0c upstream.

When limiting the argv/envp strings during exec to 1/4 of the stack limit,
the storage of the pointers to the strings was not included.  This means
that an exec with huge numbers of tiny strings could eat 1/4 of the stack
limit in strings and then additional space would be later used by the
pointers to the strings.

For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
single-byte strings would consume less than 2MB of stack, the max (8MB /
4) amount allowed, but the pointers to the strings would consume the
remaining additional stack space (1677721 * 4 == 6710884).

The result (1677721 + 6710884 == 8388605) would exhaust stack space
entirely.  Controlling this stack exhaustion could result in
pathological behavior in setuid binaries (CVE-2017-1000365).

[akpm@linux-foundation.org: additional commenting from Kees]
Fixes: b6a2fea393 ("mm: variable length argument support")
Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qualys Security Advisory <qsa@qualys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Change-Id: I2e01d7be2d52415264ff48c632bfe307008c4e03
2017-07-04 01:21:47 +03:00
Hugh Dickins
45c60e5957 mm: larger stack guard gap, between vmas
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Change-Id: I611023b0bfe1cab7b3e5da13e331a7baaaaf6eb0
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
     s390 uses generic arch_get_unmapped_area()]
[wt: backport to 3.16: adjust context]
[wt: backport to 3.10: adjust context ; code logic in PARISC's
     arch_get_unmapped_area() wasn't found ; code inserted into
     expand_upwards() and expand_downwards() runs under anon_vma lock;
     changes for gup.c:faultin_page go to memory.c:__get_user_pages();
     included Hugh Dickins' fixes]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Flex1911 <dedsa2002@gmail.com>
2017-07-02 13:03:27 +03:00
Linus Torvalds
939b2740ab splice: introduce FMODE_SPLICE_READ and FMODE_SPLICE_WRITE
Introduce FMODE_SPLICE_READ and FMODE_SPLICE_WRITE. These modes check
whether it is legal to read or write a file using splice. Both get
automatically set on regular files and are not checked when a 'struct
fileoperations' includes the splice_{read,write} methods.

Change-Id: Icb601a7db12d4e07a62d790edaa8a9a5aed3ba2a
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
2017-06-26 21:30:22 +03:00
Laura Abbott
5d5a14df85 fs: fuse: Add replacment for CMA pages into the LRU cache
CMA pages are currently replaced in the FUSE file system since
FUSE may hold on to CMA pages for a long time, preventing migration.
The replacement page is added to the file cache but not the LRU
cache. This may prevent the page from being properly aged and dropped,
creating poor performance under tight memory condition. Fix this by
adding the new page to the LRU cache after creation.

Change-Id: Ib349abf1024d48386b835335f3fbacae040b6241
CRs-Fixed: 586855
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
2017-06-26 21:01:44 +03:00
Jan Kara
b1a8c88774 BACKPORT: posix_acl: Clear SGID bit when setting file permissions
[Partially applied during f2fs inclusion, changes now aligned to upstream]

(cherry pick from commit 073931017b49d9458aa351605b43a7e34598caef)

When file permissions are modified via chmod(2) and the user is not in
the owning group or capable of CAP_FSETID, the setgid bit is cleared in
inode_change_ok().  Setting a POSIX ACL via setxattr(2) sets the file
permissions as well as the new ACL, but doesn't clear the setgid bit in
a similar way; this allows to bypass the check in chmod(2).  Fix that.

NB: conflicts resolution included extending the change to all visible
    users of the near deprecated function posix_acl_equiv_mode
    replaced with posix_acl_update_mode. We did not resolve the ACL
    leak in this CL, require additional upstream fixes.

References: CVE-2016-7097
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Bug: 32458736
[haggertk]: Backport to 3.4/msm8974
  * convert use of capable_wrt_inode_uidgid to capable
Change-Id: I19591ad452cc825ac282b3cfd2daaa72aa9a1ac1
2017-06-26 20:26:17 +03:00
Adrian Salido
21e685af37 fs/proc/array.c: make safe access to group_leader
As mentioned in commit 52ee2dfdd4
("pids: refactor vnr/nr_ns helpers to make them safe"). *_nr_ns
helpers used to be buggy. The commit addresses most of the helpers but
is missing task_tgid_xxx()

Without this protection there is a possible use after free reported by
kasan instrumented kernel:

==================================================================
BUG: KASAN: use-after-free in task_tgid_nr_ns+0x2c/0x44 at addr ***
Read of size 8 by task cat/2472
CPU: 1 PID: 2472 Comm: cat Tainted: ****
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc00020ad2c>] dump_backtrace+0x0/0x17c
[<ffffffc00020aec0>] show_stack+0x18/0x24
[<ffffffc0011573d0>] dump_stack+0x94/0x100
[<ffffffc0003c7dc0>] kasan_report+0x308/0x554
[<ffffffc0003c7518>] __asan_load8+0x20/0x7c
[<ffffffc00025a54c>] task_tgid_nr_ns+0x28/0x44
[<ffffffc00046951c>] proc_pid_status+0x444/0x1080
[<ffffffc000460f60>] proc_single_show+0x8c/0xdc
[<ffffffc0004081b0>] seq_read+0x2e8/0x6f0
[<ffffffc0003d1420>] vfs_read+0xd8/0x1e0
[<ffffffc0003d1b98>] SyS_read+0x68/0xd4

Accessing group_leader while holding rcu_lock and using the now safe
helpers introduced in the commit mentioned, this race condition is
addressed.

Signed-off-by: Adrian Salido <salidoa@google.com>
Change-Id: I4315217922dda375a30a3581c0c1740dda7b531b
Bug: 31495866
2017-06-07 13:26:13 -06:00
Tom Marshall
903457eb2c kernel: Fix potential refcount leak in su check
Change-Id: I3d241ae805ba708c18bccfd5e5d6cdcc8a5bc1c8
2017-05-19 18:41:42 -06:00
Tom Marshall
75ec7fa33f kernel: Only expose su when daemon is running
Note: this is for the 3.4 kernel

It has been claimed that the PG implementation of 'su' has security
vulnerabilities even when disabled.  Unfortunately, the people that
find these vulnerabilities often like to keep them private so they
can profit from exploits while leaving users exposed to malicious
hackers.

In order to reduce the attack surface for vulnerabilites, it is
therefore necessary to make 'su' completely inaccessible when it
is not in use (except by the root and system users).

Change-Id: Ia7d50ba46c3d932c2b0ca5fc8e9ec69ec9045f85
2017-05-19 18:41:25 -06:00
Ben Hutchings
ade551c944 splice: Apply generic position and size checks to each write
We need to check the position and size of file writes against various
limits, using generic_write_check().  This was not being done for
the splice write path.  It was fixed upstream by commit 8d0207652c
("->splice_write() via ->write_iter()") but we can't apply that.

CVE-2014-7822

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Change-Id: Ic7e7bd78d4594c993c9684d32a0ddeaf70165bce
(cherry picked from commit 894c6350eaad7e613ae267504014a456e00a3e2a)
2017-04-17 15:41:36 -06:00
Jeff Mahoney
dae4a6db45 ecryptfs: don't allow mmap when the lower fs doesn't support it
There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs.  We shouldn't emulate mmap support on file systems
that don't offer support natively.

CVE-2016-1583

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
[tyhicks: clean up f_op check by using ecryptfs_file_to_lower()]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
(adapted from commit f0fe970df3838c202ef6c07a4c2b36838ef0a88b)

Change-Id: I3eb979e9476847834eeea0ecbaf07a53329a7219
2017-03-03 16:49:07 -07:00
Eryu Guan
e0dd30eb33 ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
(cherry picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe)
(minor backport adapted from cf851ad35fd1e9c7b8ed00741eca613bc1a9c8c8)

Change-Id: If183ad4a873705c9a0312087577705298b3586fe
2017-03-03 13:40:24 -07:00
Nick Desaulniers
a8c9068848 BACKPORT: aio: mark AIO pseudo-fs noexec
This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set.  Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.

I have tested the patch on my machine.

To test the behavior, compile and run this:

    #define _GNU_SOURCE
    #include <unistd.h>
    #include <sys/personality.h>
    #include <linux/aio_abi.h>
    #include <err.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/syscall.h>

    int main(void) {
        personality(READ_IMPLIES_EXEC);
        aio_context_t ctx = 0;
        if (syscall(__NR_io_setup, 1, &ctx))
            err(1, "io_setup");

        char cmd[1000];
        sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
            (int)getpid());
        system(cmd);
        return 0;
    }

In the output, "rw-s" is good, "rwxs" is bad.

Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 22f6b4d34fcf039c63a94e7670e0da24f8575a5a)
(cherry picked from googlesource commit bc02d1d9f5)

Bug: 31711619
Change-Id: I9f2872703bef240d6b82320c744529459bb076dc
2017-03-03 13:15:25 -07:00
Jan Kara
bab8f030b0 isofs: Fix infinite looping over CE entries
Rock Ridge extensions define so called Continuation Entries (CE) which
define where is further space with Rock Ridge data. Corrupted isofs
image can contain arbitrarily long chain of these, including a one
containing loop and thus causing kernel to end in an infinite loop when
traversing these entries.

Limit the traversal to 32 entries which should be more than enough space
to store all the Rock Ridge data.

Reported-by: P J P <ppandit@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
(cherry picked from commit f54e18f1b831c92f6512d2eedb224cd63d607d3d)

Change-Id: I62cd59b27ac11fbf0a04b0d02874df7f390338bb
2017-03-03 12:44:58 -07:00
Omar Sandoval
dbe124efba block: fix use-after-free in sys_ioprio_get()
get_task_ioprio() accesses the task->io_context without holding the task
lock and thus can race with exit_io_context(), leading to a
use-after-free. The reproducer below hits this within a few seconds on
my 4-core QEMU VM:

int main(int argc, char **argv)
{
	pid_t pid, child;
	long nproc, i;

	/* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */
	syscall(SYS_ioprio_set, 1, 0, 0x6000);

	nproc = sysconf(_SC_NPROCESSORS_ONLN);

	for (i = 0; i < nproc; i++) {
		pid = fork();
		assert(pid != -1);
		if (pid == 0) {
			for (;;) {
				pid = fork();
				assert(pid != -1);
				if (pid == 0) {
					_exit(0);
				} else {
					child = wait(NULL);
					assert(child == pid);
				}
			}
		}

		pid = fork();
		assert(pid != -1);
		if (pid == 0) {
			for (;;) {
				/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
				syscall(SYS_ioprio_get, 2, 0);
			}
		}
	}

	for (;;) {
		/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
		syscall(SYS_ioprio_get, 2, 0);
	}

	return 0;
}

This gets us KASAN dumps like this:

[   35.526914] ==================================================================
[   35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c
[   35.530009] Read of size 2 by task ioprio-gpf/363
[   35.530009] =============================================================================
[   35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected
[   35.530009] -----------------------------------------------------------------------------

[   35.530009] Disabling lock debugging due to kernel taint
[   35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360
[   35.530009] 	___slab_alloc+0x55d/0x5a0
[   35.530009] 	__slab_alloc.isra.20+0x2b/0x40
[   35.530009] 	kmem_cache_alloc_node+0x84/0x200
[   35.530009] 	create_task_io_context+0x2b/0x370
[   35.530009] 	get_task_io_context+0x92/0xb0
[   35.530009] 	copy_process.part.8+0x5029/0x5660
[   35.530009] 	_do_fork+0x155/0x7e0
[   35.530009] 	SyS_clone+0x19/0x20
[   35.530009] 	do_syscall_64+0x195/0x3a0
[   35.530009] 	return_from_SYSCALL_64+0x0/0x6a
[   35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060
[   35.530009] 	__slab_free+0x27b/0x3d0
[   35.530009] 	kmem_cache_free+0x1fb/0x220
[   35.530009] 	put_io_context+0xe7/0x120
[   35.530009] 	put_io_context_active+0x238/0x380
[   35.530009] 	exit_io_context+0x66/0x80
[   35.530009] 	do_exit+0x158e/0x2b90
[   35.530009] 	do_group_exit+0xe5/0x2b0
[   35.530009] 	SyS_exit_group+0x1d/0x20
[   35.530009] 	entry_SYSCALL_64_fastpath+0x1a/0xa4
[   35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080
[   35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001
[   35.530009] ==================================================================

Fix it by grabbing the task lock while we poke at the io_context.

Change-Id: I4261aaf076fab943a80a45b0a77e023aa4ecbbd8
Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-11 13:35:57 +11:00
Nick Desaulniers
c3f8d15467 fs: ext4: disable support for fallocate FALLOC_FL_PUNCH_HOLE
Bug: 28760453
Change-Id: I019c2de559db9e4b95860ab852211b456d78c4ca
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
2016-10-31 23:29:10 +11:00
Eric W. Biederman
61f1c9ad34 mnt: Fail collect_mounts when applied to unmounted mounts
The only users of collect_mounts are in audit_tree.c

In audit_trim_trees and audit_add_tree_rule the path passed into
collect_mounts is generated from kern_path passed an audit_tree
pathname which is guaranteed to be an absolute path.   In those cases
collect_mounts is obviously intended to work on mounted paths and
if a race results in paths that are unmounted when collect_mounts
it is reasonable to fail early.

The paths passed into audit_tag_tree don't have the absolute path
check.  But are used to play with fsnotify and otherwise interact with
the audit_trees, so again operating only on mounted paths appears
reasonable.

Avoid having to worry about what happens when we try and audit
unmounted filesystems by restricting collect_mounts to mounts
that appear in the mount tree.

Change-Id: I2edfee6d6951a2179ce8f53785b65ddb1eb95629
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-10-31 22:51:16 +11:00
Thomas Gleixner
34ba3c370d kthread: Prevent unpark race which puts threads on the wrong cpu
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK
  bind_to(CPU2)		BUG_ON(wrong CPU)

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0	      	     	 CPU1
Bring up CPU2
create_thread(T)
park(T)
 wait_for_completion()
			 parkme()
			 complete()
sched_set_stop_task()
			 schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Change-Id: I9ad6fbe65992ad5b5cb9a252470a56ec51a4ff4f
Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-29 23:12:39 +08:00
Jaegeuk Kim
e787ff9965 f2fs: set fsync mark only for the last dnode
In order to give atomic writes, we should consider power failure during
sync_node_pages in fsync.
So, this patch marks fsync flag only in the last dnode block.

Change-Id: Ib44a91bf820f6631fe359a8ac430ede77ceda403
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
69baed249c f2fs: report unwritten status in fsync_node_pages
The fsync_node_pages should return pass or failure so that user could know
fsync is completed or not.

Change-Id: I3d588c44ad7452e66d3d6a795f2060de75fd5d0f
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
2d6e13b5bb f2fs: flush dirty pages before starting atomic writes
If somebody wrote some data before atomic writes, we should flush them in order
to handle atomic data in a right period.

Change-Id: I35611d9016330ef837554cff263bcbb10b4cc810
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
2e84eaaee8 f2fs: unset atomic/volatile flag in f2fs_release_file
The atomic/volatile operation should be done in pair of start and commit
ioctl.
For example, if a killed process remains open-ended atomic operation, we should
drop its flag as well as its atomic data. Otherwise, if sqlite initiates another
operation which doesn't require atomic writes, it will lose every data, since
f2fs still treats with them as atomic writes; nobody will trigger its commit.

Change-Id: Ic97f7d88a1158e2f21f4bd5447870ff578641fb3
Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
1282b71690 f2fs: fix dropping inmemory pages in a wrong time
When one reader closes its file while the other writer is doing atomic writes,
f2fs_release_file drops atomic data resulting in an empty commit.
This patch fixes this wrong commit problem by checking openess of the file.

 Process0                       Process1
 				open file
 start atomic write
 write data
 read data
				close file
				f2fs_release_file()
				clear atomic data
 commit atomic write

Change-Id: I99b90b569a56cb53bccf8758f870e0f49849c6fd
Reported-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
4ab3c246ba f2fs: split sync_node_pages with fsync_node_pages
This patch splits the existing sync_node_pages into (f)sync_node_pages.
The fsync_node_pages is used for f2fs_sync_file only.

Change-Id: I207b087a54f1a0c2e994a78cd6ed475578d7044e
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
3a5683358f f2fs: avoid writing 0'th page in volatile writes
The first page of volatile writes usually contains a sort of header information
which will be used for recovery.
(e.g., journal header of sqlite)

If this is written without other journal data, user needs to handle the stale
journal information.

Change-Id: I85f4cfe4cbef32ed43b0f52d7328b42d411dd2da
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:38 +08:00
Jaegeuk Kim
7c65b74491 f2fs: avoid needless lock for node pages when fsyncing a file
When fsync is called, sync_node_pages finds a proper direct node pages to flush.
But, it locks unrelated direct node pages together unnecessarily.

Change-Id: I6adc83f2e6592aea707851ee6e365afcc0e36f92
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:37 +08:00
Chao Yu
36490001bb f2fs: fix deadlock when flush inline data
Below backtrace info was reported by Yunlei He:

Call Trace:
 [<ffffffff817a9395>] schedule+0x35/0x80
 [<ffffffff817abb7d>] rwsem_down_read_failed+0xed/0x130
 [<ffffffff813c12a8>] call_rwsem_down_read_failed+0x18/0x
 [<ffffffff817ab1d0>] down_read+0x20/0x30
 [<ffffffffa02a1a12>] f2fs_evict_inode+0x242/0x3a0 [f2fs]
 [<ffffffff81217057>] evict+0xc7/0x1a0
 [<ffffffff81217cd6>] iput+0x196/0x200
 [<ffffffff812134f9>] __dentry_kill+0x179/0x1e0
 [<ffffffff812136f9>] dput+0x199/0x1f0
 [<ffffffff811fe77b>] __fput+0x18b/0x220
 [<ffffffff811fe84e>] ____fput+0xe/0x10
 [<ffffffff81097427>] task_work_run+0x77/0x90
 [<ffffffff81074d62>] exit_to_usermode_loop+0x73/0xa2
 [<ffffffff81003b7a>] do_syscall_64+0xfa/0x110
 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25

Call Trace:
 [<ffffffff817a9395>] schedule+0x35/0x80
 [<ffffffff81216dc3>] __wait_on_freeing_inode+0xa3/0xd0
 [<ffffffff810bc300>] ? autoremove_wake_function+0x40/0x4
 [<ffffffff8121771d>] find_inode_fast+0x7d/0xb0
 [<ffffffff8121794a>] ilookup+0x6a/0xd0
 [<ffffffffa02bc740>] sync_node_pages+0x210/0x650 [f2fs]
 [<ffffffff8122e690>] ? do_fsync+0x70/0x70
 [<ffffffffa02b085e>] block_operations+0x9e/0xf0 [f2fs]
 [<ffffffff8137b795>] ? bio_endio+0x55/0x60
 [<ffffffffa02b0942>] write_checkpoint+0x92/0xba0 [f2fs]
 [<ffffffff8117da57>] ? mempool_free_slab+0x17/0x20
 [<ffffffff8117de8b>] ? mempool_free+0x2b/0x80
 [<ffffffff8122e690>] ? do_fsync+0x70/0x70
 [<ffffffffa02a53e3>] f2fs_sync_fs+0x63/0xd0 [f2fs]
 [<ffffffff8129630f>] ? ext4_sync_fs+0xbf/0x190
 [<ffffffff8122e6b0>] sync_fs_one_sb+0x20/0x30
 [<ffffffff812002e9>] iterate_supers+0xb9/0x110
 [<ffffffff8122e7b5>] sys_sync+0x55/0x90
 [<ffffffff81003ae9>] do_syscall_64+0x69/0x110
 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25

With following excuting serials, we will set inline_node in inode page
after inode was unlinked, result in a deadloop described as below:
1. open file
2. write file
3. unlink file
4. write file
5. close file

Thread A				Thread B
 - dput
  - iput_final
   - inode->i_state |= I_FREEING
   - evict
    - f2fs_evict_inode
					 - f2fs_sync_fs
					  - write_checkpoint
					   - block_operations
					    - f2fs_lock_all (down_write(cp_rwsem))
     - f2fs_lock_op (down_read(cp_rwsem))
					    - sync_node_pages
					     - ilookup
					      - find_inode_fast
					       - __wait_on_freeing_inode
					         (wait on I_FREEING clear)

Here, we change to set inline_node flag only for linked inode for fixing.

Change-Id: Ibf4326ecb4ba68e45e4e964092e1d2955341bc56
Reported-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: stable@vger.kernel.org # v4.6
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:37 +08:00
Chao Yu
2f63209408 f2fs: fix to update dirty page count correctly
Once we failed to merge inline data into inode page during flushing inline
inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page
count incorrect, result in panic in ->evict_inode, Fix it.

------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:336!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 10004 Comm: umount Tainted: G           O    4.6.0-rc5+ #17
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: f0c33000 ti: c5212000 task.ti: c5212000
EIP: 0060:[<f89aacb5>] EFLAGS: 00010202 CPU: 3
EIP is at f2fs_evict_inode+0x85/0x490 [f2fs]
EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000
ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: c5213e78
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0
Stack:
 c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac
 f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64
 c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0
Call Trace:
 [<c176d45c>] ? _raw_spin_unlock+0x2c/0x50
 [<c1204a68>] evict+0xa8/0x170
 [<c1204b64>] dispose_list+0x34/0x50
 [<c120588d>] evict_inodes+0x10d/0x130
 [<c11ea941>] generic_shutdown_super+0x41/0xe0
 [<c1185190>] ? unregister_shrinker+0x40/0x50
 [<c1185190>] ? unregister_shrinker+0x40/0x50
 [<c11eac52>] kill_block_super+0x22/0x70
 [<f89af23e>] kill_f2fs_super+0x1e/0x20 [f2fs]
 [<c11eae1d>] deactivate_locked_super+0x3d/0x70
 [<c11eb383>] deactivate_super+0x43/0x60
 [<c1208ec9>] cleanup_mnt+0x39/0x80
 [<c1208f50>] __cleanup_mnt+0x10/0x20
 [<c107d091>] task_work_run+0x71/0x90
 [<c105725a>] exit_to_usermode_loop+0x72/0x9e
 [<c1001c7c>] do_fast_syscall_32+0x19c/0x1c0
 [<c176dd48>] sysenter_past_esp+0x45/0x74
EIP: [<f89aacb5>] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78
---[ end trace d30536330b7fdc58 ]---

Change-Id: I68907a13e6ac726e54f5c2bbe219bc2c8400a558
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:37 +08:00
Jaegeuk Kim
0222e52820 ext4/fscrypto: avoid RCU lookup in d_revalidate
As Al pointed, d_revalidate should return RCU lookup before using d_inode.
This was originally introduced by:
commit 34286d6662 ("fs: rcu-walk aware d_revalidate method").

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable <stable@vger.kernel.org>

 Conflicts:
	fs/ext4/crypto.c
2016-10-29 23:12:37 +08:00
Jaegeuk Kim
4f963da44b fscrypto: don't let data integrity writebacks fail with ENOMEM
This patch fixes the issue introduced by the ext4 crypto fix in a same manner.
For F2FS, however, we flush the pending IOs and wait for a while to acquire free
memory.

Fixes: c9af28fdd4492 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/crypto/crypto.c
2016-10-29 23:12:37 +08:00
Jaegeuk Kim
92d54d00b5 f2fs: use dget_parent and file_dentry in f2fs_file_open
This patch synced with the below two ext4 crypto fixes together.

In 4.6-rc1, f2fs newly introduced accessing f_path.dentry which crashes
overlayfs. To fix, now we need to use file_dentry() to access that field.

[Backport NOTE]
 - Over 4.2, it should use file_dentry

Fixes: c0a37d487884 ("ext4: use file_dentry()")
Fixes: 9dd78d8c9a7b ("ext4: use dget_parent() in ext4_file_open()")
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
c5c8bb5165 fscrypto: use dget_parent() in fscrypt_d_revalidate()
This patch updates fscrypto along with the below ext4 crypto change.

Fixes: 3d43bcfef5f0 ("ext4 crypto: use dget_parent() in ext4_d_revalidate()")
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/crypto/crypto.c
2016-10-29 23:12:36 +08:00
Shuoran Liu
5e468690df f2fs: retrieve IO write stat from the right place
In the following patch,

    f2fs: split journal cache from curseg cache

journal cache is split from curseg cache. So IO write statistics should be
retrived from journal cache but not curseg->sum_blk. Otherwise, it will
get 0, and the stat is lost.

Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
df869c0814 f2fs crypto: fix corrupted symlink in encrypted case
In the encrypted symlink case, we should check its corrupted symname after
decrypting it.
Otherwise, we can report -ENOENT incorrectly, if encrypted symname starts with
'\0'.

Cc: stable 4.5+ <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
b54ac0bf29 f2fs: cover large section in sanity check of super
This patch fixes the bug which does not cover a large section case when checking
the sanity of superblock.
If f2fs detects misalignment, it will fix the superblock during the mount time,
so it doesn't need to trigger fsck.f2fs further.

Reported-by: Matthias Prager <linux@matthiasprager.de>
Reported-by: David Gnedt <david.gnedt@davizone.at>
Cc: stable 4.5+ <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Linus Torvalds
1db29fa179 f2fs/crypto: fix xts_tweak initialization
Commit 0b81d07790726 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto") moved the f2fs crypto files to fs/crypto/ and
renamed the symbol prefixes from "f2fs_" to "fscrypt_" (and from "F2FS_"
to just "FS" for preprocessor symbols).

Because of the symbol renaming, it's a bit hard to see it as a file
move: use

    git show -M30 0b81d07790726

to lower the rename detection to just 30% similarity and make git show
the files as renamed (the header file won't be shown as a rename even
then - since all it contains is symbol definitions, it looks almost
completely different).

Even with the renames showing as renames, the diffs are not all that
easy to read, since so much is just the renames.  But Eric Biggers
noticed that it's not just all renames: the initialization of the
xts_tweak had been broken too, using the inode number rather than the
page offset.

That's not right - it makes the xfs_tweak the same for all pages of each
inode.  It _might_ make sense to make the xfs_tweak contain both the
offset _and_ the inode number, but not just the inode number.

Reported-by: Eric Biggers <ebiggers3@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
841b7347fa f2fs: submit node page write bios when really required
If many threads calls fsync with data writes, we don't need to flush every
bios having node page writes.
The f2fs_wait_on_page_writeback will flush its bios when the page is really
needed.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Arnd Bergmann
f9b55fda32 f2fs: add missing argument to f2fs_setxattr stub
The f2fs_setxattr() prototype for CONFIG_F2FS_FS_XATTR=n has
been wrong for a long time, since 8ae8f1627f ("f2fs: support
xattr security labels"), but there have never been any callers,
so it did not matter.

Now, the function gets called from f2fs_ioc_keyctl(), which
causes a build failure:

fs/f2fs/file.c: In function 'f2fs_ioc_keyctl':
include/linux/stddef.h:7:14: error: passing argument 6 of 'f2fs_setxattr' makes integer from pointer without a cast [-Werror=int-conversion]
 #define NULL ((void *)0)
              ^
fs/f2fs/file.c:1599:27: note: in expansion of macro 'NULL'
     value, F2FS_KEY_SIZE, NULL, type);
                           ^
In file included from ../fs/f2fs/file.c:29:0:
fs/f2fs/xattr.h:129:19: note: expected 'int' but argument is of type 'void *'
 static inline int f2fs_setxattr(struct inode *inode, int index,
                   ^
fs/f2fs/file.c:1597:9: error: too many arguments to function 'f2fs_setxattr'
  return f2fs_setxattr(inode, F2FS_XATTR_INDEX_KEY,
         ^
In file included from ../fs/f2fs/file.c:29:0:
fs/f2fs/xattr.h:129:19: note: declared here
 static inline int f2fs_setxattr(struct inode *inode, int index,

Thsi changes the prototype of the empty stub function to match
that of the actual implementation. This will not make the key
management work when F2FS_FS_XATTR is disabled, but it gets it
to build at least.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Chao Yu
135b6ab7ff f2fs: fix to avoid unneeded unlock_new_inode
During ->lookup, I_NEW state of inode was been cleared in f2fs_iget,
so in error path, we don't need to clear it again.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Chao Yu
0487130e26 f2fs: clean up opened code with f2fs_update_dentry
Just clean up opened code with existing function, no logic change.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
1e100874fc f2fs: declare static functions
Just to avoid sparse warnings.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Fan Li
6be6530f86 f2fs: modify the readahead method in ra_node_page()
ra_node_page() is used to read ahead one node page. Comparing to regular
read, it's faster because it doesn't wait for IO completion.
But if it is called twice for reading the same block, and the IO request
from the first call hasn't been completed before the second call, the second
call will have to wait until the read is over.

Here use the code in __do_page_cache_readahead() to solve this problem.
It does nothing when someone else already puts the page in mapping. The
status of page should be assured by whoever puts it there.
This implement also prevents alteration of page reference count.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
6f86d87cf6 f2fs crypto: sync ext4_lookup and ext4_file_open
This patch tries to catch up with lookup and open policies in ext4.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:36 +08:00
Jaegeuk Kim
ea38719073 f2fs: define not-set fallocate flags
This fixes building bugs in aosp.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:35 +08:00
Jaegeuk Kim
47393901fb fs crypto: move per-file encryption from f2fs tree to fs/crypto
This patch adds the renamed functions moved from the f2fs crypto files.

[Backporting to 3.10]
 - Removed d_is_negative() in fscrypt_d_revalidate().

1. definitions for per-file encryption used by ext4 and f2fs.

2. crypto.c for encrypt/decrypt functions
 a. IO preparation:
  - fscrypt_get_ctx / fscrypt_release_ctx
 b. before IOs:
  - fscrypt_encrypt_page
  - fscrypt_decrypt_page
  - fscrypt_zeroout_range
 c. after IOs:
  - fscrypt_decrypt_bio_pages
  - fscrypt_pullback_bio_page
  - fscrypt_restore_control_page

3. policy.c supporting context management.
 a. For ioctls:
  - fscrypt_process_policy
  - fscrypt_get_policy
 b. For context permission
  - fscrypt_has_permitted_context
  - fscrypt_inherit_context

4. keyinfo.c to handle permissions
  - fscrypt_get_encryption_info
  - fscrypt_free_encryption_info

5. fname.c to support filename encryption
 a. general wrapper functions
  - fscrypt_fname_disk_to_usr
  - fscrypt_fname_usr_to_disk
  - fscrypt_setup_filename
  - fscrypt_free_filename

 b. specific filename handling functions
  - fscrypt_fname_alloc_buffer
  - fscrypt_fname_free_buffer

6. Makefile and Kconfig

Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Uday Savagaonkar <savagaon@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/f2fs.h
	fs/f2fs/inode.c
	fs/f2fs/super.c
	include/linux/fs.h

Change-Id: I162f3562aed66e7e377ec38714fae96c651fbdd7
2016-10-29 23:12:35 +08:00
Yang Shi
2b9dcf81e6 f2fs: mutex can't be used by down_write_nest_lock()
f2fs_lock_all() calls down_write_nest_lock() to acquire a rw_sem and check
a mutex, but down_write_nest_lock() is designed for two rw_sem accoring to the
comment in include/linux/rwsem.h. And, other than f2fs, it is just called in
mm/mmap.c with two rwsem.

So, it looks it is used wrongly by f2fs. And, it causes the below compile
warning on -rt kernel too.

In file included from fs/f2fs/xattr.c:25:0:
fs/f2fs/f2fs.h: In function 'f2fs_lock_all':
fs/f2fs/f2fs.h:962:34: warning: passing argument 2 of 'down_write_nest_lock' from incompatible pointer type [-Wincompatible-pointer-types]
  f2fs_down_write(&sbi->cp_rwsem, &sbi->cp_mutex);
                                  ^
fs/f2fs/f2fs.h:27:55: note: in definition of macro 'f2fs_down_write'
 #define f2fs_down_write(x, y) down_write_nest_lock(x, y)
                                                       ^
In file included from include/linux/rwsem.h:22:0,
                 from fs/f2fs/xattr.c:21:
include/linux/rwsem_rt.h:138:20: note: expected 'struct rw_semaphore *' but argument is of type 'struct mutex *'
 static inline void down_write_nest_lock(struct rw_semaphore *sem,

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:35 +08:00
Liu Xue
55340ae5aa f2fs: recovery missing dot dentries in root directory
If f2fs was corrupted with missing dot dentries in root dirctory,
it needs to recover them after fsck.f2fs set F2FS_INLINE_DOTS flag
in directory inode when fsck.f2fs detects missing dot dentries.

Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Yong Sheng <shengyong1@huawei.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:35 +08:00
Willy Tarreau
dcaffd537f pipe: limit the per-user amount of pages allocated in pipes
On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
	Documentation/sysctl/fs.txt
	fs/pipe.c
	include/linux/sched.h

Change-Id: Ic7c678af18129943e16715fdaa64a97a7f0854be
2016-10-29 23:12:35 +08:00
Theodore Ts'o
2da70f4e34 ext4: avoid hang when mounting non-journal filesystems with orphan list
When trying to mount a file system which does not contain a journal,
but which does have a orphan list containing an inode which needs to
be truncated, the mount call with hang forever in
ext4_orphan_cleanup() because ext4_orphan_del() will return
immediately without removing the inode from the orphan list, leading
to an uninterruptible loop in kernel code which will busy out one of
the CPU's on the system.

This can be trivially reproduced by trying to mount the file system
found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
source tree.  If a malicious user were to put this on a USB stick, and
mount it on a Linux desktop which has automatic mounts enabled, this
could be considered a potential denial of service attack.  (Not a big
deal in practice, but professional paranoids worry about such things,
and have even been known to allocate CVE numbers for such problems.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2016-10-29 23:12:34 +08:00
Anatol Pomozov
012f862b13 ext4: make orphan functions be no-op in no-journal mode
Instead of checking whether the handle is valid, we check if journal
is enabled. This avoids taking the s_orphan_lock mutex in all cases
when there is no journal in use, including the error paths where
ext4_orphan_del() is called with a handle set to NULL.

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2016-10-29 23:12:34 +08:00
Chao Yu
b86bad0997 f2fs: fix to avoid deadlock when merging inline data
When testing with fsstress, kworker and user threads were both blocked:

INFO: task kworker/u16:1:16580 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:1   D ffff8803f2595390     0 16580      2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-251:0)
 ffff8802730e5760 0000000000000046 ffff880274729fc0 0000000000012440
 ffff8802730e5fd8 ffff8802730e4010 0000000000012440 0000000000012440
 ffff8802730e5fd8 0000000000012440 ffff880274729fc0 ffff88026eb50000
Call Trace:
 [<ffffffff816fe9d9>] schedule+0x29/0x70
 [<ffffffff816ff895>] rwsem_down_read_failed+0xa5/0xf9
 [<ffffffff81378584>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffffa0694feb>] f2fs_write_data_page+0x31b/0x420 [f2fs]
 [<ffffffffa0690f1a>] __f2fs_writepage+0x1a/0x50 [f2fs]
 [<ffffffffa06922a0>] f2fs_write_data_pages+0xe0/0x290 [f2fs]
 [<ffffffff811473b3>] do_writepages+0x23/0x40
 [<ffffffff811cc3ee>] __writeback_single_inode+0x4e/0x250
 [<ffffffff811cd4f1>] writeback_sb_inodes+0x2c1/0x470
 [<ffffffff811cd73e>] __writeback_inodes_wb+0x9e/0xd0
 [<ffffffff811cda0b>] wb_writeback+0x1fb/0x2d0
 [<ffffffff811cdb7c>] wb_do_writeback+0x9c/0x220
 [<ffffffff811ce232>] bdi_writeback_workfn+0x72/0x1c0
 [<ffffffff8106b74e>] process_one_work+0x1de/0x5b0
 [<ffffffff8106e78f>] worker_thread+0x11f/0x3e0
 [<ffffffff810750ce>] kthread+0xde/0xf0
 [<ffffffff817093f8>] ret_from_fork+0x58/0x90

fsstress thread stack:
 [<ffffffff81139f0e>] sleep_on_page+0xe/0x20
 [<ffffffff81139ef7>] __lock_page+0x67/0x70
 [<ffffffff8113b100>] find_lock_page+0x50/0x80
 [<ffffffff8113b24f>] find_or_create_page+0x3f/0xb0
 [<ffffffffa06983a9>] sync_node_pages+0x259/0x810 [f2fs]
 [<ffffffffa068d874>] write_checkpoint+0x1a4/0xce0 [f2fs]
 [<ffffffffa0686b0c>] f2fs_sync_fs+0x7c/0xd0 [f2fs]
 [<ffffffffa067c813>] f2fs_sync_file+0x143/0x5f0 [f2fs]
 [<ffffffff811d301b>] vfs_fsync_range+0x2b/0x40
 [<ffffffff811d304c>] vfs_fsync+0x1c/0x20
 [<ffffffff811d3291>] do_fsync+0x41/0x70
 [<ffffffff811d32d3>] SyS_fdatasync+0x13/0x20
 [<ffffffff817094a2>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff

The reason of this issue is:
CPU0:					CPU1:
 - f2fs_write_data_pages
					 - f2fs_sync_fs
					  - write_checkpoint
					   - block_operations
					    - f2fs_lock_all
					     - down_write(sbi->cp_rwsem)
  - lock_page(page)
  - f2fs_write_data_page
					    - sync_node_pages
					     - flush_inline_data
					      - pagecache_get_page(page, GFP_LOCK)
   - f2fs_lock_op
    - down_read(sbi->cp_rwsem)

This patch alters to use trylock_page in flush_inline_data to fix this ABBA
deadlock issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
09f4658771 f2fs: introduce f2fs_flush_merged_bios for cleanup
Add a new helper f2fs_flush_merged_bios to clean up redundant codes.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
cc7e1e5fe8 f2fs: introduce f2fs_update_data_blkaddr for cleanup
Add a new help f2fs_update_data_blkaddr to clean up redundant codes.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
e90d738a5e f2fs crypto: fix incorrect positioning for GCing encrypted data page
For now, flow of GCing an encrypted data page:
1) try to grab meta page in meta inode's mapping with index of old block
address of that data page
2) load data of ciphertext into meta page
3) allocate new block address
4) write the meta page into new block address
5) update block address pointer in direct node page.

Other reader/writer will use f2fs_wait_on_encrypted_page_writeback to
check and wait on GCed encrypted data cached in meta page writebacked
in order to avoid inconsistence among data page cache, meta page cache
and data on-disk when updating.

However, we will use new block address updated in step 5) as an index to
lookup meta page in inner bio buffer. That would be wrong, and we will
never find the GCing meta page, since we use the old block address as
index of that page in step 1).

This patch fixes the issue by adjust the order of step 1) and step 3),
and in step 1) grab page with index generated in step 3).

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/gc.c
2016-10-29 23:12:32 +08:00
Chao Yu
b22711753f f2fs: fix incorrect upper bound when iterating inode mapping tree
1. Inode mapping tree can index page in range of [0, ULONG_MAX], however,
in some places, f2fs only search or iterate page in ragne of [0, LONG_MAX],
result in miss hitting in page cache.

2. filemap_fdatawait_range accepts range parameters in unit of bytes, so
the max range it covers should be [0, LLONG_MAX], if we use [0, LONG_MAX]
as range for waiting on writeback, big number of pages will not be covered.

This patch corrects above two issues.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Yunlei He
76fa243e45 f2fs: avoid hungtask problem caused by losing wake_up
The D state of wait_on_all_pages_writeback should be waken by
function f2fs_write_end_io when all writeback pages have been
succesfully written to device. It's possible that wake_up comes
between get_pages and io_schedule. Maybe in this case it will
lost wake_up and still in D state even if all pages have been
write back to device, and finally, the whole system will be into
the hungtask state.

                if (!get_pages(sbi, F2FS_WRITEBACK))
                         break;
					<---------  wake_up
                io_schedule();

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Biao He <hebiao6@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
d37aaf04b7 f2fs: trace old block address for CoWed page
This patch enables to trace old block address of CoWed page for better
debugging.

f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f0, oldaddr = 0xfe8ab, newaddr = 0xfee90 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4f8, oldaddr = 0xfe8b0, newaddr = 0xfee91 rw = WRITE_SYNC, type = NODE
f2fs_submit_page_mbio: dev = (1,0), ino = 1, page_index = 0x1d4fa, oldaddr = 0xfe8ae, newaddr = 0xfee92 rw = WRITE_SYNC, type = NODE

f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x96, oldaddr = 0xf049b, newaddr = 0x2bbe rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x97, oldaddr = 0xf049c, newaddr = 0x2bbf rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 134824, page_index = 0x98, oldaddr = 0xf049d, newaddr = 0x2bc0 rw = WRITE, type = DATA

f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x47, oldaddr = 0xffffffff, newaddr = 0xf2631 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x48, oldaddr = 0xffffffff, newaddr = 0xf2632 rw = WRITE, type = DATA
f2fs_submit_page_mbio: dev = (1,0), ino = 135260, page_index = 0x49, oldaddr = 0xffffffff, newaddr = 0xf2633 rw = WRITE, type = DATA

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
3ccd1c10f5 f2fs: try to flush inode after merging inline data
When flushing node pages, if current node page is an inline inode page, we
will try to merge inline data from data page into inline inode page, then
skip flushing current node page, it will decrease the number of nodes to
be flushed in batch in this round, which may lead to worse performance.

This patch gives a chance to flush just merged inline inode pages for
performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
7ad7e4c310 f2fs: show more info about superblock recovery
This patch changes to show more info in message log about the recovery
of the corrupted superblock during ->mount, e.g. the index of corrupted
superblock and the result of recovery.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
5c16430b87 f2fs: fix the wrong stat count of calling gc
With a partition which was formated as multi segments in one section,
we stated incorrectly for count of gc operation.

e.g., for a partition with segs_per_sec = 4

cat /sys/kernel/debug/f2fs/status

GC calls: 208 (BG: 7)
  - data segments : 104 (52)
  - node segments : 104 (24)

GC called count should be (104 (data segs) + 104 (node segs)) / 4 = 52,
rather than 208. Fix it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Jaegeuk Kim
60f5accac7 f2fs: remain last victim segment number ascending order
This patch avoids to remain inefficient victim segment number selected by
a victim.

For example, if all the dirty segments has same valid blocks, we can get
the victim segments descending order due to keeping wrong last segment number.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Shawn Lin
e8d08bb60c f2fs: reuse read_inline_data for f2fs_convert_inline_page
f2fs_convert_inline_page introduce what read_inline_data
already does for copying out the inline data from inode_page.
We can use read_inline_data instead to simplify the code.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:32 +08:00
Chao Yu
dd44751411 f2fs: fix to delete old dirent in converted inline directory in ->rename
When doing test with fstests/generic/068 in inline_dentry enabled f2fs,
following oops dmesg will be reported:

 ------------[ cut here ]------------
 WARNING: CPU: 5 PID: 11841 at fs/inode.c:273 drop_nlink+0x49/0x50()
 Modules linked in: f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
 CPU: 5 PID: 11841 Comm: fsstress Tainted: G           O    4.5.0-rc1 #45
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013
  0000000000000111 ffff88009cdf7ae8 ffffffff813e5944 0000000000002e41
  0000000000000000 0000000000000111 0000000000000000 ffff88009cdf7b28
  ffffffff8106a587 ffff88009cdf7b58 ffff8804078fe180 ffff880374a64e00
 Call Trace:
  [<ffffffff813e5944>] dump_stack+0x48/0x64
  [<ffffffff8106a587>] warn_slowpath_common+0x97/0xe0
  [<ffffffff8106a5ea>] warn_slowpath_null+0x1a/0x20
  [<ffffffff81231039>] drop_nlink+0x49/0x50
  [<ffffffffa07b95b4>] f2fs_rename2+0xe04/0x10c0 [f2fs]
  [<ffffffff81231ff1>] ? lock_two_nondirectories+0x81/0x90
  [<ffffffff813f454d>] ? lockref_get+0x1d/0x30
  [<ffffffff81220f70>] vfs_rename+0x2e0/0x640
  [<ffffffff8121f9db>] ? lookup_dcache+0x3b/0xd0
  [<ffffffff810b8e41>] ? update_fast_ctr+0x21/0x40
  [<ffffffff8134ff12>] ? security_path_rename+0xa2/0xd0
  [<ffffffff81224af6>] SYSC_renameat2+0x4b6/0x540
  [<ffffffff810ba8ed>] ? trace_hardirqs_off+0xd/0x10
  [<ffffffff810022ba>] ? exit_to_usermode_loop+0x7a/0xd0
  [<ffffffff817e0ade>] ? int_ret_from_sys_call+0x52/0x9f
  [<ffffffff810bdc90>] ? trace_hardirqs_on_caller+0x100/0x1c0
  [<ffffffff81224b8e>] SyS_renameat2+0xe/0x10
  [<ffffffff8121f08e>] SyS_rename+0x1e/0x20
  [<ffffffff817e0957>] entry_SYSCALL_64_fastpath+0x12/0x6f
 ---[ end trace 2b31e17995404e42 ]---

This is because: in the same inline directory, when we renaming one file
from source name to target name which is not existed, once space of inline
dentry is not enough, inline conversion will be triggered, after that all
data in inline dentry will be moved to normal dentry page.

After attaching the new entry in coverted dentry page, still we try to
remove old entry in original inline dentry, since old entry has been
moved, so it obviously doesn't make any effect, result in remaining old
entry in converted dentry page.

Now, we have two valid dentries pointed to the same inode which has nlink
value of 1, deleting them both, above warning appears.

This issue can be reproduced easily as below steps:
1. mount f2fs with inline_dentry option
2. mkdir dir
3. touch 180 files named [001-180] in dir
4. rename dir/180 dir/181
5. rm dir/180 dir/181

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
7af1656411 f2fs: detect error of update_dent_inode in ->rename
Should check and show correct return value of update_dent_inode in
->rename.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Shawn Lin
a03b195403 f2fs: move sanity checking of cp into get_valid_checkpoint
>From the function name of get_valid_checkpoint, it seems to return
the valid cp or NULL for caller to check. If no valid one is found,
f2fs_fill_super will print the err log. But if get_valid_checkpoint
get one valid(the return value indicate that it's valid, however actually
it is invalid after sanity checking), then print another similar err
log. That seems strange. Let's keep sanity checking inside the procedure
of geting valid cp. Another improvement we gained from this move is
that even the large volume is supported, we check the cp in advanced
to skip the following procedure if failing the sanity checking.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Shawn Lin
1d4eb788ac f2fs: slightly reorganize read_raw_super_block
read_raw_super_block was introduced to help find the
first valid superblock. Commit da554e48caab ("f2fs:
recovering broken superblock during mount") changed the
behaviour to read both of them and check whether need
the recovery flag or not. So the comment before this
function isn't consistent with what it actually does.
Also, the origin code use two tags to round the err
cases, which isn't so readable. So this patch amend
the comment and slightly reorganize it.

Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
33336cb167 f2fs: reorder nat cache lock in cache_nat_entry
When lookuping nat entry in cache_nat_entry, if we fail to hit nat cache,
we try to load nat entries a) from journal of current segment cache or b)
from NAT pages for updating, during the process, write lock of
nat_tree_lock will be held to avoid inconsistent condition in between
nid cache and nat cache caused by racing among nat entry shrinker,
checkpointer, nat entry updater.

But this way may cause low efficient when updating nat cache, because it
serializes accessing in journal cache or reading NAT pages.

Here, we reorder lock and update flow as below to enhance accessing
concurrency:

 - get_node_info
  - down_read(nat_tree_lock)
  - lookup nat cache --- hit -> unlock & return
  - lookup journal cache --- hit -> unlock & goto update
  - up_read(nat_tree_lock)
update:
  - down_write(nat_tree_lock)
  - cache_nat_entry
   - lookup nat cache --- nohit -> update
  - up_write(nat_tree_lock)

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
626dcbd8f5 f2fs: split journal cache from curseg cache
In curseg cache, f2fs caches two different parts:
 - datas of current summay block, i.e. summary entries, footer info.
 - journal info, i.e. sparse nat/sit entries or io stat info.

With this approach, 1) it may cause higher lock contention when we access
or update both of the parts of cache since we use the same mutex lock
curseg_mutex to protect the cache. 2) current summary block with last
journal info will be writebacked into device as a normal summary block
when flushing, however, we treat journal info as valid one only in current
summary, so most normal summary blocks contain junk journal data, it wastes
remaining space of summary block.

So, in order to fix above issues, we split curseg cache into two parts:
a) current summary block, protected by original mutex lock curseg_mutex
b) journal cache, protected by newly introduced r/w semaphore journal_rwsem

When loading curseg cache during ->mount, we store summary info and
journal info into different caches; When doing checkpoint, we combine
datas of two cache into current summary block for persisting.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
80f42a5919 f2fs: enhance IO path with block plug
Try to use block plug in more place as below to let process cache bios
as much as possbile, in order to reduce lock overhead of queue in IO
scheduler.
1) sync_meta_pages
2) ra_meta_pages
3) f2fs_balance_fs_bg

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
0f9485738f f2fs: introduce f2fs_journal struct to wrap journal info
Introduce a new structure f2fs_journal to wrap journal info in struct
f2fs_summary_block for readability.

struct f2fs_journal {
	union {
		__le16 n_nats;
		__le16 n_sits;
	};
	union {
		struct nat_journal nat_j;
		struct sit_journal sit_j;
		struct f2fs_extra_info info;
	};
} __packed;

struct f2fs_summary_block {
	struct f2fs_summary entries[ENTRIES_IN_SUM];
	struct f2fs_journal journal;
	struct summary_footer footer;
} __packed;

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
06f3d93e20 f2fs crypto: avoid unneeded memory allocation when {en/de}crypting symlink
This patch adopts f2fs with codes of ext4, it removes unneeded memory
allocation in creating/accessing path of symlink.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/crypto_fname.c
2016-10-29 23:12:31 +08:00
Chao Yu
3b92fcd38b f2fs crypto: handle unexpected lack of encryption keys
This patch syncs f2fs with commit abdd438b26b4 ("ext4 crypto: handle
unexpected lack of encryption keys") from ext4.

Fix up attempts by users to try to write to a file when they don't
have access to the encryption key.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
a18a1be377 f2fs crypto: make sure the encryption info is initialized on opendir(2)
This patch syncs f2fs with commit 6bc445e0ff44 ("ext4 crypto: make
sure the encryption info is initialized on opendir(2)") from ext4.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
3d95525a4a f2fs: support revoking atomic written pages
f2fs support atomic write with following semantics:
1. open db file
2. ioctl start atomic write
3. (write db file) * n
4. ioctl commit atomic write
5. close db file

With this flow we can avoid file becoming corrupted when abnormal power
cut, because we hold data of transaction in referenced pages linked in
inmem_pages list of inode, but without setting them dirty, so these data
won't be persisted unless we commit them in step 4.

But we should still hold journal db file in memory by using volatile
write, because our semantics of 'atomic write support' is incomplete, in
step 4, we could fail to submit all dirty data of transaction, once
partial dirty data was committed in storage, then after a checkpoint &
abnormal power-cut, db file will be corrupted forever.

So this patch tries to improve atomic write flow by adding a revoking flow,
once inner error occurs in committing, this gives another chance to try to
revoke these partial submitted data of current transaction, it makes
committing operation more like aotmical one.

If we're not lucky, once revoking operation was failed, EAGAIN will be
reported to user for suggesting doing the recovery with held journal file,
or retrying current transaction again.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Chao Yu
6399d80f86 f2fs: split drop_inmem_pages from commit_inmem_pages
Split drop_inmem_pages from commit_inmem_pages for code readability,
and prepare for the following modification.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:31 +08:00
Jaegeuk Kim
de89056ac2 f2fs: avoid garbage lenghs in dentries
This patch fixes to eliminate garbage name lengths in dentries in order
to provide correct answers of readdir.

For example, if a valid dentry consists of:
 bitmap : 1   1 1 1
 len    : 32  0 x 0,

readdir can start with second bit_pos having len = 0.
Or, it can start with third bit_pos having garbage.

In both of cases, we should avoid to try filling dentries.
So, this patch not only removes any garbage length, but also avoid entering
zero length case in readdir.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
9be179b0be f2fs: use correct errno
This patch is to fix misused error number.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
f87ff6041f f2fs crypto: add missing locking for keyring_key access
This patch adopts:
	ext4 crypto: add missing locking for keyring_key access

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/crypto_key.c
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
84d0f07b55 f2fs crypto: check for too-short encrypted file names
This patch adopts:
	ext4 crypto: check for too-short encrypted file names

An encrypted file name should never be shorter than an 16 bytes, the
AES block size.  The 3.10 crypto layer will oops and crash the kernel
if ciphertext shorter than the block size is passed to it.

Fortunately, in modern kernels the crypto layer will not crash the
kernel in this scenario, but nevertheless, it represents a corrupted
directory, and we should detect it and mark the file system as
corrupted so that e2fsck can fix this.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
1513d8ceda f2fs crypto: f2fs_page_crypto() doesn't need a encryption context
This patch adopts:
	ext4 crypto: ext4_page_crypto() doesn't need a encryption context

Since ext4_page_crypto() doesn't need an encryption context (at least
not any more), this allows us to simplify a number function signature
and also allows us to avoid needing to allocate a context in
ext4_block_write_begin().  It also means we no longer need a separate
ext4_decrypt_one() function.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
d88d6ef8de f2fs crypto: fix spelling typo in comment
This patch adopts:
	ext4 crypto: fix spelling typo in comment

Signed-off-by: Laurent Navet <laurent.navet@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
a940126577 f2fs crypto: replace some BUG_ON()'s with error checks
This patch adopts:
	ext4 crypto: replace some BUG_ON()'s with error checks

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/crypto_key.c
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
309f065641 f2fs: increase i_size to avoid missing data
When finsert is doing with dirting pages, we should increase i_size right away.
Otherwise, the moved page is able to be dropped by the following
filemap_write_and_wait_range before updating i_size.
Especially, it can be done by
	if ((page->index >= end_index + 1) || !offset)
		goto out;
in f2fs_write_data_page.

This should resolve the below xfstests/091 failure reported by Dave.

$ diff -u tests/generic/091.out /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad
--- tests/generic/091.out       2014-01-20 16:57:33.000000000 +1100
+++ /home/dave/src/xfstests-dev/results//f2fs/generic/091.out.bad       2016-02-08 15:21:02.701375087 +1100
@@ -1,7 +1,18 @@
 QA output created by 091
 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
+mapped writes DISABLED
+skipping insert range behind EOF
+skipping insert range behind EOF
+truncating to largest ever: 0x11e00
+dowrite: write: Invalid argument
+LOG DUMP (7 total operations):
+1(  1 mod 256): SKIPPED (no operation)
+2(  2 mod 256): SKIPPED (no operation)
+3(  3 mod 256): FALLOC   0x2e0f2 thru 0x3134a  (0x3258 bytes) PAST_EOF
+4(  4 mod 256): SKIPPED (no operation)
+5(  5 mod 256): SKIPPED (no operation)
+6(  6 mod 256): TRUNCATE UP    from 0x0 to 0x11e00
+7(  7 mod 256): WRITE    0x73400 thru 0x79fff  (0x6c00 bytes) HOLE
+Log of operations saved to "/mnt/test/junk.fsxops"; replay with --replay-ops
+Correct content saved for comparison
+(maybe hexdump "/mnt/test/junk" vs "/mnt/test/junk.fsxgood")

Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
df014cd170 f2fs: preallocate blocks for buffered aio writes
This patch preallocates data blocks for buffered aio writes.
With this patch, we can avoid redundant locking and unlocking of node pages
given consecutive aio request.

[For 3.10]
 - Add preallocationg for generic_splice_write(sendfile) for xfstests/249, 285

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/data.c
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
a206ff8ff8 f2fs: move dio preallocation into f2fs_file_write_iter
This patch moves preallocation code for direct IOs into f2fs_file_write_iter.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/data.c
	fs/f2fs/file.c
2016-10-29 23:12:30 +08:00
Yunlei He
6356b2f4f9 f2fs: fix missing skip pages info
fix missing skip pages info in f2fs_writepages trace event.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Chao Yu
9a1fd07149 f2fs: introduce f2fs_submit_merged_bio_cond
f2fs use single bio buffer per type data (META/NODE/DATA) for caching
writes locating in continuous block address as many as possible, after
submitting, these writes may be still cached in bio buffer, so we have
to flush cached writes in bio buffer by calling f2fs_submit_merged_bio.

Unfortunately, in the scenario of high concurrency, bio buffer could be
flushed by someone else before we submit it as below reasons:
a) there is no space in bio buffer.
b) add a request of different type (SYNC, ASYNC).
c) add a discontinuous block address.

For this condition, f2fs_submit_merged_bio will be devastating, because
it could break the following merging of writes in bio buffer, split one
big bio into two smaller one.

This patch introduces f2fs_submit_merged_bio_cond which can do a
conditional submitting with bio buffer, before submitting it will judge
whether:
 - page in DATA type bio buffer is matching with specified page;
 - page in DATA type bio buffer is belong to specified inode;
 - page in NODE type bio buffer is belong to specified inode;
If there is no eligible page in bio buffer, we will skip submitting step,
result in gaining more chance to merge consecutive block IOs in bio cache.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
af7ffe2704 f2fs: fix conflict on page->private usage
This patch fixes confilct on page->private value between f2fs_trace_pid and
atomic page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:30 +08:00
Jaegeuk Kim
7f78a775da f2fs: flush bios to handle cp_error in put_super
Sometimes, if cp_error is set, there remains under-writeback pages, resulting in
kernel hang in put_super.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Jaegeuk Kim
568ca9d039 f2fs: wait on page's writeback in writepages path
Likewise f2fs_write_cache_pages, let's do for node and meta pages too.
Especially, for node blocks, we should do this before marking its fsync
and dentry flags.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Chao Yu
c943a5d1aa f2fs: speed up handling holes in fiemap
This patch makes f2fs_map_blocks supporting returning next potential
page offset which skips hole region in indirect tree of inode, and
use it to speed up fiemap in handling big hole case.

Test method:
xfs_io -f /mnt/f2fs/file  -c "pwrite 1099511627776 4096"
time xfs_io -f /mnt/f2fs/file -c "fiemap -v"

Before:
time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
/mnt/f2fs/file:
 EXT: FILE-OFFSET              BLOCK-RANGE      TOTAL FLAGS
   0: [0..2147483647]:         hole             2147483648
   1: [2147483648..2147483655]: 81920..81927         8   0x1

real    3m3.518s
user    0m0.000s
sys     3m3.456s

After:
time xfs_io -f /mnt/f2fs/file -c "fiemap -v"
/mnt/f2fs/file:
 EXT: FILE-OFFSET              BLOCK-RANGE      TOTAL FLAGS
   0: [0..2147483647]:         hole             2147483648
   1: [2147483648..2147483655]: 81920..81927         8   0x1

real    0m0.008s
user    0m0.000s
sys     0m0.008s

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Chao Yu
99300aead6 f2fs: introduce get_next_page_offset to speed up SEEK_DATA
When seeking data in ->llseek, if we encounter a big hole which covers
several dnode pages, we will try to seek data from index of page which
is the first page of next dnode page, at most we could skip searching
(ADDRS_PER_BLOCK - 1) pages.

However it's still not efficient, because if our indirect/double-indirect
pointer are NULL, there are no dnode page locate in the tree indirect/
double-indirect pointer point to, it's not necessary to search the whole
region.

This patch introduces get_next_page_offset to calculate next page offset
based on current searching level and max searching level returned from
get_dnode_of_data, with this, we could skip searching the entire area
indirect or double-indirect node block is not exist.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Chao Yu
e8e6b247e9 f2fs: remove unneeded pointer conversion
There are redundant pointer conversion in following call stack:
 - at position a, inode was been converted to f2fs_file_info.
 - at position b, f2fs_file_info was been converted to inode again.

 - truncate_blocks(inode,..)
  - fi = F2FS_I(inode)		---a
  - ADDRS_PER_PAGE(node_page, fi)
   - addrs_per_inode(fi)
    - inode = &fi->vfs_inode	---b
    - f2fs_has_inline_xattr(inode)
     - fi = F2FS_I(inode)
     - is_inode_flag_set(fi,..)

In order to avoid unneeded conversion, alter ADDRS_PER_PAGE and
addrs_per_inode to acept parameter with type of inode pointer.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Chao Yu
ea2c43f60e f2fs: simplify __allocate_data_blocks
This patch uses existing function f2fs_map_block to simplify implementation
of __allocate_data_blocks.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Chao Yu
d65b55a172 f2fs: simplify f2fs_map_blocks
In f2fs_map_blocks, we use duplicated codes to handle first block mapping
and the following blocks mapping, it's unnecessary. This patch simplifies
f2fs_map_blocks to avoid using copied codes.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Shuoran Liu
9876399dc3 f2fs: introduce lifetime write IO statistics
This patch introduces lifetime IO write statistics exposed to the sysfs interface.
The write IO amount is obtained from block layer, accumulated in the file system and
stored in the hot node summary of checkpoint.

Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Pengyang Hou <houpengyang@huawei.com>
[Jaegeuk Kim: add sysfs documentation]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Jaegeuk Kim
0007854c9f f2fs: give scheduling point in shrinking path
It needs to give a chance to be rescheduled while shrinking slab entries.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Hou Pengyang
faf1258aa7 f2fs: improve shrink performance of extent nodes
On the worst case, we need to scan the whole radix tree and every rb-tree to
free the victimed extent_nodes when shrinking.

Pengyang initially introduced a victim_list to record the victimed extent_nodes,
and free these extent_nodes by just scanning a list.

Later, Chao Yu enhances the original patch to improve memory footprint by
removing victim list.

The policy of lru list shrinking becomes:
1) lock lru list's lock
2) trylock extent tree's lock
3) remove extent node from lru list
4) unlock lru list's lock
5) do shrink
6) repeat 1) to 5)

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Jaegeuk Kim
64a2858911 f2fs: don't set cached_en if it will be freed
If en has empty list pointer, it will be freed sooner, so we don't need to
set cached_en with it.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:29 +08:00
Jaegeuk Kim
2f6563b2ef f2fs: move extent_node list operations being coupled with rbtree operation
This patch moves extent_node list operations to be handled together with
its rbtree operations.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Hou Pengyang
285b5811ff f2fs: reconstruct the code to free an extent_node
There are three steps to free an extent node:
1) list_del_init, 2)__detach_extent_node, 3) kmem_cache_free

In path f2fs_destroy_extent_tree, 1->2->3 to free a node,
But in path f2fs_update_extent_tree_range, it is 2->1->3.

This patch makes all the order to be: 1->2->3
It makes sense, since in the next patch, we import a victim list in the
path shrink_extent_tree, we could check if the extent_node is in the victim
list by checking the list_empty(). So it is necessary to put 1) first.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
bfe1d80660 f2fs: use wq_has_sleeper for cp_wait wait_queue
We need to use wq_has_sleeper including smp_mb to consider cp_wait concurrency.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Fan Li
81e6758a09 f2fs: avoid unnecessary search while finding victim in gc
variable nsearched in get_victim_by_default() indicates the number of
dirty segments we already checked. There are 2 problems about the way
it updates:
1. When p.ofs_unit is greater than 1, the victim we find consists
   of multiple segments, possibly more than 1 dirty segment.
   But nsearched always increases by 1.
2. If segments have been found but not been chosen, nsearched won't
   increase. So even we have checked all dirty segments, nsearched
   may still less than p.max_search.
All these problems could cause unnecessary search after all dirty
segments have already been checked.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Yunlei He
5c863d9060 f2fs: delete unnecessary wait for page writeback
no need to wait inline file page writeback for no one
use it, so this patch delete unnecessary wait.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
fa43b1083f f2fs: use wait_for_stable_page to avoid contention
In write_begin, if storage supports stable_page, we don't need to wait for
writeback to update its contents.
This patch introduces to use wait_for_stable_page instead of
wait_on_page_writeback.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Chao Yu
c8d6e41c9a f2fs: enhance foreground GC
If we configure section consist of multiple segments, foreground GC will
do the garbage collection with following approach:

	for each segment in victim section
		blk_start_plug
		for each valid block in segment
			write out by OPU method
		submit bio cache   <---
		blk_finish_plug   <---

There are two issue:
1) for most of the time, 'submit bio cache' will break the merging in
current bio buffer from writes of next segments, making a smaller bio
submitting.
2) block plug only cover IO submitting in one segment, which reduce
opportunity of merging IOs in plug with multiple segments.

So refactor the code as below structure to strive for biggest
opportunity of merging IOs:

	blk_start_plug
	for each segment in victim section
		for each valid block in segment
			write out by OPU method
	submit bio cache
	blk_finish_plug

Test method:
1. mkfs.f2fs -s 8 /dev/sdX
2. touch 32 files
3. write 2M data into each file
4. punch 1.5M data from offset 0 for each file
5. trigger foreground gc through ioctl

Before patch, there are totoally 40 bios submitted.
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65776, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66016, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66256, size = 122880
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 66496, size = 32768
----repeat for 8 times

After patch, there are totally 35 bios submitted.
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 65536, size = 122880
----repeat 34 times
f2fs_submit_write_bio: dev = (8,32), WRITE_SYNC, DATA, sector = 73696, size = 16384

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
1924b1d13d f2fs: don't need to call set_page_dirty for io error
If end_io gets an error, we don't need to set the page as dirty, since we
already set f2fs_stop_checkpoint which will not flush any data.

This will resolve the following warning.

======================================================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
4.4.0+ #9 Tainted: G           O
------------------------------------------------------
xfs_io/26773 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
 (&(&sbi->inode_lock[i])->rlock){+.+...}, at: [<ffffffffc025483f>] update_dirty_page+0x6f/0xd0 [f2fs]

and this task is already holding:
 (&(&q->__queue_lock)->rlock){-.-.-.}, at: [<ffffffff81396ea2>] blk_queue_bio+0x422/0x490
which would create a new lock dependency:
 (&(&q->__queue_lock)->rlock){-.-.-.} -> (&(&sbi->inode_lock[i])->rlock){+.+...}

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/data.c
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
e038208489 f2fs: avoid needless sync_inode_page when reading inline_data
In write_begin, if there is an inline_data, f2fs loads it into 0'th data page.
Since it's the read path, we don't need to sync its inode page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
87abb094a6 f2fs: don't need to sync node page at every time
In write_end, we don't need to sync inode page at every time.
Instead, we can expect f2fs_write_inode will update later.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
c060dfefbb f2fs: avoid multiple node page writes due to inline_data
The sceanrio is:
1. create fully node blocks
2. flush node blocks
3. write inline_data for all the node blocks again
4. flush node blocks redundantly

So, this patch tries to flush inline_data when flushing node blocks.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
6130193402 f2fs: do f2fs_balance_fs when block is allocated
We should consider data block allocation to trigger f2fs_balance_fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
54d7b9f302 f2fs: fix to overcome inline_data floods
The scenario is:
1. create lots of node blocks
2. sync
3. write lots of inline_data
-> got panic due to no free space

In that case, we should flush node blocks when writing inline_data in #3,
and trigger gc as well.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:28 +08:00
Jaegeuk Kim
1b914a30c4 f2fs: use writepages->lock for WB_SYNC_ALL
If there are many writepages calls by multiple threads in background, we don't
need to serialize to merge all the bios, since it's background.
In such the case, it'd better to run writepages concurrently.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:27 +08:00
Jaegeuk Kim
bfa33c6b5a f2fs: remove needless condition check
This patch removes needless condition variable.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:27 +08:00
Chao Yu
58d92819d5 f2fs: correct search area in get_new_segment
get_new_segment starts from current segment position, tries to search a
free segment among its right neighbors locate in same section.

But previously our search area was set as [current segment, max segment],
which means we have to search to more bits in free_segmap bitmap for some
worse cases. So here we correct the search area to [current segment, last
segment in section] to avoid unnecessary searching.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:27 +08:00
Chao Yu
619df17d64 f2fs: export dirty_nats_ratio in sysfs
This patch exports a new sysfs entry 'dirty_nat_ratio' to control threshold
of dirty nat entries, if current ratio exceeds configured threshold,
checkpoint will be triggered in f2fs_balance_fs_bg for flushing dirty nats.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	Documentation/ABI/testing/sysfs-fs-f2fs
2016-10-29 23:12:27 +08:00
Chao Yu
496bb6b28a f2fs: flush dirty nat entries when exceeding threshold
When testing f2fs with xfstest, generic/251 is stuck for long time,
the case uses below serials to obtain fresh released space in device,
in order to prepare for following fstrim test.

1. rm -rf /mnt/dir
2. mkdir /mnt/dir/
3. cp -axT `pwd`/ /mnt/dir/
4. goto 1

During preparing step, all nat entries will be cached in nat cache,
most of them are dirty entries with invalid blkaddr, which means
nodes related to these entries have been truncated, and they could
be reused after the dirty entries been checkpointed.

However, there was no checkpoint been triggered, so nid allocators
(e.g. mkdir, creat) will run into long journey of iterating all NAT
pages, looking for free nids in alloc_nid->build_free_nids.

Here, in f2fs_balance_fs_bg we give another chance to do checkpoint
to flush nat entries for reusing them in free nid cache when dirty
entry count exceeds 10% of max count.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:27 +08:00
Chao Yu
5d38e01ab0 f2fs: relocate is_merged_page
Operations in is_merged_page is related to inner bio cache, move it to
data.c.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/segment.c
2016-10-29 23:12:27 +08:00
Eric Sandeen
52b81d3536 ext4: fix unjournaled inode bitmap modification
commit 119c0d4460 changed
ext4_new_inode() such that the inode bitmap was being modified
outside a transaction, which could lead to corruption, and was
discovered when journal_checksum found a bad checksum in the
journal during log replay.

Nix ran into this when using the journal_async_commit mount
option, which enables journal checksumming.  The ensuing
journal replay failures due to the bad checksums led to
filesystem corruption reported as the now infamous
"Apparent serious progressive ext4 data corruption bug"

[ Changed by tytso to only call ext4_journal_get_write_access() only
  when we're fairly certain that we're going to allocate the inode. ]

I've tested this by mounting with journal_checksum and
running fsstress then dropping power; I've also tested by
hacking DM to create snapshots w/o first quiescing, which
allows me to test journal replay repeatedly w/o actually
power-cycling the box.  Without the patch I hit a journal
checksum error every time.  With this fix it survives
many iterations.

Change-Id: Id679b89e870c62f83476f836650ac9068908a84e
Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-10-29 23:12:26 +08:00
Jaegeuk Kim
db748062ad f2fs: should unset atomic flag after successful commit
If there is an error during commit, we should keep the flag in order to
abort it.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
7a2dc25c86 f2fs: fix wrong memory condition check
This patch fixes wrong decision for avaliable_free_memory.
The return valus is already set as false, so we should consider true condition
below only.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/node.c
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
85abb2f17e f2fs: monitor the number of background checkpoint
This patch adds to show the number of background checkpoint.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
389d2eece6 f2fs: detect idle time depending on user behavior
This patch adds last time that user requested filesystem operations.
This information is used to detect whether system is idle or not later.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	Documentation/ABI/testing/sysfs-fs-f2fs
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
b074a0de9c f2fs: introduce time and interval facility
This patch adds time and interval arrays to store some timing variables.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Chao Yu
f1cbf37ac2 f2fs: skip releasing nodes in chindless extent tree
If there are no nodes in extent tree, let's skip releasing step to avoid
any overhead of grabbing/releasing extent tree lock.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Chao Yu
157a5e1f6a f2fs: use atomic type for node count in extent tree
1. rename field in struct extent_tree from count to node_cnt for
   readability.
2. alter to use atomic type for node_cnt.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Chao Yu
4f11a62b68 f2fs: recognize encrypted data in f2fs_fiemap
This patch fixes to teach f2fs_fiemap to recognize encrypted data.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
baac2dde6b f2fs: clean up f2fs_balance_fs
This patch adds one parameter to clean up all the callers of f2fs_balance_fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/namei.c
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
b5199c6522 f2fs: remove redundant calls
This patch removes redundant calls.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
d5dd3ee188 f2fs: avoid unnecessary f2fs_balance_fs calls
Only when node page is newly dirtied, it needs to check whether we need to do
f2fs_gc.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:25 +08:00
Jaegeuk Kim
0788a58411 f2fs: check the page status filled from disk
After reading a page, we need to check whether there is any error.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Chao Yu
c25b54f508 f2fs: introduce __get_node_page to reuse common code
There are duplicated code in between get_node_page and get_node_page_ra,
introduce __get_node_page to includes common parts of these two, and
export get_node_page and get_node_page_ra by reusing __get_node_page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/node.c
2016-10-29 23:12:24 +08:00
Chao Yu
01fbe1c8b3 f2fs: check node id earily when readaheading node page
Add node id check in ra_node_page and get_node_page_ra like get_node_page.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Fan Li
3cb1aba93a f2fs: read isize while holding i_mutex in fiemap
make sure the isize we read doesn't change during the process.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
83ef8a9fcd Revert "f2fs: check the node block address of newly allocated nid"
Original issue is fixed by:

  f2fs: cover more area with nat_tree_lock

This reverts commit 24928634f81b1592e83b37dcd89ed45c28f12feb.
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
e694adb9e1 f2fs: cover more area with nat_tree_lock
There was a subtle bug on nat cache management which incurs wrong nid allocation
or wrong block addresses when try_to_free_nats is triggered heavily.
This patch enlarges the previous coverage of nat_tree_lock to avoid data race.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Chao Yu
ed22a357c2 f2fs: introduce max_file_blocks in sbi
Introduce max_file_blocks in sbi to store max block index of file in f2fs,
it could be used to avoid unneeded calculation of max block index in
runtime.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix overflow of sbi->max_file_blocks]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Chao Yu
1b4b58672a f2fs crypto: check CONFIG_F2FS_FS_XATTR for encrypted symlink
Add missed CONFIG_F2FS_FS_XATTR for encrypted symlink inode in order
to avoid unneeded registry of ->{get,set,remove}xattr.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
b167c6ac4c f2fs: introduce zombie list for fast shrinking extent trees
This patch removes refcount, and instead, adds zombie_list to shrink directly
without radix tree traverse.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
2dfff0e691 f2fs: monitor zombie_tree count
This patch adds an entry to show the number of zombie extent_tree.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
ad7ad27045 f2fs: use IPU for fdatasync
This patch fixes missing IPU condition when fdatasync is called.
With this patch, fdatasync is able to avoid additional node writes for recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
a4aa034e8f f2fs: write pending bios when cp_error is set
When testing ioc_shutdown, put_super is able to be hanged by waiting for
writebacking pages as follows.

INFO: task umount:2723 blocked for more than 120 seconds.
      Tainted: G           O    4.4.0-rc3+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount          D ffff88000859f9d8     0  2723   2110 0x00000000
 ffff88000859f9d8 0000000000000000 0000000000000000 ffffffff81e11540
 ffff880078c225c0 ffff8800085a0000 ffff88007fc17440 7fffffffffffffff
 ffffffff818239f0 ffff88000859fb48 ffff88000859f9f0 ffffffff8182310c
Call Trace:
 [<ffffffff818239f0>] ? bit_wait+0x50/0x50
 [<ffffffff8182310c>] schedule+0x3c/0x90
 [<ffffffff81827fb9>] schedule_timeout+0x2d9/0x430
 [<ffffffff810e0f8f>] ? mark_held_locks+0x6f/0xa0
 [<ffffffff8111614d>] ? ktime_get+0x7d/0x140
 [<ffffffff818239f0>] ? bit_wait+0x50/0x50
 [<ffffffff8106a655>] ? kvm_clock_get_cycles+0x25/0x30
 [<ffffffff8111617c>] ? ktime_get+0xac/0x140
 [<ffffffff818239f0>] ? bit_wait+0x50/0x50
 [<ffffffff81822564>] io_schedule_timeout+0xa4/0x110
 [<ffffffff81823a25>] bit_wait_io+0x35/0x50
 [<ffffffff818235bd>] __wait_on_bit+0x5d/0x90
 [<ffffffff811b9e8b>] wait_on_page_bit+0xcb/0xf0
 [<ffffffff810d5f90>] ? autoremove_wake_function+0x40/0x40
 [<ffffffff811cf84c>] truncate_inode_pages_range+0x4bc/0x840
 [<ffffffff811cfc3d>] truncate_inode_pages_final+0x4d/0x60
 [<ffffffffc023ced5>] f2fs_evict_inode+0x75/0x400 [f2fs]
 [<ffffffff812639bc>] evict+0xbc/0x190
 [<ffffffff81263d19>] iput+0x229/0x2c0
 [<ffffffffc0241885>] f2fs_put_super+0x105/0x1a0 [f2fs]
 [<ffffffff8124756a>] generic_shutdown_super+0x6a/0xf0
 [<ffffffff812478f7>] kill_block_super+0x27/0x70
 [<ffffffffc0241290>] kill_f2fs_super+0x20/0x30 [f2fs]
 [<ffffffff81247b03>] deactivate_locked_super+0x43/0x70
 [<ffffffff81247f4c>] deactivate_super+0x5c/0x60
 [<ffffffff81268d2f>] cleanup_mnt+0x3f/0x90
 [<ffffffff81268dc2>] __cleanup_mnt+0x12/0x20
 [<ffffffff810ac463>] task_work_run+0x73/0xa0
 [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
 [<ffffffff81003e7c>] syscall_return_slowpath+0xcc/0xe0
 [<ffffffff81829ea2>] int_ret_from_sys_call+0x25/0x9f

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
e1496a4648 f2fs: remove f2fs_bug_on in terms of max_depth
There is no report on this bug_on case, but if malicious attacker changed this
field intentionally, we can just reset it as a MAX value.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:24 +08:00
Jaegeuk Kim
30a9582a1d f2fs: fix f2fs_ioc_abort_volatile_write
There are two rules to handle aborting volatile or atomic writes.

1. drop atomic writes
 - we don't need to keep any stale db data.

2. write journal data
 - we should keep the journal data with fsync for db recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Chao Yu
899b32588e f2fs: fix to skip recovering dot dentries in a readonly fs
If filesystem is readonly, leave user message info instead of recovering
inline dot inode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Jaegeuk Kim
3b07fd693d f2fs: load largest extent all the time
Otherwise, we can get mismatched largest extent information.

One example is:
1. mount f2fs w/ extent_cache
2. make a small extent
3. umount
4. mount f2fs w/o extent_cache
5. update the largest extent
6. umount
7. mount f2fs w/ extent_cache
8. get the old extent made by #2

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Jaegeuk Kim
a15f311013 f2fs: use i_size_read to get i_size
We need to use i_size_read() to get inode->i_size.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/data.c
2016-10-29 23:12:23 +08:00
Jaegeuk Kim
0978d2a646 f2fs: early check broken symlink length in the encrypted case
If link is broken, its len is zero, and we don't need to move forward.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Chao Yu
71568f2f05 f2fs: clean up f2fs_ioc_write_checkpoint
Use f2fs_sync_fs to clean up codes in f2fs_ioc_write_checkpoint.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: remove unused err variable]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Yunlei He
e40bd22303 f2fs: add a max block check for get_data_block_bmap
This patch adds a max block check for get_data_block_bmap.

Trinity test program will send a block number as parameter into
ioctl_fibmap, which will be used in get_node_path(), when the block
number large than f2fs max blocks, it will trigger kernel bug.

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
[Jaegeuk Kim: fix missing condition, pointed by Chao Yu]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Fan Li
e465ea8aa2 f2fs: fix bugs and simplify codes of f2fs_fiemap
fix bugs:
1. len could be updated incorrectly when start+len is beyond isize.
2. If there is a hole consisting of more than two blocks, it could
   fail to add FIEMAP_EXTENT_LAST flag for the last extent.
3. If there is an extent beyond isize, when we search extents in a range
   that ends at isize, it will also return the extent beyond isize,
   which is outside the range.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Chao Yu
562f92afb9 f2fs: let user being aware of IO error
Sometimes we keep dumb when IO error occur in lower layer device, so user
will not receive any error return value for some operation, but actually,
the operation did not succeed.

This sould be avoided, so this patch reports such kind of error to user.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

 Conflicts:
	fs/f2fs/data.c
2016-10-29 23:12:23 +08:00
Chao Yu
ce1ff1416d f2fs: add missing f2fs_balance_fs in __recover_dot_dentries
__recover_do_dentries will try to grab free space in storage, so fix to
add missing f2fs_balance_fs here.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00
Jaegeuk Kim
73efe696d4 f2fs: declare static function
The __f2fs_commit_super is static.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-10-29 23:12:23 +08:00