Use sbi->blocks_per_seg directly to avoid unnecessary calculation when using
1 << sbi->log_blocks_per_seg.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For direct IO, f2fs only allocate new address for the block which is not
exist in the disk before, its mapping info should not exist in extent
cache previously, so here we do not need to call f2fs_drop_largest_extent
to drop related cache.
Due to no more callers for f2fs_drop_largest_extent now, kill it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In recover_data, value of argument 'type' will be CURSEG_WARM_NODE all
the time, remove it for cleanup.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Clean up codes in lookup_journal_in_cursum() with __has_cursum_space().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
No logic changes, just clean up the error path.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Conflicts:
fs/f2fs/dir.c
If device does not support discard, some obsolete dnodes can be recovered
by roll-forward. This patch enhances the recovery flow.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While handling extent trees, we can enter into a reclaiming path anytime.
If it tries to release some extent nodes in the same extent tree,
write_lock(&et->lock) would be hanged.
In order to avoid the deadlock, we can just skip it.
Note that, if it is an unreferenced tree, we should get write_lock(&et->lock)
successfully and release all of therein nodes.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
get_lock_data_page in f2fs_readdir can fail due to a lot of reasons (i.e.
no memory or IO error...), it's better to report this kind of error to
user rather than ignoring it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should clear uptodate flag for all pages atomic written when we drop
them, otherwise before these cached pages were reclaimed or invalidated
eventually, we will see invalid data when hitting them again.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
1. Skip __reverse_ulong if the bitmap is empty.
2. Reduce branches and codes.
According to my test, the performance of this new version is 5% higher on
an empty bitmap of 64bytes, and remains about the same in the worst scenario.
Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If last dirty dentry page was writebacked in reclaim path, we should
remove its directory inode from global dirty list to avoid unnecessary
flush for this inode when doing checkpoint.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In 64-bit kernel f2fs can supports 32-bit ioctl system call by identifying
encoded code which is converted from 32-bit one to 64-bit one in
->compat_ioctl.
When we introduced new interfaces in ->ioctl, we forgot to enable them in
->compat_ioctl, so enable them for fixing.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: fix wrongly added spaces together]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a new ioctl F2FS_IOC_DEFRAGMENT to support file
defragment in a specified range of regular file.
This ioctl can be used in very limited workload: if user expects high
sequential read performance in randomly written file, this interface
can be used for defragmentation, after that file can be written as
continuous as possible in the device.
Meanwhile, it has side-effect, it will make holes in segments where
blocks located originally, so it's better to trigger GC to eliminate
fragment in segments.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should always commit atomic written pages in LFS mode, otherwise data
will become corrupted if we encounter suddent power cut after partial
pages committed in IPU mode.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_create_root_stats can fail due to no memory, report it to user.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In linux 3.10, we can not make sure the return value of nd_get_link function
is valid. So this patch add a check before use it.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.
We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.
Effect of sync on fsmark before the patch:
FSUse% Count Size Files/sec App Overhead
.....
0 640000 4096 35154.6 1026984
0 720000 4096 36740.3 1023844
0 800000 4096 36184.6 916599
0 880000 4096 1282.7 1054367
0 960000 4096 3951.3 918773
0 1040000 4096 40646.2 996448
0 1120000 4096 43610.1 895647
0 1200000 4096 40333.1 921048
And a single sync pass took:
real 0m52.407s
user 0m0.000s
sys 0m0.090s
After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:
real 0m6.930s
user 0m0.000s
sys 0m0.039s
IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I9e55d65f5ecb2305497711d4688f0647d9346035
In case when system contains no dirty pages, wakeup_flusher_threads()
will submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
immediately without doing anything. Thus sync(1) will write all the
dirty inodes from a WB_SYNC_ALL writeback pass which is slow.
Fix the problem by using get_nr_dirty_pages() in
wakeup_flusher_threads() instead of calculating number of dirty pages
manually. That function also takes number of dirty inodes into account.
Change-Id: I458027ae08d9a5a93202a7b97ace1f8da7a18a07
CC: stable@vger.kernel.org
Reported-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There is a race between mark inode dirty and writeback thread, see the
following scenario. In this case, writeback thread will not run though
there is dirty_io.
__mark_inode_dirty() bdi_writeback_workfn()
... ...
spin_lock(&inode->i_lock);
...
if (bdi_cap_writeback_dirty(bdi)) {
<<< assume wb has dirty_io, so wakeup_bdi is false.
<<< the following inode_dirty also have wakeup_bdi false.
if (!wb_has_dirty_io(&bdi->wb))
wakeup_bdi = true;
}
spin_unlock(&inode->i_lock);
<<< assume last dirty_io is removed here.
pages_written = wb_do_writeback(wb);
...
<<< work_list empty and wb has no dirty_io,
<<< delayed_work will not be queued.
if (!list_empty(&bdi->work_list) ||
(wb_has_dirty_io(wb) && dirty_writeback_interval))
queue_delayed_work(bdi_wq, &wb->dwork,
msecs_to_jiffies(dirty_writeback_interval * 10));
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
<<< new dirty_io is added.
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
spin_unlock(&bdi->wb.list_lock);
<<< though there is dirty_io, but wakeup_bdi is false,
<<< so writeback thread will not be waked up and
<<< the new dirty_io will not be flushed.
if (wakeup_bdi)
bdi_wakeup_thread_delayed(bdi);
Writeback will run until there is a new flush work queued. This may cause
a lot of dirty pages stay in memory for a long time.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
Change-Id: I973fcba5381881a003a035ffff48f64348660079
The last patch is:
commit beaa57dd986d4f398728c060692fc2452895cfd8
Author: Chao Yu <chao2.yu@samsung.com>
Date: Thu Oct 22 18:24:12 2015 +0800
f2fs: fix to skip shrinking extent nodes
In f2fs_shrink_extent_tree we should stop shrink flow if we have already
shrunk enough nodes in extent cache.
Change-Id: I7bc76a98ce99412c59435f4573ace38fca604694
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I got a report about unkillable task eating CPU. Further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.
Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.
A similar problem is described in 124d3b7041 ("fix writev regression:
pan hanging unkillable and un-straceable"). If zero-length segmend
is followed by segment with invalid address,
iov_iter_fault_in_readable() checks only first segment (zero-length),
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.
Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.
Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
description.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: ea9b9907b8 ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org>
Change-Id: Id37193373294dd43191469389cfe68ca1736a54b
(cherry pick from commit 54708d2858e79a2bdda10bf8a20c80eb96c20613)
The commit 96d0df79f2 ("proc: make proc_fd_permission() thread-friendly")
fixed the access to /proc/self/fd from sub-threads, but introduced another
problem: a sub-thread can't access /proc/<tid>/fd/ or /proc/thread-self/fd
if generic_permission() fails.
Change proc_fd_permission() to check same_thread_group(pid_task(), current).
Fixes: 96d0df79f2 ("proc: make proc_fd_permission() thread-friendly")
Reported-by: "Jin, Yihua" <yihua.jin@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 26016905
Change-Id: Id91ef67770ab09fb2023338f4d0ace5fd3a60b1f
(cherry pick from commit 96d0df79f2)
proc_fd_permission() says "process can still access /proc/self/fd after it
has executed a setuid()", but the "task_pid() = proc_pid() check only
helps if the task is group leader, /proc/self points to
/proc/<leader-pid>.
Change this check to use task_tgid() so that the whole thread group can
access its /proc/self/fd or /proc/<tid-of-sub-thread>/fd.
Notes:
- CLONE_THREAD does not require CLONE_FILES so task->files
can differ, but I don't think this can lead to any security
problem. And this matches same_thread_group() in
__ptrace_may_access().
- /proc/self should probably point to /proc/<thread-tid>, but
it is too late to change the rules. Perhaps it makes sense
to add /proc/thread though.
Test-case:
void *tfunc(void *arg)
{
assert(opendir("/proc/self/fd"));
return NULL;
}
int main(void)
{
pthread_t t;
pthread_create(&t, NULL, tfunc, NULL);
pthread_join(t, NULL);
return 0;
}
fails if, say, this executable is not readable and suid_dumpable = 0.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bug: 26016905
Change-Id: I5c171211749b9a372560f941f6f66fbdc369e046
commit a6138db815 upstream.
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.
Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others. This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.
Change-Id: I42178b32592b2ccc688d096b420304e93abeaba0
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 0d0826019e upstream.
Andy Lutomirski recently demonstrated that when chroot is used to set
the root path below the path for the new ``root'' passed to pivot_root
the pivot_root system call succeeds and leaks mounts.
In examining the code I see that starting with a new root that is
below the current root in the mount tree will result in a loop in the
mount tree after the mounts are detached and then reattached to one
another. Resulting in all kinds of ugliness including a leak of that
mounts involved in the leak of the mount loop.
Prevent this problem by ensuring that the new mount is reachable from
the current root of the mount tree.
[Added stable cc. Fixes CVE-2014-7970. --Andy]
Change-Id: I77908c81d43a2e5542f8ae27ca898dd26003b0e4
Reported-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
We used to read file_handle twice. Once to get the amount of extra bytes, and
once to fetch the entire structure.
This may be problematic since we do size verifications only after the first
read, so if the number of extra bytes changes in userspace between the first
and second calls, we'll have an incoherent view of file_handle.
Instead, read the constant size once, and copy that over to the final
structure without having to re-read it again.
Change-Id: Ib05e5129629e27d5a05953098c5bc470fae40d2a
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This prevents a race between chown() and execve(), where chowning a
setuid-user binary to root would momentarily make the binary setuid
root.
This patch was mostly written by Linus Torvalds.
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Iecebf23d07e299689e4ba4fd74ea8821ef96e72b
Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
end of the allocated buffer during encrypted filename decoding. This
fix corrects the issue by getting rid of the unnecessary 0 write when
the current bit offset is 2.
Change-Id: I2e139f816b9ce0ad6d207c6f454d6f25061383ee
Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Reported-by: Dmitry Chernenkov <dmitryc@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # v2.6.29+: 51ca58d eCryptfs: Filename Encryption: Encoding and encryption functions
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Augment the compat ioctl table with entries for
PM control of TTY devices. These compat entries
were not present since other TTY/serial core drivers
were not using them.
Backported from msm-3.18.
Change-Id: I96a0e54c001d780a2a427380655f1fbb0091aef7
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
We had for some reason overlooked the AIO interface, and it didn't use
the proper rw_verify_area() helper function that checks (for example)
mandatory locking on the file, and that the size of the access doesn't
cause us to overflow the provided offset limits etc.
Instead, AIO did just the security_file_permission() thing (that
rw_verify_area() also does) directly.
This fixes it to do all the proper helper functions, which not only
means that now mandatory file locking works with AIO too, we can
actually remove lines of code.
Bug: 28939037
Reported-by: Manish Honap <manish_honap_vit@yahoo.co.in>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a70b52ec1a)
Change-Id: I2e182e973b44ba97c45c80d52d8a0b7c32a72750
Previous upstream *stable* fix 14f81062 was incomplete.
A local process can trigger a system crash with an OOB read on buf.
This occurs when the state of buf gets out of sync. After an error in
pipe_iov_copy_to_user() read_pipe may exit having updated buf->offset
but not buf->len. Upon retrying pipe_read() while in
pipe_iov_copy_to_user() *remaining will be larger than the space left
after buf->offset e.g. *remaing = PAGE_SIZE, buf->len = PAGE_SIZE,
buf->offset = 0x300.
This is fixed by not updating the state of buf->offset until after the
full copy is completed, similar to how pipe_write() is implemented.
For stable kernels < 3.16.
Bug: 27721803
Change-Id: Iefffbcc6cfd159dba69c31bcd98c6d5c1f21ff2e
Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec,
the first time atomically and the second time not. The second attempt
needs to continue from the iovec position, pipe buffer offset and
remaining length where the first attempt failed, but currently the
pipe buffer offset and remaining length are reset. This will corrupt
the piped data (possibly also leading to an information leak between
processes) and may also corrupt kernel memory.
This was fixed upstream by commits f0d1bec9d5 ("new helper:
copy_page_from_iter()") and 637b58c288 ("switch pipe_read() to
copy_page_to_iter()"), but those aren't suitable for stable. This fix
for older kernel versions was made by Seth Jennings for RHEL and I
have extracted it from their update.
CVE-2015-1805
Bug: 27275324
Change-Id: I459adb9076fcd50ff1f1c557089c4e421b036ec4
References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 85c34d007116f8a8aafb173966a605fb03532f45)
(cherry pick from commit ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce)
As pointed by recent post[1] on exploiting DRAM physical imperfection,
/proc/PID/pagemap exposes sensitive information which can be used to do
attacks.
This disallows anybody without CAP_SYS_ADMIN to read the pagemap.
[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
[ Eventually we might want to do anything more finegrained, but for now
this is the simple model. - Linus ]
Change-Id: I26bcfef2a0d4d3d02da4c522b29615d8ec694600
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Seaborn <mseaborn@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 26038496
Assign a unique proc inode to each namespace, and use that
inode number to ensure we only allocate at most one proc
inode for every namespace in proc.
A single proc inode per namespace allows userspace to test
to see if two processes are in the same namespace.
This has been a long requested feature and only blocked because
a naive implementation would put the id in a global space and
would ultimately require having a namespace for the names of
namespaces, making migration and certain virtualization tricks
impossible.
We still don't have per superblock inode numbers for proc, which
appears necessary for application unaware checkpoint/restart and
migrations (if the application is using namespace file descriptors)
but that is now allowd by the design if it becomes important.
I have preallocated the ipc and uts initial proc inode numbers so
their structures can be statically initialized.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 98f842e675)
Change the proc namespace files into symlinks so that
we won't cache the dentries for the namespace files
which can bypass the ptrace_may_access checks.
To support the symlinks create an additional namespace
inode with it's own set of operations distinct from the
proc pid inode and dentry methods as those no longer
make sense.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit bf056bfa80)
Generalize the proc inode allocation so that it can be
used without having to having to create a proc_dir_entry.
This will allow namespace file descriptors to remain light
weight entitities but still have the same inode number
when the backing namespace is the same.
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 33d6dce607)
- Add a filesystem flag to mark filesystems that are safe to mount as
an unprivileged user.
- Add a filesystem flag to mark filesystems that don't need MNT_NODEV
when mounted by an unprivileged user.
- Relax the permission checks to allow unprivileged users that have
CAP_SYS_ADMIN permissions in the user namespace referred to by the
current mount namespace to be allowed to mount, unmount, and move
filesystems.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 0c55cfc416)
Sharing mount subtress with mount namespaces created by unprivileged
users allows unprivileged mounts created by unprivileged users to
propagate to mount namespaces controlled by privileged users.
Prevent nasty consequences by changing shared subtrees to slave
subtress when an unprivileged users creates a new mount namespace.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 7a472ef4be)
This will allow for support for unprivileged mounts in a new user namespace.
Acked-by: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 771b137168)
setns support for the mount namespace is a little tricky as an
arbitrary decision must be made about what to set fs->root and
fs->pwd to, as there is no expectation of a relationship between
the two mount namespaces. Therefore I arbitrarily find the root
mount point, and follow every mount on top of it to find the top
of the mount stack. Then I set fs->root and fs->pwd to that
location. The topmost root of the mount stack seems like a
reasonable place to be.
Bind mount support for the mount namespace inodes has the
possibility of creating circular dependencies between mount
namespaces. Circular dependencies can result in loops that
prevent mount namespaces from every being freed. I avoid
creating those circular dependencies by adding a sequence number
to the mount namespace and require all bind mounts be of a
younger mount namespace into an older mount namespace.
Add a helper function proc_ns_inode so it is possible to
detect when we are attempting to bind mound a namespace inode.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 8823c079ba)
normally we deal with lock_mount()/umount races by checking that
mountpoint to be is still in our namespace after lock_mount() has
been done. However, do_add_mount() skips that check when called
with MNT_SHRINKABLE in flags (i.e. from finish_automount()). The
reason is that ->mnt_ns may be a temporary namespace created exactly
to contain automounts a-la NFS4 referral handling. It's not the
namespace of the caller, though, so check_mnt() would fail here.
We still need to check that ->mnt_ns is non-NULL in that case,
though.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 156cacb1d0)
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail. We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates. So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made. The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.
I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
(cherry picked from commit c3b2da3148)
Add comments describing what the directions "up" and "down" mean and ref count
handling to the VFS mount following family of functions.
Signed-off-by: Valerie Aurora <vaurora@redhat.com> (Original author)
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit f015f1267b)
copy_tree() can theoretically fail in a case other than ENOMEM, but always
returns NULL which is interpreted by callers as -ENOMEM. Change it to return
an explicit error.
Also change clone_mnt() for consistency and because union mounts will add new
error cases.
Thanks to Andreas Gruenbacher <agruen@suse.de> for a bug fix.
[AV: folded braino fix by Dan Carpenter]
Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Valerie Aurora <valerie.aurora@gmail.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit be34d1a3bc)
don't rely on proc_mounts->m being the first field; container_of()
is there for purpose. No need to bother with ->private, while
we are at it - the same container_of will do nicely.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 6ce6e24e72)
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit f7a99c5b7c)
__mnt_make_shortterm() in there undoes the effect of __mnt_make_longterm()
we'd done back when we set ->mnt_ns non-NULL; it should not be done to
vfsmounts that had never gone through commit_tree() and friends. Kudos to
lczerner for catching that one...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 63d37a84ab)
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
In preparation, this patch changes the API to look more like normal
function calls with pointers, not magic macros.
The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 962830df36)
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit eea62f831b)
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__mnt_make_shortterm() in there undoes the effect of __mnt_make_longterm()
we'd done back when we set ->mnt_ns non-NULL; it should not be done to
vfsmounts that had never gone through commit_tree() and friends. Kudos to
lczerner for catching that one...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs. The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.
Origin:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=695f055936938c674473ea071ca7359a863551e7
[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak <petrcermak@chromium.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Primiano Tucci <primiano@chromium.org>
Cc: Petr Cermak <petrcermak@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I06f83ce2d3d003ff67aced16d2710e4f88eb3af4
Make oom_adj and oom_score_adj user read-only.
Bug: 19636629
Conflicts:
fs/proc/base.c
Signed-off-by: Rom Lemarchand <romlem@google.com>
Signed-off-by: Patrick Tjin <pattjin@google.com>
Change-Id: I02bda099b3884105a0291b84b26bf9270bf48e22
filp_close is using !file_count(filp) to check if f_count is 0. if it is
0, filp_close think it is a closed file then will return. However, for a
closed file, f_count could be reduced to -1, then !file_count(filp) is
false, filp_close will proceed to handle this file then could panic.
This change will check if f_count is 0 or negative instead of only
checking 0 to avoid panic.
b/18200219 LRX21M: kernel_panic
Change-Id: I5117853dcbebec399021abf34338b1f6aff6ad14
Signed-off-by: Shengzhe Zhao <a18689@motorola.com>
Reviewed-by: Yi-Wei Zhao <gbjc64@motorola.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
Some OOM implementations are pretty trigger happy when it comes to
releasing memory for kmalloc() allocations. We might as well head
straight to vmalloc for allocations over PAGE_SIZE.
Bug: 17871993
Change-Id: Ic0dedbadc8bf551d34cc5d77c8073938d4adef80
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Naveen Ramaraj <nramaraj@codeaurora.org>
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.
E.g. for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation. In such
situations reading /proc/stat is not possible anymore.
Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.
For reference a call trace where reading from /proc/stat failed:
sadc: page allocation failure: order:4, mode:0x1040d0
CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[...]
Call Trace:
show_stack+0x6c/0xe8
warn_alloc_failed+0xd6/0x138
__alloc_pages_nodemask+0x9da/0xb68
__get_free_pages+0x2e/0x58
kmalloc_order_trace+0x44/0xc0
stat_open+0x5a/0xd8
proc_reg_open+0x8a/0x140
do_dentry_open+0x1bc/0x2c8
finish_open+0x46/0x60
do_last+0x382/0x10d0
path_openat+0xc8/0x4f8
do_filp_open+0x46/0xa8
do_sys_open+0x114/0x1f0
sysc_tracego+0x14/0x1a
Conflicts:
fs/seq_file.c
Bug: 17871993
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 058504edd0
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Iad795a92fee1983c300568429a6283c48625bd9a
Signed-off-by: Jeremy Gebben <jgebben@codeaurora.org>
Signed-off-by: Naveen Ramaraj <nramaraj@codeaurora.org>
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.
This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.
When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.
The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.
Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.
Changes across threads to the filter pointer includes a barrier.
Based on patches by Will Drewry.
Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Conflicts:
include/linux/seccomp.h
include/uapi/linux/seccomp.h
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Conflicts:
fs/exec.c
include/linux/sched.h
kernel/sys.c
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
v18: updated change desc
v17: using new define values as per 3.4
Conflicts:
include/linux/prctl.h
kernel/sys.c
Once we'd freed m->buf, m->count should become zero - we have no valid
contents reachable via m->buf.
Reported-by: Charley (Hao Chuan) Chu <charley.chu@broadcom.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ed Tam <etam@google.com>
This issue was first pointed out by Jiaxing Wang several months ago, but no
further comments:
https://lkml.org/lkml/2013/6/29/41
As we know pread() does not change f_pos, so after pread(), file->f_pos
and m->read_pos become different. And seq_lseek() does not update file->f_pos
if offset equals to m->read_pos, so after pread() and seq_lseek()(lseek to
m->read_pos), then a subsequent read may read from a wrong position, the
following program produces the problem:
char str1[32] = { 0 };
char str2[32] = { 0 };
int poffset = 10;
int count = 20;
/*open any seq file*/
int fd = open("/proc/modules", O_RDONLY);
pread(fd, str1, count, poffset);
printf("pread:%s\n", str1);
/*seek to where m->read_pos is*/
lseek(fd, poffset+count, SEEK_SET);
/*supposed to read from poffset+count, but this read from position 0*/
read(fd, str2, count);
printf("read:%s\n", str2);
out put:
pread:
ck_netbios_ns 12665
read:
nf_conntrack_netbios
/proc/modules:
nf_conntrack_netbios_ns 12665 0 - Live 0xffffffffa038b000
nf_conntrack_broadcast 12589 1 nf_conntrack_netbios_ns, Live 0xffffffffa0386000
So we always update file->f_pos to offset in seq_lseek() to fix this issue.
Conflicts:
fs/seq_file.c
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ed Tam <etam@google.com>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092
Signed-off-by: Colin Cross <ccross@android.com>
Avoid waking up every thread sleeping in a select call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Change-Id: I0d7565ec0b6bc5d44cb55f958589c56e6bd16348
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Change-Id: I848d08d28c89302fd42bbbdfa76489a474ab27bf
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
Conflicts:
fs/eventpoll.c
CIFS calls wait_event_freezekillable_unsafe with a VFS lock held,
which is unsafe and will cause lockdep warnings when 6aa9707
"lockdep: check that no locks held at freeze time" is reapplied
(it was reverted in dbf520a). CIFS shouldn't be doing this, but
it has long-running syscalls that must hold a lock but also
shouldn't block suspend. Until CIFS freeze handling is rewritten
to use a signal to exit out of the critical section, add a new
wait_event_freezekillable_unsafe helper that will not run the
lockdep test when 6aa9707 is reapplied, and call it from CIFS.
In practice the likley result of holding the lock while freezing
is that a second task blocked on the lock will never freeze,
aborting suspend, but it is possible to manufacture a case using
the cgroup freezer, the lock, and the suspend freezer to create
a deadlock. Silencing the lockdep warning here will allow
problems to be found in other drivers that may have a more
serious deadlock risk, and prevent new problems from being added.
Change-Id: I420c5392bacf68e58e268293b2b36068ad4df753
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
NFS calls the freezable helpers with locks held, which is unsafe
and will cause lockdep warnings when 6aa9707 "lockdep: check
that no locks held at freeze time" is reapplied (it was reverted
in dbf520a). NFS shouldn't be doing this, but it has
long-running syscalls that must hold a lock but also shouldn't
block suspend. Until NFS freeze handling is rewritten to use a
signal to exit out of the critical section, add new *_unsafe
versions of the helpers that will not run the lockdep test when
6aa9707 is reapplied, and call them from NFS.
In practice the likley result of holding the lock while freezing
is that a second task blocked on the lock will never freeze,
aborting suspend, but it is possible to manufacture a case using
the cgroup freezer, the lock, and the suspend freezer to create
a deadlock. Silencing the lockdep warning here will allow
problems to be found in other drivers that may have a more
serious deadlock risk, and prevent new problems from being added.
Change-Id: Ia17d32cdd013a6517bdd5759da900970a4427170
Signed-off-by: Colin Cross <ccross@android.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Patch from Ken that can solve fsstress failing issue
commit b9fa7bb8ff207eeb27d2e0ed45b8c3acf1a7af8c
Author: Tao Ma <boyu.mt@taobao.com>
Date: Mon May 28 18:20:59 2012 -0400
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Change-Id: Iba773843728759e1d64d4ff57765288eb5977665
Reviewed-on: http://mcrd1-5.corpnet.asus/code-review/master/67871
Reviewed-by: Lin Johnny1 <Johnny1_Lin@asus.com>
Tested-by: Lin Johnny1 <Johnny1_Lin@asus.com>
Reviewed-by: Sam hblee <Sam_hblee@asus.com>
In ext4_nonda_switch(), if the file system is getting full we used to
call writeback_inodes_sb_if_idle(). The problem is that we can be
holding i_mutex already, and this causes a potential deadlock when
writeback_inodes_sb_if_idle() when it tries to take s_umount. (See
lockdep output below).
As it turns out we don't need need to hold s_umount; the fact that we
are in the middle of the write(2) system call will keep the superblock
pinned. Unfortunately writeback_inodes_sb() checks to make sure
s_umount is taken, and the VFS uses a different mechanism for making
sure the file system doesn't get unmounted out from under us. The
simplest way of dealing with this is to just simply grab s_umount
using a trylock, and skip kicking the writeback flusher thread in the
very unlikely case that we can't take a read lock on s_umount without
blocking.
Also, we now check the cirteria for kicking the writeback thread
before we decide to whether to fall back to non-delayed writeback, so
if there are any outstanding delayed allocation writes, we try to get
them resolved as soon as possible.
[ INFO: possible circular locking dependency detected ]
3.6.0-rc1-00042-gce894ca #367 Not tainted
-------------------------------------------------------
dd/8298 is trying to acquire lock:
(&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
but task is already holding lock:
(&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
which lock already depends on the new lock.
2 locks held by dd/8298:
#0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
#1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
stack backtrace:
Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
Call Trace:
[<c015b79c>] ? console_unlock+0x345/0x372
[<c06d62a1>] print_circular_bug+0x190/0x19d
[<c019906c>] __lock_acquire+0x86d/0xb6c
[<c01999db>] ? mark_held_locks+0x5c/0x7b
[<c0199724>] lock_acquire+0x66/0xb9
[<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
[<c06db935>] down_read+0x28/0x58
[<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
[<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
[<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
[<c0271ece>] ext4_da_write_begin+0x27/0x193
[<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
[<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
[<c01ddce7>] generic_file_aio_write+0x78/0xd3
[<c026d336>] ext4_file_write+0x480/0x4a6
[<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
[<c0180944>] ? sched_clock_cpu+0x11a/0x13e
[<c01967e9>] ? trace_hardirqs_off+0xb/0xd
[<c018099f>] ? local_clock+0x37/0x4e
[<c0209f2c>] do_sync_write+0x67/0x9d
[<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
[<c020a7b9>] vfs_write+0x7b/0xe6
[<c020a9a6>] sys_write+0x3b/0x64
[<c06dd4bd>] syscall_call+0x7/0xb
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Commit 080399aaaf ("block: don't mark buffers beyond end of disk as
mapped") exposed a bug in __getblk_slow that causes mount to hang as it
loops infinitely waiting for a buffer that lies beyond the end of the
disk to become uptodate.
The problem was initially reported by Torsten Hilbrich here:
https://lkml.org/lkml/2012/6/18/54
and also reported independently here:
http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511
and then Richard W.M. Jones and Marcos Mello noted a few separate
bugzillas also associated with the same issue. This patch has been
confirmed to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=835019
The main problem is here, in __getblk_slow:
for (;;) {
struct buffer_head * bh;
int ret;
bh = __find_get_block(bdev, block, size);
if (bh)
return bh;
ret = grow_buffers(bdev, block, size);
if (ret < 0)
return NULL;
if (ret == 0)
free_more_memory();
}
__find_get_block does not find the block, since it will not be marked as
mapped, and so grow_buffers is called to fill in the buffers for the
associated page. I believe the for (;;) loop is there primarily to
retry in the case of memory pressure keeping grow_buffers from
succeeding. However, we also continue to loop for other cases, like the
block lying beond the end of the disk. So, the fix I came up with is to
only loop when grow_buffers fails due to memory allocation issues
(return value of 0).
The attached patch was tested by myself, Torsten, and Rich, and was
found to resolve the problem in call cases.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-and-Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Josh Boyer <jwboyer@redhat.com>
Cc: Stable <stable@vger.kernel.org> # 3.0+
[ Jens is on vacation, taking this directly - Linus ]
--
Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IBM reported a deadlock in select_parent(). This was found to be caused
by taking rename_lock when already locked when restarting the tree
traversal.
There are two cases when the traversal needs to be restarted:
1) concurrent d_move(); this can only happen when not already locked,
since taking rename_lock protects against concurrent d_move().
2) racing with final d_put() on child just at the moment of ascending
to parent; rename_lock doesn't protect against this rare race, so it
can happen when already locked.
Because of case 2, we need to be able to handle restarting the traversal
when rename_lock is already held. This patch fixes all three callers of
try_to_ascend().
IBM reported that the deadlock is gone with this patch.
[ I rewrote the patch to be smaller and just do the "goto again" if the
lock was already held, but credit goes to Miklos for the real work.
- Linus ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CMA features may ifdef out parts of the code with
CONFIG_CMA. Older code uses CONFIG_DMA_CMA. Switch
to using the newer CONFIG_CMA to ensure the code gets
compiled when needed.
Change-Id: I3cae639797787b4926a6c5e057de973b66196707
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Neha Pandey <nehap@codeaurora.org>
The FUSE file system may hold references to pages for long
periods of time, preventing migration from occuring. If a CMA
page is used here, CMA allocations may fail. Work around this
by swapping out a CMA page for a non-CMA page when working with
the FUSE file system.
Change-Id: Id763ea833ee125c8732ae3759ec9e20d94aa8424
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.
Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.
Change-Id: I253f4ee2069e190c1115afc421dadd27a7fa87dc
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Git commit 09a1d34f85 "nohz: Make idle/iowait counter update
conditional" introduced a bug in regard to cpu hotplug. The effect is
that the number of idle ticks in the cpu summary line in /proc/stat is
still counting ticks for offline cpus.
Reproduction is easy, just start a workload that keeps all cpus busy,
switch off one or more cpus and then watch the idle field in top.
On a dual-core with one cpu 100% busy and one offline cpu you will get
something like this:
%Cpu(s): 48.7 us, 1.3 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 0.0 si,
%0.0 st
The problem is that an offline cpu still has ts->idle_active == 1.
To fix this we should make sure that the cpu is online when calling
get_cpu_idle_time_us and get_cpu_iowait_time_us.
[Srivatsa: Rebased to current mainline]
Change-Id: I53cd02fef784647e45abb4c99ff641e5e69a9d3e
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20121010061820.8999.57245.stgit@srivatsabhat.in.ibm.com
Cc: deepthi@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In yaffs_rename(), d_entry->d_inode can be NULL if the
target directory doesn't exist or if there is any race
condition such as target directory being deleted while
renaming another directory to target directory name.
Avoid dereferencing d_inode in such cases.
CRs-Fixed: 360748
Change-Id: If95b4992f1056fea78f2e1bd54253cd5c8aac93d
Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
Some userspace applications use /proc/stat to determine how many CPUs
the system has. CPU hotplug can offline a CPU at runtime and causing the
offline CPU not present in /proc/stat if we only show online cpu in
/proc/stat.
Change-Id: I4fd0cfcdb174244044634389da2fbdef77744c19
Signed-off-by: Ajay Dudani <adudani@codeaurora.org>
The compiler warns that 'saved_nlink' declared in line 980
may be used uninitialized.
Change-Id: Ie725c0efe949213b1281a70c779a3167b1961327
Acked-by: Kaushik Sikdar <ksikdar@qualcomm.com>
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
According to the current distributed security logic between
kernel-userspace, the kernel is not aware of the level of
security that a link-key provides when userspace responds
to the link key request. Adding a ioctl entry which will
update the kernel space auth_key's level of security as soon
as userspace responds to the link key request.
CRs-fixed: 264601
Change-Id: I6765cce92a6f8b761742d57ea94e81502f6e7fcf
Signed-off-by: NaveenKumar <naveenr@codeaurora.org>
Fix NR_IPI to be 7 instead of 6 because both googly and core add
an IPI.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Conflicts:
arch/arm/Kconfig
arch/arm/common/Makefile
arch/arm/include/asm/hardware/cache-l2x0.h
arch/arm/mm/cache-l2x0.c
arch/arm/mm/mmu.c
include/linux/wakelock.h
kernel/power/Kconfig
kernel/power/Makefile
kernel/power/main.c
kernel/power/power.h
Suspend attempts can abort when the FUSE daemon is already frozen
and a client is waiting uninterruptibly for a response, causing
freezing of tasks to fail.
Use the freeze-friendly wait API, but disregard other signals.
Change-Id: Icefb7e4bbc718ccb76bf3c04daaa5eeea7e0e63c
Signed-off-by: Todd Poynor <toddpoynor@google.com>
(cherry picked from commit d893d660b7)
If FAT formatted SD card gets removed without unmounting,
FAT file system may throw many kernel error messages which
could too much traffic for console driver and can sometimes
even cause the system to trigger watchdog timeout.
This patch converts the printk to printk_ratelimited to rate
limit the error messages from FAT fs.
Change-Id: I58b942f6714a8d3353478eb21139b8046ee3f875
Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org>
(cherry picked from commit a872b71069d717ced1a2de642afa0693d1bb9448)
Cleancache requires a small amount of code to be added
to a filesystem's implementation so that clean page
cache pages from a filesystem of that type may be
recognized and stored in/retrieved from cleancache.
Change-Id: I94c3fc8817ab66e2c54f7b2c6c474dd2321d9806
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
(cherry picked from commit eb93161d9746b2aa0ac534d1da88a33480d21905)
Use deferable timer in background operations thread,
so that it won't cause unnecessary wakeups. Typically,
wakeups are seen in the range 60ms to 2secs (for HZ=100)
after the thread is scheduled out. In general, during this
delay the processor can go into sleep, if there is no other
activity. Since, the work done in this background operation
is not critical and can be handled as soon as when timer
expires and processor wakes up for other critical events,
we mark the timer responsible for wakeup of this thread
as deferable timer. Otherwise, the processor wakesup
unnecessarily to handle the background operations causing
higher power consumption in idle state.
Change-Id: Ic168525c6b33600ad23017d00ea9723cf8a738d2
Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
(cherry picked from commit 4e239c0cbfc9070ff0c006a92eaafd52243b47a0)
Pull block layer fixes from Jens Axboe:
"A few small, but important fixes. Most of them are marked for stable
as well
- Fix failure to release a semaphore on error path in mtip32xx.
- Fix crashable condition in bio_get_nr_vecs().
- Don't mark end-of-disk buffers as mapped, limit it to i_size.
- Fix for build problem with CONFIG_BLOCK=n on arm at least.
- Fix for a buffer overlow on UUID partition printing.
- Trivial removal of unused variables in dac960."
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix buffer overflow when printing partition UUIDs
Fix blkdev.h build errors when BLOCK=n
bio allocation failure due to bio_get_nr_vecs()
block: don't mark buffers beyond end of disk as mapped
mtip32xx: release the semaphore on an error path
dac960: Remove unused variables from DAC960_CreateProcEntries()
Merge misc fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (4 patches)
frv: delete incorrect task prototypes causing compile fail
slub: missing test for partial pages flush work in flush_all()
fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
drivers/rtc/rtc-pl031.c: configure correct wday for 2000-01-01
Instead of doing the i_mode calculations at proc_fd_instantiate() time,
move them into tid_fd_revalidate(), which is where the other inode state
(notably uid/gid information) is updated too.
Otherwise we'll end up with stale i_mode information if an fd is re-used
while the dentry still hangs around. Not that anything really *cares*
(symlink permissions don't really matter), but Tetsuo Handa noticed that
the owner read/write bits don't always match the state of the
readability of the file descriptor, and we _used_ to get this right a
long time ago in a galaxy far, far away.
Besides, aside from fixing an ugly detail (that has apparently been this
way since commit 61a2878402: "proc: Remove the hard coded inode
numbers" in 2006), this removes more lines of code than it adds. And it
just makes sense to update i_mode in the same place we update i_uid/gid.
Al Viro correctly points out that we could just do the inode fill in the
inode iops ->getattr() function instead. However, that does require
somewhat slightly more invasive changes, and adds yet *another* lookup
of the file descriptor. We need to do the revalidate() for other
reasons anyway, and have the file descriptor handy, so we might as well
fill in the information at this point.
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
map_files/ entries are never supposed to be executed, still curious
minds might try to run them, which leads to the following deadlock
======================================================
[ INFO: possible circular locking dependency detected ]
3.4.0-rc4-24406-g841e6a6 #121 Not tainted
-------------------------------------------------------
bash/1556 is trying to acquire lock:
(&sb->s_type->i_mutex_key#8){+.+.+.}, at: do_lookup+0x267/0x2b1
but task is already holding lock:
(&sig->cred_guard_mutex){+.+.+.}, at: prepare_bprm_creds+0x2d/0x69
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
validate_chain+0x444/0x4f4
__lock_acquire+0x387/0x3f8
lock_acquire+0x12b/0x158
__mutex_lock_common+0x56/0x3a9
mutex_lock_killable_nested+0x40/0x45
lock_trace+0x24/0x59
proc_map_files_lookup+0x5a/0x165
__lookup_hash+0x52/0x73
do_lookup+0x276/0x2b1
walk_component+0x3d/0x114
do_last+0xfc/0x540
path_openat+0xd3/0x306
do_filp_open+0x3d/0x89
do_sys_open+0x74/0x106
sys_open+0x21/0x23
tracesys+0xdd/0xe2
-> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
check_prev_add+0x6a/0x1ef
validate_chain+0x444/0x4f4
__lock_acquire+0x387/0x3f8
lock_acquire+0x12b/0x158
__mutex_lock_common+0x56/0x3a9
mutex_lock_nested+0x40/0x45
do_lookup+0x267/0x2b1
walk_component+0x3d/0x114
link_path_walk+0x1f9/0x48f
path_openat+0xb6/0x306
do_filp_open+0x3d/0x89
open_exec+0x25/0xa0
do_execve_common+0xea/0x2f9
do_execve+0x43/0x45
sys_execve+0x43/0x5a
stub_execve+0x6c/0xc0
This is because prepare_bprm_creds grabs task->signal->cred_guard_mutex
and when do_lookup happens we try to grab task->signal->cred_guard_mutex
again in lock_trace.
Fix it using plain ptrace_may_access() helper in proc_map_files_lookup()
and in proc_map_files_readdir() instead of lock_trace(), the caller must
be CAP_SYS_ADMIN granted anyway.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>