lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
In preparation, this patch changes the API to look more like normal
function calls with pointers, not magic macros.
The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 962830df36)
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.
Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.
This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit eea62f831b)
it's enough to set ->mnt_ns of internal vfsmounts to something
distinct from all struct mnt_namespace out there; then we can
just use the check for ->mnt_ns != NULL in the fast path of
mntput_no_expire()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__mnt_make_shortterm() in there undoes the effect of __mnt_make_longterm()
we'd done back when we set ->mnt_ns non-NULL; it should not be done to
vfsmounts that had never gone through commit_tree() and friends. Kudos to
lczerner for catching that one...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Peak resident size of a process can be reset back to the process's
current rss value by writing "5" to /proc/pid/clear_refs. The driving
use-case for this would be getting the peak RSS value, which can be
retrieved from the VmHWM field in /proc/pid/status, per benchmark
iteration or test scenario.
Origin:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=695f055936938c674473ea071ca7359a863551e7
[akpm@linux-foundation.org: clarify behaviour in documentation]
Signed-off-by: Petr Cermak <petrcermak@chromium.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Primiano Tucci <primiano@chromium.org>
Cc: Petr Cermak <petrcermak@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I06f83ce2d3d003ff67aced16d2710e4f88eb3af4
Make oom_adj and oom_score_adj user read-only.
Bug: 19636629
Conflicts:
fs/proc/base.c
Signed-off-by: Rom Lemarchand <romlem@google.com>
Signed-off-by: Patrick Tjin <pattjin@google.com>
Change-Id: I02bda099b3884105a0291b84b26bf9270bf48e22
filp_close is using !file_count(filp) to check if f_count is 0. if it is
0, filp_close think it is a closed file then will return. However, for a
closed file, f_count could be reduced to -1, then !file_count(filp) is
false, filp_close will proceed to handle this file then could panic.
This change will check if f_count is 0 or negative instead of only
checking 0 to avoid panic.
b/18200219 LRX21M: kernel_panic
Change-Id: I5117853dcbebec399021abf34338b1f6aff6ad14
Signed-off-by: Shengzhe Zhao <a18689@motorola.com>
Reviewed-by: Yi-Wei Zhao <gbjc64@motorola.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
Some OOM implementations are pretty trigger happy when it comes to
releasing memory for kmalloc() allocations. We might as well head
straight to vmalloc for allocations over PAGE_SIZE.
Bug: 17871993
Change-Id: Ic0dedbadc8bf551d34cc5d77c8073938d4adef80
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Naveen Ramaraj <nramaraj@codeaurora.org>
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.
E.g. for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation. In such
situations reading /proc/stat is not possible anymore.
Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.
For reference a call trace where reading from /proc/stat failed:
sadc: page allocation failure: order:4, mode:0x1040d0
CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[...]
Call Trace:
show_stack+0x6c/0xe8
warn_alloc_failed+0xd6/0x138
__alloc_pages_nodemask+0x9da/0xb68
__get_free_pages+0x2e/0x58
kmalloc_order_trace+0x44/0xc0
stat_open+0x5a/0xd8
proc_reg_open+0x8a/0x140
do_dentry_open+0x1bc/0x2c8
finish_open+0x46/0x60
do_last+0x382/0x10d0
path_openat+0xc8/0x4f8
do_filp_open+0x46/0xa8
do_sys_open+0x114/0x1f0
sysc_tracego+0x14/0x1a
Conflicts:
fs/seq_file.c
Bug: 17871993
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 058504edd0
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Iad795a92fee1983c300568429a6283c48625bd9a
Signed-off-by: Jeremy Gebben <jgebben@codeaurora.org>
Signed-off-by: Naveen Ramaraj <nramaraj@codeaurora.org>
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.
This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.
When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.
The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.
Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.
Changes across threads to the filter pointer includes a barrier.
Based on patches by Will Drewry.
Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Conflicts:
include/linux/seccomp.h
include/uapi/linux/seccomp.h
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Conflicts:
fs/exec.c
include/linux/sched.h
kernel/sys.c
With this change, calling
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)
disables privilege granting operations at execve-time. For example, a
process will not be able to execute a setuid binary to change their uid
or gid if this bit is set. The same is true for file capabilities.
Additionally, LSM_UNSAFE_NO_NEW_PRIVS is defined to ensure that
LSMs respect the requested behavior.
To determine if the NO_NEW_PRIVS bit is set, a task may call
prctl(PR_GET_NO_NEW_PRIVS, 0, 0, 0, 0);
It returns 1 if set and 0 if it is not set. If any of the arguments are
non-zero, it will return -1 and set errno to -EINVAL.
(PR_SET_NO_NEW_PRIVS behaves similarly.)
This functionality is desired for the proposed seccomp filter patch
series. By using PR_SET_NO_NEW_PRIVS, it allows a task to modify the
system call behavior for itself and its child tasks without being
able to impact the behavior of a more privileged task.
Another potential use is making certain privileged operations
unprivileged. For example, chroot may be considered "safe" if it cannot
affect privileged tasks.
Note, this patch causes execve to fail when PR_SET_NO_NEW_PRIVS is
set and AppArmor is in use. It is fixed in a subsequent patch.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Eric Paris <eparis@redhat.com>
v18: updated change desc
v17: using new define values as per 3.4
Conflicts:
include/linux/prctl.h
kernel/sys.c
Once we'd freed m->buf, m->count should become zero - we have no valid
contents reachable via m->buf.
Reported-by: Charley (Hao Chuan) Chu <charley.chu@broadcom.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ed Tam <etam@google.com>
This issue was first pointed out by Jiaxing Wang several months ago, but no
further comments:
https://lkml.org/lkml/2013/6/29/41
As we know pread() does not change f_pos, so after pread(), file->f_pos
and m->read_pos become different. And seq_lseek() does not update file->f_pos
if offset equals to m->read_pos, so after pread() and seq_lseek()(lseek to
m->read_pos), then a subsequent read may read from a wrong position, the
following program produces the problem:
char str1[32] = { 0 };
char str2[32] = { 0 };
int poffset = 10;
int count = 20;
/*open any seq file*/
int fd = open("/proc/modules", O_RDONLY);
pread(fd, str1, count, poffset);
printf("pread:%s\n", str1);
/*seek to where m->read_pos is*/
lseek(fd, poffset+count, SEEK_SET);
/*supposed to read from poffset+count, but this read from position 0*/
read(fd, str2, count);
printf("read:%s\n", str2);
out put:
pread:
ck_netbios_ns 12665
read:
nf_conntrack_netbios
/proc/modules:
nf_conntrack_netbios_ns 12665 0 - Live 0xffffffffa038b000
nf_conntrack_broadcast 12589 1 nf_conntrack_netbios_ns, Live 0xffffffffa0386000
So we always update file->f_pos to offset in seq_lseek() to fix this issue.
Conflicts:
fs/seq_file.c
Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Ed Tam <etam@google.com>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092
Signed-off-by: Colin Cross <ccross@android.com>
Avoid waking up every thread sleeping in a select call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Change-Id: I0d7565ec0b6bc5d44cb55f958589c56e6bd16348
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.
This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.
Change-Id: I848d08d28c89302fd42bbbdfa76489a474ab27bf
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Iliyan Malchev <malchev@google.com>
Conflicts:
fs/eventpoll.c
CIFS calls wait_event_freezekillable_unsafe with a VFS lock held,
which is unsafe and will cause lockdep warnings when 6aa9707
"lockdep: check that no locks held at freeze time" is reapplied
(it was reverted in dbf520a). CIFS shouldn't be doing this, but
it has long-running syscalls that must hold a lock but also
shouldn't block suspend. Until CIFS freeze handling is rewritten
to use a signal to exit out of the critical section, add a new
wait_event_freezekillable_unsafe helper that will not run the
lockdep test when 6aa9707 is reapplied, and call it from CIFS.
In practice the likley result of holding the lock while freezing
is that a second task blocked on the lock will never freeze,
aborting suspend, but it is possible to manufacture a case using
the cgroup freezer, the lock, and the suspend freezer to create
a deadlock. Silencing the lockdep warning here will allow
problems to be found in other drivers that may have a more
serious deadlock risk, and prevent new problems from being added.
Change-Id: I420c5392bacf68e58e268293b2b36068ad4df753
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
NFS calls the freezable helpers with locks held, which is unsafe
and will cause lockdep warnings when 6aa9707 "lockdep: check
that no locks held at freeze time" is reapplied (it was reverted
in dbf520a). NFS shouldn't be doing this, but it has
long-running syscalls that must hold a lock but also shouldn't
block suspend. Until NFS freeze handling is rewritten to use a
signal to exit out of the critical section, add new *_unsafe
versions of the helpers that will not run the lockdep test when
6aa9707 is reapplied, and call them from NFS.
In practice the likley result of holding the lock while freezing
is that a second task blocked on the lock will never freeze,
aborting suspend, but it is possible to manufacture a case using
the cgroup freezer, the lock, and the suspend freezer to create
a deadlock. Silencing the lockdep warning here will allow
problems to be found in other drivers that may have a more
serious deadlock risk, and prevent new problems from being added.
Change-Id: Ia17d32cdd013a6517bdd5759da900970a4427170
Signed-off-by: Colin Cross <ccross@android.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Patch from Ken that can solve fsstress failing issue
commit b9fa7bb8ff207eeb27d2e0ed45b8c3acf1a7af8c
Author: Tao Ma <boyu.mt@taobao.com>
Date: Mon May 28 18:20:59 2012 -0400
ext4: protect group inode free counting with group lock
Now when we set the group inode free count, we don't have a proper
group lock so that multiple threads may decrease the inode free
count at the same time. And e2fsck will complain something like:
Free inodes count wrong for group #1 (1, counted=0).
Fix? no
Free inodes count wrong for group #2 (3, counted=0).
Fix? no
Directories count wrong for group #2 (780, counted=779).
Fix? no
Free inodes count wrong for group #3 (2272, counted=2273).
Fix? no
So this patch try to protect it with the ext4_lock_group.
btw, it is found by xfstests test case 269 and the volume is
mkfsed with the parameter
"-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr"
and I have run it 100 times and the error in e2fsck doesn't
show up again.
Change-Id: Iba773843728759e1d64d4ff57765288eb5977665
Reviewed-on: http://mcrd1-5.corpnet.asus/code-review/master/67871
Reviewed-by: Lin Johnny1 <Johnny1_Lin@asus.com>
Tested-by: Lin Johnny1 <Johnny1_Lin@asus.com>
Reviewed-by: Sam hblee <Sam_hblee@asus.com>
In ext4_nonda_switch(), if the file system is getting full we used to
call writeback_inodes_sb_if_idle(). The problem is that we can be
holding i_mutex already, and this causes a potential deadlock when
writeback_inodes_sb_if_idle() when it tries to take s_umount. (See
lockdep output below).
As it turns out we don't need need to hold s_umount; the fact that we
are in the middle of the write(2) system call will keep the superblock
pinned. Unfortunately writeback_inodes_sb() checks to make sure
s_umount is taken, and the VFS uses a different mechanism for making
sure the file system doesn't get unmounted out from under us. The
simplest way of dealing with this is to just simply grab s_umount
using a trylock, and skip kicking the writeback flusher thread in the
very unlikely case that we can't take a read lock on s_umount without
blocking.
Also, we now check the cirteria for kicking the writeback thread
before we decide to whether to fall back to non-delayed writeback, so
if there are any outstanding delayed allocation writes, we try to get
them resolved as soon as possible.
[ INFO: possible circular locking dependency detected ]
3.6.0-rc1-00042-gce894ca #367 Not tainted
-------------------------------------------------------
dd/8298 is trying to acquire lock:
(&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
but task is already holding lock:
(&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
which lock already depends on the new lock.
2 locks held by dd/8298:
#0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3
#1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3
stack backtrace:
Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367
Call Trace:
[<c015b79c>] ? console_unlock+0x345/0x372
[<c06d62a1>] print_circular_bug+0x190/0x19d
[<c019906c>] __lock_acquire+0x86d/0xb6c
[<c01999db>] ? mark_held_locks+0x5c/0x7b
[<c0199724>] lock_acquire+0x66/0xb9
[<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
[<c06db935>] down_read+0x28/0x58
[<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46
[<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46
[<c026f3b2>] ext4_nonda_switch+0xe1/0xf4
[<c0271ece>] ext4_da_write_begin+0x27/0x193
[<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb
[<c01ddc47>] __generic_file_aio_write+0x1dd/0x205
[<c01ddce7>] generic_file_aio_write+0x78/0xd3
[<c026d336>] ext4_file_write+0x480/0x4a6
[<c0198c1d>] ? __lock_acquire+0x41e/0xb6c
[<c0180944>] ? sched_clock_cpu+0x11a/0x13e
[<c01967e9>] ? trace_hardirqs_off+0xb/0xd
[<c018099f>] ? local_clock+0x37/0x4e
[<c0209f2c>] do_sync_write+0x67/0x9d
[<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44
[<c020a7b9>] vfs_write+0x7b/0xe6
[<c020a9a6>] sys_write+0x3b/0x64
[<c06dd4bd>] syscall_call+0x7/0xb
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Commit 080399aaaf ("block: don't mark buffers beyond end of disk as
mapped") exposed a bug in __getblk_slow that causes mount to hang as it
loops infinitely waiting for a buffer that lies beyond the end of the
disk to become uptodate.
The problem was initially reported by Torsten Hilbrich here:
https://lkml.org/lkml/2012/6/18/54
and also reported independently here:
http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511
and then Richard W.M. Jones and Marcos Mello noted a few separate
bugzillas also associated with the same issue. This patch has been
confirmed to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=835019
The main problem is here, in __getblk_slow:
for (;;) {
struct buffer_head * bh;
int ret;
bh = __find_get_block(bdev, block, size);
if (bh)
return bh;
ret = grow_buffers(bdev, block, size);
if (ret < 0)
return NULL;
if (ret == 0)
free_more_memory();
}
__find_get_block does not find the block, since it will not be marked as
mapped, and so grow_buffers is called to fill in the buffers for the
associated page. I believe the for (;;) loop is there primarily to
retry in the case of memory pressure keeping grow_buffers from
succeeding. However, we also continue to loop for other cases, like the
block lying beond the end of the disk. So, the fix I came up with is to
only loop when grow_buffers fails due to memory allocation issues
(return value of 0).
The attached patch was tested by myself, Torsten, and Rich, and was
found to resolve the problem in call cases.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-and-Tested-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Josh Boyer <jwboyer@redhat.com>
Cc: Stable <stable@vger.kernel.org> # 3.0+
[ Jens is on vacation, taking this directly - Linus ]
--
Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
IBM reported a deadlock in select_parent(). This was found to be caused
by taking rename_lock when already locked when restarting the tree
traversal.
There are two cases when the traversal needs to be restarted:
1) concurrent d_move(); this can only happen when not already locked,
since taking rename_lock protects against concurrent d_move().
2) racing with final d_put() on child just at the moment of ascending
to parent; rename_lock doesn't protect against this rare race, so it
can happen when already locked.
Because of case 2, we need to be able to handle restarting the traversal
when rename_lock is already held. This patch fixes all three callers of
try_to_ascend().
IBM reported that the deadlock is gone with this patch.
[ I rewrote the patch to be smaller and just do the "goto again" if the
lock was already held, but credit goes to Miklos for the real work.
- Linus ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CMA features may ifdef out parts of the code with
CONFIG_CMA. Older code uses CONFIG_DMA_CMA. Switch
to using the newer CONFIG_CMA to ensure the code gets
compiled when needed.
Change-Id: I3cae639797787b4926a6c5e057de973b66196707
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Neha Pandey <nehap@codeaurora.org>
The FUSE file system may hold references to pages for long
periods of time, preventing migration from occuring. If a CMA
page is used here, CMA allocations may fail. Work around this
by swapping out a CMA page for a non-CMA page when working with
the FUSE file system.
Change-Id: Id763ea833ee125c8732ae3759ec9e20d94aa8424
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.
Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.
Change-Id: I253f4ee2069e190c1115afc421dadd27a7fa87dc
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Git commit 09a1d34f85 "nohz: Make idle/iowait counter update
conditional" introduced a bug in regard to cpu hotplug. The effect is
that the number of idle ticks in the cpu summary line in /proc/stat is
still counting ticks for offline cpus.
Reproduction is easy, just start a workload that keeps all cpus busy,
switch off one or more cpus and then watch the idle field in top.
On a dual-core with one cpu 100% busy and one offline cpu you will get
something like this:
%Cpu(s): 48.7 us, 1.3 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 0.0 si,
%0.0 st
The problem is that an offline cpu still has ts->idle_active == 1.
To fix this we should make sure that the cpu is online when calling
get_cpu_idle_time_us and get_cpu_iowait_time_us.
[Srivatsa: Rebased to current mainline]
Change-Id: I53cd02fef784647e45abb4c99ff641e5e69a9d3e
Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20121010061820.8999.57245.stgit@srivatsabhat.in.ibm.com
Cc: deepthi@linux.vnet.ibm.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In yaffs_rename(), d_entry->d_inode can be NULL if the
target directory doesn't exist or if there is any race
condition such as target directory being deleted while
renaming another directory to target directory name.
Avoid dereferencing d_inode in such cases.
CRs-Fixed: 360748
Change-Id: If95b4992f1056fea78f2e1bd54253cd5c8aac93d
Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
Some userspace applications use /proc/stat to determine how many CPUs
the system has. CPU hotplug can offline a CPU at runtime and causing the
offline CPU not present in /proc/stat if we only show online cpu in
/proc/stat.
Change-Id: I4fd0cfcdb174244044634389da2fbdef77744c19
Signed-off-by: Ajay Dudani <adudani@codeaurora.org>
The compiler warns that 'saved_nlink' declared in line 980
may be used uninitialized.
Change-Id: Ie725c0efe949213b1281a70c779a3167b1961327
Acked-by: Kaushik Sikdar <ksikdar@qualcomm.com>
Signed-off-by: Rohit Vaswani <rvaswani@codeaurora.org>
According to the current distributed security logic between
kernel-userspace, the kernel is not aware of the level of
security that a link-key provides when userspace responds
to the link key request. Adding a ioctl entry which will
update the kernel space auth_key's level of security as soon
as userspace responds to the link key request.
CRs-fixed: 264601
Change-Id: I6765cce92a6f8b761742d57ea94e81502f6e7fcf
Signed-off-by: NaveenKumar <naveenr@codeaurora.org>
Fix NR_IPI to be 7 instead of 6 because both googly and core add
an IPI.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Conflicts:
arch/arm/Kconfig
arch/arm/common/Makefile
arch/arm/include/asm/hardware/cache-l2x0.h
arch/arm/mm/cache-l2x0.c
arch/arm/mm/mmu.c
include/linux/wakelock.h
kernel/power/Kconfig
kernel/power/Makefile
kernel/power/main.c
kernel/power/power.h
Suspend attempts can abort when the FUSE daemon is already frozen
and a client is waiting uninterruptibly for a response, causing
freezing of tasks to fail.
Use the freeze-friendly wait API, but disregard other signals.
Change-Id: Icefb7e4bbc718ccb76bf3c04daaa5eeea7e0e63c
Signed-off-by: Todd Poynor <toddpoynor@google.com>
(cherry picked from commit d893d660b7)
If FAT formatted SD card gets removed without unmounting,
FAT file system may throw many kernel error messages which
could too much traffic for console driver and can sometimes
even cause the system to trigger watchdog timeout.
This patch converts the printk to printk_ratelimited to rate
limit the error messages from FAT fs.
Change-Id: I58b942f6714a8d3353478eb21139b8046ee3f875
Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org>
(cherry picked from commit a872b71069d717ced1a2de642afa0693d1bb9448)
Cleancache requires a small amount of code to be added
to a filesystem's implementation so that clean page
cache pages from a filesystem of that type may be
recognized and stored in/retrieved from cleancache.
Change-Id: I94c3fc8817ab66e2c54f7b2c6c474dd2321d9806
Signed-off-by: Larry Bassel <lbassel@codeaurora.org>
(cherry picked from commit eb93161d9746b2aa0ac534d1da88a33480d21905)
Use deferable timer in background operations thread,
so that it won't cause unnecessary wakeups. Typically,
wakeups are seen in the range 60ms to 2secs (for HZ=100)
after the thread is scheduled out. In general, during this
delay the processor can go into sleep, if there is no other
activity. Since, the work done in this background operation
is not critical and can be handled as soon as when timer
expires and processor wakes up for other critical events,
we mark the timer responsible for wakeup of this thread
as deferable timer. Otherwise, the processor wakesup
unnecessarily to handle the background operations causing
higher power consumption in idle state.
Change-Id: Ic168525c6b33600ad23017d00ea9723cf8a738d2
Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
(cherry picked from commit 4e239c0cbfc9070ff0c006a92eaafd52243b47a0)
Pull block layer fixes from Jens Axboe:
"A few small, but important fixes. Most of them are marked for stable
as well
- Fix failure to release a semaphore on error path in mtip32xx.
- Fix crashable condition in bio_get_nr_vecs().
- Don't mark end-of-disk buffers as mapped, limit it to i_size.
- Fix for build problem with CONFIG_BLOCK=n on arm at least.
- Fix for a buffer overlow on UUID partition printing.
- Trivial removal of unused variables in dac960."
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix buffer overflow when printing partition UUIDs
Fix blkdev.h build errors when BLOCK=n
bio allocation failure due to bio_get_nr_vecs()
block: don't mark buffers beyond end of disk as mapped
mtip32xx: release the semaphore on an error path
dac960: Remove unused variables from DAC960_CreateProcEntries()
Merge misc fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (4 patches)
frv: delete incorrect task prototypes causing compile fail
slub: missing test for partial pages flush work in flush_all()
fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
drivers/rtc/rtc-pl031.c: configure correct wday for 2000-01-01
Instead of doing the i_mode calculations at proc_fd_instantiate() time,
move them into tid_fd_revalidate(), which is where the other inode state
(notably uid/gid information) is updated too.
Otherwise we'll end up with stale i_mode information if an fd is re-used
while the dentry still hangs around. Not that anything really *cares*
(symlink permissions don't really matter), but Tetsuo Handa noticed that
the owner read/write bits don't always match the state of the
readability of the file descriptor, and we _used_ to get this right a
long time ago in a galaxy far, far away.
Besides, aside from fixing an ugly detail (that has apparently been this
way since commit 61a2878402: "proc: Remove the hard coded inode
numbers" in 2006), this removes more lines of code than it adds. And it
just makes sense to update i_mode in the same place we update i_uid/gid.
Al Viro correctly points out that we could just do the inode fill in the
inode iops ->getattr() function instead. However, that does require
somewhat slightly more invasive changes, and adds yet *another* lookup
of the file descriptor. We need to do the revalidate() for other
reasons anyway, and have the file descriptor handy, so we might as well
fill in the information at this point.
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
map_files/ entries are never supposed to be executed, still curious
minds might try to run them, which leads to the following deadlock
======================================================
[ INFO: possible circular locking dependency detected ]
3.4.0-rc4-24406-g841e6a6 #121 Not tainted
-------------------------------------------------------
bash/1556 is trying to acquire lock:
(&sb->s_type->i_mutex_key#8){+.+.+.}, at: do_lookup+0x267/0x2b1
but task is already holding lock:
(&sig->cred_guard_mutex){+.+.+.}, at: prepare_bprm_creds+0x2d/0x69
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sig->cred_guard_mutex){+.+.+.}:
validate_chain+0x444/0x4f4
__lock_acquire+0x387/0x3f8
lock_acquire+0x12b/0x158
__mutex_lock_common+0x56/0x3a9
mutex_lock_killable_nested+0x40/0x45
lock_trace+0x24/0x59
proc_map_files_lookup+0x5a/0x165
__lookup_hash+0x52/0x73
do_lookup+0x276/0x2b1
walk_component+0x3d/0x114
do_last+0xfc/0x540
path_openat+0xd3/0x306
do_filp_open+0x3d/0x89
do_sys_open+0x74/0x106
sys_open+0x21/0x23
tracesys+0xdd/0xe2
-> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
check_prev_add+0x6a/0x1ef
validate_chain+0x444/0x4f4
__lock_acquire+0x387/0x3f8
lock_acquire+0x12b/0x158
__mutex_lock_common+0x56/0x3a9
mutex_lock_nested+0x40/0x45
do_lookup+0x267/0x2b1
walk_component+0x3d/0x114
link_path_walk+0x1f9/0x48f
path_openat+0xb6/0x306
do_filp_open+0x3d/0x89
open_exec+0x25/0xa0
do_execve_common+0xea/0x2f9
do_execve+0x43/0x45
sys_execve+0x43/0x5a
stub_execve+0x6c/0xc0
This is because prepare_bprm_creds grabs task->signal->cred_guard_mutex
and when do_lookup happens we try to grab task->signal->cred_guard_mutex
again in lock_trace.
Fix it using plain ptrace_may_access() helper in proc_map_files_lookup()
and in proc_map_files_readdir() instead of lock_trace(), the caller must
be CAP_SYS_ADMIN granted anyway.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
...and add a "directio" synonym since that's what the manpage has
always advertised.
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
- Fix a lock ordering deadlock in JFFS2
- Fix an oops in the dataflash driver, triggered by a dummy call to test
whether it has OTP functionality.
- Fix request_mem_region() failure on amsdelta NAND driver.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iEYEABECAAYFAk+vekgACgkQdwG7hYl686N8bQCfdizsFrliKbDW20R/pO66NoAV
aloAn0ln+mwe3rIdNt8qKynW8e8dbudF
=R7XS
-----END PGP SIGNATURE-----
Merge tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd
Pull three MTD fixes from David Woodhouse:
- Fix a lock ordering deadlock in JFFS2
- Fix an oops in the dataflash driver, triggered by a dummy call to test
whether it has OTP functionality.
- Fix request_mem_region() failure on amsdelta NAND driver.
* tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd:
mtd: ams-delta: fix request_mem_region() failure
jffs2: Fix lock acquisition order bug in gc path
mtd: fix oops in dataflash driver
The number of bio_get_nr_vecs() is passed down via bio_alloc() to
bvec_alloc_bs(), which fails the bio allocation if
nr_iovecs > BIO_MAX_PAGES. For the underlying caller this causes an
unexpected bio allocation failure.
Limiting to queue_max_segments() is not sufficient, as max_segments
also might be very large.
bvec_alloc_bs(gfp_mask, nr_iovecs, ) => NULL when nr_iovecs > BIO_MAX_PAGES
bio_alloc_bioset(gfp_mask, nr_iovecs, ...)
bio_alloc(GFP_NOIO, nvecs)
xfs_alloc_ioend_bio()
Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hi,
We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk. It can
easily be reproduced by doing the following:
[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s
In dmesg, you'll find the following:
squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 43.106012] attempt to access beyond end of device
[ 43.106029] loop0: rw=0, want=277410, limit=277408
[ 43.106039] Buffer I/O error on device loop0, logical block 138704
[ 43.106053] attempt to access beyond end of device
[ 43.106057] loop0: rw=0, want=277412, limit=277408
[ 43.106061] Buffer I/O error on device loop0, logical block 138705
[ 43.106066] attempt to access beyond end of device
[ 43.106070] loop0: rw=0, want=277414, limit=277408
[ 43.106073] Buffer I/O error on device loop0, logical block 138706
[ 43.106078] attempt to access beyond end of device
[ 43.106081] loop0: rw=0, want=277416, limit=277408
[ 43.106085] Buffer I/O error on device loop0, logical block 138707
[ 43.106089] attempt to access beyond end of device
[ 43.106093] loop0: rw=0, want=277418, limit=277408
[ 43.106096] Buffer I/O error on device loop0, logical block 138708
[ 43.106101] attempt to access beyond end of device
[ 43.106104] loop0: rw=0, want=277420, limit=277408
[ 43.106108] Buffer I/O error on device loop0, logical block 138709
[ 43.106112] attempt to access beyond end of device
[ 43.106116] loop0: rw=0, want=277422, limit=277408
[ 43.106120] Buffer I/O error on device loop0, logical block 138710
[ 43.106124] attempt to access beyond end of device
[ 43.106128] loop0: rw=0, want=277424, limit=277408
[ 43.106131] Buffer I/O error on device loop0, logical block 138711
[ 43.106135] attempt to access beyond end of device
[ 43.106139] loop0: rw=0, want=277426, limit=277408
[ 43.106143] Buffer I/O error on device loop0, logical block 138712
[ 43.106147] attempt to access beyond end of device
[ 43.106151] loop0: rw=0, want=277428, limit=277408
[ 43.106154] Buffer I/O error on device loop0, logical block 138713
[ 43.106158] attempt to access beyond end of device
[ 43.106162] loop0: rw=0, want=277430, limit=277408
[ 43.106166] attempt to access beyond end of device
[ 43.106169] loop0: rw=0, want=277432, limit=277408
...
[ 43.106307] attempt to access beyond end of device
[ 43.106311] loop0: rw=0, want=277470, limit=2774
Squashfs manages to read in the end block(s) of the disk during the
mount operation. Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped. Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above. I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.
Cheers,
Jeff
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Nick Piggin <npiggin@kernel.dk>
--
Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge misc fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (8 patches)
MAINTAINERS: add maintainer for LED subsystem
mm: nobootmem: fix sign extend problem in __free_pages_memory()
drivers/leds: correct __devexit annotations
memcg: free spare array to avoid memory leak
namespaces, pid_ns: fix leakage on fork() failure
hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()
mm: fix division by 0 in percpu_pagelist_fraction()
proc/pid/pagemap: correctly report non-present ptes and holes between vmas
Reset the current pagemap-entry if the current pte isn't present, or if
current vma is over. Otherwise pagemap reports last entry again and
again.
Non-present pte reporting was broken in commit 092b50bacd ("pagemap:
introduce data structure for pagemap entry")
Reporting for holes was broken in commit 5aaabe831e ("pagemap: avoid
splitting thp when reading /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This test is always true so it means we revalidate the length every
time, which generates more network traffic. When it is SEEK_SET or
SEEK_CUR, then we don't need to revalidate.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The locking policy is such that the erase_complete_block spinlock is
nested within the alloc_sem mutex. This fixes a case in which the
acquisition order was erroneously reversed. This issue was caught by
the following lockdep splat:
=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.5 #1
-------------------------------------------------------
jffs2_gcd_mtd6/299 is trying to acquire lock:
(&c->alloc_sem){+.+.+.}, at: [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890
but task is already holding lock:
(&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&(&c->erase_completion_lock)->rlock){+.+...}:
[<c008bec4>] validate_chain+0xe6c/0x10bc
[<c008c660>] __lock_acquire+0x54c/0xba4
[<c008d240>] lock_acquire+0xa4/0x114
[<c046780c>] _raw_spin_lock+0x3c/0x4c
[<c01f744c>] jffs2_garbage_collect_pass+0x4c/0x890
[<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc
[<c0071a68>] kthread+0x98/0xa0
[<c000f264>] kernel_thread_exit+0x0/0x8
-> #0 (&c->alloc_sem){+.+.+.}:
[<c008ad2c>] print_circular_bug+0x70/0x2c4
[<c008c08c>] validate_chain+0x1034/0x10bc
[<c008c660>] __lock_acquire+0x54c/0xba4
[<c008d240>] lock_acquire+0xa4/0x114
[<c0466628>] mutex_lock_nested+0x74/0x33c
[<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890
[<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc
[<c0071a68>] kthread+0x98/0xa0
[<c000f264>] kernel_thread_exit+0x0/0x8
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&(&c->erase_completion_lock)->rlock);
lock(&c->alloc_sem);
lock(&(&c->erase_completion_lock)->rlock);
lock(&c->alloc_sem);
*** DEADLOCK ***
1 lock held by jffs2_gcd_mtd6/299:
#0: (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890
stack backtrace:
[<c00155dc>] (unwind_backtrace+0x0/0x100) from [<c0463dc0>] (dump_stack+0x20/0x24)
[<c0463dc0>] (dump_stack+0x20/0x24) from [<c008ae84>] (print_circular_bug+0x1c8/0x2c4)
[<c008ae84>] (print_circular_bug+0x1c8/0x2c4) from [<c008c08c>] (validate_chain+0x1034/0x10bc)
[<c008c08c>] (validate_chain+0x1034/0x10bc) from [<c008c660>] (__lock_acquire+0x54c/0xba4)
[<c008c660>] (__lock_acquire+0x54c/0xba4) from [<c008d240>] (lock_acquire+0xa4/0x114)
[<c008d240>] (lock_acquire+0xa4/0x114) from [<c0466628>] (mutex_lock_nested+0x74/0x33c)
[<c0466628>] (mutex_lock_nested+0x74/0x33c) from [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890)
[<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890) from [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc)
[<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc) from [<c0071a68>] (kthread+0x98/0xa0)
[<c0071a68>] (kthread+0x98/0xa0) from [<c000f264>] (kernel_thread_exit+0x0/0x8)
This was introduce in '81cfc9f jffs2: Fix serious write stall due to erase'.
Cc: stable@kernel.org [2.6.37+]
Signed-off-by: Josh Cartwright <joshc@linux.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Pull btrfs fixes from Chris Mason:
"The big ones here are a memory leak we introduced in rc1, and a
scheduling while atomic if the transid on disk doesn't match the
transid we expected. This happens for corrupt blocks, or out of date
disks.
It also fixes up the ioctl definition for our ioctl to resolve logical
inode numbers. The __u32 was a merging error and doesn't match what
we ship in the progs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: avoid sleeping in verify_parent_transid while atomic
Btrfs: fix crash in scrub repair code when device is missing
btrfs: Fix mismatching struct members in ioctl.h
Btrfs: fix page leak when allocing extent buffers
Btrfs: Add properly locking around add_root_to_dirty_list
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.
But, a few callers are using it with spinlocks held. Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case. This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit ec81aecb29 ("hfs: fix a potential buffer overflow") fixed a few
potential buffer overflows in the hfs filesystem. But as Timo Warns
pointed out, these changes also need to be made on the hfsplus
filesystem as well.
Reported-by: Timo Warns <warns@pre-sense.de>
Acked-by: WANG Cong <amwang@redhat.com>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Sage Weil <sage@newdream.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: stable <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CIFS fixes from Steve French.
* git://git.samba.org/sfrench/cifs-2.6:
fs/cifs: fix parsing of dfs referrals
cifs: make sure we ignore the credentials= and cred= options
[CIFS] Update cifs version to 1.78
cifs - check S_AUTOMOUNT in revalidate
cifs: add missing initialization of server->req_lock
cifs: don't cap ra_pages at the same level as default_backing_dev_info
CIFS: Fix indentation in cifs_show_options
Fix that when scrub tries to repair an I/O or checksum error and one of
the devices containing the mirror is missing, it crashes in bio_add_page
because the bdev is a NULL pointer for missing devices.
Reported-by: Marco L. Crociani <marco.crociani@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix the size members of btrfs_ioctl_ino_path_args and
btrfs_ioctl_logical_ino_args. The user space btrfs-progs utilities used
__u64 and the kernel headers used __u32 before.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we happen to alloc a extent buffer and then alloc a page and notice that
page is already attached to an extent buffer, we will only unlock it and
free our existing eb. Any pages currently attached to that eb will be
properly freed, but we don't do the page_cache_release() on the page where
we noticed the other extent buffer which can cause us to leak pages and I
hope cause the weird issues we've been seeing in this area. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
add_root_to_dirty_list happens once at the very beginning of the
transaction, but it is still racey.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The problem was that the first referral was parsed more than once
and so the caller tried the same referrals multiple times.
The problem was introduced partly by commit
066ce68994,
where 'ref += le16_to_cpu(ref->Size);' got lost,
but that was also wrong...
Cc: <stable@vger.kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Tested-by: Björn Jacke <bj@sernet.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
can have holes in the kernel address space: it seems to happen easily
with Xen, and it looks like the AMD gart64 code will also punch holes
dynamically.
Actually hitting that case is still very unlikely, so just do the
access, and take an exception and fix it up for the very unlikely case
of it being a page-crosser with no next page.
And hey, this abstraction might even help other architectures that have
other issues with unaligned word accesses than the possible missing next
page. IOW, this could do the byte order magic too.
Peter Anvin fixed a thinko in the shifting for the exception case.
Reported-and-tested-by: Jana Saout <jana@saout.de>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Older mount.cifs programs passed this on to the kernel after parsing
the file. Make sure the kernel ignores that option.
Should fix:
https://bugzilla.kernel.org/show_bug.cgi?id=43195
Cc: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Ronald <ronald645@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
When revalidating a dentry, if the inode wasn't known to be a dfs
entry when the dentry was instantiated, such as when created via
->readdir(), the DCACHE_NEED_AUTOMOUNT flag needs to be set on the
dentry in ->d_revalidate().
The false return from cifs_d_revalidate(), due to the inode now
being marked with the S_AUTOMOUNT flag, might not invalidate the
dentry if there is a concurrent unlazy path walk. This is because
the dentry reference count will be at least 2 in this case causing
d_invalidate() to return EBUSY. So the asumption that the dentry
will be discarded then correctly instantiated via ->lookup() might
not hold.
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Steve French <smfrench@gmail.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
Highlights include:
- Fixes for the NFSv4 security negotiation
- Use the correct hostname when mounting from a private namespace
- NFS net namespace bugfixes for the pipefs filesystem
- NFSv4 GETACL bugfixes
- IPv6 bugfix for NFSv4 referrals
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPoK8MAAoJEGcL54qWCgDyr4AP/1cSY4ZjaZwZm1l9M1l1RBtx
zBBE6RfM+4eKqwAzFNaIjLjslLMMkTV0TsARYG/CQrJ4DuonHDkdGMwXdTgWFYNN
AuVO50QKTy+8j2PqY5t84/d6WrFrxbCckKyhixb4/uHtl6mB2jdICA7xLWa4hndS
kPhRYZQt4zs+Db7Y66nXCLnpWaWoR34ZNxbpoCTLLyYIiUOTplfSfJ21bVZWN3Pt
M5BYUdKDfgDV15V1/UqULL9j3xnrgFsOK9DjiHEXppXZYfEqfwmEMg9ZQw2AfAm1
HcrcVv3YTa0I4ag3s/IeZ7wot8PJPOMQzVnzvD2FIO8FX+9vkkYQ3BwoQSVv21Ar
hgywkT/MMlz9mCDqpjJQVgTaNq4AOoFBF5MXQz9KLWSdummjZs3ILMkpV7Ze3qpj
Q6GEgii5Xr+Pj/D5D5W3gvkcztDhn3ziSv7fuL5fEADfrP6tYxNmLlP1MKPzrtJn
SP7WnkmcuWXdvfnKAeOeqAsrvDuaNoHRjtNmfe1PAajUWcvVuLidYhi84dtRYvBe
N4ukQGqerBoHN3nYhQHl0p9arXA6mAdb2Y9Pt9FY3nraA7e+oJWaEfq1vuFEgF8s
et8mDrGYpVN155qUvCBGNIwyQXgGt6LLhBZVF9OJa59JfRPDkagaIaTVPlhKJm/Q
Mbx7dfpGgDU+aLipyv2Q
=vLBv
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.4-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fixes for the NFSv4 security negotiation
- Use the correct hostname when mounting from a private namespace
- NFS net namespace bugfixes for the pipefs filesystem
- NFSv4 GETACL bugfixes
- IPv6 bugfix for NFSv4 referrals
* tag 'nfs-for-3.4-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Use the correct hostname in the client identifier string
SUNRPC: RPC client must use the current utsname hostname string
NFS: get module in idmap PipeFS notifier callback
NFS: Remove unused function nfs_lookup_with_sec()
NFS: Honor the authflavor set in the clone mount data
NFS: Fix following referral mount points with different security
NFS: Do secinfo as part of lookup
NFS: Handle exceptions coming out of nfs4_proc_fs_locations()
NFS: Fix SECINFO_NO_NAME
SUNRPC: traverse clients tree on PipeFS event
SUNRPC: set per-net PipeFS superblock before notification
SUNRPC: skip clients with program without PipeFS entries
SUNRPC: skip dead but not buried clients on PipeFS events
Avoid beyond bounds copy while caching ACL
Avoid reading past buffer when calling GETACL
fix page number calculation bug for block layout decode buffer
NFSv4.1 fix page number calculation bug for filelayout decode buffers
pnfs-obj: Remove unused variable from objlayout_get_deviceinfo()
nfs4: fix referrals on mounts that use IPv6 addrs
While testing, I've found that even when we are able to negotiate a
much larger rsize with the server, on-the-wire reads often end up being
capped at 128k because of ra_pages being capped at that level.
Lifting this restriction gave almost a twofold increase in sequential
read performance on my craptactular KVM test rig with a 1M rsize.
I think this is safe since the actual ra_pages that the VM requests
is run through max_sane_readahead() prior to submitting the I/O. Under
memory pressure we should end up with large readahead requests being
suppressed anyway.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Trivial patch which fixes a misplaced tab in cifs_show_options().
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix printk format warnings -- both items are size_t,
so use %zu to print them.
fs/nfsd/nfs4recover.c:580:3: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t'
fs/nfsd/nfs4recover.c:580:3: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'unsigned int'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to use the hostname of the process that created the nfs_client.
That hostname is now stored in the rpc_client->cl_nodename.
Also remove the utsname()->domainname component. There is no reason
to include the NIS/YP domainname in a client identifier string.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The autofs packet size has had a very unfortunate size problem on x86:
because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
because the packet data was not 8-byte aligned, the size of the autofsv5
packet structure differed between 32-bit and 64-bit modes despite
looking otherwise identical (300 vs 304 bytes respectively).
We first fixed that up by making the 64-bit compat mode know about this
problem in commit a32744d4ab ("autofs: work around unhappy compat
problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
64-bit kernel because everything then worked the same way as on a 32-bit
kernel.
But it turned out that 'automount' had actually known and worked around
this problem in user space, so fixing the kernel to do the proper 32-bit
compatibility handling actually *broke* 32-bit automount on a 64-bit
kernel, because it knew that the packet sizes were wrong and expected
those incorrect sizes.
As a result, we ended up reverting that compatibility mode fix, and
thus breaking systemd again, in commit fcbf94b9de.
With both automount and systemd doing a single read() system call, and
verifying that they get *exactly* the size they expect but using
different sizes, it seemed that fixing one of them inevitably seemed to
break the other. At one point, a patch I seriously considered applying
from Michael Tokarev did a "strcmp()" to see if it was automount that
was doing the operation. Ugly, ugly.
However, a prettier solution exists now thanks to the packetized pipe
mode. By marking the communication pipe as being packetized (by simply
setting the O_DIRECT flag), we can always just write the bigger packet
size, and if user-space does a smaller read, it will just get that
partial end result and the extra alignment padding will simply be thrown
away.
This makes both automount and systemd happy, since they now get the size
they asked for, and the kernel side of autofs simply no longer needs to
care - it could pad out the packet arbitrarily.
Of course, if there is some *other* user of autofs (please, please,
please tell me it ain't so - and we haven't heard of any) that tries to
read the packets with multiple writes, that other user will now be
broken - the whole point of the packetized mode is that one system call
gets exactly one packet, and you cannot read a packet in pieces.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.
When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own. The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).
End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time. You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.
NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops. Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.
The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes). But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.
Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is bug fix.
Notifier callback is called from SUNRPC module. So before dereferencing NFS
module we have to make sure, that it's alive.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull btrfs fixes from Chris Mason:
"This has our collection of bug fixes. I missed the last rc because I
thought our patches were making NFS crash during my xfs test runs.
Turns out it was an NFS client bug fixed by someone else while I tried
to bisect it.
All of these fixes are small, but some are fairly high impact. The
biggest are fixes for our mount -o remount handling, a deadlock due to
GFP_KERNEL allocations in readdir, and a RAID10 error handling bug.
This was tested against both 3.3 and Linus' master as of this morning."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits)
Btrfs: reduce lock contention during extent insertion
Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
Btrfs: Fix space checking during fs resize
Btrfs: fix block_rsv and space_info lock ordering
Btrfs: Prevent root_list corruption
Btrfs: fix repair code for RAID10
Btrfs: do not start delalloc inodes during sync
Btrfs: fix that check_int_data mount option was ignored
Btrfs: don't count CRC or header errors twice while scrubbing
Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
btrfs: don't return EINTR
Btrfs: double unlock bug in error handling
Btrfs: always store the mirror we read the eb from
fs/btrfs/volumes.c: add missing free_fs_devices
btrfs: fix early abort in 'remount'
Btrfs: fix max chunk size check in chunk allocator
Btrfs: add missing read locks in backref.c
Btrfs: don't call free_extent_buffer twice in iterate_irefs
Btrfs: Make free_ipath() deal gracefully with NULL pointers
Btrfs: avoid possible use-after-free in clear_extent_bit()
...
This reverts commit a32744d4ab.
While that commit was technically the right thing to do, and made the
x86-64 compat mode work identically to native 32-bit mode (and thus
fixing the problem with a 32-bit systemd install on a 64-bit kernel), it
turns out that the automount binaries had workarounds for this compat
problem.
Now, the workarounds are disgusting: doing an "uname()" to find out the
architecture of the kernel, and then comparing it for the 64-bit cases
and fixing up the size of the read() in automount for those. And they
were confused: it's not actually a generic 64-bit issue at all, it's
very much tied to just x86-64, which has different alignment for an
'u64' in 64-bit mode than in 32-bit mode.
But the end result is that fixing the compat layer actually breaks the
case of a 32-bit automount on a x86-64 kernel.
There are various approaches to fix this (including just doing a
"strcmp()" on current->comm and comparing it to "automount"), but I
think that I will do the one that teaches pipes about a special "packet
mode", which will allow user space to not have to care too deeply about
the padding at the end of the autofs packet.
That change will make the compat workaround unnecessary, so let's revert
it first, and get automount working again in compat mode. The
packetized pipes will then fix autofs for systemd.
Reported-and-requested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Ian Kent <raven@themaw.net>
Cc: stable@kernel.org # for 3.3
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull CIFS fixes from Steve French.
* git://git.samba.org/sfrench/cifs-2.6:
Use correct conversion specifiers in cifs_show_options
CIFS: Show backupuid/gid in /proc/mounts
cifs: fix offset handling in cifs_iovec_write
We're spending huge amounts of time on lock contention during
end_io processing because we unconditionally assume we are overwriting
an existing extent in the file for each IO.
This checks to see if we are outside i_size, and if so, it uses a
less expensive readonly search of the btree to look for existing
extents.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has an optimization where it will preallocate dentries during
readdir to fill in enough information to open the inode without an extra
lookup.
But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
that leads to deadlocks because our readdir code has tree locks held.
For now, disable this optimization. We'll fix the gfp mask in the next
merge window.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The authflavor is set in an nfs_clone_mount structure and passed to the
xdev_mount() functions where it was promptly ignored. Instead, use it
to initialize an rpc_clnt for the cloned server.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I create a new proc_lookup_mountpoint() to use when submounting an NFS
v4 share. This function returns an rpc_clnt to use for performing an
fs_locations() call on a referral's mountpoint.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Whenever lookup sees wrongsec do a secinfo and retry the lookup to find
attributes of the file or directory, such as "is this a referral
mountpoint?". This also allows me to remove handling -NFS4ERR_WRONSEC
as part of getattr xdr decoding.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't want to return -NFS4ERR_WRONGSEC to the VFS because it could
cause the kernel to oops.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
I was using the same decoder function for SECINFO and SECINFO_NO_NAME,
so it was returning an error when it tried to decode an OP_SECINFO_NO_NAME
header as OP_SECINFO.
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When attempting to cache ACLs returned from the server, if the bitmap
size + the ACL size is greater than a PAGE_SIZE but the ACL size itself
is smaller than a PAGE_SIZE, we can read past the buffer page boundary.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix out-of-space checking, addressing a warning and potential resource
leak when resizing the filesystem down while allocating blocks.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
may_commit_transaction() calls
spin_lock(&space_info->lock);
spin_lock(&delayed_rsv->lock);
and update_global_block_rsv() calls
spin_lock(&block_rsv->lock);
spin_lock(&sinfo->lock);
Lockdep complains about this at run time.
Everywhere except in update_global_block_rsv(), the space_info lock is
the outer lock, therefore the locking order in update_global_block_rsv()
is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I was seeing root_list corruption on unmount during fs resize in 3.4-rc4; add
correct locking to address this.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_map_block sets mirror_num, so that the repair code knows eventually
which device gave us the read error. For RAID10, mirror_num must be 1 or 2.
Before this fix mirror_num was incorrectly related to our stripe index.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and
start writing them out, but it doesn't splice the list or anything so as
long as somebody is doing work on the box you could end up in this section
_forever_. So just remove it, it's not needed anyway since sync will start
writeback on all inodes anyway, all we need to do is wait for ordered
extents and then we can commit the transaction. In my horrible torture test
sync goes from taking 4 minutes to about 1.5 minutes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Bug noticed in commit
bf118a342f
When calling GETACL, if the size of the bitmap array, the length
attribute and the acl returned by the server is greater than the
allocated buffer(args.acl_len), we can Oops with a General Protection
fault at _copy_from_pages() when we attempt to read past the pages
allocated.
This patch allocates an extra PAGE for the bitmap and checks to see that
the bitmap + attribute_length + ACLs don't exceed the buffer space
allocated to it.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Jian Li <jiali@redhat.com>
[Trond: Fixed a size_t vs unsigned int printk() warning]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Merge fixes from Andrew Morton:
"13 fixes. The acerhdf patches aren't (really) fixes. But they've
been stuck in my tree for up to two years, sent to Matthew multiple
times and the developers are unhappy."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (13 patches)
mm: fix NULL ptr dereference in move_pages
mm: fix NULL ptr dereference in migrate_pages
revert "proc: clear_refs: do not clear reserved pages"
drivers/rtc/rtc-ds1307.c: fix BUG shown with lock debugging enabled
arch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file
hugetlbfs: lockdep annotate root inode properly
acerhdf: lowered default temp fanon/fanoff values
acerhdf: add support for new hardware
acerhdf: add support for Aspire 1410 BIOS v1.3314
fs/buffer.c: remove BUG() in possible but rare condition
mm: fix up the vmscan stat in vmstat
epoll: clear the tfile_check_list on -ELOOP
mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma
Local variable 'sb' was not being used in objlayout_get_deviceinfo().
Signed-off-by: Sachin Bhamare <sbhamare@panasas.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All referrals (IPv4 addr, IPv6 addr, and DNS) are broken on mounts of
IPv6 addresses, because validation code uses a path that is parsed
from the dev_name ("<server>:<path>") by splitting on the first colon and
colons are used in IPv6 addrs.
This patch ignores colons within IPv6 addresses that are escaped by '[' and ']'.
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Highlights include:
- Fix NFSv4 infinite loops on open(O_TRUNC)
- Fix an Oops and an infinite loop in the NFSv4 flock code
- Don't register the PipeFS filesystem until it has been set up
- Fix an Oops in nfs_try_to_update_request
- Don't reuse NFSv4 open owners: fixes a bad sequence id storm.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPlbzwAAoJEGcL54qWCgDy24oQALZE67vBft7M2j0BiWhVbV15
YLbCf6x/h+0BJAkKWdrBaw7N6GX6OYBOX2SsmrBkzYf5mgHeju5+dH0CmRAR5xib
5d+Lwxif1l+rABfdzzJf8gY1L1THyJCnfmarKKyYEJ5OC1pJyulKLanXSPzPfzlm
APV5Jf6NM2WRgkCqzP6zf61NG0HbDSR7C//HQ3k21Sdt9XDLf5qLHBSuPIQ+BlZY
EvpbERTtJgp7rPJsLQv1F2dgasDUQNg8G+tmZatGcqEiNxVyQ2YqwshaldOVqftv
3Kocs6OW5C1ESj1dFJZmeMZ/+GSHjRJx8fpqHJjmCsh4kPGgFviQDdYwu4FDhhPI
FZslC5nVi8JMTPNJAFmfvbwPQId/TSRPCWYO5PtW1LSfRT/+25b6M5duro1eGIbJ
/FDoOCYQmepNOfobU9Q3roDWyNSLYFaUaMJUrccRcAuS3S2NEXisTAT49kmqa1Vm
ZArOJBnXTgmGi30nKhqqLJ43P61ekhX0AQ6PycZAXkjeRlkQs7AAQbMJZMB2X0r9
KtRCDPiH2NuR0FwxNMkMP4BXdsaY7Sz/xiSZXLOUf1SeWBiBtYoDdrQ3z67SGOeG
qxI3qXXl0KC2+l2jnezcWhBf4CDpxftGIBi+rKWJt8stoYzbemB/M1lkoTCwrVzq
8Gwyy0QTVzE9VkY77oVW
=hQAK
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix NFSv4 infinite loops on open(O_TRUNC)
- Fix an Oops and an infinite loop in the NFSv4 flock code
- Don't register the PipeFS filesystem until it has been set up
- Fix an Oops in nfs_try_to_update_request
- Don't reuse NFSv4 open owners: fixes a bad sequence id storm.
* tag 'nfs-for-3.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Keep dropped state owners on the LRU list for a while
NFSv4: Ensure that we don't drop a state owner more than once
NFSv4: Ensure we do not reuse open owner names
nfs: Enclose hostname in brackets when needed in nfs_do_root_mount
NFS: put open context on error in nfs_flush_multi
NFS: put open context on error in nfs_pagein_multi
NFSv4: Fix open(O_TRUNC) and ftruncate() error handling
NFSv4: Ensure that we check lock exclusive/shared type against open modes
NFSv4: Ensure that the LOCK code sets exception->inode
NFS: check for req==NULL in nfs_try_to_update_request cleanup
SUNRPC: register PipeFS file system after pernet sybsystem
Revert commit 85e72aa538 ("proc: clear_refs: do not clear reserved
pages"), which was a quick fix suitable for -stable until ARM had been
moved over to the gate_vma mechanism:
https://lkml.org/lkml/2012/1/14/55
With commit f9d4861f ("ARM: 7294/1: vectors: use gate_vma for vectors user
mapping"), ARM does now use the gate_vma, so the PageReserved check can be
removed from the proc code.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Nicolas Pitre <nico@linaro.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the below reported false lockdep warning. e096d0c7e2
("lockdep: Add helper function for dir vs file i_mutex annotation") added
a similar annotation for every other inode in hugetlbfs but missed the
root inode because it was allocated by a separate function.
For HugeTLB fs we allow taking i_mutex in mmap. HugeTLB fs doesn't
support file write and its file read callback is modified in a05b0855fd
("hugetlbfs: avoid taking i_mutex from hugetlbfs_read()") to not take
i_mutex. Hence for HugeTLB fs with regular files we really don't take
i_mutex with mmap_sem held.
======================================================
[ INFO: possible circular locking dependency detected ]
3.4.0-rc1+ #322 Not tainted
-------------------------------------------------------
bash/1572 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff810f1618>] might_fault+0x40/0x90
but task is already holding lock:
(&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&sb->s_type->i_mutex_key#12){+.+.+.}:
[<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
[<ffffffff816a2f5e>] __mutex_lock_common+0x48/0x350
[<ffffffff816a3325>] mutex_lock_nested+0x2a/0x31
[<ffffffff811fb8e1>] hugetlbfs_file_mmap+0x7d/0x104
[<ffffffff810f859a>] mmap_region+0x272/0x47d
[<ffffffff810f8a39>] do_mmap_pgoff+0x294/0x2ee
[<ffffffff810f8b65>] sys_mmap_pgoff+0xd2/0x10e
[<ffffffff8103d19e>] sys_mmap+0x1d/0x1f
[<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
-> #0 (&mm->mmap_sem){++++++}:
[<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
[<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
[<ffffffff810f1645>] might_fault+0x6d/0x90
[<ffffffff81125d62>] filldir+0x6a/0xc2
[<ffffffff81133a83>] dcache_readdir+0x5c/0x222
[<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
[<ffffffff811260b6>] sys_getdents+0x79/0xc9
[<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sb->s_type->i_mutex_key#12);
lock(&mm->mmap_sem);
lock(&sb->s_type->i_mutex_key#12);
lock(&mm->mmap_sem);
*** DEADLOCK ***
1 lock held by bash/1572:
#0: (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff81125f88>] vfs_readdir+0x56/0xa8
stack backtrace:
Pid: 1572, comm: bash Not tainted 3.4.0-rc1+ #322
Call Trace:
[<ffffffff81699a3c>] print_circular_bug+0x1f8/0x209
[<ffffffff810a0256>] __lock_acquire+0xa81/0xd75
[<ffffffff810f38aa>] ? handle_pte_fault+0x5ff/0x614
[<ffffffff8109e622>] ? mark_lock+0x2d/0x258
[<ffffffff810f1618>] ? might_fault+0x40/0x90
[<ffffffff810a09e5>] lock_acquire+0xd5/0xfa
[<ffffffff810f1618>] ? might_fault+0x40/0x90
[<ffffffff816a3249>] ? __mutex_lock_common+0x333/0x350
[<ffffffff810f1645>] might_fault+0x6d/0x90
[<ffffffff810f1618>] ? might_fault+0x40/0x90
[<ffffffff81125d62>] filldir+0x6a/0xc2
[<ffffffff81133a83>] dcache_readdir+0x5c/0x222
[<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
[<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
[<ffffffff81125cf8>] ? sys_ioctl+0x74/0x74
[<ffffffff81125fa8>] vfs_readdir+0x76/0xa8
[<ffffffff811260b6>] sys_getdents+0x79/0xc9
[<ffffffff816a5922>] system_call_fastpath+0x16/0x1b
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While stressing the kernel with with failing allocations today, I hit the
following chain of events:
alloc_page_buffers():
bh = alloc_buffer_head(GFP_NOFS);
if (!bh)
goto no_grow; <= path taken
grow_dev_page():
bh = alloc_page_buffers(page, size, 0);
if (!bh)
goto failed; <= taken, consequence of the above
and then the failed path BUG()s the kernel.
The failure is inserted a litte bit artificially, but even then, I see no
reason why it should be deemed impossible in a real box.
Even though this is not a condition that we expect to see around every
time, failed allocations are expected to be handled, and BUG() sounds just
too much. As a matter of fact, grow_dev_page() can return NULL just fine
in other circumstances, so I propose we just remove it, then.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An epoll_ctl(,EPOLL_CTL_ADD,,) operation can return '-ELOOP' to prevent
circular epoll dependencies from being created. However, in that case we
do not properly clear the 'tfile_check_list'. Thus, add a call to
clear_tfile_check_list() for the -ELOOP case.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Tested-by: Alexandra N. Kossovsky <Alexandra.Kossovsky@oktetlabs.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cifs_show_options uses the wrong conversion specifier for uid, gid,
rsize & wsize. Correct this to %u to match it to the variable type
'unsigned integer'.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Show backupuid/backupgid in /proc/mounts for cifs shares mounted with
the backupuid/backupgid feature.
Also consolidate the two separate checks for
pvolume_info->backupuid_specified into a single if condition in
cifs_setup_cifs_sb().
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This patch instructs DLM to prevent an "in place" conversion, where the
lock just stays on the granted queue, and instead forces the conversion to
the back of the convert queue. This is done on upward conversions only.
This is useful in cases where, for example, a lock is frequently needed in
PR on one node, but another node needs it temporarily in EX to update it.
This may happen, for example, when the rindex is being updated by gfs2_grow.
The gfs2_grow needs to have the lock in EX, but the other nodes need to
re-read it to retrieve the updates. The glock is already granted in PR on
the non-growing nodes, so this prevents them from continually re-granting
the lock in PR, and forces the EX from gfs2_grow to go through.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
These are two low-risk bug fixes for ext4, fixing a compile warning
and a potential deadlock.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABCAAGBQJPlgZ8AAoJENNvdpvBGATwewkP/ioo2U05O4tzmt05+HICw1ZK
vh1x6oaO3bUMa21pKBzS60rDc+EDu61E+bjVrsasOmom8DZyOP92SiwaDnIsKn6p
JBSNwzIOPmuPflEY3tnOsnOZ1umZcB16uhki1Rk1HE0nRPdKiyKJKZnbSzmUGWUW
gJwHbHddxZKTmDrEy4CxfbwwKKVm2SQUO5crLohFst4JsXc1h6muEfkcAZvCfZ68
1PQIkTkJUXArQuTuxzP89r7L8tqHJv4iOz+PT0FlluGWvgJUWIOVvjdJfPuQTmLi
UNzvtoQxuxjdZuCK/D16kNTkOEPzOhMlNW1djAntdCQohHIJG0Hd5bFju9bybSLz
838sTCEFxRS7rdBEXiksWsPCVDz/QVnPft0RG9jqXd6dRPFr/XJ1rAeDTjW2vmWw
ZO28p99aolA5At02AlSf9IgMIME0gKejnvpRo703UW456BlFIXPK3e/nbtE7Eb5A
HcZhvIwncWE4cbq2/AboielPSnyx6Z3SJS0hBIQ2wG40xcL/jxYL7K2/trkUr2KH
H3/4RsrSlLDXqHRJ4cVW75zKMgyNvc+60HDlAxE62LqKFR7K93hdlHpnkySy/1St
FaIiipH8Tmt+u6tqn6rlR82vRxd/dkLgQMpCWm4Et4THXlvisZkbxaDXrEGx79qg
v8eEdmHeJuLcQesm9TrS
=Ygid
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
"These are two low-risk bug fixes for ext4, fixing a compile warning
and a potential deadlock."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
super.c: unused variable warning without CONFIG_QUOTA
jbd2: use GFP_NOFS for blkdev_issue_flush
sb info is only checked with quota support.
fs/ext4/super.c: In function ‘parse_options’:
fs/ext4/super.c:1600:23: warning: unused variable ‘sbi’ [-Wunused-variable]
Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
flush request is issued in transaction commit code path, so looks using
GFP_KERNEL to allocate memory for flush request bio falls into the classic
deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are
using GFP.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
This includes one short patch fixing the behavior of
the QUECVT flag, which the gfs2 folks are waiting on.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPlYXZAAoJEDgbc8f8gGmqpzYP/RFkCn8mC5y5cM8lWBk2JQAJ
u7khyqowm3TWxjIpX85n7Uxq1vEX4RxFiRzCeiZj3ZoWE3PEQim8Tqrw8SFs8lcT
y7oYL6TBkgCbM1ROuKDqXRiw8oRAfRud3cqtRvQzxuds3AoaoyYvE6N+to2y9XlR
5DuUBJEtrpKOEdW1ZeXeUmCnvDwrUyEFuIlACoyochzbk6ug1EF926dgSaViE4ZG
OFcGMy8ELNqVYibVcJof2ZfztTvrMcXPIpsJrkK5tIW6w6q+2+eN4Xc2/xMZ4OYc
5AHHXxrqbK1ZABLrqsK/lUQi0Z241kAnqIi33i2nl3mhWSDF3K5CNXmrF9rvGsN7
wEqsfdGOnwFQucF1VU95neo+jYMnom9VGodpvSop7Xy5r+i59MPcfMDfz/I1KqX7
vBDuM5rwisYNfOb6wsfFNcBhkf1ktgo2h2iH5UdIaWfHApF1Lnls7D2j/o7r2uxF
tRd4sPhRt2eIn68XRggbWOVxMfdUKtaW50ZhKzW9osMItYX748O8XfQdk0sQUbD9
ZXbFEfbfsfRgMKhMSyNFcGDh6ePsT/cmZL/zR5VKVEHuprL3hEDPhCui5GT0Sm1G
9sXpLu9p51r0d4OIJpScOFMv8aD64w/mwLJ3r5nrGZz2APK9SwWJOqX82fyqivQc
uvO42yNGkwSGnBjXKiM6
=KDNZ
-----END PGP SIGNATURE-----
Merge tag 'dlm-fixes-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm fixes from David Teigland:
"This includes one short patch fixing the behavior of the QUECVT flag,
which the gfs2 folks are waiting on."
* tag 'dlm-fixes-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: fix QUECVT when convert queue is empty
The QUECVT flag should not prevent conversions from
being granted immediately when the convert queue is
empty.
Signed-off-by: David Teigland <teigland@redhat.com>
Retest the RB_EMPTY_NODE() condition under the spin lock
to ensure that we don't call rb_erase() more than once on the
same state owner.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
... since exit_mmap() is coming and it will munmap() everything anyway.
In all other cases aio_free_ring() has ctx->mm == current->mm; moreover,
all other callers of vm_munmap() have mm == current->mm, so this will
allow us to get rid of mm argument of vm_munmap().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The NFSv4 spec is ambiguous about whether or not it is permissible
to reuse open owner names, so play it safe. This patch adds a timestamp
to the state_owner structure, and combines that with the IDA based
uniquifier.
Fixes a regression whereby the Linux server returns NFS4ERR_BAD_SEQID.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.
This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
to export our internal do_mmap_pgoff() function.
Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead. We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It does the same thing as "do_brk()", except it handles the VM locking
too.
It turns out that all external callers want that anyway, so we can make
do_brk() static to just mm/mmap.c while at it.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When hostname contains colon (e.g. when it is an IPv6 address) it needs
to be enclosed in brackets to make parsing of NFS device string possible.
Fix nfs_do_root_mount() to enclose hostname properly when needed. NFS code
actually does not need this as it does not parse the string passed by
nfs_do_root_mount() but the device string is exposed to userspace in
/proc/mounts.
CC: Josh Boyer <jwboyer@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the recent update of the cifs_iovec_write code to use async writes,
the handling of the file position was broken. That patch added a local
"offset" variable to handle the offset, and then only updated the
original "*poffset" before exiting.
Unfortunately, it copied off the original offset from the beginning,
instead of doing so after generic_write_checks had been called. Fix
this by moving the initialization of "offset" after that in the
function.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Pull nfsd bugfixes from J. Bruce Fields:
"One bugfix, and one minor header fix from Jeff Layton while we're
here"
* 'for-3.4' of git://linux-nfs.org/~bfields/linux:
nfsd: include cld.h in the headers_install target
nfsd: don't fail unchecked creates of non-special files
If the file wasn't opened for writing, then truncate and ftruncate
need to report the appropriate errors.
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Since we may be simulating flock() locks using NFS byte range locks,
we can't rely on the VFS having checked the file open mode for us.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
All callers of nfs4_handle_exception() that need to handle
NFS4ERR_OPENMODE correctly should set exception->inode
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Pull fuse updates from Miklos Szeredi.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use flexible array in fuse.h
fuse: allow nanosecond granularity
fuse: O_DIRECT support for files
fuse: fix nlink after unlink
The bitfield member mount_opt was too small by one bit to hold the mount
option that enabled to include data extents in the integrity checker.
Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR
option was added (git rebase silently merges so that the increase of the
size of the bitfield member is lost), the bit limit was removed entirely.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
When a filesystem is mounted with the degraded option, it is
possible that some of the devices are not there.
btrfs_ioctl_dev_info() crashs in this case because the device
name is a NULL pointer. This ioctl was only used for scrub.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
It is basically a good thing if we are interruptible when waiting for
free space, but the generality in which it is implemented currently
leads to system calls being interruptible that are not documented this
way. For example git can't handle interrupted unlink(), leading to
corrupt repos under space pressure.
Instead we raise the bar to only be interruptible by SIGKILL.
Thanks to David Sterba for suggesting this.
Signed-off-by: Arne Jansen <sensille@gmx.net>
The caller expects this function to return with the lock held and
releases it immediately on error.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Josef Bacik <josef@redhat.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Fix a bug, where in case we need to adjust stripe_size so that the
length of the resulting chunk is less than or equal to max_chunk_size,
DUP chunks turn out to be only half as big as they could be.
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
iref_to_path and iterate_irefs both increment the eb's refcount to use it
after releasing the path. Both depend on consistent data remaining in the
extent buffer and need a read lock to protect it.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Avoid calling free_extent_buffer more than once when the iterator function
returns non-zero. The only code that uses this is scrub repair for corrupted
nodatasum blocks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>