Commit graph

32212 commits

Author SHA1 Message Date
Maxim Patlasov c2bf56d4a3 mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling.  For such filesystems balance_dirty_pages always check bdi
counters against bdi limits.  I.e.  even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks.  The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
writeback).  The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately.  A fuse request queued
for real processing bears a copy of original page.  Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce.  There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region.  An
hour later I observed almost all RAM consumed by fuse writeback.  Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users.  This is write-through
page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation.  Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature.  Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain.  Let's make simple
computations.  Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 5a53748568f79641eaf40e41081a2f4987f005c2
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I2def00e492ab04b4938d11e35dfc87656b2acf20
Signed-off-by: Tarun Gupta <tarung@codeaurora.org>
2014-11-25 20:59:45 -08:00
Linux Build Service Account 62b1d26801 Merge "sched: Add API to set task's initial task load" 2014-11-20 15:36:29 -08:00
Vignesh Radhakrishnan d4f8b279d0 fs: fuse: lock the new non-CMA page before replace_page_cache_page()
While swapping a page in FUSE filesystem from a CMA to non-CMA page,
we call replace_page_cache_page() to swap and this function expects
old and new page to be locked. Else we will hit a VM_BUG_ON in
replace_page_cache_page().

Hence, lock the page before calling replace_page_cache_page() to
satisfy its requirement and prevent hitting VM_BUG_ON condition.

CRs-Fixed: 751088
Change-Id: I1dbb90dbaa9f056f211754bad15ae76c9d7171a5
Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
2014-11-06 16:24:02 +05:30
Srivatsa Vaddagiri f0e281597c sched: Add API to set task's initial task load
Add a per-task attribute, init_load_pct, that is used to initialize
newly created children's initial task load. This helps important
applications launch their child tasks on cpus with highest capacity.

Change-Id: Ie9665fd2aeb15203f95fd7f211c50bebbaa18727
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
2014-11-05 14:26:59 +05:30
Venkat Gopalakrishnan b015fc41eb block/fs: make tracking dirty task debug only
Adding a new element "tsk_dirty" to struct page increases the size
of mem_map/vmemmap, restrict this to a debug only functionality to
save few MB of memory.

Considering a system with 1G of RAM, there will be nearly 262144
pages and thus that many number of page structures in mem_map/vmemmap.
With pointer size of 8 bytes on a 64 bit system, adding this
pointer to "struct page" means an increase of "2MB" for mem_map.

CRs-Fixed: 738692
Change-Id: Idf3217dcbe17cf1ab4d462d2aa8d39da1ffd8b13
Signed-off-by: Venkat Gopalakrishnan <venkatg@codeaurora.org>
2014-10-28 17:13:01 -07:00
Jordan Crouse e67b7ce309 fs/seq_file: Use vmalloc by default for allocations > PAGE_SIZE
Some OOM implementations are pretty trigger happy when it comes to
releasing memory for kmalloc() allocations.  We might as well head
straight to vmalloc for allocations over PAGE_SIZE.

Change-Id: Ic0dedbadc8bf551d34cc5d77c8073938d4adef80
Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
2014-09-23 10:37:58 -06:00
Heiko Carstens 53f0b6187f fs/seq_file: fallback to vmalloc allocation
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.

E.g.  for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation.  In such
situations reading /proc/stat is not possible anymore.

Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.

For reference a call trace where reading from /proc/stat failed:

  sadc: page allocation failure: order:4, mode:0x1040d0
  CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
  [...]
  Call Trace:
    show_stack+0x6c/0xe8
    warn_alloc_failed+0xd6/0x138
    __alloc_pages_nodemask+0x9da/0xb68
    __get_free_pages+0x2e/0x58
    kmalloc_order_trace+0x44/0xc0
    stat_open+0x5a/0xd8
    proc_reg_open+0x8a/0x140
    do_dentry_open+0x1bc/0x2c8
    finish_open+0x46/0x60
    do_last+0x382/0x10d0
    path_openat+0xc8/0x4f8
    do_filp_open+0x46/0xa8
    do_sys_open+0x114/0x1f0
    sysc_tracego+0x14/0x1a

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 058504edd02667eef8fac9be27ab3ea74332e9b4
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Iad795a92fee1983c300568429a6283c48625bd9a
Signed-off-by: Jeremy Gebben <jgebben@codeaurora.org>
2014-09-23 10:37:58 -06:00
Rafael J. Wysocki 7ba8c45847 Revert "epoll: use freezable blocking call"
This reverts commit 1c441e921201 (epoll: use freezable blocking call)
which is reported to cause user space memory corruption to happen
after suspend to RAM.

Since it appears to be extremely difficult to root cause this
problem, it is best to revert the offending commit and try to address
the original issue in a better way later.

Change-Id: Ib57a2bda6c190454ae3bd3ae7eeb1456ea2156a6
References: https://bugzilla.kernel.org/show_bug.cgi?id=61781
Reported-by: Natrio <natrio@list.ru>
Reported-by: Jeff Pohlmeyer <yetanothergeek@gmail.com>
Bisected-by: Leo Wolf <jclw@ymail.com>
Fixes: 1c441e921201 (epoll: use freezable blocking call)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
Git-commit: c511851de162e8ec03d62e7d7feecbdf590d881d
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-stable.git
Signed-off-by: Osvaldo Banuelos <osvaldob@codeaurora.org>
2014-08-29 14:20:41 -07:00
Ian Maund 6440f462f9 Merge upstream tag 'v3.10.49' into msm-3.10
* commit 'v3.10.49': (529 commits)
  Linux 3.10.49
  ACPI / battery: Retry to get battery information if failed during probing
  x86, ioremap: Speed up check for RAM pages
  Score: Modify the Makefile of Score, remove -mlong-calls for compiling
  Score: The commit is for compiling successfully.
  Score: Implement the function csum_ipv6_magic
  score: normalize global variables exported by vmlinux.lds
  rtmutex: Plug slow unlock race
  rtmutex: Handle deadlock detection smarter
  rtmutex: Detect changes in the pi lock chain
  rtmutex: Fix deadlock detector for real
  ring-buffer: Check if buffer exists before polling
  drm/radeon: stop poisoning the GART TLB
  drm/radeon: fix typo in golden register setup on evergreen
  ext4: disable synchronous transaction batching if max_batch_time==0
  ext4: clarify error count warning messages
  ext4: fix unjournalled bg descriptor while initializing inode bitmap
  dm io: fix a race condition in the wake up code for sync_io
  Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code
  clk: spear3xx: Use proper control register offset
  ...

In addition to bringing in upstream commits, this merge also makes minor
changes to mainitain compatibility with upstream:

The definition of list_next_entry in qcrypto.c and ipa_dp.c has been
removed, as upstream has moved the definition to list.h. The implementation
of list_next_entry was identical between the two.

irq.c, for both arm and arm64 architecture, has had its calls to
__irq_set_affinity_locked updated to reflect changes to the API upstream.

Finally, as we have removed the sleep_length member variable of the
tick_sched struct, all changes made by upstream commit ec804bd do not
apply to our tree and have been removed from this merge. Only
kernel/time/tick-sched.c is impacted.

Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-08-20 13:23:09 -07:00
Peter Zijlstra c5ac12693f arch: Mass conversion of smp_mb__*()
Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 4e857c58efeb99393cba5a5d0d8ec7117183137c
[joonwoop@codeaurora.org: fixed trivial merge conflict.]
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-08-15 11:45:28 -07:00
Heiko Carstens e89d22ffdc compat: let architectures define __ARCH_WANT_COMPAT_SYS_GETDENTS64
For architecture dependent compat syscalls in common code an architecture
must define something like __ARCH_WANT_<WHATEVER> if it wants to use the
code.
This however is not true for compat_sys_getdents64 for which architectures
must define __ARCH_OMIT_COMPAT_SYS_GETDENTS64 if they do not want the code.

This leads to the situation where all architectures, except mips, get the
compat code but only x86_64, arm64 and the generic syscall architectures
actually use it.

So invert the logic, so that architectures actively must do something to
get the compat code.

This way a couple of architectures get rid of otherwise dead code.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 0473c9b5f05948df780bbc7b996dd7aefc4ec41d
Signed-off-by: Joonwoo Park <joonwoop@codeaurora.org>
2014-08-15 11:41:28 -07:00
Linux Build Service Account e90565f9c2 Merge "qmp: sphinx: Add Qualcomm Malware Protection kernel instrumentation" 2014-07-31 21:36:52 -07:00
Dinesh K Garg 3c77a990d0 qmp: sphinx: Add Qualcomm Malware Protection kernel instrumentation
The kernel instrumentation for Sphinx is added as a seperate commit.

Change-Id: I533e3f3f694165358d6756f1b8e18cbb09232c1c
Signed-off-by: Dinesh K Garg <dineshg@codeaurora.org>
2014-07-30 10:39:50 -07:00
Naveen Kaje 94303953c8 fs: Add TTY PM IOCTLs to compat table
Augment the compat ioctl table with entries for
PM control of TTY devices. These compat entries
were not present since other TTY/serial core drivers
were not using them.

Change-Id: I96a0e54c001d780a2a427380655f1fbb0091aef7
Signed-off-by: Naveen Kaje <nkaje@codeaurora.org>
2014-07-30 10:25:00 -06:00
Eric Sandeen 39f9c0e3dd ext4: disable synchronous transaction batching if max_batch_time==0
commit 5dd214248f94d430d70e9230bda72f2654ac88a8 upstream.

The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17 15:58:03 -07:00
Theodore Ts'o 9625fe1e2e ext4: clarify error count warning messages
commit ae0f78de2c43b6fadd007c231a352b13b5be8ed2 upstream.

Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17 15:58:02 -07:00
Theodore Ts'o 7cefa2c68d ext4: fix unjournalled bg descriptor while initializing inode bitmap
commit 61c219f5814277ecb71d64cb30297028d6665979 upstream.

The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc.  Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17 15:58:02 -07:00
Linux Build Service Account 38636483f7 Merge "Merge google-common commits into msm-3.10" 2014-07-09 20:08:44 -07:00
J. Bruce Fields 9c252e3881 nfsd: fix rare symlink decoding bug
commit 76f47128f9b33af1e96819746550d789054c9664 upstream.

An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.

The vfs, on the other hand, wants null-terminated data.

The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.

The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.

But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.

Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.

In the NFSv2/v3 cases this actually appears to be safe:

	- nfs3svc_decode_symlinkargs explicitly null-terminates the data
	  (after first checking its length and copying it to a new
	  page).
	- NFSv2 limits symlinks to 1k.  The buffer holding the rpc
	  request is always at least a page, and the link data (and
	  previous fields) have maximum lengths that prevent the request
	  from reaching the end of a page.

In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.

The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though.  It should
really either do the copy itself every time or just require a
null-terminated string.

Reported-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:14:02 -07:00
Jan Kara 82252eaccd ext4: Fix hole punching for files with indirect blocks
commit a93cd4cf86466caa49cfe64607bea7f0bde3f916 upstream.

Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

Fixes: 8bad6fc813
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:14:01 -07:00
Jan Kara ead37447d6 ext4: Fix buffer double free in ext4_alloc_branch()
commit c5c7b8ddfbf8cb3b2291e515a34ab1b8982f5a2d upstream.

Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:14:01 -07:00
Steve French 780ae119ce CIFS: fix mount failure with broken pathnames when smb3 mount with mapchars option
commit ce36d9ab3bab06b7b5522f5c8b68fac231b76ffb upstream.

When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ?
via the Unicode Windows to POSIX remap range) empty paths
(eg when we open "" to query the root of the SMB3 directory on mount) were not
null terminated so we sent garbarge as a path name on empty paths which caused
SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified.  mapchars is
particularly important since Unix Extensions for SMB3 are not supported (yet)

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 11:14:01 -07:00
Jeff Mahoney 4660efb843 reiserfs: call truncate_setsize under tailpack mutex
commit 22e7478ddbcb670e33fab72d0bbe7c394c3a2c84 upstream.

Prior to commit 0e4f6a791b (Fix reiserfs_file_release()), reiserfs
truncates serialized on i_mutex. They mostly still do, with the exception
of reiserfs_file_release. That blocks out other writers via the tailpack
mutex and the inode openers counter adjusted in reiserfs_file_open.

However, NFS will call reiserfs_setattr without having called ->open, so
we end up with a race when nfs is calling ->setattr while another
process is releasing the file. Ultimately, it triggers the
BUG_ON(inode->i_size != new_file_size) check in maybe_indirect_to_direct.

The solution is to pull the lock into reiserfs_setattr to encompass the
truncate_setsize call as well.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:15 -07:00
Jeff Layton 548e6c6e20 nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry
commit 1b19453d1c6abcfa7c312ba6c9f11a277568fc94 upstream.

Currently, the DRC cache pruner will stop scanning the list when it
hits an entry that is RC_INPROG. It's possible however for a call to
take a *very* long time. In that case, we don't want it to block other
entries from being pruned if they are expired or we need to trim the
cache to get back under the limit.

Fix the DRC cache pruner to just ignore RC_INPROG entries.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:14 -07:00
Jeff Layton 03e9dd7847 nfsd: don't try to reuse an expired DRC entry off the list
commit a0ef5e19684f0447da9ff0654a12019c484f57ca upstream.

Currently when we are processing a request, we try to scrape an expired
or over-limit entry off the list in preference to allocating a new one
from the slab.

This is unnecessarily complicated. Just use the slab layer.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:14 -07:00
Trond Myklebust 4b331d38a2 NFS: Don't declare inode uptodate unless all attributes were checked
commit 43b6535e717d2f656f71d9bd16022136b781c934 upstream.

Fix a bug, whereby nfs_update_inode() was declaring the inode to be
up to date despite not having checked all the attributes.
The bug occurs because the temporary variable in which we cache
the validity information is 'sanitised' before reapplying to
nfsi->cache_validity.

Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:14 -07:00
Christoph Hellwig 32cf2fba27 nfsd: getattr for FATTR4_WORD0_FILES_AVAIL needs the statfs buffer
commit 12337901d654415d9f764b5f5ba50052e9700f37 upstream.

Note nobody's ever noticed because the typical client probably never
requests FILES_AVAIL without also requesting something else on the list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:14 -07:00
J. Bruce Fields 1dd6ffc628 nfsd4: fix FREE_STATEID lockowner leak
commit 48385408b45523d9a432c66292d47ef43efcbb94 upstream.

27b11428b7de ("nfsd4: remove lockowner when removing lock stateid")
introduced a memory leak.

Reported-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:14 -07:00
Trond Myklebust f1631672d6 pNFS: Handle allocation errors correctly in filelayout_alloc_layout_hdr()
commit 6df200f5d5191bdde4d2e408215383890f956781 upstream.

Return the NULL pointer when the allocation fails.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:14 -07:00
hujianyang 6f02490b96 UBIFS: Remove incorrect assertion in shrink_tnc()
commit 72abc8f4b4e8574318189886de627a2bfe6cd0da upstream.

I hit the same assert failed as Dolev Raviv reported in Kernel v3.10
shows like this:

[ 9641.164028] UBIFS assert failed in shrink_tnc at 131 (pid 13297)
[ 9641.234078] CPU: 1 PID: 13297 Comm: mmap.test Tainted: G           O 3.10.40 #1
[ 9641.234116] [<c0011a6c>] (unwind_backtrace+0x0/0x12c) from [<c000d0b0>] (show_stack+0x20/0x24)
[ 9641.234137] [<c000d0b0>] (show_stack+0x20/0x24) from [<c0311134>] (dump_stack+0x20/0x28)
[ 9641.234188] [<c0311134>] (dump_stack+0x20/0x28) from [<bf22425c>] (shrink_tnc_trees+0x25c/0x350 [ubifs])
[ 9641.234265] [<bf22425c>] (shrink_tnc_trees+0x25c/0x350 [ubifs]) from [<bf2245ac>] (ubifs_shrinker+0x25c/0x310 [ubifs])
[ 9641.234307] [<bf2245ac>] (ubifs_shrinker+0x25c/0x310 [ubifs]) from [<c00cdad8>] (shrink_slab+0x1d4/0x2f8)
[ 9641.234327] [<c00cdad8>] (shrink_slab+0x1d4/0x2f8) from [<c00d03d0>] (do_try_to_free_pages+0x300/0x544)
[ 9641.234344] [<c00d03d0>] (do_try_to_free_pages+0x300/0x544) from [<c00d0a44>] (try_to_free_pages+0x2d0/0x398)
[ 9641.234363] [<c00d0a44>] (try_to_free_pages+0x2d0/0x398) from [<c00c6a60>] (__alloc_pages_nodemask+0x494/0x7e8)
[ 9641.234382] [<c00c6a60>] (__alloc_pages_nodemask+0x494/0x7e8) from [<c00f62d8>] (new_slab+0x78/0x238)
[ 9641.234400] [<c00f62d8>] (new_slab+0x78/0x238) from [<c031081c>] (__slab_alloc.constprop.42+0x1a4/0x50c)
[ 9641.234419] [<c031081c>] (__slab_alloc.constprop.42+0x1a4/0x50c) from [<c00f80e8>] (kmem_cache_alloc_trace+0x54/0x188)
[ 9641.234459] [<c00f80e8>] (kmem_cache_alloc_trace+0x54/0x188) from [<bf227908>] (do_readpage+0x168/0x468 [ubifs])
[ 9641.234553] [<bf227908>] (do_readpage+0x168/0x468 [ubifs]) from [<bf2296a0>] (ubifs_readpage+0x424/0x464 [ubifs])
[ 9641.234606] [<bf2296a0>] (ubifs_readpage+0x424/0x464 [ubifs]) from [<c00c17c0>] (filemap_fault+0x304/0x418)
[ 9641.234638] [<c00c17c0>] (filemap_fault+0x304/0x418) from [<c00de694>] (__do_fault+0xd4/0x530)
[ 9641.234665] [<c00de694>] (__do_fault+0xd4/0x530) from [<c00e10c0>] (handle_pte_fault+0x480/0xf54)
[ 9641.234690] [<c00e10c0>] (handle_pte_fault+0x480/0xf54) from [<c00e2bf8>] (handle_mm_fault+0x140/0x184)
[ 9641.234716] [<c00e2bf8>] (handle_mm_fault+0x140/0x184) from [<c0316688>] (do_page_fault+0x150/0x3ac)
[ 9641.234737] [<c0316688>] (do_page_fault+0x150/0x3ac) from [<c000842c>] (do_DataAbort+0x3c/0xa0)
[ 9641.234759] [<c000842c>] (do_DataAbort+0x3c/0xa0) from [<c0314e38>] (__dabt_usr+0x38/0x40)

After analyzing the code, I found a condition that may cause this failed
in correct operations. Thus, I think this assertion is wrong and should be
removed.

Suppose there are two clean znodes and one dirty znode in TNC. So the
per-filesystem atomic_t @clean_zn_cnt is (2). If commit start, dirty_znode
is set to COW_ZNODE in get_znodes_to_commit() in case of potentially ops
on this znode. We clear COW bit and DIRTY bit in write_index() without
@tnc_mutex locked. We don't increase @clean_zn_cnt in this place. As the
comments in write_index() shows, if another process hold @tnc_mutex and
dirty this znode after we clean it, @clean_zn_cnt would be decreased to (1).
We will increase @clean_zn_cnt to (2) with @tnc_mutex locked in
free_obsolete_znodes() to keep it right.

If shrink_tnc() performs between decrease and increase, it will release
other 2 clean znodes it holds and found @clean_zn_cnt is less than zero
(1 - 2 = -1), then hit the assertion. Because free_obsolete_znodes() will
soon correct @clean_zn_cnt and no harm to fs in this case, I think this
assertion could be removed.

2 clean zondes and 1 dirty znode, @clean_zn_cnt == 2

Thread A (commit)         Thread B (write or others)       Thread C (shrinker)
->write_index
   ->clear_bit(DIRTY_NODE)
   ->clear_bit(COW_ZNODE)

            @clean_zn_cnt == 2
                          ->mutex_locked(&tnc_mutex)
                          ->dirty_cow_znode
                              ->!ubifs_zn_cow(znode)
                              ->!test_and_set_bit(DIRTY_NODE)
                              ->atomic_dec(&clean_zn_cnt)
                          ->mutex_unlocked(&tnc_mutex)

            @clean_zn_cnt == 1
                                                           ->mutex_locked(&tnc_mutex)
                                                           ->shrink_tnc
                                                             ->destroy_tnc_subtree
                                                             ->atomic_sub(&clean_zn_cnt, 2)
                                                             ->ubifs_assert  <- hit
                                                           ->mutex_unlocked(&tnc_mutex)

            @clean_zn_cnt == -1
->mutex_lock(&tnc_mutex)
->free_obsolete_znodes
   ->atomic_inc(&clean_zn_cnt)
->mutux_unlock(&tnc_mutex)

            @clean_zn_cnt == 0 (correct after shrink)

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:13 -07:00
hujianyang ac8df9ec7b UBIFS: fix an mmap and fsync race condition
commit 691a7c6f28ac90cccd0dbcf81348ea90b211bdd0 upstream.

There is a race condition in UBIFS:

Thread A (mmap)                        Thread B (fsync)

->__do_fault                           ->write_cache_pages
   -> ubifs_vm_page_mkwrite
       -> budget_space
       -> lock_page
       -> release/convert_page_budget
       -> SetPagePrivate
       -> TestSetPageDirty
       -> unlock_page
                                       -> lock_page
                                           -> TestClearPageDirty
                                           -> ubifs_writepage
                                               -> do_writepage
                                                   -> release_budget
                                                   -> ClearPagePrivate
                                                   -> unlock_page
   -> !(ret & VM_FAULT_LOCKED)
   -> lock_page
   -> set_page_dirty
       -> ubifs_set_page_dirty
           -> TestSetPageDirty (set page dirty without budgeting)
   -> unlock_page

This leads to situation where we have a diry page but no budget allocated for
this page, so further write-back may fail with -ENOSPC.

In this fix we return from page_mkwrite without performing unlock_page. We
return VM_FAULT_LOCKED instead. After doing this, the race above will not
happen.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Tested-by: Laurence Withers <lwithers@guralp.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06 18:54:13 -07:00
Eric Sandeen 26401cbceb btrfs: fix use of uninit "ret" in end_extent_writepage()
commit 3e2426bd0eb980648449e7a2f5a23e3cd3c7725c upstream.

If this condition in end_extent_writepage() is false:

	if (tree->ops && tree->ops->writepage_end_io_hook)

we will then test an uninitialized "ret" at:

	ret = ret < 0 ? ret : -EIO;

The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.

Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.

Signed-of-by: Eric Sandeen <sandeen@redhat.com>

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Liu Bo b2ac302362 Btrfs: fix scrub_print_warning to handle skinny metadata extents
commit 6eda71d0c030af0fc2f68aaa676e6d445600855b upstream.

The skinny extents are intepreted incorrectly in scrub_print_warning(),
and end up hitting the BUG() in btrfs_extent_inline_ref_size.

Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Liu Bo d6f5d5fd0c Btrfs: use right type to get real comparison
commit cd857dd6bc2ae9ecea14e75a34e8a8fdc158e307 upstream.

We want to make sure the point is still within the extent item, not to verify
the memory it's pointing to.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Rickard Strandqvist 2346e1e345 fs: btrfs: volumes.c: Fix for possible null pointer dereference
commit 8321cf2596d283821acc466377c2b85bcd3422b7 upstream.

There is otherwise a risk of a possible null pointer dereference.

Was largely found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Filipe Manana 70510742ba Btrfs: send, don't error in the presence of subvols/snapshots
commit 1af56070e3ef9477dbc7eba3b9ad7446979c7974 upstream.

If we are doing an incremental send and the base snapshot has a
directory with name X that doesn't exist anymore in the second
snapshot and a new subvolume/snapshot exists in the second snapshot
that has the same name as the directory (name X), the incremental
send would fail with -ENOENT error. This is because it attempts
to lookup for an inode with a number matching the objectid of a
root, which doesn't exist.

Steps to reproduce:

    mkfs.btrfs -f /dev/sdd
    mount /dev/sdd /mnt

    mkdir /mnt/testdir
    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    rmdir /mnt/testdir
    btrfs subvolume create /mnt/testdir
    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data

A test case for xfstests follows.

Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Wang Shilong 2d857bc05e Btrfs: set right total device count for seeding support
commit 298658414a2f0bea1f05a81876a45c1cd96aa2e0 upstream.

Seeding device support allows us to create a new filesystem
based on existed filesystem.

However newly created filesystem's @total_devices should include seed
devices. This patch fix the following problem:

 # mkfs.btrfs -f /dev/sdb
 # btrfstune -S 1 /dev/sdb
 # mount /dev/sdb /mnt
 # btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1
 # umount /mnt
 # mount /dev/sdc /mnt               --->fs_devices->total_devices = 2

This is because we record right @total_devices in superblock, but
@fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout().

Fix this problem by not resetting @fs_devices->total_devices.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Liu Bo 372fad07b9 Btrfs: mark mapping with error flag to report errors to userspace
commit 5dca6eea91653e9949ce6eb9e9acab6277e2f2c4 upstream.

According to commit 865ffef379
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Wang Shilong 791a1cc32d Btrfs: make sure there are not any read requests before stopping workers
commit de348ee022175401e77d7662b7ca6e231a94e3fd upstream.

In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:

kernel BUG at fs/btrfs/async-thread.c:619!

By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.

Thanks to Miao for pointing out this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:46 -07:00
Miao Xie 87d7177149 Btrfs: output warning instead of error when loading free space cache failed
commit 32d6b47fe6fc1714d5f1bba1b9f38e0ab0ad58a8 upstream.

If we fail to load a free space cache, we can rebuild it from the extent tree,
so it is not a serious error, we should not output a error message that
would make the users uncomfortable. This patch uses warning message instead
of it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:45 -07:00
Qu Wenruo de09143391 btrfs: Add ctime/mtime update for btrfs device add/remove.
commit 5a1972bd9fd4b2fb1bac8b7a0b636d633d8717e3 upstream.

Btrfs will send uevent to udev inform the device change,
but ctime/mtime for the block device inode is not udpated, which cause
libblkid used by btrfs-progs unable to detect device change and use old
cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an
error message.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:45 -07:00
Chris Mason 8e3ddf4c9b Btrfs: fix double free in find_lock_delalloc_range
commit 7d78874273463a784759916fc3e0b4e2eb141c70 upstream.

We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:45 -07:00
Benjamin LaHaise d36db46c2c aio: fix kernel memory disclosure in io_getevents() introduced in v3.10
commit edfbbf388f293d70bf4b7c0bc38774d05e6f711a upstream.

A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380be.  The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.

[jmoyer@redhat.com: backported to 3.10]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:45 -07:00
Benjamin LaHaise 6745cb91b5 aio: fix aio request leak when events are reaped by userspace
commit f8567a3845ac05bb28f3c1b478ef752762bd39ef upstream.

The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping.  Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

[jmoyer@redhat.com: backported to 3.10]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:45 -07:00
Maurizio Lombardi dc10f332a7 ext4: fix wrong assert in ext4_mb_normalize_request()
commit b5b60778558cafad17bbcbf63e0310bd3c68eb17 upstream.

The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:42 -07:00
Jan Kara 327d2822e7 ext4: fix zeroing of page during writeback
commit eeece469dedadf3918bad50ad80f4616a0064e90 upstream.

Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.

The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).

Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c
Fixes: bd2d0210cf
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30 20:09:42 -07:00
Arve Hjønnevåg 7132a7f02d pstore/ram: Add ramoops_console_write_buf api
Allow writing into the ramoops console buffer.

Change-Id: Iff0d69b562e4dae33ea7f8d19412227bebb17e47
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Git-commit: 3e966cff406d9b43a16e07291a918cb2a098e32f
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2014-06-23 14:59:06 -07:00
Amir Samuelov 8307869921 platform: msm: fix PFT when using direct-io
When read/write a file using dircet-io (O_DIRECT),
we can't get the inode from a bio by walking the path
bio->bi_io_vec->bv_page->mapping->host
since the page is anonymous and not mapped.
On the other more typical cases when not using O_DIRECT,
the page cache is used and the bv_page is available.

Change-Id: I349cad54a978ed9919f960d55f0f95c1e53262e5
Signed-off-by: Amir Samuelov <amirs@codeaurora.org>
2014-06-23 21:38:40 +03:00
Linux Build Service Account 67a383249d Merge "Merge v3.10.40 and related reverts into msm-3.10" 2014-06-20 00:09:33 -07:00
Linux Build Service Account 464c84b97a Merge "fs: ubifs: fix shrink_tnc assertion warning" 2014-06-19 12:13:43 -07:00