android_kernel_samsung_msm8976/fs
Maxim Patlasov c2bf56d4a3 mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling.  For such filesystems balance_dirty_pages always check bdi
counters against bdi limits.  I.e.  even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks.  The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
writeback).  The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately.  A fuse request queued
for real processing bears a copy of original page.  Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce.  There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region.  An
hour later I observed almost all RAM consumed by fuse writeback.  Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users.  This is write-through
page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation.  Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature.  Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain.  Let's make simple
computations.  Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 5a53748568f79641eaf40e41081a2f4987f005c2
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I2def00e492ab04b4938d11e35dfc87656b2acf20
Signed-off-by: Tarun Gupta <tarung@codeaurora.org>
2014-11-25 20:59:45 -08:00
..
9p aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
adfs
affs
afs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
autofs4 autofs - remove autofs dentry mount check 2013-05-06 13:06:59 -07:00
befs befs_readdir(): do not increment ->f_pos if filldir tells us to stop 2013-05-31 15:17:56 -04:00
bfs
btrfs Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
cachefiles
ceph ceph: allow sync_read/write return partial successed size of read/write. 2014-01-09 12:24:25 -08:00
cifs Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
coda
configfs configfs: fix race between dentry put and lookup 2013-11-29 11:11:53 -08:00
cramfs
debugfs Merge upstream linux-stable v3.10.28 into msm-3.10 2014-03-24 14:28:34 -07:00
devpts devpts: plug the memory leak in kill_sb 2013-12-04 10:55:49 -08:00
dlm
ecryptfs ecryptfs: Fix memory leakage in keystore.c 2013-11-13 12:05:31 +09:00
efivarfs efivarfs: Never return ENOENT from firmware again 2013-05-13 20:12:10 +01:00
efs
exofs ore: Fix wrong math in allocation of per device BIO 2014-02-13 13:48:00 -08:00
exportfs
ext2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ext3 ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() 2013-07-21 18:21:23 -07:00
ext4 Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
f2fs f2fs updates for v3.10 2013-05-08 15:11:48 -07:00
fat Add compat_ioctl support for VFAT_IOCTL_GET_VOLUME_ID 2014-06-13 12:05:26 -07:00
freevxfs
fscache
fuse mm/page-writeback.c: add strictlimit feature 2014-11-25 20:59:45 -08:00
gfs2 arch: Mass conversion of smp_mb__*() 2014-08-15 11:45:28 -07:00
hfs hfs: avoid crash in hfs_bnode_create 2013-05-24 16:22:51 -07:00
hfsplus aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
hostfs hostfs: use kmalloc instead of kzalloc 2013-05-04 15:48:45 -04:00
hpfs hpfs: better test for errors 2013-07-13 11:42:26 -07:00
hppfs
hugetlbfs cope with potentially long ->d_dname() output for shmem/hugetlb 2014-01-09 16:35:41 -08:00
isofs isofs: Refuse RW mount of the filesystem instead of making it RO 2013-09-26 17:18:28 -07:00
jbd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-05-03 09:56:25 -07:00
jbd2 Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
jffs2 jffs2: remove from wait queue after schedule() 2014-04-26 17:15:36 -07:00
jfs jfs: fix error path in ialloc 2013-11-13 12:05:31 +09:00
lockd lockd: ensure we tear down any live sockets when socket creation fails during lockd_up 2014-05-13 13:59:46 +02:00
logfs
minix
ncpfs ncpfs: fix rmdir returns Device or resource busy 2013-06-07 12:15:38 -04:00
nfs Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
nfs_common
nfsd nfsd: fix rare symlink decoding bug 2014-07-09 11:14:02 -07:00
nilfs2 nilfs2: fix segctor bug that causes file system corruption 2014-01-25 08:27:12 -08:00
nls
notify compat: fix sys_fanotify_mark 2014-02-13 13:48:00 -08:00
ntfs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ocfs2 ocfs2: do not put bh when buffer_uptodate failed 2014-05-06 07:55:32 -07:00
omfs
openpromfs
proc sched: Add API to set task's initial task load 2014-11-05 14:26:59 +05:30
pstore pstore/ram: Add ramoops_console_write_buf api 2014-06-23 14:59:06 -07:00
qnx4
qnx6 qnx6: qnx6_readdir() has a braino in pos calculation 2013-05-31 15:17:31 -04:00
quota quota: Fix race between dqput() and dquot_scan_active() 2014-03-06 21:30:12 -08:00
ramfs
reiserfs reiserfs: call truncate_setsize under tailpack mutex 2014-07-06 18:54:15 -07:00
romfs
squashfs
sysfs
sysv sysv: Add forgotten superblock lock init for v7 fs 2013-10-05 07:13:09 -07:00
ubifs Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
udf udf: Refuse RW mount of the filesystem instead of making it RO 2013-10-01 09:17:48 -07:00
ufs
xfs xfs: fix directory hash ordering bug 2014-05-06 07:55:28 -07:00
yaffs2 fs: yaffs2: initialize delayed workqueue before using it 2013-11-26 23:29:26 -08:00
aio.c aio: fix kernel memory disclosure in io_getevents() introduced in v3.10 2014-06-30 20:09:45 -07:00
anon_inodes.c
attr.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-16 13:42:52 -07:00
bad_inode.c
binfmt_aout.c mm: remove free_area_cache 2014-02-07 13:49:41 -08:00
binfmt_elf.c Merge upstream linux-stable v3.10.28 into msm-3.10 2014-03-24 14:28:34 -07:00
binfmt_elf_fdpic.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c bio-integrity: Fix bio_integrity_verify segment start bug 2014-03-23 21:38:21 -07:00
bio.c platform: msm: fix PFT when using direct-io 2014-06-23 21:38:40 +03:00
block_dev.c writeback: Fix periodic writeback after fs mount 2013-07-28 16:29:40 -07:00
buffer.c block/fs: make tracking dirty task debug only 2014-10-28 17:13:01 -07:00
char_dev.c
compat.c compat: let architectures define __ARCH_WANT_COMPAT_SYS_GETDENTS64 2014-08-15 11:41:28 -07:00
compat_binfmt_elf.c
compat_ioctl.c fs: Add TTY PM IOCTLs to compat table 2014-07-30 10:25:00 -06:00
coredump.c do_coredump(): don't wait for thaw if coredump has already been interrupted 2013-05-04 14:45:54 -04:00
coredump.h
dcache.c vfs: In d_path don't call d_dname on a mount point 2014-01-25 08:27:11 -08:00
dcookies.c fs/compat: fix lookup_dcookie() parameter handling 2014-02-13 13:48:00 -08:00
direct-io.c platform: msm: fix PFT when using direct-io 2014-06-23 21:38:40 +03:00
drop_caches.c
eventfd.c
eventpoll.c Revert "epoll: use freezable blocking call" 2014-08-29 14:20:41 -07:00
exec.c metag: Reduce maximum stack size to 256MB 2014-06-07 13:25:38 -07:00
fcntl.c
fhandle.c
file.c Merge upstream linux-stable v3.10.36 into msm-3.10 2014-04-23 16:23:49 -07:00
file_table.c don't bother with {get,put}_write_access() on non-regular files 2014-05-30 21:52:12 -07:00
filesystems.c
fs-writeback.c Merge upstream tag 'v3.10.40' into msm-3.10 2014-06-18 13:10:54 -07:00
fs_struct.c
generic_acl.c
inode.c fs,userns: Change inode_capable to capable_wrt_inode_uidgid 2014-06-16 13:42:52 -07:00
internal.h splice: don't pass the address of ->f_pos to methods 2013-06-20 19:02:45 +04:00
ioctl.c
ioprio.c
Kconfig fs: yaffs: Import yaffs from Thu Dec 23 13:31:37 2010 +1300 2013-09-04 17:24:15 -07:00
Kconfig.binfmt
libfs.c
locks.c locks: allow __break_lease to sleep even when break_time is 0 2014-05-13 13:59:44 +02:00
Makefile fs: yaffs: Import yaffs from Thu Dec 23 13:31:37 2010 +1300 2013-09-04 17:24:15 -07:00
mbcache.c
mount.h vfs: Is mounted should be testing mnt_ns for NULL or error. 2014-02-06 11:08:16 -08:00
mpage.c
namei.c Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
namespace.c VFS: collect_mounts() should return an ERR_PTR 2013-08-29 09:47:35 -07:00
no-block.c
open.c Merge upstream tag 'v3.10.49' into msm-3.10 2014-08-20 13:23:09 -07:00
pipe.c vfs: fix subtle use-after-free of pipe_inode_info 2013-12-11 22:36:26 -08:00
pnode.c vfs: Fix invalid ida_remove() call 2013-05-31 15:16:33 -04:00
pnode.h
posix_acl.c posix_acl: handle NULL ACL in posix_acl_equiv_mode 2014-06-07 13:25:33 -07:00
proc_namespace.c
read_write.c Merge upstream linux-stable v3.10.36 into msm-3.10 2014-04-23 16:23:49 -07:00
readdir.c
select.c select: use freezable blocking call 2013-07-01 15:45:28 -07:00
seq_file.c fs/seq_file: Use vmalloc by default for allocations > PAGE_SIZE 2014-09-23 10:37:58 -06:00
signalfd.c
splice.c fuse: fix pipe_buf_operations 2014-02-13 13:47:59 -08:00
stack.c
stat.c
statfs.c vfs: allow O_PATH file descriptors for fstatfs() 2013-10-18 07:45:44 -07:00
super.c Merge "fs/super.c: sync ro remount after blocking writers" 2014-04-16 04:43:09 -07:00
sync.c
timerfd.c timerfd: support CLOCK_BOOTTIME clock 2014-06-13 12:06:03 -07:00
utimes.c
xattr.c
xattr_acl.c