printk is meant to be used with an associated log level. There are some
instances of printk scattered around the mm code where the log level is
missing. Add a log level and adhere to suggestions by
scripts/checkpatch.pl by moving to the pr_* macros.
Also add the typical pr_fmt definition so that print statements can be
easily traced back to the modules where they occur, correlated one with
another, etc. This will require the removal of some (now redundant)
prefixes on a few print statements.
Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream.
George Boole would have noticed a slight error in 4.16 commit
69d763fc6d3a ("mm: pin address_space before dereferencing it while
isolating an LRU page"). Fix it, to match both the comment above it,
and the original behaviour.
Although anonymous pages are not marked PageDirty at first, we have an
old habit of calling SetPageDirty when a page is removed from swap
cache: so there's a category of ex-swap pages that are easily
migratable, but were inadvertently excluded from compaction's async
migration in 4.16.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Ivan Kalvachev <ikalvachev@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 69d763fc6d3aee787a3e8c8c35092b4f4960fa5d upstream.
Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as follows
CPU1 CPU2
truncate(inode) __isolate_lru_page()
...
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space
if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.
The race is theoretically possible but unlikely. Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page. Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.
This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired. There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed. However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.
Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
Fixes: c824493528 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
number of pages removed from the candidate list. But shrink_page_list()
puts back mlocked pages without passing it to caller and without counting
as nr_reclaimed. This increases nr_isolated.
To fix this, this patch changes shrink_page_list() to pass unevictable
pages back to caller. Caller will take care those pages.
Minchan said:
It fixes two issues.
1. With unevictable page, cma_alloc will be successful.
Exactly speaking, cma_alloc of current kernel will fail due to
unevictable pages.
2. fix leaking of NR_ISOLATED counter of vmstat
With it, too_many_isolated works. Otherwise, it could make hang until
the process get SIGKILL.
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Git-commit: 99e564148e202d817163a10af873a81bc33d532e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
CRs-Fixed: 885312
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Change-Id: Icbd26a41d49ae33a43cbeac9d59d7be939192b5a
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUyuGRAAoJEDjbvchgkmk+7EwQALYPOeh+AManQFB1MQvFuOgZ
/4ulpjhGXw/RPTKHMeyHo8vRfUhMOx8UPF62uql+g1l9b/Zt2bs6qXu4QcxRRsQc
trSTUpi+U14y1hkgqOVOcFYP2ZaTjNEBQgLJ4eGn46CliLqme+rfoyRYm2GXzcR4
6cbSAr3mufdFIpi9/8Dn62Gv0aws5lIv3qkHJXznyuux3tisPT5y6Ux2KJoivPn/
SqADtRpwo+7lTjl15fE++9AqNsGMorV6toT2OO/7nXP+824psInKLmREAT2qC99b
BG61vcYdxOuHtzmwrvCf1jSRjxhvZT0j2xhBr/vCKcxy08AT0vDv68zrV1r6TIuu
U7/CKXtFBY95cjfnkTLJuswBSuIA/+sQHV6DaddH0V8fcZ6rQMLrblQ9ZcFFFkmT
2SG6lmlXqZvcEKYGMnL/Dcow1rkRhB5stiGgTkYxjiRSRpzAHISRJ/GGpsT+rRqK
HpBs5p9JshvRl7RWKwAu+DNGaEK1X/WYxc4/jw6dZFWX7lEWSMIPlr9zXgZCZ39y
V6lV1VVlT9/CSs1swKHUyhHHehlFsnIlQ6Fkiycr/KkuqBLs92Hyb7WhpVa819yX
osXdxSm6J54skiOLKYpBWHpnY09Tc+p28VEfMpErTExgp2oE8F34K7kdhoQPQb97
2mHiXNa+J4CLUNQ+sRmw
=HDBo
-----END PGP SIGNATURE-----
Merge commit 'v3.10.67' into msm-3.10
This merge brings us up to date with upstream kernel.org tag v3.10.67.
It also contains changes to allow forbidden warnings introduced in
the commit 'core, nfqueue, openvswitch: Orphan frags in skb_zerocopy
and handle errors'. Once upstream has corrected these warnings, the
changes to scripts/gcc-wrapper.py, in this commit, can be reverted.
* commit 'v3.10.67' (915 commits)
Linux 3.10.67
md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
ext4: fix warning in ext4_da_update_reserve_space()
quota: provide interface for readding allocated space into reserved space
crypto: add missing crypto module aliases
crypto: include crypto- module prefix in template
crypto: prefix module autoloading with "crypto-"
drbd: merge_bvec_fn: properly remap bvm->bi_bdev
Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single"
ipvs: uninitialized data with IP_VS_IPV6
KEYS: close race between key lookup and freeing
sata_dwc_460ex: fix resource leak on error path
x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs
x86, tls: Interpret an all-zero struct user_desc as "no segment"
x86, tls, ldt: Stop checking lm in LDT_empty
x86/tsc: Change Fast TSC calibration failed from error to info
x86, hyperv: Mark the Hyper-V clocksource as being continuous
clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write
can: dev: fix crtlmode_supported check
bus: mvebu-mbus: fix support of MBus window 13
ARM: dts: imx25: Fix PWM "per" clocks
time: adjtimex: Validate the ADJ_FREQUENCY values
time: settimeofday: Validate the values of tv from user
dm cache: share cache-metadata object across inactive and active DM tables
ipr: wait for aborted command responses
drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES
scripts/recordmcount.pl: There is no -m32 gcc option on Super-H anymore
ALSA: usb-audio: Add mic volume fix quirk for Logitech Webcam C210
libata: prevent HSM state change race between ISR and PIO
pinctrl: Fix two deadlocks
gpio: sysfs: fix gpio device-attribute leak
gpio: sysfs: fix gpio-chip device-attribute leak
Linux 3.10.66
s390/3215: fix tty output containing tabs
s390/3215: fix hanging console issue
fsnotify: next_i is freed during fsnotify_unmount_inodes.
netfilter: ipset: small potential read beyond the end of buffer
mmc: sdhci: Fix sleep in atomic after inserting SD card
LOCKD: Fix a race when initialising nlmsvc_timeout
x86, um: actually mark system call tables readonly
um: Skip futex_atomic_cmpxchg_inatomic() test
decompress_bunzip2: off by one in get_next_block()
ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances
ARM: omap5/dra7xx: Fix frequency typos
ARM: clk-imx6q: fix video divider for rev T0 1.0
ARM: imx6q: drop unnecessary semicolon
ARM: dts: imx25: Fix the SPI1 clocks
Input: I8042 - add Acer Aspire 7738 to the nomux list
Input: i8042 - reset keyboard to fix Elantech touchpad detection
can: kvaser_usb: Don't send a RESET_CHIP for non-existing channels
can: kvaser_usb: Reset all URB tx contexts upon channel close
can: kvaser_usb: Don't free packets when tight on URBs
USB: keyspan: fix null-deref at probe
USB: cp210x: add IDs for CEL USB sticks and MeshWorks devices
USB: cp210x: fix ID for production CEL MeshConnect USB Stick
usb: dwc3: gadget: Stop TRB preparation after limit is reached
usb: dwc3: gadget: Fix TRB preparation during SG
OHCI: add a quirk for ULi M5237 blocking on reset
gpiolib: of: Correct error handling in of_get_named_gpiod_flags
NFSv4.1: Fix client id trunking on Linux
ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
vfio-pci: Fix the check on pci device type in vfio_pci_probe()
uvcvideo: Fix destruction order in uvc_delete()
smiapp: Take mutex during PLL update in sensor initialisation
af9005: fix kernel panic on init if compiled without IR
smiapp-pll: Correct clock debug prints
video/logo: prevent use of logos after they have been freed
storvsc: ring buffer failures may result in I/O freeze
iscsi-target: Fail connection on short sendmsg writes
hp_accel: Add support for HP ZBook 15
cfg80211: Fix 160 MHz channels with 80+80 and 160 MHz drivers
ARC: [nsimosci] move peripherals to match model to FPGA
drm/i915: Force the CS stall for invalidate flushes
drm/i915: Invalidate media caches on gen7
drm/radeon: properly filter DP1.2 4k modes on non-DP1.2 hw
drm/radeon: check the right ring in radeon_evict_flags()
drm/vmwgfx: Fix fence event code
enic: fix rx skb checksum
alx: fix alx_poll()
tcp: Do not apply TSO segment limit to non-TSO packets
tg3: tg3_disable_ints using uninitialized mailbox value to disable interrupts
netlink: Don't reorder loads/stores before marking mmap netlink frame as available
netlink: Always copy on mmap TX.
Linux 3.10.65
mm: Don't count the stack guard page towards RLIMIT_STACK
mm: propagate error from stack expansion even for guard page
mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
perf session: Do not fail on processing out of order event
perf: Fix events installation during moving group
perf/x86/intel/uncore: Make sure only uncore events are collected
Btrfs: don't delay inode ref updates during log replay
ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP
scripts/kernel-doc: don't eat struct members with __aligned
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
nfsd4: fix xdr4 inclusion of escaped char
fs: nfsd: Fix signedness bug in compare_blob
serial: samsung: wait for transfer completion before clock disable
writeback: fix a subtle race condition in I_DIRTY clearing
cdc-acm: memory leak in error case
genhd: check for int overflow in disk_expand_part_tbl()
USB: cdc-acm: check for valid interfaces
ALSA: hda - Fix wrong gpio_dir & gpio_mask hint setups for IDT/STAC codecs
ALSA: hda - using uninitialized data
ALSA: usb-audio: extend KEF X300A FU 10 tweak to Arcam rPAC
driver core: Fix unbalanced device reference in drivers_probe
x86, vdso: Use asm volatile in __getcpu
x86_64, vdso: Fix the vdso address randomization algorithm
HID: Add a new id 0x501a for Genius MousePen i608X
HID: add battery quirk for USB_DEVICE_ID_APPLE_ALU_WIRELESS_2011_ISO keyboard
HID: roccat: potential out of bounds in pyra_sysfs_write_settings()
HID: i2c-hid: prevent buffer overflow in early IRQ
HID: i2c-hid: fix race condition reading reports
iommu/vt-d: Fix an off-by-one bug in __domain_mapping()
UBI: Fix double free after do_sync_erase()
UBI: Fix invalid vfree()
pstore-ram: Allow optional mapping with pgprot_noncached
pstore-ram: Fix hangs by using write-combine mappings
PCI: Restore detection of read-only BARs
ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap
ASoC: max98090: Fix ill-defined sidetone route
ASoC: sigmadsp: Refuse to load firmware files with a non-supported version
ath5k: fix hardware queue index assignment
swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single
can: peak_usb: fix memset() usage
can: peak_usb: fix cleanup sequence order in case of error during init
ath9k: fix BE/BK queue order
ath9k_hw: fix hardware queue allocation
ocfs2: fix journal commit deadlock
Linux 3.10.64
Btrfs: fix fs corruption on transaction abort if device supports discard
Btrfs: do not move em to modified list when unpinning
eCryptfs: Remove buggy and unnecessary write in file name decode routine
eCryptfs: Force RO mount when encrypted view is enabled
udf: Verify symlink size before loading it
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
ncpfs: return proper error from NCP_IOC_SETROOT ioctl
crypto: af_alg - fix backlog handling
userns: Unbreak the unprivileged remount tests
userns: Allow setting gid_maps without privilege when setgroups is disabled
userns: Add a knob to disable setgroups on a per user namespace basis
userns: Rename id_map_mutex to userns_state_mutex
userns: Only allow the creator of the userns unprivileged mappings
userns: Check euid no fsuid when establishing an unprivileged uid mapping
userns: Don't allow unprivileged creation of gid mappings
userns: Don't allow setgroups until a gid mapping has been setablished
userns: Document what the invariant required for safe unprivileged mappings.
groups: Consolidate the setgroups permission checks
umount: Disallow unprivileged mount force
mnt: Update unprivileged remount test
mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
mac80211: free management frame keys when removing station
mac80211: fix multicast LED blinking and counter
KEYS: Fix stale key registration at error path
isofs: Fix unchecked printing of ER records
x86/tls: Don't validate lm in set_thread_area() after all
dm space map metadata: fix sm_bootstrap_get_nr_blocks()
dm bufio: fix memleak when using a dm_buffer's inline bio
nfs41: fix nfs4_proc_layoutget error handling
megaraid_sas: corrected return of wait_event from abort frame path
mmc: block: add newline to sysfs display of force_ro
mfd: tc6393xb: Fail ohci suspend if full state restore is required
md/bitmap: always wait for writes on unplug.
x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit
x86_64, switch_to(): Load TLS descriptors before switching DS and ES
x86/tls: Disallow unusual TLS segments
x86/tls: Validate TLS entries to protect espfix
isofs: Fix infinite looping over CE entries
Linux 3.10.63
ALSA: usb-audio: Don't resubmit pending URBs at MIDI error recovery
powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
ARM: sched_clock: Load cycle count after epoch stabilizes
igb: bring link up when PHY is powered up
ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
nEPT: Nested INVEPT
net: sctp: use MAX_HEADER for headroom reserve in output path
net: mvneta: fix Tx interrupt delay
rtnetlink: release net refcnt on error in do_setlink()
net/mlx4_core: Limit count field to 24 bits in qp_alloc_res
tg3: fix ring init when there are more TX than RX channels
ipv6: gre: fix wrong skb->protocol in WCCP
sata_fsl: fix error handling of irq_of_parse_and_map
ahci: disable MSI on SAMSUNG 0xa800 SSD
AHCI: Add DeviceIDs for Sunrise Point-LP SATA controller
media: smiapp: Only some selection targets are settable
drm/i915: Unlock panel even when LVDS is disabled
drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
i2c: davinci: generate STP always when NACK is received
i2c: omap: fix i207 errata handling
i2c: omap: fix NACK and Arbitration Lost irq handling
xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
mm: fix swapoff hang after page migration and fork
mm: frontswap: invalidate expired data on a dup-store failure
Linux 3.10.62
nfsd: Fix ACL null pointer deref
powerpc/powernv: Honor the generic "no_64bit_msi" flag
bnx2fc: do not add shared skbs to the fcoe_rx_list
nfsd4: fix leak of inode reference on delegation failure
nfsd: Fix slot wake up race in the nfsv4.1 callback code
rt2x00: do not align payload on modern H/W
can: dev: avoid calling kfree_skb() from interrupt context
spi: dw: Fix dynamic speed change.
iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly
target: Don't call TFO->write_pending if data_length == 0
srp-target: Retry when QP creation fails with ENOMEM
Input: xpad - use proper endpoint type
ARM: 8222/1: mvebu: enable strex backoff delay
ARM: 8216/1: xscale: correct auxiliary register in suspend/resume
ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices
can: esd_usb2: fix memory leak on disconnect
USB: xhci: don't start a halted endpoint before its new dequeue is set
usb-quirks: Add reset-resume quirk for MS Wireless Laser Mouse 6000
usb: serial: ftdi_sio: add PIDs for Matrix Orbital products
USB: serial: cp210x: add IDs for CEL MeshConnect USB Stick
USB: keyspan: fix tty line-status reporting
USB: keyspan: fix overrun-error reporting
USB: ssu100: fix overrun-error reporting
iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit mask
powerpc/pseries: Fix endiannes issue in RTAS call from xmon
powerpc/pseries: Honor the generic "no_64bit_msi" flag
of/base: Fix PowerPC address parsing hack
ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use
ASoC: sgtl5000: Fix SMALL_POP bit definition
PCI/MSI: Add device flag indicating that 64-bit MSIs don't work
ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
pptp: fix stack info leak in pptp_getname()
qmi_wwan: Add support for HP lt4112 LTE/HSPA+ Gobi 4G Modem
ieee802154: fix error handling in ieee802154fake_probe()
ipv4: Fix incorrect error code when adding an unreachable route
inetdevice: fixed signed integer overflow
sparc64: Fix constraints on swab helpers.
uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
x86, mm: Set NX across entire PMD at boot
x86: Require exact match for 'noxsave' command line option
x86_64, traps: Rework bad_iret
x86_64, traps: Stop using IST for #SS
x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
MIPS: Loongson: Make platform serial setup always built-in.
MIPS: oprofile: Fix backtrace on 64-bit kernel
Linux 3.10.61
mm: memcg: handle non-error OOM situations more gracefully
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
mm: memcg: enable memcg OOM killer only for user faults
x86: finish user fault error path with fatal signal
arch: mm: pass userspace fault flag to generic fault handler
arch: mm: do not invoke OOM killer on kernel fault OOM
arch: mm: remove obsolete init OOM protection
mm: invoke oom-killer from remaining unconverted page fault handlers
net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks
net: sctp: fix panic on duplicate ASCONF chunks
net: sctp: fix remote memory pressure from excessive queueing
KVM: x86: Don't report guest userspace emulation error to userspace
SCSI: hpsa: fix a race in cmd_free/scsi_done
net/mlx4_en: Fix BlueFlame race
ARM: Correct BUG() assembly to ensure it is endian-agnostic
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
mei: bus: fix possible boundaries violation
perf: Handle compat ioctl
MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches
dell-wmi: Fix access out of memory
ARM: probes: fix instruction fetch order with <asm/opcodes.h>
br: fix use of ->rx_handler_data in code executed on non-rx_handler path
netfilter: nf_nat: fix oops on netns removal
netfilter: xt_bpf: add mising opaque struct sk_filter definition
netfilter: nf_log: release skbuff on nlmsg put failure
netfilter: nfnetlink_log: fix maximum packet length logged to userspace
netfilter: nf_log: account for size of NLMSG_DONE attribute
ipc: always handle a new value of auto_msgmni
clocksource: Remove "weak" from clocksource_default_clock() declaration
kgdb: Remove "weak" from kgdb_arch_pc() declaration
media: ttusb-dec: buffer overflow in ioctl
NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
nfs: Fix use of uninitialized variable in nfs_getattr()
NFS: Don't try to reclaim delegation open state if recovery failed
NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
Input: alps - allow up to 2 invalid packets without resetting device
Input: alps - ignore potential bare packets when device is out of sync
dm raid: ensure superblock's size matches device's logical block size
dm btree: fix a recursion depth bug in btree walking code
block: Fix computation of merged request priority
parisc: Use compat layer for msgctl, shmat, shmctl and semtimedop syscalls
scsi: only re-lock door after EH on devices that were reset
nfs: fix pnfs direct write memory leak
firewire: cdev: prevent kernel stack leaking into ioctl arguments
arm64: __clear_user: handle exceptions on strb
ARM: 8198/1: make kuser helpers depend on MMU
drm/radeon: add missing crtc unlock when setting up the MC
mac80211: fix use-after-free in defragmentation
macvtap: Fix csum_start when VLAN tags are present
iwlwifi: configure the LTR
libceph: do not crash on large auth tickets
xtensa: re-wire umount syscall to sys_oldumount
ALSA: usb-audio: Fix memory leak in FTU quirk
ahci: disable MSI instead of NCQ on Samsung pci-e SSDs on macbooks
ahci: Add Device IDs for Intel Sunrise Point PCH
audit: keep inode pinned
x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
sparc32: Implement xchg and atomic_xchg using ATOMIC_HASH locks
sparc64: Do irq_{enter,exit}() around generic_smp_call_function*().
sparc64: Fix crashes in schizo_pcierr_intr_other().
sunvdc: don't call VD_OP_GET_VTOC
vio: fix reuse of vio_dring slot
sunvdc: limit each sg segment to a page
sunvdc: compute vdisk geometry from capacity
sunvdc: add cdrom and v1.1 protocol support
net: sctp: fix memory leak in auth key management
net: sctp: fix NULL pointer dereference in af->from_addr_param on malformed packet
gre6: Move the setting of dev->iflink into the ndo_init functions.
ip6_tunnel: Use ip6_tnl_dev_init as the ndo_init function.
Linux 3.10.60
libceph: ceph-msgr workqueue needs a resque worker
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
of: Fix overflow bug in string property parsing functions
sysfs: driver core: Fix glue dir race condition by gdp_mutex
i2c: at91: don't account as iowait
acer-wmi: Add acpi_backlight=video quirk for the Acer KAV80
rbd: Fix error recovery in rbd_obj_read_sync()
drm/radeon: remove invalid pci id
usb: gadget: udc: core: fix kernel oops with soft-connect
usb: gadget: function: acm: make f_acm pass USB20CV Chapter9
usb: dwc3: gadget: fix set_halt() bug with pending transfers
crypto: algif - avoid excessive use of socket buffer in skcipher
mm: Remove false WARN_ON from pagecache_isize_extended()
x86, apic: Handle a bad TSC more gracefully
posix-timers: Fix stack info leak in timer_create()
mac80211: fix typo in starting baserate for rts_cts_rate_idx
PM / Sleep: fix recovery during resuming from hibernation
tty: Fix high cpu load if tty is unreleaseable
quota: Properly return errors from dquot_writeback_dquots()
ext3: Don't check quota format when there are no quota files
nfsd4: fix crash on unknown operation number
cpc925_edac: Report UE events properly
e7xxx_edac: Report CE events properly
i3200_edac: Report CE events properly
i82860_edac: Report CE events properly
scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
cgroup/kmemleak: add kmemleak_free() for cgroup deallocations.
usb: Do not allow usb_alloc_streams on unconfigured devices
USB: opticon: fix non-atomic allocation in write path
usb-storage: handle a skipped data phase
spi: pxa2xx: toggle clocks on suspend if not disabled by runtime PM
spi: pl022: Fix incorrect dma_unmap_sg
usb: dwc3: gadget: Properly initialize LINK TRB
wireless: rt2x00: add new rt2800usb device
USB: option: add Haier CE81B CDMA modem
usb: option: add support for Telit LE910
USB: cdc-acm: only raise DTR on transitions from B0
USB: cdc-acm: add device id for GW Instek AFG-2225
usb: serial: ftdi_sio: add "bricked" FTDI device PID
usb: serial: ftdi_sio: add Awinda Station and Dongle products
USB: serial: cp210x: add Silicon Labs 358x VID and PID
serial: Fix divide-by-zero fault in uart_get_divisor()
staging:iio:ade7758: Remove "raw" from channel name
staging:iio:ade7758: Fix check if channels are enabled in prenable
staging:iio:ade7758: Fix NULL pointer deref when enabling buffer
staging:iio:ad5933: Drop "raw" from channel names
staging:iio:ad5933: Fix NULL pointer deref when enabling buffer
OOM, PM: OOM killed task shouldn't escape PM suspend
freezer: Do not freeze tasks killed by OOM killer
ext4: fix oops when loading block bitmap failed
cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy
ext4: fix overflow when updating superblock backups after resize
ext4: check s_chksum_driver when looking for bg csum presence
ext4: fix reservation overflow in ext4_da_write_begin
ext4: add ext4_iget_normal() which is to be used for dir tree lookups
ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
ext4: don't check quota format when there are no quota files
ext4: check EA value offset when loading
jbd2: free bh when descriptor block checksum fails
MIPS: tlbex: Properly fix HUGE TLB Refill exception handler
target: Fix APTPL metadata handling for dynamic MappedLUNs
target: Fix queue full status NULL pointer for SCF_TRANSPORT_TASK_SENSE
qla_target: don't delete changed nacls
ARC: Update order of registers in KGDB to match GDB 7.5
ARC: [nsimosci] Allow "headless" models to boot
KVM: x86: Emulator fixes for eip canonical checks on near branches
KVM: x86: Fix wrong masking on relative jump/call
kvm: x86: don't kill guest on unknown exit reason
KVM: x86: Check non-canonical addresses upon WRMSR
KVM: x86: Improve thread safety in pit
KVM: x86: Prevent host from panicking on shared MSR writes.
kvm: fix excessive pages un-pinning in kvm_iommu_map error path.
media: tda7432: Fix setting TDA7432_MUTE bit for TDA7432_RF register
media: ds3000: fix LNB supply voltage on Tevii S480 on initialization
media: em28xx-v4l: give back all active video buffers to the vb2 core properly on streaming stop
media: v4l2-common: fix overflow in v4l_bound_align_image()
drm/nouveau/bios: memset dcb struct to zero before parsing
drm/tilcdc: Fix the error path in tilcdc_load()
drm/ast: Fix HW cursor image
Input: i8042 - quirks for Fujitsu Lifebook A544 and Lifebook AH544
Input: i8042 - add noloop quirk for Asus X750LN
framebuffer: fix border color
modules, lock around setting of MODULE_STATE_UNFORMED
dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
block: fix alignment_offset math that assumes io_min is a power-of-2
drbd: compute the end before rb_insert_augmented()
dm bufio: update last_accessed when relinking a buffer
virtio_pci: fix virtio spec compliance on restore
selinux: fix inode security list corruption
pstore: Fix duplicate {console,ftrace}-efi entries
mfd: rtsx_pcr: Fix MSI enable error handling
mnt: Prevent pivot_root from creating a loop in the mount tree
UBI: add missing kmem_cache_free() in process_pool_aeb error path
random: add and use memzero_explicit() for clearing data
crypto: more robust crypto_memneq
fix misuses of f_count() in ppp and netlink
kill wbuf_queued/wbuf_dwork_lock
ALSA: pcm: Zero-clear reserved fields of PCM status ioctl in compat mode
evm: check xattr value length and type in evm_inode_setxattr()
x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAE
x86_64, entry: Fix out of bounds read on sysenter
x86_64, entry: Filter RFLAGS.NT on entry from userspace
x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
x86, fpu: shift drop_init_fpu() from save_xstate_sig() to handle_signal()
x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable()
x86: Reject x32 executables if x32 ABI not supported
vfs: fix data corruption when blocksize < pagesize for mmaped data
UBIFS: fix free log space calculation
UBIFS: fix a race condition
UBIFS: remove mst_mutex
fs: Fix theoretical division by 0 in super_cache_scan().
fs: make cont_expand_zero interruptible
mmc: rtsx_pci_sdmmc: fix incorrect last byte in R2 response
libata-sff: Fix controllers with no ctl port
pata_serverworks: disable 64-KB DMA transfers on Broadcom OSB4 IDE Controller
Revert "percpu: free percpu allocation info for uniprocessor system"
lockd: Try to reconnect if statd has moved
drivers/net: macvtap and tun depend on INET
ipv4: dst_entry leak in ip_send_unicast_reply()
ax88179_178a: fix bonding failure
ipv4: fix nexthop attlen check in fib_nh_match
tracing/syscalls: Ignore numbers outside NR_syscalls' range
Linux 3.10.59
ecryptfs: avoid to access NULL pointer when write metadata in xattr
ARM: at91/PMC: don't forget to write PMC_PCDR register to disable clocks
ALSA: usb-audio: Add support for Steinberg UR22 USB interface
ALSA: emu10k1: Fix deadlock in synth voice lookup
ALSA: pcm: use the same dma mmap codepath both for arm and arm64
arm64: compat: fix compat types affecting struct compat_elf_prpsinfo
spi: dw-mid: terminate ongoing transfers at exit
kernel: add support for gcc 5
fanotify: enable close-on-exec on events' fd when requested in fanotify_init()
mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set
Bluetooth: Fix issue with USB suspend in btusb driver
Bluetooth: Fix HCI H5 corrupted ack value
rt2800: correct BBP1_TX_POWER_CTRL mask
PCI: Generate uppercase hex for modalias interface class
PCI: Increase IBM ipr SAS Crocodile BARs to at least system page size
iwlwifi: Add missing PCI IDs for the 7260 series
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
lzo: check for length overrun in variable length encoding.
Revert "lzo: properly check for overruns"
Documentation: lzo: document part of the encoding
m68k: Disable/restore interrupts in hwreg_present()/hwreg_write()
Drivers: hv: vmbus: Fix a bug in vmbus_open()
Drivers: hv: vmbus: Cleanup vmbus_establish_gpadl()
Drivers: hv: vmbus: Cleanup vmbus_teardown_gpadl()
Drivers: hv: vmbus: Cleanup vmbus_post_msg()
firmware_class: make sure fw requests contain a name
qla2xxx: Use correct offset to req-q-out for reserve calculation
mptfusion: enable no_write_same for vmware scsi disks
be2iscsi: check ip buffer before copying
regmap: fix NULL pointer dereference in _regmap_write/read
regmap: debugfs: fix possbile NULL pointer dereference
spi: dw-mid: check that DMA was inited before exit
spi: dw-mid: respect 8 bit mode
x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 instead
kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
KVM: s390: unintended fallthrough for external call
kvm: x86: fix stale mmio cache bug
fs: Add a missing permission check to do_umount
Btrfs: fix race in WAIT_SYNC ioctl
Btrfs: fix build_backref_tree issue with multiple shared blocks
Btrfs: try not to ENOSPC on log replay
Linux 3.10.58
USB: cp210x: add support for Seluxit USB dongle
USB: serial: cp210x: added Ketra N1 wireless interface support
USB: Add device quirk for ASUS T100 Base Station keyboard
ipv6: reallocate addrconf router for ipv6 address when lo device up
tcp: fixing TLP's FIN recovery
sctp: handle association restarts when the socket is closed.
ip6_gre: fix flowi6_proto value in xmit path
hyperv: Fix a bug in netvsc_start_xmit()
tg3: Allow for recieve of full-size 8021AD frames
tg3: Work around HW/FW limitations with vlan encapsulated frames
l2tp: fix race while getting PMTU on PPP pseudo-wire
openvswitch: fix panic with multiple vlan headers
packet: handle too big packets for PACKET_V3
tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
sit: Fix ipip6_tunnel_lookup device matching criteria
myri10ge: check for DMA mapping errors
Linux 3.10.57
cpufreq: ondemand: Change the calculation of target frequency
cpufreq: Fix wrong time unit conversion
nl80211: clear skb cb before passing to netlink
drbd: fix regression 'out of mem, failed to invoke fence-peer helper'
jiffies: Fix timeval conversion to jiffies
md/raid5: disable 'DISCARD' by default due to safety concerns.
media: vb2: fix VBI/poll regression
mm: numa: Do not mark PTEs pte_numa when splitting huge pages
mm, thp: move invariant bug check out of loop in __split_huge_page_map
ring-buffer: Fix infinite spin in reading buffer
init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
perf: fix perf bug in fork()
udf: Avoid infinite loop when processing indirect ICBs
Linux 3.10.56
vm_is_stack: use for_each_thread() rather then buggy while_each_thread()
oom_kill: add rcu_read_lock() into find_lock_task_mm()
oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()
oom_kill: change oom_kill.c to use for_each_thread()
introduce for_each_thread() to replace the buggy while_each_thread()
kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code
arm: multi_v7_defconfig: Enable Zynq UART driver
ext2: Fix fs corruption in ext2_get_xip_mem()
serial: 8250_dma: check the result of TX buffer mapping
ARM: 7748/1: oabi: handle faults when loading swi instruction from userspace
netfilter: nf_conntrack: avoid large timeout for mid-stream pickup
PM / sleep: Use valid_state() for platform-dependent sleep states only
PM / sleep: Add state field to pm_states[] entries
ipvs: fix ipv6 hook registration for local replies
ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding
ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack
md/raid1: fix_read_error should act on all non-faulty devices.
media: cx18: fix kernel oops with tda8290 tuner
Fix nasty 32-bit overflow bug in buffer i/o code.
perf kmem: Make it work again on non NUMA machines
perf: Fix a race condition in perf_remove_from_context()
alarmtimer: Lock k_itimer during timer callback
alarmtimer: Do not signal SIGEV_NONE timers
parisc: Only use -mfast-indirect-calls option for 32-bit kernel builds
powerpc/perf: Fix ABIv2 kernel backtraces
sched: Fix unreleased llc_shared_mask bit during CPU hotplug
ocfs2/dlm: do not get resource spinlock if lockres is new
nilfs2: fix data loss with mmap()
fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
fsnotify/fdinfo: use named constants instead of hardcoded values
kcmp: fix standard comparison bug
Revert "mac80211: disable uAPSD if all ACs are under ACM"
usb: dwc3: core: fix ordering for PHY suspend
usb: dwc3: core: fix order of PM runtime calls
usb: host: xhci: fix compliance mode workaround
genhd: fix leftover might_sleep() in blk_free_devt()
lockd: fix rpcbind crash on lockd startup failure
rtlwifi: rtl8192cu: Add new ID
percpu: perform tlb flush after pcpu_map_pages() failure
percpu: fix pcpu_alloc_pages() failure path
percpu: free percpu allocation info for uniprocessor system
ata_piix: Add Device IDs for Intel 9 Series PCH
Input: i8042 - add nomux quirk for Avatar AVIU-145A6
Input: i8042 - add Fujitsu U574 to no_timeout dmi table
Input: atkbd - do not try 'deactivate' keyboard on any LG laptops
Input: elantech - fix detection of touchpad on ASUS s301l
Input: synaptics - add support for ForcePads
Input: serport - add compat handling for SPIOCSTYPE ioctl
dm crypt: fix access beyond the end of allocated space
block: Fix dev_t minor allocation lifetime
workqueue: apply __WQ_ORDERED to create_singlethread_workqueue()
Revert "iwlwifi: dvm: don't enable CTS to self"
SCSI: libiscsi: fix potential buffer overrun in __iscsi_conn_send_pdu
NFC: microread: Potential overflows in microread_target_discovered()
iscsi-target: Fix memory corruption in iscsit_logout_post_handler_diffcid
iscsi-target: avoid NULL pointer in iscsi_copy_param_list failure
Target/iser: Don't put isert_conn inside disconnected handler
Target/iser: Get isert_conn reference once got to connected_handler
iio:inkern: fix overwritten -EPROBE_DEFER in of_iio_channel_get_by_name
iio:magnetometer: bugfix magnetometers gain values
iio: adc: ad_sigma_delta: Fix indio_dev->trig assignment
iio: st_sensors: Fix indio_dev->trig assignment
iio: meter: ade7758: Fix indio_dev->trig assignment
iio: inv_mpu6050: Fix indio_dev->trig assignment
iio: gyro: itg3200: Fix indio_dev->trig assignment
iio:trigger: modify return value for iio_trigger_get
CIFS: Fix SMB2 readdir error handling
CIFS: Fix directory rename error
ASoC: davinci-mcasp: Correct rx format unit configuration
shmem: fix nlink for rename overwrite directory
x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8
KVM: x86: handle idiv overflow at kvm_write_tsc
regmap: Fix handling of volatile registers for format_write() chips
ACPICA: Update to GPIO region handler interface.
MIPS: mcount: Adjust stack pointer for static trace in MIPS32
MIPS: ZBOOT: add missing <linux/string.h> include
ARM: 8165/1: alignment: don't break misaligned NEON load/store
ARM: 7897/1: kexec: Use the right ISA for relocate_new_kernel
ARM: 8133/1: use irq_set_affinity with force=false when migrating irqs
ARM: 8128/1: abort: don't clear the exclusive monitors
NFSv4: Fix another bug in the close/open_downgrade code
NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
usb:hub set hub->change_bits when over-current happens
usb: dwc3: omap: fix ordering for runtime pm calls
USB: EHCI: unlink QHs even after the controller has stopped
USB: storage: Add quirks for Entrega/Xircom USB to SCSI converters
USB: storage: Add quirk for Ariston Technologies iConnect USB to SCSI adapter
USB: storage: Add quirk for Adaptec USBConnect 2000 USB-to-SCSI Adapter
storage: Add single-LUN quirk for Jaz USB Adapter
usb: hub: take hub->hdev reference when processing from eventlist
xhci: fix oops when xhci resumes from hibernate with hw lpm capable devices
xhci: Fix null pointer dereference if xhci initialization fails
USB: zte_ev: fix removed PIDs
USB: ftdi_sio: add support for NOVITUS Bono E thermal printer
USB: sierra: add 1199:68AA device ID
USB: sierra: avoid CDC class functions on "68A3" devices
USB: zte_ev: remove duplicate Qualcom PID
USB: zte_ev: remove duplicate Gobi PID
Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
USB: option: add VIA Telecom CDS7 chipset device id
USB: option: reduce interrupt-urb logging verbosity
USB: serial: fix potential heap buffer overflow
USB: sisusb: add device id for Magic Control USB video
USB: serial: fix potential stack buffer overflow
USB: serial: pl2303: add device id for ztek device
xtensa: fix a6 and a7 handling in fast_syscall_xtensa
xtensa: fix TLBTEMP_BASE_2 region handling in fast_second_level_miss
xtensa: fix access to THREAD_RA/THREAD_SP/THREAD_DS
xtensa: fix address checks in dma_{alloc,free}_coherent
xtensa: replace IOCTL code definitions with constants
drm/radeon: add connector quirk for fujitsu board
drm/vmwgfx: Fix a potential infinite spin waiting for fifo idle
drm/ast: AST2000 cannot be detected correctly
drm/i915: Wait for vblank before enabling the TV encoder
drm/i915: Remove bogus __init annotation from DMI callbacks
HID: logitech-dj: prevent false errors to be shown
HID: magicmouse: sanity check report size in raw_event() callback
HID: picolcd: sanity check report size in raw_event() callback
cfq-iosched: Fix wrong children_weight calculation
ALSA: pcm: fix fifo_size frame calculation
ALSA: hda - Fix invalid pin powermap without jack detection
ALSA: hda - Fix COEF setups for ALC1150 codec
ALSA: core: fix buffer overflow in snd_info_get_line()
arm64: ptrace: fix compat hardware watchpoint reporting
trace: Fix epoll hang when we race with new entries
i2c: at91: Fix a race condition during signal handling in at91_do_twi_xfer.
i2c: at91: add bound checking on SMBus block length bytes
arm64: flush TLS registers during exec
ibmveth: Fix endian issues with rx_no_buffer statistic
ahci: add pcid for Marvel 0x9182 controller
ahci: Add Device IDs for Intel 9 Series PCH
pata_scc: propagate return value of scc_wait_after_reset
drm/i915: read HEAD register back in init_ring_common() to enforce ordering
drm/radeon: load the lm63 driver for an lm64 thermal chip.
drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().
drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().
drm/tilcdc: fix double kfree
drm/tilcdc: fix release order on exit
drm/tilcdc: panel: fix leak when unloading the module
drm/tilcdc: tfp410: fix dangling sysfs connector node
drm/tilcdc: slave: fix dangling sysfs connector node
drm/tilcdc: panel: fix dangling sysfs connector node
carl9170: fix sending URBs with wrong type when using full-speed
Linux 3.10.55
libceph: gracefully handle large reply messages from the mon
libceph: rename ceph_msg::front_max to front_alloc_len
tpm: Provide a generic means to override the chip returned timeouts
vfs: fix bad hashing of dentries
dcache.c: get rid of pointless macros
IB/srp: Fix deadlock between host removal and multipathd
blkcg: don't call into policy draining if root_blkg is already gone
mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc()
mtd/ftl: fix the double free of the buffers allocated in build_maps()
CIFS: Fix wrong restart readdir for SMB1
CIFS: Fix wrong filename length for SMB2
CIFS: Fix wrong directory attributes after rename
CIFS: Possible null ptr deref in SMB2_tcon
CIFS: Fix async reading on reconnects
CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
libceph: do not hard code max auth ticket len
libceph: add process_one_ticket() helper
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
md/raid1,raid10: always abort recover on write error.
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't dirty buffers beyond EOF
xfs: quotacheck leaves dquot buffers without verifiers
RDMA/iwcm: Use a default listen backlog if needed
md/raid10: Fix memory leak when raid10 reshape completes.
md/raid10: fix memory leak when reshaping a RAID10.
md/raid6: avoid data corruption during recovery of double-degraded RAID6
Bluetooth: Avoid use of session socket after the session gets freed
Bluetooth: never linger on process exit
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
ring-buffer: Up rb_iter_peek() loop count to 3
ring-buffer: Always reset iterator to reader page
ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock
ACPI: Run fixed event device notifications in process context
ACPICA: Utilities: Fix memory leak in acpi_ut_copy_iobject_to_iobject
bfa: Fix undefined bit shift on big-endian architectures with 32-bit DMA address
ASoC: pxa-ssp: drop SNDRV_PCM_FMTBIT_S24_LE
ASoC: max98090: Fix missing free_irq
ASoC: samsung: Correct I2S DAI suspend/resume ops
ASoC: wm_adsp: Add missing MODULE_LICENSE
ASoC: pcm: fix dpcm_path_put in dpcm runtime update
openrisc: Rework signal handling
MIPS: Fix accessing to per-cpu data when flushing the cache
MIPS: OCTEON: make get_system_type() thread-safe
MIPS: asm: thread_info: Add _TIF_SECCOMP flag
MIPS: Cleanup flags in syscall flags handlers.
MIPS: asm/reg.h: Make 32- and 64-bit definitions available at the same time
MIPS: Remove BUG_ON(!is_fpu_owner()) in do_ade()
MIPS: tlbex: Fix a missing statement for HUGETLB
MIPS: Prevent user from setting FCSR cause bits
MIPS: GIC: Prevent array overrun
drivers: scsi: storvsc: Correctly handle TEST_UNIT_READY failure
Drivers: scsi: storvsc: Implement a eh_timed_out handler
powerpc/pseries: Failure on removing device node
powerpc/mm: Use read barrier when creating real_pte
powerpc/mm/numa: Fix break placement
regulator: arizona-ldo1: remove bypass functionality
mfd: omap-usb-host: Fix improper mask use.
kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path
CAPABILITIES: remove undefined caps from all processes
tpm: missing tpm_chip_put in tpm_get_random()
firmware: Do not use WARN_ON(!spin_is_locked())
spi: omap2-mcspi: Configure hardware when slave driver changes mode
spi: orion: fix incorrect handling of cell-index DT property
iommu/amd: Fix cleanup_domain for mass device removal
media: media-device: Remove duplicated memset() in media_enum_entities()
media: au0828: Only alt setting logic when needed
media: xc4000: Fix get_frequency()
media: xc5000: Fix get_frequency()
Linux 3.10.54
USB: fix build error with CONFIG_PM_RUNTIME disabled
NFSv4: Fix problems with close in the presence of a delegation
NFSv3: Fix another acl regression
svcrdma: Select NFSv4.1 backchannel transport based on forward channel
NFSD: Decrease nfsd_users in nfsd_startup_generic fail
usb: hub: Prevent hub autosuspend if usbcore.autosuspend is -1
USB: whiteheat: Added bounds checking for bulk command response
USB: ftdi_sio: Added PID for new ekey device
USB: ftdi_sio: add Basic Micro ATOM Nano USB2Serial PID
ARM: OMAP2+: hwmod: Rearm wake-up interrupts for DT when MUSB is idled
usb: xhci: amd chipset also needs short TX quirk
xhci: Treat not finding the event_seg on COMP_STOP the same as COMP_STOP_INVAL
Staging: speakup: Update __speakup_paste_selection() tty (ab)usage to match vt
jbd2: fix infinite loop when recovering corrupt journal blocks
mei: nfc: fix memory leak in error path
mei: reset client state on queued connect request
Btrfs: fix csum tree corruption, duplicate and outdated checksums
hpsa: fix bad -ENOMEM return value in hpsa_big_passthru_ioctl
x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stub
x86_64/vsyscall: Fix warn_bad_vsyscall log output
x86: don't exclude low BIOS area when allocating address space for non-PCI cards
drm/radeon: add additional SI pci ids
ext4: fix BUG_ON in mb_free_blocks()
kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601)
Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
KVM: x86: Inter-privilege level ret emulation is not implemeneted
crypto: ux500 - make interrupt mode plausible
serial: core: Preserve termios c_cflag for console resume
ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
drivers/i2c/busses: use correct type for dma_map/unmap
hwmon: (dme1737) Prevent overflow problem when writing large limits
hwmon: (ads1015) Fix out-of-bounds array access
hwmon: (lm85) Fix various errors on attribute writes
hwmon: (ads1015) Fix off-by-one for valid channel index checking
hwmon: (gpio-fan) Prevent overflow problem when writing large limits
hwmon: (lm78) Fix overflow problems seen when writing large temperature limits
hwmon: (sis5595) Prevent overflow problem when writing large limits
drm: omapdrm: fix compiler errors
ARM: OMAP3: Fix choice of omap3_restore_es function in OMAP34XX rev3.1.2 case.
mei: start disconnect request timer consistently
ALSA: hda/realtek - Avoid setting wrong COEF on ALC269 & co
ALSA: hda/ca0132 - Don't try loading firmware at resume when already failed
ALSA: virtuoso: add Xonar Essence STX II support
ALSA: hda - fix an external mic jack problem on a HP machine
USB: Fix persist resume of some SS USB devices
USB: ehci-pci: USB host controller support for Intel Quark X1000
USB: serial: ftdi_sio: Add support for new Xsens devices
USB: serial: ftdi_sio: Annotate the current Xsens PID assignments
USB: OHCI: don't lose track of EDs when a controller dies
isofs: Fix unbounded recursion when processing relocated directories
HID: fix a couple of off-by-ones
HID: logitech: perform bounds checking on device_id early enough
stable_kernel_rules: Add pointer to netdev-FAQ for network patches
Linux 3.10.53
arch/sparc/math-emu/math_32.c: drop stray break operator
sparc64: ldc_connect() should not return EINVAL when handshake is in progress.
sunsab: Fix detection of BREAK on sunsab serial console
bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000
sparc64: Guard against flushing openfirmware mappings.
sparc64: Do not insert non-valid PTEs into the TSB hash table.
sparc64: Add membar to Niagara2 memcpy code.
sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus.
sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses.
sparc64: Fix top-level fault handling bugs.
sparc64: Handle 32-bit tasks properly in compute_effective_address().
sparc64: Make itc_sync_lock raw
sparc64: Fix argument sign extension for compat_sys_futex().
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
iovec: make sure the caller actually wants anything in memcpy_fromiovecend
net: Correctly set segment mac_len in skb_segment().
macvlan: Initialize vlan_features to turn on offload support.
net: sctp: inherit auth_capable on INIT collisions
tcp: Fix integer-overflow in TCP vegas
tcp: Fix integer-overflows in TCP veno
net: sendmsg: fix NULL pointer dereference
ip: make IP identifiers less predictable
inetpeer: get rid of ip_id_count
bnx2x: fix crash during TSO tunneling
Linux 3.10.52
x86/espfix/xen: Fix allocation of pages for paravirt page tables
lib/btree.c: fix leak of whole btree nodes
net/l2tp: don't fall back on UDP [get|set]sockopt
net: mvneta: replace Tx timer with a real interrupt
net: mvneta: add missing bit descriptions for interrupt masks and causes
net: mvneta: do not schedule in mvneta_tx_timeout
net: mvneta: use per_cpu stats to fix an SMP lock up
net: mvneta: increase the 64-bit rx/tx stats out of the hot path
Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan"
staging: vt6655: Fix Warning on boot handle_irq_event_percpu.
x86_64/entry/xen: Do not invoke espfix64 on Xen
x86, espfix: Make it possible to disable 16-bit support
x86, espfix: Make espfix64 a Kconfig option, fix UML
x86, espfix: Fix broken header guard
x86, espfix: Move espfix definitions into a separate header file
x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option"
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
printk: rename printk_sched to printk_deferred
iio: buffer: Fix demux table creation
staging: vt6655: Fix disassociated messages every 10 seconds
mm, thp: do not allow thp faults to avoid cpuset restrictions
scsi: handle flush errors properly
rapidio/tsi721_dma: fix failure to obtain transaction descriptor
cfg80211: fix mic_failure tracing
ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout
crypto: af_alg - properly label AF_ALG socket
Linux 3.10.51
core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
x86/efi: Include a .bss section within the PE/COFF headers
s390/ptrace: fix PSW mask check
Fix gcc-4.9.0 miscompilation of load_balance() in scheduler
mm: hugetlb: fix copy_hugetlb_page_range()
x86_32, entry: Store badsys error code in %eax
hwmon: (smsc47m192) Fix temperature limit and vrm write operations
parisc: Remove SA_RESTORER define
coredump: fix the setting of PF_DUMPCORE
Input: fix defuzzing logic
slab_common: fix the check for duplicate slab names
slab_common: Do not check for duplicate slab names
tracing: Fix wraparound problems in "uptime" trace clock
blkcg: don't call into policy draining if root_blkg is already gone
ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode)
libata: introduce ata_host->n_tags to avoid oops on SAS controllers
libata: support the ata host which implements a queue depth less than 32
block: don't assume last put of shared tags is for the host
block: provide compat ioctl for BLKZEROOUT
media: tda10071: force modulation to QPSK on DVB-S
media: hdpvr: fix two audio bugs
Linux 3.10.50
ARC: Implement ptrace(PTRACE_GET_THREAD_AREA)
sched: Fix possible divide by zero in avg_atom() calculation
locking/mutex: Disable optimistic spinning on some architectures
PM / sleep: Fix request_firmware() error at resume
dm cache metadata: do not allow the data block size to change
dm thin metadata: do not allow the data block size to change
alarmtimer: Fix bug where relative alarm timers were treated as absolute
drm/radeon: avoid leaking edid data
drm/qxl: return IRQ_NONE if it was not our irq
drm/radeon: set default bl level to something reasonable
irqchip: gic: Fix core ID calculation when topology is read from DT
irqchip: gic: Add support for cortex a7 compatible string
ring-buffer: Fix polling on trace_pipe
mwifiex: fix Tx timeout issue
perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
ipv4: fix buffer overflow in ip_options_compile()
dns_resolver: Null-terminate the right string
dns_resolver: assure that dns_query() result is null-terminated
sunvnet: clean up objects created in vnet_new() on vnet_exit()
net: pppoe: use correct channel MTU when using Multilink PPP
net: sctp: fix information leaks in ulpevent layer
tipc: clear 'next'-pointer of message fragments before reassembly
be2net: set EQ DB clear-intr bit in be_open()
netlink: Fix handling of error from netlink_dump().
net: mvneta: Fix big endian issue in mvneta_txq_desc_csum()
net: mvneta: fix operation in 10 Mbit/s mode
appletalk: Fix socket referencing in skb
tcp: fix false undo corner cases
igmp: fix the problem when mc leave group
net: qmi_wwan: add two Sierra Wireless/Netgear devices
net: qmi_wwan: Add ID for Telewell TW-LTE 4G v2
ipv4: icmp: Fix pMTU handling for rare case
tcp: Fix divide by zero when pushing during tcp-repair
bnx2x: fix possible panic under memory stress
net: fix sparse warning in sk_dst_set()
ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix
ipv4: fix dst race in sk_dst_get()
8021q: fix a potential memory leak
net: sctp: check proc_dointvec result in proc_sctp_do_auth
tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb
ip_tunnel: fix ip_tunnel_lookup
shmem: fix splicing from a hole while it's punched
shmem: fix faulting into a hole, not taking i_mutex
shmem: fix faulting into a hole while it's punched
iwlwifi: dvm: don't enable CTS to self
igb: do a reset on SR-IOV re-init if device is down
hwmon: (adt7470) Fix writes to temperature limit registers
hwmon: (da9052) Don't use dash in the name attribute
hwmon: (da9055) Don't use dash in the name attribute
tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
tracing: Fix graph tracer with stack tracer on other archs
fuse: handle large user and group ID
Bluetooth: Ignore H5 non-link packets in non-active state
Drivers: hv: util: Fix a bug in the KVP code
media: gspca_pac7302: Add new usb-id for Genius i-Look 317
usb: Check if port status is equal to RxDetect
Signed-off-by: Ian Maund <imaund@codeaurora.org>
Some pages could be shared by several processes. (ex, libc)
In case of that, it's too bad to reclaim them from the beginnig.
This patch causes VM to keep them on memory until last task
try to reclaim them so shared pages will be reclaimed only if
all of task has gone swapping out.
This feature doesn't handle non-linear mapping on ramfs because
it's very time-consuming and doesn't make sure of reclaiming and
not common.
Change-Id: I7e5f34f2e947f5db6d405867fe2ad34863ca40f7
Signed-off-by: Sangseok Lee <sangseok.lee@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:27
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Shrink_page_list expects all pages come from a same zone
but it's too limited to use.
This patch removes the dependency so next patch can use
shrink_page_list with pages from multiple zones.
Change-Id: I34469b7f0a79f2b79e30e40033ba8b3e1dd5f2d0
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:25
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
These day, there are many platforms avaiable in the embedded market
and they are smarter than kernel which has very limited information
about working set so they want to involve memory management more heavily
like android's lowmemory killer and ashmem or recent many lowmemory
notifier(there was several trial for various company NOKIA, SAMSUNG,
Linaro, Google ChromeOS, Redhat).
One of the simple imagine scenario about userspace's intelligence is that
platform can manage tasks as forground and backgroud so it would be
better to reclaim background's task pages for end-user's *responsibility*
although it has frequent referenced pages.
This patch adds new knob "reclaim under proc/<pid>/" so task manager
can reclaim any target process anytime, anywhere. It could give another
method to platform for using memory efficiently.
It can avoid process killing for getting free memory, which was really
terrible experience because I lost my best score of game I had ever
after I switch the phone call while I enjoyed the game.
Reclaim file-backed pages only.
echo file > /proc/PID/reclaim
Reclaim anonymous pages only.
echo anon > /proc/PID/reclaim
Reclaim all pages
echo all > /proc/PID/reclaim
Change-Id: Iabdb7bc2ef3dc4d94e3ea005fbe18f4cd06739ab
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:24
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Now, local variable references in shrink_page_list is
PAGEREF_RECLAIM_CLEAN as default. It is for preventing to reclaim
dirty pages when CMA try to migrate pages.
Strictly speaking, we don't need it because CMA already didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
Morever, it has a problem to prevent anonymous pages's swap out when
we use force_reclaim = true in shrink_page_list(ex, per process reclaim
can do it)
So this patch makes references's default value to PAGEREF_RECLAIM
and declare .may_writepage = 0 of scan_control in CMA part to make
code more clear.
Change-Id: I5edc3c955d106ecebc4949ce27daf5b7b7a18089
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mgorman@suse.de>
Reported-by: Minkyung Kim <minkyung88@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:23
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
There are couple of issues with swapcache usage when ZRAM is used
as swap device.
1) Kernel does a swap readahead which can be around 6 to 8 pages
depending on total ram, which is not required for zram since
accesses are fast.
2) Kernel delays the freeing up of swapcache expecting a later hit,
which again is useless in the case of zram.
3) This is not related to swapcache, but zram usage itself.
As mentioned in (2) kernel delays freeing of swapcache, but along with
that it delays zram compressed page free also. i.e. there can be 2 copies,
though one is compressed.
This patch addresses these issues using two new flags
QUEUE_FLAG_FAST and SWP_FAST, to indicate that accesses to the device
will be fast and cheap, and instructs the swap layer to free up
swap space agressively, and not to do read ahead.
Change-Id: I5d2d5176a5f9420300bb2f843f6ecbdb25ea80e4
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream.
Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
stuck in a busy loop with nothing left to balance, but
kswapd_try_to_sleep() failing to sleep. Their analysis found the cause
to be a combination of several factors:
1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
2. The process has been killed (by OOM in this case), but has not yet been
scheduled to remove itself from the waitqueue and die.
3. kswapd checks for throttled processes in prepare_kswapd_sleep():
if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
wake_up(&pgdat->pfmemalloc_wait);
return false; // kswapd will not go to sleep
}
However, for a process that was already killed, wake_up() does not remove
the process from the waitqueue, since try_to_wake_up() checks its state
first and returns false when the process is no longer waiting.
4. kswapd is running on the same CPU as the only CPU that the process is
allowed to run on (through cpus_allowed, or possibly single-cpu system).
5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
encounters no voluntary preemption points and repeatedly fails
prepare_kswapd_sleep(), blocking the process from running and removing
itself from the waitqueue, which would let kswapd sleep.
So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled. This was done to make sure that no
process is left sleeping on pfmemalloc_wait when kswapd itself goes to
sleep.
However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue actually empties. To prevent processes from being
left sleeping, it's actually enough to guarantee that all processes
waiting on pfmemalloc_wait queue have been woken up by the time we put
kswapd to sleep.
This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above. Note that if any process puts itself in
the queue after this waitqueue_active() check, or after the wake up
itself, it means that the process will also wake up kswapd - and since
we are under prepare_to_wait(), the wake up won't be missed. Also we
update the comment prepare_kswapd_sleep() to hopefully more clearly
describe the races it is preventing.
Fixes: 5515061d22 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is observed that sometimes multiple tasks get blocked in
the congestion_wait loop below, in shrink_inactive_list.
(__schedule) from [<c0a03328>]
(schedule_timeout) from [<c0a04940>]
(io_schedule_timeout) from [<c01d585c>]
(congestion_wait) from [<c01cc9d8>]
(shrink_inactive_list) from [<c01cd034>]
(shrink_zone) from [<c01cdd08>]
(try_to_free_pages) from [<c01c442c>]
(__alloc_pages_nodemask) from [<c01f1884>]
(new_slab) from [<c09fcf60>]
(__slab_alloc) from [<c01f1a6c>]
In one such instance, zone_page_state(zone, NR_ISOLATED_FILE)
had returned 14, zone_page_state(zone, NR_INACTIVE_FILE)
returned 92, and the gfp_flag was GFP_KERNEL which resulted
in too_many_isolated to return true. But one of the CPU pageset
vmstat diff had NR_ISOLATED_FILE as -14. As there weren't any more
update to per cpu pageset, the threshold wasn't met, and the
tasks were blocked in the congestion wait.
This patch uses zone_page_state_snapshot instead, but restricts
its usage to avoid performance penalty.
Change-Id: Iec767a548e524729c7ed79a92fe4718cdd08ce69
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
shrink_inactive_list() used to wait 0.1s to avoid congestion when all
the pages that were isolated from the inactive list were dirty but not
under active writeback. That makes no real sense, and apparently causes
major interactivity issues under some loads since 3.11.
The ostensible reason for it was to wait for kswapd to start writing
pages, but that seems questionable as well, since the congestion wait
code seems to trigger for kswapd itself as well. Also, the logic behind
delaying anything when we haven't actually started writeback is not
clear - it only delays actually starting that writeback.
We'll still trigger the congestion waiting if
(a) the process is kswapd, and we hit pages flagged for immediate
reclaim
(b) the process is not kswapd, and the zone backing dev writeback is
actually congested.
This probably needs to be revisited, but as it is this fixes a reported
regression.
Reported-by: Felipe Contreras <felipe.contreras@gmail.com>
Pinpointed-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: b738d764652dc5aab1c8939f637112981fce9e0e
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I4fbcbb10d7ba242caf80da06bd8ed11770571cff
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Commit "mm: vmscan: obey proportional scanning requirements for kswapd"
ensured that file/anon lists were scanned proportionally for reclaim from
kswapd but ignored it for direct reclaim. The intent was to minimse
direct reclaim latency but Yuanhan Liu pointer out that it substitutes one
long stall for many small stalls and distorts aging for normal workloads
like streaming readers/writers. Hugh Dickins pointed out that a
side-effect of the same commit was that when one LRU list dropped to zero
that the entirety of the other list was shrunk leading to excessive
reclaim in memcgs. This patch scans the file/anon lists proportionally
for direct reclaim to similarly age page whether reclaimed by kswapd or
direct reclaim but takes care to abort reclaim if one LRU drops to zero
after reclaiming the requested number of pages.
Based on ext4 and using the Intel VM scalability test
3.15.0-rc5 3.15.0-rc5
shrinker proportion
Unit lru-file-readonce elapsed 5.3500 ( 0.00%) 5.4200 ( -1.31%)
Unit lru-file-readonce time_range 0.2700 ( 0.00%) 0.1400 ( 48.15%)
Unit lru-file-readonce time_stddv 0.1148 ( 0.00%) 0.0536 ( 53.33%)
Unit lru-file-readtwice elapsed 8.1700 ( 0.00%) 8.1700 ( 0.00%)
Unit lru-file-readtwice time_range 0.4300 ( 0.00%) 0.2300 ( 46.51%)
Unit lru-file-readtwice time_stddv 0.1650 ( 0.00%) 0.0971 ( 41.16%)
The test cases are running multiple dd instances reading sparse files. The results are within
the noise for the small test machine. The impact of the patch is more noticable from the vmstats
3.15.0-rc5 3.15.0-rc5
shrinker proportion
Minor Faults 35154 36784
Major Faults 611 1305
Swap Ins 394 1651
Swap Outs 4394 5891
Allocation stalls 118616 44781
Direct pages scanned 4935171 4602313
Kswapd pages scanned 15921292 16258483
Kswapd pages reclaimed 15913301 16248305
Direct pages reclaimed 4933368 4601133
Kswapd efficiency 99% 99%
Kswapd velocity 670088.047 682555.961
Direct efficiency 99% 99%
Direct velocity 207709.217 193212.133
Percentage direct scans 23% 22%
Page writes by reclaim 4858.000 6232.000
Page writes file 464 341
Page writes anon 4394 5891
Note that there are fewer allocation stalls even though the amount
of direct reclaim scanning is very approximately the same.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I93acb1ea93d90afca35f3db2a350f2e6589e7c64
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.
A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
uses the remainder for a streaming writer demonstrates this problem. In
that case, the actual cache pages are a small fraction of what is
considered dirtyable overall, which results in an relatively large
portion of the cache pages to be dirtied. As kswapd starts rotating
these, random tasks enter direct reclaim and stall on IO.
Only consider free pages and file pages dirtyable.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Tested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: a1c3bfb2f67ef766de03f1f56bdfff9c8595ab14
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I35ae9cfbcccbf3329e6f15158cc7bb72905cb7ce
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
After the patch "mm: vmscan: Flatten kswapd priority loop" was merged
the scanning priority of kswapd changed.
The priority now rises until it is scanning enough pages to meet the
high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of
pages were encountered under writeback but this value is scaled based on
the priority. As kswapd frequently scans with a higher priority now it
is relatively easy to set ZONE_WRITEBACK. This patch removes the
scaling and treates writeback pages similar to how it treats unqueued
dirty pages and congested pages. The user-visible effect should be that
kswapd will writeback fewer pages from reclaim context.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 918fc718c5922520c499ad60f61b8df86b998ae9
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I5f75351d845ab0de4ca1c22ffba10e06ea45d111
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Direct reclaim is not aborting to allow compaction to go ahead properly.
do_try_to_free_pages is told to abort reclaim which is happily ignores
and instead increases priority instead until it reaches 0 and starts
shrinking file/anon equally. This patch corrects the situation by
aborting reclaim when requested instead of raising priority.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 5a1c9cbc1550f93335d7c03eb6c271e642deff04
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I1e3fc6b2fea5d5a06edf5c682caffa3a7907a7ad
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Page reclaim keeps track of dirty and under writeback pages and uses it
to determine if wait_iff_congested() should stall or if kswapd should
begin writing back pages. This fails to account for buffer pages that
can be under writeback but not PageWriteback which is the case for
filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
can have all the buffers clean and writepage does no IO so it should not
be accounted as congested.
This patch adds an address_space operation that filesystems may
optionally use to check if a page is really dirty or really under
writeback. An implementation is provided for for buffer_heads is added
and used for block operations and ext3 in ordered mode. By default the
page flags are obeyed.
Credit goes to Jan Kara for identifying that the page flags alone are
not sufficient for ext3 and sanity checking a number of ideas on how the
problem could be addressed.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: b45972265f823ed01eae0867a176320071665787
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Idabea6f388eddcf5acf4725975d51119169da211
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Currently a zone will only be marked congested if the underlying BDI is
congested but if dirty pages are spread across zones it is possible that
an individual zone is full of dirty pages without being congested. The
impact is that zone gets scanned very quickly potentially reclaiming
really clean pages. This patch treats pages marked for immediate
reclaim as congested for the purposes of marking a zone ZONE_CONGESTED
and stalling in wait_iff_congested.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: d04e8acd03e5c3421ef18e3da7bc88d56179ca42
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I321615bb32c4efe5889df9ce6482c825d7a816e6
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
shrink_inactive_list makes decisions on whether to stall based on the
number of dirty pages encountered. The wait_iff_congested() call in
shrink_page_list does no such thing and it's arbitrary.
This patch moves the decision on whether to set ZONE_CONGESTED and the
wait_iff_congested call into shrink_page_list. This keeps all the
decisions on whether to stall or not in the one place.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 8e950282804558e4605401b9c79c1d34f0d73507
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Ie73206306ff0589877cab6d1a4ec510d88088403
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
In shrink_page_list a decision may be made to stall and flag a zone as
ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
encountered later then the reclaimer will stall. Set ZONE_WRITEBACK
before potentially going to sleep so it is noticed sooner.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: f7ab8db791a8692f5ed4201dbae25722c1732a8d
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I32b015f56fb76c2c2f15163659eda478f63e4b5e
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Commit "mm: vmscan: Block kswapd if it is encountering pages under
writeback" blocks page reclaim if it encounters pages under writeback
marked for immediate reclaim. It blocks while pages are still isolated
from the LRU which is unnecessary. This patch defers the blocking until
after the isolated pages have been processed and tidies up some of the
comments.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: b1a6f21e3b2315d46ae8af88a8f4eb8ea2763107
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Ia6da0949d7bf81cd7c8d3951a7f9c723131b9037
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Further testing of the "Reduce system disruption due to kswapd"
discovered a few problems. First and foremost, it's possible for pages
under writeback to be freed which will lead to badness. Second, as
pages were not being swapped the file LRU was being scanned faster and
clean file pages were being reclaimed. In some cases this results in
increased read IO to re-read data from disk. Third, more pages were
being written from kswapd context which can adversly affect IO
performance. Lastly, it was observed that PageDirty pages are not
necessarily dirty on all filesystems (buffers can be clean while
PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g.
ext3). This disconnect confuses the reclaim stalling logic. This
follow-up series is aimed at these problems.
The tests were based on three kernels
vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to
kswapd" applied on top as per what should be in Andrew's tree
right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
The first test used memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM
Tests. memcachetest benchmarks how many operations/second memcached can
service. It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress. The
expectation is that the IO should have little or no impact on
memcachetest which is running entirely in memory.
parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22763.00 ( -1.53%)
Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 22934.00 ( -3.53%)
Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23765.00 (464.76%)
Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24614.00 (499.76%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 6.00 ( 50.00%)
Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%)
Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%)
Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1533541.00 ( 0.19%)
Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1608832.00 ( 9.97%)
Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1613541.00 ( 8.21%)
Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1630881.00 ( 8.09%)
Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 166.00 ( 9.78%)
Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 93.00 ( 99.62%)
Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 134.00 ( 99.37%)
memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
23K/sec to just over 4K/second when there is 2385M of IO going
on in the background. With current mmotm, there is no collapse
in performance and with this follow-up series there is little
change.
swaptotal is the total amount of swap traffic. With mmotm and the follow-up
series, the total amount of swapping is much reduced.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults 11160152 10706748 10622316
Major Faults 46305 755 678
Swap Ins 260249 0 0
Swap Outs 683860 18 18
Direct pages scanned 0 678 2520
Kswapd pages scanned 6046108 8814900 1639279
Kswapd pages reclaimed 1081954 1172267 1094635
Direct pages reclaimed 0 566 2304
Kswapd efficiency 17% 13% 66%
Kswapd velocity 5217.560 7618.953 1414.879
Direct efficiency 100% 83% 91%
Direct velocity 0.000 0.586 2.175
Percentage direct scans 0% 0% 0%
Zone normal velocity 5105.086 6824.681 671.158
Zone dma32 velocity 112.473 794.858 745.896
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim 1929612.000 6861768.000 32821.000
Page writes file 1245752 6861750 32803
Page writes anon 683860 18 18
Page reclaim immediate 7484 40 239
Sector Reads 1130320 93996 86900
Sector Writes 13508052 10823500 11804436
Page rescued immediate 0 0 0
Slabs scanned 33536 27136 18560
Direct inode steals 0 0 0
Kswapd inode steals 8641 1035 0
Kswapd skipped wait 0 0 0
THP fault alloc 8 37 33
THP collapse alloc 508 552 515
THP splits 24 1 1
THP fault fallback 0 0 0
THP collapse fail 0 0 0
There are a number of observations to make here
1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
pages swapped were really unused anonymous pages. Related to that,
major faults are much reduced.
2. kswapd efficiency was impacted by the initial series but with these
follow-up patches, the efficiency is now at 66% indicating that far
fewer pages were skipped during scanning due to dirty or writeback
pages.
3. kswapd velocity is reduced indicating that fewer pages are being scanned
with the follow-up series as kswapd now stalls when the tail of the
LRU queue is full of unqueued dirty pages. The stall gives flushers a
chance to catch-up so kswapd can reclaim clean pages when it wakes
4. In light of Zlatko's recent reports about zone scanning imbalances,
mmtests now reports scanning velocity on a per-zone basis. With mainline,
you can see that the scanning activity is dominated by the Normal
zone with over 45 times more scanning in Normal than the DMA32 zone.
With the series currently in mmotm, the ratio is slightly better but it
is still the case that the bulk of scanning is in the highest zone. With
this follow-up series, the ratio of scanning between the Normal and
DMA32 zone is roughly equal.
5. As Dave Chinner observed, the current patches in mmotm increased the
number of pages written from kswapd context which is expected to adversly
impact IO performance. With the follow-up patches, far fewer pages are
written from kswapd context than the mainline kernel
6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
the follow-up series, there is less slab shrinking activity and no inodes
were reclaimed.
7. Note that "Sectors Read" is drastically reduced implying that the source
data being used for the IO is not being aggressively discarded due to
page reclaim skipping over dirty pages and reclaiming clean pages. Note
that the reducion in reads could also be due to inode data not being
re-read from disk after a slab shrink.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 166.99 32.09 33.44
Mean sda-await 853.64 192.76 185.43
Mean sda-r_await 6.31 9.24 5.97
Mean sda-w_await 2992.81 202.65 192.43
Max sda-avgqz 1409.91 718.75 698.98
Max sda-await 6665.74 3538.00 3124.23
Max sda-r_await 58.96 111.95 58.00
Max sda-w_await 28458.94 3977.29 3148.61
In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations
1. The average queue size is reduced by the initial series and roughly
the same with this follow up.
2. Average wait times for writes are reduced and as the IO
is completing faster it at least implies that the gain is because
flushers are writing the files efficiently instead of page reclaim
getting in the way.
3. The reduction in maximum write latency is staggering. 28 seconds down
to 3 seconds.
Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.
Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.
I ran a longer-lived memcached test with IO going to NFS instead of a local disk
parallelio
3.9.0 3.9.0 3.9.0
vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v7r10
Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23321.00 ( -0.01%)
Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23242.00 ( -8.95%)
Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 23521.00 (166.86%)
Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25560.00 (338.05%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 11.00 ( 83.08%)
Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 53.00 ( 58.91%)
Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 108.00 ( 64.12%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 63.00 ( 99.56%)
Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 30118.00 ( 92.50%)
Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 63082.00 ( 88.62%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 63.00 ( 98.61%)
Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 13917.00 ( 91.81%)
Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 25729.00 ( 86.63%)
Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1454024.00 ( -0.56%)
Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1535776.00 ( 1.38%)
Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1559622.00 ( 7.87%)
Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1596713.00 ( 3.52%)
Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 75.00 ( 90.17%)
Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2189.00 ( 90.83%)
Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 4088.00 ( 84.98%)
1. Performance does not collapse due to IO which is good. IO is also completing
faster. Note with mmotm, IO completes in a third of the time and faster again
with this series applied
2. Swapping is reduced, although not eliminated. The figures for the follow-up
look bad but it does vary a bit as the stalling is not perfect for nfs
or filesystems like ext3 with unusual handling of dirty and writeback
pages
3. There are swapins, particularly with larger amounts of IO indicating
that active pages are being reclaimed. However, the number of much
reduced.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults 36339175 35025445 35219699
Major Faults 310964 27108 51887
Swap Ins 2176399 173069 333316
Swap Outs 3344050 357228 504824
Direct pages scanned 8972 77283 43242
Kswapd pages scanned 20899983 8939566 14772851
Kswapd pages reclaimed 6193156 5172605 5231026
Direct pages reclaimed 8450 73802 39514
Kswapd efficiency 29% 57% 35%
Kswapd velocity 3929.743 1847.499 3058.840
Direct efficiency 94% 95% 91%
Direct velocity 1.687 15.972 8.954
Percentage direct scans 0% 0% 0%
Zone normal velocity 3721.907 939.103 2185.142
Zone dma32 velocity 209.522 924.368 882.651
Zone dma velocity 0.000 0.000 0.000
Page writes by reclaim 4082185.000 526319.000 537114.000
Page writes file 738135 169091 32290
Page writes anon 3344050 357228 504824
Page reclaim immediate 9524 170 5595843
Sector Reads 8909900 861192 1483680
Sector Writes 13428980 1488744 2076800
Page rescued immediate 0 0 0
Slabs scanned 38016 31744 28672
Direct inode steals 0 0 0
Kswapd inode steals 424 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 14 15 119
THP collapse alloc 1767 1569 1618
THP splits 30 29 25
THP fault fallback 0 0 0
THP collapse fail 8 5 0
Compaction stalls 17 41 100
Compaction success 7 31 95
Compaction failures 10 10 5
Page migrate success 7083 22157 62217
Page migrate failure 0 0 0
Compaction pages isolated 14847 48758 135830
Compaction migrate scanned 18328 48398 138929
Compaction free scanned 2000255 355827 1720269
Compaction cost 7 24 68
I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.
3.9.0 3.9.0 3.9.0
vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz 23.58 0.35 0.44
Mean sda-await 133.47 15.72 15.46
Mean sda-r_await 4.72 4.69 3.95
Mean sda-w_await 507.69 28.40 33.68
Max sda-avgqz 680.60 12.25 23.14
Max sda-await 3958.89 221.83 286.22
Max sda-r_await 63.86 61.23 67.29
Max sda-w_await 11710.38 883.57 1767.28
And as before, write wait times are much reduced.
This patch:
The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered. This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance. The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO. The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed. Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped. Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.
This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages. Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up. The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only. Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: e2be15f6c3eecedfbe1550cca8d72c5057abbbd2
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I2c8aee00da5e3e9562984e792d16f9e11bd4a435
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
balance_pgdat() is very long and some of the logic can and should be
internal to kswapd_shrink_zone(). Move it so the flow of
balance_pgdat() is marginally easier to follow.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 7c954f6de6b630de30f265a079aad359f159ebe9
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I6c4e76e6e132c5982c228863c99195d7ad7768bc
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Currently kswapd checks if it should start writepage as it shrinks each
zone without taking into consideration if the zone is balanced or not.
This is not wrong as such but it does not make much sense either. This
patch checks once per pgdat scan if kswapd should be writing pages.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: b7ea3c417b6c2e74ca1cb051568f60377908928d
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Change-Id: I7cb0fb685f8346f07d0fc4810f6c593334cd1590
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Historically, kswapd used to congestion_wait() at higher priorities if
it was not making forward progress. This made no sense as the failure
to make progress could be completely independent of IO. It was later
replaced by wait_iff_congested() and removed entirely by commit 258401a6
(mm: don't wait on congested zones in balance_pgdat()) as it was
duplicating logic in shrink_inactive_list().
This is problematic. If kswapd encounters many pages under writeback
and it continues to scan until it reaches the high watermark then it
will quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.
The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer
was unable to write to the underlying BDI. kswapd bypasses the BDI
congestion as it sets PF_SWAPWRITE but even if this was taken into
account then it would cause direct reclaimers to stall on writeback
which is not desirable.
This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 283aba9f9e0e4882bf09bd37a2983379a6fae805
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Ib34f1959c0e5265242152f98cc52c62ab7015993
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Currently kswapd queues dirty pages for writeback if scanning at an
elevated priority but the priority kswapd scans at is not related to the
number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten
kswapd priority loop", the priority is related to the size of the LRU
and the zone watermark which is no indication as to whether kswapd
should write pages or not.
This patch tracks if an excessive number of unqueued dirty pages are
being encountered at the end of the LRU. If so, it indicates that dirty
pages are being recycled before flusher threads can clean them and flags
the zone so that kswapd will start writing pages until the zone is
balanced.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: d43006d503ac921c7df4f94d13c17db6f13c9d26
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I565caf3aef9f3e5f59cda1adc70207412719a2ed
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition. Kswapd can reach priority 0
quite easily if it is encountering a large number of pages it cannot
reclaim such as pages under writeback. When this happens, kswapd
reclaims very aggressively even though there may be no real risk of
allocation failure or OOM.
This patch prevents kswapd reaching priority 0 and trying to reclaim the
world. Direct reclaimers will still reach priority 0 in the event of an
OOM situation.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 9aa41348a8d11427feec350b21dcdd4330fd20c4
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I6bd5891e9f2b670b3c495cfad26d69af92e6d856
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
In the past, kswapd makes a decision on whether to compact memory after
the pgdat was considered balanced. This more or less worked but it is
late to make such a decision and does not fit well now that kswapd makes
a decision whether to exit the zone scanning loop depending on reclaim
progress.
This patch will compact a pgdat if at least the requested number of
pages were reclaimed from unbalanced zones for a given priority. If any
zone is currently balanced, kswapd will not call compaction as it is
expected the necessary pages are already available.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 2ab44f434586b8ccb11f781b4c2730492e6628f5
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Ie490e6df9576de1de1bc0c3c1b634618394dcf8e
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
kswapd stops raising the scanning priority when at least
SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered
balanced. It then rechecks if it needs to restart at DEF_PRIORITY and
whether high-order reclaim needs to be reset. This is not wrong per-se
but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY
may require several restarts before it has scanned enough pages to meet
the high watermark even at 100% efficiency. This patch irons out the
logic a bit by controlling when priority is raised and removing the
"goto loop_again".
This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency. It will not
raise the scanning prioirty higher unless it is failing to reclaim any
pages.
To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: b8e83b942a16eb73e63406592d3178207a4f07a1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I93ee675006800f2805408f2865150182bfd4b22b
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count(). The patch "mm: vmscan: Limit
the number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.
This patch preserves the proportional scanning and reclaim. It does
mean that kswapd will reclaim more than requested but the number of
pages will be related to the high watermark.
[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
[kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
[hannes@cmpxchg.org: Account for already scanned pages properly]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: e82e0561dae9f3ae5a21fc2d3d3ccbe69d90be46
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I9dc9b73c0d73c27cda72181b4eb3f625e491f114
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.
Changelog since V3
- Drop the slab shrink changes in light of Glaubers series and
discussions highlighted that there were a number of potential
problems with the patch. (mel)
- Rebased to 3.10-rc1
Changelog since V2
- Preserve ratio properly for proportional scanning (kamezawa)
Changelog since V1
- Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
- Reformat comment in shrink_page_list (andi)
- Clarify some comments (dhillf)
- Rework how the proportional scanning is preserved
- Add PageReclaim check before kswapd starts writeback
- Reset sc.nr_reclaimed on every full zone scan
Kswapd and page reclaim behaviour has been screwy in one way or the
other for a long time. Very broadly speaking it worked in the far past
because machines were limited in memory so it did not have that many
pages to scan and it stalled congestion_wait() frequently to prevent it
going completely nuts. In recent times it has behaved very
unsatisfactorily with some of the problems compounded by the removal of
stall logic and the introduction of transparent hugepage support with
high-order reclaims.
There are many variations of bugs that are rooted in this area. One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap. Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed. In other cases an application starts and kswapd hits 100%
CPU usage for prolonged periods of time and so on. There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it. It's
compounded by the problem that it can be very workload and machine
specific.
This series aims at addressing some of the worst of these problems
without attempting to fundmentally alter how page reclaim works.
Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.
Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.
Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.
Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.
Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.
Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
are applied.
This was tested using memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in MM
Tests.
memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times. It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress. The expectation is that the IO
should have little or no impact on memcachetest which is running
entirely in memory.
3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%)
Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%)
Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%)
Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%)
Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%)
Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%)
Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%)
Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%)
Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%)
Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%)
Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%)
Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%)
Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%)
Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%)
Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%)
Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%)
Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%)
Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%)
Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%)
Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%)
Note how the vanilla kernels performance collapses when there is enough
IO taking place in the background. This drop in performance is part of
what users complain of when they start backups. Note how the swapin and
major fault figures indicate that processes were being pushed to swap
prematurely. With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.
20 iterations of this test were run in total and averaged. Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted. The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration. So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground. There are six
blocks of information reported here
memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
22K/sec to just under 4K/second when there is 2385M of IO going
on in the background. This is one type of performance collapse
users complain about if a large cp or backup starts in the
background
io-duration refers to how long it takes for the background IO to
complete. It's showing that with the patched kernel that the IO
completes faster while not interfering with the memcache
workload
swaptotal is the total amount of swap traffic. With the patched kernel,
the total amount of swapping is much reduced although it is
still not zero.
swapin in this case is an indication as to whether we are swap trashing.
The closer the swapin/swapout ratio is to 1, the worse the
trashing is. Note with the patched kernel that there is no swapin
activity indicating that all the pages swapped were really inactive
unused pages.
minorfaults are just minor faults. An increased number of minor faults
can indicate that page reclaim is unmapping the pages but not
swapping them out before they are faulted back in. With the
patched kernel, there is only a small change in minor faults
majorfaults are just major faults in the target workload and a high
number can indicate that a workload is being prematurely
swapped. With the patched kernel, major faults are much reduced. As
there are no swapin's recorded so it's not being swapped. The likely
explanation is that that libraries or configuration files used by
the workload during startup get paged out by the background IO.
Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.
3.10.0-rc1 3.10.0-rc1
vanilla lessdisrupt-v4
Page Ins 1234608 101892
Page Outs 12446272 11810468
Swap Ins 283406 0
Swap Outs 698469 27882
Direct pages scanned 0 136480
Kswapd pages scanned 6266537 5369364
Kswapd pages reclaimed 1088989 930832
Direct pages reclaimed 0 120901
Kswapd efficiency 17% 17%
Kswapd velocity 5398.371 4635.115
Direct efficiency 100% 88%
Direct velocity 0.000 117.817
Percentage direct scans 0% 2%
Page writes by reclaim 1655843 4009929
Page writes file 957374 3982047
Page writes anon 698469 27882
Page reclaim immediate 5245 1745
Page rescued immediate 0 0
Slabs scanned 33664 25216
Direct inode steals 0 0
Kswapd inode steals 19409 778
Kswapd skipped wait 0 0
THP fault alloc 35 30
THP collapse alloc 472 401
THP splits 27 22
THP fault fallback 0 0
THP collapse fail 0 1
Compaction stalls 0 4
Compaction success 0 0
Compaction failures 0 4
Page migrate success 0 0
Page migrate failure 0 0
Compaction pages isolated 0 0
Compaction migrate scanned 0 0
Compaction free scanned 0 0
Compaction cost 0 0
NUMA PTE updates 0 0
NUMA hint faults 0 0
NUMA hint local faults 0 0
NUMA pages migrated 0 0
AutoNUMA cost 0 0
Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world. ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd
23 tclsh-3367
38 memcachetest-13733
49 memcachetest-12443
57 tee-3368
1541 dd-13826
1981 dd-12539
A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage. There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU. This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.
This patch:
The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX. The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed. It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.
This patch limits the number of pages kswapd will reclaim to the high
watermark. Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities. The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 75485363ce8552698bfb9970d901f755d5713cca
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: Idfce2d7ebe6a809f47ce88344a4954a634e9470e
[vinmenon@codeaurora.org: resolve trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Allow the kswapd cpu affinity to be configured.
There can be power benefits on certain targets when limiting kswapd
to run only on certain cores.
CRs-fixed: 752344
Change-Id: I8a83337ff313a7e0324361140398226a09f8be0f
Signed-off-by: Liam Mark <lmark@codeaurora.org>
A workaround was added ealier to move a page to active
list if swapping to devices like zram fails. But this
can result in try_to_free_swap being called from
shrink_page_list, without a properly locked page.
Lock the page when we indicate to activate a page
in pageout().
Add a check to ensure that error is on swap, and
clear the error flag before moving the page to
active list.
CRs-fixed: 760049
Change-Id: I77a8bbd6ed13efdec943298fe9448412feeac176
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
* commit 'v3.10.49': (529 commits)
Linux 3.10.49
ACPI / battery: Retry to get battery information if failed during probing
x86, ioremap: Speed up check for RAM pages
Score: Modify the Makefile of Score, remove -mlong-calls for compiling
Score: The commit is for compiling successfully.
Score: Implement the function csum_ipv6_magic
score: normalize global variables exported by vmlinux.lds
rtmutex: Plug slow unlock race
rtmutex: Handle deadlock detection smarter
rtmutex: Detect changes in the pi lock chain
rtmutex: Fix deadlock detector for real
ring-buffer: Check if buffer exists before polling
drm/radeon: stop poisoning the GART TLB
drm/radeon: fix typo in golden register setup on evergreen
ext4: disable synchronous transaction batching if max_batch_time==0
ext4: clarify error count warning messages
ext4: fix unjournalled bg descriptor while initializing inode bitmap
dm io: fix a race condition in the wake up code for sync_io
Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code
clk: spear3xx: Use proper control register offset
...
In addition to bringing in upstream commits, this merge also makes minor
changes to mainitain compatibility with upstream:
The definition of list_next_entry in qcrypto.c and ipa_dp.c has been
removed, as upstream has moved the definition to list.h. The implementation
of list_next_entry was identical between the two.
irq.c, for both arm and arm64 architecture, has had its calls to
__irq_set_affinity_locked updated to reflect changes to the API upstream.
Finally, as we have removed the sleep_length member variable of the
tick_sched struct, all changes made by upstream commit ec804bd do not
apply to our tree and have been removed from this merge. Only
kernel/time/tick-sched.c is impacted.
Change-Id: I63b7e0c1354812921c94804e1f3b33d1ad6ee3f1
Signed-off-by: Ian Maund <imaund@codeaurora.org>
Ensure that shrinkers are given the option to completely drop
their caches even when their caches are smaller than the batch size.
This change helps improve memory headroom by ensuring that under
significant memory pressure shrinkers can drop all of their caches.
This change only attempts to more aggressively call the shrinkers
during background memory reclaim inorder to avoid hurting the
perforamnce of direct memory reclaim.
Change-Id: I8dbc29c054add639e4810e36fd2c8a063e5c52f3
Signed-off-by: Liam Mark <lmark@codeaurora.org>
commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.
When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:
On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
> <Interrupt>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0
This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 675becce15f320337499bc1a9356260409a5ba29 upstream.
throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.
The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.
On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When performing memory reclaim support treating anonymous and
file backed pages equally.
Swapping anonymous pages out to memory can be efficient enough
to justify treating anonymous and file backed pages equally.
CRs-Fixed: 648984
Change-Id: I6315b8557020d1e27a34225bb9cefbef1fb43266
Signed-off-by: Liam Mark <lmark@codeaurora.org>
The following commits have been reverted from this merge, as they are
known to introduce new bugs and are currently incompatible with our
audio implementation. Investigation of these commits is ongoing, and
they are expected to be brought in at a later time:
86e6de7 ALSA: compress: fix drain calls blocking other compress functions (v6)
16442d4 ALSA: compress: fix drain calls blocking other compress functions
This merge commit also includes a change in block, necessary for
compilation. Upstream has modified elevator_init_fn to prevent race
conditions, requring updates to row_init_queue and test_init_queue.
* commit 'v3.10.28': (1964 commits)
Linux 3.10.28
ARM: 7938/1: OMAP4/highbank: Flush L2 cache before disabling
drm/i915: Don't grab crtc mutexes in intel_modeset_gem_init()
serial: amba-pl011: use port lock to guard control register access
mm: Make {,set}page_address() static inline if WANT_PAGE_VIRTUAL
md/raid5: Fix possible confusion when multiple write errors occur.
md/raid10: fix two bugs in handling of known-bad-blocks.
md/raid10: fix bug when raid10 recovery fails to recover a block.
md: fix problem when adding device to read-only array with bitmap.
drm/i915: fix DDI PLLs HW state readout code
nilfs2: fix segctor bug that causes file system corruption
thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once only
ftrace/x86: Load ftrace_ops in parameter not the variable holding it
SELinux: Fix possible NULL pointer dereference in selinux_inode_permission()
writeback: Fix data corruption on NFS
hwmon: (coretemp) Fix truncated name of alarm attributes
vfs: In d_path don't call d_dname on a mount point
staging: comedi: adl_pci9111: fix incorrect irq passed to request_irq()
staging: comedi: addi_apci_1032: fix subdevice type/flags bug
mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
GFS2: Increase i_writecount during gfs2_setattr_chown
perf/x86/amd/ibs: Fix waking up from S3 for AMD family 10h
perf scripting perl: Fix build error on Fedora 12
ARM: 7815/1: kexec: offline non panic CPUs on Kdump panic
Linux 3.10.27
sched: Guarantee new group-entities always have weight
sched: Fix hrtimer_cancel()/rq->lock deadlock
sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining
sched: Fix race on toggling cfs_bandwidth_used
x86, fpu, amd: Clear exceptions in AMD FXSAVE workaround
netfilter: nf_nat: fix access to uninitialized buffer in IRC NAT helper
SCSI: sd: Reduce buffer size for vpd request
intel_pstate: Add X86_FEATURE_APERFMPERF to cpu match parameters.
mac80211: move "bufferable MMPDU" check to fix AP mode scan
ACPI / Battery: Add a _BIX quirk for NEC LZ750/LS
ACPI / TPM: fix memory leak when walking ACPI namespace
mfd: rtsx_pcr: Disable interrupts before cancelling delayed works
clk: exynos5250: fix sysmmu_mfc{l,r} gate clocks
clk: samsung: exynos5250: Add CLK_IGNORE_UNUSED flag for the sysreg clock
clk: samsung: exynos4: Correct SRC_MFC register
clk: clk-divider: fix divisor > 255 bug
ahci: add PCI ID for Marvell 88SE9170 SATA controller
parisc: Ensure full cache coherency for kmap/kunmap
drm/nouveau/bios: make jump conditional
ARM: shmobile: mackerel: Fix coherent DMA mask
ARM: shmobile: armadillo: Fix coherent DMA mask
ARM: shmobile: kzm9g: Fix coherent DMA mask
ARM: dts: exynos5250: Fix MDMA0 clock number
ARM: fix "bad mode in ... handler" message for undefined instructions
ARM: fix footbridge clockevent device
net: Loosen constraints for recalculating checksum in skb_segment()
bridge: use spin_lock_bh() in br_multicast_set_hash_max
netpoll: Fix missing TXQ unlock and and OOPS.
net: llc: fix use after free in llc_ui_recvmsg
virtio-net: fix refill races during restore
virtio_net: don't leak memory or block when too many frags
virtio-net: make all RX paths handle errors consistently
virtio_net: fix error handling for mergeable buffers
vlan: Fix header ops passthru when doing TX VLAN offload.
net: rose: restore old recvmsg behavior
rds: prevent dereference of a NULL device
ipv6: always set the new created dst's from in ip6_rt_copy
net: fec: fix potential use after free
hamradio/yam: fix info leak in ioctl
drivers/net/hamradio: Integer overflow in hdlcdrv_ioctl()
net: inet_diag: zero out uninitialized idiag_{src,dst} fields
ip_gre: fix msg_name parsing for recvfrom/recvmsg
net: unix: allow bind to fail on mutex lock
ipv6: fix illegal mac_header comparison on 32bit
netvsc: don't flush peers notifying work during setting mtu
tg3: Initialize REG_BASE_ADDR at PCI config offset 120 to 0
net: unix: allow set_peek_off to fail
net: drop_monitor: fix the value of maxattr
ipv6: don't count addrconf generated routes against gc limit
packet: fix send path when running with proto == 0
virtio: delete napi structures from netdev before releasing memory
macvtap: signal truncated packets
tun: update file current position
macvtap: update file current position
macvtap: Do not double-count received packets
rds: prevent BUG_ON triggered on congestion update to loopback
net: do not pretend FRAGLIST support
IPv6: Fixed support for blackhole and prohibit routes
HID: Revert "Revert "HID: Fix logitech-dj: missing Unifying device issue""
gpio-rcar: R-Car GPIO IRQ share interrupt
clocksource: em_sti: Set cpu_possible_mask to fix SMP broadcast
irqchip: renesas-irqc: Fix irqc_probe error handling
Linux 3.10.26
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
ext4: fix bigalloc regression
arm64: Use Normal NonCacheable memory for writecombine
arm64: Do not flush the D-cache for anonymous pages
arm64: Avoid cache flushing in flush_dcache_page()
ARM: KVM: arch_timers: zero CNTVOFF upon return to host
ARM: hyp: initialize CNTVOFF to zero
clocksource: arch_timer: use virtual counters
arm64: Remove unused cpu_name ascii in arch/arm64/mm/proc.S
arm64: dts: Reserve the memory used for secondary CPU release address
arm64: check for number of arguments in syscall_get/set_arguments()
arm64: fix possible invalid FPSIMD initialization state
...
Change-Id: Ia0e5d71b536ab49ec3a1179d59238c05bdd03106
Signed-off-by: Ian Maund <imaund@codeaurora.org>
commit a1c3bfb2f67ef766de03f1f56bdfff9c8595ab14 upstream.
The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.
A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
uses the remainder for a streaming writer demonstrates this problem. In
that case, the actual cache pages are a small fraction of what is
considered dirtyable overall, which results in an relatively large
portion of the cache pages to be dirtied. As kswapd starts rotating
these, random tasks enter direct reclaim and stall on IO.
Only consider free pages and file pages dirtyable.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Tested-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Move pages that fail swapout to the LRU active list to reduce
pressure on swap device when swapping out is already failing.
This helps when using a pseudo swap device such as zram which
starts failing when memory is low.
Change-Id: Ib136cd0a744378aa93d837a24b9143ee818c80b3
Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.
Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:
kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.
Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.
In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737
We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.
Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.
Change-Id: I28cffd677bc9c2d8521849b1a16e211ed24b6d3f
[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
Cc: Ying Han <yinghan@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Neil Zhang <zhangwm@marvell.com>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[lauraa@codeaurora.org: Minor context fixup in mm/vmscan.c]
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Git-commit: 6e543d5780e36ff5ee56c44d7e2e30db3457a7ed
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>