commit 5b398e416e880159fe55eefd93c6588fa072cd66 upstream.
I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.
Call trace:
[<ffffffc000086a88>] __switch_to+0x74/0x8c
[<ffffffc000a1bae0>] __schedule+0x23c/0x7bc
[<ffffffc000a1c09c>] schedule+0x3c/0x94
[<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350
[<ffffffc000a1e32c>] down_write+0x64/0x80
[<ffffffc00021f794>] __ksm_exit+0x90/0x19c
[<ffffffc0000be650>] mmput+0x118/0x11c
[<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74
[<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4
[<ffffffc0000d0f34>] get_signal+0x444/0x5e0
[<ffffffc000089fcc>] do_signal+0x1d8/0x450
[<ffffffc00008a35c>] do_notify_resume+0x70/0x78
The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator
ksm_do_scan
scan_get_next_rmap_item
down_read
get_next_rmap_item
alloc_rmap_item #ksmd will loop permanently.
There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel. Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.
Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.
While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.
Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Suggested-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJVoAOcAAoJEDjbvchgkmk+UhcP/1EOwnsJDcZ/sZkkclNgRmrJ
yLBCW65caLAI2E3SmIdKvHQwIx7lHzX5gmWRBrvx+fIl4KhaNKEQ0NCOf1ATaVuQ
MkYMdkicXWpLiFNdKokezryevGS8T1RME+2QlPFv3++Rby1Gy90YD5tu7YlIrEn7
sPRJQHEPCzVAQ7Lqhd66yHICM6/QvdefXj4pjh7vV8IMb2YwnY4vqYt7RxnJCUfP
tqljxrT274kzpA2awzALNh+o3B3/Y4W9ROmlDWviw3JBc9gEqFXYwbDf8KDwA5c0
sp9GPGed/dV5DFuqRcAHksJenFnE3E4gZjo/R5hluHQU27peBuRfXev2hZyBfZqG
796eUOky8fb0OiyxHfT2vhfGeD7CHI/asvIAORjDBVUqzJy9nkkby3XJ0U4tW+pz
VkcilD2oHw1uRIFH3JoBWTJ9W6CYSNFG1qxw+brgfKT5otJG/dBiI8kBABx+aTq7
V+A2cvf11oVwDEb93dnVypMGsfCywqzJUwEIRli9fTFjK7Fg9CBSGX38nwVGUaRv
M2/NeloTyWqUQE41Nd11gCu+hKQRtUU77nxpZcSeKn1XsbpO9/7dHTwcELRuKnTD
9XDksqPznXmC9KXGj7XMcRkLyWyB//JHjay0FCS6b4S6v7R5nrEIRjcpdB+H1WLd
zMOXRH4ZlcOAS/Yt2QMd
=8AB3
-----END PGP SIGNATURE-----
Merge upstream tag 'v3.10.84' into LA.BR.1.3.3
This merge brings us up-to-date as of upstream tag v3.10.84
* tag 'v3.10.84' (317 commits):
Linux 3.10.84
fs: Fix S_NOSEC handling
KVM: x86: make vapics_in_nmi_mode atomic
MIPS: Fix KVM guest fixmap address
x86/PCI: Use host bridge _CRS info on Foxconn K8M890-8237A
powerpc/perf: Fix book3s kernel to userspace backtraces
arm: KVM: force execution of HCPTR access on VM exit
Revert "crypto: talitos - convert to use be16_add_cpu()"
crypto: talitos - avoid memleak in talitos_alg_alloc()
sctp: Fix race between OOTB responce and route removal
packet: avoid out of bounds read in round robin fanout
packet: read num_members once in packet_rcv_fanout()
bridge: fix br_stp_set_bridge_priority race conditions
bridge: fix multicast router rlist endless loop
sparc: Use GFP_ATOMIC in ldc_alloc_exp_dring() as it can be called in softirq context
Linux 3.10.83
bus: mvebu: pass the coherency availability information at init time
KVM: nSVM: Check for NRIPS support before updating control field
ARM: clk-imx6q: refine sata's parent
d_walk() might skip too much
ipv6: update ip6_rt_last_gc every time GC is run
ipv6: prevent fib6_run_gc() contention
xfrm: Increase the garbage collector threshold
Btrfs: make xattr replace operations atomic
x86/microcode/intel: Guard against stack overflow in the loader
fs: take i_mutex during prepare_binprm for set[ug]id executables
hpsa: add missing pci_set_master in kdump path
hpsa: refine the pci enable/disable handling
sb_edac: Fix erroneous bytes->gigabytes conversion
ACPICA: Utilities: Cleanup to remove useless ACPI_PRINTF/FORMAT_xxx helpers.
ACPICA: Utilities: Cleanup to convert physical address printing formats.
__ptrace_may_access() should not deny sub-threads
include/linux/sched.h: don't use task->pid/tgid in same_thread_group/has_group_leader_pid
netfilter: Zero the tuple in nfnl_cthelper_parse_tuple()
netfilter: nfnetlink_cthelper: Remove 'const' and '&' to avoid warnings
config: Enable NEED_DMA_MAP_STATE by default when SWIOTLB is selected
get rid of s_files and files_lock
fput: turn "list_head delayed_fput_list" into llist_head
Linux 3.10.82
lpfc: Add iotag memory barrier
pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomic
drm/mgag200: Reject non-character-cell-aligned mode widths
tracing: Have filter check for balanced ops
crypto: caam - fix RNG buffer cache alignment
Linux 3.10.81
btrfs: cleanup orphans while looking up default subvolume
btrfs: incorrect handling for fiemap_fill_next_extent return
cfg80211: wext: clear sinfo struct before calling driver
mm/memory_hotplug.c: set zone->wait_table to null after freeing it
drm/i915: Fix DDC probe for passive adapters
pata_octeon_cf: fix broken build
ozwpan: unchecked signed subtraction leads to DoS
ozwpan: divide-by-zero leading to panic
ozwpan: Use proper check to prevent heap overflow
MIPS: Fix enabling of DEBUG_STACKOVERFLOW
ring-buffer-benchmark: Fix the wrong sched_priority of producer
USB: serial: ftdi_sio: Add support for a Motion Tracker Development Board
USB: cp210x: add ID for HubZ dual ZigBee and Z-Wave dongle
block: fix ext_dev_lock lockdep report
Input: elantech - fix detection of touchpads where the revision matches a known rate
ALSA: usb-audio: add MAYA44 USB+ mixer control names
ALSA: usb-audio: Add mic volume fix quirk for Logitech Quickcam Fusion
ALSA: hda/realtek - Add a fixup for another Acer Aspire 9420
iio: adis16400: Compute the scan mask from channel indices
iio: adis16400: Use != channel indices for the two voltage channels
iio: adis16400: Report pressure channel scale
xen: netback: read hotplug script once at start of day.
udp: fix behavior of wrong checksums
net_sched: invoke ->attach() after setting dev->qdisc
unix/caif: sk_socket can disappear when state is unlocked
net: dp83640: fix broken calibration routine.
bridge: fix parsing of MLDv2 reports
ipv4: Avoid crashing in ip_error
net: phy: Allow EEE for all RGMII variants
Linux 3.10.80
fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings
vfs: read file_handle only once in handle_to_path
ACPI / init: Fix the ordering of acpi_reserve_resources()
Input: elantech - fix semi-mt protocol for v3 HW
rtlwifi: rtl8192cu: Fix kernel deadlock
md/raid5: don't record new size if resize_stripes fails.
svcrpc: fix potential GSSX_ACCEPT_SEC_CONTEXT decoding failures
ARM: fix missing syscall trace exit
ARM: dts: imx27: only map 4 Kbyte for fec registers
crypto: s390/ghash - Fix incorrect ghash icv buffer handling.
rt2x00: add new rt2800usb device DWA 130
libata: Ignore spurious PHY event on LPM policy change
libata: Add helper to determine when PHY events should be ignored
ext4: check for zero length extent explicitly
ext4: convert write_begin methods to stable_page_writes semantics
mmc: atmel-mci: fix bad variable type for clkdiv
powerpc: Align TOC to 256 bytes
usb: gadget: configfs: Fix interfaces array NULL-termination
usb-storage: Add NO_WP_DETECT quirk for Lacie 059f:0651 devices
USB: cp210x: add ID for KCF Technologies PRN device
USB: pl2303: Remove support for Samsung I330
USB: visor: Match I330 phone more precisely
xhci: gracefully handle xhci_irq dead device
xhci: Solve full event ring by increasing TRBS_PER_SEGMENT to 256
xhci: fix isoc endpoint dequeue from advancing too far on transaction error
target/pscsi: Don't leak scsi_host if hba is VIRTUAL_HOST
ASoC: wm8994: correct BCLK DIV 348 to 384
ASoC: wm8960: fix "RINPUT3" audio route error
ASoC: mc13783: Fix wrong mask value used in mc13xxx_reg_rmw() calls
ALSA: hda - Add headphone quirk for Lifebook E752
ALSA: hda - Add Conexant codecs CX20721, CX20722, CX20723 and CX20724
d_walk() might skip too much
lib: Fix strnlen_user() to not touch memory after specified maximum
hwmon: (ntc_thermistor) Ensure iio channel is of type IIO_VOLTAGE
libceph: request a new osdmap if lingering request maps to no osd
lguest: fix out-by-one error in address checking.
fs, omfs: add NULL terminator in the end up the token list
KVM: MMU: fix CR4.SMEP=1, CR0.WP=0 with shadow pages
net: socket: Fix the wrong returns for recvmsg and sendmsg
kernel: use the gnu89 standard explicitly
staging, rtl8192e, LLVMLinux: Remove unused inline prototype
staging: rtl8712, rtl8712: avoid lots of build warnings
staging, rtl8192e, LLVMLinux: Change extern inline to static inline
drm/i915: Fix declaration of intel_gmbus_{is_forced_bit/is_port_falid}
staging: wlags49_h2: fix extern inline functions
Linux 3.10.79
ACPICA: Utilities: Cleanup to enforce ACPI_PHYSADDR_TO_PTR()/ACPI_PTR_TO_PHYSADDR().
ACPICA: Tables: Change acpi_find_root_pointer() to use acpi_physical_address.
revert "softirq: Add support for triggering softirq work on softirqs"
sound/oss: fix deadlock in sequencer_ioctl(SNDCTL_SEQ_OUTOFBAND)
mmc: card: Don't access RPMB partitions for normal read/write
pinctrl: Don't just pretend to protect pinctrl_maps, do it for real
drm/i915: Add missing MacBook Pro models with dual channel LVDS
ARM: mvebu: armada-xp-openblocks-ax3-4: Disable internal RTC
ARM: dts: imx23-olinuxino: Fix dr_mode of usb0
ARM: dts: imx28: Fix AUART4 TX-DMA interrupt name
ARM: dts: imx25: Add #pwm-cells to pwm4
gpio: sysfs: fix memory leaks and device hotplug
gpio: unregister gpiochip device before removing it
xen/console: Update console event channel on resume
mm/memory-failure: call shake_page() when error hits thp tail page
nilfs2: fix sanity check of btree level in nilfs_btree_root_broken()
ocfs2: dlm: fix race between purge and get lock resource
Linux 3.10.78
ARC: signal handling robustify
UBI: fix soft lockup in ubi_check_volume()
Drivers: hv: vmbus: Don't wait after requesting offers
ARM: dts: dove: Fix uart[23] reg property
staging: panel: fix lcd type
usb: gadget: printer: enqueue printer's response for setup request
usb: host: oxu210hp: use new USB_RESUME_TIMEOUT
3w-sas: fix command completion race
3w-9xxx: fix command completion race
3w-xxxx: fix command completion race
ext4: fix data corruption caused by unwritten and delayed extents
rbd: end I/O the entire obj_request on error
serial: of-serial: Remove device_type = "serial" registration
ALSA: hda - Fix mute-LED fixed mode
ALSA: emu10k1: Emu10k2 32 bit DMA mode
ALSA: emu10k1: Fix card shortname string buffer overflow
ALSA: emux: Fix mutex deadlock in OSS emulation
ALSA: emux: Fix mutex deadlock at unloading
ipv4: Missing sk_nulls_node_init() in ping_unhash().
Linux 3.10.77
s390: Fix build error
nosave: consolidate __nosave_{begin,end} in <asm/sections.h>
memstick: mspro_block: add missing curly braces
C6x: time: Ensure consistency in __init
wl18xx: show rx_frames_per_rates as an array as it really is
lib: memzero_explicit: use barrier instead of OPTIMIZER_HIDE_VAR
e1000: add dummy allocator to fix race condition between mtu change and netpoll
ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
RCU pathwalk breakage when running into a symlink overmounting something
drm/i915: cope with large i2c transfers
drm/radeon: fix doublescan modes (v2)
i2c: core: Export bus recovery functions
IB/mlx4: Fix WQE LSO segment calculation
IB/core: don't disallow registering region starting at 0x0
IB/core: disallow registering 0-sized memory region
stk1160: Make sure current buffer is released
mvsas: fix panic on expander attached SATA devices
Drivers: hv: vmbus: Fix a bug in the error path in vmbus_open()
xtensa: provide __NR_sync_file_range2 instead of __NR_sync_file_range
xtensa: xtfpga: fix hardware lockup caused by LCD driver
ACPICA: Utilities: split IO address types from data type models.
drivers: parport: Kconfig: exclude arm64 for PARPORT_PC
scsi: storvsc: Fix a bug in copy_from_bounce_buffer()
UBI: fix check for "too many bytes"
UBI: initialize LEB number variable
UBI: fix out of bounds write
UBI: account for bitflips in both the VID header and data
tools/power turbostat: Use $(CURDIR) instead of $(PWD) and add support for O= option in Makefile
powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
ext4: make fsync to sync parent dir in no-journal for real this time
arm64: kernel: compiling issue, need delete read_current_timer()
video: vgacon: Don't build on arm64
console: Disable VGA text console support on cris
drivers: parport: Kconfig: exclude h8300 for PARPORT_PC
parport: disable PC-style parallel port support on cris
rtlwifi: rtl8192cu: Add new device ID
rtlwifi: rtl8192cu: Add new USB ID
ptrace: fix race between ptrace_resume() and wait_task_stopped()
fs/binfmt_elf.c: fix bug in loading of PIE binaries
Input: elantech - fix absolute mode setting on some ASUS laptops
ALSA: emu10k1: don't deadlock in proc-functions
usb: core: hub: use new USB_RESUME_TIMEOUT
usb: host: sl811: use new USB_RESUME_TIMEOUT
usb: host: xhci: use new USB_RESUME_TIMEOUT
usb: host: isp116x: use new USB_RESUME_TIMEOUT
usb: host: r8a66597: use new USB_RESUME_TIMEOUT
usb: define a generic USB_RESUME_TIMEOUT macro
usb: phy: Find the right match in devm_usb_phy_match
ARM: S3C64XX: Use fixed IRQ bases to avoid conflicts on Cragganmore
ARM: 8320/1: fix integer overflow in ELF_ET_DYN_BASE
power_supply: lp8788-charger: Fix leaked power supply on probe fail
ring-buffer: Replace this_cpu_*() with __this_cpu_*()
spi: spidev: fix possible arithmetic overflow for multi-transfer message
cdc-wdm: fix endianness bug in debug statements
MIPS: Hibernate: flush TLB entries earlier
KVM: use slowpath for cross page cached accesses
s390/hibernate: fix save and restore of kernel text section
KVM: s390: Zero out current VMDB of STSI before including level3 data.
usb: gadget: composite: enable BESL support
Btrfs: fix inode eviction infinite loop after cloning into it
Btrfs: fix log tree corruption when fs mounted with -o discard
tcp: avoid looping in tcp_send_fin()
tcp: fix possible deadlock in tcp_send_fin()
ip_forward: Drop frames with attached skb->sk
Linux 3.10.76
dcache: Fix locking bugs in backported "deal with deadlock in d_walk()"
arc: mm: Fix build failure
sb_edac: avoid INTERNAL ERROR message in EDAC with unspecified channel
x86: mm: move mmap_sem unlock from mm_fault_error() to caller
vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS
vm: add VM_FAULT_SIGSEGV handling support
deal with deadlock in d_walk()
move d_rcu from overlapping d_child to overlapping d_alias
kconfig: Fix warning "‘jump’ may be used uninitialized"
KVM: x86: SYSENTER emulation is broken
netfilter: conntrack: disable generic tracking for known protocols
Bluetooth: Ignore isochronous endpoints for Intel USB bootloader
Bluetooth: Add support for Intel bootloader devices
Bluetooth: btusb: Add IMC Networks (Broadcom based)
Bluetooth: Add firmware update for Atheros 0cf3:311f
Bluetooth: Enable Atheros 0cf3:311e for firmware upload
mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) support
splice: Apply generic position and size checks to each write
jfs: fix readdir regression
serial: 8250_dw: Fix deadlock in LCR workaround
benet: Call dev_kfree_skby_any instead of kfree_skb.
ixgb: Call dev_kfree_skby_any instead of dev_kfree_skb.
tg3: Call dev_kfree_skby_any instead of dev_kfree_skb.
bnx2: Call dev_kfree_skby_any instead of dev_kfree_skb.
r8169: Call dev_kfree_skby_any instead of dev_kfree_skb.
8139too: Call dev_kfree_skby_any instead of dev_kfree_skb.
8139cp: Call dev_kfree_skby_any instead of kfree_skb.
tcp: tcp_make_synack() should clear skb->tstamp
tcp: fix FRTO undo on cumulative ACK of SACKed range
ipv6: Don't reduce hop limit for an interface
tcp: prevent fetching dst twice in early demux code
remove extra definitions of U32_MAX
conditionally define U32_MAX
Linux 3.10.75
pagemap: do not leak physical addresses to non-privileged userspace
console: Fix console name size mismatch
IB/mlx4: Saturate RoCE port PMA counters in case of overflow
kernel.h: define u8, s8, u32, etc. limits
net: llc: use correct size for sysctl timeout entries
net: rds: use correct size for max unacked packets and bytes
ipc: fix compat msgrcv with negative msgtyp
core, nfqueue, openvswitch: fix compilation warning
media: s5p-mfc: fix mmap support for 64bit arch
iscsi target: fix oops when adding reject pdu
ocfs2: _really_ sync the right range
be2iscsi: Fix kernel panic when device initialization fails
cifs: fix use-after-free bug in find_writable_file
usb: xhci: apply XHCI_AVOID_BEI quirk to all Intel xHCI controllers
cpuidle: ACPI: do not overwrite name and description of C0
dmaengine: omap-dma: Fix memory leak when terminating running transfer
iio: imu: Use iio_trigger_get for indio_dev->trig assignment
iio: inv_mpu6050: Clear timestamps fifo while resetting hardware fifo
Defer processing of REQ_PREEMPT requests for blocked devices
USB: ftdi_sio: Use jtag quirk for SNAP Connect E10
USB: ftdi_sio: Added custom PID for Synapse Wireless product
radeon: Do not directly dereference pointers to BIOS area.
writeback: fix possible underflow in write bandwidth calculation
writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth()
mm/memory hotplug: postpone the reset of obsolete pgdat
nbd: fix possible memory leak
iwlwifi: dvm: run INIT firmware again upon .start()
IB/uverbs: Prevent integer overflow in ib_umem_get address arithmetic
IB/core: Avoid leakage from kernel to user space
tcp: Fix crash in TCP Fast Open
selinux: fix sel_write_enforce broken return value
ALSA: hda - Fix headphone pin config for Lifebook T731
ALSA: usb - Creative USB X-Fi Pro SB1095 volume knob support
ALSA: hda - Add one more node in the EAPD supporting candidate list
Linux 3.10.74
net: ethernet: pcnet32: Setup the SRAM and NOUFLO on Am79C97{3, 5}
powerpc/mpc85xx: Add ranges to etsec2 nodes
hfsplus: fix B-tree corruption after insertion at position 0
dm: hold suspend_lock while suspending device during device deletion
vt6655: RFbSetPower fix missing rate RATE_12M
perf: Fix irq_work 'tail' recursion
Revert "iwlwifi: mvm: fix failure path when power_update fails in add_interface"
mac80211: drop unencrypted frames in mesh fwding
mac80211: disable u-APSD queues by default
nl80211: ignore HT/VHT capabilities without QoS/WMM
tcm_qla2xxx: Fix incorrect use of __transport_register_session
tcm_fc: missing curly braces in ft_invl_hw_context()
ASoC: wm8955: Fix wrong value references for boolean kctl
ASoC: adav80x: Fix wrong value references for boolean kctl
ASoC: ak4641: Fix wrong value references for boolean kctl
ASoC: wm8904: Fix wrong value references for boolean kctl
ASoC: wm8903: Fix wrong value references for boolean kctl
ASoC: wm2000: Fix wrong value references for boolean kctl
ASoC: wm8731: Fix wrong value references for boolean kctl
ASoC: tas5086: Fix wrong value references for boolean kctl
ASoC: wm8960: Fix wrong value references for boolean kctl
ASoC: cs4271: Fix wrong value references for boolean kctl
ASoC: sgtl5000: remove useless register write clearing CHRGPUMP_POWERUP
Change-Id: Ib7976ee2c7224e39074157e28db4158db40b00db
Signed-off-by: Kaushal Kumar <kaushalk@codeaurora.org>
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream.
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.
That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.
However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.
To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.
This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.
Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.
Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[shengyong: Backport to 3.10
- adjust context
- ignore modification for arch nios2, because 3.10 does not support it
- ignore modification for driver lustre, because 3.10 does not support it
- ignore VM_FAULT_FALLBACK in VM_FAULT_ERROR, becase 3.10 does not support
this flag
- add SIGSEGV handling to powerpc/cell spu_fault.c, because 3.10 does not
separate it to copro_fault.c
- add SIGSEGV handling in mm/memory.c, because 3.10 does not separate it
to gup.c
]
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some pages could be shared by several processes. (ex, libc)
In case of that, it's too bad to reclaim them from the beginnig.
This patch causes VM to keep them on memory until last task
try to reclaim them so shared pages will be reclaimed only if
all of task has gone swapping out.
This feature doesn't handle non-linear mapping on ramfs because
it's very time-consuming and doesn't make sure of reclaiming and
not common.
Change-Id: I7e5f34f2e947f5db6d405867fe2ad34863ca40f7
Signed-off-by: Sangseok Lee <sangseok.lee@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Patch-mainline: linux-mm @ 9 May 2013 16:21:27
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
KSM is yet another framework which may obfuscate some memory
problems. Use the showmem notifier to show how KSM is being
used to give some insight into potential issues or non-issues.
Change-Id: If82405dc33f212d085e6847f7c511fd4d0a32a10
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream.
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).
This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.
This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.
This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.
Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KSM thread to scan pages is getting schedule on definite timeout.
That wakes up CPU from idle state and hence may affect the power
consumption. Provide an optional support to use deferred timer
which suites low-power use-cases.
To enable deferred timers,
$ echo 1 > /sys/kernel/mm/ksm/deferred_timer
Change-Id: I07fe199f97fe1f72f9a9e1b0b757a3ac533719e8
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
A CONFIG_DISCONTIGMEM=y m68k config gave
mm/ksm.c: In function `get_kpfn_nid':
mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid'
linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but
expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM
(see arch/mips/include/asm/mmzone.h for example).
Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze,
and m68k got away without it so far, so fix the build in mm/ksm.c.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
allocated, particularly when very few users will ever actually tune
merge_across_nodes 0 to use more than 1+1 of those trees. Not a big
deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.
Start off with 1+1 statically allocated, then if merge_across_nodes is
ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to
free up the extra if it's tuned back, that would be a waste of effort.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.
It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte. follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a
fault.
Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow(). I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Think of struct rmap_item as an extension of struct page (restricted to
MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
small, especially on 32-bit architectures of limited lowmem.
Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
making no change to its 64-byte struct rmap_item; but bloats the 32-bit
struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
rounds up to 40 bytes once allocated from slab. We'd better avoid that.
Hey, I only just remembered that the anon_vma pointer in struct
rmap_item has no purpose until the rmap_item is hung from a stable tree
node (which has its own nid field); and rmap_item's nid field no purpose
than to say which tree root to tell rb_erase() when unlinking from an
unstable tree.
Double them up in a union. There's just one place where we set anon_vma
early (when we already hold mmap_sem): now we must remove tree_rmap_item
from its unstable tree there, before overwriting nid. No need to
spatter BUG()s around: we'd be seeing oopses if this were wrong.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An inconsistency emerged in reviewing the NUMA node changes to KSM: when
meeting a page from the wrong NUMA node in a stable tree, we say that
it's okay for comparisons, but not as a leaf for merging; whereas when
meeting a page from the wrong NUMA node in an unstable tree, we bail out
immediately.
Now, it might be that a wrong NUMA node in an unstable tree is more
likely to correlate with instablility (different content, with rbnode
now misplaced) than page migration; but even so, we are accustomed to
instablility in the unstable tree.
Without strong evidence for which strategy is generally better, I'd
rather be consistent with what's done in the stable tree: accept a page
from the wrong NUMA node for comparison, but not as a leaf for merging.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Added slightly more detail to the Documentation of merge_across_nodes, a
few comments in areas indicated by review, and renamed get_ksm_page()'s
argument from "locked" to "lock_it". No functional change.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Complaints are rare, but lockdep still does not understand the way
ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
problem because notifier callbacks are made under down_read of
blocking_notifier_head->rwsem (so first the mutex is taken while holding
the rwsem, then later the rwsem is taken while still holding the mutex);
but is not in fact a problem because mem_hotplug_mutex is held
throughout the dance.
There was an attempt to fix this with mutex_lock_nested(); but if that
happened to fool lockdep two years ago, apparently it does so no longer.
I had hoped to eradicate this issue in extending KSM page migration not
to need the ksm_thread_mutex. But then realized that although the page
migration itself is safe, we do still need to lock out ksmd and other
users of get_ksm_page() while offlining memory - at some point between
MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
vanish, and get_ksm_page()'s accesses to them become a violation.
So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
checks, to achieve the same lockout without being caught by lockdep.
This is less elegant for KSM, but it's more important to keep lockdep
useful to other users - and I apologize for how long it took to fix.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
set to non-default 0: if a KSM page is migrated to a different NUMA node,
how do we migrate its stable node to the right tree? And what if that
collides with an existing stable node?
ksm_migrate_page() can do no more than it's already doing, updating
stable_node->kpfn: the stable tree itself cannot be manipulated without
holding ksm_thread_mutex. So accept that a stable tree may temporarily
indicate a page belonging to the wrong NUMA node, leave updating until the
next pass of ksmd, just be careful not to merge other pages on to a
misplaced page. Note nid of holding tree in stable_node, and recognize
that it will not always match nid of kpfn.
A misplaced KSM page is discovered, either when ksm_do_scan() next comes
around to one of its rmap_items (we now have to go to cmp_and_merge_page
even on pages in a stable tree), or when stable_tree_search() arrives at a
matching node for another page, and this node page is found misplaced.
In each case, move the misplaced stable_node to a list of migrate_nodes
(and use the address of migrate_nodes as magic by which to identify them):
we don't need them in a tree. If stable_tree_search() finds no match for
a page, but it's currently exiled to this list, then slot its stable_node
right there into the tree, bringing all of its mappings with it; otherwise
they get migrated one by one to the original page of the colliding node.
stable_tree_search() is now modelled more like stable_tree_insert(), in
order to handle these insertions of migrated nodes.
remove_node_from_stable_tree(), remove_all_stable_nodes() and
ksm_check_stable_tree() have to handle the migrate_nodes list as well as
the stable tree itself. Less obviously, we do need to prune the list of
stale entries from time to time (scan_get_next_rmap_item() does it once
each full scan): whereas stale nodes in the stable tree get naturally
pruned as searches try to brush past them, these migrate_nodes may get
forgotten and accumulate.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.
But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree? And what if
that collides with an existing stable node?
So far there's no provision for that, and this patch does not attempt to
deal with it either. But how will I test a solution, when I don't know
how to hotremove memory? The best answer is to enable KSM page migration
in all cases now, and test more common cases. With THP and compaction
added since KSM came in, page migration is now mainstream, and it's a
shame that a KSM page can frustrate freeing a page block.
Without worrying about merge_across_nodes 0 for now, this patch gets KSM
page migration working reliably for default merge_across_nodes 1 (but
leave the patch enabling it until near the end of the series).
It's much simpler than I'd originally imagined, and does not require an
additional tier of locking: page migration relies on the page lock, KSM
page reclaim relies on the page lock, the page lock is enough for KSM page
migration too.
Almost all the care has to be in get_ksm_page(): that's the function which
worries about when a stable node is stale and should be freed, now it also
has to worry about the KSM page being migrated.
The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here. That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
swapcache); but this works well, I've no urge to pull it apart now.
(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)
You might worry about mremap, and whether page migration's rmap_walk to
remove migration entries will find all the KSM locations where it inserted
earlier: that should already be handled, by the satisfyingly heavy hammer
of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switching merge_across_nodes after running KSM is liable to oops on stale
nodes still left over from the previous stable tree. It's not something
that people will often want to do, but it would be lame to demand a reboot
when they're trying to determine which merge_across_nodes setting is best.
How can this happen? We only permit switching merge_across_nodes when
pages_shared is 0, and usually set run 2 to force that beforehand, which
ought to unmerge everything: yet oopses still occur when you then run 1.
Three causes:
1. The old stable tree (built according to the inverse
merge_across_nodes) has not been fully torn down. A stable node
lingers until get_ksm_page() notices that the page it references no
longer references it: but the page is not necessarily freed as soon as
expected, particularly when swapcache.
Fix this with a pass through the old stable tree, applying
get_ksm_page() to each of the remaining nodes (most found stale and
removed immediately), with forced removal of any left over. Unless the
page is still mapped: I've not seen that case, it shouldn't occur, but
better to WARN_ON_ONCE and EBUSY than BUG.
2. __ksm_enter() has a nice little optimization, to insert the new mm
just behind ksmd's cursor, so there's a full pass for it to stabilize
(or be removed) before ksmd addresses it. Nice when ksmd is running,
but not so nice when we're trying to unmerge all mms: we were missing
those mms forked and inserted behind the unmerge cursor. Easily fixed
by inserting at the end when KSM_RUN_UNMERGE.
3. It is possible for a KSM page to be faulted back from swapcache
into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
private to ksm.c, so dissolve the distinction between
ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
the one call into ksm.c.
A long outstanding, unrelated bugfix sneaks in with that third fix:
ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
error when read in from swap) to a page which it then marks Uptodate. Fix
this case by not copying, letting do_swap_page() discover the error.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some places where get_ksm_page() is used, we need the page to be locked.
When KSM migration is fully enabled, we shall want that to make sure that
the page just acquired cannot be migrated beneath us (raised page count is
only effective when there is serialization to make sure migration
notices). Whereas when navigating through the stable tree, we certainly
do not want to lock each node (raised page count is enough to guarantee
the memcmps, even if page is migrated to another node).
Since we're about to add another use case, add the locked argument to
get_ksm_page() now.
Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I
really got the wrong end of the stick on that! There's a configuration in
which page_cache_get_speculative() can do something cheaper than
get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
disabled preemption for it. There's no need for rcu_read_lock() around
get_page_unless_zero() (and mapping checks) here. Cut out that silliness
before making this any harder to understand.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
(restarting whenever it finds a stale node to remove), but rearrange so
that at least it does not needlessly restart from nid 0 each time. And
add a couple of comments: here is why we keep pfn instead of page.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef
CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid
when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as
"int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA
to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few
comments noticed in ongoing work.
Leave stable_tree_insert()'s rb_linkage until after the node has been
set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page
lock make either way safe, but we're going to copy and I prefer this
precedent.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.
(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense in
preferring to allocate the new pages local to the caller's node.)
This patch:
Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes. When it is set
to zero only pages from the same node are merged, otherwise pages from
all nodes can be merged together (default behavior).
Typical use-case could be a lot of KVM guests on NUMA machine and cpus
from more distant nodes would have significant increase of access
latency to the merged ksm page. Sysfs knob was choosen for higher
variability when some users still prefers higher amount of saved
physical memory regardless of access latency.
Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is
possible only when there are not any ksm shared pages in system.
I've tested this patch on numa machines with 2, 4 and 8 nodes and
measured speed of memory access inside of KVM guests with memory pinned
to one of nodes with this benchmark:
http://pholasek.fedorapeople.org/alloc_pg.c
Population standard deviations of access times in percentage of average
were following:
merge_across_nodes=1
2 nodes 1.4%
4 nodes 1.6%
8 nodes 1.7%
merge_across_nodes=0
2 nodes 1%
4 nodes 0.32%
8 nodes 0.018%
RFC: https://lkml.org/lkml/2011/11/30/91
v1: https://lkml.org/lkml/2012/1/23/46
v2: https://lkml.org/lkml/2012/6/29/105
v3: https://lkml.org/lkml/2012/9/14/550
v4: https://lkml.org/lkml/2012/9/23/137
v5: https://lkml.org/lkml/2012/12/10/540
v6: https://lkml.org/lkml/2012/12/23/154
v7: https://lkml.org/lkml/2012/12/27/225
Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
here for !merge_across_nodes to migrate nodes to the proper tree.
Signed-off-by: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Switch ksm to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the ksm module.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.
These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.
Use page_add_new_anon_rmap() for these copies. This also puts them on
the proper LRU lists and marks them SwapBacked, so we can get rid of
doing this ad-hoc in the KSM copy code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rmap walks in ksm.c are like those in rmap.c: they can safely be
done with anon_vma_lock_read().
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
Ka0JKgnWvsa6ez6FSzKI
=ivQa
-----END PGP SIGNATURE-----
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma
Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.
In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.
The most recent set of comparisons available from different people are
mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397
The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.
My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.
These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."
* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...
test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
specify that current should be killed first if an oom condition occurs in
between the two calls.
The usage is
short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
...
compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
to store the thread's oom_score_adj, temporarily change it to the maximum
score possible, and then restore the old value if it is still the same.
This happens to still be racy, however, if the user writes
OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
The compare_swap_oom_score_adj() will then incorrectly reset the old value
prior to the write of OOM_SCORE_ADJ_MAX.
To fix this, introduce a new oom_flags_t member in struct signal_struct
that will be used for per-thread oom killer flags. KSM and swapoff can
now use a bit in this member to specify that threads should be killed
first in oom conditions without playing around with oom_score_adj.
This also allows the correct oom_score_adj to always be shown when reading
/proc/pid/oom_score.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
so this range can be represented by the signed short type with no
functional change. The extra space this frees up in struct signal_struct
will be used for per-thread oom kill flags in the next patch.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Several place need to find the pmd by(mm_struct, address), so introduce a
function to simplify it.
[akpm@linux-foundation.org: fix warning]
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Ni zhan Chen <nizhan.chen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.
Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:
96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread
With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.
Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.
This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.
Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Cc: Andrea Arcangeli <andrea@qumranet.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Liran Liss <liranl@mellanox.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a large VMA (anon or private file mapping) is first touched, which
will populate its anon_vma field, and then split into many regions through
the use of mprotect(), the original anon_vma ends up linking all of the
vmas on a linked list. This can cause rmap to become inefficient, as we
have to walk potentially thousands of irrelevent vmas before finding the
one a given anon page might fall into.
By replacing the same_anon_vma linked list with an interval tree (where
each avc's interval is determined by its vma's start and last pgoffs), we
can make rmap efficient for this use case again.
While the change is large, all of its pieces are fairly simple.
Most places that were walking the same_anon_vma list were looking for a
known pgoff, so they can just use the anon_vma_interval_tree_foreach()
interval tree iterator instead. The exception here is ksm, where the
page's index is not known. It would probably be possible to rework ksm so
that the index would be known, but for now I have decided to keep things
simple and just walk the entirety of the interval tree there.
When updating vma's that already have an anon_vma assigned, we must take
care to re-index the corresponding avc's on their interval tree. This is
done through the use of anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma(), which remove the avc's from
their interval tree before the update and re-insert them after the update.
The anon_vma stays locked during the update, so there is no chance that
rmap would miss the vmas that are being updated.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Daniel Santos <daniel.santos@pobox.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:
| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.
Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn
ptes, special ptes and normal ptes.
Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
VM_PFNMAP. If driver populates whole VMA at mmap() it probably not
expects page-faults.
This patch removes special check from vma_wants_writenotify() which
disables pages write tracking for VMA populated via vm_instert_page().
BDI below mapped file should not use dirty-accounting, moreover
do_wp_page() can handle this.
vm_insert_page() still marks vma after first usage. Usually it is called
from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it
wants to call this function from other places, for example from page-fault
handler.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are multiple places which perform the same check. Add a new
find_mergeable_vma() to handle this.
Signed-off-by: Bob Liu <lliubbo@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).
Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.
A task, or a charge, or a page on lru: each secures a memcg against
removal. In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.
Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons. So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.
There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed. We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().
I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3e ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.
But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
change? just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.
And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9 ("memcg:
clear pc->mem_cgroup if necessary").
Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
and reducing atomic ops/complexity in memcg LRU handling.
In some cases, pages are added to lru before charge to memcg and pages
are not classfied to memory cgroup at lru addtion. Now, the lru where
the page should be added is determined a bit in page_cgroup->flags and
pc->mem_cgroup. I'd like to remove the check of flag.
To handle the case pc->mem_cgroup may contain stale pointers if pages
are added to LRU before classification. This patch resets
pc->mem_cgroup to root_mem_cgroup before lru additions.
[akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
[hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
[akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
[hughd@google.com: stop oops in mem_cgroup_reset_owner()]
[hughd@google.com: fix page migration to reset_owner]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_set_oom_score_adj() was introduced in 72788c3856 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.
Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.
To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Righi reported a case where an exiting task can race against
ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
triggering a NULL pointer dereference in ksmd.
ksm_scan.mm_slot == &ksm_mm_head with only one registered mm
CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item)
list_empty() is false
lock slot == &ksm_mm_head
list_del(slot->mm_list)
(list now empty)
unlock
lock
slot = list_entry(slot->mm_list.next)
(list is empty, so slot is still ksm_mm_head)
unlock
slot->mm == NULL ... Oops
Close this race by revalidating that the new slot is not simply the list
head again.
Andrea's test case:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#define BUFSIZE getpagesize()
int main(int argc, char **argv)
{
void *ptr;
if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
perror("posix_memalign");
exit(1);
}
if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
perror("madvise");
exit(1);
}
*(char *)NULL = 0;
return 0;
}
Reported-by: Andrea Righi <andrea@betterlinux.com>
Tested-by: Andrea Righi <andrea@betterlinux.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a kernel-wide shortage of per-process flags, so it's always
helpful to trim one when possible without incurring a significant penalty.
It's even more important when you're planning on adding a per- process
flag yourself, which I plan to do shortly for transparent hugepages.
PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
tendency to allocate large amounts of memory and should be preferred for
killing over other tasks. We'd rather immediately kill the task making
the errant syscall rather than penalizing an innocent task.
This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.
The process's old oom_score_adj is stored and then set to
OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
value is then reinstated when the process should no longer be considered a
high priority for oom killing.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The normal code pattern used in the kernel is: get/put.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: CAI Qian <caiqian@redhat.com>
Cc:Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup some code with common compound_trans_head helper.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>