A spare array holding mem cgroup threshold events is kept around
to make sure we can always safely deregister an event and have an
array to store the new set of events in.
In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.
Fixed by only freeing after synchronize_rcu().
Change-Id: Iee3ad8eb400612ec24898832eb19ff34eb2aecb4
Signed-off-by: Martijn Coenen <maco@google.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUyuGRAAoJEDjbvchgkmk+7EwQALYPOeh+AManQFB1MQvFuOgZ
/4ulpjhGXw/RPTKHMeyHo8vRfUhMOx8UPF62uql+g1l9b/Zt2bs6qXu4QcxRRsQc
trSTUpi+U14y1hkgqOVOcFYP2ZaTjNEBQgLJ4eGn46CliLqme+rfoyRYm2GXzcR4
6cbSAr3mufdFIpi9/8Dn62Gv0aws5lIv3qkHJXznyuux3tisPT5y6Ux2KJoivPn/
SqADtRpwo+7lTjl15fE++9AqNsGMorV6toT2OO/7nXP+824psInKLmREAT2qC99b
BG61vcYdxOuHtzmwrvCf1jSRjxhvZT0j2xhBr/vCKcxy08AT0vDv68zrV1r6TIuu
U7/CKXtFBY95cjfnkTLJuswBSuIA/+sQHV6DaddH0V8fcZ6rQMLrblQ9ZcFFFkmT
2SG6lmlXqZvcEKYGMnL/Dcow1rkRhB5stiGgTkYxjiRSRpzAHISRJ/GGpsT+rRqK
HpBs5p9JshvRl7RWKwAu+DNGaEK1X/WYxc4/jw6dZFWX7lEWSMIPlr9zXgZCZ39y
V6lV1VVlT9/CSs1swKHUyhHHehlFsnIlQ6Fkiycr/KkuqBLs92Hyb7WhpVa819yX
osXdxSm6J54skiOLKYpBWHpnY09Tc+p28VEfMpErTExgp2oE8F34K7kdhoQPQb97
2mHiXNa+J4CLUNQ+sRmw
=HDBo
-----END PGP SIGNATURE-----
Merge commit 'v3.10.67' into msm-3.10
This merge brings us up to date with upstream kernel.org tag v3.10.67.
It also contains changes to allow forbidden warnings introduced in
the commit 'core, nfqueue, openvswitch: Orphan frags in skb_zerocopy
and handle errors'. Once upstream has corrected these warnings, the
changes to scripts/gcc-wrapper.py, in this commit, can be reverted.
* commit 'v3.10.67' (915 commits)
Linux 3.10.67
md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
ext4: fix warning in ext4_da_update_reserve_space()
quota: provide interface for readding allocated space into reserved space
crypto: add missing crypto module aliases
crypto: include crypto- module prefix in template
crypto: prefix module autoloading with "crypto-"
drbd: merge_bvec_fn: properly remap bvm->bi_bdev
Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single"
ipvs: uninitialized data with IP_VS_IPV6
KEYS: close race between key lookup and freeing
sata_dwc_460ex: fix resource leak on error path
x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs
x86, tls: Interpret an all-zero struct user_desc as "no segment"
x86, tls, ldt: Stop checking lm in LDT_empty
x86/tsc: Change Fast TSC calibration failed from error to info
x86, hyperv: Mark the Hyper-V clocksource as being continuous
clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write
can: dev: fix crtlmode_supported check
bus: mvebu-mbus: fix support of MBus window 13
ARM: dts: imx25: Fix PWM "per" clocks
time: adjtimex: Validate the ADJ_FREQUENCY values
time: settimeofday: Validate the values of tv from user
dm cache: share cache-metadata object across inactive and active DM tables
ipr: wait for aborted command responses
drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES
scripts/recordmcount.pl: There is no -m32 gcc option on Super-H anymore
ALSA: usb-audio: Add mic volume fix quirk for Logitech Webcam C210
libata: prevent HSM state change race between ISR and PIO
pinctrl: Fix two deadlocks
gpio: sysfs: fix gpio device-attribute leak
gpio: sysfs: fix gpio-chip device-attribute leak
Linux 3.10.66
s390/3215: fix tty output containing tabs
s390/3215: fix hanging console issue
fsnotify: next_i is freed during fsnotify_unmount_inodes.
netfilter: ipset: small potential read beyond the end of buffer
mmc: sdhci: Fix sleep in atomic after inserting SD card
LOCKD: Fix a race when initialising nlmsvc_timeout
x86, um: actually mark system call tables readonly
um: Skip futex_atomic_cmpxchg_inatomic() test
decompress_bunzip2: off by one in get_next_block()
ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances
ARM: omap5/dra7xx: Fix frequency typos
ARM: clk-imx6q: fix video divider for rev T0 1.0
ARM: imx6q: drop unnecessary semicolon
ARM: dts: imx25: Fix the SPI1 clocks
Input: I8042 - add Acer Aspire 7738 to the nomux list
Input: i8042 - reset keyboard to fix Elantech touchpad detection
can: kvaser_usb: Don't send a RESET_CHIP for non-existing channels
can: kvaser_usb: Reset all URB tx contexts upon channel close
can: kvaser_usb: Don't free packets when tight on URBs
USB: keyspan: fix null-deref at probe
USB: cp210x: add IDs for CEL USB sticks and MeshWorks devices
USB: cp210x: fix ID for production CEL MeshConnect USB Stick
usb: dwc3: gadget: Stop TRB preparation after limit is reached
usb: dwc3: gadget: Fix TRB preparation during SG
OHCI: add a quirk for ULi M5237 blocking on reset
gpiolib: of: Correct error handling in of_get_named_gpiod_flags
NFSv4.1: Fix client id trunking on Linux
ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
vfio-pci: Fix the check on pci device type in vfio_pci_probe()
uvcvideo: Fix destruction order in uvc_delete()
smiapp: Take mutex during PLL update in sensor initialisation
af9005: fix kernel panic on init if compiled without IR
smiapp-pll: Correct clock debug prints
video/logo: prevent use of logos after they have been freed
storvsc: ring buffer failures may result in I/O freeze
iscsi-target: Fail connection on short sendmsg writes
hp_accel: Add support for HP ZBook 15
cfg80211: Fix 160 MHz channels with 80+80 and 160 MHz drivers
ARC: [nsimosci] move peripherals to match model to FPGA
drm/i915: Force the CS stall for invalidate flushes
drm/i915: Invalidate media caches on gen7
drm/radeon: properly filter DP1.2 4k modes on non-DP1.2 hw
drm/radeon: check the right ring in radeon_evict_flags()
drm/vmwgfx: Fix fence event code
enic: fix rx skb checksum
alx: fix alx_poll()
tcp: Do not apply TSO segment limit to non-TSO packets
tg3: tg3_disable_ints using uninitialized mailbox value to disable interrupts
netlink: Don't reorder loads/stores before marking mmap netlink frame as available
netlink: Always copy on mmap TX.
Linux 3.10.65
mm: Don't count the stack guard page towards RLIMIT_STACK
mm: propagate error from stack expansion even for guard page
mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
perf session: Do not fail on processing out of order event
perf: Fix events installation during moving group
perf/x86/intel/uncore: Make sure only uncore events are collected
Btrfs: don't delay inode ref updates during log replay
ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP
scripts/kernel-doc: don't eat struct members with __aligned
nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
nfsd4: fix xdr4 inclusion of escaped char
fs: nfsd: Fix signedness bug in compare_blob
serial: samsung: wait for transfer completion before clock disable
writeback: fix a subtle race condition in I_DIRTY clearing
cdc-acm: memory leak in error case
genhd: check for int overflow in disk_expand_part_tbl()
USB: cdc-acm: check for valid interfaces
ALSA: hda - Fix wrong gpio_dir & gpio_mask hint setups for IDT/STAC codecs
ALSA: hda - using uninitialized data
ALSA: usb-audio: extend KEF X300A FU 10 tweak to Arcam rPAC
driver core: Fix unbalanced device reference in drivers_probe
x86, vdso: Use asm volatile in __getcpu
x86_64, vdso: Fix the vdso address randomization algorithm
HID: Add a new id 0x501a for Genius MousePen i608X
HID: add battery quirk for USB_DEVICE_ID_APPLE_ALU_WIRELESS_2011_ISO keyboard
HID: roccat: potential out of bounds in pyra_sysfs_write_settings()
HID: i2c-hid: prevent buffer overflow in early IRQ
HID: i2c-hid: fix race condition reading reports
iommu/vt-d: Fix an off-by-one bug in __domain_mapping()
UBI: Fix double free after do_sync_erase()
UBI: Fix invalid vfree()
pstore-ram: Allow optional mapping with pgprot_noncached
pstore-ram: Fix hangs by using write-combine mappings
PCI: Restore detection of read-only BARs
ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap
ASoC: max98090: Fix ill-defined sidetone route
ASoC: sigmadsp: Refuse to load firmware files with a non-supported version
ath5k: fix hardware queue index assignment
swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single
can: peak_usb: fix memset() usage
can: peak_usb: fix cleanup sequence order in case of error during init
ath9k: fix BE/BK queue order
ath9k_hw: fix hardware queue allocation
ocfs2: fix journal commit deadlock
Linux 3.10.64
Btrfs: fix fs corruption on transaction abort if device supports discard
Btrfs: do not move em to modified list when unpinning
eCryptfs: Remove buggy and unnecessary write in file name decode routine
eCryptfs: Force RO mount when encrypted view is enabled
udf: Verify symlink size before loading it
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
ncpfs: return proper error from NCP_IOC_SETROOT ioctl
crypto: af_alg - fix backlog handling
userns: Unbreak the unprivileged remount tests
userns: Allow setting gid_maps without privilege when setgroups is disabled
userns: Add a knob to disable setgroups on a per user namespace basis
userns: Rename id_map_mutex to userns_state_mutex
userns: Only allow the creator of the userns unprivileged mappings
userns: Check euid no fsuid when establishing an unprivileged uid mapping
userns: Don't allow unprivileged creation of gid mappings
userns: Don't allow setgroups until a gid mapping has been setablished
userns: Document what the invariant required for safe unprivileged mappings.
groups: Consolidate the setgroups permission checks
umount: Disallow unprivileged mount force
mnt: Update unprivileged remount test
mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
mac80211: free management frame keys when removing station
mac80211: fix multicast LED blinking and counter
KEYS: Fix stale key registration at error path
isofs: Fix unchecked printing of ER records
x86/tls: Don't validate lm in set_thread_area() after all
dm space map metadata: fix sm_bootstrap_get_nr_blocks()
dm bufio: fix memleak when using a dm_buffer's inline bio
nfs41: fix nfs4_proc_layoutget error handling
megaraid_sas: corrected return of wait_event from abort frame path
mmc: block: add newline to sysfs display of force_ro
mfd: tc6393xb: Fail ohci suspend if full state restore is required
md/bitmap: always wait for writes on unplug.
x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit
x86_64, switch_to(): Load TLS descriptors before switching DS and ES
x86/tls: Disallow unusual TLS segments
x86/tls: Validate TLS entries to protect espfix
isofs: Fix infinite looping over CE entries
Linux 3.10.63
ALSA: usb-audio: Don't resubmit pending URBs at MIDI error recovery
powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
ARM: sched_clock: Load cycle count after epoch stabilizes
igb: bring link up when PHY is powered up
ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
nEPT: Nested INVEPT
net: sctp: use MAX_HEADER for headroom reserve in output path
net: mvneta: fix Tx interrupt delay
rtnetlink: release net refcnt on error in do_setlink()
net/mlx4_core: Limit count field to 24 bits in qp_alloc_res
tg3: fix ring init when there are more TX than RX channels
ipv6: gre: fix wrong skb->protocol in WCCP
sata_fsl: fix error handling of irq_of_parse_and_map
ahci: disable MSI on SAMSUNG 0xa800 SSD
AHCI: Add DeviceIDs for Sunrise Point-LP SATA controller
media: smiapp: Only some selection targets are settable
drm/i915: Unlock panel even when LVDS is disabled
drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
i2c: davinci: generate STP always when NACK is received
i2c: omap: fix i207 errata handling
i2c: omap: fix NACK and Arbitration Lost irq handling
xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
mm: fix swapoff hang after page migration and fork
mm: frontswap: invalidate expired data on a dup-store failure
Linux 3.10.62
nfsd: Fix ACL null pointer deref
powerpc/powernv: Honor the generic "no_64bit_msi" flag
bnx2fc: do not add shared skbs to the fcoe_rx_list
nfsd4: fix leak of inode reference on delegation failure
nfsd: Fix slot wake up race in the nfsv4.1 callback code
rt2x00: do not align payload on modern H/W
can: dev: avoid calling kfree_skb() from interrupt context
spi: dw: Fix dynamic speed change.
iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly
target: Don't call TFO->write_pending if data_length == 0
srp-target: Retry when QP creation fails with ENOMEM
Input: xpad - use proper endpoint type
ARM: 8222/1: mvebu: enable strex backoff delay
ARM: 8216/1: xscale: correct auxiliary register in suspend/resume
ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices
can: esd_usb2: fix memory leak on disconnect
USB: xhci: don't start a halted endpoint before its new dequeue is set
usb-quirks: Add reset-resume quirk for MS Wireless Laser Mouse 6000
usb: serial: ftdi_sio: add PIDs for Matrix Orbital products
USB: serial: cp210x: add IDs for CEL MeshConnect USB Stick
USB: keyspan: fix tty line-status reporting
USB: keyspan: fix overrun-error reporting
USB: ssu100: fix overrun-error reporting
iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit mask
powerpc/pseries: Fix endiannes issue in RTAS call from xmon
powerpc/pseries: Honor the generic "no_64bit_msi" flag
of/base: Fix PowerPC address parsing hack
ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use
ASoC: sgtl5000: Fix SMALL_POP bit definition
PCI/MSI: Add device flag indicating that 64-bit MSIs don't work
ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
pptp: fix stack info leak in pptp_getname()
qmi_wwan: Add support for HP lt4112 LTE/HSPA+ Gobi 4G Modem
ieee802154: fix error handling in ieee802154fake_probe()
ipv4: Fix incorrect error code when adding an unreachable route
inetdevice: fixed signed integer overflow
sparc64: Fix constraints on swab helpers.
uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
x86, mm: Set NX across entire PMD at boot
x86: Require exact match for 'noxsave' command line option
x86_64, traps: Rework bad_iret
x86_64, traps: Stop using IST for #SS
x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
MIPS: Loongson: Make platform serial setup always built-in.
MIPS: oprofile: Fix backtrace on 64-bit kernel
Linux 3.10.61
mm: memcg: handle non-error OOM situations more gracefully
mm: memcg: do not trap chargers with full callstack on OOM
mm: memcg: rework and document OOM waiting and wakeup
mm: memcg: enable memcg OOM killer only for user faults
x86: finish user fault error path with fatal signal
arch: mm: pass userspace fault flag to generic fault handler
arch: mm: do not invoke OOM killer on kernel fault OOM
arch: mm: remove obsolete init OOM protection
mm: invoke oom-killer from remaining unconverted page fault handlers
net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks
net: sctp: fix panic on duplicate ASCONF chunks
net: sctp: fix remote memory pressure from excessive queueing
KVM: x86: Don't report guest userspace emulation error to userspace
SCSI: hpsa: fix a race in cmd_free/scsi_done
net/mlx4_en: Fix BlueFlame race
ARM: Correct BUG() assembly to ensure it is endian-agnostic
perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
mei: bus: fix possible boundaries violation
perf: Handle compat ioctl
MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches
dell-wmi: Fix access out of memory
ARM: probes: fix instruction fetch order with <asm/opcodes.h>
br: fix use of ->rx_handler_data in code executed on non-rx_handler path
netfilter: nf_nat: fix oops on netns removal
netfilter: xt_bpf: add mising opaque struct sk_filter definition
netfilter: nf_log: release skbuff on nlmsg put failure
netfilter: nfnetlink_log: fix maximum packet length logged to userspace
netfilter: nf_log: account for size of NLMSG_DONE attribute
ipc: always handle a new value of auto_msgmni
clocksource: Remove "weak" from clocksource_default_clock() declaration
kgdb: Remove "weak" from kgdb_arch_pc() declaration
media: ttusb-dec: buffer overflow in ioctl
NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
nfs: Fix use of uninitialized variable in nfs_getattr()
NFS: Don't try to reclaim delegation open state if recovery failed
NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
Input: alps - allow up to 2 invalid packets without resetting device
Input: alps - ignore potential bare packets when device is out of sync
dm raid: ensure superblock's size matches device's logical block size
dm btree: fix a recursion depth bug in btree walking code
block: Fix computation of merged request priority
parisc: Use compat layer for msgctl, shmat, shmctl and semtimedop syscalls
scsi: only re-lock door after EH on devices that were reset
nfs: fix pnfs direct write memory leak
firewire: cdev: prevent kernel stack leaking into ioctl arguments
arm64: __clear_user: handle exceptions on strb
ARM: 8198/1: make kuser helpers depend on MMU
drm/radeon: add missing crtc unlock when setting up the MC
mac80211: fix use-after-free in defragmentation
macvtap: Fix csum_start when VLAN tags are present
iwlwifi: configure the LTR
libceph: do not crash on large auth tickets
xtensa: re-wire umount syscall to sys_oldumount
ALSA: usb-audio: Fix memory leak in FTU quirk
ahci: disable MSI instead of NCQ on Samsung pci-e SSDs on macbooks
ahci: Add Device IDs for Intel Sunrise Point PCH
audit: keep inode pinned
x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
sparc32: Implement xchg and atomic_xchg using ATOMIC_HASH locks
sparc64: Do irq_{enter,exit}() around generic_smp_call_function*().
sparc64: Fix crashes in schizo_pcierr_intr_other().
sunvdc: don't call VD_OP_GET_VTOC
vio: fix reuse of vio_dring slot
sunvdc: limit each sg segment to a page
sunvdc: compute vdisk geometry from capacity
sunvdc: add cdrom and v1.1 protocol support
net: sctp: fix memory leak in auth key management
net: sctp: fix NULL pointer dereference in af->from_addr_param on malformed packet
gre6: Move the setting of dev->iflink into the ndo_init functions.
ip6_tunnel: Use ip6_tnl_dev_init as the ndo_init function.
Linux 3.10.60
libceph: ceph-msgr workqueue needs a resque worker
Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
of: Fix overflow bug in string property parsing functions
sysfs: driver core: Fix glue dir race condition by gdp_mutex
i2c: at91: don't account as iowait
acer-wmi: Add acpi_backlight=video quirk for the Acer KAV80
rbd: Fix error recovery in rbd_obj_read_sync()
drm/radeon: remove invalid pci id
usb: gadget: udc: core: fix kernel oops with soft-connect
usb: gadget: function: acm: make f_acm pass USB20CV Chapter9
usb: dwc3: gadget: fix set_halt() bug with pending transfers
crypto: algif - avoid excessive use of socket buffer in skcipher
mm: Remove false WARN_ON from pagecache_isize_extended()
x86, apic: Handle a bad TSC more gracefully
posix-timers: Fix stack info leak in timer_create()
mac80211: fix typo in starting baserate for rts_cts_rate_idx
PM / Sleep: fix recovery during resuming from hibernation
tty: Fix high cpu load if tty is unreleaseable
quota: Properly return errors from dquot_writeback_dquots()
ext3: Don't check quota format when there are no quota files
nfsd4: fix crash on unknown operation number
cpc925_edac: Report UE events properly
e7xxx_edac: Report CE events properly
i3200_edac: Report CE events properly
i82860_edac: Report CE events properly
scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
cgroup/kmemleak: add kmemleak_free() for cgroup deallocations.
usb: Do not allow usb_alloc_streams on unconfigured devices
USB: opticon: fix non-atomic allocation in write path
usb-storage: handle a skipped data phase
spi: pxa2xx: toggle clocks on suspend if not disabled by runtime PM
spi: pl022: Fix incorrect dma_unmap_sg
usb: dwc3: gadget: Properly initialize LINK TRB
wireless: rt2x00: add new rt2800usb device
USB: option: add Haier CE81B CDMA modem
usb: option: add support for Telit LE910
USB: cdc-acm: only raise DTR on transitions from B0
USB: cdc-acm: add device id for GW Instek AFG-2225
usb: serial: ftdi_sio: add "bricked" FTDI device PID
usb: serial: ftdi_sio: add Awinda Station and Dongle products
USB: serial: cp210x: add Silicon Labs 358x VID and PID
serial: Fix divide-by-zero fault in uart_get_divisor()
staging:iio:ade7758: Remove "raw" from channel name
staging:iio:ade7758: Fix check if channels are enabled in prenable
staging:iio:ade7758: Fix NULL pointer deref when enabling buffer
staging:iio:ad5933: Drop "raw" from channel names
staging:iio:ad5933: Fix NULL pointer deref when enabling buffer
OOM, PM: OOM killed task shouldn't escape PM suspend
freezer: Do not freeze tasks killed by OOM killer
ext4: fix oops when loading block bitmap failed
cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy
ext4: fix overflow when updating superblock backups after resize
ext4: check s_chksum_driver when looking for bg csum presence
ext4: fix reservation overflow in ext4_da_write_begin
ext4: add ext4_iget_normal() which is to be used for dir tree lookups
ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
ext4: don't check quota format when there are no quota files
ext4: check EA value offset when loading
jbd2: free bh when descriptor block checksum fails
MIPS: tlbex: Properly fix HUGE TLB Refill exception handler
target: Fix APTPL metadata handling for dynamic MappedLUNs
target: Fix queue full status NULL pointer for SCF_TRANSPORT_TASK_SENSE
qla_target: don't delete changed nacls
ARC: Update order of registers in KGDB to match GDB 7.5
ARC: [nsimosci] Allow "headless" models to boot
KVM: x86: Emulator fixes for eip canonical checks on near branches
KVM: x86: Fix wrong masking on relative jump/call
kvm: x86: don't kill guest on unknown exit reason
KVM: x86: Check non-canonical addresses upon WRMSR
KVM: x86: Improve thread safety in pit
KVM: x86: Prevent host from panicking on shared MSR writes.
kvm: fix excessive pages un-pinning in kvm_iommu_map error path.
media: tda7432: Fix setting TDA7432_MUTE bit for TDA7432_RF register
media: ds3000: fix LNB supply voltage on Tevii S480 on initialization
media: em28xx-v4l: give back all active video buffers to the vb2 core properly on streaming stop
media: v4l2-common: fix overflow in v4l_bound_align_image()
drm/nouveau/bios: memset dcb struct to zero before parsing
drm/tilcdc: Fix the error path in tilcdc_load()
drm/ast: Fix HW cursor image
Input: i8042 - quirks for Fujitsu Lifebook A544 and Lifebook AH544
Input: i8042 - add noloop quirk for Asus X750LN
framebuffer: fix border color
modules, lock around setting of MODULE_STATE_UNFORMED
dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
block: fix alignment_offset math that assumes io_min is a power-of-2
drbd: compute the end before rb_insert_augmented()
dm bufio: update last_accessed when relinking a buffer
virtio_pci: fix virtio spec compliance on restore
selinux: fix inode security list corruption
pstore: Fix duplicate {console,ftrace}-efi entries
mfd: rtsx_pcr: Fix MSI enable error handling
mnt: Prevent pivot_root from creating a loop in the mount tree
UBI: add missing kmem_cache_free() in process_pool_aeb error path
random: add and use memzero_explicit() for clearing data
crypto: more robust crypto_memneq
fix misuses of f_count() in ppp and netlink
kill wbuf_queued/wbuf_dwork_lock
ALSA: pcm: Zero-clear reserved fields of PCM status ioctl in compat mode
evm: check xattr value length and type in evm_inode_setxattr()
x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAE
x86_64, entry: Fix out of bounds read on sysenter
x86_64, entry: Filter RFLAGS.NT on entry from userspace
x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
x86, fpu: shift drop_init_fpu() from save_xstate_sig() to handle_signal()
x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable()
x86: Reject x32 executables if x32 ABI not supported
vfs: fix data corruption when blocksize < pagesize for mmaped data
UBIFS: fix free log space calculation
UBIFS: fix a race condition
UBIFS: remove mst_mutex
fs: Fix theoretical division by 0 in super_cache_scan().
fs: make cont_expand_zero interruptible
mmc: rtsx_pci_sdmmc: fix incorrect last byte in R2 response
libata-sff: Fix controllers with no ctl port
pata_serverworks: disable 64-KB DMA transfers on Broadcom OSB4 IDE Controller
Revert "percpu: free percpu allocation info for uniprocessor system"
lockd: Try to reconnect if statd has moved
drivers/net: macvtap and tun depend on INET
ipv4: dst_entry leak in ip_send_unicast_reply()
ax88179_178a: fix bonding failure
ipv4: fix nexthop attlen check in fib_nh_match
tracing/syscalls: Ignore numbers outside NR_syscalls' range
Linux 3.10.59
ecryptfs: avoid to access NULL pointer when write metadata in xattr
ARM: at91/PMC: don't forget to write PMC_PCDR register to disable clocks
ALSA: usb-audio: Add support for Steinberg UR22 USB interface
ALSA: emu10k1: Fix deadlock in synth voice lookup
ALSA: pcm: use the same dma mmap codepath both for arm and arm64
arm64: compat: fix compat types affecting struct compat_elf_prpsinfo
spi: dw-mid: terminate ongoing transfers at exit
kernel: add support for gcc 5
fanotify: enable close-on-exec on events' fd when requested in fanotify_init()
mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set
Bluetooth: Fix issue with USB suspend in btusb driver
Bluetooth: Fix HCI H5 corrupted ack value
rt2800: correct BBP1_TX_POWER_CTRL mask
PCI: Generate uppercase hex for modalias interface class
PCI: Increase IBM ipr SAS Crocodile BARs to at least system page size
iwlwifi: Add missing PCI IDs for the 7260 series
NFSv4.1: Fix an NFSv4.1 state renewal regression
NFSv4: fix open/lock state recovery error handling
NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
lzo: check for length overrun in variable length encoding.
Revert "lzo: properly check for overruns"
Documentation: lzo: document part of the encoding
m68k: Disable/restore interrupts in hwreg_present()/hwreg_write()
Drivers: hv: vmbus: Fix a bug in vmbus_open()
Drivers: hv: vmbus: Cleanup vmbus_establish_gpadl()
Drivers: hv: vmbus: Cleanup vmbus_teardown_gpadl()
Drivers: hv: vmbus: Cleanup vmbus_post_msg()
firmware_class: make sure fw requests contain a name
qla2xxx: Use correct offset to req-q-out for reserve calculation
mptfusion: enable no_write_same for vmware scsi disks
be2iscsi: check ip buffer before copying
regmap: fix NULL pointer dereference in _regmap_write/read
regmap: debugfs: fix possbile NULL pointer dereference
spi: dw-mid: check that DMA was inited before exit
spi: dw-mid: respect 8 bit mode
x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 instead
kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
KVM: s390: unintended fallthrough for external call
kvm: x86: fix stale mmio cache bug
fs: Add a missing permission check to do_umount
Btrfs: fix race in WAIT_SYNC ioctl
Btrfs: fix build_backref_tree issue with multiple shared blocks
Btrfs: try not to ENOSPC on log replay
Linux 3.10.58
USB: cp210x: add support for Seluxit USB dongle
USB: serial: cp210x: added Ketra N1 wireless interface support
USB: Add device quirk for ASUS T100 Base Station keyboard
ipv6: reallocate addrconf router for ipv6 address when lo device up
tcp: fixing TLP's FIN recovery
sctp: handle association restarts when the socket is closed.
ip6_gre: fix flowi6_proto value in xmit path
hyperv: Fix a bug in netvsc_start_xmit()
tg3: Allow for recieve of full-size 8021AD frames
tg3: Work around HW/FW limitations with vlan encapsulated frames
l2tp: fix race while getting PMTU on PPP pseudo-wire
openvswitch: fix panic with multiple vlan headers
packet: handle too big packets for PACKET_V3
tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
sit: Fix ipip6_tunnel_lookup device matching criteria
myri10ge: check for DMA mapping errors
Linux 3.10.57
cpufreq: ondemand: Change the calculation of target frequency
cpufreq: Fix wrong time unit conversion
nl80211: clear skb cb before passing to netlink
drbd: fix regression 'out of mem, failed to invoke fence-peer helper'
jiffies: Fix timeval conversion to jiffies
md/raid5: disable 'DISCARD' by default due to safety concerns.
media: vb2: fix VBI/poll regression
mm: numa: Do not mark PTEs pte_numa when splitting huge pages
mm, thp: move invariant bug check out of loop in __split_huge_page_map
ring-buffer: Fix infinite spin in reading buffer
init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
perf: fix perf bug in fork()
udf: Avoid infinite loop when processing indirect ICBs
Linux 3.10.56
vm_is_stack: use for_each_thread() rather then buggy while_each_thread()
oom_kill: add rcu_read_lock() into find_lock_task_mm()
oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()
oom_kill: change oom_kill.c to use for_each_thread()
introduce for_each_thread() to replace the buggy while_each_thread()
kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code
arm: multi_v7_defconfig: Enable Zynq UART driver
ext2: Fix fs corruption in ext2_get_xip_mem()
serial: 8250_dma: check the result of TX buffer mapping
ARM: 7748/1: oabi: handle faults when loading swi instruction from userspace
netfilter: nf_conntrack: avoid large timeout for mid-stream pickup
PM / sleep: Use valid_state() for platform-dependent sleep states only
PM / sleep: Add state field to pm_states[] entries
ipvs: fix ipv6 hook registration for local replies
ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding
ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack
md/raid1: fix_read_error should act on all non-faulty devices.
media: cx18: fix kernel oops with tda8290 tuner
Fix nasty 32-bit overflow bug in buffer i/o code.
perf kmem: Make it work again on non NUMA machines
perf: Fix a race condition in perf_remove_from_context()
alarmtimer: Lock k_itimer during timer callback
alarmtimer: Do not signal SIGEV_NONE timers
parisc: Only use -mfast-indirect-calls option for 32-bit kernel builds
powerpc/perf: Fix ABIv2 kernel backtraces
sched: Fix unreleased llc_shared_mask bit during CPU hotplug
ocfs2/dlm: do not get resource spinlock if lockres is new
nilfs2: fix data loss with mmap()
fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
fsnotify/fdinfo: use named constants instead of hardcoded values
kcmp: fix standard comparison bug
Revert "mac80211: disable uAPSD if all ACs are under ACM"
usb: dwc3: core: fix ordering for PHY suspend
usb: dwc3: core: fix order of PM runtime calls
usb: host: xhci: fix compliance mode workaround
genhd: fix leftover might_sleep() in blk_free_devt()
lockd: fix rpcbind crash on lockd startup failure
rtlwifi: rtl8192cu: Add new ID
percpu: perform tlb flush after pcpu_map_pages() failure
percpu: fix pcpu_alloc_pages() failure path
percpu: free percpu allocation info for uniprocessor system
ata_piix: Add Device IDs for Intel 9 Series PCH
Input: i8042 - add nomux quirk for Avatar AVIU-145A6
Input: i8042 - add Fujitsu U574 to no_timeout dmi table
Input: atkbd - do not try 'deactivate' keyboard on any LG laptops
Input: elantech - fix detection of touchpad on ASUS s301l
Input: synaptics - add support for ForcePads
Input: serport - add compat handling for SPIOCSTYPE ioctl
dm crypt: fix access beyond the end of allocated space
block: Fix dev_t minor allocation lifetime
workqueue: apply __WQ_ORDERED to create_singlethread_workqueue()
Revert "iwlwifi: dvm: don't enable CTS to self"
SCSI: libiscsi: fix potential buffer overrun in __iscsi_conn_send_pdu
NFC: microread: Potential overflows in microread_target_discovered()
iscsi-target: Fix memory corruption in iscsit_logout_post_handler_diffcid
iscsi-target: avoid NULL pointer in iscsi_copy_param_list failure
Target/iser: Don't put isert_conn inside disconnected handler
Target/iser: Get isert_conn reference once got to connected_handler
iio:inkern: fix overwritten -EPROBE_DEFER in of_iio_channel_get_by_name
iio:magnetometer: bugfix magnetometers gain values
iio: adc: ad_sigma_delta: Fix indio_dev->trig assignment
iio: st_sensors: Fix indio_dev->trig assignment
iio: meter: ade7758: Fix indio_dev->trig assignment
iio: inv_mpu6050: Fix indio_dev->trig assignment
iio: gyro: itg3200: Fix indio_dev->trig assignment
iio:trigger: modify return value for iio_trigger_get
CIFS: Fix SMB2 readdir error handling
CIFS: Fix directory rename error
ASoC: davinci-mcasp: Correct rx format unit configuration
shmem: fix nlink for rename overwrite directory
x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8
KVM: x86: handle idiv overflow at kvm_write_tsc
regmap: Fix handling of volatile registers for format_write() chips
ACPICA: Update to GPIO region handler interface.
MIPS: mcount: Adjust stack pointer for static trace in MIPS32
MIPS: ZBOOT: add missing <linux/string.h> include
ARM: 8165/1: alignment: don't break misaligned NEON load/store
ARM: 7897/1: kexec: Use the right ISA for relocate_new_kernel
ARM: 8133/1: use irq_set_affinity with force=false when migrating irqs
ARM: 8128/1: abort: don't clear the exclusive monitors
NFSv4: Fix another bug in the close/open_downgrade code
NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
usb:hub set hub->change_bits when over-current happens
usb: dwc3: omap: fix ordering for runtime pm calls
USB: EHCI: unlink QHs even after the controller has stopped
USB: storage: Add quirks for Entrega/Xircom USB to SCSI converters
USB: storage: Add quirk for Ariston Technologies iConnect USB to SCSI adapter
USB: storage: Add quirk for Adaptec USBConnect 2000 USB-to-SCSI Adapter
storage: Add single-LUN quirk for Jaz USB Adapter
usb: hub: take hub->hdev reference when processing from eventlist
xhci: fix oops when xhci resumes from hibernate with hw lpm capable devices
xhci: Fix null pointer dereference if xhci initialization fails
USB: zte_ev: fix removed PIDs
USB: ftdi_sio: add support for NOVITUS Bono E thermal printer
USB: sierra: add 1199:68AA device ID
USB: sierra: avoid CDC class functions on "68A3" devices
USB: zte_ev: remove duplicate Qualcom PID
USB: zte_ev: remove duplicate Gobi PID
Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
USB: option: add VIA Telecom CDS7 chipset device id
USB: option: reduce interrupt-urb logging verbosity
USB: serial: fix potential heap buffer overflow
USB: sisusb: add device id for Magic Control USB video
USB: serial: fix potential stack buffer overflow
USB: serial: pl2303: add device id for ztek device
xtensa: fix a6 and a7 handling in fast_syscall_xtensa
xtensa: fix TLBTEMP_BASE_2 region handling in fast_second_level_miss
xtensa: fix access to THREAD_RA/THREAD_SP/THREAD_DS
xtensa: fix address checks in dma_{alloc,free}_coherent
xtensa: replace IOCTL code definitions with constants
drm/radeon: add connector quirk for fujitsu board
drm/vmwgfx: Fix a potential infinite spin waiting for fifo idle
drm/ast: AST2000 cannot be detected correctly
drm/i915: Wait for vblank before enabling the TV encoder
drm/i915: Remove bogus __init annotation from DMI callbacks
HID: logitech-dj: prevent false errors to be shown
HID: magicmouse: sanity check report size in raw_event() callback
HID: picolcd: sanity check report size in raw_event() callback
cfq-iosched: Fix wrong children_weight calculation
ALSA: pcm: fix fifo_size frame calculation
ALSA: hda - Fix invalid pin powermap without jack detection
ALSA: hda - Fix COEF setups for ALC1150 codec
ALSA: core: fix buffer overflow in snd_info_get_line()
arm64: ptrace: fix compat hardware watchpoint reporting
trace: Fix epoll hang when we race with new entries
i2c: at91: Fix a race condition during signal handling in at91_do_twi_xfer.
i2c: at91: add bound checking on SMBus block length bytes
arm64: flush TLS registers during exec
ibmveth: Fix endian issues with rx_no_buffer statistic
ahci: add pcid for Marvel 0x9182 controller
ahci: Add Device IDs for Intel 9 Series PCH
pata_scc: propagate return value of scc_wait_after_reset
drm/i915: read HEAD register back in init_ring_common() to enforce ordering
drm/radeon: load the lm63 driver for an lm64 thermal chip.
drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().
drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().
drm/tilcdc: fix double kfree
drm/tilcdc: fix release order on exit
drm/tilcdc: panel: fix leak when unloading the module
drm/tilcdc: tfp410: fix dangling sysfs connector node
drm/tilcdc: slave: fix dangling sysfs connector node
drm/tilcdc: panel: fix dangling sysfs connector node
carl9170: fix sending URBs with wrong type when using full-speed
Linux 3.10.55
libceph: gracefully handle large reply messages from the mon
libceph: rename ceph_msg::front_max to front_alloc_len
tpm: Provide a generic means to override the chip returned timeouts
vfs: fix bad hashing of dentries
dcache.c: get rid of pointless macros
IB/srp: Fix deadlock between host removal and multipathd
blkcg: don't call into policy draining if root_blkg is already gone
mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc()
mtd/ftl: fix the double free of the buffers allocated in build_maps()
CIFS: Fix wrong restart readdir for SMB1
CIFS: Fix wrong filename length for SMB2
CIFS: Fix wrong directory attributes after rename
CIFS: Possible null ptr deref in SMB2_tcon
CIFS: Fix async reading on reconnects
CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
libceph: do not hard code max auth ticket len
libceph: add process_one_ticket() helper
libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
md/raid1,raid10: always abort recover on write error.
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't zero partial page cache pages during O_DIRECT writes
xfs: don't dirty buffers beyond EOF
xfs: quotacheck leaves dquot buffers without verifiers
RDMA/iwcm: Use a default listen backlog if needed
md/raid10: Fix memory leak when raid10 reshape completes.
md/raid10: fix memory leak when reshaping a RAID10.
md/raid6: avoid data corruption during recovery of double-degraded RAID6
Bluetooth: Avoid use of session socket after the session gets freed
Bluetooth: never linger on process exit
mnt: Add tests for unprivileged remount cases that have found to be faulty
mnt: Change the default remount atime from relatime to the existing value
mnt: Correct permission checks in do_remount
mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
mnt: Only change user settable mount flags in remount
ring-buffer: Up rb_iter_peek() loop count to 3
ring-buffer: Always reset iterator to reader page
ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock
ACPI: Run fixed event device notifications in process context
ACPICA: Utilities: Fix memory leak in acpi_ut_copy_iobject_to_iobject
bfa: Fix undefined bit shift on big-endian architectures with 32-bit DMA address
ASoC: pxa-ssp: drop SNDRV_PCM_FMTBIT_S24_LE
ASoC: max98090: Fix missing free_irq
ASoC: samsung: Correct I2S DAI suspend/resume ops
ASoC: wm_adsp: Add missing MODULE_LICENSE
ASoC: pcm: fix dpcm_path_put in dpcm runtime update
openrisc: Rework signal handling
MIPS: Fix accessing to per-cpu data when flushing the cache
MIPS: OCTEON: make get_system_type() thread-safe
MIPS: asm: thread_info: Add _TIF_SECCOMP flag
MIPS: Cleanup flags in syscall flags handlers.
MIPS: asm/reg.h: Make 32- and 64-bit definitions available at the same time
MIPS: Remove BUG_ON(!is_fpu_owner()) in do_ade()
MIPS: tlbex: Fix a missing statement for HUGETLB
MIPS: Prevent user from setting FCSR cause bits
MIPS: GIC: Prevent array overrun
drivers: scsi: storvsc: Correctly handle TEST_UNIT_READY failure
Drivers: scsi: storvsc: Implement a eh_timed_out handler
powerpc/pseries: Failure on removing device node
powerpc/mm: Use read barrier when creating real_pte
powerpc/mm/numa: Fix break placement
regulator: arizona-ldo1: remove bypass functionality
mfd: omap-usb-host: Fix improper mask use.
kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path
CAPABILITIES: remove undefined caps from all processes
tpm: missing tpm_chip_put in tpm_get_random()
firmware: Do not use WARN_ON(!spin_is_locked())
spi: omap2-mcspi: Configure hardware when slave driver changes mode
spi: orion: fix incorrect handling of cell-index DT property
iommu/amd: Fix cleanup_domain for mass device removal
media: media-device: Remove duplicated memset() in media_enum_entities()
media: au0828: Only alt setting logic when needed
media: xc4000: Fix get_frequency()
media: xc5000: Fix get_frequency()
Linux 3.10.54
USB: fix build error with CONFIG_PM_RUNTIME disabled
NFSv4: Fix problems with close in the presence of a delegation
NFSv3: Fix another acl regression
svcrdma: Select NFSv4.1 backchannel transport based on forward channel
NFSD: Decrease nfsd_users in nfsd_startup_generic fail
usb: hub: Prevent hub autosuspend if usbcore.autosuspend is -1
USB: whiteheat: Added bounds checking for bulk command response
USB: ftdi_sio: Added PID for new ekey device
USB: ftdi_sio: add Basic Micro ATOM Nano USB2Serial PID
ARM: OMAP2+: hwmod: Rearm wake-up interrupts for DT when MUSB is idled
usb: xhci: amd chipset also needs short TX quirk
xhci: Treat not finding the event_seg on COMP_STOP the same as COMP_STOP_INVAL
Staging: speakup: Update __speakup_paste_selection() tty (ab)usage to match vt
jbd2: fix infinite loop when recovering corrupt journal blocks
mei: nfc: fix memory leak in error path
mei: reset client state on queued connect request
Btrfs: fix csum tree corruption, duplicate and outdated checksums
hpsa: fix bad -ENOMEM return value in hpsa_big_passthru_ioctl
x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stub
x86_64/vsyscall: Fix warn_bad_vsyscall log output
x86: don't exclude low BIOS area when allocating address space for non-PCI cards
drm/radeon: add additional SI pci ids
ext4: fix BUG_ON in mb_free_blocks()
kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601)
Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
KVM: x86: Inter-privilege level ret emulation is not implemeneted
crypto: ux500 - make interrupt mode plausible
serial: core: Preserve termios c_cflag for console resume
ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
drivers/i2c/busses: use correct type for dma_map/unmap
hwmon: (dme1737) Prevent overflow problem when writing large limits
hwmon: (ads1015) Fix out-of-bounds array access
hwmon: (lm85) Fix various errors on attribute writes
hwmon: (ads1015) Fix off-by-one for valid channel index checking
hwmon: (gpio-fan) Prevent overflow problem when writing large limits
hwmon: (lm78) Fix overflow problems seen when writing large temperature limits
hwmon: (sis5595) Prevent overflow problem when writing large limits
drm: omapdrm: fix compiler errors
ARM: OMAP3: Fix choice of omap3_restore_es function in OMAP34XX rev3.1.2 case.
mei: start disconnect request timer consistently
ALSA: hda/realtek - Avoid setting wrong COEF on ALC269 & co
ALSA: hda/ca0132 - Don't try loading firmware at resume when already failed
ALSA: virtuoso: add Xonar Essence STX II support
ALSA: hda - fix an external mic jack problem on a HP machine
USB: Fix persist resume of some SS USB devices
USB: ehci-pci: USB host controller support for Intel Quark X1000
USB: serial: ftdi_sio: Add support for new Xsens devices
USB: serial: ftdi_sio: Annotate the current Xsens PID assignments
USB: OHCI: don't lose track of EDs when a controller dies
isofs: Fix unbounded recursion when processing relocated directories
HID: fix a couple of off-by-ones
HID: logitech: perform bounds checking on device_id early enough
stable_kernel_rules: Add pointer to netdev-FAQ for network patches
Linux 3.10.53
arch/sparc/math-emu/math_32.c: drop stray break operator
sparc64: ldc_connect() should not return EINVAL when handshake is in progress.
sunsab: Fix detection of BREAK on sunsab serial console
bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000
sparc64: Guard against flushing openfirmware mappings.
sparc64: Do not insert non-valid PTEs into the TSB hash table.
sparc64: Add membar to Niagara2 memcpy code.
sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus.
sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses.
sparc64: Fix top-level fault handling bugs.
sparc64: Handle 32-bit tasks properly in compute_effective_address().
sparc64: Make itc_sync_lock raw
sparc64: Fix argument sign extension for compat_sys_futex().
sctp: fix possible seqlock seadlock in sctp_packet_transmit()
iovec: make sure the caller actually wants anything in memcpy_fromiovecend
net: Correctly set segment mac_len in skb_segment().
macvlan: Initialize vlan_features to turn on offload support.
net: sctp: inherit auth_capable on INIT collisions
tcp: Fix integer-overflow in TCP vegas
tcp: Fix integer-overflows in TCP veno
net: sendmsg: fix NULL pointer dereference
ip: make IP identifiers less predictable
inetpeer: get rid of ip_id_count
bnx2x: fix crash during TSO tunneling
Linux 3.10.52
x86/espfix/xen: Fix allocation of pages for paravirt page tables
lib/btree.c: fix leak of whole btree nodes
net/l2tp: don't fall back on UDP [get|set]sockopt
net: mvneta: replace Tx timer with a real interrupt
net: mvneta: add missing bit descriptions for interrupt masks and causes
net: mvneta: do not schedule in mvneta_tx_timeout
net: mvneta: use per_cpu stats to fix an SMP lock up
net: mvneta: increase the 64-bit rx/tx stats out of the hot path
Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan"
staging: vt6655: Fix Warning on boot handle_irq_event_percpu.
x86_64/entry/xen: Do not invoke espfix64 on Xen
x86, espfix: Make it possible to disable 16-bit support
x86, espfix: Make espfix64 a Kconfig option, fix UML
x86, espfix: Fix broken header guard
x86, espfix: Move espfix definitions into a separate header file
x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option"
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
printk: rename printk_sched to printk_deferred
iio: buffer: Fix demux table creation
staging: vt6655: Fix disassociated messages every 10 seconds
mm, thp: do not allow thp faults to avoid cpuset restrictions
scsi: handle flush errors properly
rapidio/tsi721_dma: fix failure to obtain transaction descriptor
cfg80211: fix mic_failure tracing
ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout
crypto: af_alg - properly label AF_ALG socket
Linux 3.10.51
core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
x86/efi: Include a .bss section within the PE/COFF headers
s390/ptrace: fix PSW mask check
Fix gcc-4.9.0 miscompilation of load_balance() in scheduler
mm: hugetlb: fix copy_hugetlb_page_range()
x86_32, entry: Store badsys error code in %eax
hwmon: (smsc47m192) Fix temperature limit and vrm write operations
parisc: Remove SA_RESTORER define
coredump: fix the setting of PF_DUMPCORE
Input: fix defuzzing logic
slab_common: fix the check for duplicate slab names
slab_common: Do not check for duplicate slab names
tracing: Fix wraparound problems in "uptime" trace clock
blkcg: don't call into policy draining if root_blkg is already gone
ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode)
libata: introduce ata_host->n_tags to avoid oops on SAS controllers
libata: support the ata host which implements a queue depth less than 32
block: don't assume last put of shared tags is for the host
block: provide compat ioctl for BLKZEROOUT
media: tda10071: force modulation to QPSK on DVB-S
media: hdpvr: fix two audio bugs
Linux 3.10.50
ARC: Implement ptrace(PTRACE_GET_THREAD_AREA)
sched: Fix possible divide by zero in avg_atom() calculation
locking/mutex: Disable optimistic spinning on some architectures
PM / sleep: Fix request_firmware() error at resume
dm cache metadata: do not allow the data block size to change
dm thin metadata: do not allow the data block size to change
alarmtimer: Fix bug where relative alarm timers were treated as absolute
drm/radeon: avoid leaking edid data
drm/qxl: return IRQ_NONE if it was not our irq
drm/radeon: set default bl level to something reasonable
irqchip: gic: Fix core ID calculation when topology is read from DT
irqchip: gic: Add support for cortex a7 compatible string
ring-buffer: Fix polling on trace_pipe
mwifiex: fix Tx timeout issue
perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
ipv4: fix buffer overflow in ip_options_compile()
dns_resolver: Null-terminate the right string
dns_resolver: assure that dns_query() result is null-terminated
sunvnet: clean up objects created in vnet_new() on vnet_exit()
net: pppoe: use correct channel MTU when using Multilink PPP
net: sctp: fix information leaks in ulpevent layer
tipc: clear 'next'-pointer of message fragments before reassembly
be2net: set EQ DB clear-intr bit in be_open()
netlink: Fix handling of error from netlink_dump().
net: mvneta: Fix big endian issue in mvneta_txq_desc_csum()
net: mvneta: fix operation in 10 Mbit/s mode
appletalk: Fix socket referencing in skb
tcp: fix false undo corner cases
igmp: fix the problem when mc leave group
net: qmi_wwan: add two Sierra Wireless/Netgear devices
net: qmi_wwan: Add ID for Telewell TW-LTE 4G v2
ipv4: icmp: Fix pMTU handling for rare case
tcp: Fix divide by zero when pushing during tcp-repair
bnx2x: fix possible panic under memory stress
net: fix sparse warning in sk_dst_set()
ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix
ipv4: fix dst race in sk_dst_get()
8021q: fix a potential memory leak
net: sctp: check proc_dointvec result in proc_sctp_do_auth
tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb
ip_tunnel: fix ip_tunnel_lookup
shmem: fix splicing from a hole while it's punched
shmem: fix faulting into a hole, not taking i_mutex
shmem: fix faulting into a hole while it's punched
iwlwifi: dvm: don't enable CTS to self
igb: do a reset on SR-IOV re-init if device is down
hwmon: (adt7470) Fix writes to temperature limit registers
hwmon: (da9052) Don't use dash in the name attribute
hwmon: (da9055) Don't use dash in the name attribute
tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
tracing: Fix graph tracer with stack tracer on other archs
fuse: handle large user and group ID
Bluetooth: Ignore H5 non-link packets in non-active state
Drivers: hv: util: Fix a bug in the KVP code
media: gspca_pac7302: Add new usb-id for Genius i-Look 317
usb: Check if port status is equal to RxDetect
Signed-off-by: Ian Maund <imaund@codeaurora.org>
Use the 'allow_attach' handler for the 'mem' cgroup to allow
non-root processes to add arbitrary processes to a 'mem' cgroup
if it has the CAP_SYS_NICE capability set.
Bug: 18260435
Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
Signed-off-by: Rom Lemarchand <romlem@android.com>
Git-commit: cce78bc02ff0ea2d21e88e3438d65272b898aa35
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
In a system like Android, a process with SYS_ADMIN rights
controls the system for things like moving process from
one cgroup to another. The native cgroup capabilities
are only allowed to execute by root user and not system.
While adding a new cgroup sub-system, one may override
and relax the permission so that 'system' can also control
cgroup. Here, memcg is one such cgroup sub system which
requires system level control for that.
Allow non-root processes to add arbitrary into 'memory'
cgroups if it has 'CAP_SYS_ADMIN' capability set.
Change-Id: I43d4468186f142c176cb5b5f060751bb1b160344
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
commit 4942642080ea82d99ab5b653abb9a12b7ba31f4a upstream.
Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead. But there are many more and it's impractical to annotate
them all.
First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well. This simplifies the code quite a bit for
added bonus.
Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts. If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.
Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3812c8c8f3953921ef18544110dafc3505c1ac62 upstream.
The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved. The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.
For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex. The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:
OOM invoking task:
mem_cgroup_handle_oom+0x241/0x3b0
mem_cgroup_cache_charge+0xbe/0xe0
add_to_page_cache_locked+0x4c/0x140
add_to_page_cache_lru+0x22/0x50
grab_cache_page_write_begin+0x8b/0xe0
ext3_write_begin+0x88/0x270
generic_file_buffered_write+0x116/0x290
__generic_file_aio_write+0x27c/0x480
generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
do_sync_write+0xea/0x130
vfs_write+0xf3/0x1f0
sys_write+0x51/0x90
system_call_fastpath+0x18/0x1d
OOM kill victim:
do_truncate+0x58/0xa0 # takes i_mutex
do_last+0x250/0xa30
path_openat+0xd7/0x440
do_filp_open+0x49/0xa0
do_sys_open+0x106/0x240
sys_open+0x20/0x30
system_call_fastpath+0x18/0x1d
The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.
A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit. But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks. For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.
This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:
1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.
2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.
Debugged by Michal Hocko.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit fb2a6fc56be66c169f8b80e07ed999ba453a2db2 upstream.
The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies. However, it would be nice if this construct would be
clearly recognizable and not be as obfuscated as it is right now. Clean
up as follows:
1. Remove the return value of mem_cgroup_oom_unlock()
2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
makes it more obvious that the task has to be on the waitqueue
before attempting to OOM-trylock the hierarchy, to not miss any
wakeups before going to sleep. It just didn't matter until now
because it was all lumped together into the global memcg_oom_lock
spinlock section.
4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
It is proctected by the hierarchical OOM-lock.
5. The memcg_oom_lock spinlock is only required to propagate the OOM
lock in any given hierarchy atomically. Restrict its scope to
mem_cgroup_oom_(trylock|unlock).
6. Do not wake up the waitqueue unconditionally at the end of the
function. Only the lockholder has to wake up the next in line
after releasing the lock.
Note that the lockholder kicks off the OOM-killer, which in turn
leads to wakeups from the uncharges of the exiting task. But a
contender is not guaranteed to see them if it enters the OOM path
after the OOM kills but before the lockholder releases the lock.
Thus there has to be an explicit wakeup after releasing the lock.
7. Put the OOM task on the waitqueue before marking the hierarchy as
under OOM as that is the point where we start to receive wakeups.
No point in listening before being on the waitqueue.
8. Likewise, unmark the hierarchy before finishing the sleep, for
symmetry.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 519e52473ebe9db5cdef44670d5a97f1fd53d721 upstream.
System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.
Enable the memcg OOM killer only for user faults, where it's really the
only option available.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4fb1a86fb5e4209a7d4426d4e586c58e9edc74ac upstream.
Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding
cgroup_mutex which prevents the child from reaching its
mem_cgroup_reparent_charges().
Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.
Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
all its children (and grandchildren, in the correct order) to have their
charges reparented first.
[The version for 3.10.34 (or perhaps now 3.10.35) is this below.
Yes, more differences, and the old mem_cgroup_reparent_charges line
is intentionally left in for 3.10 whereas it was removed for 3.12+:
that's because the css/cgroup iterator changed in between, it used
not to supply the root of the subtree, but nowadays it does - Hugh]
Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream.
Hugh has reported an endless loop when the hardlimit reclaim sees the
same group all the time. This might happen when the reclaim races with
the memcg removal.
shrink_zone
[rmdir root]
mem_cgroup_iter(root, NULL, reclaim)
// prev = NULL
rcu_read_lock()
mem_cgroup_iter_load
last_visited = iter->last_visited // gets root || NULL
css_tryget(last_visited) // failed
last_visited = NULL [1]
memcg = root = __mem_cgroup_iter_next(root, NULL)
mem_cgroup_iter_update
iter->last_visited = root;
reclaim->generation = iter->generation
mem_cgroup_iter(root, root, reclaim)
// prev = root
rcu_read_lock
mem_cgroup_iter_load
last_visited = iter->last_visited // gets root
css_tryget(last_visited) // failed
[1]
The issue seemed to be introduced by commit 5f57816197 ("memcg: relax
memcg iter caching") which has replaced unconditional css_get/css_put by
css_tryget/css_put for the cached iterator.
This patch fixes the issue by skipping css_tryget on the root of the
tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
in mem_cgroup_iter_update.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: <stable@vger.kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 2bff24a3707093c435ab3241c47dcdb5f16e432b upstream.
A memory cgroup with (1) multiple threshold notifications and (2) at least
one threshold >=2G was not reliable. Specifically the notifications would
either not fire or would not fire in the proper order.
The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
thresholds in sorted order. mem_cgroup_usage_register_event() sorts them
with compare_thresholds(), which returns the difference of two 64 bit
thresholds as an int. If the difference is positive but has bit[31] set,
then sort() treats the difference as negative and breaks sort order.
This fix compares the two arbitrary 64 bit thresholds returning the
classic -1, 0, 1 result.
The test below sets two notifications (at 0x1000 and 0x81001000):
cd /sys/fs/cgroup/memory
mkdir x
for x in 4096 2164264960; do
cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
done
echo $$ > x/cgroup.procs
anon_leaker 500M
v3.11-rc7 fails to signal the 4096 event listener:
Leaking...
Done leaking pages.
Patched v3.11-rc7 properly notifies:
Leaking...
4096 listener:2013:8:31:14:13:36
Done leaking pages.
The fixed bug is old. It appears to date back to the introduction of
memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
implement memory thresholds"
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3e6b11df245180949938734bc192eaf32f3a06b3 upstream.
struct memcg_cache_params has a union. Different parts of this union
are used for root and non-root caches. A part with destroying work is
used only for non-root caches.
I fixed the same problem in another place v3.9-rc1-16204-gf101a94, but
didn't notice this one.
This patch fixes the kernel panic:
[ 46.848187] BUG: unable to handle kernel paging request at 000000fffffffeb8
[ 46.849026] IP: [<ffffffff811a484c>] kmem_cache_destroy_memcg_children+0x6c/0xc0
[ 46.849092] PGD 0
[ 46.849092] Oops: 0000 [#1] SMP
...
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit f37a96914d1aea10fed8d9af10251f0b9caea31b upstream.
mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
This is not correct because only memcg_propagate_kmem takes an
additional reference while mem_cgroup_sockets_init is allowed to fail as
well (although no current implementation fails) but it doesn't take any
reference. This all suggests that it should be memcg_propagate_kmem
that should clean up after itself so this patch moves mem_cgroup_put
over there.
Unfortunately this is not that easy (as pointed out by Li Zefan) because
memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
memcg_propagate_kmem fails so the additional reference is dropped in
that case in kmem_cgroup_destroy which means that the reference would be
dropped two times.
The easiest way then would be to simply remove mem_cgrroup_put from
mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
thing.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.
The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.
The write side sets the position pointer first, then updates the
sequence count to "publish" the new position. Likewise, the read side
must read the sequence count first, then the position. If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:
writer: reader:
iter->position = position if iter->sequence == expected:
smp_wmb() smp_rmb()
iter->sequence = sequence position = iter->position
However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory. Fix this.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: <stable@kernel.org> [3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0c59b89c81 ("mm: memcg: push down PageSwapCache check into
uncharge entry functions") added a VM_BUG_ON() on PageSwapCache in the
uncharge path after checking that page flag once, assuming that the
state is stable in all paths, but this is not the case and the condition
triggers in user environments. An uncharge after the last page table
reference to the page goes away can race with reclaim adding the page to
swap cache.
Swap cache pages are usually uncharged when they are freed after
swapout, from a path that also handles swap usage accounting and memcg
lifetime management. However, since the last page table reference is
gone and thus no references to the swap slot left, the swap slot will be
freed shortly when reclaim attempts to write the page to disk. The
whole swap accounting is not even necessary.
So while the race condition for which this VM_BUG_ON was added is real
and actually existed all along, there are no negative effects. Remove
the VM_BUG_ON again.
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Lingzhu Xiang <lxiang@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat. The units are in
bytes.
This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.
The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
The memcg is not referenced, so it can be destroyed at anytime right
after we exit rcu read section, so it's not safe to access it.
To fix this, we call css_tryget() to get a reference while we're still
in rcu read section.
This also removes a bogus comment above __memcg_create_cache_enqueue().
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A memcg may livelock when oom if the process that grabs the hierarchy's
oom lock is never the first process with PF_EXITING set in the memcg's
task iteration.
The oom killer, both global and memcg, will defer if it finds an
eligible process that is in the process of exiting and it is not being
ptraced. The idea is to allow it to exit without using memory reserves
before needlessly killing another process.
This normally works fine except in the memcg case with a large number of
threads attached to the oom memcg. In this case, the memcg oom killer
only gets called for the process that grabs the hierarchy's oom lock;
all others end up blocked on the memcg's oom waitqueue. Thus, if the
process that grabs the hierarchy's oom lock is never the first
PF_EXITING process in the memcg's task iteration, the oom killer is
constantly deferred without anything making progress.
The fix is to give PF_EXITING processes access to memory reserves so
that we've marked them as oom killed without any iteration. This allows
__mem_cgroup_try_charge() to succeed so that the process may exit. This
makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
immediately granted for processes with pending SIGKILLs and those in the
exit path, to be equivalent to what is done for the global oom killer.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This might cause a use-after-free bug.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications. The levels are defined like this:
The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).
The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.
The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.
The events are propagated upward until the event is handled, i.e. the
events are not pass-through. Here is what this means: for example you
have three cgroups: A->B->C. Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure. In
this situation, only group C will receive the notification, i.e. groups
A and B will not receive it. This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing. So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)
Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features. Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off. The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
option, so it will not require any changes on the userland side.
[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just a trivial issue I stumbled on while doing something else...
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2d11085e40 ("memcg: do not create memsw files if swap
accounting is disabled") memsw files are created only if memcg swap
accounting is enabled so it doesn't make any sense to check for it
explicitly in mem_cgroup_read(), mem_cgroup_write() and
mem_cgroup_reset().
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk). The code
would be much more easier to follow if we move the iteration outside of
the function (to __mem_cgrou_iter_next) so the distinction is more
clear. This patch doesn't introduce any functional changes.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current implementation of mem_cgroup_iter has to consider both css
and memcg to find out whether no group has been found (css==NULL - aka
the loop is completed) and that no memcg is associated with the found
node (!memcg - aka css_tryget failed because the group is no longer
alive). This leads to awkward tweaks like tests for css && !memcg to
skip the current node.
It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.
We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the per-node-zone-priority iterator caches memory cgroups
rather than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might live for
unbounded amount of time even though their group is already gone (until
the global/targeted reclaim triggers the zone under priority to find out
the group is dead and let it to find the final rest).
We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.
This number would be stored into iterator everytime when a memcg is
cached. If the iter count doesn't match the curent walker root's one we
will start from the root again. The group counter is incremented
upwards the hierarchy every time a group is removed.
The iter_lock can be dropped because racing iterators cannot leak the
reference anymore as the reference count is not elevated for
last_visited when it is cached.
Locking rules got a bit complicated by this change though. The iterator
primarily relies on rcu read lock which makes sure that once we see a
valid last_visited pointer then it will be valid for the whole RCU walk.
smp_rmb makes sure that dead_count is read before last_visited and
last_dead_count while smp_wmb makes sure that last_visited is updated
before last_dead_count so the up-to-date last_dead_count cannot point to
an outdated last_visited. css_tryget then makes sure that the
last_visited is still alive in case the iteration races with the cached
group removal (css is invalidated before mem_cgroup_css_offline
increments dead_count).
In short:
mem_cgroup_iter
rcu_read_lock()
dead_count = atomic_read(parent->dead_count)
smp_rmb()
if (dead_count != iter->last_dead_count)
last_visited POSSIBLY INVALID -> last_visited = NULL
if (!css_tryget(iter->last_visited))
last_visited DEAD -> last_visited = NULL
next = find_next(last_visited)
css_tryget(next)
css_put(last_visited) // css would be invalidated and parent->dead_count
// incremented if this was the last reference
iter->last_visited = next
smp_wmb()
iter->last_dead_count = dead_count
rcu_read_unlock()
cgroup_rmdir
cgroup_destroy_locked
atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
mem_cgroup_css_offline
mem_cgroup_invalidate_reclaim_iterators
while(parent = parent_mem_cgroup)
atomic_inc(parent->dead_count)
css_put(css) // last reference held by cgroup core
Spotted by Ying Han.
Original idea from Johannes Weiner.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node is
visited before its children.
Example:
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, d
Commit 574bd9f7c7 ("cgroup: implement generic child / descendant walk
macros") has introduced generic cgroup tree walkers which provide either
pre-order or post-order tree walk. This patch converts css->id based
iteration to pre-order tree walk to keep the semantic with the original
iterator where parent is always visited before its subtree.
cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
already and css reference counting enforces that the group is alive for
both the last seen cgroup and the found one resp. it signals that the
group is dead and it should be skipped.
If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).
Per node-zone-prio iter_lock has been introduced to ensure that
css_tryget and iter->last_visited is set atomically. Otherwise two
racing walkers could both take a references and only one release it
leading to a css leak (which pins cgroup dentry).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patchset tries to make mem_cgroup_iter saner in the way how it walks
hierarchies. css->id based traversal is far from being ideal as it is not
deterministic because it depends on the creation ordering. Additional to
that css_id is considered a burden for cgroup maintainers because it is
quite some code and memcg is the last user of it. After this series only
the swap accounting uses css_id but that one will follow up later.
Diffstat (if we exclude removed/added comments) looks quite
promising. We got rid of some code:
$ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
b/include/linux/cgroup.h | 3 ---
kernel/cgroup.c | 33 ---------------------------------
mm/memcontrol.c | 4 +++-
3 files changed, 3 insertions(+), 37 deletions(-)
The first patch is just preparatory and it changes when we release css of
the previously returned memcg. Nothing controlversial.
The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those groups'
css to keep them alive.
I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch. Johannes felt the race I was
describing should be mostly harmless and I haven't been able to trigger it
so the lock doesn't deserve its own patch. It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it. It will go away with the next patch.
The next patch fixups an unbounded cgroup removal holdoff caused by the
elevated css refcount. The issue has been observed by Ying Han. Johannes
wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed. He has suggested a
different way when the iterator checks whether a cached memcg is still
valid or no. More on that in the patch but the basic idea is that every
memcg tracks the number removed subgroups and iterator records this number
when a group is cached. These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.
The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is moved
to a helper so that the reference counting and iteration are separated.
The last patch just removes css_get_next as there is no user for it any
longer.
My testing looked as follows:
A (use_hierarchy=1, limit_in_bytes=150M)
/|\
1 2 3
Children groups were created so that the number is never higher than 3 and
their limits were random between 50-100M. Each group hosts a kernel build
(starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
and terminated after random time - up to 5 minutes) and then it is
removed.
This should exercise both leaf and hierarchical reclaim as well as races
with cgroup removals and debugging messages I added on top proved that.
100 groups were created during the test.
This patch:
css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently
releases the reference right after it gets the last css_id.
This is correct because neither prev's memcg nor cgroup are accessed after
then. This will change in the next patch so we need to hold the group
alive a bit longer so let's move the css_put at the end of the function.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Turn on use_hierarchy by default if sane_behavior is specified and
don't create .use_hierarchy file.
It is debatable whether to remove .use_hierarchy file or make it ro as
the former could make transition easier in certain cases; however, the
behavior changes which will be gated by sane_behavior are intensive
including changing basic meaning of certain control knobs in a few
controllers and I don't really think keeping this piece would make
things easier in any noticeable way, so let's remove it.
v2: Explain that mem_cgroup_bind() doesn't have to worry about
children as suggested by Michal Hocko.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
As cgroup supports rename, it's unsafe to dereference dentry->d_name
without proper vfs locks. Fix this by using cgroup_name() rather than
dentry directly.
Also open code memcg_cache_name because it is called only from
kmem_cache_dup which frees the returned name right after
kmem_cache_create_memcg makes a copy of it. Such a short-lived
allocation doesn't make too much sense. So replace it by a static
buffer as kmem_cache_dup is called with memcg_cache_mutex.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
"mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should encourage all memcg controller initialization independent on a
specific mem_cgroup to be done here rather than exploit css_alloc
callback and assume that nothing happens before root cgroup is created.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.
This patch wraps the current memcg_stock initialization code into a
helper calls it from the controller subsystem initialization code.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg
allocation code with something that can be called when the memcg
subsystem is initialized by mem_cgroup_init along with other controller
specific parts.
While we are at it let's make mem_cgroup_soft_limit_tree_init void
because it doesn't make much sense to report memory failure because if
we fail to allocate memory that early during the boot then we are
screwed anyway (this saves some code).
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
An inactive file list is considered low when its active counterpart is
bigger, regardless of whether it is a global zone LRU list or a memcg
zone LRU list. The only difference is in how the LRU size is assessed.
get_lru_size() does the right thing for both global and memcg reclaim
situations.
Get rid of inactive_file_is_low_global() and
mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
the numbers in common code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When use_hierarchy is enabled, we acquire an extra reference count in
our parent during cgroup creation. We don't release it, though, if any
failure exist in the creation process.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We were deferring the kmemcg static branch increment to a later time,
due to a nasty dependency between the cpu_hotplug lock, taken by the
jump label update, and the cgroup_lock.
Now we no longer take the cgroup lock, and we can save ourselves the
trouble.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the preparation work done in earlier patches, the cgroup_lock can
be trivially replaced with a memcg-specific lock. This is an automatic
translation at every site where the values involved were queried.
The sites where values are written, however, used to be naturally called
under cgroup_lock. This is the case for instance in the css_online
callback. For those, we now need to explicitly add the memcg lock.
With this, all the calls to cgroup_lock outside cgroup core are gone.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.
This is less than ideal, because it enforces a dependency over cgroup
core that we would be better off without. The solution proposed here is
to iterate over the child cgroups and if any is found that is already
online, we bounce and return: we don't really care how many children we
have, only if we have any.
This is also made to be hierarchy aware. IOW, cgroups with hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.
[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.
The memory controller uses some tunables to adjust its operation. Those
tunables are inherited from parent to children upon children
intialization. For most of them, the value cannot be changed after the
parent has a new children.
cgroup core splits initialization in two phases: css_alloc and css_online.
After css_alloc, the memory allocation and basic initialization are done.
But the new group is not yet visible anywhere, not even for cgroup core
code. It is only somewhere between css_alloc and css_online that it is
inserted into the internal children lists. Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the old
parent values, that can change between the copy and the moment in which
the groups is linked to any data structure that can indicate the presence
of children.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup. We do this because we rely on
cgroup core to provide us with this information.
We need to guarantee that upon child creation, our tunables are
consistent. For those, the calls to cgroup_lock() all live in handlers
like mem_cgroup_hierarchy_write(), where we change a tunable in the
group that is hierarchy-related. For instance, the use_hierarchy flag
cannot be changed if the cgroup already have children.
Furthermore, those values are propagated from the parent to the child
when a new child is created. So if we don't lock like this, we can end
up with the following situation:
A B
memcg_css_alloc() mem_cgroup_hierarchy_write()
copy use hierarchy from parent change use hierarchy in parent
finish creation.
This is mainly because during create, we are still not fully connected
to the css tree. So all iterators and the such that we could use, will
fail to show that the group has children.
My observation is that all of creation can proceed in parallel with
those tasks, except value assignment. So what this patch series does is
to first move all value assignment that is dependent on parent values
from css_alloc to css_online, where the iterators all work, and then we
lock only the value assignment. This will guarantee that parent and
children always have consistent values. Together with an online test,
that can be derived from the observation that the refcount of an online
memcg can be made to be always positive, we should be able to
synchronize our side without the cgroup lock.
This patch:
Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration. However, this is only
needed because the current strategy keeps checking this value throughout
the whole process. Since all we need is serialization, one needs only
to guarantee that whatever decision we made in the beginning of a
specific migration is respected throughout the process.
We can achieve this by just saving it in mc. By doing this, no kind of
locking is needed.
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to maintain all the memcg bookkeeping, we need per-node
descriptors, which will in turn contain a per-zone descriptor.
Because we want to statically allocate those, this array ends up being
very big. Part of the reason is that we allocate something large enough
to hold MAX_NUMNODES, the compile time constant that holds the maximum
number of nodes we would ever consider.
However, we can do better in some cases if the firmware help us. This
is true for modern x86 machines; coincidentally one of the architectures
in which MAX_NUMNODES tends to be very big.
By using the firmware-provided maximum number of nodes instead of
MAX_NUMNODES, we can reduce the memory footprint of struct memcg
considerably. In the extreme case in which we have only one node, this
reduces the size of the structure from ~ 64k to ~2k. This is
particularly important because it means that we will no longer resort to
the vmalloc area for the struct memcg on defconfigs. We also have
enough room for an extra node and still be outside vmalloc.
One also has to keep in mind that with the industry's ability to fit
more processors in a die as fast as the FED prints money, a nodes = 2
configuration is already respectably big.
[akpm@linux-foundation.org: add check for invalid nid, remove inline]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memcg swap accounting is currently enabled by enable_swap_cgroup when
the root cgroup is created. mem_cgroup_init acts as a memcg subsystem
initializer which sounds like a much better place for enable_swap_cgroup
as well. We already register memsw files from there so it makes a lot
of sense to merge those two into a single enable_swap_cgroup function.
This patch doesn't introduce any semantic changes.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhouping Liu has reported that memsw files are exported even though swap
accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior
has been introduced by commit af36f906c0 ("memcg: always create memsw
files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear
that memsw operations are not supported in the given configuration it is
fair to say that this behavior could be quite confusing.
Let's tear memsw files out of default cgroup files and add them only if
the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
swapaccount=1 boot parameter). We can hook into mem_cgroup_init which
is called when the memcg subsystem is initialized and which happens
after boot command line is processed.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Zhouping Liu <zliu@redhat.com>
Tested-by: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.
In my test with 3 SSD, this increases the swapout throughput 20%.
[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>