Commit Graph

574 Commits

Author SHA1 Message Date
Martijn Coenen ba88acf8c2 UPSTREAM: memcg: Only free spare array when readers are done
A spare array holding mem cgroup threshold events is kept around
to make sure we can always safely deregister an event and have an
array to store the new set of events in.

In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold->primary after it is freed.

Fixed by only freeing after synchronize_rcu().

Change-Id: Iee3ad8eb400612ec24898832eb19ff34eb2aecb4
Signed-off-by: Martijn Coenen <maco@google.com>
2016-05-18 14:36:06 +05:30
Ian Maund 8b08aa9e75 This is the 3.10.67 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUyuGRAAoJEDjbvchgkmk+7EwQALYPOeh+AManQFB1MQvFuOgZ
 /4ulpjhGXw/RPTKHMeyHo8vRfUhMOx8UPF62uql+g1l9b/Zt2bs6qXu4QcxRRsQc
 trSTUpi+U14y1hkgqOVOcFYP2ZaTjNEBQgLJ4eGn46CliLqme+rfoyRYm2GXzcR4
 6cbSAr3mufdFIpi9/8Dn62Gv0aws5lIv3qkHJXznyuux3tisPT5y6Ux2KJoivPn/
 SqADtRpwo+7lTjl15fE++9AqNsGMorV6toT2OO/7nXP+824psInKLmREAT2qC99b
 BG61vcYdxOuHtzmwrvCf1jSRjxhvZT0j2xhBr/vCKcxy08AT0vDv68zrV1r6TIuu
 U7/CKXtFBY95cjfnkTLJuswBSuIA/+sQHV6DaddH0V8fcZ6rQMLrblQ9ZcFFFkmT
 2SG6lmlXqZvcEKYGMnL/Dcow1rkRhB5stiGgTkYxjiRSRpzAHISRJ/GGpsT+rRqK
 HpBs5p9JshvRl7RWKwAu+DNGaEK1X/WYxc4/jw6dZFWX7lEWSMIPlr9zXgZCZ39y
 V6lV1VVlT9/CSs1swKHUyhHHehlFsnIlQ6Fkiycr/KkuqBLs92Hyb7WhpVa819yX
 osXdxSm6J54skiOLKYpBWHpnY09Tc+p28VEfMpErTExgp2oE8F34K7kdhoQPQb97
 2mHiXNa+J4CLUNQ+sRmw
 =HDBo
 -----END PGP SIGNATURE-----

Merge commit 'v3.10.67' into msm-3.10

This merge brings us up to date with upstream kernel.org tag v3.10.67.
It also contains changes to allow forbidden warnings introduced in
the commit 'core, nfqueue, openvswitch: Orphan frags in skb_zerocopy
and handle errors'. Once upstream has corrected these warnings, the
changes to scripts/gcc-wrapper.py, in this commit, can be reverted.

* commit 'v3.10.67' (915 commits)
  Linux 3.10.67
  md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.
  ext4: fix warning in ext4_da_update_reserve_space()
  quota: provide interface for readding allocated space into reserved space
  crypto: add missing crypto module aliases
  crypto: include crypto- module prefix in template
  crypto: prefix module autoloading with "crypto-"
  drbd: merge_bvec_fn: properly remap bvm->bi_bdev
  Revert "swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single"
  ipvs: uninitialized data with IP_VS_IPV6
  KEYS: close race between key lookup and freeing
  sata_dwc_460ex: fix resource leak on error path
  x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs
  x86, tls: Interpret an all-zero struct user_desc as "no segment"
  x86, tls, ldt: Stop checking lm in LDT_empty
  x86/tsc: Change Fast TSC calibration failed from error to info
  x86, hyperv: Mark the Hyper-V clocksource as being continuous
  clocksource: exynos_mct: Fix bitmask regression for exynos4_mct_write
  can: dev: fix crtlmode_supported check
  bus: mvebu-mbus: fix support of MBus window 13
  ARM: dts: imx25: Fix PWM "per" clocks
  time: adjtimex: Validate the ADJ_FREQUENCY values
  time: settimeofday: Validate the values of tv from user
  dm cache: share cache-metadata object across inactive and active DM tables
  ipr: wait for aborted command responses
  drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES
  scripts/recordmcount.pl: There is no -m32 gcc option on Super-H anymore
  ALSA: usb-audio: Add mic volume fix quirk for Logitech Webcam C210
  libata: prevent HSM state change race between ISR and PIO
  pinctrl: Fix two deadlocks
  gpio: sysfs: fix gpio device-attribute leak
  gpio: sysfs: fix gpio-chip device-attribute leak
  Linux 3.10.66
  s390/3215: fix tty output containing tabs
  s390/3215: fix hanging console issue
  fsnotify: next_i is freed during fsnotify_unmount_inodes.
  netfilter: ipset: small potential read beyond the end of buffer
  mmc: sdhci: Fix sleep in atomic after inserting SD card
  LOCKD: Fix a race when initialising nlmsvc_timeout
  x86, um: actually mark system call tables readonly
  um: Skip futex_atomic_cmpxchg_inatomic() test
  decompress_bunzip2: off by one in get_next_block()
  ARM: shmobile: sh73a0 legacy: Set .control_parent for all irqpin instances
  ARM: omap5/dra7xx: Fix frequency typos
  ARM: clk-imx6q: fix video divider for rev T0 1.0
  ARM: imx6q: drop unnecessary semicolon
  ARM: dts: imx25: Fix the SPI1 clocks
  Input: I8042 - add Acer Aspire 7738 to the nomux list
  Input: i8042 - reset keyboard to fix Elantech touchpad detection
  can: kvaser_usb: Don't send a RESET_CHIP for non-existing channels
  can: kvaser_usb: Reset all URB tx contexts upon channel close
  can: kvaser_usb: Don't free packets when tight on URBs
  USB: keyspan: fix null-deref at probe
  USB: cp210x: add IDs for CEL USB sticks and MeshWorks devices
  USB: cp210x: fix ID for production CEL MeshConnect USB Stick
  usb: dwc3: gadget: Stop TRB preparation after limit is reached
  usb: dwc3: gadget: Fix TRB preparation during SG
  OHCI: add a quirk for ULi M5237 blocking on reset
  gpiolib: of: Correct error handling in of_get_named_gpiod_flags
  NFSv4.1: Fix client id trunking on Linux
  ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing
  vfio-pci: Fix the check on pci device type in vfio_pci_probe()
  uvcvideo: Fix destruction order in uvc_delete()
  smiapp: Take mutex during PLL update in sensor initialisation
  af9005: fix kernel panic on init if compiled without IR
  smiapp-pll: Correct clock debug prints
  video/logo: prevent use of logos after they have been freed
  storvsc: ring buffer failures may result in I/O freeze
  iscsi-target: Fail connection on short sendmsg writes
  hp_accel: Add support for HP ZBook 15
  cfg80211: Fix 160 MHz channels with 80+80 and 160 MHz drivers
  ARC: [nsimosci] move peripherals to match model to FPGA
  drm/i915: Force the CS stall for invalidate flushes
  drm/i915: Invalidate media caches on gen7
  drm/radeon: properly filter DP1.2 4k modes on non-DP1.2 hw
  drm/radeon: check the right ring in radeon_evict_flags()
  drm/vmwgfx: Fix fence event code
  enic: fix rx skb checksum
  alx: fix alx_poll()
  tcp: Do not apply TSO segment limit to non-TSO packets
  tg3: tg3_disable_ints using uninitialized mailbox value to disable interrupts
  netlink: Don't reorder loads/stores before marking mmap netlink frame as available
  netlink: Always copy on mmap TX.
  Linux 3.10.65
  mm: Don't count the stack guard page towards RLIMIT_STACK
  mm: propagate error from stack expansion even for guard page
  mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  perf session: Do not fail on processing out of order event
  perf: Fix events installation during moving group
  perf/x86/intel/uncore: Make sure only uncore events are collected
  Btrfs: don't delay inode ref updates during log replay
  ARM: mvebu: disable I/O coherency on non-SMP situations on Armada 370/375/38x/XP
  scripts/kernel-doc: don't eat struct members with __aligned
  nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
  nfsd4: fix xdr4 inclusion of escaped char
  fs: nfsd: Fix signedness bug in compare_blob
  serial: samsung: wait for transfer completion before clock disable
  writeback: fix a subtle race condition in I_DIRTY clearing
  cdc-acm: memory leak in error case
  genhd: check for int overflow in disk_expand_part_tbl()
  USB: cdc-acm: check for valid interfaces
  ALSA: hda - Fix wrong gpio_dir & gpio_mask hint setups for IDT/STAC codecs
  ALSA: hda - using uninitialized data
  ALSA: usb-audio: extend KEF X300A FU 10 tweak to Arcam rPAC
  driver core: Fix unbalanced device reference in drivers_probe
  x86, vdso: Use asm volatile in __getcpu
  x86_64, vdso: Fix the vdso address randomization algorithm
  HID: Add a new id 0x501a for Genius MousePen i608X
  HID: add battery quirk for USB_DEVICE_ID_APPLE_ALU_WIRELESS_2011_ISO keyboard
  HID: roccat: potential out of bounds in pyra_sysfs_write_settings()
  HID: i2c-hid: prevent buffer overflow in early IRQ
  HID: i2c-hid: fix race condition reading reports
  iommu/vt-d: Fix an off-by-one bug in __domain_mapping()
  UBI: Fix double free after do_sync_erase()
  UBI: Fix invalid vfree()
  pstore-ram: Allow optional mapping with pgprot_noncached
  pstore-ram: Fix hangs by using write-combine mappings
  PCI: Restore detection of read-only BARs
  ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap
  ASoC: max98090: Fix ill-defined sidetone route
  ASoC: sigmadsp: Refuse to load firmware files with a non-supported version
  ath5k: fix hardware queue index assignment
  swiotlb-xen: pass dev_addr to swiotlb_tbl_unmap_single
  can: peak_usb: fix memset() usage
  can: peak_usb: fix cleanup sequence order in case of error during init
  ath9k: fix BE/BK queue order
  ath9k_hw: fix hardware queue allocation
  ocfs2: fix journal commit deadlock
  Linux 3.10.64
  Btrfs: fix fs corruption on transaction abort if device supports discard
  Btrfs: do not move em to modified list when unpinning
  eCryptfs: Remove buggy and unnecessary write in file name decode routine
  eCryptfs: Force RO mount when encrypted view is enabled
  udf: Verify symlink size before loading it
  exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  ncpfs: return proper error from NCP_IOC_SETROOT ioctl
  crypto: af_alg - fix backlog handling
  userns: Unbreak the unprivileged remount tests
  userns: Allow setting gid_maps without privilege when setgroups is disabled
  userns: Add a knob to disable setgroups on a per user namespace basis
  userns: Rename id_map_mutex to userns_state_mutex
  userns: Only allow the creator of the userns unprivileged mappings
  userns: Check euid no fsuid when establishing an unprivileged uid mapping
  userns: Don't allow unprivileged creation of gid mappings
  userns: Don't allow setgroups until a gid mapping has been setablished
  userns: Document what the invariant required for safe unprivileged mappings.
  groups: Consolidate the setgroups permission checks
  umount: Disallow unprivileged mount force
  mnt: Update unprivileged remount test
  mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
  mac80211: free management frame keys when removing station
  mac80211: fix multicast LED blinking and counter
  KEYS: Fix stale key registration at error path
  isofs: Fix unchecked printing of ER records
  x86/tls: Don't validate lm in set_thread_area() after all
  dm space map metadata: fix sm_bootstrap_get_nr_blocks()
  dm bufio: fix memleak when using a dm_buffer's inline bio
  nfs41: fix nfs4_proc_layoutget error handling
  megaraid_sas: corrected return of wait_event from abort frame path
  mmc: block: add newline to sysfs display of force_ro
  mfd: tc6393xb: Fail ohci suspend if full state restore is required
  md/bitmap: always wait for writes on unplug.
  x86, kvm: Clear paravirt_enabled on KVM guests for espfix32's benefit
  x86_64, switch_to(): Load TLS descriptors before switching DS and ES
  x86/tls: Disallow unusual TLS segments
  x86/tls: Validate TLS entries to protect espfix
  isofs: Fix infinite looping over CE entries
  Linux 3.10.63
  ALSA: usb-audio: Don't resubmit pending URBs at MIDI error recovery
  powerpc: 32 bit getcpu VDSO function uses 64 bit instructions
  ARM: sched_clock: Load cycle count after epoch stabilizes
  igb: bring link up when PHY is powered up
  ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
  nEPT: Nested INVEPT
  net: sctp: use MAX_HEADER for headroom reserve in output path
  net: mvneta: fix Tx interrupt delay
  rtnetlink: release net refcnt on error in do_setlink()
  net/mlx4_core: Limit count field to 24 bits in qp_alloc_res
  tg3: fix ring init when there are more TX than RX channels
  ipv6: gre: fix wrong skb->protocol in WCCP
  sata_fsl: fix error handling of irq_of_parse_and_map
  ahci: disable MSI on SAMSUNG 0xa800 SSD
  AHCI: Add DeviceIDs for Sunrise Point-LP SATA controller
  media: smiapp: Only some selection targets are settable
  drm/i915: Unlock panel even when LVDS is disabled
  drm/radeon: kernel panic in drm_calc_vbltimestamp_from_scanoutpos with 3.18.0-rc6
  i2c: davinci: generate STP always when NACK is received
  i2c: omap: fix i207 errata handling
  i2c: omap: fix NACK and Arbitration Lost irq handling
  xen-netfront: Remove BUGs on paged skb data which crosses a page boundary
  mm: fix swapoff hang after page migration and fork
  mm: frontswap: invalidate expired data on a dup-store failure
  Linux 3.10.62
  nfsd: Fix ACL null pointer deref
  powerpc/powernv: Honor the generic "no_64bit_msi" flag
  bnx2fc: do not add shared skbs to the fcoe_rx_list
  nfsd4: fix leak of inode reference on delegation failure
  nfsd: Fix slot wake up race in the nfsv4.1 callback code
  rt2x00: do not align payload on modern H/W
  can: dev: avoid calling kfree_skb() from interrupt context
  spi: dw: Fix dynamic speed change.
  iser-target: Handle DEVICE_REMOVAL event on network portal listener correctly
  target: Don't call TFO->write_pending if data_length == 0
  srp-target: Retry when QP creation fails with ENOMEM
  Input: xpad - use proper endpoint type
  ARM: 8222/1: mvebu: enable strex backoff delay
  ARM: 8216/1: xscale: correct auxiliary register in suspend/resume
  ALSA: usb-audio: Add ctrl message delay quirk for Marantz/Denon devices
  can: esd_usb2: fix memory leak on disconnect
  USB: xhci: don't start a halted endpoint before its new dequeue is set
  usb-quirks: Add reset-resume quirk for MS Wireless Laser Mouse 6000
  usb: serial: ftdi_sio: add PIDs for Matrix Orbital products
  USB: serial: cp210x: add IDs for CEL MeshConnect USB Stick
  USB: keyspan: fix tty line-status reporting
  USB: keyspan: fix overrun-error reporting
  USB: ssu100: fix overrun-error reporting
  iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit mask
  powerpc/pseries: Fix endiannes issue in RTAS call from xmon
  powerpc/pseries: Honor the generic "no_64bit_msi" flag
  of/base: Fix PowerPC address parsing hack
  ASoC: wm_adsp: Avoid attempt to free buffers that might still be in use
  ASoC: sgtl5000: Fix SMALL_POP bit definition
  PCI/MSI: Add device flag indicating that 64-bit MSIs don't work
  ipx: fix locking regression in ipx_sendmsg and ipx_recvmsg
  pptp: fix stack info leak in pptp_getname()
  qmi_wwan: Add support for HP lt4112 LTE/HSPA+ Gobi 4G Modem
  ieee802154: fix error handling in ieee802154fake_probe()
  ipv4: Fix incorrect error code when adding an unreachable route
  inetdevice: fixed signed integer overflow
  sparc64: Fix constraints on swab helpers.
  uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
  x86, mm: Set NX across entire PMD at boot
  x86: Require exact match for 'noxsave' command line option
  x86_64, traps: Rework bad_iret
  x86_64, traps: Stop using IST for #SS
  x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C
  MIPS: Loongson: Make platform serial setup always built-in.
  MIPS: oprofile: Fix backtrace on 64-bit kernel
  Linux 3.10.61
  mm: memcg: handle non-error OOM situations more gracefully
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  mm: memcg: enable memcg OOM killer only for user faults
  x86: finish user fault error path with fatal signal
  arch: mm: pass userspace fault flag to generic fault handler
  arch: mm: do not invoke OOM killer on kernel fault OOM
  arch: mm: remove obsolete init OOM protection
  mm: invoke oom-killer from remaining unconverted page fault handlers
  net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks
  net: sctp: fix panic on duplicate ASCONF chunks
  net: sctp: fix remote memory pressure from excessive queueing
  KVM: x86: Don't report guest userspace emulation error to userspace
  SCSI: hpsa: fix a race in cmd_free/scsi_done
  net/mlx4_en: Fix BlueFlame race
  ARM: Correct BUG() assembly to ensure it is endian-agnostic
  perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
  mei: bus: fix possible boundaries violation
  perf: Handle compat ioctl
  MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches
  dell-wmi: Fix access out of memory
  ARM: probes: fix instruction fetch order with <asm/opcodes.h>
  br: fix use of ->rx_handler_data in code executed on non-rx_handler path
  netfilter: nf_nat: fix oops on netns removal
  netfilter: xt_bpf: add mising opaque struct sk_filter definition
  netfilter: nf_log: release skbuff on nlmsg put failure
  netfilter: nfnetlink_log: fix maximum packet length logged to userspace
  netfilter: nf_log: account for size of NLMSG_DONE attribute
  ipc: always handle a new value of auto_msgmni
  clocksource: Remove "weak" from clocksource_default_clock() declaration
  kgdb: Remove "weak" from kgdb_arch_pc() declaration
  media: ttusb-dec: buffer overflow in ioctl
  NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return
  nfs: Fix use of uninitialized variable in nfs_getattr()
  NFS: Don't try to reclaim delegation open state if recovery failed
  NFSv4: Ensure that we remove NFSv4.0 delegations when state has expired
  Input: alps - allow up to 2 invalid packets without resetting device
  Input: alps - ignore potential bare packets when device is out of sync
  dm raid: ensure superblock's size matches device's logical block size
  dm btree: fix a recursion depth bug in btree walking code
  block: Fix computation of merged request priority
  parisc: Use compat layer for msgctl, shmat, shmctl and semtimedop syscalls
  scsi: only re-lock door after EH on devices that were reset
  nfs: fix pnfs direct write memory leak
  firewire: cdev: prevent kernel stack leaking into ioctl arguments
  arm64: __clear_user: handle exceptions on strb
  ARM: 8198/1: make kuser helpers depend on MMU
  drm/radeon: add missing crtc unlock when setting up the MC
  mac80211: fix use-after-free in defragmentation
  macvtap: Fix csum_start when VLAN tags are present
  iwlwifi: configure the LTR
  libceph: do not crash on large auth tickets
  xtensa: re-wire umount syscall to sys_oldumount
  ALSA: usb-audio: Fix memory leak in FTU quirk
  ahci: disable MSI instead of NCQ on Samsung pci-e SSDs on macbooks
  ahci: Add Device IDs for Intel Sunrise Point PCH
  audit: keep inode pinned
  x86, x32, audit: Fix x32's AUDIT_ARCH wrt audit
  sparc32: Implement xchg and atomic_xchg using ATOMIC_HASH locks
  sparc64: Do irq_{enter,exit}() around generic_smp_call_function*().
  sparc64: Fix crashes in schizo_pcierr_intr_other().
  sunvdc: don't call VD_OP_GET_VTOC
  vio: fix reuse of vio_dring slot
  sunvdc: limit each sg segment to a page
  sunvdc: compute vdisk geometry from capacity
  sunvdc: add cdrom and v1.1 protocol support
  net: sctp: fix memory leak in auth key management
  net: sctp: fix NULL pointer dereference in af->from_addr_param on malformed packet
  gre6: Move the setting of dev->iflink into the ndo_init functions.
  ip6_tunnel: Use ip6_tnl_dev_init as the ndo_init function.
  Linux 3.10.60
  libceph: ceph-msgr workqueue needs a resque worker
  Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup
  of: Fix overflow bug in string property parsing functions
  sysfs: driver core: Fix glue dir race condition by gdp_mutex
  i2c: at91: don't account as iowait
  acer-wmi: Add acpi_backlight=video quirk for the Acer KAV80
  rbd: Fix error recovery in rbd_obj_read_sync()
  drm/radeon: remove invalid pci id
  usb: gadget: udc: core: fix kernel oops with soft-connect
  usb: gadget: function: acm: make f_acm pass USB20CV Chapter9
  usb: dwc3: gadget: fix set_halt() bug with pending transfers
  crypto: algif - avoid excessive use of socket buffer in skcipher
  mm: Remove false WARN_ON from pagecache_isize_extended()
  x86, apic: Handle a bad TSC more gracefully
  posix-timers: Fix stack info leak in timer_create()
  mac80211: fix typo in starting baserate for rts_cts_rate_idx
  PM / Sleep: fix recovery during resuming from hibernation
  tty: Fix high cpu load if tty is unreleaseable
  quota: Properly return errors from dquot_writeback_dquots()
  ext3: Don't check quota format when there are no quota files
  nfsd4: fix crash on unknown operation number
  cpc925_edac: Report UE events properly
  e7xxx_edac: Report CE events properly
  i3200_edac: Report CE events properly
  i82860_edac: Report CE events properly
  scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
  lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
  cgroup/kmemleak: add kmemleak_free() for cgroup deallocations.
  usb: Do not allow usb_alloc_streams on unconfigured devices
  USB: opticon: fix non-atomic allocation in write path
  usb-storage: handle a skipped data phase
  spi: pxa2xx: toggle clocks on suspend if not disabled by runtime PM
  spi: pl022: Fix incorrect dma_unmap_sg
  usb: dwc3: gadget: Properly initialize LINK TRB
  wireless: rt2x00: add new rt2800usb device
  USB: option: add Haier CE81B CDMA modem
  usb: option: add support for Telit LE910
  USB: cdc-acm: only raise DTR on transitions from B0
  USB: cdc-acm: add device id for GW Instek AFG-2225
  usb: serial: ftdi_sio: add "bricked" FTDI device PID
  usb: serial: ftdi_sio: add Awinda Station and Dongle products
  USB: serial: cp210x: add Silicon Labs 358x VID and PID
  serial: Fix divide-by-zero fault in uart_get_divisor()
  staging:iio:ade7758: Remove "raw" from channel name
  staging:iio:ade7758: Fix check if channels are enabled in prenable
  staging:iio:ade7758: Fix NULL pointer deref when enabling buffer
  staging:iio:ad5933: Drop "raw" from channel names
  staging:iio:ad5933: Fix NULL pointer deref when enabling buffer
  OOM, PM: OOM killed task shouldn't escape PM suspend
  freezer: Do not freeze tasks killed by OOM killer
  ext4: fix oops when loading block bitmap failed
  cpufreq: intel_pstate: Fix setting max_perf_pct in performance policy
  ext4: fix overflow when updating superblock backups after resize
  ext4: check s_chksum_driver when looking for bg csum presence
  ext4: fix reservation overflow in ext4_da_write_begin
  ext4: add ext4_iget_normal() which is to be used for dir tree lookups
  ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
  ext4: don't check quota format when there are no quota files
  ext4: check EA value offset when loading
  jbd2: free bh when descriptor block checksum fails
  MIPS: tlbex: Properly fix HUGE TLB Refill exception handler
  target: Fix APTPL metadata handling for dynamic MappedLUNs
  target: Fix queue full status NULL pointer for SCF_TRANSPORT_TASK_SENSE
  qla_target: don't delete changed nacls
  ARC: Update order of registers in KGDB to match GDB 7.5
  ARC: [nsimosci] Allow "headless" models to boot
  KVM: x86: Emulator fixes for eip canonical checks on near branches
  KVM: x86: Fix wrong masking on relative jump/call
  kvm: x86: don't kill guest on unknown exit reason
  KVM: x86: Check non-canonical addresses upon WRMSR
  KVM: x86: Improve thread safety in pit
  KVM: x86: Prevent host from panicking on shared MSR writes.
  kvm: fix excessive pages un-pinning in kvm_iommu_map error path.
  media: tda7432: Fix setting TDA7432_MUTE bit for TDA7432_RF register
  media: ds3000: fix LNB supply voltage on Tevii S480 on initialization
  media: em28xx-v4l: give back all active video buffers to the vb2 core properly on streaming stop
  media: v4l2-common: fix overflow in v4l_bound_align_image()
  drm/nouveau/bios: memset dcb struct to zero before parsing
  drm/tilcdc: Fix the error path in tilcdc_load()
  drm/ast: Fix HW cursor image
  Input: i8042 - quirks for Fujitsu Lifebook A544 and Lifebook AH544
  Input: i8042 - add noloop quirk for Asus X750LN
  framebuffer: fix border color
  modules, lock around setting of MODULE_STATE_UNFORMED
  dm log userspace: fix memory leak in dm_ulog_tfr_init failure path
  block: fix alignment_offset math that assumes io_min is a power-of-2
  drbd: compute the end before rb_insert_augmented()
  dm bufio: update last_accessed when relinking a buffer
  virtio_pci: fix virtio spec compliance on restore
  selinux: fix inode security list corruption
  pstore: Fix duplicate {console,ftrace}-efi entries
  mfd: rtsx_pcr: Fix MSI enable error handling
  mnt: Prevent pivot_root from creating a loop in the mount tree
  UBI: add missing kmem_cache_free() in process_pool_aeb error path
  random: add and use memzero_explicit() for clearing data
  crypto: more robust crypto_memneq
  fix misuses of f_count() in ppp and netlink
  kill wbuf_queued/wbuf_dwork_lock
  ALSA: pcm: Zero-clear reserved fields of PCM status ioctl in compat mode
  evm: check xattr value length and type in evm_inode_setxattr()
  x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAE
  x86_64, entry: Fix out of bounds read on sysenter
  x86_64, entry: Filter RFLAGS.NT on entry from userspace
  x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
  x86, fpu: shift drop_init_fpu() from save_xstate_sig() to handle_signal()
  x86, fpu: __restore_xstate_sig()->math_state_restore() needs preempt_disable()
  x86: Reject x32 executables if x32 ABI not supported
  vfs: fix data corruption when blocksize < pagesize for mmaped data
  UBIFS: fix free log space calculation
  UBIFS: fix a race condition
  UBIFS: remove mst_mutex
  fs: Fix theoretical division by 0 in super_cache_scan().
  fs: make cont_expand_zero interruptible
  mmc: rtsx_pci_sdmmc: fix incorrect last byte in R2 response
  libata-sff: Fix controllers with no ctl port
  pata_serverworks: disable 64-KB DMA transfers on Broadcom OSB4 IDE Controller
  Revert "percpu: free percpu allocation info for uniprocessor system"
  lockd: Try to reconnect if statd has moved
  drivers/net: macvtap and tun depend on INET
  ipv4: dst_entry leak in ip_send_unicast_reply()
  ax88179_178a: fix bonding failure
  ipv4: fix nexthop attlen check in fib_nh_match
  tracing/syscalls: Ignore numbers outside NR_syscalls' range
  Linux 3.10.59
  ecryptfs: avoid to access NULL pointer when write metadata in xattr
  ARM: at91/PMC: don't forget to write PMC_PCDR register to disable clocks
  ALSA: usb-audio: Add support for Steinberg UR22 USB interface
  ALSA: emu10k1: Fix deadlock in synth voice lookup
  ALSA: pcm: use the same dma mmap codepath both for arm and arm64
  arm64: compat: fix compat types affecting struct compat_elf_prpsinfo
  spi: dw-mid: terminate ongoing transfers at exit
  kernel: add support for gcc 5
  fanotify: enable close-on-exec on events' fd when requested in fanotify_init()
  mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set
  Bluetooth: Fix issue with USB suspend in btusb driver
  Bluetooth: Fix HCI H5 corrupted ack value
  rt2800: correct BBP1_TX_POWER_CTRL mask
  PCI: Generate uppercase hex for modalias interface class
  PCI: Increase IBM ipr SAS Crocodile BARs to at least system page size
  iwlwifi: Add missing PCI IDs for the 7260 series
  NFSv4.1: Fix an NFSv4.1 state renewal regression
  NFSv4: fix open/lock state recovery error handling
  NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
  lzo: check for length overrun in variable length encoding.
  Revert "lzo: properly check for overruns"
  Documentation: lzo: document part of the encoding
  m68k: Disable/restore interrupts in hwreg_present()/hwreg_write()
  Drivers: hv: vmbus: Fix a bug in vmbus_open()
  Drivers: hv: vmbus: Cleanup vmbus_establish_gpadl()
  Drivers: hv: vmbus: Cleanup vmbus_teardown_gpadl()
  Drivers: hv: vmbus: Cleanup vmbus_post_msg()
  firmware_class: make sure fw requests contain a name
  qla2xxx: Use correct offset to req-q-out for reserve calculation
  mptfusion: enable no_write_same for vmware scsi disks
  be2iscsi: check ip buffer before copying
  regmap: fix NULL pointer dereference in _regmap_write/read
  regmap: debugfs: fix possbile NULL pointer dereference
  spi: dw-mid: check that DMA was inited before exit
  spi: dw-mid: respect 8 bit mode
  x86/intel/quark: Switch off CR4.PGE so TLB flush uses CR3 instead
  kvm: don't take vcpu mutex for obviously invalid vcpu ioctls
  KVM: s390: unintended fallthrough for external call
  kvm: x86: fix stale mmio cache bug
  fs: Add a missing permission check to do_umount
  Btrfs: fix race in WAIT_SYNC ioctl
  Btrfs: fix build_backref_tree issue with multiple shared blocks
  Btrfs: try not to ENOSPC on log replay
  Linux 3.10.58
  USB: cp210x: add support for Seluxit USB dongle
  USB: serial: cp210x: added Ketra N1 wireless interface support
  USB: Add device quirk for ASUS T100 Base Station keyboard
  ipv6: reallocate addrconf router for ipv6 address when lo device up
  tcp: fixing TLP's FIN recovery
  sctp: handle association restarts when the socket is closed.
  ip6_gre: fix flowi6_proto value in xmit path
  hyperv: Fix a bug in netvsc_start_xmit()
  tg3: Allow for recieve of full-size 8021AD frames
  tg3: Work around HW/FW limitations with vlan encapsulated frames
  l2tp: fix race while getting PMTU on PPP pseudo-wire
  openvswitch: fix panic with multiple vlan headers
  packet: handle too big packets for PACKET_V3
  tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()
  sit: Fix ipip6_tunnel_lookup device matching criteria
  myri10ge: check for DMA mapping errors
  Linux 3.10.57
  cpufreq: ondemand: Change the calculation of target frequency
  cpufreq: Fix wrong time unit conversion
  nl80211: clear skb cb before passing to netlink
  drbd: fix regression 'out of mem, failed to invoke fence-peer helper'
  jiffies: Fix timeval conversion to jiffies
  md/raid5: disable 'DISCARD' by default due to safety concerns.
  media: vb2: fix VBI/poll regression
  mm: numa: Do not mark PTEs pte_numa when splitting huge pages
  mm, thp: move invariant bug check out of loop in __split_huge_page_map
  ring-buffer: Fix infinite spin in reading buffer
  init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
  perf: fix perf bug in fork()
  udf: Avoid infinite loop when processing indirect ICBs
  Linux 3.10.56
  vm_is_stack: use for_each_thread() rather then buggy while_each_thread()
  oom_kill: add rcu_read_lock() into find_lock_task_mm()
  oom_kill: has_intersects_mems_allowed() needs rcu_read_lock()
  oom_kill: change oom_kill.c to use for_each_thread()
  introduce for_each_thread() to replace the buggy while_each_thread()
  kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code
  arm: multi_v7_defconfig: Enable Zynq UART driver
  ext2: Fix fs corruption in ext2_get_xip_mem()
  serial: 8250_dma: check the result of TX buffer mapping
  ARM: 7748/1: oabi: handle faults when loading swi instruction from userspace
  netfilter: nf_conntrack: avoid large timeout for mid-stream pickup
  PM / sleep: Use valid_state() for platform-dependent sleep states only
  PM / sleep: Add state field to pm_states[] entries
  ipvs: fix ipv6 hook registration for local replies
  ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding
  ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack
  md/raid1: fix_read_error should act on all non-faulty devices.
  media: cx18: fix kernel oops with tda8290 tuner
  Fix nasty 32-bit overflow bug in buffer i/o code.
  perf kmem: Make it work again on non NUMA machines
  perf: Fix a race condition in perf_remove_from_context()
  alarmtimer: Lock k_itimer during timer callback
  alarmtimer: Do not signal SIGEV_NONE timers
  parisc: Only use -mfast-indirect-calls option for 32-bit kernel builds
  powerpc/perf: Fix ABIv2 kernel backtraces
  sched: Fix unreleased llc_shared_mask bit during CPU hotplug
  ocfs2/dlm: do not get resource spinlock if lockres is new
  nilfs2: fix data loss with mmap()
  fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
  fsnotify/fdinfo: use named constants instead of hardcoded values
  kcmp: fix standard comparison bug
  Revert "mac80211: disable uAPSD if all ACs are under ACM"
  usb: dwc3: core: fix ordering for PHY suspend
  usb: dwc3: core: fix order of PM runtime calls
  usb: host: xhci: fix compliance mode workaround
  genhd: fix leftover might_sleep() in blk_free_devt()
  lockd: fix rpcbind crash on lockd startup failure
  rtlwifi: rtl8192cu: Add new ID
  percpu: perform tlb flush after pcpu_map_pages() failure
  percpu: fix pcpu_alloc_pages() failure path
  percpu: free percpu allocation info for uniprocessor system
  ata_piix: Add Device IDs for Intel 9 Series PCH
  Input: i8042 - add nomux quirk for Avatar AVIU-145A6
  Input: i8042 - add Fujitsu U574 to no_timeout dmi table
  Input: atkbd - do not try 'deactivate' keyboard on any LG laptops
  Input: elantech - fix detection of touchpad on ASUS s301l
  Input: synaptics - add support for ForcePads
  Input: serport - add compat handling for SPIOCSTYPE ioctl
  dm crypt: fix access beyond the end of allocated space
  block: Fix dev_t minor allocation lifetime
  workqueue: apply __WQ_ORDERED to create_singlethread_workqueue()
  Revert "iwlwifi: dvm: don't enable CTS to self"
  SCSI: libiscsi: fix potential buffer overrun in __iscsi_conn_send_pdu
  NFC: microread: Potential overflows in microread_target_discovered()
  iscsi-target: Fix memory corruption in iscsit_logout_post_handler_diffcid
  iscsi-target: avoid NULL pointer in iscsi_copy_param_list failure
  Target/iser: Don't put isert_conn inside disconnected handler
  Target/iser: Get isert_conn reference once got to connected_handler
  iio:inkern: fix overwritten -EPROBE_DEFER in of_iio_channel_get_by_name
  iio:magnetometer: bugfix magnetometers gain values
  iio: adc: ad_sigma_delta: Fix indio_dev->trig assignment
  iio: st_sensors: Fix indio_dev->trig assignment
  iio: meter: ade7758: Fix indio_dev->trig assignment
  iio: inv_mpu6050: Fix indio_dev->trig assignment
  iio: gyro: itg3200: Fix indio_dev->trig assignment
  iio:trigger: modify return value for iio_trigger_get
  CIFS: Fix SMB2 readdir error handling
  CIFS: Fix directory rename error
  ASoC: davinci-mcasp: Correct rx format unit configuration
  shmem: fix nlink for rename overwrite directory
  x86 early_ioremap: Increase FIX_BTMAPS_SLOTS to 8
  KVM: x86: handle idiv overflow at kvm_write_tsc
  regmap: Fix handling of volatile registers for format_write() chips
  ACPICA: Update to GPIO region handler interface.
  MIPS: mcount: Adjust stack pointer for static trace in MIPS32
  MIPS: ZBOOT: add missing <linux/string.h> include
  ARM: 8165/1: alignment: don't break misaligned NEON load/store
  ARM: 7897/1: kexec: Use the right ISA for relocate_new_kernel
  ARM: 8133/1: use irq_set_affinity with force=false when migrating irqs
  ARM: 8128/1: abort: don't clear the exclusive monitors
  NFSv4: Fix another bug in the close/open_downgrade code
  NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
  usb:hub set hub->change_bits when over-current happens
  usb: dwc3: omap: fix ordering for runtime pm calls
  USB: EHCI: unlink QHs even after the controller has stopped
  USB: storage: Add quirks for Entrega/Xircom USB to SCSI converters
  USB: storage: Add quirk for Ariston Technologies iConnect USB to SCSI adapter
  USB: storage: Add quirk for Adaptec USBConnect 2000 USB-to-SCSI Adapter
  storage: Add single-LUN quirk for Jaz USB Adapter
  usb: hub: take hub->hdev reference when processing from eventlist
  xhci: fix oops when xhci resumes from hibernate with hw lpm capable devices
  xhci: Fix null pointer dereference if xhci initialization fails
  USB: zte_ev: fix removed PIDs
  USB: ftdi_sio: add support for NOVITUS Bono E thermal printer
  USB: sierra: add 1199:68AA device ID
  USB: sierra: avoid CDC class functions on "68A3" devices
  USB: zte_ev: remove duplicate Qualcom PID
  USB: zte_ev: remove duplicate Gobi PID
  Revert "USB: option,zte_ev: move most ZTE CDMA devices to zte_ev"
  USB: option: add VIA Telecom CDS7 chipset device id
  USB: option: reduce interrupt-urb logging verbosity
  USB: serial: fix potential heap buffer overflow
  USB: sisusb: add device id for Magic Control USB video
  USB: serial: fix potential stack buffer overflow
  USB: serial: pl2303: add device id for ztek device
  xtensa: fix a6 and a7 handling in fast_syscall_xtensa
  xtensa: fix TLBTEMP_BASE_2 region handling in fast_second_level_miss
  xtensa: fix access to THREAD_RA/THREAD_SP/THREAD_DS
  xtensa: fix address checks in dma_{alloc,free}_coherent
  xtensa: replace IOCTL code definitions with constants
  drm/radeon: add connector quirk for fujitsu board
  drm/vmwgfx: Fix a potential infinite spin waiting for fifo idle
  drm/ast: AST2000 cannot be detected correctly
  drm/i915: Wait for vblank before enabling the TV encoder
  drm/i915: Remove bogus __init annotation from DMI callbacks
  HID: logitech-dj: prevent false errors to be shown
  HID: magicmouse: sanity check report size in raw_event() callback
  HID: picolcd: sanity check report size in raw_event() callback
  cfq-iosched: Fix wrong children_weight calculation
  ALSA: pcm: fix fifo_size frame calculation
  ALSA: hda - Fix invalid pin powermap without jack detection
  ALSA: hda - Fix COEF setups for ALC1150 codec
  ALSA: core: fix buffer overflow in snd_info_get_line()
  arm64: ptrace: fix compat hardware watchpoint reporting
  trace: Fix epoll hang when we race with new entries
  i2c: at91: Fix a race condition during signal handling in at91_do_twi_xfer.
  i2c: at91: add bound checking on SMBus block length bytes
  arm64: flush TLS registers during exec
  ibmveth: Fix endian issues with rx_no_buffer statistic
  ahci: add pcid for Marvel 0x9182 controller
  ahci: Add Device IDs for Intel 9 Series PCH
  pata_scc: propagate return value of scc_wait_after_reset
  drm/i915: read HEAD register back in init_ring_common() to enforce ordering
  drm/radeon: load the lm63 driver for an lm64 thermal chip.
  drm/ttm: Choose a pool to shrink correctly in ttm_dma_pool_shrink_scan().
  drm/ttm: Fix possible division by 0 in ttm_dma_pool_shrink_scan().
  drm/tilcdc: fix double kfree
  drm/tilcdc: fix release order on exit
  drm/tilcdc: panel: fix leak when unloading the module
  drm/tilcdc: tfp410: fix dangling sysfs connector node
  drm/tilcdc: slave: fix dangling sysfs connector node
  drm/tilcdc: panel: fix dangling sysfs connector node
  carl9170: fix sending URBs with wrong type when using full-speed
  Linux 3.10.55
  libceph: gracefully handle large reply messages from the mon
  libceph: rename ceph_msg::front_max to front_alloc_len
  tpm: Provide a generic means to override the chip returned timeouts
  vfs: fix bad hashing of dentries
  dcache.c: get rid of pointless macros
  IB/srp: Fix deadlock between host removal and multipathd
  blkcg: don't call into policy draining if root_blkg is already gone
  mtd: nand: omap: Fix 1-bit Hamming code scheme, omap_calculate_ecc()
  mtd/ftl: fix the double free of the buffers allocated in build_maps()
  CIFS: Fix wrong restart readdir for SMB1
  CIFS: Fix wrong filename length for SMB2
  CIFS: Fix wrong directory attributes after rename
  CIFS: Possible null ptr deref in SMB2_tcon
  CIFS: Fix async reading on reconnects
  CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
  libceph: do not hard code max auth ticket len
  libceph: add process_one_ticket() helper
  libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
  md/raid1,raid10: always abort recover on write error.
  xfs: don't zero partial page cache pages during O_DIRECT writes
  xfs: don't zero partial page cache pages during O_DIRECT writes
  xfs: don't dirty buffers beyond EOF
  xfs: quotacheck leaves dquot buffers without verifiers
  RDMA/iwcm: Use a default listen backlog if needed
  md/raid10: Fix memory leak when raid10 reshape completes.
  md/raid10: fix memory leak when reshaping a RAID10.
  md/raid6: avoid data corruption during recovery of double-degraded RAID6
  Bluetooth: Avoid use of session socket after the session gets freed
  Bluetooth: never linger on process exit
  mnt: Add tests for unprivileged remount cases that have found to be faulty
  mnt: Change the default remount atime from relatime to the existing value
  mnt: Correct permission checks in do_remount
  mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
  mnt: Only change user settable mount flags in remount
  ring-buffer: Up rb_iter_peek() loop count to 3
  ring-buffer: Always reset iterator to reader page
  ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock
  ACPI: Run fixed event device notifications in process context
  ACPICA: Utilities: Fix memory leak in acpi_ut_copy_iobject_to_iobject
  bfa: Fix undefined bit shift on big-endian architectures with 32-bit DMA address
  ASoC: pxa-ssp: drop SNDRV_PCM_FMTBIT_S24_LE
  ASoC: max98090: Fix missing free_irq
  ASoC: samsung: Correct I2S DAI suspend/resume ops
  ASoC: wm_adsp: Add missing MODULE_LICENSE
  ASoC: pcm: fix dpcm_path_put in dpcm runtime update
  openrisc: Rework signal handling
  MIPS: Fix accessing to per-cpu data when flushing the cache
  MIPS: OCTEON: make get_system_type() thread-safe
  MIPS: asm: thread_info: Add _TIF_SECCOMP flag
  MIPS: Cleanup flags in syscall flags handlers.
  MIPS: asm/reg.h: Make 32- and 64-bit definitions available at the same time
  MIPS: Remove BUG_ON(!is_fpu_owner()) in do_ade()
  MIPS: tlbex: Fix a missing statement for HUGETLB
  MIPS: Prevent user from setting FCSR cause bits
  MIPS: GIC: Prevent array overrun
  drivers: scsi: storvsc: Correctly handle TEST_UNIT_READY failure
  Drivers: scsi: storvsc: Implement a eh_timed_out handler
  powerpc/pseries: Failure on removing device node
  powerpc/mm: Use read barrier when creating real_pte
  powerpc/mm/numa: Fix break placement
  regulator: arizona-ldo1: remove bypass functionality
  mfd: omap-usb-host: Fix improper mask use.
  kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path
  CAPABILITIES: remove undefined caps from all processes
  tpm: missing tpm_chip_put in tpm_get_random()
  firmware: Do not use WARN_ON(!spin_is_locked())
  spi: omap2-mcspi: Configure hardware when slave driver changes mode
  spi: orion: fix incorrect handling of cell-index DT property
  iommu/amd: Fix cleanup_domain for mass device removal
  media: media-device: Remove duplicated memset() in media_enum_entities()
  media: au0828: Only alt setting logic when needed
  media: xc4000: Fix get_frequency()
  media: xc5000: Fix get_frequency()
  Linux 3.10.54
  USB: fix build error with CONFIG_PM_RUNTIME disabled
  NFSv4: Fix problems with close in the presence of a delegation
  NFSv3: Fix another acl regression
  svcrdma: Select NFSv4.1 backchannel transport based on forward channel
  NFSD: Decrease nfsd_users in nfsd_startup_generic fail
  usb: hub: Prevent hub autosuspend if usbcore.autosuspend is -1
  USB: whiteheat: Added bounds checking for bulk command response
  USB: ftdi_sio: Added PID for new ekey device
  USB: ftdi_sio: add Basic Micro ATOM Nano USB2Serial PID
  ARM: OMAP2+: hwmod: Rearm wake-up interrupts for DT when MUSB is idled
  usb: xhci: amd chipset also needs short TX quirk
  xhci: Treat not finding the event_seg on COMP_STOP the same as COMP_STOP_INVAL
  Staging: speakup: Update __speakup_paste_selection() tty (ab)usage to match vt
  jbd2: fix infinite loop when recovering corrupt journal blocks
  mei: nfc: fix memory leak in error path
  mei: reset client state on queued connect request
  Btrfs: fix csum tree corruption, duplicate and outdated checksums
  hpsa: fix bad -ENOMEM return value in hpsa_big_passthru_ioctl
  x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stub
  x86_64/vsyscall: Fix warn_bad_vsyscall log output
  x86: don't exclude low BIOS area when allocating address space for non-PCI cards
  drm/radeon: add additional SI pci ids
  ext4: fix BUG_ON in mb_free_blocks()
  kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601)
  Revert "KVM: x86: Increase the number of fixed MTRR regs to 10"
  KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
  KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
  KVM: x86: Inter-privilege level ret emulation is not implemeneted
  crypto: ux500 - make interrupt mode plausible
  serial: core: Preserve termios c_cflag for console resume
  ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
  drivers/i2c/busses: use correct type for dma_map/unmap
  hwmon: (dme1737) Prevent overflow problem when writing large limits
  hwmon: (ads1015) Fix out-of-bounds array access
  hwmon: (lm85) Fix various errors on attribute writes
  hwmon: (ads1015) Fix off-by-one for valid channel index checking
  hwmon: (gpio-fan) Prevent overflow problem when writing large limits
  hwmon: (lm78) Fix overflow problems seen when writing large temperature limits
  hwmon: (sis5595) Prevent overflow problem when writing large limits
  drm: omapdrm: fix compiler errors
  ARM: OMAP3: Fix choice of omap3_restore_es function in OMAP34XX rev3.1.2 case.
  mei: start disconnect request timer consistently
  ALSA: hda/realtek - Avoid setting wrong COEF on ALC269 & co
  ALSA: hda/ca0132 - Don't try loading firmware at resume when already failed
  ALSA: virtuoso: add Xonar Essence STX II support
  ALSA: hda - fix an external mic jack problem on a HP machine
  USB: Fix persist resume of some SS USB devices
  USB: ehci-pci: USB host controller support for Intel Quark X1000
  USB: serial: ftdi_sio: Add support for new Xsens devices
  USB: serial: ftdi_sio: Annotate the current Xsens PID assignments
  USB: OHCI: don't lose track of EDs when a controller dies
  isofs: Fix unbounded recursion when processing relocated directories
  HID: fix a couple of off-by-ones
  HID: logitech: perform bounds checking on device_id early enough
  stable_kernel_rules: Add pointer to netdev-FAQ for network patches
  Linux 3.10.53
  arch/sparc/math-emu/math_32.c: drop stray break operator
  sparc64: ldc_connect() should not return EINVAL when handshake is in progress.
  sunsab: Fix detection of BREAK on sunsab serial console
  bbc-i2c: Fix BBC I2C envctrl on SunBlade 2000
  sparc64: Guard against flushing openfirmware mappings.
  sparc64: Do not insert non-valid PTEs into the TSB hash table.
  sparc64: Add membar to Niagara2 memcpy code.
  sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus.
  sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses.
  sparc64: Fix top-level fault handling bugs.
  sparc64: Handle 32-bit tasks properly in compute_effective_address().
  sparc64: Make itc_sync_lock raw
  sparc64: Fix argument sign extension for compat_sys_futex().
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  iovec: make sure the caller actually wants anything in memcpy_fromiovecend
  net: Correctly set segment mac_len in skb_segment().
  macvlan: Initialize vlan_features to turn on offload support.
  net: sctp: inherit auth_capable on INIT collisions
  tcp: Fix integer-overflow in TCP vegas
  tcp: Fix integer-overflows in TCP veno
  net: sendmsg: fix NULL pointer dereference
  ip: make IP identifiers less predictable
  inetpeer: get rid of ip_id_count
  bnx2x: fix crash during TSO tunneling
  Linux 3.10.52
  x86/espfix/xen: Fix allocation of pages for paravirt page tables
  lib/btree.c: fix leak of whole btree nodes
  net/l2tp: don't fall back on UDP [get|set]sockopt
  net: mvneta: replace Tx timer with a real interrupt
  net: mvneta: add missing bit descriptions for interrupt masks and causes
  net: mvneta: do not schedule in mvneta_tx_timeout
  net: mvneta: use per_cpu stats to fix an SMP lock up
  net: mvneta: increase the 64-bit rx/tx stats out of the hot path
  Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan"
  staging: vt6655: Fix Warning on boot handle_irq_event_percpu.
  x86_64/entry/xen: Do not invoke espfix64 on Xen
  x86, espfix: Make it possible to disable 16-bit support
  x86, espfix: Make espfix64 a Kconfig option, fix UML
  x86, espfix: Fix broken header guard
  x86, espfix: Move espfix definitions into a separate header file
  x86-64, espfix: Don't leak bits 31:16 of %esp returning to 16-bit stack
  Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option"
  timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
  printk: rename printk_sched to printk_deferred
  iio: buffer: Fix demux table creation
  staging: vt6655: Fix disassociated messages every 10 seconds
  mm, thp: do not allow thp faults to avoid cpuset restrictions
  scsi: handle flush errors properly
  rapidio/tsi721_dma: fix failure to obtain transaction descriptor
  cfg80211: fix mic_failure tracing
  ARM: 8115/1: LPAE: reduce damage caused by idmap to virtual memory layout
  crypto: af_alg - properly label AF_ALG socket
  Linux 3.10.51
  core, nfqueue, openvswitch: Orphan frags in skb_zerocopy and handle errors
  x86/efi: Include a .bss section within the PE/COFF headers
  s390/ptrace: fix PSW mask check
  Fix gcc-4.9.0 miscompilation of load_balance() in scheduler
  mm: hugetlb: fix copy_hugetlb_page_range()
  x86_32, entry: Store badsys error code in %eax
  hwmon: (smsc47m192) Fix temperature limit and vrm write operations
  parisc: Remove SA_RESTORER define
  coredump: fix the setting of PF_DUMPCORE
  Input: fix defuzzing logic
  slab_common: fix the check for duplicate slab names
  slab_common: Do not check for duplicate slab names
  tracing: Fix wraparound problems in "uptime" trace clock
  blkcg: don't call into policy draining if root_blkg is already gone
  ahci: add support for the Promise FastTrak TX8660 SATA HBA (ahci mode)
  libata: introduce ata_host->n_tags to avoid oops on SAS controllers
  libata: support the ata host which implements a queue depth less than 32
  block: don't assume last put of shared tags is for the host
  block: provide compat ioctl for BLKZEROOUT
  media: tda10071: force modulation to QPSK on DVB-S
  media: hdpvr: fix two audio bugs
  Linux 3.10.50
  ARC: Implement ptrace(PTRACE_GET_THREAD_AREA)
  sched: Fix possible divide by zero in avg_atom() calculation
  locking/mutex: Disable optimistic spinning on some architectures
  PM / sleep: Fix request_firmware() error at resume
  dm cache metadata: do not allow the data block size to change
  dm thin metadata: do not allow the data block size to change
  alarmtimer: Fix bug where relative alarm timers were treated as absolute
  drm/radeon: avoid leaking edid data
  drm/qxl: return IRQ_NONE if it was not our irq
  drm/radeon: set default bl level to something reasonable
  irqchip: gic: Fix core ID calculation when topology is read from DT
  irqchip: gic: Add support for cortex a7 compatible string
  ring-buffer: Fix polling on trace_pipe
  mwifiex: fix Tx timeout issue
  perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
  ipv4: fix buffer overflow in ip_options_compile()
  dns_resolver: Null-terminate the right string
  dns_resolver: assure that dns_query() result is null-terminated
  sunvnet: clean up objects created in vnet_new() on vnet_exit()
  net: pppoe: use correct channel MTU when using Multilink PPP
  net: sctp: fix information leaks in ulpevent layer
  tipc: clear 'next'-pointer of message fragments before reassembly
  be2net: set EQ DB clear-intr bit in be_open()
  netlink: Fix handling of error from netlink_dump().
  net: mvneta: Fix big endian issue in mvneta_txq_desc_csum()
  net: mvneta: fix operation in 10 Mbit/s mode
  appletalk: Fix socket referencing in skb
  tcp: fix false undo corner cases
  igmp: fix the problem when mc leave group
  net: qmi_wwan: add two Sierra Wireless/Netgear devices
  net: qmi_wwan: Add ID for Telewell TW-LTE 4G v2
  ipv4: icmp: Fix pMTU handling for rare case
  tcp: Fix divide by zero when pushing during tcp-repair
  bnx2x: fix possible panic under memory stress
  net: fix sparse warning in sk_dst_set()
  ipv4: irq safe sk_dst_[re]set() and ipv4_sk_update_pmtu() fix
  ipv4: fix dst race in sk_dst_get()
  8021q: fix a potential memory leak
  net: sctp: check proc_dointvec result in proc_sctp_do_auth
  tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb
  ip_tunnel: fix ip_tunnel_lookup
  shmem: fix splicing from a hole while it's punched
  shmem: fix faulting into a hole, not taking i_mutex
  shmem: fix faulting into a hole while it's punched
  iwlwifi: dvm: don't enable CTS to self
  igb: do a reset on SR-IOV re-init if device is down
  hwmon: (adt7470) Fix writes to temperature limit registers
  hwmon: (da9052) Don't use dash in the name attribute
  hwmon: (da9055) Don't use dash in the name attribute
  tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
  tracing: Fix graph tracer with stack tracer on other archs
  fuse: handle large user and group ID
  Bluetooth: Ignore H5 non-link packets in non-active state
  Drivers: hv: util: Fix a bug in the KVP code
  media: gspca_pac7302: Add new usb-id for Genius i-Look 317
  usb: Check if port status is equal to RxDetect

Signed-off-by: Ian Maund <imaund@codeaurora.org>
2015-04-24 18:04:40 -07:00
Rom Lemarchand 1da76042a5 memcg: add permission check
Use the 'allow_attach' handler for the 'mem' cgroup to allow
non-root processes to add arbitrary processes to a 'mem' cgroup
if it has the CAP_SYS_NICE capability set.

Bug: 18260435
Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
Signed-off-by: Rom Lemarchand <romlem@android.com>
Git-commit: cce78bc02ff0ea2d21e88e3438d65272b898aa35
Git-repo: https://android.googlesource.com/kernel/common.git
Signed-off-by: Ian Maund <imaund@codeaurora.org>
2015-03-19 14:59:18 -07:00
Chintan Pandya d68f06d491 memcg: Allow non-root users permission to control memory
In a system like Android, a process with SYS_ADMIN rights
controls the system for things like moving process from
one cgroup to another. The native cgroup capabilities
are only allowed to execute by root user and not system.
While adding a new cgroup sub-system, one may override
and relax the permission so that 'system' can also control
cgroup. Here, memcg is one such cgroup sub system which
requires system level control for that.

Allow non-root processes to add arbitrary into 'memory'
cgroups if it has 'CAP_SYS_ADMIN' capability set.

Change-Id: I43d4468186f142c176cb5b5f060751bb1b160344
Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
2014-12-15 13:13:02 +05:30
Johannes Weiner f8a5117916 mm: memcg: handle non-error OOM situations more gracefully
commit 4942642080ea82d99ab5b653abb9a12b7ba31f4a upstream.

Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
callstack on OOM") assumed that only a few places that can trigger a
memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
readahead.  But there are many more and it's impractical to annotate
them all.

First of all, we don't want to invoke the OOM killer when the failed
allocation is gracefully handled, so defer the actual kill to the end of
the fault handling as well.  This simplifies the code quite a bit for
added bonus.

Second, since a failed allocation might not be the abrupt end of the
fault, the memcg OOM handler needs to be re-entrant until the fault
finishes for subsequent allocation attempts.  If an allocation is
attempted after the task already OOMed, allow it to bypass the limit so
that it can quickly finish the fault and invoke the OOM killer.

Reported-by: azurIt <azurit@pobox.sk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:22:56 -08:00
Johannes Weiner f79d6a4689 mm: memcg: do not trap chargers with full callstack on OOM
commit 3812c8c8f3953921ef18544110dafc3505c1ac62 upstream.

The memcg OOM handling is incredibly fragile and can deadlock.  When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds.  Comparably, any other task
that enters the charge path at this point will go to a waitqueue right
then and there and sleep until the OOM situation is resolved.  The problem
is that these tasks may hold filesystem locks and the mmap_sem; locks that
the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer was
about to charge a page cache page during a write(), which holds the
i_mutex.  The OOM killer selected a task that was just entering truncate()
and trying to acquire the i_mutex:

OOM invoking task:
  mem_cgroup_handle_oom+0x241/0x3b0
  mem_cgroup_cache_charge+0xbe/0xe0
  add_to_page_cache_locked+0x4c/0x140
  add_to_page_cache_lru+0x22/0x50
  grab_cache_page_write_begin+0x8b/0xe0
  ext3_write_begin+0x88/0x270
  generic_file_buffered_write+0x116/0x290
  __generic_file_aio_write+0x27c/0x480
  generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
  do_sync_write+0xea/0x130
  vfs_write+0xf3/0x1f0
  sys_write+0x51/0x90
  system_call_fastpath+0x18/0x1d

OOM kill victim:
  do_truncate+0x58/0xa0              # takes i_mutex
  do_last+0x250/0xa30
  path_openat+0xd7/0x440
  do_filp_open+0x49/0xa0
  do_sys_open+0x106/0x240
  sys_open+0x20/0x30
  system_call_fastpath+0x18/0x1d

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg is
disabled and a userspace task is in charge of resolving OOM situations.
In this case, ALL tasks that enter the OOM path will be made to sleep on
the OOM waitqueue and wait for userspace to free resources or increase
the group's limit.  But a userspace OOM handler is prone to deadlock
itself on the locks held by the waiting tasks.  For example one of the
sleeping tasks may be stuck in a brk() call with the mmap_sem held for
writing but the userspace handler, in order to pick an optimal victim,
may need to read files from /proc/<pid>, which tries to acquire the same
mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM and
makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
   fault instead of looping on the charge attempt.  This way, the OOM
   victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
   (either the kernel OOM killer or a userspace handler), don't go to
   sleep in the charge context.  Instead, remember the OOMing memcg in
   the task struct and then fully unwind the page fault stack with
   -ENOMEM.  pagefault_out_of_memory() will then call back into the
   memcg code to check if the -ENOMEM came from the memcg, and then
   either put the task to sleep on the memcg's OOM waitqueue or just
   restart the fault.  The OOM victim can no longer get stuck on any
   lock a sleeping task may hold.

Debugged by Michal Hocko.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: azurIt <azurit@pobox.sk>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:22:56 -08:00
Johannes Weiner 7a147e0c45 mm: memcg: rework and document OOM waiting and wakeup
commit fb2a6fc56be66c169f8b80e07ed999ba453a2db2 upstream.

The memcg OOM handler open-codes a sleeping lock for OOM serialization
(trylock, wait, repeat) because the required locking is so specific to
memcg hierarchies.  However, it would be nice if this construct would be
clearly recognizable and not be as obfuscated as it is right now.  Clean
up as follows:

1. Remove the return value of mem_cgroup_oom_unlock()

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
   makes it more obvious that the task has to be on the waitqueue
   before attempting to OOM-trylock the hierarchy, to not miss any
   wakeups before going to sleep.  It just didn't matter until now
   because it was all lumped together into the global memcg_oom_lock
   spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
   It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
   lock in any given hierarchy atomically.  Restrict its scope to
   mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
   function.  Only the lockholder has to wake up the next in line
   after releasing the lock.

   Note that the lockholder kicks off the OOM-killer, which in turn
   leads to wakeups from the uncharges of the exiting task.  But a
   contender is not guaranteed to see them if it enters the OOM path
   after the OOM kills but before the lockholder releases the lock.
   Thus there has to be an explicit wakeup after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
   under OOM as that is the point where we start to receive wakeups.
   No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
   symmetry.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:22:56 -08:00
Johannes Weiner 11f34787b5 mm: memcg: enable memcg OOM killer only for user faults
commit 519e52473ebe9db5cdef44670d5a97f1fd53d721 upstream.

System calls and kernel faults (uaccess, gup) can handle an out of memory
situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really the
only option available.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21 09:22:56 -08:00
Filipe Brandenburger ba9a72974f memcg: reparent charges of children before processing parent
commit 4fb1a86fb5e4209a7d4426d4e586c58e9edc74ac upstream.

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding
cgroup_mutex which prevents the child from reaching its
mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
all its children (and grandchildren, in the correct order) to have their
charges reparented first.

[The version for 3.10.34 (or perhaps now 3.10.35) is this below.
Yes, more differences, and the old mem_cgroup_reparent_charges line
is intentionally left in for 3.10 whereas it was removed for 3.12+:
that's because the css/cgroup iterator changed in between, it used
not to supply the root of the subtree, but nowadays it does - Hugh]


Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23 21:38:20 -07:00
Michal Hocko 5fb67b91df memcg: fix endless loop caused by mem_cgroup_iter
commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream.

Hugh has reported an endless loop when the hardlimit reclaim sees the
same group all the time.  This might happen when the reclaim races with
the memcg removal.

shrink_zone
                                                [rmdir root]
  mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    mem_cgroup_iter_load
      last_visited = iter->last_visited   // gets root || NULL
      css_tryget(last_visited)            // failed
      last_visited = NULL                 [1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    mem_cgroup_iter_update
      iter->last_visited = root;
    reclaim->generation = iter->generation

 mem_cgroup_iter(root, root, reclaim)
   // prev = root
   rcu_read_lock
    mem_cgroup_iter_load
      last_visited = iter->last_visited   // gets root
      css_tryget(last_visited)            // failed
    [1]

The issue seemed to be introduced by commit 5f57816197 ("memcg: relax
memcg iter caching") which has replaced unconditional css_get/css_put by
css_tryget/css_put for the cached iterator.

This patch fixes the issue by skipping css_tryget on the root of the
tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
in mem_cgroup_iter_update.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: <stable@vger.kernel.org>	[3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06 21:30:05 -08:00
Vladimir Davydov 76fca2297a memcg: fix memcg_size() calculation
commit 695c60830764945cf61a2cc623eb1392d137223e upstream.

The mem_cgroup structure contains nr_node_ids pointers to
mem_cgroup_per_node objects, not the objects themselves.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09 12:24:24 -08:00
Greg Thelen d96fa17975 memcg: fix multiple large threshold notifications
commit 2bff24a3707093c435ab3241c47dcdb5f16e432b upstream.

A memory cgroup with (1) multiple threshold notifications and (2) at least
one threshold >=2G was not reliable.  Specifically the notifications would
either not fire or would not fire in the proper order.

The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
thresholds in sorted order.  mem_cgroup_usage_register_event() sorts them
with compare_thresholds(), which returns the difference of two 64 bit
thresholds as an int.  If the difference is positive but has bit[31] set,
then sort() treats the difference as negative and breaks sort order.

This fix compares the two arbitrary 64 bit thresholds returning the
classic -1, 0, 1 result.

The test below sets two notifications (at 0x1000 and 0x81001000):
  cd /sys/fs/cgroup/memory
  mkdir x
  for x in 4096 2164264960; do
    cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
  done
  echo $$ > x/cgroup.procs
  anon_leaker 500M

v3.11-rc7 fails to signal the 4096 event listener:
  Leaking...
  Done leaking pages.

Patched v3.11-rc7 properly notifies:
  Leaking...
  4096 listener:2013:8:31:14:13:36
  Done leaking pages.

The fixed bug is old.  It appears to date back to the introduction of
memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
implement memory thresholds"

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26 17:18:28 -07:00
Andrey Vagin 0dcf19b4fb memcg: don't initialize kmem-cache destroying work for root caches
commit 3e6b11df245180949938734bc192eaf32f3a06b3 upstream.

struct memcg_cache_params has a union.  Different parts of this union
are used for root and non-root caches.  A part with destroying work is
used only for non-root caches.

I fixed the same problem in another place v3.9-rc1-16204-gf101a94, but
didn't notice this one.

This patch fixes the kernel panic:

[   46.848187] BUG: unable to handle kernel paging request at 000000fffffffeb8
[   46.849026] IP: [<ffffffff811a484c>] kmem_cache_destroy_memcg_children+0x6c/0xc0
[   46.849092] PGD 0
[   46.849092] Oops: 0000 [#1] SMP
...

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-20 08:43:02 -07:00
Michal Hocko 71a429cf00 memcg, kmem: fix reference count handling on the error path
commit f37a96914d1aea10fed8d9af10251f0b9caea31b upstream.

mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
This is not correct because only memcg_propagate_kmem takes an
additional reference while mem_cgroup_sockets_init is allowed to fail as
well (although no current implementation fails) but it doesn't take any
reference.  This all suggests that it should be memcg_propagate_kmem
that should clean up after itself so this patch moves mem_cgroup_put
over there.

Unfortunately this is not that easy (as pointed out by Li Zefan) because
memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
memcg_propagate_kmem fails so the additional reference is dropped in
that case in kmem_cgroup_destroy which means that the reference would be
dropped two times.

The easiest way then would be to simply remove mem_cgrroup_put from
mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
thing.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-21 18:21:36 -07:00
Michal Hocko 9cafb55474 Revert "memcg: avoid dangling reference count in creation failure"
commit fa460c2d37870e0a6f94c70e8b76d05ca11b6db0 upstream.

This reverts commit e4715f01be.

mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
an additional reference from all parents so the additional
mem_cgrroup_put(parent) potentially causes use-after-free.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-13 11:42:27 -07:00
Johannes Weiner 89dc991f0f mm: memcontrol: fix lockless reclaim hierarchy iterator
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.

The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.

The write side sets the position pointer first, then updates the
sequence count to "publish" the new position.  Likewise, the read side
must read the sequence count first, then the position.  If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:

  writer:                         reader:
  iter->position = position       if iter->sequence == expected:
  smp_wmb()                           smp_rmb()
  iter->sequence = sequence           position = iter->position

However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory.  Fix this.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: <stable@kernel.org>		[3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:46 -07:00
Andrey Vagin f101a9464b memcg: don't initialize kmem-cache destroying work for root caches
struct memcg_cache_params has a union.  Different parts of this union
are used for root and non-root caches.  A part with destroying work is
used only for non-root caches.

  BUG: unable to handle kernel paging request at 0000000fffffffe0
  IP: kmem_cache_alloc+0x41/0x1f0
  Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
  CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  RIP: kmem_cache_alloc+0x41/0x1f0
  Call Trace:
   getname_flags.part.34+0x30/0x140
   getname+0x38/0x60
   do_sys_open+0xc5/0x1e0
   SyS_open+0x22/0x30
   system_call_fastpath+0x16/0x1b
  Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
  RIP  [<ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizefan@huawei.com>
Cc: <stable@vger.kernel.org>	[3.9.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12 16:29:45 -07:00
Johannes Weiner 28ccddf795 mm: memcg: remove incorrect VM_BUG_ON for swap cache pages in uncharge
Commit 0c59b89c81 ("mm: memcg: push down PageSwapCache check into
uncharge entry functions") added a VM_BUG_ON() on PageSwapCache in the
uncharge path after checking that page flag once, assuming that the
state is stable in all paths, but this is not the case and the condition
triggers in user environments.  An uncharge after the last page table
reference to the page goes away can race with reclaim adding the page to
swap cache.

Swap cache pages are usually uncharged when they are freed after
swapout, from a path that also handles swap usage accounting and memcg
lifetime management.  However, since the last page table reference is
gone and thus no references to the swap slot left, the swap slot will be
freed shortly when reclaim attempts to write the page to disk.  The
whole swap accounting is not even necessary.

So while the race condition for which this VM_BUG_ON was added is real
and actually existed all along, there are no negative effects.  Remove
the VM_BUG_ON again.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Lingzhu Xiang <lxiang@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-24 16:22:51 -07:00
David Rientjes b070e65c0b mm, memcg: add rss_huge stat to memory.stat
This exports the amount of anonymous transparent hugepages for each
memcg via the new "rss_huge" stat in memory.stat.  The units are in
bytes.

This is helpful to determine the hugepage utilization for individual
jobs on the system in comparison to rss and opportunities where
MADV_HUGEPAGE may be helpful.

The amount of anonymous transparent hugepages is also included in "rss"
for backwards compatibility.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 18:38:26 -07:00
Linus Torvalds 191a712090 Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - Fixes and a lot of cleanups.  Locking cleanup is finally complete.
   cgroup_mutex is no longer exposed to individual controlelrs which
   used to cause nasty deadlock issues.  Li fixed and cleaned up quite a
   bit including long standing ones like racy cgroup_path().

 - device cgroup now supports proper hierarchy thanks to Aristeu.

 - perf_event cgroup now supports proper hierarchy.

 - A new mount option "__DEVEL__sane_behavior" is added.  As indicated
   by the name, this option is to be used for development only at this
   point and generates a warning message when used.  Unfortunately,
   cgroup interface currently has too many brekages and inconsistencies
   to implement a consistent and unified hierarchy on top.  The new flag
   is used to collect the behavior changes which are necessary to
   implement consistent unified hierarchy.  It's likely that this flag
   won't be used verbatim when it becomes ready but will be enabled
   implicitly along with unified hierarchy.

   The option currently disables some of broken behaviors in cgroup core
   and also .use_hierarchy switch in memcg (will be routed through -mm),
   which can be used to make very unusual hierarchy where nesting is
   partially honored.  It will also be used to implement hierarchy
   support for blk-throttle which would be impossible otherwise without
   introducing a full separate set of control knobs.

   This is essentially versioning of interface which isn't very nice but
   at this point I can't see any other options which would allow keeping
   the interface the same while moving towards hierarchy behavior which
   is at least somewhat sane.  The planned unified hierarchy is likely
   to require some level of adaptation from userland anyway, so I think
   it'd be best to take the chance and update the interface such that
   it's supportable in the long term.

   Maintaining the existing interface does complicate cgroup core but
   shouldn't put too much strain on individual controllers and I think
   it'd be manageable for the foreseeable future.  Maybe we'll be able
   to drop it in a decade.

Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
  cpuset: fix compile warning when CONFIG_SMP=n
  cpuset: fix cpu hotplug vs rebuild_sched_domains() race
  cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
  cgroup: restore the call to eventfd->poll()
  cgroup: fix use-after-free when umounting cgroupfs
  cgroup: fix broken file xattrs
  devcg: remove parent_cgroup.
  memcg: force use_hierarchy if sane_behavior
  cgroup: remove cgrp->top_cgroup
  cgroup: introduce sane_behavior mount option
  move cgroupfs_root to include/linux/cgroup.h
  cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
  cgroup: make cgroup_path() not print double slashes
  Revert "cgroup: remove bind() method from cgroup_subsys."
  perf: make perf_event cgroup hierarchical
  cgroup: implement cgroup_is_descendant()
  cgroup: make sure parent won't be destroyed before its children
  cgroup: remove bind() method from cgroup_subsys.
  devcg: remove broken_hierarchy tag
  cgroup: remove cgroup_lock_is_held()
  ...
2013-04-29 19:14:20 -07:00
Li Zefan ca0dde9717 memcg: take reference before releasing rcu_read_lock
The memcg is not referenced, so it can be destroyed at anytime right
after we exit rcu read section, so it's not safe to access it.

To fix this, we call css_tryget() to get a reference while we're still
in rcu read section.

This also removes a bogus comment above __memcg_create_cache_enqueue().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:40 -07:00
David Rientjes 465adcf1ea mm, memcg: give exiting processes access to memory reserves
A memcg may livelock when oom if the process that grabs the hierarchy's
oom lock is never the first process with PF_EXITING set in the memcg's
task iteration.

The oom killer, both global and memcg, will defer if it finds an
eligible process that is in the process of exiting and it is not being
ptraced.  The idea is to allow it to exit without using memory reserves
before needlessly killing another process.

This normally works fine except in the memcg case with a large number of
threads attached to the oom memcg.  In this case, the memcg oom killer
only gets called for the process that grabs the hierarchy's oom lock;
all others end up blocked on the memcg's oom waitqueue.  Thus, if the
process that grabs the hierarchy's oom lock is never the first
PF_EXITING process in the memcg's task iteration, the oom killer is
constantly deferred without anything making progress.

The fix is to give PF_EXITING processes access to memory reserves so
that we've marked them as oom killed without any iteration.  This allows
__mem_cgroup_try_charge() to succeed so that the process may exit.  This
makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
immediately granted for processes with pending SIGKILLs and those in the
exit path, to be equivalent to what is done for the global oom killer.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:39 -07:00
Li Zefan fd0ccaf2bd memcg: avoid accessing memcg after releasing reference
This might cause a use-after-free bug.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:39 -07:00
Anton Vorontsov 70ddf637ee memcg: add memory.pressure_level events
With this patch userland applications that want to maintain the
interactivity/memory allocation cost can use the pressure level
notifications.  The levels are defined like this:

The "low" level means that the system is reclaiming memory for new
allocations.  Monitoring this reclaiming activity might be useful for
maintaining cache level.  Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file
caches, etc.  Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger.  Applications should do whatever they can to help the
system.  It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

The events are propagated upward until the event is handled, i.e.  the
events are not pass-through.  Here is what this means: for example you
have three cgroups: A->B->C.  Now you set up an event listener on
cgroups A, B and C, and suppose group C experiences some pressure.  In
this situation, only group C will receive the notification, i.e.  groups
A and B will not receive it.  This is done to avoid excessive
"broadcasting" of messages, which disturbs the system and which is
especially bad if we are low on memory or thrashing.  So, organize the
cgroups wisely, or propagate the events manually (or, ask us to
implement the pass-through events, explaining why would you need them.)

Performance wise, the memory pressure notifications feature itself is
lightweight and does not require much of bookkeeping, in contrast to the
rest of memcg features.  Unfortunately, as of current memcg
implementation, pages accounting is an inseparable part and cannot be
turned off.  The good news is that there are some efforts[1] to improve
the situation; plus, implementing the same, fully API-compatible[2]
interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
option, so it will not require any changes on the userland side.

[1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
[2] http://lkml.org/lkml/2013/2/21/454

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:38 -07:00
Michel Lespinasse 573b400d01 mm/memcontrol.c: remove unnecessary ;
Just a trivial issue I stumbled on while doing something else...

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:37 -07:00
Michal Hocko acb6d558f4 memcg: do not check for do_swap_account in mem_cgroup_{read,write,reset}
Since commit 2d11085e40 ("memcg: do not create memsw files if swap
accounting is disabled") memsw files are created only if memcg swap
accounting is enabled so it doesn't make any sense to check for it
explicitly in mem_cgroup_read(), mem_cgroup_write() and
mem_cgroup_reset().

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:34 -07:00
Michal Hocko 16248d8fe6 memcg: further simplify mem_cgroup_iter
mem_cgroup_iter basically does two things currently.  It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).  The code
would be much more easier to follow if we move the iteration outside of
the function (to __mem_cgrou_iter_next) so the distinction is more
clear.  This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Michal Hocko 19f3940286 memcg: simplify mem_cgroup_iter
The current implementation of mem_cgroup_iter has to consider both css
and memcg to find out whether no group has been found (css==NULL - aka
the loop is completed) and that no memcg is associated with the found
node (!memcg - aka css_tryget failed because the group is no longer
alive).  This leads to awkward tweaks like tests for css && !memcg to
skip the current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg.  In order to do that the iteration part has to skip
dead nodes.  This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier.  It will go away in the following patch.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:33 -07:00
Michal Hocko 5f57816197 memcg: relax memcg iter caching
Now that the per-node-zone-priority iterator caches memory cgroups
rather than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might live for
unbounded amount of time even though their group is already gone (until
the global/targeted reclaim triggers the zone under priority to find out
the group is dead and let it to find the final rest).

We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.

This number would be stored into iterator everytime when a memcg is
cached.  If the iter count doesn't match the curent walker root's one we
will start from the root again.  The group counter is incremented
upwards the hierarchy every time a group is removed.

The iter_lock can be dropped because racing iterators cannot leak the
reference anymore as the reference count is not elevated for
last_visited when it is cached.

Locking rules got a bit complicated by this change though.  The iterator
primarily relies on rcu read lock which makes sure that once we see a
valid last_visited pointer then it will be valid for the whole RCU walk.
smp_rmb makes sure that dead_count is read before last_visited and
last_dead_count while smp_wmb makes sure that last_visited is updated
before last_dead_count so the up-to-date last_dead_count cannot point to
an outdated last_visited.  css_tryget then makes sure that the
last_visited is still alive in case the iteration races with the cached
group removal (css is invalidated before mem_cgroup_css_offline
increments dead_count).

In short:
mem_cgroup_iter
 rcu_read_lock()
 dead_count = atomic_read(parent->dead_count)
 smp_rmb()
 if (dead_count != iter->last_dead_count)
 	last_visited POSSIBLY INVALID -> last_visited = NULL
 if (!css_tryget(iter->last_visited))
 	last_visited DEAD -> last_visited = NULL
 next = find_next(last_visited)
 css_tryget(next)
 css_put(last_visited) 	// css would be invalidated and parent->dead_count
 			// incremented if this was the last reference
 iter->last_visited = next
 smp_wmb()
 iter->last_dead_count = dead_count
 rcu_read_unlock()

cgroup_rmdir
 cgroup_destroy_locked
  atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
   mem_cgroup_css_offline
    mem_cgroup_invalidate_reclaim_iterators
     while(parent = parent_mem_cgroup)
     	atomic_inc(parent->dead_count)
  css_put(css) // last reference held by cgroup core

Spotted by Ying Han.

Original idea from Johannes Weiner.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:32 -07:00
Michal Hocko 542f85f9ae memcg: rework mem_cgroup_iter to use cgroup iterators
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree.  This is really awkward because the tree walk depends on
the groups creation ordering.  The only guarantee is that a parent node is
visited before its children.

Example:

 1) mkdir -p a a/d a/b/c
 2) mkdir -a a/b/c a/d

Will create the same trees but the tree walks will be different:

 1) a, d, b, c
 2) a, b, c, d

Commit 574bd9f7c7 ("cgroup: implement generic child / descendant walk
macros") has introduced generic cgroup tree walkers which provide either
pre-order or post-order tree walk.  This patch converts css->id based
iteration to pre-order tree walk to keep the semantic with the original
iterator where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal.  This implementation doesn't use those because a new memory
cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
already and css reference counting enforces that the group is alive for
both the last seen cgroup and the found one resp.  it signals that the
group is dead and it should be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time.  Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

Per node-zone-prio iter_lock has been introduced to ensure that
css_tryget and iter->last_visited is set atomically.  Otherwise two
racing walkers could both take a references and only one release it
leading to a css leak (which pins cgroup dentry).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:32 -07:00
Michal Hocko c40046f3ad memcg: keep prev's css alive for the whole mem_cgroup_iter
The patchset tries to make mem_cgroup_iter saner in the way how it walks
hierarchies.  css->id based traversal is far from being ideal as it is not
deterministic because it depends on the creation ordering.  Additional to
that css_id is considered a burden for cgroup maintainers because it is
quite some code and memcg is the last user of it.  After this series only
the swap accounting uses css_id but that one will follow up later.

Diffstat (if we exclude removed/added comments) looks quite
promising. We got rid of some code:

  $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
   b/include/linux/cgroup.h |    3 ---
   kernel/cgroup.c          |   33 ---------------------------------
   mm/memcontrol.c          |    4 +++-
   3 files changed, 3 insertions(+), 37 deletions(-)

The first patch is just preparatory and it changes when we release css of
the previously returned memcg.  Nothing controlversial.

The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order.  This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter).  We have to use memcg pointers
directly now which means that we have to keep a reference to those groups'
css to keep them alive.

I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch.  Johannes felt the race I was
describing should be mostly harmless and I haven't been able to trigger it
so the lock doesn't deserve its own patch.  It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it.  It will go away with the next patch.

The next patch fixups an unbounded cgroup removal holdoff caused by the
elevated css refcount.  The issue has been observed by Ying Han.  Johannes
wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed.  He has suggested a
different way when the iterator checks whether a cached memcg is still
valid or no.  More on that in the patch but the basic idea is that every
memcg tracks the number removed subgroups and iterator records this number
when a group is cached.  These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.

The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter.  css juggling is removed and the iteration logic is moved
to a helper so that the reference counting and iteration are separated.

The last patch just removes css_get_next as there is no user for it any
longer.

My testing looked as follows:
        A (use_hierarchy=1, limit_in_bytes=150M)
       /|\
      1 2 3

Children groups were created so that the number is never higher than 3 and
their limits were random between 50-100M.  Each group hosts a kernel build
(starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
and terminated after random time - up to 5 minutes) and then it is
removed.

This should exercise both leaf and hierarchical reclaim as well as races
with cgroup removals and debugging messages I added on top proved that.
100 groups were created during the test.

This patch:

css reference counting keeps the cgroup alive even though it has been
already removed.  mem_cgroup_iter relies on this fact and takes a
reference to the returned group.  The reference is then released on the
next iteration or mem_cgroup_iter_break.  mem_cgroup_iter currently
releases the reference right after it gets the last css_id.

This is correct because neither prev's memcg nor cgroup are accessed after
then.  This will change in the next patch so we need to hold the group
alive a bit longer so let's move the css_put at the end of the function.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:32 -07:00
Tejun Heo f00baae7ad memcg: force use_hierarchy if sane_behavior
Turn on use_hierarchy by default if sane_behavior is specified and
don't create .use_hierarchy file.

It is debatable whether to remove .use_hierarchy file or make it ro as
the former could make transition easier in certain cases; however, the
behavior changes which will be gated by sane_behavior are intensive
including changing basic meaning of certain control knobs in a few
controllers and I don't really think keeping this piece would make
things easier in any noticeable way, so let's remove it.

v2: Explain that mem_cgroup_bind() doesn't have to worry about
    children as suggested by Michal Hocko.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2013-04-15 13:46:27 -07:00
Michal Hocko d9c10ddddc memcg: fix memcg_cache_name() to use cgroup_name()
As cgroup supports rename, it's unsafe to dereference dentry->d_name
without proper vfs locks. Fix this by using cgroup_name() rather than
dentry directly.

Also open code memcg_cache_name because it is called only from
kmem_cache_dup which frees the returned name right after
kmem_cache_create_memcg makes a copy of it. Such a short-lived
allocation doesn't make too much sense. So replace it by a static
buffer as kmem_cache_dup is called with memcg_cache_mutex.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-07 09:28:23 -07:00
Konstantin Khlebnikov 15cf17d26e memcg: initialize kmem-cache destroying work earlier
Fix a warning from lockdep caused by calling cancel_work_sync() for
uninitialized struct work.  This path has been triggered by destructon
kmem-cache hierarchy via destroying its root kmem-cache.

  cache ffff88003c072d80
  obj ffff88003b410000 cache ffff88003c072d80
  obj ffff88003b924000 cache ffff88003c20bd40
  INFO: trying to register non-static key.
  the code is fine but needs lockdep annotation.
  turning off the locking correctness validator.
  Pid: 2825, comm: insmod Tainted: G           O 3.9.0-rc1-next-20130307+ #611
  Call Trace:
    __lock_acquire+0x16a2/0x1cb0
    lock_acquire+0x8a/0x120
    flush_work+0x38/0x2a0
    __cancel_work_timer+0x89/0xf0
    cancel_work_sync+0xb/0x10
    kmem_cache_destroy_memcg_children+0x81/0xb0
    kmem_cache_destroy+0xf/0xe0
    init_module+0xcb/0x1000 [kmem_test]
    do_one_initcall+0x11a/0x170
    load_module+0x19b0/0x2320
    SyS_init_module+0xc6/0xf0
    system_call_fastpath+0x16/0x1b

Example module to demonstrate:

  #include <linux/module.h>
  #include <linux/slab.h>
  #include <linux/mm.h>
  #include <linux/workqueue.h>

  int __init mod_init(void)
  {
  	int size = 256;
  	struct kmem_cache *cache;
  	void *obj;
  	struct page *page;

  	cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL);
  	if (!cache)
  		return -ENOMEM;

  	printk("cache %p\n", cache);

  	obj = kmem_cache_alloc(cache, GFP_KERNEL);
  	if (obj) {
  		page = virt_to_head_page(obj);
  		printk("obj %p cache %p\n", obj, page->slab_cache);
  		kmem_cache_free(cache, obj);
  	}

  	flush_scheduled_work();

  	obj = kmem_cache_alloc(cache, GFP_KERNEL);
  	if (obj) {
  		page = virt_to_head_page(obj);
  		printk("obj %p cache %p\n", obj, page->slab_cache);
  		kmem_cache_free(cache, obj);
  	}

  	kmem_cache_destroy(cache);

  	return -EBUSY;
  }

  module_init(mod_init);
  MODULE_LICENSE("GPL");

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-08 15:05:34 -08:00
Hugh Dickins 6d04399040 memcg: stop warning on memcg_propagate_kmem
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
"mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Michal Hocko 1081312f95 memcg: cleanup mem_cgroup_init comment
We should encourage all memcg controller initialization independent on a
specific mem_cgroup to be done here rather than exploit css_alloc
callback and assume that nothing happens before root cgroup is created.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Michal Hocko e477749624 memcg: move memcg_stock initialization to mem_cgroup_init
memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.

This patch wraps the current memcg_stock initialization code into a
helper calls it from the controller subsystem initialization code.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Michal Hocko 8787a1df30 memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init
Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg
allocation code with something that can be called when the memcg
subsystem is initialized by mem_cgroup_init along with other controller
specific parts.

While we are at it let's make mem_cgroup_soft_limit_tree_init void
because it doesn't make much sense to report memory failure because if
we fail to allocate memory that early during the boot then we are
screwed anyway (this saves some code).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Johannes Weiner e3790144c9 mm: refactor inactive_file_is_low() to use get_lru_size()
An inactive file list is considered low when its active counterpart is
bigger, regardless of whether it is a global zone LRU list or a memcg
zone LRU list.  The only difference is in how the LRU size is assessed.

get_lru_size() does the right thing for both global and memcg reclaim
situations.

Get rid of inactive_file_is_low_global() and
mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
the numbers in common code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:20 -08:00
Glauber Costa e4715f01be memcg: avoid dangling reference count in creation failure.
When use_hierarchy is enabled, we acquire an extra reference count in
our parent during cgroup creation.  We don't release it, though, if any
failure exist in the creation process.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Glauber Costa 692e89abd1 memcg: increment static branch right after limit set
We were deferring the kmemcg static branch increment to a later time,
due to a nasty dependency between the cpu_hotplug lock, taken by the
jump label update, and the cgroup_lock.

Now we no longer take the cgroup lock, and we can save ourselves the
trouble.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Glauber Costa 0999821b1d memcg: replace cgroup_lock with memcg specific memcg_lock
After the preparation work done in earlier patches, the cgroup_lock can
be trivially replaced with a memcg-specific lock.  This is an automatic
translation at every site where the values involved were queried.

The sites where values are written, however, used to be naturally called
under cgroup_lock.  This is the case for instance in the css_online
callback.  For those, we now need to explicitly add the memcg lock.

With this, all the calls to cgroup_lock outside cgroup core are gone.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Glauber Costa b5f99b537d memcg: fast hierarchy-aware child test
Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.

This is less than ideal, because it enforces a dependency over cgroup
core that we would be better off without.  The solution proposed here is
to iterate over the child cgroups and if any is found that is already
online, we bounce and return: we don't really care how many children we
have, only if we have any.

This is also made to be hierarchy aware.  IOW, cgroups with hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Glauber Costa d142e3e667 memcg: split part of memcg creation to css_online
This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.

The memory controller uses some tunables to adjust its operation.  Those
tunables are inherited from parent to children upon children
intialization.  For most of them, the value cannot be changed after the
parent has a new children.

cgroup core splits initialization in two phases: css_alloc and css_online.
After css_alloc, the memory allocation and basic initialization are done.
But the new group is not yet visible anywhere, not even for cgroup core
code.  It is only somewhere between css_alloc and css_online that it is
inserted into the internal children lists.  Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the old
parent values, that can change between the copy and the moment in which
the groups is linked to any data structure that can indicate the presence
of children.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Glauber Costa ee5e8472b8 memcg: prevent changes to move_charge_at_immigrate during task attach
In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup.  We do this because we rely on
cgroup core to provide us with this information.

We need to guarantee that upon child creation, our tunables are
consistent.  For those, the calls to cgroup_lock() all live in handlers
like mem_cgroup_hierarchy_write(), where we change a tunable in the
group that is hierarchy-related.  For instance, the use_hierarchy flag
cannot be changed if the cgroup already have children.

Furthermore, those values are propagated from the parent to the child
when a new child is created.  So if we don't lock like this, we can end
up with the following situation:

A                                   B
 memcg_css_alloc()                       mem_cgroup_hierarchy_write()
 copy use hierarchy from parent          change use hierarchy in parent
 finish creation.

This is mainly because during create, we are still not fully connected
to the css tree.  So all iterators and the such that we could use, will
fail to show that the group has children.

My observation is that all of creation can proceed in parallel with
those tasks, except value assignment.  So what this patch series does is
to first move all value assignment that is dependent on parent values
from css_alloc to css_online, where the iterators all work, and then we
lock only the value assignment.  This will guarantee that parent and
children always have consistent values.  Together with an online test,
that can be derived from the observation that the refcount of an online
memcg can be made to be always positive, we should be able to
synchronize our side without the cgroup lock.

This patch:

Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration.  However, this is only
needed because the current strategy keeps checking this value throughout
the whole process.  Since all we need is serialization, one needs only
to guarantee that whatever decision we made in the beginning of a
specific migration is respected throughout the process.

We can achieve this by just saving it in mc.  By doing this, no kind of
locking is needed.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Glauber Costa 45cf7ebd5a memcg: reduce the size of struct memcg 244-fold.
In order to maintain all the memcg bookkeeping, we need per-node
descriptors, which will in turn contain a per-zone descriptor.

Because we want to statically allocate those, this array ends up being
very big.  Part of the reason is that we allocate something large enough
to hold MAX_NUMNODES, the compile time constant that holds the maximum
number of nodes we would ever consider.

However, we can do better in some cases if the firmware help us.  This
is true for modern x86 machines; coincidentally one of the architectures
in which MAX_NUMNODES tends to be very big.

By using the firmware-provided maximum number of nodes instead of
MAX_NUMNODES, we can reduce the memory footprint of struct memcg
considerably.  In the extreme case in which we have only one node, this
reduces the size of the structure from ~ 64k to ~2k.  This is
particularly important because it means that we will no longer resort to
the vmalloc area for the struct memcg on defconfigs.  We also have
enough room for an extra node and still be outside vmalloc.

One also has to keep in mind that with the industry's ability to fit
more processors in a die as fast as the FED prints money, a nodes = 2
configuration is already respectably big.

[akpm@linux-foundation.org: add check for invalid nid, remove inline]
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:18 -08:00
Michal Hocko 6acc8b0251 memcg: clean up swap accounting initialization code
Memcg swap accounting is currently enabled by enable_swap_cgroup when
the root cgroup is created.  mem_cgroup_init acts as a memcg subsystem
initializer which sounds like a much better place for enable_swap_cgroup
as well.  We already register memsw files from there so it makes a lot
of sense to merge those two into a single enable_swap_cgroup function.

This patch doesn't introduce any semantic changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:17 -08:00
Michal Hocko 2d11085e40 memcg: do not create memsw files if swap accounting is disabled
Zhouping Liu has reported that memsw files are exported even though swap
accounting is runtime disabled if MEMCG_SWAP is enabled.  This behavior
has been introduced by commit af36f906c0 ("memcg: always create memsw
files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
the file to return EOPNOTSUPP.  Although EOPNOTSUPP should say be clear
that memsw operations are not supported in the given configuration it is
fair to say that this behavior could be quite confusing.

Let's tear memsw files out of default cgroup files and add them only if
the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
swapaccount=1 boot parameter).  We can hook into mem_cgroup_init which
is called when the memcg subsystem is initialized and which happens
after boot command line is processed.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Zhouping Liu <zliu@redhat.com>
Tested-by: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:17 -08:00
Shaohua Li 33806f06da swap: make each swap partition have one address_space
When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended.  This makes each swap partition have one
address_space to reduce the lock contention.  There is an array of
address_space for swap.  The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

[akpm@linux-foundation.org: revert unneeded change to  __add_to_swap_cache]
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:17 -08:00
Andrew Morton d045197ff9 mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo()
Acked-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:09 -08:00