Commit Graph

94 Commits

Author SHA1 Message Date
LuK1337 fc9499e55a Import latest Samsung release
* Package version: T713XXU2BQCO

Change-Id: I293d9e7f2df458c512d59b7a06f8ca6add610c99
2017-04-18 03:43:52 +02:00
Andrey Ryabinin 3c9cc62b4f mm: page_alloc: add kasan hooks on alloc and free paths
Add kernel address sanitizer hooks to mark allocated page's addresses as
accessible in corresponding shadow region.  Mark freed pages as
inaccessible.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: b8c73fc2493d42517be95cf2c89659fc6c6f4d02
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Change-Id: I222d51af2117b92007d028b62a65f236ef267db0
Signed-off-by: David Keitel <dkeitel@codeaurora.org>
2015-05-04 14:03:54 -07:00
Ian Maund 068b0551a9 This is the 3.10.73 stable release
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVFBE+AAoJEDjbvchgkmk+oTkP/j2ipSvgXghFEipZbOJUQkqC
 fa8elfoF7riTKpKOuDtDU2WI1ttCGYs5gmTNpd4KaEt23eJOQgVqIpV8GhAkW5Af
 NVyGhjF3dXNqpBkxnyuIkk5OLrNKGRNS2xpz1U254iGObYrK+tr62IzGPxEcPAhX
 Y+58xPVSjLtNdTJW3YLT3DohUbnbHG6Br9geI1IHtlxg1oDiTxtnX2FmOFzzDpP5
 qu8gnPIekg/+1EE46nEiq0C59AwC3aCzNxwlYe1Kd41SY3LUFF1eZMzmOnnwyI5K
 3FslAzT6x/sOmGJFTYrKjFA4GKsW67xHVkB/hp/Mu768RqxiQCxV4kgmPsAFLbXb
 D5qbNwr3i0iQ/9AaD7h8HJkxC/KHmszMux00L/mgZ3SGdGMEIBxHg+oP8+nP8V6C
 WfXKSWA94dpdRyULEfWdnKnUnp2860C7kt7ASTkOl8rIgU8HgaRqeu+U/KPM2ovD
 ZJtXPVB5UXCRuVAhZwbvvrLOY8UMZTnv2auAaeLYG8YptcvGeN5Z398/8qdV/z7c
 A9kOsgebs74X+lR3rbVgSDPQaq2AEiuIvtX77SfmrWXBXGmc99i9+PikuFggRprz
 cJm5bCM9DaHu/3b77X9Fwl7vnpReB0zPHiwTdH/p7OPMf5m1uQt7SqegC6btLPHs
 iYgjLd4oW+6uiV/2X1Vx
 =L+mC
 -----END PGP SIGNATURE-----

Merge commit 'v3.10.73' into msm-3.10

This merge brings us up to date with upstream kernel.org tag v3.10.73.
As part of the conflict resolution, changes introduced by commit 72684eae7
("arm64: Fix up /proc/cpuinfo") have been intentionally dropped, as they
conflict with Android changes msm-3.10 kernel to solve the problems
in a different way. Since userspace readers of this file may depend on
the existing msm-3.10 implementation, it's left as-is for now. The
commit may later be introduced if it is found to not impact userspaces
paired with this kernel.

* commit 'v3.10.73' (264 commits):
  Linux 3.10.73
  target: Allow Write Exclusive non-reservation holders to READ
  target: Allow AllRegistrants to re-RESERVE existing reservation
  target: Fix R_HOLDER bit usage for AllRegistrants
  target/pscsi: Fix NULL pointer dereference in get_device_type
  iscsi-target: Avoid early conn_logout_comp for iser connections
  target: Fix reference leak in target_get_sess_cmd() error path
  ARM: at91: pm: fix at91rm9200 standby
  ipvs: rerouting to local clients is not needed anymore
  ipvs: add missing ip_vs_pe_put in sync code
  powerpc/smp: Wait until secondaries are active & online
  x86/vdso: Fix the build on GCC5
  x86/fpu: Drop_fpu() should not assume that tsk equals current
  x86/fpu: Avoid math_state_restore() without used_math() in __restore_xstate_sig()
  crypto: aesni - fix memory usage in GCM decryption
  libsas: Fix Kernel Crash in smp_execute_task
  xen-pciback: limit guest control of command register
  nilfs2: fix deadlock of segment constructor during recovery
  regulator: core: Fix enable GPIO reference counting
  regulator: Only enable disabled regulators on resume
  ALSA: hda - Treat stereo-to-mono mix properly
  ALSA: hda - Add workaround for MacBook Air 5,2 built-in mic
  ALSA: hda - Set single_adc_amp flag for CS420x codecs
  ALSA: hda - Don't access stereo amps for mono channel widgets
  ALSA: hda - Fix built-in mic on Compaq Presario CQ60
  ALSA: control: Add sanity checks for user ctl id name string
  spi: pl022: Fix race in giveback() leading to driver lock-up
  tpm/ibmvtpm: Additional LE support for tpm_ibmvtpm_send
  workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE
  can: add missing initialisations in CAN related skbuffs
  Change email address for 8250_pci
  virtio_console: init work unconditionally
  fuse: notify: don't move pages
  fuse: set stolen page uptodate
  drm/radeon: drop setting UPLL to sleep mode
  drm/radeon: do a posting read in rs600_set_irq
  drm/radeon: do a posting read in si_set_irq
  drm/radeon: do a posting read in r600_set_irq
  drm/radeon: do a posting read in r100_set_irq
  drm/radeon: do a posting read in evergreen_set_irq
  drm/radeon: fix DRM_IOCTL_RADEON_CS oops
  tcp: make connect() mem charging friendly
  net: compat: Update get_compat_msghdr() to match copy_msghdr_from_user() behaviour
  tcp: fix tcp fin memory accounting
  Revert "net: cx82310_eth: use common match macro"
  rxrpc: bogus MSG_PEEK test in rxrpc_recvmsg()
  caif: fix MSG_OOB test in caif_seqpkt_recvmsg()
  inet_diag: fix possible overflow in inet_diag_dump_one_icsk()
  rds: avoid potential stack overflow
  net: sysctl_net_core: check SNDBUF and RCVBUF for min length
  sparc64: Fix several bugs in memmove().
  sparc: Touch NMI watchdog when walking cpus and calling printk
  sparc: perf: Make counting mode actually work
  sparc: perf: Remove redundant perf_pmu_{en|dis}able calls
  sparc: semtimedop() unreachable due to comparison error
  sparc32: destroy_context() and switch_mm() needs to disable interrupts.
  Linux 3.10.72
  ath5k: fix spontaneus AR5312 freezes
  ACPI / video: Load the module even if ACPI is disabled
  drm/radeon: fix 1 RB harvest config setup for TN/RL
  Drivers: hv: vmbus: incorrect device name is printed when child device is unregistered
  HID: fixup the conflicting keyboard mappings quirk
  HID: input: fix confusion on conflicting mappings
  staging: comedi: cb_pcidas64: fix incorrect AI range code handling
  dm snapshot: fix a possible invalid memory access on unload
  dm: fix a race condition in dm_get_md
  dm io: reject unsupported DISCARD requests with EOPNOTSUPP
  dm mirror: do not degrade the mirror on discard error
  staging: comedi: comedi_compat32.c: fix COMEDI_CMD copy back
  clk: sunxi: Support factor clocks with N factor starting not from 0
  fixed invalid assignment of 64bit mask to host dma_boundary for scatter gather segment boundary limit.
  nilfs2: fix potential memory overrun on inode
  IB/qib: Do not write EEPROM
  sg: fix read() error reporting
  ALSA: hda - Add pin configs for ASUS mobo with IDT 92HD73XX codec
  ALSA: pcm: Don't leave PREPARED state after draining
  tty: fix up atime/mtime mess, take four
  sunrpc: fix braino in ->poll()
  procfs: fix race between symlink removals and traversals
  debugfs: leave freeing a symlink body until inode eviction
  autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
  USB: serial: fix potential use-after-free after failed probe
  TTY: fix tty_wait_until_sent on 64-bit machines
  USB: serial: fix infinite wait_until_sent timeout
  net: irda: fix wait_until_sent poll timeout
  xhci: fix reporting of 0-sized URBs in control endpoint
  xhci: Allocate correct amount of scratchpad buffers
  usb: ftdi_sio: Add jtag quirk support for Cyber Cortex AV boards
  USB: usbfs: don't leak kernel data in siginfo
  USB: serial: cp210x: Adding Seletek device id's
  KVM: MIPS: Fix trace event to save PC directly
  KVM: emulate: fix CMPXCHG8B on 32-bit hosts
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  btrfs: fix lost return value due to variable shadowing
  iio: imu: adis16400: Fix sign extension
  x86/asm/entry/64: Remove a bogus 'ret_from_fork' optimization
  PM / QoS: remove duplicate call to pm_qos_update_target
  target: Check for LBA + sectors wrap-around in sbc_parse_cdb
  mm/memory.c: actually remap enough memory
  mm/compaction: fix wrong order check in compact_finished()
  mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()
  mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()
  mm/hugetlb: add migration entry check in __unmap_hugepage_range
  team: don't traverse port list using rcu in team_set_mac_address
  udp: only allow UFO for packets from SOCK_DGRAM sockets
  usb: plusb: Add support for National Instruments host-to-host cable
  macvtap: make sure neighbour code can push ethernet header
  net: compat: Ignore MSG_CMSG_COMPAT in compat_sys_{send, recv}msg
  team: fix possible null pointer dereference in team_handle_frame
  net: reject creation of netdev names with colons
  ematch: Fix auto-loading of ematch modules.
  net: phy: Fix verification of EEE support in phy_init_eee
  ipv4: ip_check_defrag should not assume that skb_network_offset is zero
  ipv4: ip_check_defrag should correctly check return value of skb_copy_bits
  gen_stats.c: Duplicate xstats buffer for later use
  rtnetlink: call ->dellink on failure when ->newlink exists
  ipv6: fix ipv6_cow_metrics for non DST_HOST case
  rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY
  Linux 3.10.71
  libceph: fix double __remove_osd() problem
  libceph: change from BUG to WARN for __remove_osd() asserts
  libceph: assert both regular and lingering lists in __remove_osd()
  MIPS: Export FP functions used by lose_fpu(1) for KVM
  x86, mm/ASLR: Fix stack randomization on 64-bit systems
  blk-throttle: check stats_cpu before reading it from sysfs
  jffs2: fix handling of corrupted summary length
  md/raid1: fix read balance when a drive is write-mostly.
  md/raid5: Fix livelock when array is both resyncing and degraded.
  metag: Fix KSTK_EIP() and KSTK_ESP() macros
  gpio: tps65912: fix wrong container_of arguments
  arm64: compat Fix siginfo_t -> compat_siginfo_t conversion on big endian
  hx4700: regulator: declare full constraints
  KVM: x86: update masterclock values on TSC writes
  KVM: MIPS: Don't leak FPU/DSP to guest
  ARC: fix page address calculation if PAGE_OFFSET != LINUX_LINK_BASE
  ntp: Fixup adjtimex freq validation on 32-bit systems
  kdb: fix incorrect counts in KDB summary command output
  ARM: pxa: add regulator_has_full_constraints to poodle board file
  ARM: pxa: add regulator_has_full_constraints to corgi board file
  vt: provide notifications on selection changes
  usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGN
  USB: fix use-after-free bug in usb_hcd_unlink_urb()
  USB: cp210x: add ID for RUGGEDCOM USB Serial Console
  tty: Prevent untrappable signals from malicious program
  axonram: Fix bug in direct_access
  cfq-iosched: fix incorrect filing of rt async cfqq
  cfq-iosched: handle failure of cfq group allocation
  iscsi-target: Drop problematic active_ts_list usage
  NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args
  Added Little Endian support to vtpm module
  tpm/tpm_i2c_stm_st33: Fix potential bug in tpm_stm_i2c_send
  tpm: Fix NULL return in tpm_ibmvtpm_get_desired_dma
  tpm_tis: verify interrupt during init
  ARM: 8284/1: sa1100: clear RCSR_SMR on resume
  tracing: Fix unmapping loop in tracing_mark_write
  MIPS: KVM: Deliver guest interrupts after local_irq_disable()
  nfs: don't call blocking operations while !TASK_RUNNING
  mmc: sdhci-pxav3: fix setting of pdata->clk_delay_cycles
  power_supply: 88pm860x: Fix leaked power supply on probe fail
  ALSA: hdspm - Constrain periods to 2 on older cards
  ALSA: off by one bug in snd_riptide_joystick_probe()
  lmedm04: Fix usb_submit_urb BOGUS urb xfer, pipe 1 != type 3 in interrupt urb
  cpufreq: speedstep-smi: enable interrupts when waiting
  PCI: Fix infinite loop with ROM image of size 0
  PCI: Generate uppercase hex for modalias var in uevent
  HID: i2c-hid: Limit reads to wMaxInputLength bytes for input events
  iwlwifi: mvm: always use mac color zero
  iwlwifi: mvm: fix failure path when power_update fails in add_interface
  iwlwifi: mvm: validate tid and sta_id in ba_notif
  iwlwifi: pcie: disable the SCD_BASE_ADDR when we resume from WoWLAN
  fsnotify: fix handling of renames in audit
  xfs: set superblock buffer type correctly
  xfs: inode unlink does not set AGI buffer type
  xfs: ensure buffer types are set correctly
  Bluetooth: ath3k: workaround the compatibility issue with xHCI controller
  Linux 3.10.70
  rbd: drop an unsafe assertion
  media/rc: Send sync space information on the lirc device
  net: sctp: fix passing wrong parameter header to param_type2af in sctp_process_param
  ppp: deflate: never return len larger than output buffer
  ipv4: tcp: get rid of ugly unicast_sock
  tcp: ipv4: initialize unicast_sock sk_pacing_rate
  bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify
  ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
  ping: Fix race in free in receive path
  udp_diag: Fix socket skipping within chain
  ipv4: try to cache dst_entries which would cause a redirect
  net: sctp: fix slab corruption from use after free on INIT collisions
  netxen: fix netxen_nic_poll() logic
  ipv6: stop sending PTB packets for MTU < 1280
  net: rps: fix cpu unplug
  ip: zero sockaddr returned on error queue
  Linux 3.10.69
  crypto: crc32c - add missing crypto module alias
  x86,kvm,vmx: Preserve CR4 across VM entry
  kvm: vmx: handle invvpid vm exit gracefully
  smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()
  ALSA: ak411x: Fix stall in work callback
  ASoC: sgtl5000: add delay before first I2C access
  ASoC: atmel_ssc_dai: fix start event for I2S mode
  lib/checksum.c: fix build for generic csum_tcpudp_nofold
  ext4: prevent bugon on race between write/fcntl
  arm64: Fix up /proc/cpuinfo
  nilfs2: fix deadlock of segment constructor over I_SYNC flag
  lib/checksum.c: fix carry in csum_tcpudp_nofold
  mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range
  MIPS: Fix kernel lockup or crash after CPU offline/online
  MIPS: IRQ: Fix disable_irq on CPU IRQs
  PCI: Add NEC variants to Stratus ftServer PCIe DMI check
  gpio: sysfs: fix memory leak in gpiod_sysfs_set_active_low
  gpio: sysfs: fix memory leak in gpiod_export_link
  Linux 3.10.68
  target: Drop arbitrary maximum I/O size limit
  iser-target: Fix implicit termination of connections
  iser-target: Handle ADDR_CHANGE event for listener cm_id
  iser-target: Fix connected_handler + teardown flow race
  iser-target: Parallelize CM connection establishment
  iser-target: Fix flush + disconnect completion handling
  iscsi,iser-target: Initiate termination only once
  vhost-scsi: Add missing virtio-scsi -> TCM attribute conversion
  tcm_loop: Fix wrong I_T nexus association
  vhost-scsi: Take configfs group dependency during VHOST_SCSI_SET_ENDPOINT
  ib_isert: Add max_send_sge=2 minimum for control PDU responses
  IB/isert: Adjust CQ size to HW limits
  workqueue: fix subtle pool management issue which can stall whole worker_pool
  gpio: squelch a compiler warning
  efi-pstore: Make efi-pstore return a unique id
  pstore/ram: avoid atomic accesses for ioremapped regions
  pstore: Fix NULL pointer fault if get NULL prz in ramoops_get_next_prz
  pstore: skip zero size persistent ram buffer in traverse
  pstore: clarify clearing of _read_cnt in ramoops_context
  pstore: d_alloc_name() doesn't return an ERR_PTR
  pstore: Fail to unlink if a driver has not defined pstore_erase
  ARM: 8109/1: mm: Modify pte_write and pmd_write logic for LPAE
  ARM: 8108/1: mm: Introduce {pte,pmd}_isset and {pte,pmd}_isclear
  ARM: DMA: ensure that old section mappings are flushed from the TLB
  ARM: 7931/1: Correct virt_addr_valid
  ARM: fix asm/memory.h build error
  ARM: 7867/1: include: asm: use 'int' instead of 'unsigned long' for 'oldval' in atomic_cmpxchg().
  ARM: 7866/1: include: asm: use 'long long' instead of 'u64' within atomic.h
  ARM: lpae: fix definition of PTE_HWTABLE_PTRS
  ARM: fix type of PHYS_PFN_OFFSET to unsigned long
  ARM: LPAE: use phys_addr_t in alloc_init_pud()
  ARM: LPAE: use signed arithmetic for mask definitions
  ARM: mm: correct pte_same behaviour for LPAE.
  ARM: 7829/1: Add ".text.unlikely" and ".text.hot" to arm unwind tables
  drivers: net: cpsw: discard dual emac default vlan configuration
  regulator: core: fix race condition in regulator_put()
  spi/pxa2xx: Clear cur_chip pointer before starting next message
  dm cache: fix missing ERR_PTR returns and handling
  dm thin: don't allow messages to be sent to a pool target in READ_ONLY or FAIL mode
  nl80211: fix per-station group key get/del and memory leak
  NFSv4.1: Fix an Oops in nfs41_walk_client_list
  nfs: fix dio deadlock when O_DIRECT flag is flipped
  Input: i8042 - add noloop quirk for Medion Akoya E7225 (MD98857)
  ALSA: seq-dummy: remove deadlock-causing events on close
  powerpc/xmon: Fix another endiannes issue in RTAS call from xmon
  can: kvaser_usb: Fix state handling upon BUS_ERROR events
  can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
  can: kvaser_usb: Send correct context to URB completion
  can: kvaser_usb: Do not sleep in atomic context
  ASoC: wm8960: Fix capture sample rate from 11250 to 11025
  spi: dw-mid: fix FIFO size

Signed-off-by: Ian Maund <imaund@codeaurora.org>
2015-04-24 18:14:57 -07:00
Vinayak Menon 1802f7bfc3 mm: compaction: fix the page state calculation in too_many_isolated
Commit "mm: vmscan: fix the page state calculation in too_many_isolated"
fixed an issue where a number of tasks were blocked in reclaim path
for seconds, because of vmstat_diff not being synced in time.
A similar problem can happen in isolate_migratepages_block, where
similar calculation is performed. This patch fixes that.

Change-Id: Ie74f108ef770da688017b515fe37faea6f384589
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2015-03-26 11:57:40 +05:30
Joonsoo Kim 2295074e44 mm/compaction: fix wrong order check in compact_finished()
commit 372549c2a3778fd3df445819811c944ad54609ca upstream.

What we want to check here is whether there is highorder freepage in buddy
list of other migratetype in order to steal it without fragmentation.
But, current code just checks cc->order which means allocation request
order.  So, this is wrong.

Without this fix, non-movable synchronous compaction below pageblock order
would not stopped until compaction is complete, because migratetype of
most pageblocks are movable and high order freepage made by compaction is
usually on movable type buddy list.

There is some report related to this bug. See below link.

  http://www.spinics.net/lists/linux-mm/msg81666.html

Although the issued system still has load spike comes from compaction,
this makes that system completely stable and responsive according to his
report.

stress-highalloc test in mmtests with non movable order 7 allocation
doesn't show any notable difference in allocation success rate, but, it
shows more compaction success rate.

Compaction success rate (Compaction success * 100 / Compaction stalls, %)
18.47 : 28.94

Fixes: 1fb3f8ca0e ("mm: compaction: capture a suitable high-order page immediately when it is made available")
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18 13:22:28 +01:00
Vlastimil Babka ee7f7615f9 mm/compaction: make isolate_freepages start at pageblock boundary
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.

The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc->free_pfn value as the first pfn.  In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.

This means that when cc->free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages.  Currently this
can happen when

 a) zone's end pfn is not pageblock aligned, or

 b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
    enabled and a hole spanning the beginning of a pageblock

This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary.  This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Heesub Shin <heesub.shin@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16 13:42:53 -07:00
Vlastimil Babka 87c4475c90 mm: compaction: detect when scanners meet in isolate_freepages
commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.

Compaction of a zone is finished when the migrate scanner (which begins
at the zone's lowest pfn) meets the free page scanner (which begins at
the zone's highest pfn).  This is detected in compact_zone() and in the
case of direct compaction, the compact_blockskip_flush flag is set so
that kswapd later resets the cached scanner pfn's, and a new compaction
may again start at the zone's borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems.  First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator
when migration fails.  Second, failing to isolate enough free pages due
to scanners meeting results in -ENOMEM being returned by
migrate_pages(), which makes compact_zone() bail out immediately without
calling compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated
attempts at compaction of a zone that keep starting from the cached
pfn's close to the meeting point, and quickly failing through the
-ENOMEM path, without the cached pfns being reset, over and over.  This
has been observed (through additional tracepoints) in the third phase of
the mmtests stress-highalloc benchmark, where the allocator runs on an
otherwise idle system.  The problem was observed in the DMA32 zone,
which was used as a fallback to the preferred Normal zone, but on the
4GB system it was actually the largest zone.  The problem is even
amplified for such fallback zone - the deferred compaction logic, which
could (after being fixed by a previous patch) reset the cached scanner
pfn's, is only applied to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success
rate from ~85% to ~65%.  This occurs in about half of benchmark runs,
making bisection problematic.  It is unlikely that the commit itself is
buggy, but it should put more pressure on the DMA32 zone during phases 1
and 2, which may leave it more fragmented in phase 3 and expose the bugs
that this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM.  The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make
kswapd reset the scanner pfn's.

The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
allocation success rates are also significantly improved.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16 13:42:53 -07:00
Vlastimil Babka e6798424a7 mm: compaction: reset cached scanner pfn's before reading them
commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.

Compaction caches pfn's for its migrate and free scanners to avoid
scanning the whole zone each time.  In compact_zone(), the cached values
are read to set up initial values for the scanners.  There are several
situations when these cached pfn's are reset to the first and last pfn
of the zone, respectively.  One of these situations is when a compaction
has been deferred for a zone and is now being restarted during a direct
compaction, which is also done in compact_zone().

However, compact_zone() currently reads the cached pfn's *before*
resetting them.  This means the reset doesn't affect the compaction that
performs it, and with good chance also subsequent compactions, as
update_pageblock_skip() is likely to be called and update the cached
pfn's to those being processed.  Another chance for a successful reset
is when a direct compaction detects that migration and free scanners
meet (which has its own problems addressed by another patch) and sets
update_pageblock_skip flag which kswapd uses to do the reset because it
goes to sleep.

This is clearly a bug that results in non-deterministic behavior, so
this patch moves the cached pfn reset to be performed *before* the
values are read.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16 13:42:53 -07:00
Laura Abbott 8c54de8fef mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block
commit 2af120bc040c5ebcda156df6be6a66610ab6957f upstream.

We received several reports of bad page state when freeing CMA pages
previously allocated with alloc_contig_range:

    BUG: Bad page state in process Binder_A  pfn:63202
    page:d21130b0 count:0 mapcount:1 mapping:  (null) index:0x7dfbf
    page flags: 0x40080068(uptodate|lru|active|swapbacked)

Based on the page state, it looks like the page was still in use.  The
page flags do not make sense for the use case though.  Further debugging
showed that despite alloc_contig_range returning success, at least one
page in the range still remained in the buddy allocator.

There is an issue with isolate_freepages_block.  In strict mode (which
CMA uses), if any pages in the range cannot be isolated,
isolate_freepages_block should return failure 0.  The current check
keeps track of the total number of isolated pages and compares against
the size of the range:

        if (strict && nr_strict_required > total_isolated)
                total_isolated = 0;

After taking the zone lock, if one of the pages in the range is not in
the buddy allocator, we continue through the loop and do not increment
total_isolated.  If in the last iteration of the loop we isolate more
than one page (e.g.  last page needed is a higher order page), the check
for total_isolated may pass and we fail to detect that a page was
skipped.  The fix is to bail out if the loop immediately if we are in
strict mode.  There's no benfit to continuing anyway since we need all
pages to be isolated.  Additionally, drop the error checking based on
nr_strict_required and just check the pfn ranges.  This matches with
what isolate_freepages_range does.

Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23 21:38:18 -07:00
Joonsoo Kim 31eb5f24b9 mm/compaction: respect ignore_skip_hint in update_pageblock_skip
commit 6815bf3f233e0b10c99a758497d5d236063b010b upstream.

update_pageblock_skip() only fits to compaction which tries to isolate
by pageblock unit.  If isolate_migratepages_range() is called by CMA, it
try to isolate regardless of pageblock unit and it don't reference
get_pageblock_skip() by ignore_skip_hint.  We should also respect it on
update_pageblock_skip() to prevent from setting the wrong information.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09 12:24:23 -08:00
Cody P Schafer 108bcc96ef mm: add & use zone_end_pfn() and zone_spans_pfn()
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.

This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.

Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:20 -08:00
Hugh Dickins 9c620e2bc5 mm: remove offlining arg to migrate_pages
No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers.  Now all cases are safe, remove the arg.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:19 -08:00
Minchan Kim 194159fbcc mm: remove MIGRATE_ISOLATE check in hotpath
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option.  So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:15 -08:00
Andrew Morton 7103f16dbf mm: compaction: make __compact_pgdat() and compact_pgdat() return void
These functions always return 0.  Formalise this.

Cc: Jason Liu <r64343@freescale.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:10 -08:00
Mel Gorman a9aacbccf3 mm: compaction: do not accidentally skip pageblocks in the migrate scanner
Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN.  It happened to work when initially
implemented as the starting PFN was also aligned but with caching
restarts and isolating in smaller chunks this is no longer always true.

The impact is that the migrate scanner scans outside its current
pageblock.  As pfn_valid() is still checked properly it does not cause
any failure and the impact of the bug is that in some cases it will scan
more than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX.  It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:10 -08:00
Mel Gorman 8fb74b9fb2 mm: compaction: partially revert capture of suitable high-order page
Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
waiting for POLLIN on a local TCP socket.  It was easier to trigger if
there was disk IO and dirty pages at the same time and he bisected it to
commit 1fb3f8ca0e ("mm: compaction: capture a suitable high-order page
immediately when it is made available").

The intention of that patch was to improve high-order allocations under
memory pressure after changes made to reclaim in 3.6 drastically hurt
THP allocations but the approach was flawed.  For Eric, the problem was
that page->pfmemalloc was not being cleared for captured pages leading
to a poor interaction with swap-over-NFS support causing the packets to
be dropped.  However, I identified a few more problems with the patch
including the fact that it can increase contention on zone->lock in some
cases which could result in async direct compaction being aborted early.

In retrospect the capture patch took the wrong approach.  What it should
have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
was allocating for THP and avoided races that way.  While the patch was
showing to improve allocation success rates at the time, the benefit is
marginal given the relative complexity and it should be revisited from
scratch in the context of the other reclaim-related changes that have
taken place since the patch was first written and tested.  This patch
partially reverts commit 1fb3f8ca0e ("mm: compaction: capture a
suitable high-order page immediately when it is made available").

Reported-and-tested-by: Eric Wong <normalperson@yhbt.net>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:56 -08:00
Jason Liu 7964c06d66 mm: compaction: fix echo 1 > compact_memory return error issue
when run the folloing command under shell, it will return error

  sh/$ echo 1 > /proc/sys/vm/compact_memory
  sh/$ sh: write error: Bad address

After strace, I found the following log:

  ...
  write(1, "1\n", 2)               = 3
  write(1, "", 4294967295)         = -1 EFAULT (Bad address)
  write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
  ) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to
compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE)
from sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu <r64343@freescale.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:54 -08:00
Minchan Kim 010fc29a45 compaction: fix build error in CMA && !COMPACTION
isolate_freepages_block() and isolate_migratepages_range() are used for
CMA as well as compaction so it breaks build for CONFIG_CMA &&
!CONFIG_COMPACTION.

This patch fixes it.

[akpm@linux-foundation.org: add "do { } while (0)", per Mel]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:18 -08:00
Linus Torvalds 3d59eebc5e Automatic NUMA Balancing V11
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQx0kQAAoJEHzG/DNEskfi4fQP/R5PRovayroZALBMLnVJDaLD
 Ttr9p40VNXbiJ+MfRgatJjSSJZ4Jl+fC3NEqBhcwVZhckZZb9R2s0WtrSQo5+ZbB
 vdRfiuKoCaKM4cSZ08C12uTvsF6xjhjd27CTUlMkyOcDoKxMEFKelv0hocSxe4Wo
 xqlv3eF+VsY7kE1BNbgBP06SX4tDpIHRxXfqJPMHaSKQmre+cU0xG2GcEu3QGbHT
 DEDTI788YSaWLmBfMC+kWoaQl1+bV/FYvavIAS8/o4K9IKvgR42VzrXmaFaqrbgb
 72ksa6xfAi57yTmZHqyGmts06qYeBbPpKI+yIhCMInxA9CY3lPbvHppRf0RQOyzj
 YOi4hovGEMJKE+BCILukhJcZ9jCTtS3zut6v1rdvR88f4y7uhR9RfmRfsxuW7PNj
 3Rmh191+n0lVWDmhOs2psXuCLJr3LEiA0dFffN1z8REUTtTAZMsj8Rz+SvBNAZDR
 hsJhERVeXB6X5uQ5rkLDzbn1Zic60LjVw7LIp6SF2OYf/YKaF8vhyWOA8dyCEu8W
 CGo7AoG0BO8tIIr8+LvFe8CweypysZImx4AjCfIs4u9pu/v11zmBvO9NO5yfuObF
 BreEERYgTes/UITxn1qdIW4/q+Nr0iKO3CTqsmu6L1GfCz3/XzPGs3U26fUhllqi
 Ka0JKgnWvsa6ez6FSzKI
 =ivQa
 -----END PGP SIGNATURE-----

Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma

Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
 "There are three implementations for NUMA balancing, this tree
  (balancenuma), numacore which has been developed in tip/master and
  autonuma which is in aa.git.

  In almost all respects balancenuma is the dumbest of the three because
  its main impact is on the VM side with no attempt to be smart about
  scheduling.  In the interest of getting the ball rolling, it would be
  desirable to see this much merged for 3.8 with the view to building
  scheduler smarts on top and adapting the VM where required for 3.9.

  The most recent set of comparisons available from different people are

    mel:    https://lkml.org/lkml/2012/12/9/108
    mingo:  https://lkml.org/lkml/2012/12/7/331
    tglx:   https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

  The results are a mixed bag.  In my own tests, balancenuma does
  reasonably well.  It's dumb as rocks and does not regress against
  mainline.  On the other hand, Ingo's tests shows that balancenuma is
  incapable of converging for this workloads driven by perf which is bad
  but is potentially explained by the lack of scheduler smarts.  Thomas'
  results show balancenuma improves on mainline but falls far short of
  numacore or autonuma.  Srikar's results indicate we all suffer on a
  large machine with imbalanced node sizes.

  My own testing showed that recent numacore results have improved
  dramatically, particularly in the last week but not universally.
  We've butted heads heavily on system CPU usage and high levels of
  migration even when it shows that overall performance is better.
  There are also cases where it regresses.  Of interest is that for
  specjbb in some configurations it will regress for lower numbers of
  warehouses and show gains for higher numbers which is not reported by
  the tool by default and sometimes missed in treports.  Recently I
  reported for numacore that the JVM was crashing with
  NullPointerExceptions but currently it's unclear what the source of
  this problem is.  Initially I thought it was in how numacore batch
  handles PTEs but I'm no longer think this is the case.  It's possible
  numacore is just able to trigger it due to higher rates of migration.

  These reports were quite late in the cycle so I/we would like to start
  with this tree as it contains much of the code we can agree on and has
  not changed significantly over the last 2-3 weeks."

* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
  mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  mm: migrate: Account a transhuge page properly when rate limiting
  mm: numa: Account for failed allocations and isolations as migration failures
  mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
  mm: numa: Add THP migration for the NUMA working set scanning fault case.
  mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
  mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
  mm: sched: numa: Control enabling and disabling of NUMA balancing
  mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
  mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
  mm: numa: migrate: Set last_nid on newly allocated page
  mm: numa: split_huge_page: Transfer last_nid on tail page
  mm: numa: Introduce last_nid to the page frame
  sched: numa: Slowly increase the scanning period as NUMA faults are handled
  mm: numa: Rate limit setting of pte_numa if node is saturated
  mm: numa: Rate limit the amount of memory that is migrated between nodes
  mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
  mm: numa: Migrate pages handled during a pmd_numa hinting fault
  mm: numa: Migrate on reference policy
  ...
2012-12-16 15:18:08 -08:00
Thierry Reding c8bf2d8ba4 mm: compaction: Fix compiler warning
compact_capture_page() is only used if compaction is enabled so it should
be moved into the corresponding #ifdef.

Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 17:38:32 -08:00
Rafael Aquini 5733c7d11d mm: introduce putback_movable_pages()
The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
hacks around putback_lru_pages() in order to allow ballooned pages to be
re-inserted on balloon page list as if a ballooned page was like a LRU page.

As ballooned pages are not legitimate LRU pages, this patch introduces
putback_movable_pages() to properly cope with cases where the isolated
pageset contains ballooned pages and LRU pages, thus fixing the mentioned
inelegant hack around putback_lru_pages().

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:27 -08:00
Rafael Aquini bf6bddf192 mm: introduce compaction and migration for ballooned pages
Memory fragmentation introduced by ballooning might reduce significantly
the number of 2MB contiguous memory blocks that can be used within a guest,
thus imposing performance penalties associated with the reduced number of
transparent huge pages that could be used by the guest workload.

This patch introduces the helper functions as well as the necessary changes
to teach compaction and migration bits how to cope with pages which are
part of a guest memory balloon, in order to make them movable by memory
compaction procedures.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11 17:22:27 -08:00
Mel Gorman 397487db69 mm: compaction: Add scanned and isolated counters for compaction
Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u:	a word	sizeof(void *)

Ca  = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure   = Ca * 2
Ci  = Cost page isolation    = (Ca + Wi)
	where Wi is a constant that should reflect the approximate
	cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free    scanning = Ca

Overall cost =	(Csm * compact_migrate_scanned) +
	      	(Csf * compact_free_scanned)    +
	      	(Ci  * compact_isolated)	+
		(Cmc * pgmigrate_success)	+
		(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11 14:28:35 +00:00
Mel Gorman 7b2a2d4a18 mm: migrate: Add a tracepoint for migrate_pages
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11 14:28:35 +00:00
Mel Gorman 5647bc293a mm: compaction: Move migration fail/success stats to migrate.c
The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-12-11 14:28:35 +00:00
Mel Gorman 60177d31d2 mm: compaction: validate pfn range passed to isolate_freepages_block
Commit 0bf380bc70 ("mm: compaction: check pfn_valid when entering a
new MAX_ORDER_NR_PAGES block during isolation for migration") added a
check for pfn_valid() when isolating pages for migration as the scanner
does not necessarily start pageblock-aligned.

Since commit c89511ab2f ("mm: compaction: Restart compaction from near
where it left off"), the free scanner has the same problem.  This patch
makes sure that the pfn range passed to isolate_freepages_block() is
within the same block so that pfn_valid() checks are unnecessary.

In answer to Henrik's wondering why others have not reported this:
reproducing this requires a large enough hole with the right aligment to
have compaction walk into a PFN range with no memmap.  Size and
alignment depends in the memory model - 4M for FLATMEM and 128M for
SPARSEMEM on x86.  It needs a "lucky" machine.

Reported-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-06 11:17:33 -08:00
Mel Gorman 0db63d7e25 mm: compaction: correct the nr_strict va isolated check for CMA
Thierry reported that the "iron out" patch for isolate_freepages_block()
had problems due to the strict check being too strict with "mm:
compaction: Iron out isolate_freepages_block() and
isolate_freepages_range() -fix1".  It's possible that more pages than
necessary are isolated but the check still fails and I missed that this
fix was not picked up before RC1.  This same problem has been identified
in 3.7-RC1 by Tony Prisk and should be addressed by the following patch.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Tony Prisk <linux@prisktech.co.nz>
Reported-by: Thierry Reding <thierry.reding@avionic-design.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19 14:07:47 -07:00
Minchan Kim e46a28790e CMA: migrate mlocked pages
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out.  Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running.  If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:23:00 +09:00
Mel Gorman 62997027ca mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity
Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning.  This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths.  Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever.  Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic.  There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
   scanner meet), the zone is marked compact_blockskip_flush. When kswapd
   goes to sleep, it will clear the cache. This is expected to be the
   common case where the cache is cleared. It does not really matter if
   kswapd happens to be asleep or going to sleep when the flag is set as
   it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
   finished being deferred then a process will clear the cache and start a
   full scan.  This situation happens if there are multiple high-order
   allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless.  For allocations that can fail such as THP,
they will simply fail.  For requests that cannot fail, they will retry the
allocation.  Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:51 +09:00
Mel Gorman c89511ab2f mm: compaction: Restart compaction from near where it left off
This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
 This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
 When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:50 +09:00
Mel Gorman bb13ffeb9f mm: compaction: cache if a pageblock was scanned and no pages were isolated
When compaction was implemented it was known that scanning could
potentially be excessive.  The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line.  It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future.  When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive.  If the information is too stale then allocations will fail
that might have otherwise succeeded.  In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
  be discarded if it's at least 5 seconds since the last time the cache
  was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event.  Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure.  Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page.  The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:50 +09:00
Mel Gorman 753341a4b8 revert "mm: have order > 0 compaction start off where it left"
This reverts commit 7db8889ab0 ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages").  These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand.  A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:50 +09:00
Mel Gorman f40d1e42bb mm: compaction: acquire the zone->lock as late as possible
Compaction's free scanner acquires the zone->lock when checking for
PageBuddy pages and isolating them.  It does this even if there are no
PageBuddy pages in the range.

This patch defers acquiring the zone lock for as long as possible.  In the
event there are no free pages in the pageblock then the lock will not be
acquired at all which reduces contention on zone->lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Tested-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:49 +09:00
Mel Gorman 2a1402aa04 mm: compaction: acquire the zone->lru_lock as late as possible
Richard Davies and Shaohua Li have both reported lock contention problems
in compaction on the zone and LRU locks as well as significant amounts of
time being spent in compaction.  This series aims to reduce lock
contention and scanning rates to reduce that CPU usage.  Richard reported
at https://lkml.org/lkml/2012/9/21/91 that this series made a big
different to a problem he reported in August:

   http://marc.info/?l=kvm&m=134511507015614&w=2

Patch 1 defers acquiring the zone->lru_lock as long as possible.

Patch 2 defers acquiring the zone->lock as lock as possible.

Patch 3 reverts Rik's "skip-free" patches as the core concept gets
	reimplemented later and the remaining patches are easier to
	understand if this is reverted first.

Patch 4 adds a pageblock-skip bit to the pageblock flags to cache what
	pageblocks should be skipped by the migrate and free scanners.
	This drastically reduces the amount of scanning compaction has
	to do.

Patch 5 reimplements something similar to Rik's idea except it uses the
	pageblock-skip information to decide where the scanners should
	restart from and does not need to wrap around.

I tested this on 3.6-rc6 + linux-next/akpm. Kernels tested were

akpm-20120920	3.6-rc6 + linux-next/akpm as of Septeber 20th, 2012
lesslock	Patches 1-6
revert		Patches 1-7
cachefail	Patches 1-8
skipuseless	Patches 1-9

Stress high-order allocation tests looked ok.  Success rates are more or
less the same with the full series applied but there is an expectation
that there is less opportunity to race with other allocation requests if
there is less scanning.  The time to complete the tests did not vary that
much and are uninteresting as were the vmstat statistics so I will not
present them here.

Using ftrace I recorded how much scanning was done by compaction and got this

                            3.6.0-rc6     3.6.0-rc6   3.6.0-rc6  3.6.0-rc6 3.6.0-rc6
                            akpm-20120920 lockless  revert-v2r2  cachefail skipuseless

Total   free    scanned         360753976  515414028  565479007   17103281   18916589
Total   free    isolated          2852429    3597369    4048601     670493     727840
Total   free    efficiency        0.0079%    0.0070%    0.0072%    0.0392%    0.0385%
Total   migrate scanned         247728664  822729112 1004645830   17946827   14118903
Total   migrate isolated          2555324    3245937    3437501     616359     658616
Total   migrate efficiency        0.0103%    0.0039%    0.0034%    0.0343%    0.0466%

The efficiency is worthless because of the nature of the test and the
number of failures.  The really interesting point as far as this patch
series is concerned is the number of pages scanned.  Note that reverting
Rik's patches massively increases the number of pages scanned indicating
that those patches really did make a difference to CPU usage.

However, caching what pageblocks should be skipped has a much higher
impact.  With patches 1-8 applied, free page and migrate page scanning are
both reduced by 95% in comparison to the akpm kernel.  If the basic
concept of Rik's patches are implemened on top then scanning then the free
scanner barely changed but migrate scanning was further reduced.  That
said, tests on 3.6-rc5 indicated that the last patch had greater impact
than what was measured here so it is a bit variable.

One way or the other, this series has a large impact on the amount of
scanning compaction does when there is a storm of THP allocations.

This patch:

Compaction's migrate scanner acquires the zone->lru_lock when scanning a
range of pages looking for LRU pages to acquire.  It does this even if
there are no LRU pages in the range.  If multiple processes are compacting
then this can cause severe locking contention.  To make matters worse
commit b2eef8c0 ("mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration") releases the lru_lock every
SWAP_CLUSTER_MAX pages that are scanned.

This patch makes two changes to how the migrate scanner acquires the LRU
lock.  First, it only releases the LRU lock every SWAP_CLUSTER_MAX pages
if the lock is contended.  This reduces the number of times it
unnecessarily disables and re-enables IRQs.  The second is that it defers
acquiring the LRU lock for as long as possible.  If there are no LRU pages
or the only LRU pages are transhuge then the LRU lock will not be acquired
at all which reduces contention on zone->lru_lock.

[minchan@kernel.org: augment comment]
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Richard Davies <richard@arachsys.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:49 +09:00
Mel Gorman 661c4cb9b8 mm: compaction: Update try_to_compact_pages()kerneldoc comment
Parameters were added without documentation, tut tut.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:49 +09:00
Mel Gorman 3cc668f4e3 mm: compaction: move fatal signal check out of compact_checklock_irqsave
Commit c67fe3752a ("mm: compaction: Abort async compaction if locks
are contended or taking too long") addressed a lock contention problem
in compaction by introducing compact_checklock_irqsave() that effecively
aborting async compaction in the event of compaction.

To preserve existing behaviour it also moved a fatal_signal_pending()
check into compact_checklock_irqsave() but that is very misleading.  It
"hides" the check within a locking function but has nothing to do with
locking as such.  It just happens to work in a desirable fashion.

This patch moves the fatal_signal_pending() check to
isolate_migratepages_range() where it belongs.  Arguably the same check
should also happen when isolating pages for freeing but it's overkill.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:48 +09:00
Shaohua Li e64c5237cf mm: compaction: abort compaction loop if lock is contended or run too long
isolate_migratepages_range() might isolate no pages if for example when
zone->lru_lock is contended and running asynchronous compaction. In this
case, we should abort compaction, otherwise, compact_zone will run a
useless loop and make zone->lru_lock is even contended.

An additional check is added to ensure that cc.migratepages and
cc.freepages get properly drained whan compaction is aborted.

[minchan@kernel.org: Putback pages isolated for migration if aborting]
[akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended]
[akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended']
[minchan@kernel.org: Putback pages isolated for migration if aborting]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:48 +09:00
Bartlomiej Zolnierkiewicz d95ea5d18e cma: fix watermark checking
* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok()
  (from Minchan Kim).

* During watermark check decrease available free pages number by
  free CMA pages number if necessary (unmovable allocations cannot
  use pages from CMA areas).

Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:45 +09:00
Mel Gorman 1fb3f8ca0e mm: compaction: capture a suitable high-order page immediately when it is made available
While compaction is migrating pages to free up large contiguous blocks
for allocation it races with other allocation requests that may steal
these blocks or break them up.  This patch alters direct compaction to
capture a suitable free page as soon as it becomes available to reduce
this race.  It uses similar logic to split_free_page() to ensure that
watermarks are still obeyed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:21 +09:00
Mel Gorman 4ffb6335da mm: compaction: update comment in try_to_compact_pages
Allocation success rates have been far lower since 3.4 due to commit
fe2c2a1066 ("vmscan: reclaim at order 0 when compaction is enabled").
This commit was introduced for good reasons and it was known in advance
that the success rates would suffer but it was justified on the grounds
that the high allocation success rates were achieved by aggressive
reclaim.  Success rates are expected to suffer even more in 3.6 due to
commit 7db8889ab0 ("mm: have order > 0 compaction start off where it
left") which testing has shown to severely reduce allocation success
rates under load - to 0% in one case.

This series aims to improve the allocation success rates without
regressing the benefits of commit fe2c2a1066.  The series is based on
latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
away.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
	of recent failures.

Patch 3 captures suitable high-order pages freed by compaction to reduce
	races with parallel allocation requests.

Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
	start off where it left] to enable compaction again

Patch 5 identifies when compacion is taking too long due to contention
	and aborts.

STRESS-HIGHALLOC
		 3.6-rc1-akpm	  full-series
Pass 1          36.00 ( 0.00%)    51.00 (15.00%)
Pass 2          42.00 ( 0.00%)    63.00 (21.00%)
while Rested    86.00 ( 0.00%)    86.00 ( 0.00%)

From

  http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html

I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in in the current akpm tree.  With the full series applied, the
success rates are up to around 51% with some variability in the results.
This is not as high a success rate but it does not reclaim excessively
which is a key point.

MMTests Statistics: vmstat
Page Ins                                     3050912     3078892
Page Outs                                    8033528     8039096
Swap Ins                                           0           0
Swap Outs                                          0           0

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned                           70942      122976
Kswapd pages scanned                         1366300     1520122
Kswapd pages reclaimed                       1366214     1484629
Direct pages reclaimed                         70936      105716
Kswapd efficiency                                99%         97%
Kswapd velocity                             1072.550    1182.615
Direct efficiency                                99%         85%
Direct velocity                               55.690      95.672

The kswapd velocity changes very little as expected.  kswapd velocity is
around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
allocation success rates it was 8140 pages/second.  Direct velocity is
higher as a result of patch 2 of the series but this is expected and is
acceptable.  The direct reclaim and kswapd velocities change very little.

If these get accepted for merging then there is a difficulty in how they
should be handled.  7db8889a ("mm: have order > 0 compaction start off
where it left") is broken but it is already in 3.6-rc1 and needs to be
fixed.  However, if just patch 4 from this series is applied then Jim
Schutt's workload is known to break again as his workload also requires
patch 5.  While it would be preferred to have all these patches in 3.6 to
improve compaction in general, it would at least be acceptable if just
patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
compaction completely.  On the face of it, that would force
__GFP_NO_KSWAPD patches to be merged at the same time but I can do a
version of this series with __GFP_NO_KSWAPD change reverted and then
rebase it on top of this series.  That might be best overall because I
note that the __GFP_NO_KSWAPD patch should have removed
deferred_compaction from page_alloc.c but it didn't but fixing that causes
collisions with this series.

This patch:

The comment about order applied when the check was order >
PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp:
use compaction for all allocation orders").  Fixing the comment while I'm
in the general area.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:20 +09:00
Mel Gorman c67fe3752a mm: compaction: Abort async compaction if locks are contended or taking too long
Jim Schutt reported a problem that pointed at compaction contending
heavily on locks.  The workload is straight-forward and in his own words;

	The systems in question have 24 SAS drives spread across 3 HBAs,
	running 24 Ceph OSD instances, one per drive.  FWIW these servers
	are dual-socket Intel 5675 Xeons w/48 GB memory.  I've got ~160
	Ceph Linux clients doing dd simultaneously to a Ceph file system
	backed by 12 of these servers.

Early in the test everything looks fine

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  31 15          0     287216        576   38606628    0    0     2  1158    2   14   1  3  95  0  0
  27 15          0     225288        576   38583384    0    0    18 2222016 203357 134876  11 56  17 15  0
  28 17          0     219256        576   38544736    0    0    11 2305932 203141 146296  11 49  23 17  0
   6 18          0     215596        576   38552872    0    0     7 2363207 215264 166502  12 45  22 20  0
  22 18          0     226984        576   38596404    0    0     3 2445741 223114 179527  12 43  23 22  0

and then it goes to pot

  procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
   r  b       swpd       free       buff      cache   si   so    bi    bo   in   cs  us sy  id wa st
  163  8          0     464308        576   36791368    0    0    11 22210  866  536   3 13  79  4  0
  207 14          0     917752        576   36181928    0    0   712 1345376 134598 47367   7 90   1  2  0
  123 12          0     685516        576   36296148    0    0   429 1386615 158494 60077   8 84   5  3  0
  123 12          0     598572        576   36333728    0    0  1107 1233281 147542 62351   7 84   5  4  0
  622  7          0     660768        576   36118264    0    0   557 1345548 151394 59353   7 85   4  3  0
  223 11          0     283960        576   36463868    0    0    46 1107160 121846 33006   6 93   1  1  0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

  perf record -g -a sleep 10
  perf report --sort symbol --call-graph fractal,5
    34.63%  [k] _raw_spin_lock_irqsave
            |
            |--97.30%-- isolate_freepages
            |          compaction_alloc
            |          unmap_and_move
            |          migrate_pages
            |          compact_zone
            |          compact_zone_order
            |          try_to_compact_pages
            |          __alloc_pages_direct_compact
            |          __alloc_pages_slowpath
            |          __alloc_pages_nodemask
            |          alloc_pages_vma
            |          do_huge_pmd_anonymous_page
            |          handle_mm_fault
            |          do_page_fault
            |          page_fault
            |          |
            |          |--87.39%-- skb_copy_datagram_iovec
            |          |          tcp_recvmsg
            |          |          inet_recvmsg
            |          |          sock_recvmsg
            |          |          sys_recvfrom
            |          |          system_call
            |          |          __recv
            |          |          |
            |          |           --100.00%-- (nil)
            |          |
            |           --12.61%-- memcpy
             --2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 16:45:03 -07:00
Mel Gorman de74f1cc3b mm: have order > 0 compaction start near a pageblock with free pages
Commit 7db8889ab0 ("mm: have order > 0 compaction start off where it
left") introduced a caching mechanism to reduce the amount work the free
page scanner does in compaction.  However, it has a problem.  Consider
two process simultaneously scanning free pages

					    			C
	Process A		M     S     			F
			|---------------------------------------|
	Process B		M 	FS

	C is zone->compact_cached_free_pfn
	S is cc->start_pfree_pfn
	M is cc->migrate_pfn
	F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

                if (cc->order > 0 && (!cc->wrapped ||
                                      zone->compact_cached_free_pfn >
                                      cc->start_free_pfn))
                        pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner
skips almost the entire space it should have scanned.  When there are
multiple processes compacting it can end in a situation where the entire
zone is not being scanned at all.  Further, it is possible for two
processes to ping-pong update to compact_cached_free_pfn which is just
random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it
matters if two free scanners happen to be in the same block but with
racing updates it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
	of the zone. When a wrapped scanner isolates a page, it updates
	compact_cached_free_pfn to point to the highest pageblock it
	can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
	checks if compact_cached_free_pfn is pointing to the end of the
	zone. If so, the value is updated to point to the highest
	pageblock that pages were isolated from. This value will not
	be updated again until a free page scanner wraps and resets
	compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 16:45:03 -07:00
Minchan Kim c81758fbe0 mm/compaction.c: fix deferring compaction mistake
Commit aff622495c ("vmscan: only defer compaction for failed order and
higher") fixed bad deferring policy but made mistake about checking
compact_order_failed in __compact_pgdat().  So it can't update
compact_order_failed with the new order.  This ends up preventing
correct operation of policy deferral.  This patch fixes it.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 16:45:03 -07:00
Rik van Riel 7db8889ab0 mm: have order > 0 compaction start off where it left
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction.  This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
David Rientjes 4bf2bba375 mm, thp: abort compaction if migration page cannot be charged to memcg
If page migration cannot charge the temporary page to the memcg,
migrate_pages() will return -ENOMEM.  This isn't considered in memory
compaction however, and the loop continues to iterate over all
pageblocks trying to isolate and migrate pages.  If a small number of
very large memcgs happen to be oom, however, these attempts will mostly
be futile leading to an enormous amout of cpu consumption due to the
page migration failures.

This patch will short circuit and fail memory compaction if
migrate_pages() returns -ENOMEM.  COMPACT_PARTIAL is returned in case
some migrations were successful so that the page allocator will retry.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-11 16:04:43 -07:00
Linus Torvalds 68e3e92620 Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks"
This reverts commit 5ceb9ce6fe.

That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported.  The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.

So revert it for now, we can re-visit this for 3.6.  If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".

Reported-and-tested-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03 20:05:57 -07:00
Hugh Dickins fa9add641b mm/memcg: apply add/del_page to lruvec
Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
its target functions.

This cleanup eliminates a swathe of cruft in memcontrol.c, including
mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
mem_cgroup_lru_move_lists() - which never actually touched the lists.

In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
lru_size stats.

Whilst these are simplifications in their own right, the goal is to bring
the evaluation of lruvec next to the spin_locking of the lrus, in
preparation for a future patch.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:28 -07:00
Konstantin Khlebnikov f3fd4a6192 mm: remove lru type checks from __isolate_lru_page()
After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
completely remove anon/file and active/inactive lru type filters from
__isolate_lru_page(), because isolation for 0-order reclaim always
isolates pages from right lru list.  And pages-isolation for lumpy
shrink_inactive_list() or memory-compaction anyway allowed to isolate
pages from all evictable lru lists.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:25 -07:00
Bartlomiej Zolnierkiewicz 5ceb9ce6fe mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks
When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
allocation takes ownership of the block may take too long.  The type of
the pageblock remains unchanged so the pageblock cannot be used as a
migration target during compaction.

Fix it by:

* Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
  COMPACT_SYNC) and then converting sync field in struct compact_control
  to use it.

* Adding nr_pageblocks_skipped field to struct compact_control and
  tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
   If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
  try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
  suitable page for allocation.  In this case then check how if there were
  enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
  COMPACT_ASYNC_UNMOVABLE mode.

* Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
  COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
  finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
  pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
  sets change the whole pageblock type to MIGRATE_MOVABLE.

My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
131072 standard 4KiB pages in 'Normal' zone) is to:

- allocate 120000 pages for kernel's usage
- free every second page (60000 pages) of memory just allocated
- allocate and use 60000 pages from user space
- free remaining 60000 pages of kernel memory
  (now we have fragmented memory occupied mostly by user space pages)
- try to allocate 100 order-9 (2048 KiB) pages for kernel's usage

The results:
- with compaction disabled I get 11 successful allocations
- with compaction enabled - 14 successful allocations
- with this patch I'm able to get all 100 successful allocations

NOTE: If we can make kswapd aware of order-0 request during compaction, we
can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
(COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
following thread:

	http://marc.info/?l=linux-mm&m=133552069417068&w=2

[minchan@kernel.org: minor cleanups]
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29 16:22:22 -07:00
Michal Nazarewicz 47118af076 mm: mmzone: MIGRATE_CMA migration type added
The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).

It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory.  Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Rob Clark <rob.clark@linaro.org>
Tested-by: Ohad Ben-Cohen <ohad@wizery.com>
Tested-by: Benjamin Gaignard <benjamin.gaignard@linaro.org>
Tested-by: Robert Nelson <robertcnelson@gmail.com>
Tested-by: Barry Song <Baohua.Song@csr.com>
2012-05-21 15:09:32 +02:00