android_kernel_samsung_msm8976/mm
Gu Zheng dfb06c8557 mm/memory hotplug: postpone the reset of obsolete pgdat
commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 upstream.

Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A:				offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
					offline cpu and memory, then try_offline_node()
					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat->node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-19 10:10:47 +02:00
..
backing-dev.c bdi: avoid oops on device removal 2014-04-26 17:15:35 -07:00
balloon_compaction.c mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
bootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
bounce.c mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored 2013-10-13 16:08:33 -07:00
cleancache.c mm: cleancache: clean up cleancache_enabled 2013-04-30 17:04:01 -07:00
compaction.c mm/compaction: fix wrong order check in compact_finished() 2015-03-18 13:22:28 +01:00
debug-pagealloc.c
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
failslab.c
filemap.c mm: memcg: handle non-error OOM situations more gracefully 2014-11-21 09:22:56 -08:00
filemap_xip.c lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
fremap.c mm: fix use-after-free in sys_remap_file_pages 2014-01-09 12:24:24 -08:00
frontswap.c mm: frontswap: invalidate expired data on a dup-store failure 2014-12-16 09:09:41 -08:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c mm: numa: Do not mark PTEs pte_numa when splitting huge pages 2014-10-09 12:18:42 -07:00
hugetlb.c mm/hugetlb: add migration entry check in __unmap_hugepage_range 2015-03-18 13:22:27 +01:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hwpoison-inject.c
init-mm.c
internal.h mm: accelerate munlock() treatment of THP pages 2013-02-27 19:10:09 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
Kconfig mmKconfig: add an option to disable bounce 2013-04-29 15:54:40 -07:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ksm.c mm: close PageTail race 2014-04-03 12:01:05 -07:00
maccess.c
madvise.c mm: madvise: complete input validation before taking lock 2013-04-29 15:54:37 -07:00
Makefile memcg: add memory.pressure_level events 2013-04-29 15:54:38 -07:00
memblock.c memblock: fix missing comment of memblock_insert_region() 2013-04-29 15:54:38 -07:00
memcontrol.c mm: memcg: handle non-error OOM situations more gracefully 2014-11-21 09:22:56 -08:00
memory-failure.c mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED 2014-06-30 20:09:42 -07:00
memory.c mm/memory.c: actually remap enough memory 2015-03-18 13:22:28 +01:00
memory_hotplug.c mm/memory hotplug: postpone the reset of obsolete pgdat 2015-04-19 10:10:47 +02:00
mempolicy.c cpuset,mempolicy: fix sleeping function called from invalid context 2014-07-17 15:58:00 -07:00
mempool.c
migrate.c mm: numa: avoid unnecessary work on the failure path 2014-01-09 12:24:23 -08:00
mincore.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
mlock.c mm: try_to_unmap_cluster() should lock_page() before mlocking 2014-05-06 07:55:32 -07:00
mm_init.c mm: init: report on last-nid information stored in page->flags 2013-02-23 17:50:18 -08:00
mmap.c mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() 2015-03-18 13:22:27 +01:00
mmu_context.c mm: remove old aio use_mm() comment 2013-05-07 18:38:27 -07:00
mmu_notifier.c mm: mmu_notifier: re-fix freed page still mapped in secondary MMU 2013-05-24 16:22:51 -07:00
mmzone.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
mprotect.c mm: fix TLB flush race between migration, and change_protection_range 2014-01-09 12:24:23 -08:00
mremap.c mm, thp: close race between mremap() and split_huge_page() 2014-06-07 13:25:31 -07:00
msync.c
nobootmem.c mm, nobootmem: do memset() after memblock_reserve() 2013-04-29 15:54:39 -07:00
nommu.c mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() 2015-03-18 13:22:28 +01:00
oom_kill.c mm: memcg: handle non-error OOM situations more gracefully 2014-11-21 09:22:56 -08:00
page-writeback.c mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() 2014-02-20 11:06:11 -08:00
page_alloc.c OOM, PM: OOM killed task shouldn't escape PM suspend 2014-11-14 08:47:58 -08:00
page_cgroup.c cgroup/kmemleak: add kmemleak_free() for cgroup deallocations. 2014-11-14 08:47:59 -08:00
page_io.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
page_isolation.c mm: fix zone_watermark_ok_safe() accounting of isolated pages 2013-01-04 16:11:46 -08:00
pagewalk.c mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_range 2015-02-11 14:48:16 +08:00
percpu-km.c
percpu-vm.c percpu: perform tlb flush after pcpu_map_pages() failure 2014-10-05 14:54:13 -07:00
percpu.c Revert "percpu: free percpu allocation info for uniprocessor system" 2014-11-14 08:47:53 -08:00
pgtable-generic.c mm: fix TLB flush race between migration, and change_protection_range 2014-01-09 12:24:23 -08:00
process_vm_access.c Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys 2013-03-12 11:05:45 -07:00
quicklist.c
readahead.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
rmap.c mm: fix sleeping function warning from __put_anon_vma 2014-06-30 20:09:42 -07:00
shmem.c shmem: fix nlink for rename overwrite directory 2014-10-05 14:54:11 -07:00
slab.c slab: fix init_lock_keys 2013-07-21 18:21:26 -07:00
slab.h memcg: check that kmem_cache has memcg_params before accessing it 2013-09-07 22:09:58 -07:00
slab_common.c slab_common: fix the check for duplicate slab names 2014-07-31 12:53:50 -07:00
slob.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
slub.c slub: Fix calculation of cpu slabs 2014-02-13 13:48:00 -08:00
sparse-vmemmap.c sparse-vmemmap: specify vmemmap population range in bytes 2013-04-29 15:54:35 -07:00
sparse.c mm, hotplug: avoid compiling memory hotremove functions when disabled 2013-04-29 15:54:37 -07:00
swap.c mm: close PageTail race 2014-04-03 12:01:05 -07:00
swap_state.c swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion 2013-06-12 16:29:45 -07:00
swapfile.c frontswap: fix incorrect zeroing and allocation size for frontswap_map 2013-06-12 16:29:46 -07:00
truncate.c mm: Remove false WARN_ON from pagecache_isize_extended() 2014-11-14 08:48:00 -08:00
util.c vm_is_stack: use for_each_thread() rather then buggy while_each_thread() 2014-10-05 14:54:16 -07:00
vmalloc.c mm/vmalloc.c: fix an overflow bug in alloc_vmap_area() 2013-11-13 12:05:34 +09:00
vmpressure.c memcg: add memory.pressure_level events 2013-04-29 15:54:38 -07:00
vmscan.c mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed 2015-01-16 06:59:03 -08:00
vmstat.c mm: numa: return the number of base pages altered by protection changes 2013-12-08 07:29:27 -08:00