android_kernel_samsung_msm8976/kernel
Roman Pen 8a872b18df workqueue: fix ghost PENDING flag while doing MQ IO
commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2016-06-07 10:42:51 +02:00
..
cpu sched, idle: Fix the idle polling state logic 2013-11-29 11:11:42 -08:00
debug kdb: fix incorrect counts in KDB summary command output 2015-03-06 14:40:52 -08:00
events ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 11:57:47 -08:00
gcov
irq genirq: Prevent chip buslock deadlock 2016-03-03 15:06:20 -08:00
power PM / QoS: remove duplicate call to pm_qos_update_target 2015-03-18 13:22:28 +01:00
sched sched/cputime: Fix steal time accounting vs. CPU hotplug 2016-06-07 10:42:48 +02:00
time posix-clock: Fix return code on the poll method's error path 2016-03-03 15:06:23 -08:00
trace tracing: Fix trace_printk() to print when not using bprintk() 2016-06-07 10:42:47 +02:00
.gitignore kernel/hz.bc: ignore. 2013-04-22 07:09:06 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/mutex: Disable optimistic spinning on some architectures 2014-07-28 08:00:07 -07:00
Kconfig.preempt
Makefile We get rid of the general module prefix confusion with a binary config option, 2013-05-05 10:58:06 -07:00
acct.c fs: Fix hang with BSD accounting on frozen filesystem 2013-05-04 14:57:58 -04:00
async.c
audit.c CAPABILITIES: remove undefined caps from all processes 2014-09-17 09:03:57 -07:00
audit.h audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record 2013-12-04 10:57:03 -08:00
audit_tree.c audit: keep inode pinned 2014-11-21 09:22:52 -08:00
audit_watch.c
auditfilter.c auditfilter.c: fix kernel-doc warnings 2013-05-24 16:22:52 -07:00
auditsc.c auditsc: audit_krule mask accesses need bounds checking 2014-06-16 13:42:53 -07:00
backtracetest.c
bounds.c
capability.c CAPABILITIES: remove undefined caps from all processes 2014-09-17 09:03:57 -07:00
cgroup.c move d_rcu from overlapping d_child to overlapping d_alias 2015-04-29 10:34:00 +02:00
cgroup_freezer.c
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
configs.c proc: Supply PDE attribute setting accessor functions 2013-05-01 17:29:18 -04:00
context_tracking.c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-06-20 08:18:35 -10:00
cpu.c sched: Fix hotplug vs. set_cpus_allowed_ptr() 2014-06-11 12:03:24 -07:00
cpu_pm.c
cpuset.c cpuset,mempolicy: fix sleeping function called from invalid context 2014-07-17 15:58:00 -07:00
crash_dump.c
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c introduce for_each_thread() to replace the buggy while_each_thread() 2014-10-05 14:54:15 -07:00
extable.c extable: Flip the sorting message 2013-04-15 13:25:16 +02:00
fork.c unshare: Unsharing a thread does not require unsharing a vm 2015-10-01 12:07:28 +02:00
freezer.c freezer: Do not freeze tasks killed by OOM killer 2014-11-14 08:47:58 -08:00
futex.c futex: Drop refcount if requeue_pi() acquired the rtmutex 2016-02-25 11:57:49 -08:00
futex_compat.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 11:57:47 -08:00
groups.c userns: Don't allow setgroups until a gid mapping has been setablished 2015-01-08 09:58:16 -08:00
hrtimer.c hrtimer: Set expiry time before switch_hrtimer_base() 2014-06-07 13:25:31 -07:00
hung_task.c
irq_work.c
itimer.c
jump_label.c
kallsyms.c kernel: kallsyms: memory override issue, need check destination buffer length 2013-04-15 15:17:26 +09:30
kcmp.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 11:57:47 -08:00
kexec.c PCI: Disable Bus Master only on kexec reboot 2013-12-20 07:45:08 -08:00
kmod.c usermodehelper: check subprocess_info->path != NULL 2013-05-16 12:01:11 -07:00
kprobes.c kprobes: Fix to free gone and unused optprobes 2013-05-28 10:37:59 +02:00
ksysfs.c
kthread.c kthread: implement probe_kthread_data() 2013-04-30 17:04:02 -07:00
latencytop.c
lglock.c
lockdep.c Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block 2013-05-08 11:51:05 -07:00
lockdep_internals.h
lockdep_proc.c
lockdep_states.h
modsign_certificate.S CONFIG_SYMBOL_PREFIX: cleanup. 2013-03-15 15:09:43 +10:30
modsign_pubkey.c
module-internal.h
module.c modules: fix longstanding /proc/kallsyms vs module insertion race. 2016-03-16 08:41:37 -07:00
module_signing.c
mutex-debug.c
mutex-debug.h
mutex.c mutex: Back out architecture specific check for negative mutex count 2013-04-19 09:33:36 +02:00
mutex.h
notifier.c
nsproxy.c proc: Split the namespace stuff out into linux/proc_ns.h 2013-05-01 17:29:39 -04:00
padata.c
panic.c dump_stack: implement arch-specific hardware description in task dumps 2013-04-30 17:04:02 -07:00
params.c params: Fix potential memory leak in add_sysfs_param() 2013-03-18 11:40:21 +00:00
pid.c exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting 2015-01-08 09:58:17 -08:00
pid_namespace.c pid_namespace: pidns_get() should check task_active_pid_ns() != NULL 2014-04-26 17:15:34 -07:00
posix-cpu-timers.c posix_timers: Fix pre-condition to stop the tick on full dynticks 2013-04-22 19:59:25 +02:00
posix-timers.c posix-timers: Fix stack info leak in timer_create() 2014-11-14 08:48:00 -08:00
printk.c console: Fix console name size mismatch 2015-04-19 10:10:51 +02:00
profile.c proc: Supply PDE attribute setting accessor functions 2013-05-01 17:29:18 -04:00
ptrace.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-02-25 11:57:47 -08:00
range.c range: Do not add new blank slot with add_range_with_merge 2013-06-18 11:32:10 -05:00
rcu.h
rcupdate.c
rcutiny.c
rcutiny_plugin.h
rcutorture.c
rcutree.c rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration 2013-06-10 13:37:12 -07:00
rcutree.h rcu: Don't call wakeup() with rcu_node structure ->lock held 2013-06-10 13:37:11 -07:00
rcutree_plugin.h rcu: Don't allocate bootmem from rcu_init() 2013-05-15 10:41:12 -07:00
rcutree_trace.c rcutrace: single_open() leaks 2013-05-05 00:16:35 -04:00
relay.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
res_counter.c
resource.c kernel/resource.c: fix muxed resource handling in __request_region() 2016-03-03 15:06:24 -08:00
rtmutex-debug.c
rtmutex-debug.h rtmutex: Handle deadlock detection smarter 2014-07-17 15:58:04 -07:00
rtmutex-tester.c locking/rtmutex/tester: Set correct permissions on sysfs files 2013-04-10 14:48:37 +02:00
rtmutex.c rtmutex: Plug slow unlock race 2014-07-17 15:58:04 -07:00
rtmutex.h rtmutex: Handle deadlock detection smarter 2014-07-17 15:58:04 -07:00
rtmutex_common.h
rwsem.c Revert "rw_semaphore: remove up/down_read_non_owner" 2013-03-23 15:53:52 -07:00
seccomp.c seccomp: allow BPF_XOR based ALU instructions. 2013-03-26 11:07:19 +11:00
semaphore.c semaphore: use `bool' type for semaphore_waiter's up 2013-04-30 17:04:08 -07:00
signal.c kernel/signal.c: unexport sigsuspend() 2016-02-19 14:22:37 -08:00
smp.c kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path 2014-09-17 09:03:57 -07:00
smpboot.c smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread() 2015-02-11 14:48:17 +08:00
smpboot.h
softirq.c revert "softirq: Add support for triggering softirq work on softirqs" 2015-05-17 09:51:33 -07:00
spinlock.c
srcu.c
stacktrace.c
stop_machine.c
sys.c reboot: rigrate shutdown/reboot to boot cpu 2013-06-12 16:29:44 -07:00
sys_ni.c unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE 2013-05-09 13:46:38 -04:00
sysctl.c perf: Enforce 1 as lower limit for perf_event_max_sample_rate 2014-06-11 12:03:27 -07:00
sysctl_binary.c switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE 2013-05-09 14:53:20 -04:00
task_work.c
taskstats.c
test_kprobes.c kernel/: rename random32() to prandom_u32() 2013-04-29 18:28:42 -07:00
time.c time: settimeofday: Validate the values of tv from user 2015-01-29 17:40:56 -08:00
timeconst.bc
timer.c timer: Prevent overflow in apply_slack 2014-06-07 13:25:30 -07:00
tracepoint.c tracepoint: Do not waste memory on mods with no tracepoints 2014-05-30 21:52:11 -07:00
tsacct.c
uid16.c groups: Consolidate the setgroups permission checks 2015-01-08 09:58:16 -08:00
up.c
user-return-notifier.c
user.c userns: Add a knob to disable setgroups on a per user namespace basis 2015-01-08 09:58:16 -08:00
user_namespace.c userns: Allow setting gid_maps without privilege when setgroups is disabled 2015-01-08 09:58:17 -08:00
utsname.c proc: Split the namespace stuff out into linux/proc_ns.h 2013-05-01 17:29:39 -04:00
utsname_sysctl.c
wait.c
watchdog.c
workqueue.c workqueue: fix ghost PENDING flag while doing MQ IO 2016-06-07 10:42:51 +02:00
workqueue_internal.h workqueue: include workqueue info when printing debug dump of a worker task 2013-04-30 17:04:02 -07:00