I got a KASAN report of use-after-free:
==================================================================
BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
Read of size 8 by task trinity-c1/315
=============================================================================
BUG kmalloc-32 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
___slab_alloc+0x4f1/0x520
__slab_alloc.isra.58+0x56/0x80
kmem_cache_alloc_trace+0x260/0x2a0
disk_seqf_start+0x66/0x110
traverse+0x176/0x860
seq_read+0x7e3/0x11a0
proc_reg_read+0xbc/0x180
do_loop_readv_writev+0x134/0x210
do_readv_writev+0x565/0x660
vfs_readv+0x67/0xa0
do_preadv+0x126/0x170
SyS_preadv+0xc/0x10
do_syscall_64+0x1a1/0x460
return_from_SYSCALL_64+0x0/0x6a
INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
__slab_free+0x17a/0x2c0
kfree+0x20a/0x220
disk_seqf_stop+0x42/0x50
traverse+0x3b5/0x860
seq_read+0x7e3/0x11a0
proc_reg_read+0xbc/0x180
do_loop_readv_writev+0x134/0x210
do_readv_writev+0x565/0x660
vfs_readv+0x67/0xa0
do_preadv+0x126/0x170
SyS_preadv+0xc/0x10
do_syscall_64+0x1a1/0x460
return_from_SYSCALL_64+0x0/0x6a
CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G B 4.7.0+ #62
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
Call Trace:
[<ffffffff81d6ce81>] dump_stack+0x65/0x84
[<ffffffff8146c7bd>] print_trailer+0x10d/0x1a0
[<ffffffff814704ff>] object_err+0x2f/0x40
[<ffffffff814754d1>] kasan_report_error+0x221/0x520
[<ffffffff8147590e>] __asan_report_load8_noabort+0x3e/0x40
[<ffffffff83888161>] klist_iter_exit+0x61/0x70
[<ffffffff82404389>] class_dev_iter_exit+0x9/0x10
[<ffffffff81d2e8ea>] disk_seqf_stop+0x3a/0x50
[<ffffffff8151f812>] seq_read+0x4b2/0x11a0
[<ffffffff815f8fdc>] proc_reg_read+0xbc/0x180
[<ffffffff814b24e4>] do_loop_readv_writev+0x134/0x210
[<ffffffff814b4c45>] do_readv_writev+0x565/0x660
[<ffffffff814b8a17>] vfs_readv+0x67/0xa0
[<ffffffff814b8de6>] do_preadv+0x126/0x170
[<ffffffff814b92ec>] SyS_preadv+0xc/0x10
This problem can occur in the following situation:
open()
- pread()
- .seq_start()
- iter = kmalloc() // succeeds
- seqf->private = iter
- .seq_stop()
- kfree(seqf->private)
- pread()
- .seq_start()
- iter = kmalloc() // fails
- .seq_stop()
- class_dev_iter_exit(seqf->private) // boom! old pointer
As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.
An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.
Change-Id: I41ee55505a213f99a92ce630885e6c31b4b60232
Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
NULL pointer dereference can happen in cfq_choose_cfqg()
when there are no cfq groups to select other than the
current serving group. Prevent this by adding a NULL
check before dereferencing.
Unable to handle kernel NULL pointer dereference at virtual address
[<c02502cc>] (cfq_dispatch_requests+0x368/0x8c0) from
[<c0243f30>] (blk_peek_request+0x220/0x25c)
[<c0243f30>] (blk_peek_request+0x220/0x25c) from
[<c0243f74>] (blk_fetch_request+0x8/0x1c)
[<c0243f74>] (blk_fetch_request+0x8/0x1c) from
[<c041cedc>] (mmc_queue_thread+0x58/0x120)
[<c041cedc>] (mmc_queue_thread+0x58/0x120) from
[<c00ad310>] (kthread+0x84/0x90)
[<c00ad310>] (kthread+0x84/0x90) from
[<c000eeac>] (kernel_thread_exit+0x0/0x8)
CRs-Fixed: 416466
Change-Id: I1fab93a4334b53e1d7c5dcc8f93cff174bae0d5e
Signed-off-by: Sujit Reddy Thumma <sthumma@codeaurora.org>
On failure of requests the console is flooded with prints,
which result in a watchdog bark.
This patch limits the number of prints to console thus, avoiding
the bark.
CRs-Fixed: 458071
Change-Id: I614002cc1305ab7b5c6a64d278139291c1baa3c4
Signed-off-by: Asutosh Das <asutoshd@codeaurora.org>
A set of processes may happen to perform interleaved reads, i.e.,requests
whose union would give rise to a sequential read pattern. There are two
typical cases: in the first case, processes read fixed-size chunks of
data at a fixed distance from each other, while in the second case processes
may read variable-size chunks at variable distances. The latter case occurs
for example with QEMU, which splits the I/O generated by the guest into
multiple chunks, and lets these chunks be served by a pool of cooperating
processes, iteratively assigning the next chunk of I/O to the first
available process. CFQ uses actual queue merging for the first type of
rocesses, whereas it uses preemption to get a sequential read pattern out
of the read requests performed by the second type of processes. In the end
it uses two different mechanisms to achieve the same goal: boosting the
throughput with interleaved I/O.
This patch introduces Early Queue Merge (EQM), a unified mechanism to get a
sequential read pattern with both types of processes. The main idea is
checking newly arrived requests against the next request of the active queue
both in case of actual request insert and in case of request merge. By doing
so, both the types of processes can be handled by just merging their queues.
EQM is then simpler and more compact than the pair of mechanisms used in
CFQ.
Finally, EQM also preserves the typical low-latency properties of BFQ, by
properly restoring the weight-raising state of a queue when it gets back to
a non-merged state.
Change-Id: I6e8e59d479c13669126ccaa7f8c2f9d54dab876f
Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Add the BFQ-v7r8 I/O scheduler to 3.4.
The general structure is borrowed from CFQ, as much of the code for
handling I/O contexts. Over time, several useful features have been
ported from CFQ as well (details in the changelog in README.BFQ). A
(bfq_)queue is associated to each task doing I/O on a device, and each
time a scheduling decision has to be made a queue is selected and served
until it expires.
- Slices are given in the service domain: tasks are assigned
budgets, measured in number of sectors. Once got the disk, a task
must however consume its assigned budget within a configurable
maximum time (by default, the maximum possible value of the
budgets is automatically computed to comply with this timeout).
This allows the desired latency vs "throughput boosting" tradeoff
to be set.
- Budgets are scheduled according to a variant of WF2Q+, implemented
using an augmented rb-tree to take eligibility into account while
preserving an O(log N) overall complexity.
- A low-latency tunable is provided; if enabled, both interactive
and soft real-time applications are guaranteed a very low latency.
- Latency guarantees are preserved also in the presence of NCQ.
- Also with flash-based devices, a high throughput is achieved
while still preserving latency guarantees.
- BFQ features Early Queue Merge (EQM), a sort of fusion of the
cooperating-queue-merging and the preemption mechanisms present
in CFQ. EQM is in fact a unified mechanism that tries to get a
sequential read pattern, and hence a high throughput, with any
set of processes performing interleaved I/O over a contiguous
sequence of sectors.
- BFQ supports full hierarchical scheduling, exporting a cgroups
interface. Since each node has a full scheduler, each group can
be assigned its own weight.
- If the cgroups interface is not used, only I/O priorities can be
assigned to processes, with ioprio values mapped to weights
with the relation weight = IOPRIO_BE_NR - ioprio.
- ioprio classes are served in strict priority order, i.e., lower
priority queues are not served as long as there are higher
priority queues. Among queues in the same class the bandwidth is
distributed in proportion to the weight of each queue. A very
thin extra bandwidth is however guaranteed to the Idle class, to
prevent it from starving.
Change-Id: I62eb1769f7d6b4e542a10a9c7751a454d31c04de
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Update Kconfig.iosched and do the related Makefile changes to include
kernel configuration options for BFQ. Also add the bfqio controller
to the cgroups subsystem.
Change-Id: I540ea28658b44c16b998f36eb9c97205f4f288f3
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
commit 02e2a5bfebe99edcf9d694575a75032d53fe1b73 upstream.
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 4d66e5e9b6d720d8463e11d027bd4ad91c8b1318 upstream.
=================================
[ INFO: inconsistent lock state ]
4.1.0-rc7+ #217 Tainted: G O
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(ext_devt_lock){+.?...}, at: [<ffffffff8143a60c>] blk_free_devt+0x3c/0x70
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810bf6b1>] __lock_acquire+0x461/0x1e70
[<ffffffff810c1947>] lock_acquire+0xb7/0x290
[<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
[<ffffffff8143a07d>] blk_alloc_devt+0x6d/0xd0 <-- take the lock in process context
[..]
[<ffffffff810bf64e>] __lock_acquire+0x3fe/0x1e70
[<ffffffff810c00ad>] ? __lock_acquire+0xe5d/0x1e70
[<ffffffff810c1947>] lock_acquire+0xb7/0x290
[<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
[<ffffffff818ac3a8>] _raw_spin_lock+0x38/0x50
[<ffffffff8143a60c>] ? blk_free_devt+0x3c/0x70
[<ffffffff8143a60c>] blk_free_devt+0x3c/0x70 <-- take the lock in softirq
[<ffffffff8143bfec>] part_release+0x1c/0x50
[<ffffffff8158edf6>] device_release+0x36/0xb0
[<ffffffff8145ac2b>] kobject_cleanup+0x7b/0x1a0
[<ffffffff8145aad0>] kobject_put+0x30/0x70
[<ffffffff8158f147>] put_device+0x17/0x20
[<ffffffff8143c29c>] delete_partition_rcu_cb+0x16c/0x180
[<ffffffff8143c130>] ? read_dev_sector+0xa0/0xa0
[<ffffffff810e0e0f>] rcu_process_callbacks+0x2ff/0xa90
[<ffffffff810e0dcf>] ? rcu_process_callbacks+0x2bf/0xa90
[<ffffffff81067e2e>] __do_softirq+0xde/0x600
Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests. This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dd "block:
Fix dev_t minor allocation lifetime". Both this and 2da78092dd are
candidates for -stable.
Fixes: 2da78092dd ("block: Fix dev_t minor allocation lifetime")
Cc: Keith Busch <keith.busch@intel.com>
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 5fabcb4c33fe11c7e3afdf805fde26c1a54d0953 upstream.
We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
with a user passed in partno value. If we pass in 0x7fffffff, the
new target in disk_expand_part_tbl() overflows the 'int' and we
access beyond the end of ptbl->part[] and even write to it when we
do the rcu_assign_pointer() to assign the new partition.
Reported-by: David Ramos <daramos@stanford.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 84ce0f0e94 upstream.
When sg_scsi_ioctl() fails to prepare request to submit in
blk_rq_map_kern() we jump to a label where we just end up copying
(luckily zeroed-out) kernel buffer to userspace instead of reporting
error. Fix the problem by jumping to the right label.
CC: Jens Axboe <axboe@kernel.dk>
CC: linux-scsi@vger.kernel.org
Coverity-id: 1226871
Signed-off-by: Jan Kara <jack@suse.cz>
Fixed up the, now unused, out label.
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit b8839b8c55 upstream.
The math in both blk_stack_limits() and queue_limit_alignment_offset()
assume that a block device's io_min (aka minimum_io_size) is always a
power-of-2. Fix the math such that it works for non-power-of-2 io_min.
This issue (of alignment_offset != 0) became apparent when testing
dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
block size") unlocked the potential for alignment_offset != 0 due to
the dm-thin-pool's io_min possibly being a non-power-of-2.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 46f341ffcf upstream.
Commit 2da78092 changed the locking from a mutex to a spinlock,
so we now longer sleep in this context. But there was a leftover
might_sleep() in there, which now triggers since we do the final
free from an RCU callback. Get rid of it.
Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 2da78092dd upstream.
Releases the dev_t minor when all references are closed to prevent
another device from acquiring the same major/minor.
Since the partition's release may be invoked from call_rcu's soft-irq
context, the ext_dev_idr's mutex had to be replaced with a spinlock so
as not so sleep.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
[lizf: Backported to 3.4:
- adjust context
- remove idr_preload() and idr_preload_end()]
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit d45b3279a5 upstream.
There is no inherent reason why the last put of a tag structure must be
the one for the Scsi_Host, as device model objects can be held for
arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single
funtion that just release a references and get rid of the BUG() when the
host reference wasn't the last.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit af5040da01 upstream.
trace_block_rq_complete does not take into account that request can
be partially completed, so we can get the following incorrect output
of blkparser:
C R 232 + 240 [0]
C R 240 + 232 [0]
C R 248 + 224 [0]
C R 256 + 216 [0]
but should be:
C R 232 + 8 [0]
C R 240 + 8 [0]
C R 248 + 8 [0]
C R 256 + 8 [0]
Also, the whole output summary statistics of completed requests and
final throughput will be incorrect.
This patch takes into account real completion size of the request and
fixes wrong completion accounting.
Signed-off-by: Roman Pen <r.peniaev@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 893d290f1d upstream.
After we've done __elv_add_request() and __blk_run_queue() in
blk_execute_rq_nowait(), the request might finish and be freed
immediately. Therefore checking if the type is REQ_TYPE_PM_RESUME
isn't safe afterwards, because if it isn't, rq might be gone.
Instead, check beforehand and stash the result in a temporary.
This fixes crashes in blk_execute_rq_nowait() I get occasionally when
running with lots of memory debugging options enabled -- I think this
race is usually harmless because the window for rq to be reallocated
is so small.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[xr: Backported to 3.4: adjust context]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c8123f8c9c upstream.
When mkfs issues a full device discard and the device only
supports discards of a smallish size, we can loop in
blkdev_issue_discard() for a long time. If preempt isn't enabled,
this can turn into a softlock situation and the kernel will
start complaining.
Add an explicit cond_resched() at the end of the loop to avoid
that.
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7c8a3679e3 upstream.
Add locking of q->sysfs_lock into elevator_change() (an exported function)
to ensure it is held to protect q->elevator from elevator_init(), even if
elevator_change() is called from non-sysfs paths.
sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
version, as the lock is already taken by elv_iosched_store().
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d82ae52e68 upstream.
Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
(65536) even if the underlying device(s) have a larger value -- this is
due to blk_stack_limits() using min_not_zero() when stacking the
max_segment_size limit.
1073741824
before patch:
65536
after patch:
1073741824
Reported-by: Lukasz Flis <l.flis@cyfronet.pl>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4912aa6c11 upstream.
crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[<ffffffff8124e424>] [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
<0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
<0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
[<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
[<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
[<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
[<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
[<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
[<ffffffff810908c6>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81090830>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
RSP <ffff881057eefd60>
The RIP is this line:
BUG_ON(blk_queued_rq(rq));
After digging through the code, I think there may be a race between the
request completion and the timer handler running.
A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:
static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}
and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.
Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.
This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.
This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ffc8b30866 upstream.
Disk names may contain arbitrary strings, so they must not be
interpreted as format strings. It seems that only md allows arbitrary
strings to be used for disk names, but this could allow for a local
memory corruption from uid 0 into ring 0.
CVE-2013-2851
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c678ef5286 upstream.
As found by gcc-4.8, the QUEUE_SYSFS_BIT_FNS macro creates functions
that use a value generated by queue_var_store independent of whether
that value was set or not.
block/blk-sysfs.c: In function 'queue_store_nonrot':
block/blk-sysfs.c:244:385: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
Unlike most other such warnings, this one is not a false positive,
writing any non-number string into the sysfs files indeed has
an undefined result, rather than returning an error.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
blk_rq_map_sg() function merges the physically contiguous pages to use same
scatter-gather node without checking if their page descriptors are
contiguous or not.
Now when dma_map_sg() is called on the scatter gather list, it would
take the base page pointer from each node (one by one) and iterates
through all of the pages in same sg node by keep incrementing the base
page pointer with the assumption that physically contiguous pages will
have their page descriptor address contiguous which may not be true
if SPARSEMEM config is enabled. So here we may end referring to invalid
page descriptor.
Following table shows the example of physically contiguous pages but
their page descriptor addresses non-contiguous.
-------------------------------------------
| Page Descriptor | Physical Address |
------------------------------------------
| 0xc1e43fdc | 0xdffff000 |
| 0xc2052000 | 0xe0000000 |
-------------------------------------------
With this patch, relevant blk-merge functions will also check if the
physically contiguous pages are having page descriptors address contiguous
or not? If not then, these pages are separated to be in different
scatter-gather nodes.
(cherry picked from commit 6e25ce37a4f5750467f7c741b549687ebbc10667)
CRs-Fixed: 392141
Change-Id: I3601565e5569a69f06fb3af99061c4d4c23af241
Signed-off-by: Subhash Jadavani <subhashj@codeaurora.org>
commit ce23bba842 upstream.
idr allocation in blk_alloc_devt() wasn't synchronized against lookup
and removal, and its limit check was off by one - 1 << MINORBITS is
the number of minors allowed, not the maximum allowed minor.
Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
checking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 7b74e91278 upstream.
While adding and removing a lot of disks disks and partitions this
sometimes shows up:
WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
Hardware name:
sysfs: cannot create duplicate filename '/dev/block/259:751'
Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
Call Trace:
warn_slowpath_common+0x87/0xc0
warn_slowpath_fmt+0x46/0x50
sysfs_add_one+0xc9/0x130
sysfs_do_create_link+0x12b/0x170
sysfs_create_link+0x13/0x20
device_add+0x317/0x650
idr_get_new+0x13/0x50
add_partition+0x21c/0x390
rescan_partitions+0x32b/0x470
sd_open+0x81/0x1f0 [sd_mod]
__blkdev_get+0x1b6/0x3c0
blkdev_get+0x10/0x20
register_disk+0x155/0x170
add_disk+0xa6/0x160
sd_probe_async+0x13b/0x210 [sd_mod]
add_wait_queue+0x46/0x60
async_thread+0x102/0x250
default_wake_function+0x0/0x20
async_thread+0x0/0x250
kthread+0x96/0xa0
child_rip+0xa/0x20
kthread+0x0/0xa0
child_rip+0x0/0x20
This most likely happens because dev_t is freed while the number is
still used and idr_get_new() is not protected on every use. The fix
adds a mutex where it wasn't before and moves the dev_t free function so
it is called after device del.
Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This test write data to the card, then send DISCARD on random
addresses on the card, send SANITIZE to the card to erase all
the unmapped areas.
Change-Id: I2904023ccd258e64b99b004bacfbe576b0ead59a
Signed-off-by: Maya Erez <merez@codeaurora.org>
- Fix test-iosched crash when running multiple tests
- Free the BIOs memory when a request is not completed
Change-Id: I1baa916c04ae73c809dee7e67ec63f4546dc71aa
Signed-off-by: Maya Erez <merez@codeaurora.org>
The test scheduler allows testing a block device by dispatching
specific requests according to the test case and declare PASS/FAIL
according to the requests completion error code
Change-Id: Ief91f9fed6e3c3c75627d27264d5252ea14f10ad
Signed-off-by: Maya Erez <merez@codeaurora.org>
For disk devices, a new uevent parameter 'NPARTS' specifies the number
of partitions detected by the kernel. Partition devices get 'PARTN' which
specifies the partitions index in the table, and 'PARTNAME', which
specifies PARTNAME specifices the partition name of a partition device
Signed-off-by: Dima Zavin <dima@android.com>
(cherry picked from commit 38c6302d9f)
Adding a new ioctl to support sanitize operation in eMMC
cards version 4.5.
The sanitize ioctl support helps performing this operation
via user application.
Change-Id: I79aa4163e7753a75bed5a26a9a92de902b4b9c21
Signed-off-by: Yaniv Gardi <ygardi@codeaurora.org>
Signed-off-by: Maya Erez <merez@codeaurora.org>
(cherry picked from commit 73937f5face75e05ec2a72966d04f4e20aa18379)
Conflicts:
block/blk-core.c
block/elevator.c
include/linux/blk_types.h
include/linux/blkdev.h
include/linux/fs.h
commit 6d93592807 upstream.
Sometimes, warnings about ioctls to partition happen often enough that they
form majority of the warnings in the kernel log and users complain. In some
cases warnings are about ioctls such as SG_IO so it's not good to get rid of
the warnings completely as they can ease debugging of userspace problems
when ioctl is refused.
Since I have seen warnings from lots of commands, including some proprietary
userspace applications, I don't think disallowing the ioctls for processes
with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: satoru takeuchi <satoru.takeuchi@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 60ea8226cb upstream.
A queue newly allocated with blk_alloc_queue_node() has only
QUEUE_FLAG_BYPASS set. For request-based drivers,
blk_init_allocated_queue() is called and q->queue_flags is overwritten
with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
initial bypass is still in effect.
In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
instead of overwriting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6d1d8050b4 "block, partition: add partition_meta_info to hd_struct"
added part_unpack_uuid() which assumes that the passed in buffer has
enough space for sprintfing "%pU" - 37 characters including '\0'.
Unfortunately, b5af921ec0 "init: add support for root devices
specified by partition UUID" supplied 33 bytes buffer to the function
leading to the following panic with stackprotector enabled.
Kernel panic - not syncing: stack-protector: Kernel stack corrupted in: ffffffff81b14c7e
[<ffffffff815e226b>] panic+0xba/0x1c6
[<ffffffff81b14c7e>] ? printk_all_partitions+0x259/0x26xb
[<ffffffff810566bb>] __stack_chk_fail+0x1b/0x20
[<ffffffff81b15c7e>] printk_all_paritions+0x259/0x26xb
[<ffffffff81aedfe0>] mount_block_root+0x1bc/0x27f
[<ffffffff81aee0fa>] mount_root+0x57/0x5b
[<ffffffff81aee23b>] prepare_namespace+0x13d/0x176
[<ffffffff8107eec0>] ? release_tgcred.isra.4+0x330/0x30
[<ffffffff81aedd60>] kernel_init+0x155/0x15a
[<ffffffff81087b97>] ? schedule_tail+0x27/0xb0
[<ffffffff815f4d24>] kernel_thread_helper+0x5/0x10
[<ffffffff81aedc0b>] ? start_kernel+0x3c5/0x3c5
[<ffffffff815f4d20>] ? gs_change+0x13/0x13
Increase the buffer size, remove the dangerous part_unpack_uuid() and
use snprintf() directly from printk_all_partitions().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Szymon Gruszczynski <sz.gruszczynski@googlemail.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block core bits from Jens Axboe:
"It's a nice and quiet round this time, since most of the tricky stuff
has been pushed to 3.5 to give it more time to mature. After a few
hectic block IO core changes for 3.3 and 3.2, I'm quite happy with a
slow round.
Really minor stuff in here, the only real functional change is making
the auto-unplug threshold a per-queue entity. The threshold is set so
that it's low enough that we don't hold off IO for too long, but still
big enough to get a nice benefit from the batched insert (and hence
queue lock cost reduction). For raid configurations, this currently
breaks down."
* 'for-3.4/core' of git://git.kernel.dk/linux-block:
block: make auto block plug flush threshold per-disk based
Documentation: Add sysfs ABI change for cfq's target latency.
block: Make cfq_target_latency tunable through sysfs.
block: use lockdep_assert_held for queue locking
block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNEL
We do auto block plug flush to reduce latency, the threshold is 16
requests. This works well if the task is accessing one or two drives.
The problem is if the task is accessing a raid 0 device and the raid
disk number is big, say 8 or 16, 16/8 = 2 or 16/16=1, we will have
heavy lock contention.
This patch makes the threshold per-disk based. The latency should be
still ok accessing one or two drives. The setup with application
accessing a lot of drives in the meantime uaually is big machine,
avoiding lock contention is more important, because any contention
will actually increase latency.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In cfq, when we calculate a time slice for a process(or a cfqq to
be precise), we have to consider the cfq_target_latency so that all the
sync request have an estimated latency(300ms) and it is controlled by
cfq_target_latency. But in some hadoop test, we have found that if
there are many processes doing sequential read(24 for example), the
throughput is bad because every process can only work for about 25ms
and the cfqq is switched. That leads to a higher disk seek. We can
achive the good throughput by setting low_latency=0, but then some
read's latency is too much for the application.
So this patch makes cfq_target_latency tunable through sysfs so that
we can tune it and find some magic number which is not bad for both
the throughput and the read latency.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should use the GFP flags that the caller specified instead of picking
our own. All the callers specify GFP_KERNEL so this doesn't make a
difference to how the kernel runs, it's just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup changes from Tejun Heo:
"Out of the 8 commits, one fixes a long-standing locking issue around
tasklist walking and others are cleanups."
* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
cgroup: remove cgroup_subsys argument from callbacks
cgroup: remove extra calls to find_existing_css_set
cgroup: replace tasklist_lock with rcu_read_lock
cgroup: simplify double-check locking in cgroup_attach_proc
cgroup: move struct cgroup_pidlist out from the header file
cgroup: remove cgroup_attach_task_current_cg()
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
Pull block fixes from Jens Axboe:
"Been sitting on this for a while, but lets get this out the door.
This fixes various important bugs for 3.3 final, along with a few more
trivial ones. Please pull!"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix ioc leak in put_io_context
block, sx8: fix pointer math issue getting fw version
Block: use a freezable workqueue for disk-event polling
drivers/block/DAC960: fix -Wuninitialized warning
drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
block: fix __blkdev_get and add_disk race condition
block: Fix setting bio flags in drivers (sd_dif/floppy)
block: Fix NULL pointer dereference in sd_revalidate_disk
block: exit_io_context() should call elevator_exit_icq_fn()
block: simplify ioc_release_fn()
block: replace icq->changed with icq->flags
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The following situation might occur:
__blkdev_get: add_disk:
register_disk()
get_gendisk()
disk_block_events()
disk->ev == NULL
disk_add_events()
__disk_unblock_events()
disk->ev != NULL
--ev->block
Then we unblock events, when they are suppose to be blocked. This can
trigger events related block/genhd.c warnings, but also can crash in
sd_check_events() or other places.
I'm able to reproduce crashes with the following scripts (with
connected usb dongle as sdb disk).
<snip>
DEV=/dev/sdb
ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue
function stop_me()
{
for i in `jobs -p` ; do kill $i 2> /dev/null ; done
exit
}
trap stop_me SIGHUP SIGINT SIGTERM
for ((i = 0; i < 10; i++)) ; do
while true; do fdisk -l $DEV 2>&1 > /dev/null ; done &
done
while true ; do
echo 1 > $ENABLE
sleep 1
echo 0 > $ENABLE
done
</snip>
I use the script to verify patch fixing oops in sd_revalidate_disk
http://marc.info/?l=linux-scsi&m=132935572512352&w=2
Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
sd_revalidate_disk" or this one, script easily crash kernel within
a few seconds. With both patches applied I do not observe crash.
Unfortunately after some time (dozen of minutes), script will hung in:
[ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20
[ 1563.906437] [<c04532d5>] msleep+0x15/0x20
[ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0
[ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170
[ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60
[ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0
[ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60
[ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0
[ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage]
[ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage]
Anyway I think this patch is some step forward.
As drawback, I do not teardown on sysfs file create error, because I do
not know how to nullify disk->ev (since it can be used). However add_disk
error handling practically does not exist too, and things will work
without this sysfs file, except events will not be exported to user
space.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.
However it ends up calling driver's revalidate_disk without open
and could cause oops.
In the case of SCSI:
process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device
<scsi_device torn down>
rescan_partitions
sd_revalidate_disk
<oops>
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052
This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.
Reported-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
From: Ben Hutchings <ben@decadent.org.uk>
Extended VBLKs (those larger than the preset VBLK size) are divided
into fragments, each with its own VBLK header. Our LDM implementation
generally assumes that each VBLK is contiguous in memory, so these
fragments must be assembled before further processing.
Currently the reassembly seems to be done quite wrongly - no VBLK
header is copied into the contiguous buffer, and the length of the
header is subtracted twice from each fragment. Also the total
length of the reassembled VBLK is calculated incorrectly.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>