[ Upstream commit f02b2320b27c16b644691267ee3b5c110846f49e ]
The mutex_destroy only makes sense when enable DEBUG_MUTEX. For the
good readbility, it's better to invoke it in exit func when the init
func invokes mutex_init.
Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Acked-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0e74aa1d79a5bbc663e03a2804399cae418a0321 ]
The syzbot found an ancient bug in the IPsec code. When we cloned
a socket policy (for example, for a child TCP socket derived from a
listening socket), we did not copy the family field. This results
in a live policy with a zero family field. This triggers a BUG_ON
check in the af_key code when the cloned policy is retrieved.
This patch fixes it by copying the family field over.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e39d5246111399dbc6e11cd39fd8580191b86c47 ]
Now when creating fnhe for redirect, it sets fnhe_expires for this
new route cache. But when updating the exist one, it doesn't do it.
It will cause this fnhe never to be expired.
Paolo already noticed it before, in Jianlin's test case, it became
even worse:
When ip route flush cache, the old fnhe is not to be removed, but
only clean it's members. When redirect comes again, this fnhe will
be found and updated, but never be expired due to fnhe_expires not
being set.
So fix it by simply updating fnhe_expires even it's for redirect.
Fixes: aee06da672 ("ipv4: use seqlock for nh_exceptions")
Reported-by: Jianlin Shi <jishi@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit cebe84c6190d741045a322f5343f717139993c08 ]
Now when ip route flush cache and it turn out all fnhe_genid != genid.
If a redirect/pmtu icmp packet comes and the old fnhe is found and all
it's members but fnhe_genid will be updated.
Then next time when it looks up route and tries to rebind this fnhe to
the new dst, the fnhe will be flushed due to fnhe_genid != genid. It
causes this redirect/pmtu icmp packet acutally not to be applied.
This patch is to also reset fnhe_genid when updating a route cache.
Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 05ab86c5 (xfrm4: Invalidate all ipv4 routes on
IPsec pmtu events). Flushing all cached entries is not needed.
Instead, invalidate only the related next hop dsts to recheck for
the added next hop exception where needed. This also fixes a subtle
race due to bumping generation id's before updating the pmtu.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 13d82bf5 (ipv4: Fix flushing of cached routing informations)
added the support to flush learned pmtu information.
However, using rt_genid is quite heavy as it is bumped on route
add/change and multicast events amongst other places. These can
happen quite often, especially if using dynamic routing protocols.
While this is ok with routes (as they are just recreated locally),
the pmtu information is learned from remote systems and the icmp
notification can come with long delays. It is worthy to have separate
genid to avoid excessive pmtu resets.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 8d63bee643f1fb53e472f0e135cae4eb99d62d19 ]
skb_warn_bad_offload triggers a warning when an skb enters the GSO
stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
checksum offload set.
Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
observed that SKB_GSO_DODGY producers can trigger the check and
that passing those packets through the GSO handlers will fix it
up. But, the software UFO handler will set ip_summed to
CHECKSUM_NONE.
When __skb_gso_segment is called from the receive path, this
triggers the warning again.
Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
Tx these two are equivalent. On Rx, this better matches the
skb state (checksum computed), as CHECKSUM_NONE here means no
checksum computed.
See also this thread for context:
http://patchwork.ozlabs.org/patch/799015/
Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
Change-Id: I80567d5c09be96b7c329aeb46b81985dd92b1761
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d ]
If queue_delayed_work() gets called with NULL @wq, the kernel will
oops asynchronuosly on timer expiration which isn't too helpful in
tracking down the offender. This actually happened with smc.
__queue_delayed_work() already does several input sanity checks
synchronously. Add NULL @wq check.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Link: http://lkml.kernel.org/r/20170227171439.jshx3qplflyrgcv7@codemonkey.org.uk
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 4dca6ea1d9432052afb06baf2e3ae78188a4410b upstream.
When the request_key() syscall is not passed a destination keyring, it
links the requested key (if constructed) into the "default" request-key
keyring. This should require Write permission to the keyring. However,
there is actually no permission check.
This can be abused to add keys to any keyring to which only Search
permission is granted. This is because Search permission allows joining
the keyring. keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_SESSION_KEYRING)
then will set the default request-key keyring to the session keyring.
Then, request_key() can be used to add keys to the keyring.
Both negatively and positively instantiated keys can be added using this
method. Adding negative keys is trivial. Adding a positive key is a
bit trickier. It requires that either /sbin/request-key positively
instantiates the key, or that another thread adds the key to the process
keyring at just the right time, such that request_key() misses it
initially but then finds it in construct_alloc_key().
Fix this bug by checking for Write permission to the keyring in
construct_get_dest_keyring() when the default keyring is being used.
We don't do the permission check for non-default keyrings because that
was already done by the earlier call to lookup_user_key(). Also,
request_key_and_link() is currently passed a 'struct key *' rather than
a key_ref_t, so the "possessed" bit is unavailable.
We also don't do the permission check for the "requestor keyring", to
continue to support the use case described by commit 8bbf4976b5
("KEYS: Alter use of key instantiation link-to-keyring argument") where
/sbin/request-key recursively calls request_key() to add keys to the
original requestor's destination keyring. (I don't know of any users
who actually do that, though...)
Fixes: 3e30148c3d ("[PATCH] Keys: Make request-key create an authorisation key")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91 upstream.
MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation. The calling
convention is quite subtle there. madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.
It seems this has been broken since introduction. Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
[mhocko@suse.com: rewrite changelog]
Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
Fixes: fe77ba6f4f ("[PATCH] xip: madvice/fadvice: execute in place")
Signed-off-by: chenjie <chenjie6@huawei.com>
Signed-off-by: guoxuenan <guoxuenan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: zhangyi (F) <yi.zhang@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Although the arm64 vDSO is cleanly separated by code/data with the
code being read-only in userspace mappings, the code page is still
writable from the kernel. There have been exploits (such as
http://itszn.com/blog/?p=21) that take advantage of this on x86 to go
from a bad kernel write to full root.
Prevent this specific exploit on arm64 by putting the vDSO code page
in read-only memory as well.
Before the change:
[ 3.138366] vdso: 2 pages (1 code @ ffffffc000a71000, 1 data @ ffffffc000a70000)
---[ Kernel Mapping ]---
0xffffffc000000000-0xffffffc000082000 520K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000082000-0xffffffc000200000 1528K ro x SHD AF UXN MEM/NORMAL
0xffffffc000200000-0xffffffc000800000 6M ro x SHD AF BLK UXN MEM/NORMAL
0xffffffc000800000-0xffffffc0009b6000 1752K ro x SHD AF UXN MEM/NORMAL
0xffffffc0009b6000-0xffffffc000c00000 2344K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000c00000-0xffffffc008000000 116M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc00c000000-0xffffffc07f000000 1840M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc800000000-0xffffffc840000000 1G RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc840000000-0xffffffc87ae00000 942M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87ae00000-0xffffffc87ae70000 448K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af80000-0xffffffc87af8a000 40K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af8b000-0xffffffc87b000000 468K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87b000000-0xffffffc87fe00000 78M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87fe00000-0xffffffc87ff50000 1344K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87ff90000-0xffffffc87ffa0000 64K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87fff0000-0xffffffc880000000 64K RW NX SHD AF UXN MEM/NORMAL
After:
[ 3.138368] vdso: 2 pages (1 code @ ffffffc0006de000, 1 data @ ffffffc000a74000)
---[ Kernel Mapping ]---
0xffffffc000000000-0xffffffc000082000 520K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000082000-0xffffffc000200000 1528K ro x SHD AF UXN MEM/NORMAL
0xffffffc000200000-0xffffffc000800000 6M ro x SHD AF BLK UXN MEM/NORMAL
0xffffffc000800000-0xffffffc0009b8000 1760K ro x SHD AF UXN MEM/NORMAL
0xffffffc0009b8000-0xffffffc000c00000 2336K RW NX SHD AF UXN MEM/NORMAL
0xffffffc000c00000-0xffffffc008000000 116M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc00c000000-0xffffffc07f000000 1840M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc800000000-0xffffffc840000000 1G RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc840000000-0xffffffc87ae00000 942M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87ae00000-0xffffffc87ae70000 448K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af80000-0xffffffc87af8a000 40K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87af8b000-0xffffffc87b000000 468K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87b000000-0xffffffc87fe00000 78M RW NX SHD AF BLK UXN MEM/NORMAL
0xffffffc87fe00000-0xffffffc87ff50000 1344K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87ff90000-0xffffffc87ffa0000 64K RW NX SHD AF UXN MEM/NORMAL
0xffffffc87fff0000-0xffffffc880000000 64K RW NX SHD AF UXN MEM/NORMAL
Inspired by https://lkml.org/lkml/2016/1/19/494 based on work by the
PaX Team, Brad Spengler, and Kees Cook.
Signed-off-by: David Brown <david.brown@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
[catalin.marinas@arm.com: removed superfluous __PAGE_ALIGNED_DATA]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Change-Id: Ib15f3b359e3a943d7a5e8793aa590bb2d0a589ba
Signed-off-by: anupritaisno1 <www.anuprita804@gmail.com>
commit 16af97dc5a8975371a83d9e30a64038b48f40a2d upstream.
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16:
- Drop change to dump_mm()
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 73bb059f9b8a00c5e1bf2f7ca83138c05d05e600 upstream.
The point of sched_group_mask is to select those CPUs from
sched_group_cpus that can actually arrive at this balance domain.
The current code gets it wrong, as can be readily demonstrated with a
topology like:
node 0 1 2 3
0: 10 20 30 20
1: 20 10 20 30
2: 30 20 10 20
3: 20 30 20 10
Where (for example) domain 1 on CPU1 ends up with a mask that includes
CPU0:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
This causes sched_balance_cpu() to compute the wrong CPU and
consequently should_we_balance() will terminate early resulting in
missed load-balance opportunities.
The fixed topology looks like:
[] CPU1 attaching sched-domain:
[] domain 0: span 0-2 level NUMA
[] groups: 1 (mask: 1), 2, 0
[] domain 1: span 0-3 level NUMA
[] groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
(note: this relies on OVERLAP domains to always have children, this is
true because the regular topology domains are still here -- this is
before degenerate trimming)
Debugged-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Fixes: e3589f6c81 ("sched: Allow for overlapping sched_domain spans")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16:
- Use span, not sg_span
- Adjust filename context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[ Upstream commit 2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d ]
Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.
Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.
If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.
Bad things would then happen, including infinite loops.
This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.
Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.
Fixes: a47e5a988a ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9b3dc0a17d7388c4fb83736ca45253a93e994ce4 upstream.
This fixes a counter problem on 32bit systems:
When the rx_bytes counter reached 2 GiB, it jumpd to (2^64 Bytes - 2GiB) Bytes.
rtnl_link_stats64 has __u64 type and atomic_long_read returns
atomic_long_t which is signed. Due to the conversation
we get an incorrect value on 32bit systems if the MSB of
the atomic_long_t value is set.
CC: Tom Parkin <tparkin@katalix.com>
Fixes: 7b7c0719cd ("l2tp: avoid deadlock in l2tp stats update")
Signed-off-by: Dominik Heidler <dheidler@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 321a52a39189d5e4af542f7dcdc07bba4545cf5d upstream.
pppol2tp_getsockopt() doesn't take into account the error code returned
by pppol2tp_tunnel_getsockopt() or pppol2tp_session_getsockopt(). If
error occurs there, pppol2tp_getsockopt() continues unconditionally and
reports erroneous values.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 364700cf8fd54f54ad08313464105a414e3bccb7 upstream.
pppol2tp_setsockopt() unconditionally overwrites the error value
returned by pppol2tp_tunnel_setsockopt() or
pppol2tp_session_setsockopt(), thus hiding errors from userspace.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 57377d63547861919ee634b845c7caa38de4a452 upstream.
Holding a reference on session is required before calling
pppol2tp_session_ioctl(). The session could get freed while processing the
ioctl otherwise. Since pppol2tp_session_ioctl() uses the session's socket,
we also need to take a reference on it in l2tp_session_get().
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 4ac36a4adaf80013a60013d6f829f5863d5d0e05 upstream.
If 'tunnel' is NULL we should return -EBADF but the 'end_put_sess' path
unconditionally sets 'error' back to zero. Rework the error path so it
more closely matches pppol2tp_sendmsg.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 8009d506a1dd00cf436b0c4cca0dcec130580a21 upstream.
The 'use' locking macros are no-ops if neither SMP or SND_DEBUG is
enabled. This might once have been OK in non-preemptible
configurations, but even in that case snd_seq_read() may sleep while
relying on a 'use' lock. So always use the proper implementations.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 7c80f9e4a588f1925b07134bb2e3689335f6c6d8 upstream.
If the usbtest driver encounters a device with an IN bulk endpoint but
no OUT bulk endpoint, it will try to dereference a NULL pointer
(out->desc.bEndpointAddress). The problem can be solved by adding a
missing test.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 4971613c1639d8e5f102c4e797c3bf8f83a5a69e upstream.
Once a socket has po->fanout set, it remains a member of the group
until it is destroyed. The prot_hook must be constant and identical
across sockets in the group.
If fanout_add races with packet_do_bind between the test of po->fanout
and taking the lock, the bind call may make type or dev inconsistent
with that of the fanout group.
Hold po->bind_lock when testing po->fanout to avoid this race.
I had to introduce artificial delay (local_bh_enable) to actually
observe the race.
Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 30f7ea1c2b5f5fb7462c5ae44fe2e40cb2d6a474 upstream.
There is a race conditions between packet_notifier and packet_bind{_spkt}.
It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
time packet_bind{_spkt} takes a reference on the new netdevice and the
time packet_do_bind sets po->ifindex.
In this case the notification can be missed.
If this happens during a dev_change_net_namespace this can result in the
netdevice to be moved to the new namespace while the packet_sock in the
old namespace still holds a reference on it. When the netdevice is later
deleted in the new namespace the deletion hangs since the packet_sock
is not found in the new namespace' &net->packet.sklist.
It can be reproduced with the script below.
This patch makes packet_do_bind check again for the presence of the
netdevice in the packet_sock's namespace after the synchronize_net
in unregister_prot_hook.
More in general it also uses the rcu lock for the duration of the bind
to stop dev_change_net_namespace/rollback_registered_many from
going past the synchronize_net following unlist_netdevice, so that
no NETDEV_UNREGISTER notifications can happen on the new netdevice
while the bind is executing. In order to do this some code from
packet_bind{_spkt} is consolidated into packet_do_dev.
import socket, os, time, sys
proto=7
realDev='em1'
vlanId=400
if len(sys.argv) > 1:
vlanId=int(sys.argv[1])
dev='vlan%d' % vlanId
os.system('taskset -p 0x10 %d' % os.getpid())
s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
os.system('ip link add link %s name %s type vlan id %d' %
(realDev, dev, vlanId))
os.system('ip netns add dummy')
pid=os.fork()
if pid == 0:
# dev should be moved while packet_do_bind is in synchronize net
os.system('taskset -p 0x20000 %d' % os.getpid())
os.system('ip link set %s netns dummy' % dev)
os.system('ip netns exec dummy ip link del %s' % dev)
s.close()
sys.exit(0)
time.sleep(.004)
try:
s.bind(('%s' % dev, proto+1))
except:
print 'Could not bind socket'
s.close()
os.system('ip netns del dummy')
sys.exit(0)
os.waitpid(pid, 0)
s.close()
os.system('ip netns del dummy')
sys.exit(0)
Change-Id: Ib42387dc7fee29b4424fb6421831ebb46a2b2178
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2:
- Add the 'dev_curr' variable
- Drop the packet_cached_dev changes
- Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 60ff5b2f547af3828aebafd54daded44cfb0807a upstream.
Currently, when passed a key that already exists, add_key() will call the
key's ->update() method if such exists. But this is heavily broken in the
case where the key is uninstantiated because it doesn't call
__key_instantiate_and_link(). Consequently, it doesn't do most of the
things that are supposed to happen when the key is instantiated, such as
setting the instantiation state, clearing KEY_FLAG_USER_CONSTRUCT and
awakening tasks waiting on it, and incrementing key->user->nikeys.
It also never takes key_construction_mutex, which means that
->instantiate() can run concurrently with ->update() on the same key. In
the case of the "user" and "logon" key types this causes a memory leak, at
best. Maybe even worse, the ->update() methods of the "encrypted" and
"trusted" key types actually just dereference a NULL pointer when passed an
uninstantiated key.
Change key_create_or_update() to wait interruptibly for the key to finish
construction before continuing.
This patch only affects *uninstantiated* keys. For now we still allow a
negatively instantiated key to be updated (thereby positively
instantiating it), although that's broken too (the next patch fixes it)
and I'm not sure that anyone actually uses that functionality either.
Here is a simple reproducer for the bug using the "encrypted" key type
(requires CONFIG_ENCRYPTED_KEYS=y), though as noted above the bug
pertained to more than just the "encrypted" key type:
#include <stdlib.h>
#include <unistd.h>
#include <keyutils.h>
int main(void)
{
int ringid = keyctl_join_session_keyring(NULL);
if (fork()) {
for (;;) {
const char payload[] = "update user:foo 32";
usleep(rand() % 10000);
add_key("encrypted", "desc", payload, sizeof(payload), ringid);
keyctl_clear(ringid);
}
} else {
for (;;)
request_key("encrypted", "desc", "callout_info", ringid);
}
}
It causes:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: encrypted_update+0xb0/0x170
PGD 7a178067 P4D 7a178067 PUD 77269067 PMD 0
PREEMPT SMP
CPU: 0 PID: 340 Comm: reproduce Tainted: G D 4.14.0-rc1-00025-g428490e38b2e #796
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8a467a39a340 task.stack: ffffb15c40770000
RIP: 0010:encrypted_update+0xb0/0x170
RSP: 0018:ffffb15c40773de8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a467a275b00 RCX: 0000000000000000
RDX: 0000000000000005 RSI: ffff8a467a275b14 RDI: ffffffffb742f303
RBP: ffffb15c40773e20 R08: 0000000000000000 R09: ffff8a467a275b17
R10: 0000000000000020 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8a4677057180 R15: ffff8a467a275b0f
FS: 00007f5d7fb08700(0000) GS:ffff8a467f200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000018 CR3: 0000000077262005 CR4: 00000000001606f0
Call Trace:
key_create_or_update+0x2bc/0x460
SyS_add_key+0x10c/0x1d0
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x7f5d7f211259
RSP: 002b:00007ffed03904c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000f8
RAX: ffffffffffffffda RBX: 000000003b2a7955 RCX: 00007f5d7f211259
RDX: 00000000004009e4 RSI: 00000000004009ff RDI: 0000000000400a04
RBP: 0000000068db8bad R08: 000000003b2a7955 R09: 0000000000000004
R10: 000000000000001a R11: 0000000000000246 R12: 0000000000400868
R13: 00007ffed03905d0 R14: 0000000000000000 R15: 0000000000000000
Code: 77 28 e8 64 34 1f 00 45 31 c0 31 c9 48 8d 55 c8 48 89 df 48 8d 75 d0 e8 ff f9 ff ff 85 c0 41 89 c4 0f 88 84 00 00 00 4c 8b 7d c8 <49> 8b 75 18 4c 89 ff e8 24 f8 ff ff 85 c0 41 89 c4 78 6d 49 8b
RIP: encrypted_update+0xb0/0x170 RSP: ffffb15c40773de8
CR2: 0000000000000018
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 95d78c28b5a85bacbc29b8dba7c04babb9b0d467 upstream.
bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.
Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 138e4ad67afd5c6c318b056b4d17c17f2c0ca5c0 upstream.
The race was introduced by me in commit 971316f050 ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead"). I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().
Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.
TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.
Hopefully this explains use-after-free reported by syzcaller:
BUG: KASAN: use-after-free in debug_spin_lock_before
...
_raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148
this is spin_lock(eventpoll->lock),
...
Freed by task 17774:
...
kfree+0xe8/0x2c0 mm/slub.c:3883
ep_free+0x22c/0x2a0 fs/eventpoll.c:865
Fixes: 971316f050 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞 <long7573@126.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Use smp_mb() and ACCESS_ONCE() instead of smp_{load_acquire,store_release}()
- EPOLLEXCLUSIVE is not supported]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 30c31d746d0eb458ae327f522bc8e4c44cbea0f0 upstream.
It is very unlikely to happen but the backlogs memory allocation
could fail and will free q->flows, but then ->destroy() will free
q->flows too. For correctness remove the first free and let ->destroy
clean up.
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: fq_codel used different alloc/free functions]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 3bdac362a2f89ed3e148fa6f38c5f5d858f50b1a upstream.
Depending on where ->init fails we can get a null pointer deref due to
uninitialized hires timer (watchdog) or a double free of the qdisc hash
because it is already freed by ->destroy().
Fixes: 8d5537387505 ("net/sched/hfsc: allocate tcf block for hfsc root class")
Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: sch_hfsc doesn't use a tcf block]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit b339752d054fb32863418452dff350a1086885b1 upstream.
When !NUMA, cpumask_of_node(@node) equals cpu_online_mask regardless of
@node. The assumption seems that if !NUMA, there shouldn't be more than
one node and thus reporting cpu_online_mask regardless of @node is
correct. However, that assumption was broken years ago to support
DISCONTIGMEM and whether a system has multiple nodes or not is
separately controlled by NEED_MULTIPLE_NODES.
This means that, on a system with !NUMA && NEED_MULTIPLE_NODES,
cpumask_of_node() will report cpu_online_mask for all possible nodes,
indicating that the CPUs are associated with multiple nodes which is an
impossible configuration.
This bug has been around forever but doesn't look like it has caused any
noticeable symptoms. However, it triggers a WARN recently added to
workqueue to verify NUMA affinity configuration.
Fix it by reporting empty cpumask on non-zero nodes if !NUMA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 4e587ea71bf924f7dac621f1351653bd41e446cb upstream.
Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This
generates a new sparse warning on rt->rt6i_node related code:
net/ipv6/route.c:1394:30: error: incompatible types in comparison
expression (different address spaces)
./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
expression (different address spaces)
This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
rcu API is used for it.
After this fix, sparse no longer generates the above warning.
Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: drop changes in rt6_cache_allowed_for_pmtu()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit e702c1204eb57788ef189c839c8c779368267d70 upstream.
Use l2tp_tunnel_get() to retrieve tunnel, so that it can't go away on
us. Otherwise l2tp_tunnel_destruct() might release the last reference
count concurrently, thus freeing the tunnel while we're using it.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit af87ae465abdc070de0dc35d6c6a9e7a8cd82987 upstream.
There's no point in checking for duplicate sessions at the beginning of
l2tp_nl_cmd_session_create(); the ->session_create() callbacks already
return -EEXIST when the session already exists.
Furthermore, even if l2tp_session_find() returns NULL, a new session
might be created right after the test. So relying on ->session_create()
to avoid duplicate session is the only sane behaviour.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: also delete the now-unused local variable]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 4e4b21da3acc68a7ea55f850cacc13706b7480e9 upstream.
Use l2tp_tunnel_get() instead of l2tp_tunnel_find() so that we get
a reference on the tunnel, preventing l2tp_tunnel_destruct() from
freeing it from under us.
Also move l2tp_tunnel_get() below nlmsg_new() so that we only take
the reference when needed.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 8c0e421525c9eb50d68e8f633f703ca31680b746 upstream.
We need to make sure the tunnel is not going to be destroyed by
l2tp_tunnel_destruct() concurrently.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit bb0a32ce4389e17e47e198d2cddaf141561581ad upstream.
l2tp_nl_cmd_tunnel_delete() needs to take a reference on the tunnel, to
prevent it from being concurrently freed by l2tp_tunnel_destruct().
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 54652eb12c1b72e9602d09cb2821d5760939190f upstream.
l2tp_tunnel_find() doesn't take a reference on the returned tunnel.
Therefore, it's unsafe to use it because the returned tunnel can go
away on us anytime.
Fix this by defining l2tp_tunnel_get(), which works like
l2tp_tunnel_find(), but takes a reference on the returned tunnel.
Caller then has to drop this reference using l2tp_tunnel_dec_refcount().
As l2tp_tunnel_dec_refcount() needs to be moved to l2tp_core.h, let's
simplify the patch and not move the L2TP_REFCNT_DEBUG part. This code
has been broken (not even compiling) in May 2012 by
commit a4ca44fa57 ("net: l2tp: Standardize logging styles")
and fixed more than two years later by
commit 29abe2fda54f ("l2tp: fix missing line continuation"). So it
doesn't appear to be used by anyone.
Same thing for l2tp_tunnel_free(); instead of moving it to l2tp_core.h,
let's just simplify things and call kfree_rcu() directly in
l2tp_tunnel_dec_refcount(). Extra assertions and debugging code
provided by l2tp_tunnel_free() didn't help catching any of the
reference counting and socket handling issues found while working on
this series.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Change-Id: I28072c45e84473121a40fc47325e131939002d2d
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: keep using atomic_t functions]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 9aaef50c44f132e040dcd7686c8e78a3390037c5 upstream.
Make l2tp_pernet()'s parameter constant, so that l2tp_session_get*() can
declare their "net" variable as "const".
Also constify "ifname" in l2tp_session_get_by_ifname().
Change-Id: Ic587ccd7eab92a1b419f1b4e82a9d30456b87a68
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 2777e2ab5a9cf2b4524486c6db1517a6ded25261 upstream.
Callers of l2tp_nl_session_find() need to hold a reference on the
returned session since there's no guarantee that it isn't going to
disappear from under them.
Relying on the fact that no l2tp netlink message may be processed
concurrently isn't enough: sessions can be deleted by other means
(e.g. by closing the PPPOL2TP socket of a ppp pseudowire).
l2tp_nl_cmd_session_delete() is a bit special: it runs a callback
function that may require a previous call to session->ref(). In
particular, for ppp pseudowires, the callback is l2tp_session_delete(),
which then calls pppol2tp_session_close() and dereferences the PPPOL2TP
socket. The socket might already be gone at the moment
l2tp_session_delete() calls session->ref(), so we need to take a
reference during the session lookup. So we need to pass the do_ref
variable down to l2tp_session_get() and l2tp_session_get_by_ifname().
Since all callers have to be updated, l2tp_session_find_by_ifname() and
l2tp_nl_session_find() are renamed to reflect their new behaviour.
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Change-Id: I0a02a9eba9b948e62f976f3c100cb4f688991828
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 61b9a047729bb230978178bca6729689d0c50ca2 upstream.
Taking a reference on sessions in l2tp_recv_common() is racy; this
has to be done by the callers.
To this end, a new function is required (l2tp_session_get()) to
atomically lookup a session and take a reference on it. Callers then
have to manually drop this reference.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Change-Id: I8f954b193338246ec46fb85ead2b34fcf98f5098
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 9ee369a405c57613d7c83a3967780c3e30c52ecc upstream.
Sessions must be fully initialised before calling
l2tp_session_add_to_tunnel(). Otherwise, there's a short time frame
where partially initialised sessions can be accessed by external users.
Fixes: dbdbc73b4478 ("l2tp: fix duplicate session creation")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: keep using l2tp_session_inc_refcount()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>