Commit Graph

446661 Commits

Author SHA1 Message Date
Davidlohr Bueso b6d5307265 mm,vmacache: add debug data
Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
hit rate -- exported in /proc/vmstat.

Any updates to the caching scheme needs this kind of data, thus it can
save some work re-implementing the counting all the time.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Aswin Chandramouleeswaran <aswin@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:07 +02:00
Linus Torvalds 3dbd0d29e6 mm: don't pointlessly use BUG_ON() for sanity check
BUG_ON() is a big hammer, and should be used _only_ if there is some
major corruption that you cannot possibly recover from, making it
imperative that the current process (and possibly the whole machine) be
terminated with extreme prejudice.

The trivial sanity check in the vmacache code is *not* such a fatal
error.  Recovering from it is absolutely trivial, and using BUG_ON()
just makes it harder to debug for no actual advantage.

To make matters worse, the placement of the BUG_ON() (only if the range
check matched) actually makes it harder to hit the sanity check to begin
with, so _if_ there is a bug (and we just got a report from Srivatsa
Bhat that this can indeed trigger), it is harder to debug not just
because the machine is possibly dead, but because we don't have better
coverage.

BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
because it is simply just about the worst thing you can ever do if you
hit some "this cannot happen" situation.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:06 +02:00
Davidlohr Bueso 7c1a95e0ae mm: per-thread vma caching
This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-27 22:08:06 +02:00
Sean Tranchetti 0d2f1604f9 af_key: unconditionally clone on broadcast
Attempting to avoid cloning the skb when broadcasting by inflating
the refcount with sock_hold/sock_put while under RCU lock is dangerous
and violates RCU principles. It leads to subtle race conditions when
attempting to free the SKB, as we may reference sockets that have
already been freed by the stack.

Unable to handle kernel paging request at virtual address 6b6b6b6b6b6c4b
[006b6b6b6b6b6c4b] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] PREEMPT SMP
task: fffffff78f65b380 task.stack: ffffff8049a88000
pc : sock_rfree+0x38/0x6c
lr : skb_release_head_state+0x6c/0xcc
Process repro (pid: 7117, stack limit = 0xffffff8049a88000)
Call trace:
	sock_rfree+0x38/0x6c
	skb_release_head_state+0x6c/0xcc
	skb_release_all+0x1c/0x38
	__kfree_skb+0x1c/0x30
	kfree_skb+0xd0/0xf4
	pfkey_broadcast+0x14c/0x18c
	pfkey_sendmsg+0x1d8/0x408
	sock_sendmsg+0x44/0x60
	___sys_sendmsg+0x1d0/0x2a8
	__sys_sendmsg+0x64/0xb4
	SyS_sendmsg+0x34/0x4c
	el0_svc_naked+0x34/0x38
Kernel panic - not syncing: Fatal exception

CRs-Fixed: 2251019
Change-Id: Ib3b01f941a34a7df61fe9445f746b7df33f4656a
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
2019-07-27 22:08:06 +02:00
syphyr f5cf52ee67 qcacld-2.0: Zero out context buffer in HDD
This is a backport of:
"qcacld-2.0: Dump Snapshot of the driver for LL"
2019-07-27 22:08:05 +02:00
syphyr cc2c36e5a0 qcacld-2.0: core: Replace remaining instances of unadorned %p
Replace instances of unadorned %p in CORE.
2019-07-27 22:08:05 +02:00
Jeff Johnson 5ce25b0d78 qcacld-2.0: hdd: Replace instances of unadorned %p
Replace instances of unadorned %p in CORE/HDD.

Change-Id: I32b89aaf6a8b1ca3177e0c1cb5cec5fbc5f5294a
CRs-Fixed: 2111273
2019-07-27 22:08:04 +02:00
Nirav Shah 52b90c2dd2 qcacld-2.0: Add debug logs
Add debug logs in WLANTL_RegisterSTAClient and
WLANTL_ClearSTAClient.

CRs-Fixed: 1036774
Change-Id: I70f19731e576c65432919588348c19ccbf7bca61
2019-07-27 22:08:04 +02:00
Govind Singh 74ddae80eb qcacld-2.0: Restore 802.11 header pointer for PMF case
In PMF case CCMP header and trailers of rx management frames
are stripped out. After stripping security headers we are
using old 802.11 header pointer, this is resulting in invalid
dereference to 802.11 header fields.
Restore 802.11 header pointer after security headers are stripped
out in case of PMF.

Change-Id: I6a26dbb0707b7981ea091526d1e49dc5bf8c9e91
CRs-Fixed: 1024097
2019-07-27 22:08:04 +02:00
Padma, Santhosh Kumar 2841d21310 qcacld-2.0: Send ESE becaon report if request is valid
prima to qcacld-2.0 propagation

Currently if connection is ESE and RRM beacon request is received,
eseProcessBeaconReportXmit is invoked as part of sending report which
results in error as there is no ese request. Add a check to invoke
eseProcessBeaconReportXmit only if measurement request is valid.

Change-Id: I3fe6101b888c70670a371a1eb45b47d756511b1d
CRs-Fixed: 1002305
2019-07-27 22:08:03 +02:00
Himanshu Agarwal b410c951a4 qcacld-2.0: Fix static code analysis error
Fix static code analysis error in TLSHIM layer.

Change-Id: I81e5b7d5910919573b69faf7cfa3210eace9d6d4
CRs-Fixed: 1008197
2019-07-27 22:08:03 +02:00
Krishna Kumaar Natarajan 5d6242a3b5 qcacld-2.0: Fix layering violation while handling management frames
qcacld-3.0 to qcacld-2.0 propagation

Fix layering violation while handling management frames. Currently
LIM data structures are accessed before dropping Assoc, Disassoc and
Deauth packets to avoid DoS attacks. Since the LIM data structures
are accessed in different thread context, data present in them are
out of sync resulting in a crash.

Fix the layering violation by doing appropriate check in WMA instead
of doing the same in LIM.

Change-Id: I8876a4d4b99948cd9ab3ccec403cf5e4050b1cff
CRs-Fixed: 977773
2019-07-27 22:08:03 +02:00
zhangq cd3ae2df46 qcacld-2.0: Resolve memory leakage in OCB
sme_utc buffer is not freed if message posting to WDA/WMA fails.

Change-Id: Id91003198c2c06e45ec970cb9a23f4e8279220d4
CRs-Fixed: 1002063
2019-07-27 22:08:02 +02:00
Gao Wu cb7fb582c1 qcacld-2.0: update payload length of MGMT frame
There are some access points that have not included the
capability field in the RSN ie's though the length for the
RSN IE's indicate for the presence of this field. A
workaround for this issue adds two default bytes as RSN
capability, but it does not update payload length. This
causes supplicant to get wrong RSN capability and then
security mismatch in host driver when connecting to the AP.

Change-Id: I03ea3e293df8cbe545a70af03b1038b6fad5a261
CRs-Fixed: 993795
2019-07-27 22:08:02 +02:00
Himanshu Agarwal 3a6de60eca qcacld-2.0: Refactor intra bss forwarded packets count
Initially, when a packet is forwarded from txrx layer, it is added in
count only once although the count should increase by 2 as there is one
rx packet and one tx packet that is not getting considered in the hdd
packet count.

Add code to ensure that when packet is forwarded from lower layers,
it get considered accurately in the packet count.

Change-Id: I47bc1e0ecfa2e831438534cf34d37086a306b4e9
CRs-Fixed: 996735
2019-07-27 22:08:01 +02:00
Himanshu Agarwal d9f3dc97b1 qcacld-2.0: Remove error print from kmsg
Remove error print from kmsg as this print is unnecessary and
may flood the kmsg.

Change-Id: I0978f88af6677cb0c1e1db5eae7e5d6a69bd4b70
CRs-Fixed: 997243
2019-07-27 22:08:01 +02:00
Himanshu Agarwal 6b9518faf6 qcacld-2.0: Add intra bss forwarded packets count
In lpm qos voting, no. of packets or bytes sent or received in a
particular amount of time is recorded and decision of disabling
or enabling lpm is done based on that. These packets are recorded
in HDD layer. In case when packets are forwarded to tx only, packets
don't come upto HDD layer and so in case of intra bss forwarding,
lpm qos voting is not being executed appropriately.

Add code to calculate the intra bss forwarded packets in txrx layer
and update them in calculating lpm qos voting.

Change-Id: I805663688cb300c8735b3e2f9680818a7b50bc9f
CRs-Fixed: 990868
2019-07-27 22:08:01 +02:00
c_zding b74a1497d5 qcacld-2.0: Fix other variable type used as boolean and initialized variable
In current logic the member "cbMode" of structure "tSirSmeStartBssReq" and member
"secondarySubBand" of structure "tLimChannelSwitchInfo" used as boolean, actually
"cbMode" defined as "tANI_U8" type and "secondarySubBand" defined as "enum" type.
Initialized variable "getAssocSTAsReq" before be used.
Change function "ATH_DFSEVENTQ_UNLOCK" position, optimize the efficiency.

Change-Id: Ic5fec6c00b4bbfed53ebb9b5f965930f26171a11
CRs-Fixed: 969139
2019-07-27 22:08:00 +02:00
Selvaraj, Sridhar 7217c4e77e qcacld-2.0: Trigger Auth req(OPEN) when SHARED times out
When the OPEN SHARED /WEP is configured, the current
implementation will start Auth request with SHARED and if it
fails, Triggers Auth request with OPEN. This change will
trigger the OPEN Auth request when the timeout happens
(no Auth response received for previous attempt(AUTH req
SHARED case).Some AP's dosent respond to Shared Auth if
they support only Open. To inter-operate with these kind
of AP's try Open Auth if Auth timeout happens with Shared
Auth.

Change-Id: I28b9186b9dc238640fd7655c9ac73e8aa89aec54
CRs-Fixed: 984341
2019-07-27 22:08:00 +02:00
Samuel Ahn 29164f263f qcacld-2.0: Add support for default TX params in OCB mode
When OCB mode is configured, default TX parameters can be provided.
These default TX parameters are used if a packet is sent without
a TX control header.

Change-Id: I72b3799cb0a9e00a60548facf25e57be241d82d7
CRs-Fixed: 964279
2019-07-27 22:07:59 +02:00
Eric Dumazet a6cf2de288 tcp: tcp_v4_err() should be more careful
[ Upstream commit 2c4cc9712364c051b1de2d175d5fbea6be948ebf ]

ICMP handlers are not very often stressed, we should
make them more resilient to bugs that might surface in
the future.

If there is no packet in retransmit queue, we should
avoid a NULL deref.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 22:07:59 +02:00
Yingying Tang 460b6f3aeb qcacld-2.0: Add null pointer check while processing GID action frame
In __limProcessGidManagementActionFrame(), a pointer is used
without NULL pointer check. Add fix to avoid the risk.

Change-Id: I9ed2ee6d85726c53ebfc036d320b28775d4e5d32
CRs-Fixed: 979671
2019-07-27 22:07:59 +02:00
Padma, Santhosh Kumar 2ebfca5060 qcacld-2.0: Validate pHashTable
prima to qcacld-2.0 propagation

When deauth/disassoc is received from peer at the same time when
cleanup in progress because of disconnect from supplicant, there
is a chance that pHashTable can be NULL. Memory pointed by
pHashTable is freed during peDeleteSession, which is called during
cleanup. In dphLookupHashEntry, pHashTable is referenced without
any NULL check, which can lead to crash. Fix this by validating
pHashTable for NULL check.

Add a NULL check in _limProcessOperatingModeActionFrame before
referencing sta context to resolve potential KW issue.

Change-Id: I74d5c739cade19941320ee02eddc09e4fc74b105
CRs-Fixed: 898375
2019-07-27 22:07:58 +02:00
Yingying Tang a9bfe7022d qcacld-2.0: Add null pointer check while processing Operation Mode action frame
In __limProcessOperatingModeActionFrame(), a pointer is used
without NULL pointer check. Add fix to avoid the risk.

Change-Id: I5d5a26b53781272406a0f1d46a90b5ef138ce552
CRs-Fixed: 979671
2019-07-27 22:07:58 +02:00
Jingxiang Ge 0c3702f9bf qcacld-2.0: Fix buffer overwrite problem in GETIBSSPEERINFO
If (length + 1) is greater than priv_data.total_len then copy_to_user
results in writing more data than the buffer can hold.

Fix this by writing mininum of (length + 1) and priv_data.total_len.

Change-Id: If0c74b3c6c76ee3ca296fd8e0e844b9c53c30498
CRs-Fixed: 2344325
2019-07-27 22:07:58 +02:00
Guisen Yang 8259595fc6 qcacld-2.0: Fix kw issues:check NULL, initialize data and LOCRET
The abnormal NULL check form cannot be detected by kw. Data used
before NULL check and local address returned by function should
be fixed.

Change the NULL check form. Fix the use before NULL check and
data used before initialization. Fix the LOCRET issue.

Change-Id: Ic1756f0e45de0f407ec9e4193fbbaec885f05f67
CRs-Fixed: 2209931
2019-07-27 22:07:57 +02:00
Rongjing Liao 3cfcc97265 qcacld-2.0: add NULL point condition check for fixing KW issues
add NULL point condition check for fixing KW issues

Change-Id: I38b3b087fa67909c59f3d01e0b3051e4f8f56464
Signed-off-by: Rongjing Liao <liaor@codeaurora.org>
2019-07-27 22:07:57 +02:00
tinlin 440a9abf2d qcacld-2.0: Check for minimum frameLen for action frames
Propagation from cld3.0 to cld2.0.

In limProcessActionFrame and limProcessActionFrameNoSession,
The Rx frame pointer is directly casted to the action frame header
to find the Action frame category and action ID without validating
the minimum length of the frame. If the frame len is less than the
action frame header len, then OOB read would occur.

Check if frame_len is less than the size of action frame header len
and return if true.

Change-ID: Idf8ca7eeacdf57171d2850fe6317784911830aac
CRs-Fixed: 2333070
2019-07-27 22:07:56 +02:00
Abhishek Singh 6a79179539 qcacld-2.0: Add check for robust action frame while sending action frames
prima to qcacld-2.0 propagation

Currently if PMF is enabled, only sa query action frames
received from supplicant are sent protected. None of the other
action frame catagory are sent protected.

Adds check for robust action frames, to decide if protection is
needed for the action frame catagory received from supplicant.

Change-Id: Ib1eb589c530ef99b7e2fedfcd106e0f646d78d93
CRs-Fixed: 960298
2019-07-27 22:07:56 +02:00
Sriram, Madhvapathi 7e1bd968da qcacld-2.0: Fix check for adapter device_mode while changing back from IBSS
Presently, if the current device_mode is STA/P2P/AP interface mode change
takes effect. If the current device_mode is IBSS the interface mode
change is rejected in __wlan_hdd_cfg80211_change_iface. This causes user
applications like wpa_supplicant, which start the interface in STA mode, to
fail to reconfigure the interface after termination

CRs-Fixed: 1005808
Change-Id: I67bcdb7453e8232dc711499ee66793877697582b
2019-07-27 22:07:56 +02:00
Sachin Ahuja f3366fc9f6 qcacld-2.0: Pass the correct userData in wpalTimerCback
prima to qcacld-2.0 propagation

During Reinit, driver sends the FW download request and
if request is timed out then the timercallback is
executed in WD thread.
Currently userdata is passed as NULL if timercallback is
executed in WD thread.

Update code to pass the correct userdata in timer
callback when it is called in WD thread context.

Change-Id: I10a9cf8c53ded7d9db4bff0761f7b86a9021011a
CRs-Fixed: 1020713
2019-07-27 22:07:55 +02:00
Agrawal Ashish f5da811e12 qcacld-2.0: Register Callback for fullPower before posting message
prima to qcacld-2.0 propagation

In pmcRequestFullPower, driver is posting message to enter
in full power with wpa_supplicant thread.
 After posting message to enter in full power
context has been switched to MC thread.
MC thread starts processing IMPS RESPONSE, even before Supplicant
thread can add callback entry to requestFullPowerList,
so in effect the IMPS response handler does not invoke any callbacks,
 and command sitting in roam pending list does not get processed.
Fix this by posting callback before posting message to enter in
full power. If enter full power get fails remove the entry.

Change-Id: If3d32d6998bf7f65171a8d501db69e72a6ee2865
CRs-Fixed: 903963
2019-07-27 22:07:55 +02:00
Jason A. Donenfeld e62a2e3be5 net_dbg_ratelimited: turn into no-op when !DEBUG
commit d92cff89a0c80e7e49796366e441d97f07b5d321 upstream.

The pr_debug family of functions turns into a no-op when -DDEBUG is not
specified, opting instead to call "no_printk", which gets compiled to a
no-op (but retains gcc's nice warnings about printf-style arguments).

The problem with net_dbg_ratelimited is that it is defined to be a
variant of net_ratelimited_function, which expands to essentially:

    if (net_ratelimit())
        pr_debug(fmt, ...);

When DEBUG is not defined, then this becomes,

    if (net_ratelimit())
        ;

This seems benign, except it isn't. Firstly, there's the obvious
overhead of calling net_ratelimit needlessly, which does quite some book
keeping for the rate limiting. Given that the pr_debug and
net_dbg_ratelimited family of functions are sprinkled liberally through
performance critical code, with developers assuming they'll be compiled
out to a no-op most of the time, we certainly do not want this needless
book keeping. Secondly, and most visibly, even though no debug message
is printed when DEBUG is not defined, if there is a flood of
invocations, dmesg winds up peppered with messages such as
"net_ratelimit: 320 callbacks suppressed". This is because our
aforementioned net_ratelimit() function actually prints this text in
some circumstances. It's especially odd to see this when there isn't any
other accompanying debug message.

So, in sum, it doesn't make sense to have this function's current
behavior, and instead it should match what every other debug family of
functions in the kernel does with !DEBUG -- nothing.

This patch replaces calls to net_dbg_ratelimited when !DEBUG with
no_printk, keeping with the idiom of all the other debug print helpers.

Also, though not strictly neccessary, it guards the call with an if (0)
so that all evaluation of any arguments are sure to be compiled out.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:07:54 +02:00
Sebastian Andrzej Siewior a1e14bfdfc locking/rtmutex: Avoid a NULL pointer dereference on deadlock
commit 8d1e5a1a1ccf5ae9d8a5a0ee7960202ccb0c5429 upstream.

With task_blocks_on_rt_mutex() returning early -EDEADLK we never
add the waiter to the waitqueue. Later, we try to remove it via
remove_waiter() and go boom in rt_mutex_top_waiter() because
rb_entry() gives a NULL pointer.

( Tested on v3.18-RT where rtmutex is used for regular mutex and I
  tried to get one twice in a row. )

Not sure when this started but I guess 397335f004f4 ("rtmutex: Fix
deadlock detector for real") or commit 3d5c9340d194 ("rtmutex:
Handle deadlock detection smarter").

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1424187823-19600-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ luis: backported to 3.16: adjusted context ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2019-07-27 22:07:54 +02:00
Thomas Gleixner 7c4a7bdb1f locking/rtmutex: Prevent dequeue vs. unlock race
commit dbb26055defd03d59f678cb5f2c992abe05b064a upstream.

David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney <ddaney@caviumnetworks.com>
Tested-by: David Daney <david.daney@cavium.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 23f78d4a03 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: use ACCESS_ONCE() instead of {READ,WRITE}_ONCE()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:07:54 +02:00
Guillaume Nault 59a177e5b6 pppoe: fix reference counting in PPPoE proxy
commit 29e73269aa4d36f92b35610c25f8b01c789b0dc8 upstream.

Drop reference on the relay_po socket when __pppoe_xmit() succeeds.
This is already handled correctly in the error path.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2019-07-27 22:07:53 +02:00
Manuel Schölling f99add3d5b dns_resolver: Do not accept domain names longer than 255 chars
According to RFC1035 "[...] the total length of a domain name (i.e.,
label octets and label length octets) is restricted to 255 octets or
less."

Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27 22:07:53 +02:00
Lorenzo Bianconi d86fb058ce net: ipv4: use a dedicated counter for icmp_v4 redirect packets
[ Upstream commit c09551c6ff7fe16a79a42133bcecba5fc2fc3291 ]

According to the algorithm described in the comment block at the
beginning of ip_rt_send_redirect, the host should try to send
'ip_rt_redirect_number' ICMP redirect packets with an exponential
backoff and then stop sending them at all assuming that the destination
ignores redirects.
If the device has previously sent some ICMP error packets that are
rate-limited (e.g TTL expired) and continues to receive traffic,
the redirect packets will never be transmitted. This happens since
peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
and so it will never be reset even if the redirect silence timeout
(ip_rt_redirect_silence) has elapsed without receiving any packet
requiring redirects.

Fix it by using a dedicated counter for the number of ICMP redirect
packets that has been sent by the host

I have not been able to identify a given commit that introduced the
issue since ip_rt_send_redirect implements the same rate-limiting
algorithm from commit 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 22:07:53 +02:00
Eric Dumazet 27760dcdf1 tcp: clear icsk_backoff in tcp_write_queue_purge()
[ Upstream commit 04c03114be82194d4a4858d41dba8e286ad1787c ]

soukjin bae reported a crash in tcp_v4_err() handling
ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
returned a NULL pointer.

Current logic should have prevented this :

  if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
      !icsk->icsk_backoff || fastopen)
      break;

Problem is the write queue might have been purged
and icsk_backoff has not been cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-07-27 22:07:52 +02:00
Paolo Abeni 08da77e269 udp: perform source validation for mcast early demux
commit bc044e8db7962e727a75b591b9851ff2ac5cf846 upstream.

The UDP early demux can leverate the rx dst cache even for
multicast unconnected sockets.

In such scenario the ipv4 source address is validated only on
the first packet in the given flow. After that, when we fetch
the dst entry  from the socket rx cache, we stop enforcing
the rp_filter and we even start accepting any kind of martian
addresses.

Disabling the dst cache for unconnected multicast socket will
cause large performace regression, nearly reducing by half the
max ingress tput.

Instead we factor out a route helper to completely validate an
skb source address for multicast packets and we call it from
the UDP early demux for mcast packets landing on unconnected
sockets, after successful fetching the related cached dst entry.

This still gives a measurable, but limited performance
regression:

		rp_filter = 0		rp_filter = 1
edmux disabled:	1182 Kpps		1127 Kpps
edmux before:	2238 Kpps		2238 Kpps
edmux after:	2037 Kpps		2019 Kpps

The above figures are on top of current net tree.
Applying the net-next commit 6e617de84e87 ("net: avoid a full
fib lookup when rp_filter is disabled.") the delta with
rp_filter == 0 will decrease even more.

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:07:52 +02:00
Paolo Abeni b60998c141 IPv4: early demux can return an error code
commit 7487449c86c65202b3b725c4524cb48dd65e4e6f upstream.

Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16:
 - Drop change to net_protocol::early_demux_handler
 - Keep using NET_INC_STATS_BH() in ip_rcv_finish()
 - Fix up additional return statement in udp_v4_early_demux()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:07:51 +02:00
Paolo Abeni a3f3d99974 ipv4: fix broadcast packets reception
commit ad0ea1989cc4d5905941d0a9e62c63ad6d859cef upstream.

Currently, ingress ipv4 broadcast datagrams are dropped since,
in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
bcast packets.

This patch addresses the issue, invoking ip_check_mc_rcu()
only for mcast packets.

Fixes: 6e5403093261 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:07:51 +02:00
Shawn Bohrer 755cb66d68 ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()
commit 6e540309326188f769e03bb4c6dd8ff6752930c2 upstream.

421b3885bf6d56391297844f43fb7154a6396e12 "udp: ipv4: Add udp early
demux" introduced a regression that allowed sockets bound to INADDR_ANY
to receive packets from multicast groups that the socket had not joined.
For example a socket that had joined 224.168.2.9 could also receive
packets from 225.168.2.9 despite not having joined that group if
ip_early_demux is enabled.

Fix this by calling ip_check_mc_rcu() in udp_v4_early_demux() to verify
that the multicast packet is indeed ours.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2019-07-27 22:07:51 +02:00
Eric Dumazet 5023264f3f udp: fix dst races with multicast early demux
commit 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a upstream.

Multicast dst are not cached. They carry DST_NOCACHE.

As mentioned in commit f8864972126899 ("ipv4: fix dst race in
sk_dst_get()"), these dst need special care before caching them
into a socket.

Caching them is allowed only if their refcnt was not 0, ie we
must use atomic_inc_not_zero()

Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
in commit d0c294c53a771 ("tcp: prevent fetching dst twice in early demux
code")

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Tested-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
[ luis: backported to 3.16: used davem's backport to 3.14 ]
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
2019-07-27 22:07:50 +02:00
Eric Dumazet 2ad7c93946 udp: ipv4: do not waste time in __udp4_lib_mcast_demux_lookup
Its too easy to add thousand of UDP sockets on a particular bucket,
and slow down an innocent multicast receiver.

Early demux is supposed to be an optimization, we should avoid spending
too much time in it.

It is interesting to note __udp4_lib_demux_lookup() only tries to
match first socket in the chain.

10 is the threshold we already have in __udp4_lib_lookup() to switch
to secondary hash.

Fixes: 421b3885bf6d5 ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Held <drheld@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27 22:07:50 +02:00
Eric Dumazet ba085e5d79 udp: ipv4: must add synchronization in udp_sk_rx_dst_set()
Unlike TCP, UDP input path does not hold the socket lock.

Before messing with sk->sk_rx_dst, we must use a spinlock, otherwise
multiple cpus could leak a refcount.

This patch also takes care of renewing a stale dst entry.
(When the sk->sk_rx_dst would not be used by IP early demux)

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27 22:07:50 +02:00
Eric Dumazet a59c54761f udp: ipv4: fix potential use after free in udp_v4_early_demux()
pskb_may_pull() can reallocate skb->head, we need to move the
initialization of iph and uh pointers after its call.

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27 22:07:49 +02:00
Eric Dumazet c88bbddf16 udp: ipv4: fix an use after free in __udp4_lib_rcv()
Dave Jones reported a use after free in UDP stack :

[ 5059.434216] =========================
[ 5059.434314] [ BUG: held lock freed! ]
[ 5059.434420] 3.13.0-rc3+ #9 Not tainted
[ 5059.434520] -------------------------
[ 5059.434620] named/863 is freeing memory ffff88005e960000-ffff88005e96061f, with a lock still held there!
[ 5059.434815]  (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435012] 3 locks held by named/863:
[ 5059.435086]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8143054d>] __netif_receive_skb_core+0x11d/0x940
[ 5059.435295]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff81467a5e>] ip_local_deliver_finish+0x3e/0x410
[ 5059.435500]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435734]
stack backtrace:
[ 5059.435858] CPU: 0 PID: 863 Comm: named Not tainted 3.13.0-rc3+ #9 [loadavg: 0.21 0.06 0.06 1/115 1365]
[ 5059.436052] Hardware name:                  /D510MO, BIOS MOPNV10J.86A.0175.2010.0308.0620 03/08/2010
[ 5059.436223]  0000000000000002 ffff88007e203ad8 ffffffff8153a372 ffff8800677130e0
[ 5059.436390]  ffff88007e203b10 ffffffff8108cafa ffff88005e960000 ffff88007b00cfc0
[ 5059.436554]  ffffea00017a5800 ffffffff8141c490 0000000000000246 ffff88007e203b48
[ 5059.436718] Call Trace:
[ 5059.436769]  <IRQ>  [<ffffffff8153a372>] dump_stack+0x4d/0x66
[ 5059.436904]  [<ffffffff8108cafa>] debug_check_no_locks_freed+0x15a/0x160
[ 5059.437037]  [<ffffffff8141c490>] ? __sk_free+0x110/0x230
[ 5059.437147]  [<ffffffff8112da2a>] kmem_cache_free+0x6a/0x150
[ 5059.437260]  [<ffffffff8141c490>] __sk_free+0x110/0x230
[ 5059.437364]  [<ffffffff8141c5c9>] sk_free+0x19/0x20
[ 5059.437463]  [<ffffffff8141cb25>] sock_edemux+0x25/0x40
[ 5059.437567]  [<ffffffff8141c181>] sock_queue_rcv_skb+0x81/0x280
[ 5059.437685]  [<ffffffff8149bd21>] ? udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.437805]  [<ffffffff81499c82>] __udp_queue_rcv_skb+0x42/0x240
[ 5059.437925]  [<ffffffff81541d25>] ? _raw_spin_lock+0x65/0x70
[ 5059.438038]  [<ffffffff8149bebb>] udp_queue_rcv_skb+0x26b/0x4b0
[ 5059.438155]  [<ffffffff8149c712>] __udp4_lib_rcv+0x152/0xb00
[ 5059.438269]  [<ffffffff8149d7f5>] udp_rcv+0x15/0x20
[ 5059.438367]  [<ffffffff81467b2f>] ip_local_deliver_finish+0x10f/0x410
[ 5059.438492]  [<ffffffff81467a5e>] ? ip_local_deliver_finish+0x3e/0x410
[ 5059.438621]  [<ffffffff81468653>] ip_local_deliver+0x43/0x80
[ 5059.438733]  [<ffffffff81467f70>] ip_rcv_finish+0x140/0x5a0
[ 5059.438843]  [<ffffffff81468926>] ip_rcv+0x296/0x3f0
[ 5059.438945]  [<ffffffff81430b72>] __netif_receive_skb_core+0x742/0x940
[ 5059.439074]  [<ffffffff8143054d>] ? __netif_receive_skb_core+0x11d/0x940
[ 5059.442231]  [<ffffffff8108c81d>] ? trace_hardirqs_on+0xd/0x10
[ 5059.442231]  [<ffffffff81430d83>] __netif_receive_skb+0x13/0x60
[ 5059.442231]  [<ffffffff81431c1e>] netif_receive_skb+0x1e/0x1f0
[ 5059.442231]  [<ffffffff814334e0>] napi_gro_receive+0x70/0xa0
[ 5059.442231]  [<ffffffffa01de426>] rtl8169_poll+0x166/0x700 [r8169]
[ 5059.442231]  [<ffffffff81432bc9>] net_rx_action+0x129/0x1e0
[ 5059.442231]  [<ffffffff810478cd>] __do_softirq+0xed/0x240
[ 5059.442231]  [<ffffffff81047e25>] irq_exit+0x125/0x140
[ 5059.442231]  [<ffffffff81004241>] do_IRQ+0x51/0xc0
[ 5059.442231]  [<ffffffff81542bef>] common_interrupt+0x6f/0x6f

We need to keep a reference on the socket, by using skb_steal_sock()
at the right place.

Note that another patch is needed to fix a race in
udp_sk_rx_dst_set(), as we hold no lock protecting the dst.

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27 22:07:49 +02:00
Florian Westphal 406121ee71 bridge: netfilter: orphan skb before invoking ip netfilter hooks
Pekka Pietikäinen reports xt_socket behavioural change after commit
00028aa37098o (netfilter: xt_socket: use IP early demux).

Reason is xt_socket now no longer does an unconditional sk lookup -
it re-uses existing skb->sk if possible, assuming ->sk was set by
ip early demux.

However, when netfilter is invoked via bridge, this can cause 'bogus'
sockets to be examined by the match, e.g. a 'tun' device socket.

bridge netfilter should orphan the skb just like the routing path
before invoking ipv4/ipv6 netfilter hooks to avoid this.

Reported-and-tested-by: Pekka Pietikäinen <pp@ee.oulu.fi>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-27 22:07:48 +02:00
Shawn Bohrer 3542138c1c udp: ipv4: Add udp early demux
The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.

For UDP multicast we can only cache the dst if there is only one
receiving socket on the host.  Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.

For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups.  Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.

Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After  89789.68 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After  12.63us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27 22:07:48 +02:00