android_kernel_google_msm

mirror of https://github.com/followmsi/android_kernel_google_msm.git synced 2024-11-06 23:17:41 +00:00

Author	SHA1	Message	Date
followmsi	3a6cf41324	Merge branch 'lineage-18.1' of https://github.com/LineageOS/android_kernel_google_msm into followmsi-12	2023-03-24 15:14:21 +01:00
Xin Long	103a20d70a	ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt [ Upstream commit 99253eb750fda6a644d5188fb26c43bad8d5a745 ] Commit `5e1859fbcc` ("ipv4: ipmr: various fixes and cleanups") fixed the issue for ipv4 ipmr: ip_mroute_setsockopt() & ip_mroute_getsockopt() should not access/set raw_sk(sk)->ipmr_table before making sure the socket is a raw socket, and protocol is IGMP The same fix should be done for ipv6 ipmr as well. This patch can fix the panic caused by overwriting the same offset as ipmr_table as in raw_sk(sk) when accessing other type's socket by ip_mroute_setsockopt(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Change-Id: I8d48f4611a2f2d0cb7ad5146036f571f12ecb1fc CVE-2017-18509 Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:38:56 +01:00
Eric Dumazet	8dc08ba699	ipv4: ipmr: various fixes and cleanups 1) ip_mroute_setsockopt() & ip_mroute_getsockopt() should not access/set raw_sk(sk)->ipmr_table before making sure the socket is a raw socket, and protocol is IGMP 2) MRT_INIT should return -EINVAL if optlen != sizeof(int), not -ENOPROTOOPT 3) MRT_ASSERT & MRT_PIM should validate optlen 4) " (v) ? 1 : 0 " can be written as " !!v " Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org> Change-Id: I606493fe572b29f2c6a709833e799e20f2212882	2023-02-18 18:38:56 +01:00
Lorenzo Colitti	afd1d2b38a	net: inet: Support UID-based routing in IP protocols. - Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Bug: 16355602 Change-Id: I910504b508948057912bc188fd1e8aca28294de3 Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:38:56 +01:00
Eric Biggers	1396cfea05	net: socket: don't set sk_uid to garbage value in ->setattr() ->setattr() was recently implemented for socket files to sync the socket inode's uid to the new 'sk_uid' member of struct sock. It does this by copying over the ia_uid member of struct iattr. However, ia_uid is actually only valid when ATTR_UID is set in ia_valid, indicating that the uid is being changed, e.g. by chown. Other metadata operations such as chmod or utimes leave ia_uid uninitialized. Therefore, sk_uid could be set to a "garbage" value from the stack. Fix this by only copying the uid over when ATTR_UID is set. [backport of net e1a3a60a2ebe991605acb14cd58e39c0545e174e] Bug: 16355602 Change-Id: I20e53848e54282b72a388ce12bfa88da5e3e9efe Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.") Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:37:20 +01:00
Lorenzo Colitti	d3cd043531	net: core: Add a UID field to struct sock. Protocol sockets (struct sock) don't have UIDs, but most of the time, they map 1:1 to userspace sockets (struct socket) which do. Various operations such as the iptables xt_owner match need access to the "UID of a socket", and do so by following the backpointer to the struct socket. This involves taking sk_callback_lock and doesn't work when there is no socket because userspace has already called close(). Simplify this by adding a sk_uid field to struct sock whose value matches the UID of the corresponding struct socket. The semantics are as follows: 1. Whenever sk_socket is non-null: sk_uid is the same as the UID in sk_socket, i.e., matches the return value of sock_i_uid. Specifically, the UID is set when userspace calls socket(), fchown(), or accept(). 2. When sk_socket is NULL, sk_uid is defined as follows: - For a socket that no longer has a sk_socket because userspace has called close(): the previous UID. - For a cloned socket (e.g., an incoming connection that is established but on which userspace has not yet called accept): the UID of the socket it was cloned from. - For a socket that has never had an sk_socket: UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. Kernel sockets created by sock_create_kern are a special case of #1 and sk_uid is the user that created them. For kernel sockets created at network namespace creation time, such as the per-processor ICMP and TCP sockets, this is the user that created the network namespace. [Backport of net-next 86741ec25462e4c8cdce6df2f41ead05568c7d5e] Bug: 16355602 Change-Id: I73e1a57dfeedf672f4c2dfc9ce6867838b55974b Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:37:04 +01:00
Masatake YAMATO	941eda2515	net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries lsof reports some of socket descriptors as "can't identify protocol" like: [yamato@localhost]/tmp% sudo lsof \| grep dbus \| grep iden dbus-daem 652 dbus 6u sock ... 17812 can't identify protocol dbus-daem 652 dbus 34u sock ... 24689 can't identify protocol dbus-daem 652 dbus 42u sock ... 24739 can't identify protocol dbus-daem 652 dbus 48u sock ... 22329 can't identify protocol ... lsof cannot resolve the protocol used in a socket because procfs doesn't provide the map between inode number on sockfs and protocol type of the socket. For improving the situation this patch adds an extended attribute named 'system.sockprotoname' in which the protocol name for /proc/PID/fd/SOCKET is stored. So lsof can know the protocol for a given /proc/PID/fd/SOCKET with getxattr system call. A few weeks ago I submitted a patch for the same purpose. The patch was introduced /proc/net/sockfs which enumerates inodes and protocols of all sockets alive on a system. However, it was rejected because (1) a global lock was needed, and (2) the layout of struct socket was changed with the patch. This patch doesn't use any global lock; and doesn't change the layout of any structs. In this patch, a protocol name is stored to dentry->d_name of sockfs when new socket is associated with a file descriptor. Before this patch dentry->d_name was not used; it was just filled with empty string. lsof may use an extended attribute named 'system.sockprotoname' to retrieve the value of dentry->d_name. It is nice if we can see the protocol name with ls -l /proc/PID/fd. However, "socket:[#INODE]", the name format returned from sockfs_dname() was already defined. To keep the compatibility between kernel and user land, the extended attribute is used to prepare the value of dentry->d_name. Change-Id: I04143ee6da5c236835a897086fc0de819abb0cdc Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:37:00 +01:00
Sabrina Dubroca	82794d1071	BACKPORT: tcp: fix recv with flags MSG_WAITALL \| MSG_PEEK Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL \| MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Change-Id: If58492ae474effe058541f7e9a0c03dc24155393 Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258 Reported-by: Enrico Scholz <rh-bugzilla@ensc.de> Reported-by: Dan Searle <dan@censornet.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:36:52 +01:00
Daniel Borkmann	e9ff904465	BACKPORT: net: sock: make sock_tx_timestamp void Currently, sock_tx_timestamp() always returns 0. The comment that describes the sock_tx_timestamp() function wrongly says that it returns an error when an invalid argument is passed (from commit `20d4947353`, ``net: socket infrastructure for SO_TIMESTAMPING''). Make the function void, so that we can also remove all the unneeded if conditions that check for such a _non-existant_ error case in the output path. Change-Id: Ibdfd5071737190371d4abec5ae76046b5aa8de23 Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:32:19 +01:00
Eric Dumazet	cab65020e8	BACKPORT: net: include/net/sock.h cleanup bool/const conversions where possible __inline__ -> inline space cleanups Change-Id: I0ee0135e737edd702f753fac182b293ec5cc652a Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>	2023-02-18 18:32:14 +01:00
Roger Hu	7c56badc9a	kernel: Revert "tcp: do not lock listener to process SYN packets" This commit belongs to the patch set (https://lwn.net/Articles/659199/) that attempts to remove the use of locks on the socket table by relocating the SYN table to a separate hash table and adding a spin lock to protect the SYN request queue. Adding only this commit introduces a race condition for LineageOS kernels for TCP listens, since the TCP SYN data structures can be corrupted. A TCP curl bomb on a TCP listen port will corrupt the SYN accept backlog: for i in $(seq 1 400); do curl -x localhost:443 https://myhost.com -L --connect-timeout 30 -o /dev/null -sS & done Run `ss -nltp` and usually the RecVQ column does not drain to 0. This reverts commit 7d9f104f9cabe1d72a50c4816a48f64fc1da7a64. This really needs to be reverted across all LineageOS forks: https://gitlab.com/LineageOS/issues/android/-/issues/3916#note_669493796 Change-Id: Ia7969aeedae411677b307a8e094f9a4cc02b801d	2022-07-05 01:10:45 -04:00
followmsi	51290e763f	Merge branch 'lineage-18.1' into followmsi-11	2021-08-23 14:34:45 +02:00
Eric Dumazet	e336f6ab76	tcp: fix more NULL deref after prequeue changes When I cooked commit `c3658e8d0f` ("tcp: fix possible NULL dereference in tcp_vX_send_reset()") I missed other spots we could deref a NULL skb_dst(skb) Again, if a socket is provided, we do not need skb_dst() to get a pointer to network namespace : sock_net(sk) is good enough. [Backport of net-next 0f85feae6b710ced3abad5b2b47d31dfcb956b62] Bug: 16355602 Change-Id: Ibe1def7979625ee7902bff2f33ec8945b9945948 Reported-by: Dann Frazier <dann.frazier@canonical.com> Bisected-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: `ca777eff51` ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>	2021-01-23 17:17:18 +03:00
David S. Miller	e80824b3dd	ipv6: Fix types of ip6_update_pmtu(). The mtu should be a __be32, not the mark. Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ie321dcc3652921f8f28491d39c8262268aeb22bc	2021-01-23 17:17:04 +03:00
David S. Miller	9ce6cd5eab	ipv6: Handle PMTU in ICMP error handlers. One tricky issue on the ipv6 side vs. ipv4 is that the ICMP callouts to handle the error pass the 32-bit info cookie in network byte order whereas ipv4 passes it around in host byte order. Like the ipv4 side, we have two helper functions. One for when we have a socket context and one for when we do not. ip6ip6 tunnels are not handled here, because they handle PMTU events by essentially relaying another ICMP packet-too-big message back to the original sender. This patch allows us to get rid of rt6_do_pmtu_disc(). It handles all kinds of situations that simply cannot happen when we do the PMTU update directly using a fully resolved route. In fact, the "plen == 128" check in ip6_rt_update_pmtu() can very likely be removed or changed into a BUG_ON() check. We should never have a prefixed ipv6 route when we get there. Another piece of strange history here is that TCP and DCCP, unlike in ipv4, never invoke the update_pmtu() method from their ICMP error handlers. This is incredibly astonishing since this is the context where we have the most accurate context in which to make a PMTU update, namely we have a fully connected socket and associated cached socket route. Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ibb5cae4316256108c5130459b07288c9fc7380c3	2021-01-23 17:16:44 +03:00
followmsi	5b7eb403ff	Merge branch 'lineage-18.0' into followmsi-11	2020-12-06 20:31:44 +01:00
Fan Du	c9f76626c3	xfrm: remove redundant parameter "int dir" in struct xfrm_mgr.acquire Sematically speaking, xfrm_mgr.acquire is called when kernel intends to ask user space IKE daemon to negotiate SAs with peers. IOW the direction will always be XFRM_POLICY_OUT, so remove int dir for clarity. Change-Id: Id4df779e818c5b6f5eccb9bf8250c17eb729f101 Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2020-12-06 13:59:24 +03:00
Fan Du	adff1877a7	Fix unexpected SA hard expiration after changing date After SA is setup, one timer is armed to detect soft/hard expiration, however the timer handler uses xtime to do the math. This makes hard expiration occurs first before soft expiration after setting new date with big interval. As a result new child SA is deleted before rekeying the new one. Signed-off-by: Fan Du <fdu@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I44623eedcfb44088e86845b4726198a5cca02c80	2020-12-06 13:58:54 +03:00
David S. Miller	d91c40e298	xfrm: Convert several xfrm policy match functions to bool. xfrm_selector_match xfrm_sec_ctx_match __xfrm4_selector_match __xfrm6_selector_match Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I0cb9bc9e9f5689edc4ee73b6a9d838a89ea5a5de	2020-12-06 13:58:45 +03:00
Paul Moore	a664edd900	xfrm: force a garbage collection after deleting a policy In some cases after deleting a policy from the SPD the policy would remain in the dst/flow/route cache for an extended period of time which caused problems for SELinux as its dynamic network access controls key off of the number of XFRM policy and state entries. This patch corrects this problem by forcing a XFRM garbage collection whenever a policy is sucessfully removed. Reported-by: Ondrej Moris <omoris@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I1aae89f15a467a7661caec89edfef21aa639b852	2020-11-30 19:39:51 +03:00
Timo Teräs	50bb785569	xfrm: properly handle invalid states as an error The error exit path needs err explicitly set. Otherwise it returns success and the only caller, xfrm_output_resume(), would oops in skb_dst(skb)->ops derefence as skb_dst(skb) is NULL. Bug introduced in commit `bb65a9cb` (xfrm: removes a superfluous check and add a statistic). Signed-off-by: Timo Teräs <timo.teras@iki.fi> Cc: Li RongQing <roy.qing.li@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I67e8f934400c9dcb8533ab778dc2555267540109	2020-11-30 19:39:48 +03:00
Mathias Krause	8fd6173111	xfrm: Fix esn sequence number diff calculation in xfrm_replay_notify_esn() Commit `0017c0b` "xfrm: Fix replay notification for esn." is off by one for the sequence number wrapped case as UINT_MAX is 0xffffffff, not 0x100000000. ;) Just calculate the diff like done everywhere else in the file. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I56b84e105364ce7102a243413c41f2491e6528bd	2020-11-30 19:39:45 +03:00
Steffen Klassert	11e5e29431	xfrm: Fix replay notification for esn. We may miscalculate the sequence number difference from the last time we send a notification if a sequence number wrap occured in the meantime. We fix this by adding a separate replay notify function for esn. Here we take the high bits of the sequence number into account to calculate the difference. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I479b8e5a402828d226b8bac67d07fb489f169fa4	2020-11-30 19:39:42 +03:00
Baker Zhang	59ae3987ae	xfrm: use xfrm direction when lookup policy because xfrm policy direction has same value with corresponding flow direction, so this problem is covered. In xfrm_lookup and __xfrm_policy_check, flow_cache_lookup is used to accelerate the lookup. Flow direction is given to flow_cache_lookup by policy_to_flow_dir. When the flow cache is mismatched, callback 'resolver' is called. 'resolver' requires xfrm direction, so convert direction back to xfrm direction. Signed-off-by: Baker Zhang <baker.zhang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Ie626dedd5affcb1f480e37b071c546e25ff0d83d	2020-11-30 19:39:39 +03:00
Mathias Krause	667db6169c	xfrm_user: constify netlink dispatch table There is no need to modify the netlink dispatch table at runtime. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: If33577532b9970f8c07b8e577275ddb6aa48cc20	2020-11-30 19:39:36 +03:00
Nicolas Dichtel	1b5c151075	xfrm: allow to avoid copying DSCP during encapsulation By default, DSCP is copying during encapsulation. Copying the DSCP in IPsec tunneling may be a bit dangerous because packets with different DSCP may get reordered relative to each other in the network and then dropped by the remote IPsec GW if the reordering becomes too big compared to the replay window. It is possible to avoid this copy with netfilter rules, but it's very convenient to be able to configure it for each SA directly. This patch adds a toogle for this purpose. By default, it's not set to maintain backward compatibility. Field flags in struct xfrm_usersa_info is full, hence I add a new attribute. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I885117f02790536e2c5002232b3b33be651a568d	2020-11-30 19:39:33 +03:00
Steffen Klassert	b0d18ee580	xfrm: Allow inserting policies with matching mark and different priorities We currently can not insert policies with mark and mask such that some flows would be matched from both policies. We make this possible when the priority of these policies are different. If both policies match a flow, the one with the higher priority is used. Reported-by: Emmanuel Thierry <emmanuel.thierry@telecom-bretagne.eu> Reported-by: Romain Kuntz <r.kuntz@ipflavors.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: Ib9d456b956784fbee982ed9191f22f9ed097cd47	2020-11-30 19:39:30 +03:00
Steffen Klassert	bdcd386baf	xfrm: Add a state resolution packet queue As the default, we blackhole packets until the key manager resolves the states. This patch implements a packet queue where IPsec packets are queued until the states are resolved. We generate a dummy xfrm bundle, the output routine of the returned route enqueues the packet to a per policy queue and arms a timer that checks for state resolution when dst_output() is called. Once the states are resolved, the packets are sent out of the queue. If the states are not resolved after some time, the queue is flushed. This patch keeps the defaut behaviour to blackhole packets as long as we have no states. To enable the packet queue the sysctl xfrm_larval_drop must be switched off. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I5ff2ae58ba580505914683e3571ee96974ee4966	2020-11-30 19:39:27 +03:00
David S. Miller	c275431b28	net: Document dst->obsolete better. Add a big comment explaining how the field works, and use defines instead of magic constants for the values assigned to it. Suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I2c8ca84d38cd49ccc1207db588108d78e6f9403a	2020-11-30 19:39:24 +03:00
Li RongQing	84482896c3	xfrm: fix a unbalanced lock Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: Ib4951dae08cd33c9bae696225e5e870e069303f1	2020-11-30 19:39:21 +03:00
Jussi Kivilinna	fe2daa40e9	pf_key/xfrm_algo: prepare pf_key and xfrm_algo for new algorithms without pfkey support Mark existing algorithms as pfkey supported and make pfkey only use algorithms that have pfkey_supported set. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I8a51662015715f6444011ce2298fe7d8c9651a81	2020-11-30 19:39:18 +03:00
YOSHIFUJI Hideaki / 吉藤英明	850f524248	xfrm: Convert xfrm_addr_cmp() to boolean xfrm_addr_equal(). All users of xfrm_addr_cmp() use its result as boolean. Introduce xfrm_addr_equal() (which is equal to !xfrm_addr_cmp()) and convert all users. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I992ceb67c76beea2332c494d92f4f3571a31f0c2	2020-11-30 19:39:14 +03:00
Michal Kubecek	0bc9330949	xfrm: fix freed block size calculation in xfrm_policy_fini() Missing multiplication of block size by sizeof(struct hlist_head) can cause xfrm_hash_free() to be called with wrong second argument so that kfree() is called on a block allocated with vzalloc() or __get_free_pages() or free_pages() is called with wrong order when a namespace with enough policies is removed. Bug introduced by commit `a35f6c5d`, i.e. versions >= 2.6.29 are affected. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I3e431a20bebf2bb1beeb450b9e07f30bfd263f1f	2020-11-30 19:39:11 +03:00
Nickolai Zeldovich	0992b79577	net/xfrm/xfrm_replay: avoid division by zero All of the xfrm_replay->advance functions in xfrm_replay.c check if x->replay_esn->replay_window is zero (and return if so). However, one of them, xfrm_replay_advance_bmp(), divides by that value (in the '%' operator) before doing the check, which can potentially trigger a divide-by-zero exception. Some compilers will also assume that the earlier division means the value cannot be zero later, and thus will eliminate the subsequent zero check as dead code. This patch moves the division to after the check. Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I0a8ed8f8a3ccc89e3081c5c94ac67a0c70670529	2020-11-30 19:39:08 +03:00
Cong Wang	56efbb3ddf	xfrm: use separated locks to protect pointers of struct xfrm_state_afinfo afinfo->type_map and afinfo->mode_map deserve separated locks, they are different things. We should just take RCU read lock to protect afinfo itself, but not for the inner pointers. Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: Ifaa8fbd443616d2a6b46efc09d6f17a5dc9906c1	2020-11-30 19:39:05 +03:00
Cong Wang	dc08fd9591	xfrm: replace rwlock on xfrm_km_list with rcu Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I207eef84335bf53a262f7836044d9c6d35223f95	2020-11-30 19:38:52 +03:00
Cong Wang	efb0f84146	xfrm: replace rwlock on xfrm_state_afinfo with rcu Similar to commit `418a99ac6a` (Replace rwlock on xfrm_policy_afinfo with rcu), the rwlock on xfrm_state_afinfo can be replaced by RCU too. Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I4cbf6c7a1f51d04651032df3de271f953d913190	2020-11-30 19:36:04 +03:00
Jussi Kivilinna	1a35fc8aaf	xfrm_algo: probe asynchronous block ciphers instead of synchronous IPSEC uses block ciphers asynchronous, but probes only for synchronous block ciphers and makes ealg entries only available if synchronous block cipher is found. So with setup, where hardware crypto driver registers asynchronous block ciphers and software crypto module is not build, ealg is not marked as being available. Use crypto_has_ablkcipher instead and remove ASYNC mask. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: I54c7eacc0b466ac1531d203d188a8cc9d83d2358	2020-11-30 19:36:01 +03:00
Li RongQing	a529c015b3	xfrm: removes a superfluous check and add a statistic Remove the check if x->km.state equal to XFRM_STATE_VALID in xfrm_state_check_expire(), which will be done before call xfrm_state_check_expire(). add a LINUX_MIB_XFRMOUTSTATEINVALID statistic to record the outbound error due to invalid xfrm state. Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: Iadce369611e5596daeac8b030ac316b7da8ca8d6	2020-11-30 19:35:58 +03:00
Shan Wei	fcec03d1a6	net: xfrm: use __this_cpu_read per-cpu helper this_cpu_ptr/this_cpu_read is faster than per_cpu_ptr(p, smp_processor_id()) and can reduce memory accesses. The latter helper needs to find the offset for current cpu, and needs more assembler instructions which objdump shows in following. this_cpu_ptr relocates and address. this_cpu_read() relocates the address and performs the fetch. this_cpu_read() saves you more instructions since it can do the relocation and the fetch in one instruction. per_cpu_ptr(p, smp_processor_id())： 1e: 65 8b 04 25 00 00 00 00 mov %gs:0x0,%eax 26: 48 98 cltq 28: 31 f6 xor %esi,%esi 2a: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 31: 48 8b 04 c5 00 00 00 00 mov 0x0(,%rax,8),%rax 39: c7 44 10 04 14 00 00 00 movl $0x14,0x4(%rax,%rdx,1) this_cpu_ptr(p) 1e: 65 48 03 14 25 00 00 00 00 add %gs:0x0,%rdx 27: 31 f6 xor %esi,%esi 29: c7 42 04 14 00 00 00 movl $0x14,0x4(%rdx) 30: 48 c7 c7 00 00 00 00 mov $0x0,%rdi Signed-off-by: Shan Wei <davidshan@tencent.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: If84706eaec5149a2c6cd8eff6804ca67f7a6e1bf	2020-11-30 19:35:55 +03:00
Ulrich Weber	a839075e33	xfrm: remove redundant replay_esn check x->replay_esn is already checked in if clause, so remove check and ident properly Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Change-Id: Ie49d59a73961651cb804085007919c834315f174	2020-11-30 19:35:52 +03:00
David S. Miller	390a9f5add	xfrm_user: Propagate netlink error codes properly. Instead of using a fixed value of "-1" or "-EMSGSIZE", propagate what the nla_*() interfaces actually return. Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I1e2b27cf758949e85967fbdd30266e19fdbbc14b	2020-11-30 19:35:49 +03:00
David S. Miller	303d8267f7	xfrm_user: Stop using NLA_PUT*(). These macros contain a hidden goto, and are thus extremely error prone and make code hard to audit. Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I902e64bada32556cac1091cccc77fe3b076ff056	2020-11-30 19:35:43 +03:00
Eric Dumazet	aedce03d5d	tcp: gso: fix truesize tracking [ Upstream commit `0d08c42cf9` ] commit `6ff50cd555` ("tcp: gso: do not generate out of order packets") had an heuristic that can trigger a warning in skb_try_coalesce(), because skb->truesize of the gso segments were exactly set to mss. This breaks the requirement that skb->truesize >= skb->len + truesizeof(struct sk_buff); It can trivially be reproduced by : ifconfig lo mtu 1500 ethtool -K lo tso off netperf As the skbs are looped into the TCP networking stack, skb_try_coalesce() warns us of these skb under-estimating their truesize. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: Iced7cd04a2860fcc14815c4f5564e01840ac41e3	2020-11-30 19:35:22 +03:00
Eric Dumazet	33e834ef26	tcp: tsq: restore minimal amount of queueing [ Upstream commit `98e09386c0` ] After commit `c9eeec26e3` ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: `c9eeec26e3` ("tcp: TSQ can use a dynamic limit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sujith Manoharan <sujith@msujith.org> Reported-by: Arnaud Ebalard <arno@natisbad.org> Tested-by: Sujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I1fab5c795148dff3da34b42c7808476f51d99f89	2020-11-30 19:35:19 +03:00
Eric Dumazet	31089dfedf	tcp: TSQ can use a dynamic limit [ Upstream commit `c9eeec26e3` ] When TCP Small Queues was added, we used a sysctl to limit amount of packets queues on Qdisc/device queues for a given TCP flow. Problem is this limit is either too big for low rates, or too small for high rates. Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO auto sizing, it can better control number of packets in Qdisc/device queues. New limit is two packets or at least 1 to 2 ms worth of packets. Low rates flows benefit from this patch by having even smaller number of packets in queues, allowing for faster recovery, better RTT estimations. High rates flows benefit from this patch by allowing more than 2 packets in flight as we had reports this was a limiting factor to reach line rate. [ In particular if TX completion is delayed because of coalescing parameters ] Example for a single flow on 10Gbp link controlled by FQ/pacing 14 packets in flight instead of 2 $ tc -s -d qd qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0 requeues 6822476) rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 2047 flow, 2046 inactive, 1 throttled, delay 15673 ns 2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit Note that sk_pacing_rate is currently set to twice the actual rate, but this might be refined in the future when a flow is in congestion avoidance. Additional change : skb->destructor should be set to tcp_wfree(). A future patch (for linux 3.13+) might remove tcp_limit_output_bytes Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I481dfb7b15ae4fc699e03e2ef846b0631c5ebb3f	2020-11-30 19:35:16 +03:00
Eric Dumazet	0cf083915c	tcp: gso: do not generate out of order packets GSO TCP handler has following issues : 1) ooo_okay from original GSO packet is duplicated to all segments 2) segments (but the last one) are orphaned, so transmit path can not get transmit queue number from the socket. This happens if GSO segmentation is done before stacked device for example. Result is we can send packets from a given TCP flow to different TX queues (if using multiqueue NICS). This generates OOO problems and spurious SACK & retransmits. Fix this by keeping socket pointer set for all segments. This means that every segment must also have a destructor, and the original gso skb truesize must be split on all segments, to keep precise sk->sk_wmem_alloc accounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I9d9d4216ec24ad4091535a9b9e811760cd949825	2020-11-30 19:35:13 +03:00
Eric Dumazet	1315c0d65e	tcp: tcp_tso_segment() small optimization We can move th->check computation out of the loop, as compiler doesn't know each skb initially share same tcp headers after skb_segment() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I7f81480604c4bd58f5cdd0be289233b9e6142a7f	2020-11-30 19:35:10 +03:00
Eric Dumazet	2687196048	tcp: GSO should be TSQ friendly I noticed that TSQ (TCP Small queues) was less effective when TSO is turned off, and GSO is on. If BQL is not enabled, TSQ has then no effect. It turns out the GSO engine frees the original gso_skb at the time the fragments are generated and queued to the NIC. We should instead call the tcp_wfree() destructor for the last fragment, to keep the flow control as intended in TSQ. This effectively limits the number of queued packets on qdisc + NIC layers. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: Idbe0206447bf979e012b6c4d5fe9e4fec10d5eb8	2020-11-30 19:35:07 +03:00
Eric Dumazet	e7e3467ab1	tcp: TCP Small Queues This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Change-Id: I37d5e4d7c9ced1846385b6a04ae3ad134763a949	2020-11-30 19:35:00 +03:00

1 2 3 4 5 ...

23974 commits