Commit graph

388 commits

Author SHA1 Message Date
David S. Miller
c275431b28 net: Document dst->obsolete better.
Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.

Suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
Change-Id: I2c8ca84d38cd49ccc1207db588108d78e6f9403a
2020-11-30 19:39:24 +03:00
Luca Weiss
e4cede11f4 ipv4: Pass struct flowi4 directly to rt_fill_info
This is partly a backport of d6c0a4f609
  (ipv4: Kill 'rt_src' from 'struct rtable').

skb->sk can be null, and in fact it is when creating the buffer
in inet_rtm_getroute. There is no other way of accessing the flow,
so pass it directly.

Fixes invalid memory address when running 'ip route get $IPADDR'

Change-Id: I7b9e5499614b96360c9c8420907e82e145bb97f3
2020-10-25 02:37:54 -04:00
Pavel Emelyanov
36640b6239 net: Loopback ifindex is constant now
As pointed out, there are places, that access net->loopback_dev->ifindex
and after ifindex generation is made per-net this value becomes constant
equals 1. So go ahead and introduce the LOOPBACK_IFINDEX constant and use
it where appropriate.

Change-Id: I29fd08fa01a9522240ab654d436b02a577bb610c
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 14:52:48 +00:00
Artem Borisov
d7992e6feb Merge remote-tracking branch 'stable/linux-3.4.y' into lineage-15.1
All bluetooth-related changes were omitted because of our ancient incompatible bt stack.

Change-Id: I96440b7be9342a9c1adc9476066272b827776e64
2017-12-27 17:13:15 +03:00
Lorenzo Colitti
966db7310d net: core: add UID to flows, rules, and routes
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a
  range of UIDs.
- Define a RTA_UID attribute for per-UID route lookups and dumps.
- Support passing these attributes to and from userspace via
  rtnetlink. The value INVALID_UID indicates no UID was
  specified.
- Add a UID field to the flow structures.

[Backport of net-next 622ec2c9d52405973c9f1ca5116eb1c393adfc7d]

Bug: 16355602
Change-Id: Iea98e6fedd0fd4435a1f4efa3deb3629505619ab
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 13:38:07 +03:00
Simon Shields
f33f0fc3ed Revert "net: core: Support UID-based routing."
This reverts commit dbadd302523011ab746840885b93e8bd0b2ff499.

Change-Id: I8d996d5a84a3203443987c844d10534cd0dfc2c1
2017-08-27 19:09:20 +03:00
Hannes Frederic Sowa
ef06d64826 ipv4: try to cache dst_entries which would cause a redirect
Not caching dst_entries which cause redirects could be exploited by hosts
on the same subnet, causing a severe DoS attack. This effect aggravated
since commit f886497212 ("ipv4: fix dst race in sk_dst_get()").

Lookups causing redirects will be allocated with DST_NOCACHE set which
will force dst_release to free them via RCU.  Unfortunately waiting for
RCU grace period just takes too long, we can end up with >1M dst_entries
waiting to be released and the system will run OOM. rcuos threads cannot
catch up under high softirq load.

Attaching the flag to emit a redirect later on to the specific skb allows
us to cache those dst_entries thus reducing the pressure on allocation
and deallocation.

This issue was discovered by Marcelo Leitner.

Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Marcelo Leitner <mleitner@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

Change-Id: I53e4b500a4db2f5fece937a42a3bd810b2640c44
2016-10-29 23:12:10 +08:00
Marcelo Ricardo Leitner
50f1b3d5a4 ipv4: disable bh while doing route gc
Further tests revealed that after moving the garbage collector to a work
queue and protecting it with a spinlock may leave the system prone to
soft lockups if bottom half gets very busy.

It was reproced with a set of firewall rules that REJECTed packets. If
the NIC bottom half handler ends up running on the same CPU that is
running the garbage collector on a very large cache, the garbage
collector will not be able to do its job due to the amount of work
needed for handling the REJECTs and also won't reschedule.

The fix is to disable bottom half during the garbage collecting, as it
already was in the first place (most calls to it came from softirqs).

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
2014-12-01 18:02:44 +08:00
Marcelo Ricardo Leitner
b54ca60899 ipv4: avoid parallel route cache gc executions
When rt_intern_hash() has to deal with neighbour cache overflowing,
it triggers the route cache garbage collector in an attempt to free
some references on neighbour entries.

Such call cannot be done async but should also not run in parallel with
an already-running one, so that they don't collapse fighting over the
hash lock entries.

This patch thus blocks parallel executions with spinlocks:
- A call from worker and from rt_intern_hash() are not the same, and
cannot be merged, thus they will wait each other on rt_gc_lock.
- Calls to gc from rt_intern_hash() may happen in parallel but we must
wait for it to finish in order to try again. This dedup and
synchrinozation is then performed by the locking just before calling
__do_rt_garbage_collect().

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
2014-12-01 18:02:44 +08:00
Marcelo Ricardo Leitner
b6153ead0d ipv4: move route garbage collector to work queue
Currently the route garbage collector gets called by dst_alloc() if it
have more entries than the threshold. But it's an expensive call, that
don't really need to be done by then.

Another issue with current way is that it allows running the garbage
collector with the same start parameters on multiple CPUs at once, which
is not optimal. A system may even soft lockup if the cache is big enough
as the garbage collectors will be fighting over the hash lock entries.

This patch thus moves the garbage collector to run asynchronously on a
work queue, much similar to how rt_expire_check runs.

There is one condition left that allows multiple executions, which is
handled by the next patch.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Zefan Li <lizefan@huawei.com>
2014-12-01 18:02:43 +08:00
Eric Dumazet
509a15a5d6 ip: make IP identifiers less predictable
[ Upstream commit 04ca6973f7 ]

In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6e8 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-14 08:42:35 +08:00
Eric Dumazet
ad52eef552 inetpeer: get rid of ip_id_count
[ Upstream commit 73f156a6e8 ]

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
   with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
   is about 20.

4) If server deals with many tcp flows, we have a high probability of
   not finding the inet_peer, allocating a fresh one, inserting it in
   the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-14 08:42:35 +08:00
Lorenzo Colitti
553313899d net: core: Support UID-based routing.
This contains the following commits:

1. 0149763 net: core: Add a UID range to fib rules.
2. 1650474 net: core: Use the socket UID in routing lookups.
3. 0b16771 net: ipv4: Add the UID to the route cache.
4. ee058f1 net: core: Add a RTA_UID attribute to routes.
    This is so that userspace can do per-UID route lookups.

Bug: 15413527
Change-Id: I1285474c6734614d3bda6f61d88dfe89a4af7892
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
2014-07-08 16:55:52 -07:00
Li RongQing
62e1a647e7 ipv4: initialise the itag variable in __mkroute_input
[ Upstream commit fbdc0ad095 ]

the value of itag is a random value from stack, and may not be initiated by
fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID
is not set

This will make the cached dst uncertainty

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07 16:02:00 -07:00
Jiri Benc
ad61d4c760 ipv4: fix ineffective source address selection
[ Upstream commit 0a7e226090 ]

When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit 813b3b5db8 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-04 04:23:40 -08:00
Linus Torvalds
ed359a3b7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Provide device string properly for USB i2400m wimax devices, also
    don't OOPS when providing firmware string.  From Phil Sutter.

 2) Add support for sh_eth SH7734 chips, from Nobuhiro Iwamatsu.

 3) Add another device ID to USB zaurus driver, from Guan Xin.

 4) Loop index start in pool vector iterator is wrong causing MAC to not
    get configured in bnx2x driver, fix from Dmitry Kravkov.

 5) EQL driver assumes HZ=100, fix from Eric Dumazet.

 6) Now that skb_add_rx_frag() can specify the truesize increment
    separately, do so in f_phonet and cdc_phonet, also from Eric
    Dumazet.

 7) virtio_net accidently uses net_ratelimit() not only on the kernel
    warning but also the statistic bump, fix from Rick Jones.

 8) ip_route_input_mc() uses fixed init_net namespace, oops, use
    dev_net(dev) instead.  Fix from Benjamin LaHaise.

 9) dev_forward_skb() needs to clear the incoming interface index of the
    SKB so that it looks like a new incoming packet, also from Benjamin
    LaHaise.

10) iwlwifi mistakenly initializes a channel entry as 2GHZ instead of
    5GHZ, fix from Stanislav Yakovlev.

11) Missing kmalloc() return value checks in orinoco, from Santosh
    Nayak.

12) ath9k doesn't check for HT capabilities in the right way, it is
    checking ht_supported instead of the ATH9K_HW_CAP_HT flag.  Fix from
    Sujith Manoharan.

13) Fix x86 BPF JIT emission of 16-bit immediate field of AND
    instructions, from Feiran Zhuang.

14) Avoid infinite loop in GARP code when registering sysfs entries.
    From David Ward.

15) rose protocol uses memcpy instead of memcmp in a device address
    comparison, oops.  Fix from Daniel Borkmann.

16) Fix build of lpc_eth due to dev_hw_addr_rancom() interface being
    renamed to eth_hw_addr_random().  From Roland Stigge.

17) Make ipv6 RTM_GETROUTE interpret RTA_IIF attribute the same way
    that ipv4 does.  Fix from Shmulik Ladkani.

18) via-rhine has an inverted bit test, causing suspend/resume
    regressions.  Fix from Andreas Mohr.

19) RIONET assumes 4K page size, fix from Akinobu Mita.

20) Initialization of imask register in sky2 is buggy, because bits are
    "or'd" into an uninitialized local variable.  Fix from Lino
    Sanfilippo.

21) Fix FCOE checksum offload handling, from Yi Zou.

22) Fix VLAN processing regression in e1000, from Jiri Pirko.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  sky2: dont overwrite settings for PHY Quick link
  tg3: Fix 5717 serdes powerdown problem
  net: usb: cdc_eem: fix mtu
  net: sh_eth: fix endian check for architecture independent
  usb/rtl8150 : Remove duplicated definitions
  rionet: fix page allocation order of rionet_active
  via-rhine: fix wait-bit inversion.
  ipv6: Fix RTM_GETROUTE's interpretation of RTA_IIF to be consistent with ipv4
  net: lpc_eth: Fix rename of dev_hw_addr_random
  net/netfilter/nfnetlink_acct.c: use linux/atomic.h
  rose_dev: fix memcpy-bug in rose_set_mac_address
  Fix non TBI PHY access; a bad merge undid bug fix in a previous commit.
  net/garp: avoid infinite loop if attribute already exists
  x86 bpf_jit: fix a bug in emitting the 16-bit immediate operand of AND
  bonding: emit event when bonding changes MAC
  mac80211: fix oper channel timestamp updation
  ath9k: Use HW HT capabilites properly
  MAINTAINERS: adding maintainer for ipw2x00
  net: orinoco: add error handling for failed kmalloc().
  net/wireless: ipw2x00: fix a typo in wiphy struct initilization
  ...
2012-04-02 17:53:39 -07:00
David Howells
9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
Benjamin LaHaise
4e7b2f1454 net/ipv4: fix IPv4 multicast over network namespaces
When using multicast over a local bridge feeding a number of LXC guests
using veth, the LXC guests are unable to get a response from other guests
when pinging 224.0.0.1.  Multicast packets did not appear to be getting
delivered to the network namespaces of the guest hosts, and further
inspection showed that the incoming route was pointing to the loopback
device of the host, not the guest.  This lead to the wrong network namespace
being picked up by sockets (like ICMP).  Fix this by using the correct
network namespace when creating the inbound route entry.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-28 04:45:37 -04:00
Joe Perches
afd465030a net: ipv4: Standardize prefixes for message logging
Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-12 17:05:21 -07:00
Joe Perches
058bd4d2a4 net: Convert printks to pr_<level>
Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini.  Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
  #define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
 887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-11 23:42:51 -07:00
David S. Miller
b2d3298e09 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-03-09 14:34:20 -08:00
Steffen Klassert
ac3f48de09 route: Remove redirect_genid
As we invalidate the inetpeer tree along with the routing cache now,
we don't need a genid to reset the redirect handling when the routing
cache is flushed.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08 00:30:32 -08:00
Steffen Klassert
5faa5df1fa inetpeer: Invalidate the inetpeer tree along with the routing cache
We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.

To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08 00:30:24 -08:00
David S. Miller
80703d265b ipv4: Eliminate spurious argument to __ipv4_neigh_lookup
'tbl' is always arp_tbl, so specifying it is pointless.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-15 17:48:35 -05:00
David S. Miller
39232973b7 ipv4/ipv6: Prepare for new route gateway semantics.
In the future the ipv4/ipv6 route gateway will take on two types
of values:

1) INADDR_ANY/IN6ADDR_ANY, for local network routes, and in this case
   the neighbour must be obtained using the destination address in
   ipv4/ipv6 header as the lookup key.

2) Everything else, the actual nexthop route address.

So if the gateway is not inaddr-any we use it, otherwise we must use
the packet's destination address.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-26 15:22:32 -05:00
David S. Miller
abb434cb05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bluetooth/l2cap_core.c

Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23 17:13:56 -05:00
Eric Dumazet
e688a60480 net: introduce DST_NOPEER dst flag
Chris Boot reported crashes occurring in ipv6_select_ident().

[  461.457562] RIP: 0010:[<ffffffff812dde61>]  [<ffffffff812dde61>]
ipv6_select_ident+0x31/0xa7

[  461.578229] Call Trace:
[  461.580742] <IRQ>
[  461.582870]  [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2
[  461.589054]  [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155
[  461.595140]  [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b
[  461.601198]  [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e
[nf_conntrack_ipv6]
[  461.608786]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[  461.614227]  [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543
[  461.620659]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[  461.626440]  [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a
[bridge]
[  461.633581]  [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459
[  461.639577]  [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76
[bridge]
[  461.646887]  [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f
[bridge]
[  461.653997]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[  461.659473]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[  461.665485]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[  461.671234]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[  461.677299]  [<ffffffffa0379215>] ?
nf_bridge_update_protocol+0x20/0x20 [bridge]
[  461.684891]  [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack]
[  461.691520]  [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge]
[  461.697572]  [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56
[bridge]
[  461.704616]  [<ffffffffa0379031>] ?
nf_bridge_push_encap_header+0x1c/0x26 [bridge]
[  461.712329]  [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95
[bridge]
[  461.719490]  [<ffffffffa037900a>] ?
nf_bridge_pull_encap_header+0x1c/0x27 [bridge]
[  461.727223]  [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
[  461.734292]  [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77
[  461.739758]  [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
[  461.746203]  [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111
[  461.751950]  [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge]
[  461.758378]  [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56
[bridge]

This is caused by bridge netfilter special dst_entry (fake_rtable), a
special shared entry, where attaching an inetpeer makes no sense.

Problem is present since commit 87c48fa3b4 (ipv6: make fragment
identifications less predictable)

Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
__ip_select_ident() fallback to the 'no peer attached' handling.

Reported-by: Chris Boot <bootc@bootc.net>
Tested-by: Chris Boot <bootc@bootc.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22 22:34:56 -05:00
Stephen Rothwell
b9eda06f80 ipv4: using prefetch requires including prefetch.h
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-22 09:16:09 -08:00
Eric Dumazet
9f28a2fc0b ipv4: reintroduce route cache garbage collector
Commit 2c8cec5c10 (ipv4: Cache learned PMTU information in inetpeer)
removed IP route cache garbage collector a bit too soon, as this gc was
responsible for expired routes cleanup, releasing their neighbour
reference.

As pointed out by Robert Gladewitz, recent kernels can fill and exhaust
their neighbour cache.

Reintroduce the garbage collection, since we'll have to wait our
neighbour lookups become refcount-less to not depend on this stuff.

Reported-by: Robert Gladewitz <gladewitz@gmx.de>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-21 15:47:16 -05:00
David S. Miller
959327c784 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-06 21:10:05 -05:00
David Miller
2721745501 net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.
To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
2011-12-05 15:20:19 -05:00
David S. Miller
de398fb8b9 ipv4: Fix peer validation on cached lookup.
If ipv4_valdiate_peer() fails during a cached entry lookup,
we'll NULL derer since the loop iterator assumes rth is not
NULL.

Letting this be handled as a failure is just bogus, so just make it
not fail.  If we have trouble getting a non-NULL neighbour for the
redirected gateway, just restore the original gateway and continue.

The very next use of this cached route will try again.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-05 13:21:42 -05:00
Julian Anastasov
f61759e6b8 ipv4: make sure RTO_ONLINK is saved in routing cache
__mkroute_output fails to work with the original tos
and uses value with stripped RTO_ONLINK bit. Make sure we put
the original TOS bits into rt_key_tos because it used to match
cached route.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-03 01:32:23 -05:00
David S. Miller
b3613118eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2011-12-02 13:49:21 -05:00
David S. Miller
efbc368dcc ipv4: Perform peer validation on cached route lookup.
Otherwise we won't notice the peer GENID change.

Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01 13:38:59 -05:00
David Miller
32092ecf06 atm: clip: Use device neigh support on top of "arp_tbl".
Instead of instantiating an entire new neigh_table instance
just for ATM handling, use the neigh device private facility.

Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 18:51:03 -05:00
Eric Dumazet
218fa90f07 ipv4: fix lockdep splat in rt_cache_seq_show
After commit f2c31e32b3 (fix NULL dereferences in check_peer_redir()),
dst_get_neighbour() should be guarded by rcu_read_lock() /
rcu_read_unlock() section.

Reported-by: Miles Lane <miles.lane@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30 17:24:14 -05:00
Eric Dumazet
de68dca181 inet: add a redirect generation id in inetpeer
Now inetpeer is the place where we cache redirect information for ipv4
destinations, we must be able to invalidate informations when a route is
added/removed on host.

As inetpeer is not yet namespace aware, this patch adds a shared
redirect_genid, and a per inetpeer redirect_genid. This might be changed
later if inetpeer becomes ns aware.

Cache information for one inerpeer is valid as long as its
redirect_genid has the same value than global redirect_genid.

Reported-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com>
Tested-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 19:16:37 -05:00
Steffen Klassert
261663b0ee ipv4: Don't use the cached pmtu informations for input routes
The pmtu informations on the inetpeer are visible for output and
input routes. On packet forwarding, we might propagate a learned
pmtu to the sender. As we update the pmtu informations of the
inetpeer on demand, the original sender of the forwarded packets
might never notice when the pmtu to that inetpeer increases.
So use the mtu of the outgoing device on packet forwarding instead
of the pmtu to the final destination.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:52 -05:00
Steffen Klassert
618f9bc74a net: Move mtu handling down to the protocol depended handlers
We move all mtu handling from dst_mtu() down to the protocol
layer. So each protocol can implement the mtu handling in
a different manner.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:51 -05:00
Steffen Klassert
ebb762f27f net: Rename the dst_opt default_mtu method to mtu
We plan to invoke the dst_opt->default_mtu() method unconditioally
from dst_mtu(). So rename the method to dst_opt->mtu() to match
the name with the new meaning.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:50 -05:00
Steffen Klassert
6b600b26c0 route: Use the device mtu as the default for blackhole routes
As it is, we return null as the default mtu of blackhole routes.
This may lead to a propagation of a bogus pmtu if the default_mtu
method of a blackhole route is invoked. So return dst->dev->mtu
as the default mtu instead.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26 14:29:50 -05:00
Eric Dumazet
9cc20b268a ipv4: fix redirect handling
commit f39925dbde (ipv4: Cache learned redirect information in
inetpeer.) introduced a regression in ICMP redirect handling.

It assumed ipv4_dst_check() would be called because all possible routes
were attached to the inetpeer we modify in ip_rt_redirect(), but thats
not true.

commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix
this but solution was not complete. (It fixed only one route)

So we must lookup existing routes (including different TOS values) and
call check_peer_redir() on them.

Reported-by: Ivan Zahariev <famzah@icdsoft.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-18 15:24:32 -05:00
Steffen Klassert
2bc8ca40f9 ipv4: Fix inetpeer expire time information
As we update the learned pmtu informations on demand, we might
report a nagative expiration time value to userspace if the
pmtu informations are already expired and we have not send a
packet to that inetpeer after expiration. With this patch we
send a expire time of null to userspace after expiration
until the next packet is send to that inetpeer.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-08 14:40:40 -05:00
Gao feng
59445b6b1f ipv4: avoid useless call of the function check_peer_pmtu
In func ipv4_dst_check,check_peer_pmtu should be called only when peer is updated.
So,if the peer is not updated in ip_rt_frag_needed,we can not inc __rt_peer_genid.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 18:30:07 -04:00
David S. Miller
1805b2f048 Merge branch 'master' of ra.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-10-24 18:18:09 -04:00
Flavio Leitner
7cc9150ebe route: fix ICMP redirect validation
The commit f39925dbde
(ipv4: Cache learned redirect information in inetpeer.)
removed some ICMP packet validations which are required by
RFC 1122, section 3.2.2.2:
...
  A Redirect message SHOULD be silently discarded if the new
  gateway address it specifies is not on the same connected
  (sub-) net through which the Redirect arrived [INTRO:2,
  Appendix A], or if the source of the Redirect is not the
  current first-hop gateway for the specified destination (see
  Section 3.3.1).

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-24 02:56:38 -04:00
Vasily Averin
349d2895cc ipv4: NET_IPV4_ROUTE_GC_INTERVAL removal
removing obsoleted sysctl,
ip_rt_gc_interval variable no longer used since 2.6.38

Signed-off-by: Vasily Averin <vvs@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-03 14:13:01 -04:00
David S. Miller
823dcd2506 Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net 2011-08-20 10:39:12 -07:00
Eric Dumazet
33d480ce6d net: cleanup some rcu_dereference_raw
RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-08-12 02:55:28 -07:00