Commit graph

21800 commits

Author SHA1 Message Date
Linus Torvalds
0d9cabdcce Merge branch 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup changes from Tejun Heo:
 "Out of the 8 commits, one fixes a long-standing locking issue around
  tasklist walking and others are cleanups."

* 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
  cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
  cgroup: remove cgroup_subsys argument from callbacks
  cgroup: remove extra calls to find_existing_css_set
  cgroup: replace tasklist_lock with rcu_read_lock
  cgroup: simplify double-check locking in cgroup_attach_proc
  cgroup: move struct cgroup_pidlist out from the header file
  cgroup: remove cgroup_attach_task_current_cg()
2012-03-20 18:11:21 -07:00
Linus Torvalds
843ec558f9 tty and serial merge for 3.4-rc1
Here's the big serial and tty merge for the 3.4-rc1 tree.
 
 There's loads of fixes and reworks in here from Jiri for the tty layer,
 and a number of patches from Alan to help try to wrestle the vt layer
 into a sane model.
 
 Other than that, lots of driver updates and fixes, and other minor
 stuff, all detailed in the shortlog.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk9nihQACgkQMUfUDdst+ylXTQCdFuwVuZgjCts+xDVa1jX2ac84
 UogAn3Wr+P7NYFN6gvaGm52KbGbZs405
 =2b/l
 -----END PGP SIGNATURE-----

Merge tag 'tty-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull TTY/serial patches from Greg KH:
 "tty and serial merge for 3.4-rc1

  Here's the big serial and tty merge for the 3.4-rc1 tree.

  There's loads of fixes and reworks in here from Jiri for the tty
  layer, and a number of patches from Alan to help try to wrestle the vt
  layer into a sane model.

  Other than that, lots of driver updates and fixes, and other minor
  stuff, all detailed in the shortlog."

* tag 'tty-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (132 commits)
  serial: pxa: add clk_prepare/clk_unprepare calls
  TTY: Wrong unicode value copied in con_set_unimap()
  serial: PL011: clear pending interrupts
  serial: bfin-uart: Don't access tty circular buffer in TX DMA interrupt after it is reset.
  vt: NULL dereference in vt_do_kdsk_ioctl()
  tty: serial: vt8500: fix annotations for probe/remove
  serial: remove back and forth conversions in serial_out_sync
  serial: use serial_port_in/out vs serial_in/out in 8250
  serial: introduce generic port in/out helpers
  serial: reduce number of indirections in 8250 code
  serial: delete useless void casts in 8250.c
  serial: make 8250's serial_in shareable to other drivers.
  serial: delete last unused traces of pausing I/O in 8250
  pch_uart: Add module parameter descriptions
  pch_uart: Use existing default_baud in setup_console
  pch_uart: Add user_uartclk parameter
  pch_uart: Add Fish River Island II uart clock quirks
  pch_uart: Use uartclk instead of base_baud
  mpc5200b/uart: select more tolerant uart prescaler on low baudrates
  tty: moxa: fix bit test in moxa_start()
  ...
2012-03-20 11:24:39 -07:00
Linus Torvalds
9c2b957db1 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events changes for v3.4 from Ingo Molnar:

 - New "hardware based branch profiling" feature both on the kernel and
   the tooling side, on CPUs that support it.  (modern x86 Intel CPUs
   with the 'LBR' hardware feature currently.)

   This new feature is basically a sophisticated 'magnifying glass' for
   branch execution - something that is pretty difficult to extract from
   regular, function histogram centric profiles.

   The simplest mode is activated via 'perf record -b', and the result
   looks like this in perf report:

	$ perf record -b any_call,u -e cycles:u branchy

	$ perf report -b --sort=symbol
	    52.34%  [.] main                   [.] f1
	    24.04%  [.] f1                     [.] f3
	    23.60%  [.] f1                     [.] f2
	     0.01%  [k] _IO_new_file_xsputn    [k] _IO_file_overflow
	     0.01%  [k] _IO_vfprintf_internal  [k] _IO_new_file_xsputn
	     0.01%  [k] _IO_vfprintf_internal  [k] strchrnul
	     0.01%  [k] __printf               [k] _IO_vfprintf_internal
	     0.01%  [k] main                   [k] __printf

   This output shows from/to branch columns and shows the highest
   percentage (from,to) jump combinations - i.e.  the most likely taken
   branches in the system.  "branches" can also include function calls
   and any other synchronous and asynchronous transitions of the
   instruction pointer that are not 'next instruction' - such as system
   calls, traps, interrupts, etc.

   This feature comes with (hopefully intuitive) flat ascii and TUI
   support in perf report.

 - Various 'perf annotate' visual improvements for us assembly junkies.
   It will now recognize function calls in the TUI and by hitting enter
   you can follow the call (recursively) and back, amongst other
   improvements.

 - Multiple threads/processes recording support in perf record, perf
   stat, perf top - which is activated via a comma-list of PIDs:

	perf top -p 21483,21485
	perf stat -p 21483,21485 -ddd
	perf record -p 21483,21485

 - Support for per UID views, via the --uid paramter to perf top, perf
   report, etc.  For example 'perf top --uid mingo' will only show the
   tasks that I am running, excluding other users, root, etc.

 - Jump label restructurings and improvements - this includes the
   factoring out of the (hopefully much clearer) include/linux/static_key.h
   generic facility:

	struct static_key key = STATIC_KEY_INIT_FALSE;

	...

	if (static_key_false(&key))
	        do unlikely code
	else
	        do likely code

	...
	static_key_slow_inc();
	...
	static_key_slow_inc();
	...

   The static_key_false() branch will be generated into the code with as
   little impact to the likely code path as possible.  the
   static_key_slow_*() APIs flip the branch via live kernel code patching.

   This facility can now be used more widely within the kernel to
   micro-optimize hot branches whose likelihood matches the static-key
   usage and fast/slow cost patterns.

 - SW function tracer improvements: perf support and filtering support.

 - Various hardenings of the perf.data ABI, to make older perf.data's
   smoother on newer tool versions, to make new features integrate more
   smoothly, to support cross-endian recording/analyzing workflows
   better, etc.

 - Restructuring of the kprobes code, the splitting out of 'optprobes',
   and a corner case bugfix.

 - Allow the tracing of kernel console output (printk).

 - Improvements/fixes to user-space RDPMC support, allowing user-space
   self-profiling code to extract PMU counts without performing any
   system calls, while playing nice with the kernel side.

 - 'perf bench' improvements

 - ... and lots of internal restructurings, cleanups and fixes that made
   these features possible.  And, as usual this list is incomplete as
   there were also lots of other improvements

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
  perf report: Fix annotate double quit issue in branch view mode
  perf report: Remove duplicate annotate choice in branch view mode
  perf/x86: Prettify pmu config literals
  perf report: Enable TUI in branch view mode
  perf report: Auto-detect branch stack sampling mode
  perf record: Add HEADER_BRANCH_STACK tag
  perf record: Provide default branch stack sampling mode option
  perf tools: Make perf able to read files from older ABIs
  perf tools: Fix ABI compatibility bug in print_event_desc()
  perf tools: Enable reading of perf.data files from different ABI rev
  perf: Add ABI reference sizes
  perf report: Add support for taken branch sampling
  perf record: Add support for sampling taken branch
  perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
  x86/kprobes: Split out optprobe related code to kprobes-opt.c
  x86/kprobes: Fix a bug which can modify kernel code permanently
  x86/kprobes: Fix instruction recovery on optimized path
  perf: Add callback to flush branch_stack on context switch
  perf: Disable PERF_SAMPLE_BRANCH_* when not supported
  perf/x86: Add LBR software filter support for Intel CPUs
  ...
2012-03-20 10:29:15 -07:00
Linus Torvalds
5928a2b60c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes for v3.4 from Ingo Molnar.  The major features of this
series are:

 - making RCU more aggressive about entering dyntick-idle mode in order
   to improve energy efficiency

 - converting a few more call_rcu()s to kfree_rcu()s

 - applying a number of rcutree fixes and cleanups to rcutiny

 - removing CONFIG_SMP #ifdefs from treercu

 - allowing RCU CPU stall times to be set via sysfs

 - adding CPU-stall capability to rcutorture

 - adding more RCU-abuse diagnostics

 - updating documentation

 - fixing yet more issues located by the still-ongoing top-to-bottom
   inspection of RCU, this time with a special focus on the CPU-hotplug
   code path.

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  rcu: Stop spurious warnings from synchronize_sched_expedited
  rcu: Hold off RCU_FAST_NO_HZ after timer posted
  rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
  rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
  rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
  rcu: Remove redundant check for rcu_head misalignment
  PTR_ERR should be called before its argument is cleared.
  rcu: Convert WARN_ON_ONCE() in rcu_lock_acquire() to lockdep
  rcu: Trace only after NULL-pointer check
  rcu: Call out dangers of expedited RCU primitives
  rcu: Rework detection of use of RCU by offline CPUs
  lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
  rcu: No interrupt disabling for rcu_prepare_for_idle()
  rcu: Move synchronize_sched_expedited() to rcutree.c
  rcu: Check for illegal use of RCU from offlined CPUs
  rcu: Update stall-warning documentation
  rcu: Add CPU-stall capability to rcutorture
  rcu: Make documentation give more realistic rcutorture duration
  rcutorture: Permit holding off CPU-hotplug operations during boot
  rcu: Print scheduling-clock information on RCU CPU stall-warning messages
  ...
2012-03-20 10:10:18 -07:00
Pablo Neira Ayuso
a16a1647fa netfilter: ctnetlink: fix race between delete and timeout expiration
Kerin Millar reported hardlockups while running `conntrackd -c'
in a busy firewall. That system (with several processors) was
acting as backup in a primary-backup setup.

After several tries, I found a race condition between the deletion
operation of ctnetlink and timeout expiration. This patch fixes
this problem.

Tested-by: Kerin Millar <kerframil@gmail.com>
Reported-by: Kerin Millar <kerframil@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-17 01:47:08 -07:00
RongQing.Li
c577923756 ipv6: Don't dev_hold(dev) in ip6_mc_find_dev_rcu.
ip6_mc_find_dev_rcu() is called with rcu_read_lock(), so don't
need to dev_hold().
With dev_hold(), not corresponding dev_put(), will lead to leak.

[ bug introduced in 96b52e61be (ipv6: mcast: RCU conversions) ]

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16 21:56:42 -07:00
Eric Dumazet
cc34eb672e sch_sfq: revert dont put new flow at the end of flows
This reverts commit d47a0ac7b6 (sch_sfq: dont put new flow at the end of
flows)

As Jesper found out, patch sounded great but has bad side effects.

In stress situation, pushing new flows in front of the queue can prevent
old flows doing any progress. Packets can stay in SFQ queue for
unlimited amount of time.

It's possible to add heuristics to limit this problem, but this would
add complexity outside of SFQ scope.

A more sensible answer to Dave Taht concerns (who reported the issued I
tried to solve in original commit) is probably to use a qdisc hierarchy
so that high prio packets dont enter a potentially crowded SFQ qdisc.

Reported-by: Jesper Dangaard Brouer <jdb@comx.dk>
Cc: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16 01:55:25 -07:00
Eric Dumazet
122bdf67f1 ipv6: fix icmp6_dst_alloc()
commit 87a115783 ( ipv6: Move xfrm_lookup() call down into
icmp6_dst_alloc().) forgot to convert one error path, leading
to crashes in mld_sendpack()

Many thanks to Dave Jones for providing a very complete bug report.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-16 01:53:42 -07:00
Ingo Molnar
35239e23c6 Merge branch 'perf/urgent' into perf/core
Merge reason: We are going to queue up a dependent patch.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12 20:44:11 +01:00
Eric Dumazet
dfd25ffffc tcp: fix syncookie regression
commit ea4fc0d619 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
added a serious regression on synflood handling.

Simon Kirby discovered a successful connection was delayed by 20 seconds
before being responsive.

In my tests, I discovered that xmit frames were lost, and needed ~4
retransmits and a socket dst rebuild before being really sent.

In case of syncookie initiated connection, we use a different path to
initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

As ip_queue_xmit() now depends on inet flow being setup, fix this by
copying the temp flowi4 we use in cookie_v4_check().

Reported-by: Simon Kirby <sim@netnation.com>
Bisected-by: Simon Kirby <sim@netnation.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-11 15:52:12 -07:00
Jiri Slaby
410235fd4d TTY: remove unneeded tty->index checks
Checking if tty->index is in bounds is not needed. The tty has the
index set in the initial open. This is done in get_tty_driver. And it
can be only in interval <0,driver->num).

So remove the tests which check exactly this interval. Some are
left untouched as they check against the current backing device count.
(Leaving apart that the check is racy in most of the cases.)

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-08 11:42:21 -08:00
Jiri Slaby
2f16669d32 TTY: remove re-assignments to tty_driver members
All num, magic and owner are set by alloc_tty_driver. No need to
re-set them on each allocation site.

pti driver sets something different to what it passes to
alloc_tty_driver. It is not a bug, since we don't use the lines
parameter in any way. Anyway this is fixed, and now we do the right
thing.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-08 11:37:58 -08:00
Steffen Klassert
ac3f48de09 route: Remove redirect_genid
As we invalidate the inetpeer tree along with the routing cache now,
we don't need a genid to reset the redirect handling when the routing
cache is flushed.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08 00:30:32 -08:00
Steffen Klassert
5faa5df1fa inetpeer: Invalidate the inetpeer tree along with the routing cache
We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.

To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08 00:30:24 -08:00
Paulius Zaleckas
5200959b83 bridge: fix state reporting when port is disabled
Now we have:
eth0: link *down*
br0: port 1(eth0) entered *forwarding* state

br_log_state(p) should be called *after* p->state is set
to BR_STATE_DISABLED.

Reported-by: Zilvinas Valinskas <zilvinas@wilibox.com>
Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08 00:25:25 -08:00
Paulius Zaleckas
d9e179ecec bridge: br_log_state() s/entering/entered/
When br_log_state() is reporting state it should say "entered"
istead of "entering" since state at this point is already
changed.

Signed-off-by: Paulius Zaleckas <paulius.zaleckas@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-08 00:25:25 -08:00
David S. Miller
9259c483a3 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch 2012-03-07 22:49:01 -08:00
Jesse Gross
81e5d41d7e openvswitch: Fix checksum update for actions on UDP packets.
When modifying IP addresses or ports on a UDP packet we don't
correctly follow the rules for unchecksummed packets.  This meant
that packets without a checksum can be given a incorrect new checksum
and packets with a checksum can become marked as being unchecksummed.
This fixes it to handle those requirements.

Signed-off-by: Jesse Gross <jesse@nicira.com>
2012-03-07 14:36:57 -08:00
Ben Pfaff
651a68ea2c openvswitch: Honor dp_ifindex, when specified, for vport lookup by name.
When OVS_VPORT_ATTR_NAME is specified and dp_ifindex is nonzero, the
logical behavior would be for the vport name lookup scope to be limited
to the specified datapath, but in fact the dp_ifindex value was ignored.
This commit causes the search scope to be honored.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2012-03-06 15:04:04 -08:00
Li Wei
d6ddef9e64 IPv6: Fix not join all-router mcast group when forwarding set.
When forwarding was set and a new net device is register,
we need add this device to the all-router mcast group.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 16:58:47 -05:00
Pablo Neira Ayuso
7413851197 netfilter: nf_conntrack: fix early_drop with reliable event delivery
If reliable event delivery is enabled and ctnetlink fails to deliver
the destroy event in early_drop, the conntrack subsystem cannot
drop any the candidate flow that was planned to be evicted.

Reported-by: Kerin Millar <kerframil@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:50 -05:00
Florian Westphal
739e4505a0 bridge: netfilter: don't call iptables on vlan packets if sysctl is off
When net.bridge.bridge-nf-filter-vlan-tagged is 0 (default), vlan packets
arriving should not be sent to ip(6)tables by bridge netfilter.

However, it turns out that we currently always send VLAN packets to
netfilter, if ..
a), CONFIG_VLAN_8021Q is enabled ; or
b), CONFIG_VLAN_8021Q is not set but rx vlan offload is enabled
   on the bridge port.

This is because bridge netfilter treats skb with
skb->protocol == ETH_P_IP{V6} as "non-vlan packet".

With rx vlan offload on or CONFIG_VLAN_8021Q=y, the vlan header has
already been removed here, and we cannot rely on skb->protocol alone.

Fix this by only using skb->protocol if the skb has no vlan tag,
or if a vlan tag is present and filter-vlan-tagged bridge netfilter
sysctl is enabled.

We cannot remove the skb->protocol == htons(ETH_P_8021Q) test
because the vlan tag is still around in the CONFIG_VLAN_8021Q=n &&
"ethtool -K $itf rxvlan off" case.

reproducer:
iptables -t raw -I PREROUTING -i br0
iptables -t raw -I PREROUTING -i br0.1

Then send packets to an ip address configured on br0.1 interface.
Even with net.bridge.bridge-nf-filter-vlan-tagged=0, the 1st rule
will match instead of the 2nd one.

With this patch applied, the 2nd rule will match instead.
In the non-local address case, netfilter won't be consulted after
this patch unless the sysctl is switched on.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:49 -05:00
Pablo Neira Ayuso
a157b9d5b5 netfilter: bridge: fix wrong pointer dereference
In adf7ff8, a invalid dereference was added in ebt_make_names.

CC [M]  net/bridge/netfilter/ebtables.o
net/bridge/netfilter/ebtables.c: In function `ebt_make_names':
net/bridge/netfilter/ebtables.c:1371:20: warning: `t' may be used uninitialized in this function [-Wuninitialized]

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:49 -05:00
Pablo Neira Ayuso
8be619d1e4 netfilter: ctnetlink: remove incorrect spin_[un]lock_bh on NAT module autoload
Since 7d367e0, ctnetlink_new_conntrack is called without holding
the nf_conntrack_lock spinlock. Thus, ctnetlink_parse_nat_setup
does not require to release that spinlock anymore in the NAT module
autoload case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:49 -05:00
Santosh Nayak
848edc6919 netfilter: ebtables: fix wrong name length while copying to user-space
user-space ebtables expects 32 bytes-long names, but xt_match names
use 29 bytes. We have to copy less 29 bytes and then, make sure we
fill the remaining bytes with zeroes.

Signed-off-by: Santosh Nayak <santoshprasadnayak@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:49 -05:00
Neal Cardwell
4648dc97af tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_una
This commit fixes tcp_shift_skb_data() so that it does not shift
SACKed data below snd_una.

This fixes an issue whose symptoms exactly match reports showing
tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at
net/ipv4/tcp_input.c:3418" thread on netdev).

Since 2008 (832d11c5cd)
tcp_shift_skb_data() had been shifting SACKed ranges that were below
snd_una. It checked that the *end* of the skb it was about to shift
from was above snd_una, but did not check that the end of the actual
shifted range was above snd_una; this commit adds that check.

Shifting SACKed ranges below snd_una is problematic because for such
ranges tcp_sacktag_one() short-circuits: it does not declare anything
as SACKed and does not increase sacked_out.

Before the fixes in commits cc9a672ee5
and daef52bab1, shifting SACKed ranges
below snd_una happened to work because tcp_shifted_skb() was always
(incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq
tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence
tcp_sacktag_one() never short-circuited and always increased
tp->sacked_out in this case.

After those two fixes, my testing has verified that shifting SACKed
ranges below snd_una could cause tp->sacked_out to go negative with
the following sequence of events:

(1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una,
    then shifts a prefix of that skb that is below snd_una

(2) tcp_shifted_skb() increments the packet count of the
    already-SACKed prev sk_buff

(3) tcp_sacktag_one() sees the end of the new SACKed range is below
    snd_una, so it short-circuits and doesn't increase tp->sacked_out

(5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed,
    decrements tp->sacked_out by this "inflated" pcount that was
    missing a matching increase in tp->sacked_out, and hence
    tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted
    to s32 is negative.

(6) this leads to the warnings seen in the recent "WARNING: at
    net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.:
    tcp_input.c:3418  WARN_ON((int)tp->sacked_out < 0);

More generally, I think this bug can be tickled in some cases where
two or more ACKs from the receiver are lost and then a DSACK arrives
that is immediately above an existing SACKed skb in the write queue.

This fix changes tcp_shift_skb_data() to abort this sequence at step
(1) in the scenario above by noticing that the bytes are below snd_una
and not shifting them.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-06 14:43:49 -05:00
Ulrich Weber
d1d81d4c3d bridge: check return value of ipv6_dev_get_saddr()
otherwise source IPv6 address of ICMPV6_MGM_QUERY packet
might be random junk if IPv6 is disabled on interface or
link-local address is not yet ready (DAD).

Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-05 16:45:34 -05:00
Ingo Molnar
737f24bda7 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/builtin-record.c
	tools/perf/builtin-top.c
	tools/perf/perf.h
	tools/perf/util/top.h

Merge reason: resolve these cherry-picking conflicts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:20:08 +01:00
Eric Dumazet
a4b64fbe48 rtnetlink: fix rtnl_calcit() and rtnl_dump_ifinfo()
nlmsg_parse() might return an error, so test its return value before
potential random memory accesses.

Errors introduced in commit 115c9b8192 (rtnetlink: Fix problem with
buffer allocation)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-04 22:02:55 -05:00
Joakim Tjernlund
709e1b5cd9 bridge: message age needs to increase, not decrease.
commit bridge: send proper message_age in config BPDU
added this gem:
  bpdu.message_age = (jiffies - root->designated_age)
  p->designated_age = jiffies + bpdu->message_age;
Notice how bpdu->message_age is negated when reassigned to
bpdu.message_age. This causes message age to decrease breaking the
STP protocol.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-04 21:57:40 -05:00
Joakim Tjernlund
aaca735f4f bridge: Adjust min age inc for HZ > 256
min age increment needs to round up its min age tick for all
HZ values to guarantee message age is increasing.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-04 21:57:39 -05:00
Neal Cardwell
c0638c247f tcp: don't fragment SACKed skbs in tcp_mark_head_lost()
In tcp_mark_head_lost() we should not attempt to fragment a SACKed skb
to mark the first portion as lost. This is for two primary reasons:

(1) tcp_shifted_skb() coalesces adjacent regions of SACKed skbs. When
doing this, it preserves the sum of their packet counts in order to
reflect the real-world dynamics on the wire. But given that skbs can
have remainders that do not align to MSS boundaries, this packet count
preservation means that for SACKed skbs there is not necessarily a
direct linear relationship between tcp_skb_pcount(skb) and
skb->len. Thus tcp_mark_head_lost()'s previous attempts to fragment
off and mark as lost a prefix of length (packets - oldcnt)*mss from
SACKed skbs were leading to occasional failures of the WARN_ON(len >
skb->len) in tcp_fragment() (which used to be a BUG_ON(); see the
recent "crash in tcp_fragment" thread on netdev).

(2) there is no real point in fragmenting off part of a SACKed skb and
calling tcp_skb_mark_lost() on it, since tcp_skb_mark_lost() is a NOP
for SACKed skbs.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-03 14:57:59 -05:00
David S. Miller
1e804aecd7 Merge branch 'master' of git://1984.lsi.us.es/net 2012-03-01 15:40:36 -05:00
Neal Cardwell
4c90d3b303 tcp: fix false reordering signal in tcp_shifted_skb
When tcp_shifted_skb() shifts bytes from the skb that is currently
pointed to by 'highest_sack' then the increment of
TCP_SKB_CB(skb)->seq implicitly advances tcp_highest_sack_seq(). This
implicit advancement, combined with the recent fix to pass the correct
SACKed range into tcp_sacktag_one(), caused tcp_sacktag_one() to think
that the newly SACKed range was before the tcp_highest_sack_seq(),
leading to a call to tcp_update_reordering() with a degree of
reordering matching the size of the newly SACKed range (typically just
1 packet, which is a NOP, but potentially larger).

This commit fixes this by simply calling tcp_sacktag_one() before the
TCP_SKB_CB(skb)->seq advancement that can advance our notion of the
highest SACKed sequence.

Correspondingly, we can simplify the code a little now that
tcp_shifted_skb() should update the lost_cnt_hint in all cases where
skb == tp->lost_skb_hint.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-28 15:06:46 -05:00
Ingo Molnar
bdd4431c8d Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
The major features of this series are:

 - making RCU more aggressive about entering dyntick-idle mode in order to
   improve energy efficiency

 - converting a few more call_rcu()s to kfree_rcu()s

 - applying a number of rcutree fixes and cleanups to rcutiny

 - removing CONFIG_SMP #ifdefs from treercu

 - allowing RCU CPU stall times to be set via sysfs

 - adding CPU-stall capability to rcutorture

 - adding more RCU-abuse diagnostics

 - updating documentation

 - fixing yet more issues located by the still-ongoing top-to-bottom
   inspection of RCU, this time with a special focus on the
   CPU-hotplug code path.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 10:16:10 +01:00
John W. Linville
eea79e0713 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem 2012-02-27 13:14:47 -05:00
Linus Torvalds
203738e548 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
1) ICMP sockets leave err uninitialized but we try to return it for the
   unsupported MSG_OOB case, reported by Dave Jones.

2) Add new Zaurus device ID entries, from Dave Jones.

3) Pointer calculation in hso driver memset is wrong, from Dan
   Carpenter.

4) ks8851_probe() checks unsigned value as negative, fix also from Dan
   Carpenter.

5) Fix crashes in atl1c driver due to TX queue handling, from Eric
   Dumazet.  I anticipate some TX side locking fixes coming in the near
   future for this driver as well.

6) The inline directive fix in Bluetooth which was breaking the build
   only with very new versions of GCC, from Johan Hedberg.

7) Fix crashes in the ATP CLIP code due to ARP cleanups this merge
   window, reported by Meelis Roos and fixed by Eric Dumazet.

8) JME driver doesn't flush RX FIFO correctly, from Guo-Fu Tseng.

9) Some ip6_route_output() callers test the return value for NULL, but
   this never happens as the convention is to return a dst entry with
   dst->error set.  Fixes from RonQing Li.

10) Logitech Harmony 900 should be handled by zaurus driver not
   cdc_ether, update white lists and black lists accordingly.  From
   Scott Talbert.

11) Receiving from certain kinds of devices there won't be a MAC header,
   so there is no MAC header to fixup in the IPSEC code, and if we try
   to do it we'll crash.  Fix from Eric Dumazet.

12) Port type array indexing off-by-one in mlx4 driver, fix from Yevgeny
   Petrilin.

13) Fix regression in link-down handling in davinci_emac which causes
   all RX descriptors to be freed up and therefore RX to wedge
   completely, from Christian Riesch.

14) It took two attempts, but ctnetlink soft lockups seem to be
   cured now, from Pablo Neira Ayuso.

15) Endianness bug fix in ENIC driver, from Santosh Nayak.

16) The long ago conversion of the PPP fragmentation code over to
   abstracted SKB list handling wasn't perfect, once we get an
   out of sequence SKB we don't flush the rest of them like we
   should.  From Ben McKeegan.

17) Fix regression of ->ip_summed initialization in sfc driver.
   From Ben Hutchings.

18) Bluetooth timeout mistakenly using msecs instead of jiffies,
   from Andrzej Kaczmarek.

19) Using _sync variant of work cancellation results in deadlocks,
   use the non _sync variants instead.  From Andre Guedes.

20) Bluetooth rfcomm code had reference counting problems leading
   to crashes, fix from Octavian Purdila.

21) The conversion of netem over to classful qdisc handling added
   two bugs to netem_dequeue(), fixes from Eric Dumazet.

22) Missing pci_iounmap() in ATM Solos driver.  Fix from Julia Lawall.

23) b44_pci_exit() should not have __exit tag since it's invoked from
   non-__exit code.  From Nikola Pajkovsky.

24) The conversion of the neighbour hash tables over to RCU added a
   race, fixed here by adding the necessary reread of tbl->nht, fix
   from Michel Machado.

25) When we added VF (virtual function) attributes for network device
   dumps, this potentially bloats up the size of the dump of one
   network device such that the dump size is too large for the buffer
   allocated by properly written netlink applications.

   In particular, if you add 255 VFs to a network device, parts of
   GLIBC stop working.

   To fix this, we add an attribute that is used to turn on these
   extended portions of the network device dump.  Sophisticaed
   applications like 'ip' that want to see this stuff  will be changed
   to set the attribute, whereas things like GLIBC that don't care
   about VFs simply will not, and therefore won't be busted by the
   mere presence of VFs on a network device.

   Thanks to the tireless work of Greg Rose on this fix.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits)
  sfc: Fix assignment of ip_summed for pre-allocated skbs
  ppp: fix 'ppp_mp_reconstruct bad seq' errors
  enic: Fix endianness bug.
  gre: fix spelling in comments
  netfilter: ctnetlink: fix soft lockup when netlink adds new entries (v2)
  Revert "netfilter: ctnetlink: fix soft lockup when netlink adds new entries"
  davinci_emac: Do not free all rx dma descriptors during init
  mlx4_core: Fixing array indexes when setting port types
  phy: IC+101G and PHY_HAS_INTERRUPT flag
  netdev/phy/icplus: Correct broken phy_init code
  ipsec: be careful of non existing mac headers
  Move Logitech Harmony 900 from cdc_ether to zaurus
  hso: memsetting wrong data in hso_get_count()
  netfilter: ip6_route_output() never returns NULL.
  ethernet/broadcom: ip6_route_output() never returns NULL.
  ipv6: ip6_route_output() never returns NULL.
  jme: Fix FIFO flush issue
  atm: clip: remove clip_tbl
  ipv4: ping: Fix recvmsg MSG_OOB error handling.
  rtnetlink: Fix problem with buffer allocation
  ...
2012-02-26 12:47:17 -08:00
Florian Westphal
e899b1119f netfilter: bridge: fix module autoload in compat case
We expected 0 if module doesn't exist, which is no longer the case
(42046e2e45,
netfilter: x_tables: return -ENOENT for non-existant matches/targets).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-02-25 14:29:09 +01:00
David S. Miller
e807e566e9 Merge branch 'master' of git://1984.lsi.us.es/net 2012-02-24 17:41:57 -05:00
stephen hemminger
bff528578f gre: fix spelling in comments
The original spelling and bad word choice makes these comments hard to read.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-24 17:41:11 -05:00
Jozsef Kadlecsik
7d367e0668 netfilter: ctnetlink: fix soft lockup when netlink adds new entries (v2)
Marcell Zambo and Janos Farago noticed and reported that when
new conntrack entries are added via netlink and the conntrack table
gets full, soft lockup happens. This is because the nf_conntrack_lock
is held while nf_conntrack_alloc is called, which is in turn wants
to lock nf_conntrack_lock while evicting entries from the full table.

The patch fixes the soft lockup with limiting the holding of the
nf_conntrack_lock to the minimum, where it's absolutely required.
It required to extend (and thus change) nf_conntrack_hash_insert
so that it makes sure conntrack and ctnetlink do not add the same entry
twice to the conntrack table.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-02-24 12:24:15 +01:00
Pablo Neira Ayuso
279072882d Revert "netfilter: ctnetlink: fix soft lockup when netlink adds new entries"
This reverts commit af14cca162.

This patch contains a race condition between packets and ctnetlink
in the conntrack addition. A new patch to fix this issue follows up.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-02-24 12:19:57 +01:00
Ingo Molnar
c5905afb0e static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24 10:05:59 +01:00
Eric Dumazet
03606895cd ipsec: be careful of non existing mac headers
Niccolo Belli reported ipsec crashes in case we handle a frame without
mac header (atm in his case)

Before copying mac header, better make sure it is present.

Bugzilla reference:  https://bugzilla.kernel.org/show_bug.cgi?id=42809

Reported-by: Niccolò Belli <darkbasic@linuxsystems.it>
Tested-by: Niccolò Belli <darkbasic@linuxsystems.it>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-23 16:50:45 -05:00
David S. Miller
4a2258dddd Merge branch 'nf' of git://1984.lsi.us.es/net 2012-02-23 00:20:14 -05:00
RongQing.Li
5d38b1f8cf netfilter: ip6_route_output() never returns NULL.
ip6_route_output() never returns NULL, so it is wrong to
check if the return value is NULL.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-22 15:30:15 -05:00
RongQing.Li
5095d64db1 ipv6: ip6_route_output() never returns NULL.
ip6_route_output() never returns NULL, so it is wrong to
check if the return value is NULL.

Signed-off-by: RongQing.Li <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-22 15:30:14 -05:00
Eric Dumazet
597cdbc223 atm: clip: remove clip_tbl
Commit 32092ecf06 (atm: clip: Use device neigh support on top of
"arp_tbl".) introduced a bug since clip_tbl is zeroed : Crash occurs in
__neigh_for_each_release()

idle_timer_check() must use instead arp_tbl and neigh_check_cb() should
ignore non clip neighbours.

Idea from David Miller.

Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-22 02:23:25 -05:00
David S. Miller
a5e7424d42 ipv4: ping: Fix recvmsg MSG_OOB error handling.
Don't return an uninitialized variable as the error, return
-EOPNOTSUPP instead.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-21 17:59:19 -05:00
Greg Rose
115c9b8192 rtnetlink: Fix problem with buffer allocation
Implement a new netlink attribute type IFLA_EXT_MASK.  The mask
is a 32 bit value that can be used to indicate to the kernel that
certain extended ifinfo values are requested by the user application.
At this time the only mask value defined is RTEXT_FILTER_VF to
indicate that the user wants the ifinfo dump to send information
about the VFs belonging to the interface.

This patch fixes a bug in which certain applications do not have
large enough buffers to accommodate the extra information returned
by the kernel with large numbers of SR-IOV virtual functions.
Those applications will not send the new netlink attribute with
the interface info dump request netlink messages so they will
not get unexpectedly large request buffers returned by the kernel.

Modifies the rtnl_calcit function to traverse the list of net
devices and compute the minimum buffer size that can hold the
info dumps of all matching devices based upon the filter passed
in via the new netlink attribute filter mask.  If no filter
mask is sent then the buffer allocation defaults to NLMSG_GOODSIZE.

With this change it is possible to add yet to be defined netlink
attributes to the dump request which should make it fairly extensible
in the future.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-21 16:56:45 -05:00