It was scheduled to be removed for a long time.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netfilter@vger.kernel.org
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Change-Id: Iae2ab6490a36ecde8fb9ee5b65e4448969f74b07
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.
This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.
Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.
One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).
Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.
Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Change-Id: I611023b0bfe1cab7b3e5da13e331a7baaaaf6eb0
Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
s390 uses generic arch_get_unmapped_area()]
[wt: backport to 3.16: adjust context]
[wt: backport to 3.10: adjust context ; code logic in PARISC's
arch_get_unmapped_area() wasn't found ; code inserted into
expand_upwards() and expand_downwards() runs under anon_vma lock;
changes for gup.c:faultin_page go to memory.c:__get_user_pages();
included Hugh Dickins' fixes]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Flex1911 <dedsa2002@gmail.com>
Preparatory patch to make the idle thread allocation for secondary
cpus generic.
Change-Id: I93b918d42eceb5bd3c8281fb48504f34352f382d
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124556.964170564@linutronix.de
commit 3f81d2447b37ac697b3c600039f2c6b628c06e21 upstream.
We were previously using free_bootmem() and just getting lucky
that nothing too bad happened.
Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream.
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.
That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works. However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV. And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.
However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d4514 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space. And user space really
expected SIGSEGV, not SIGBUS.
To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it. They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.
This is the mindless minimal patch to do this. A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.
Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.
Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Adjust filenames, context
- Drop arc, metag, nios2 and lustre changes
- For sh, patch both 32-bit and 64-bit implementations to use goto bad_area
- For s390, pass int_code and trans_exc_code as arguments to do_no_context()
and do_sigsegv()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
[lizf: Backported to 3.4:
- adjust context in arch/power/mm/fault.c
- apply the original change in upstream commit for s390]
Signed-off-by: Zefan Li <lizefan@huawei.com>
commit f862eefec0 upstream.
It turns out the kernel relies on barrier() to force a reload of the
percpu offset value. Since we can't easily modify the definition of
barrier() to include "tp" as an output register, we instead provide a
definition of __my_cpu_offset as extended assembly that includes a fake
stack read to hazard against barrier(), forcing gcc to know that it
must reread "tp" and recompute anything based on "tp" after a barrier.
This fixes observed hangs in the slub allocator when we are looping
on a percpu cmpxchg_double.
A similar fix for ARMv7 was made in June in change 509eb76ebf.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3cb3f839d3 upstream.
gcc 4.7.x is emitting calls to __ffsdi2 where previously
it used to inline the appropriate ctz instructions.
While this needs to be fixed in gcc, it's also easy to avoid
having it cause build failures when building with those
compilers by exporting __ffsdi2 to modules.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit ff7f3efb9a upstream.
The current Tilera boot infrastructure now provides the initramfs
to Linux as a Tilera-hypervisor file named "initramfs", rather than
"initramfs.cpio.gz", as before. (This makes it reasonable to use
other compression techniques than gzip on the file without having to
worry about the name causing confusion.) Adapt to use the new name,
but also fall back to checking for the old name.
Cc'ing to stable so that older kernels will remain compatible with
newer Tilera boot infrastructure.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 627072b06c upstream.
The tile tool chain uses the .eh_frame information for backtracing.
The vmlinux build drops any .eh_frame sections at link time, but when
present in kernel modules, it causes a module load failure due to the
presence of unsupported pc-relative relocations. When compiling to
use compiler feedback support, the compiler by default omits .eh_frame
information, so we don't see this problem. But when not using feedback,
we need to explicitly suppress the .eh_frame.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f1d62bed7 upstream.
This is because __builtin_clz(0) returns 64 for the "undefined" case
of 0, since the builtin just does a right-shift 32 and "clz" instruction.
So, use the alpha approach of casting to u32 and using __builtin_clzll().
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some discussion with the glibc mailing lists revealed that this was
necessary for 64-bit platforms with MIPS-like sign-extension rules
for 32-bit values. The original symptom was that passing (uid_t)-1 to
setreuid() was failing in programs linked -pthread because of the "setxid"
mechanism for passing setxid-type function arguments to the syscall code.
SYSCALL_WRAPPERS handles ensuring that all syscall arguments end up with
proper sign-extension and is thus the appropriate fix for this problem.
On other platforms (s390, powerpc, sparc64, and mips) this was fixed
in 2.6.28.6. The general issue is tracked as CVE-2009-0029.
Cc: <stable@vger.kernel.org>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
This passes siginfo and mcontext to tilegx32 signal handlers that
don't have SA_SIGINFO set just as we have been doing for tilegx64.
Cc: stable@vger.kernel.org
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
First, we were at risk of handling thread-info flags, in particular
do_signal(), when returning from kernel space. This could happen
after a failed kernel_execve(), or when forking a kernel thread.
The fix is to test in do_work_pending() for user_mode() and return
immediately if so; we already had this test for one of the flags,
so I just hoisted it to the top of the function.
Second, if a ptraced process updated the callee-saved registers
in the ptregs struct and then processed another thread-info flag, we
would overwrite the modifications with the original callee-saved
registers. To fix this, we add a register to note if we've already
saved the registers once, and skip doing it on additional passes
through the loop. To avoid a performance hit from the couple of
extra instructions involved, I modified the GET_THREAD_INFO() macro
to be guaranteed to be one instruction, then bundled it with adjacent
instructions, yielding an overall net savings.
Reported-By: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.
This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
to export our internal do_mmap_pgoff() function.
Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead. We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Until we push the unaligned access support for tilegx, it's silly
to have arch/tile/kernel/proc.c generate a warning about an unused
variable. Extend the #ifdef to cover all the code and data for now.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
The scheduler depends on receiving the CPU_STARTING notification, without
which we end up into a lot of trouble. So add the missing call to
notify_cpu_starting() in the bringup code.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Pull arch/tile bug fixes from Chris Metcalf:
"This includes Paul Gortmaker's change to fix the <asm/system.h>
disintegration issues on tile, a fix to unbreak the tilepro ethernet
driver, and a backlog of bugfix-only changes from internal Tilera
development over the last few months.
They have all been to LKML and on linux-next for the last few days.
The EDAC change to MAINTAINERS is an oddity but discussion on the
linux-edac list suggested I ask you to pull that change through my
tree since they don't have a tree to pull edac changes from at the
moment."
* 'stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: (39 commits)
drivers/net/ethernet/tile: fix netdev_alloc_skb() bombing
MAINTAINERS: update EDAC information
tilepro ethernet driver: fix a few minor issues
tile-srom.c driver: minor code cleanup
edac: say "TILEGx" not "TILEPro" for the tilegx edac driver
arch/tile: avoid accidentally unmasking NMI-type interrupt accidentally
arch/tile: remove bogus performance optimization
arch/tile: return SIGBUS for addresses that are unaligned AND invalid
arch/tile: fix finv_buffer_remote() for tilegx
arch/tile: use atomic exchange in arch_write_unlock()
arch/tile: stop mentioning the "kvm" subdirectory
arch/tile: export the page_home() function.
arch/tile: fix pointer cast in cacheflush.c
arch/tile: fix single-stepping over swint1 instructions on tilegx
arch/tile: implement panic_smp_self_stop()
arch/tile: add "nop" after "nap" to help GX idle power draw
arch/tile: use proper memparse() for "maxmem" options
arch/tile: fix up locking in pgtable.c slightly
arch/tile: don't leak kernel memory when we unload modules
arch/tile: fix bug in delay_backoff()
...
The return path as we reload registers and core state requires that r30
hold a boolean indicating whether we are returning from an NMI, but in a
couple of cases we weren't setting this properly, with the result that we
could accidentally unmask the NMI interrupt(s), which could cause confusion.
Now we set r30 in every place where we jump into the interrupt return path.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
We were re-homing the initial task's kernel stack on the boot cpu,
but in fact it's better to let it stay globally homed, since that
task isn't bound to the boot cpu anyway. This is more of a general
cleanup than an actual performance optimization, but it removes
code, which is a good thing. :-)
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Previously we were returning SIGSEGV in this case. It seems cleaner
to return SIGBUS since the hardware figures out alignment traps
before TLB violations, so SIGBUS is the "more correct" signal.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
There were some correctness issues with this code that are now fixed
with this change. The change is likely less performant than it could
be, but it should no longer be vulnerable to any races with memory
operations on the memory network while invalidating a range of memory.
This code is run infrequently so performance isn't critical, but
correctness definitely is.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
This idiom is used elsewhere when we do an unlock by writing a zero,
but I missed it here. Using an atomic operation avoids waiting
on the write buffer for the unlocking write to be sent to the home cache.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
It causes "make clean" to fail, for example. Once we have KVM support
complete, we'll reinstate the subdir reference.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Pragmatically it couldn't be wrong to cast pointers to long to compare
them (since all kernel addresses are in the top half of VA space),
but it's more correct to cast to unsigned long.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
If we are single-stepping and make a syscall, we call ptrace_notify()
explicitly on the return path back to user space, since we are returning
to a pc value set artificially to the next instruction, and otherwise
we won't register that we stepped over the syscall instruction (swint1).
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
This allows the later-panicking tiles to wait in a lower power state
until they get interrupted with an smp_send_stop().
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
We should be holding the init_mm.page_table_lock in shatter_huge_page()
since we are modifying the kernel page tables. Then, only if we are
walking the other root page tables to update them, do we want to take
the pgd_lock.
Add a comment about taking the pgd_lock that we always do it with
interrupts disabled and therefore are not at risk from the tlbflush
IPI deadlock as is seen on x86.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
We were carefully computing a value to use for the number of loops
to spin for, and then ignoring it.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Previously we only handled kernels up to a single huge page in size.
Now we create additional PTEs appropriately.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
If we took a page fault while we had interrupts disabled, we
shouldn't enable them in the page fault handler.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
We make sure not to try to set the home for an MMIO PTE (on tilegx)
or a PTE that isn't referencing memory managed by Linux.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Doing so raises the possibility of self-deadlock if we are waiting
for a backtrace for an oprofile or perf interrupt while we are
in the middle of migrating our own stack page.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Not associated with any code changes, so I'm just lumping these
comment changes into a commit by themselves.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
We now respond to MEM_ERROR traps (e.g. an atomic instruction to
non-cacheable memory) with a SIGBUS.
We also no longer generate a console crash message if a user
process die due to a SIGTRAP.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
In certain circumstances we need to do a bunch of jump-and-link
instructions to fill the hardware return-address stack with nonzero values.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Fix a long-standing bug in the stack backtracer where we would print
garbage to the console instead of kernel function names, if the kernel
wasn't built with symbol support (e.g. mboot).
Make sure to tag every line of userspace backtrace output if we actually
have the mmap_sem, since that way if there's no tag, we know that it's
because we couldn't trylock the semaphore.
Stop doing a TLB flush and examining page tables during backtrace.
Instead, just trust that __copy_from_user_inatomic() will properly fault
and return a failure, which it should do in all cases.
Fix a latent bug where the backtracer would directly examine a signal
context in user space, rather than copying it safely to kernel memory
first. This meant that a race with another thread could potentially
have caused a kernel panic.
Guard against unaligned sp when trying to restart backtrace at an
interrupt or signal handler point in the kernel backtracer.
Report kernel symbolic information for the call instruction rather
than for the following instruction. We still report the actual numeric
address corresponding to the instruction after the call, for the sake
of consistency with the normal expectations for stack backtracers.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Add a comment explaining why this is important, and add a CFLAGS_REMOVE
clause to the Makefile to make sure it happens.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
With lockstat we can end up trying to get a backtrace before
"high_memory" is initialized, so don't worry about range testing
if it is zero.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
It still returns whether @v was not @u, not the old value,
unlike __atomic_add_unless().
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Arun Sharma <asharma@fb.com>
We aren't yet using this definition in the kernel, but fix it up
before someone goes looking for it.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>