Commit Graph

146 Commits

Author SHA1 Message Date
Jann Horn f716247d29 splice: don't merge into linked buffers
commit a0ce2f0aa6ad97c3d4927bf2ca54bcebdf062d55 upstream.

Before this patch, it was possible for two pipes to affect each other after
data had been transferred between them with tee():

============
$ cat tee_test.c

int main(void) {
  int pipe_a[2];
  if (pipe(pipe_a)) err(1, "pipe");
  int pipe_b[2];
  if (pipe(pipe_b)) err(1, "pipe");
  if (write(pipe_a[1], "abcd", 4) != 4) err(1, "write");
  if (tee(pipe_a[0], pipe_b[1], 2, 0) != 2) err(1, "tee");
  if (write(pipe_b[1], "xx", 2) != 2) err(1, "write");

  char buf[5];
  if (read(pipe_a[0], buf, 4) != 4) err(1, "read");
  buf[4] = 0;
  printf("got back: '%s'\n", buf);
}
$ gcc -o tee_test tee_test.c
$ ./tee_test
got back: 'abxx'
$
============

As suggested by Al Viro, fix it by creating a separate type for
non-mergeable pipe buffers, then changing the types of buffers in
splice_pipe_to_pipe() and link_pipe().

Fixes: 7c77f0b3f9 ("splice: implement pipe to pipe splicing")
Fixes: 70524490ee ("[PATCH] splice: add support for sys_tee()")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[bwh: Backported to 3.16: Use generic_pipe_buf_steal(), as for other pipe
 types, since anon_pipe_buf_steal() does not exist here]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 22:11:20 +02:00
Eric Biggers 46918de204 pipe: read buffer limits atomically
commit f7340761812fc10313e6fcc115e0bc4f7a799112 upstream.

The pipe buffer limits are accessed without any locking, and may be
changed at any time by the sysctl handlers.  In theory this could cause
problems for expressions like the following:

    pipe_user_pages_hard && user_bufs > pipe_user_pages_hard

...  since the assembly code might reference the 'pipe_user_pages_hard'
memory location multiple times, and if the admin removes the limit by
setting it to 0, there is a very brief window where processes could
incorrectly observe the limit to be exceeded.

Fix this by loading the limits with READ_ONCE() prior to use.

Link: http://lkml.kernel.org/r/20180111052902.14409-8-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16:
 - Use ACCESS_ONCE() instead of READ_ONCE()
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:46 +02:00
Eric Biggers ec9fdf2dbd pipe: simplify round_pipe_size()
commit c4fed5a91fadc8a277b1eda474317b501651dd3e upstream.

round_pipe_size() calculates the number of pages the requested size
corresponds to, then rounds the page count up to the next power of 2.

However, it also rounds everything < PAGE_SIZE up to PAGE_SIZE.
Therefore, there's no need to actually translate the size into a page
count; we just need to round the size up to the next power of 2.

We do need to verify the size isn't greater than (1 << 31), since on
32-bit systems roundup_pow_of_two() would be undefined in that case.  But
that can just be combined with the UINT_MAX check which we need anyway
now.

Finally, update pipe_set_size() to not redundantly check the return value
of round_pipe_size() for the "invalid size" case twice.

Link: http://lkml.kernel.org/r/20180111052902.14409-7-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:46 +02:00
Eric Biggers a02b779d53 pipe: reject F_SETPIPE_SZ with size over UINT_MAX
commit 96e99be40e4cff870a83233731121ec0f7f95075 upstream.

A pipe's size is represented as an 'unsigned int'.  As expected, writing a
value greater than UINT_MAX to /proc/sys/fs/pipe-max-size fails with
EINVAL.  However, the F_SETPIPE_SZ fcntl silently truncates such values to
32 bits, rather than failing with EINVAL as expected.  (It *does* fail
with EINVAL for values above (1 << 31) but <= UINT_MAX.)

Fix this by moving the check against UINT_MAX into round_pipe_size() which
is called in both cases.

Link: http://lkml.kernel.org/r/20180111052902.14409-6-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:46 +02:00
Eric Biggers 580ec5076e pipe: fix off-by-one error when checking buffer limits
commit 9903a91c763ecdae333a04a9d89d79d2b8966503 upstream.

With pipe-user-pages-hard set to 'N', users were actually only allowed up
to 'N - 1' buffers; and likewise for pipe-user-pages-soft.

Fix this to allow up to 'N' buffers, as would be expected.

Link: http://lkml.kernel.org/r/20180111052902.14409-5-ebiggers3@gmail.com
Fixes: b0b91d18e2e9 ("pipe: fix limit checking in pipe_set_size()")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Willy Tarreau <w@1wt.eu>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:45 +02:00
Eric Biggers b037483c89 pipe: actually allow root to exceed the pipe buffer limits
commit 85c2dd5473b2718b4b63e74bfeb1ca876868e11f upstream.

pipe-user-pages-hard and pipe-user-pages-soft are only supposed to apply
to unprivileged users, as documented in both Documentation/sysctl/fs.txt
and the pipe(7) man page.

However, the capabilities are actually only checked when increasing a
pipe's size using F_SETPIPE_SZ, not when creating a new pipe.  Therefore,
if pipe-user-pages-hard has been set, the root user can run into it and be
unable to create pipes.  Similarly, if pipe-user-pages-soft has been set,
the root user can run into it and have their pipes limited to 1 page each.

Fix this by allowing the privileged override in both cases.

Link: http://lkml.kernel.org/r/20180111052902.14409-4-ebiggers3@gmail.com
Fixes: 759c01142a5d ("pipe: limit the per-user amount of pages allocated in pipes")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:45 +02:00
Eric Biggers f32040ce1c pipe, sysctl: remove pipe_proc_fn()
commit 319e0a21bb7823abbb4818fe2724e572bbac77a2 upstream.

pipe_proc_fn() is no longer needed, as it only calls through to
proc_dopipe_max_size().  Just put proc_dopipe_max_size() in the ctl_table
entry directly, and remove the unneeded EXPORT_SYMBOL() and the ENOSYS
stub for it.

(The reason the ENOSYS stub isn't needed is that the pipe-max-size
ctl_table entry is located directly in 'kern_table' rather than being
registered separately.  Therefore, the entry is already only defined when
the kernel is built with sysctl support.)

Link: http://lkml.kernel.org/r/20180111052902.14409-3-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:45 +02:00
Eric Biggers bae0262aef pipe, sysctl: drop 'min' parameter from pipe-max-size converter
commit 4c2e4befb3cc9ce42d506aa537c9ab504723e98c upstream.

Patch series "pipe: buffer limits fixes and cleanups", v2.

This series simplifies the sysctl handler for pipe-max-size and fixes
another set of bugs related to the pipe buffer limits:

- The root user wasn't allowed to exceed the limits when creating new
  pipes.

- There was an off-by-one error when checking the limits, so a limit of
  N was actually treated as N - 1.

- F_SETPIPE_SZ accepted values over UINT_MAX.

- Reading the pipe buffer limits could be racy.

This patch (of 7):

Before validating the given value against pipe_min_size,
do_proc_dopipe_max_size_conv() calls round_pipe_size(), which rounds the
value up to pipe_min_size.  Therefore, the second check against
pipe_min_size is redundant.  Remove it.

Link: http://lkml.kernel.org/r/20180111052902.14409-2-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:44 +02:00
Joe Lawrence 7eecc8b01c pipe: add proc_dopipe_max_size() to safely assign pipe_max_size
commit 7a8d181949fb2c16be00f8cdb354794a30e46b39 upstream.

pipe_max_size is assigned directly via procfs sysctl:

  static struct ctl_table fs_table[] = {
          ...
          {
                  .procname       = "pipe-max-size",
                  .data           = &pipe_max_size,
                  .maxlen         = sizeof(int),
                  .mode           = 0644,
                  .proc_handler   = &pipe_proc_fn,
                  .extra1         = &pipe_min_size,
          },
          ...

  int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
                   size_t *lenp, loff_t *ppos)
  {
          ...
          ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)
          ...

and then later rounded in-place a few statements later:

          ...
          pipe_max_size = round_pipe_size(pipe_max_size);
          ...

This leaves a window of time between initial assignment and rounding
that may be visible to other threads.  (For example, one thread sets a
non-rounded value to pipe_max_size while another reads its value.)

Similar reads of pipe_max_size are potentially racy:

  pipe.c :: alloc_pipe_info()
  pipe.c :: pipe_set_size()

Add a new proc_dopipe_max_size() that consolidates reading the new value
from the user buffer, verifying bounds, and calling round_pipe_size()
with a single assignment to pipe_max_size.

Change-Id: I635ecef3cdbbc0edf5158bf92bd4c17eab390080
Link: http://lkml.kernel.org/r/1507658689-11669-4-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: Continue using int sysctl functions because we don't
 have proper unsigned int support]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:44 +02:00
Joe Lawrence 68c934d27e pipe: avoid round_pipe_size() nr_pages overflow on 32-bit
commit d3f14c485867cfb2e0c48aa88c41d0ef4bf5209c upstream.

round_pipe_size() contains a right-bit-shift expression which may
overflow, which would cause undefined results in a subsequent
roundup_pow_of_two() call.

  static inline unsigned int round_pipe_size(unsigned int size)
  {
          unsigned long nr_pages;

          nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
          return roundup_pow_of_two(nr_pages) << PAGE_SHIFT;
  }

PAGE_SIZE is defined as (1UL << PAGE_SHIFT), so:
  - 4 bytes wide on 32-bit (0 to 0xffffffff)
  - 8 bytes wide on 64-bit (0 to 0xffffffffffffffff)

That means that 32-bit round_pipe_size(), nr_pages may overflow to 0:

  size=0x00000000    nr_pages=0x0
  size=0x00000001    nr_pages=0x1
  size=0xfffff000    nr_pages=0xfffff
  size=0xfffff001    nr_pages=0x0         << !
  size=0xffffffff    nr_pages=0x0         << !

This is bad because roundup_pow_of_two(n) is undefined when n == 0!

64-bit is not a problem as the unsigned int size is 4 bytes wide
(similar to 32-bit) and the larger, 8 byte wide unsigned long, is
sufficient to handle the largest value of the bit shift expression:

  size=0xffffffff    nr_pages=100000

Modify round_pipe_size() to return 0 if n == 0 and updates its callers to
handle accordingly.

Link: http://lkml.kernel.org/r/1507658689-11669-3-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:43 +02:00
Michael Kerrisk (man-pages) d04cada422 pipe: cap initial pipe capacity according to pipe-max-size limit
commit 086e774a57fba4695f14383c0818994c0b31da7c upstream.

This is a patch that provides behavior that is more consistent, and
probably less surprising to users. I consider the change optional, and
welcome opinions about whether it should be applied.

By default, pipes are created with a capacity of 64 kiB.  However,
/proc/sys/fs/pipe-max-size may be set smaller than this value.  In this
scenario, an unprivileged user could thus create a pipe whose initial
capacity exceeds the limit. Therefore, it seems logical to cap the
initial pipe capacity according to the value of pipe-max-size.

The test program shown earlier in this patch series can be used to
demonstrate the effect of the change brought about with this patch:

    # cat /proc/sys/fs/pipe-max-size
    1048576
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536
    # echo 10000 > /proc/sys/fs/pipe-max-size
    # cat /proc/sys/fs/pipe-max-size
    16384
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 16384
    # ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536

The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size
caps the initial allocation for a new pipe for unprivileged users, but
not for privileged users.

Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:43 +02:00
Michael Kerrisk (man-pages) a9f857695c pipe: make account_pipe_buffers() return a value, and use it
commit 9c87bcf0a31b338dc8a69a5d251a037565a94e13 upstream.

This is an optional patch, to provide a small performance
improvement.  Alter account_pipe_buffers() so that it returns the
new value in user->pipe_bufs. This means that we can refactor
too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to
avoid the costs of repeated use of atomic_long_read() to get the
value user->pipe_bufs.

Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:43 +02:00
Michael Kerrisk (man-pages) 32c6a41540 pipe: fix limit checking in alloc_pipe_info()
commit a005ca0e6813e1d796a7422a7e31d8b8d6555df1 upstream.

The limit checking in alloc_pipe_info() (used by pipe(2) and when
opening a FIFO) has the following problems:

(1) When checking capacity required for the new pipe, the checks against
    the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made
    against existing consumption, and exclude the memory required for
    the new pipe capacity. As a consequence: (1) the memory allocation
    throttling provided by the soft limit does not kick in quite as
    early as it should, and (2) the user can overrun the hard limit.

(2) As currently implemented, accounting and checking against the limits
    is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c).  The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

This patch addresses the above problems as follows:

* Alter the checks against limits to include the memory required for the
  new pipe.
* Re-order the accounting step so that it precedes the buffer allocation.
  If the accounting step determines that a limit has been reached, revert
  the accounting and cause the operation to fail.

Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: Don't use GFP_KERNEL_ACCOUNT]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:42 +02:00
Michael Kerrisk (man-pages) de499a3442 pipe: simplify logic in alloc_pipe_info()
commit 09b4d1990094dd22c27fb0163534db419458569c upstream.

Replace an 'if' block that covers most of the code in this function
with a 'goto'. This makes the code a little simpler to read, and also
simplifies the next patch (fix limit checking in alloc_pipe_info())

Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16:
 - Don't use GFP_KERNEL_ACCOUNT
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:42 +02:00
Michael Kerrisk (man-pages) 5c86aa5b8d pipe: fix limit checking in pipe_set_size()
commit b0b91d18e2e97b741b294af9333824ecc3fadfd8 upstream.

The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:

(1) When increasing the pipe capacity, the checks against the limits in
    /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
    consumption, and exclude the memory required for the increased pipe
    capacity. The new increase in pipe capacity can then push the total
    memory used by the user for pipes (possibly far) over a limit. This
    can also trigger the problem described next.

(2) The limit checks are performed even when the new pipe capacity is
    less than the existing pipe capacity. This can lead to problems if a
    user sets a large pipe capacity, and then the limits are lowered,
    with the result that the user will no longer be able to decrease the
    pipe capacity.

(3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a)
    simultaneously, and then allocate pipe buffers that are accounted
    for only in step (c).  The race means that the user's pipe buffer
    allocation could be pushed over the limit (by an arbitrary amount,
    depending on how unlucky we were in the race). [Thanks to Vegard
    Nossum for spotting this point, which I had missed.]

This patch addresses the above problems as follows:

* Perform checks against the limits only when increasing a pipe's
  capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
  the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
  allocation. If the accounting step determines that a limit has
  been reached, revert the accounting and cause the operation to fail.

The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)

Problem 1
=========

Using the test program on an unpatched kernel, we first set some
limits:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard    # 40.96 MB

Then show that we can set a pipe with capacity (100MB) that is
over the hard limit

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            F_SETPIPE_SZ returned 134217728

Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            F_SETPIPE_SZ returned 134217728
        Loop 2: set pipe capacity to 100000000 bytes
            Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

With a patched kernel, setting a capacity over the limit fails at the
first attempt:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 100000000 bytes
            Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

There is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited.  Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).

Problem 2
=========

Running the test program on an unpatched kernel, we first set some limits:

    # getconf PAGESIZE
    4096
    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard    # 40.96 MB

Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 748
    # Initial pipe capacity: 65536
        Loop 1: set pipe capacity to 10000000 bytes
            F_SETPIPE_SZ returned 16777216
            Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard      # 4.096 MB
    #     Loop 2: set pipe capacity to 5000000 bytes
            Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

In this case, the user should be able to lower the limit.

With a kernel that has the patch below, the second fcntl()
succeeds:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 3215
    # Initial pipe capacity: 65536
    #     Loop 1: set pipe capacity to 10000000 bytes
            F_SETPIPE_SZ returned 16777216
            Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard

    #     Loop 2: set pipe capacity to 5000000 bytes
            F_SETPIPE_SZ returned 8388608

8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

/* test_F_SETPIPE_SZ.c

   (C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later

   Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
   and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int
main(int argc, char *argv[])
{
    int (*pfd)[2];
    int npipes;
    int pcap, rcap;
    int j, p, s, stime, loop;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s num-pipes "
                "[pipe-capacity sleep-time]...\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    npipes = atoi(argv[1]);

    pfd = calloc(npipes, sizeof (int [2]));
    if (pfd == NULL) {
        perror("calloc");
        exit(EXIT_FAILURE);
    }

    for (j = 0; j < npipes; j++) {
        if (pipe(pfd[j]) == -1) {
            fprintf(stderr, "Loop %d: pipe() failed: ", j);
            perror("pipe");
            exit(EXIT_FAILURE);
        }
    }

    printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));

    for (j = 2; j < argc; j += 2 ) {
        loop = j / 2;
        pcap = atoi(argv[j]);
        printf("    Loop %d: set pipe capacity to %d bytes\n", loop, pcap);

        for (p = 0; p < npipes; p++) {
            s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
            if (s == -1) {
                fprintf(stderr, "        Loop %d, pipe %d: F_SETPIPE_SZ "
                        "failed: ", loop, p);
                perror("fcntl");
                exit(EXIT_FAILURE);
            }

            if (p == 0) {
                printf("        F_SETPIPE_SZ returned %d\n", s);
                rcap = s;
            } else {
                if (s != rcap) {
                    fprintf(stderr, "        Loop %d, pipe %d: F_SETPIPE_SZ "
                            "unexpected return: %d\n", loop, p, s);
                    exit(EXIT_FAILURE);
                }
            }

            stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
            if (stime > 0) {
                printf("        Sleeping %d seconds\n", stime);
                sleep(stime);
            }
        }
    }

    exit(EXIT_SUCCESS);
}

8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

Patch history:

v2
   * Switch order of test in 'if' statement to avoid function call
      (to capability()) in normal path. [This is a fix to a preexisting
      wart in the code. Thanks to Willy Tarreau]
    * Perform (size > pipe_max_size) check before calling
      account_pipe_buffers().  [Thanks to Vegard Nossum]
      Quoting Vegard:

        The potential problem happens if the user passes a very large number
        which will overflow pipe->user->pipe_bufs.

        On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
        then round_pipe_size() returns INT_MAX. Although it's true that the
        accounting is done in terms of pages and not bytes, so you'd need on
        the order of (1 << 13) = 8192 processes hitting the limit at the same
        time in order to make it overflow, which seems a bit unlikely.

        (See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
        limit checking)

Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:41 +02:00
Michael Kerrisk (man-pages) 0dda4baf2a pipe: refactor argument for account_pipe_buffers()
commit 3734a13b96ebf039b293d8d37a934fd1bd9e03ab upstream.

This is a preparatory patch for following work. account_pipe_buffers()
performs accounting in the 'user_struct'. There is no need to pass a
pointer to a 'pipe_inode_info' struct (which is then dereferenced to
obtain a pointer to the 'user' field). Instead, pass a pointer directly
to the 'user_struct'. This change is needed in preparation for a
subsequent patch that the fixes the limit checking in alloc_pipe_info()
(and the resulting code is a little more logical).

Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:41 +02:00
Michael Kerrisk (man-pages) 45a24d1e64 pipe: move limit checking logic into pipe_set_size()
commit d37d41666408102bf0ac8e48d8efdce7b809e5f6 upstream.

This is a preparatory patch for following work. Move the F_SETPIPE_SZ
limit-checking logic from pipe_fcntl() into pipe_set_size().  This
simplifies the code a little, and allows for reworking required in
a later patch that fixes the limit checking in pipe_set_size()

Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:41 +02:00
Michael Kerrisk (man-pages) 62a3f8ce98 pipe: relocate round_pipe_size() above pipe_set_size()
commit f491bd71118beba608d39ac2d5f1530e1160cd2e upstream.

Patch series "pipe: fix limit handling", v2.

When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.

While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:

(1) When increasing the pipe capacity, the checks against the limits
    in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
    existing consumption, and exclude the memory required for the
    increased pipe capacity. The new increase in pipe capacity can then
    push the total memory used by the user for pipes (possibly far) over
    a limit. This can also trigger the problem described next.

(2) The limit checks are performed even when the new pipe capacity
    is less than the existing pipe capacity. This can lead to problems
    if a user sets a large pipe capacity, and then the limits are
    lowered, with the result that the user will no longer be able to
    decrease the pipe capacity.

(3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c).  The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

This patch series addresses these three problems.

This patch (of 8):

This is a minor preparatory patch.  After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().

Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2019-07-27 21:49:40 +02:00
LuK1337 fc9499e55a Import latest Samsung release
* Package version: T713XXU2BQCO

Change-Id: I293d9e7f2df458c512d59b7a06f8ca6add610c99
2017-04-18 03:43:52 +02:00
Ben Hutchings a5c1b71b32 pipe: Fix buffer offset after partially failed read
Quoting the RHEL advisory:

> It was found that the fix for CVE-2015-1805 incorrectly kept buffer
> offset and buffer length in sync on a failed atomic read, potentially
> resulting in a pipe buffer state corruption. A local, unprivileged user
> could use this flaw to crash the system or leak kernel memory to user
> space. (CVE-2016-0774, Moderate)

The same flawed fix was applied to stable branches from 2.6.32.y to
3.14.y inclusive, and I was able to reproduce the issue on 3.2.y.
We need to give pipe_iov_copy_to_user() a separate offset variable
and only update the buffer offset if it succeeds.

Change-Id: Ibc4d13da6c463d02bd6ef7addc0fc871b6f63760
References: https://rhn.redhat.com/errata/RHSA-2016-0103.html
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Git-commit: feae3ca2e5e1a8f44aa6290255d3d9709985d0b2
Git-repo: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
2016-08-02 03:33:19 -07:00
Ben Hutchings 14f81062f3 pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomic
pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec,
the first time atomically and the second time not.  The second attempt
needs to continue from the iovec position, pipe buffer offset and
remaining length where the first attempt failed, but currently the
pipe buffer offset and remaining length are reset.  This will corrupt
the piped data (possibly also leading to an information leak between
processes) and may also corrupt kernel memory.

This was fixed upstream by commits f0d1bec9d58d ("new helper:
copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to
copy_page_to_iter()"), but those aren't suitable for stable.  This fix
for older kernel versions was made by Seth Jennings for RHEL and I
have extracted it from their update.

CVE-2015-1805

References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-29 12:08:34 -07:00
Linus Torvalds 120d92a872 vfs: fix subtle use-after-free of pipe_inode_info
commit b0d8d2292160bb63de1972361ebed100c64b5b37 upstream.

The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:

        spin_lock(&inode->i_lock);
        if (!--pipe->files) {
                inode->i_pipe = NULL;
                kill = 1;
        }
        spin_unlock(&inode->i_lock);
        __pipe_unlock(pipe);
        if (kill)
                free_pipe_info(pipe);

where the final freeing is done last.

HOWEVER.  The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".

This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.

Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.

Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early.  And since there were two users of this
pattern, create a helper function for it.

Introduced commit ba5bb14733 ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").

Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-11 22:36:26 -08:00
Kent Overstreet a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Al Viro 4b8a8f1e4f get rid of the last free_pipe_info() callers
and rename __free_pipe_info() to free_pipe_info()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:02 -04:00
Al Viro 7bee130e22 get rid of alloc_pipe_info() argument
not used anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:01 -04:00
Al Viro 6447a3cf19 get rid of pipe->inode
it's used only as a flag to distinguish normal pipes/FIFOs from the
internal per-task one used by file-to-file splice.  And pipe->files
would work just as well for that purpose...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:01 -04:00
Al Viro ebec73f475 introduce variants of pipe_lock/pipe_unlock for real pipes/FIFOs
fs/pipe.c file_operations methods *know* that pipe is not an internal one;
no need to check pipe->inode for those callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:01 -04:00
Al Viro de32ec4cfe pipe: set file->private_data to ->i_pipe
simplify get_pipe_info(), while we are at it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:00 -04:00
Al Viro 72b0d9aacb pipe: don't use ->i_mutex
now it can be done - put mutex into pipe_inode_info, use it instead
of ->i_mutex

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:00 -04:00
Al Viro ba5bb14733 pipe: take allocation and freeing of pipe_inode_info out of ->i_mutex
* new field - pipe->files; number of struct file over that pipe (all
  sharing the same inode, of course); protected by inode->i_lock.
* pipe_release() decrements pipe->files, clears inode->i_pipe when
  if the counter has reached 0 (all under ->i_lock) and, in that case,
  frees pipe after having done pipe_unlock()
* fifo_open() starts with grabbing ->i_lock, and either bumps pipe->files
  if ->i_pipe was non-NULL or allocates a new pipe (dropping and regaining
  ->i_lock) and rechecks ->i_pipe; if it's still NULL, inserts new pipe
  there, otherwise bumps ->i_pipe->files and frees the one we'd allocated.
  At that point we know that ->i_pipe is non-NULL and won't go away, so
  we can do pipe_lock() on it and proceed as we used to.  If we end up
  failing, decrement pipe->files and if it reaches 0 clear ->i_pipe and
  free the sucker after pipe_unlock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:12:59 -04:00
Al Viro 18c03cfd40 pipe: preparation to new locking rules
* use the fact that file_inode(file)->i_pipe doesn't change
  while the file is opened - no locks needed to access that.
* switch to pipe_lock/pipe_unlock where it's easy to do

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:12:59 -04:00
Al Viro fc7478a2bf pipe: switch wait_for_partner() and wake_up_partner() to pipe_inode_info
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:12:59 -04:00
Al Viro 599a0ac14e pipe: fold file_operations instances in one
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:12:58 -04:00
Al Viro f776c73888 fold fifo.c into pipe.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:12:58 -04:00
Al Viro a930d87905 vfs: fix pipe counter breakage
If you open a pipe for neither read nor write, the pipe code will not
add any usage counters to the pipe, causing the 'struct pipe_inode_info"
to be potentially released early.

That doesn't normally matter, since you cannot actually use the pipe,
but the pipe release code - particularly fasync handling - still expects
the actual pipe infrastructure to all be there.  And rather than adding
NULL pointer checks, let's just disallow this case, the same way we
already do for the named pipe ("fifo") case.

This is ancient going back to pre-2.4 days, and until trinity, nobody
naver noticed.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-12 08:29:17 -07:00
Anatol Pomozov 39b6525274 fs: Preserve error code in get_empty_filp(), part 2
Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
 - not enough memory for file structures
 - operation is not allowed
 - user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:32 -05:00
Al Viro 496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Al Viro 5b249b1b07 pipe(2) - race-free error recovery
don't mess with sys_close() if copy_to_user() fails; just postpone
fd_install() until we know it hasn't.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:08:52 -04:00
Linus Torvalds a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Al Viro e4fad8e5d2 consolidate pipe file creation
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:19 +04:00
Cong Wang 2164d33446 pipe: remove KM_USER0 from comments
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-07-24 15:27:34 +08:00
Josef Bacik c3b2da3148 fs: introduce inode operation ->update_time
Btrfs has to make sure we have space to allocate new blocks in order to modify
the inode, so updating time can fail.  We've gotten around this by having our
own file_update_time but this is kind of a pain, and Christoph has indicated he
would like to make xfs do something different with atime updates.  So introduce
->update_time, where we will deal with i_version an a/m/c time updates and
indicate which changes need to be made.  The normal version just does what it
has always done, updates the time and marks the inode dirty, and then
filesystems can choose to do something different.

I've gone through all of the users of file_update_time and made them check for
errors with the exception of the fault code since it's complicated and I wasn't
quite sure what to do there, also Jan is going to be pushing the file time
updates into page_mkwrite for those who have it so that should satisfy btrfs and
make it not a big deal to check the file_update_time() return code in the
generic fault path. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-01 12:07:25 -04:00
Will Deacon 46ce341b2f pipe: return -ENOIOCTLCMD instead of -EINVAL on unknown ioctl command
As described in commit 07d106d0a ("vfs: fix up ENOIOCTLCMD error
handling"), drivers should return -ENOIOCTLCMD if they receive an ioctl
command which they don't understand. Doing so will result in -ENOTTY
being returned to userspace, which matches the behaviour of the compat
layer if it fails to translate an ioctl command.

This patch fixes the pipe ioctl to return -ENOIOCTLCMD instead of
-EINVAL when passed an unknown ioctl command.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-30 21:04:55 -04:00
Linus Torvalds 9883035ae7 pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org  # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-29 13:12:42 -07:00
Muthu Kumar b502bd1152 magic.h: move some FS magic numbers into magic.h
- Move open-coded filesystem magic numbers into magic.h

- Rearrange magic.h so that the filesystem-related constants are grouped
  together.

Signed-off-by: Muthukumar R <muthur@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:31 -07:00
Cong Wang e8e3c3d66f fs: remove the second argument of k[un]map_atomic()
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:21 +08:00
Sasha Levin 2ccd4f4d47 pipe: fail cleanly when root tries F_SETPIPE_SZ with big size
When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:

  ------------[ cut here ]------------
  WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
  Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
  Call Trace:
     warn_slowpath_common+0x75/0xb0
     warn_slowpath_null+0x15/0x20
     __alloc_pages_nodemask+0x5d3/0x7a0
     __get_free_pages+0x12/0x50
     __kmalloc+0x12b/0x150
     pipe_set_size+0x75/0x120
     pipe_fcntl+0xf8/0x140
     do_fcntl+0x2d4/0x410
     sys_fcntl+0x66/0xa0
     system_call_fastpath+0x16/0x1b
  ---[ end trace 432f702e6db7b5ee ]---

Instead, make kcalloc() handle the overflow case and fail quietly.

[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12 20:13:04 -08:00
Al Viro 84b92d39f9 vfs: pipe.c is really non-modular
... so no exitcalls there.  Not much would work if pipe(2) would stop
working, after all...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:41 -05:00
Pavel Emelyanov d70ef97baf fs/pipe.c: add ->statfs callback for pipefs
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS.  Wire
pipfs up so that the statfs succeeds.

This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.

When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has.  Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly.  Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:51 -07:00
Eric Dumazet a209dfc7b0 vfs: dont chain pipe/anon/socket on superblock s_inodes list
Workloads using pipes and sockets hit inode_sb_list_lock contention.

superblock s_inodes list is needed for quota, dirty, pagecache and
fsnotify management. pipe/anon/socket fs are clearly not candidates for
these.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-26 12:57:09 -04:00