android_kernel_google_msm/fs/proc
Andrea Arcangeli 7766755a2f Fix /proc dcache deadlock in do_exit
This patch fixes a sles9 system hang in start_this_handle from a customer
with some heavy workload where all tasks are waiting on kjournald to commit
the transaction, but kjournald waits on t_updates to go down to zero (it
never does).

This was reported as a lowmem shortage deadlock but when checking the debug
data I noticed the VM wasn't under pressure at all (well it was really
under vm pressure, because lots of tasks hanged in the VM prune_dcache
methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
mode, the holder of the journal handle should have if this was a vm issue
in the first place).

No task was apparently holding the leftover handle in the committing
transaction, so I deduced t_updates was stuck to 1 because a journal_stop
was never run by some path (this turned out to be correct).  With a debug
patch adding proper reverse links and stack trace logging in ext3 deployed
in production, I found journal_stop is never run because
mark_inode_dirty_sync is called inside release_task called by do_exit.
(that was quite fun because I would have never thought about this
subtleness, I thought a regular path in ext3 had a bug and it forgot to
call journal_stop)

do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)

The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.

I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.

Now mainline doesn't have this line in iput (like sles9 has):

    	     if (inode->i_state & I_DIRTY_DELAYED)
	     			mark_inode_dirty_sync(inode);

so it will probably not crash with ext3, but for example ext2 implements an
I/O-blocking ext2_put_inode that will lead to similar screwups with
ext2_free_blocks never coming back and it's definitely wrong to call
blocking-IO paths inside do_exit.  So this should fix a subtle bug in
mainline too (not verified in practice though).  The equivalent fix for
ext3 is also not verified yet to fix the problem in sles9 but I don't have
doubt it will (it usually takes days to crash, so it'll take weeks to be
sure).

An alternate fix would be to offload that work to a kernel thread, but I
don't think a reschedule for this is worth it, the vm should be able to
collect those entries for the synchronous release_task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
..
array.c Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc 2008-02-01 11:45:47 +11:00
base.c Fix /proc dcache deadlock in do_exit 2008-02-05 09:44:18 -08:00
generic.c proc: remove/Fix proc generic d_revalidate 2007-12-10 19:43:55 -08:00
inode-alloc.txt
inode.c proc: fix proc_dir_entry refcounting 2007-12-05 09:21:20 -08:00
internal.h maps4: add /proc/pid/pagemap interface 2008-02-05 09:44:16 -08:00
kcore.c is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2008-02-05 09:44:14 -08:00
kmsg.c
Makefile [NET]: Make /proc/net per network namespace 2007-10-10 16:49:06 -07:00
mmu.c fs/proc/mmu.c: headers butchery 2007-10-17 08:42:48 -07:00
nommu.c
proc_devtree.c
proc_misc.c maps4: make page monitoring /proc file optional 2008-02-05 09:44:17 -08:00
proc_net.c [ATM]: Oops reading net/atm/arp 2008-01-28 15:01:36 -08:00
proc_sysctl.c Fix pointer mismatches in proc_sysctl.c 2007-10-25 15:16:49 -07:00
proc_tty.c
root.c proc: fix proc_dir_entry refcounting 2007-12-05 09:21:20 -08:00
task_mmu.c maps4: make page monitoring /proc file optional 2008-02-05 09:44:17 -08:00
task_nommu.c restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace pid 2008-01-02 13:13:27 -08:00
vmcore.c