jffs2 deadlock introduced in linux 2.6.22.5

Discussion:

Jason Lunz

2007-08-30 18:23:55 UTC

commit 1d8715b388c978b0f1b1bf4812fcee0e73b023d7 was added between
2.6.22.4 and 2.6.22.5 to cure a locking problem, but it seems to have
introduced another (worse?) one.

With a jffs2 filesystem (on block2mtd) on a 2.6.22.5 kernel, if I do
anything that appends to a file with many small writes, I get what looks
like a deadlock between the writer and the jffs2 gc thread. For example:

# while true; do echo >> /some/file/on/jffs2; done

will result in the bash hanging in D state, with these kernel stacks in
dmesg after "echo t > /proc/sysrq-trigger":

jffs2_gcd_mtd S DFD1EEA8 0 1086 2 (L-TLB)
dfd1eebc 00000046 00000002 dfd1eea8 dfd1eea4 00000000 00000000 c0334a00
c0334a00 00000000 0000000a dfcb8550 2ee3df10 0000001a 00002280 dfcb8670
c1407a00 00000000 00000286 df9fa600 dfe20900 ffff414a c1407ec4 0000ffff
Call Trace:
[<c026b84c>] __down_interruptible+0xb2/0x10b
[<c0269e4b>] __sched_text_start+0x14b/0x8a4
[<c0115380>] default_wake_function+0x0/0xc
[<c026b727>] __down_failed_interruptible+0x7/0xc
[<e09425bd>] jffs2_garbage_collect_pass+0x20/0x597 [jffs2]
[<c0120cd0>] __dequeue_signal+0xd7/0x11c
[<c01209ed>] recalc_sigpending+0xb/0x1d
[<c01221e5>] dequeue_signal+0x9d/0x117
[<e09439e7>] jffs2_garbage_collect_thread+0x11b/0x15a [jffs2]
[<c0103bf6>] ret_from_fork+0x6/0x1c
[<e09438cc>] jffs2_garbage_collect_thread+0x0/0x15a [jffs2]
[<e09438cc>] jffs2_garbage_collect_thread+0x0/0x15a [jffs2]
[<c01048fb>] kernel_thread_helper+0x7/0x10

bash D CE2C85E0 0 2223 2219 (NOTLB)
d834bb2c 00000086 00000000 ce2c85e0 ce2c85e0 ce8004c0 00000003 c0334a00
c0334a00 00000000 00000009 df93ca70 2ee3bc90 0000001a 0002ff74 df93cb90
c1407a00 00000000 00000000 00000000 dfe20900 00000000 c1407ec4 00000000
Call Trace:
[<c026aa96>] io_schedule+0x1e/0x28
[<c0139f4f>] sync_page+0x38/0x3b
[<c026ad57>] __wait_on_bit+0x33/0x58
[<c0139f17>] sync_page+0x0/0x3b
[<c013a09d>] wait_on_page_bit+0x63/0x69
[<c01286d4>] wake_bit_function+0x0/0x3c
[<c013bce8>] read_cache_page+0x28/0x3f
[<e0943b18>] jffs2_gc_fetch_page+0x26/0x3b [jffs2]
[<e0941fa8>] jffs2_garbage_collect_live+0x992/0xf87 [jffs2]
[<e08ee60e>] block2mtd_write+0x18f/0x1a6 [block2mtd]
[<e08cd000>] default_mtd_writev+0x0/0x9e [mtdcore]
[<e09449e2>] jffs2_flash_direct_writev+0x62/0xd0 [jffs2]
[<e0942a9c>] jffs2_garbage_collect_pass+0x4ff/0x597 [jffs2]
[<e0a02c82>] aufs_read_unlock+0x17/0x5f [aufs]
[<e0a1d546>] ibend+0x39/0x3f [aufs]
[<e093d3ba>] jffs2_reserve_space+0xb5/0x15b [jffs2]
[<c0121e28>] send_sig_info+0x55/0x65
[<e093f845>] jffs2_write_inode_range+0x5a/0x278 [jffs2]
[<e093b696>] jffs2_commit_write+0xec/0x1be [jffs2]
[<c013b494>] generic_file_buffered_write+0x3f1/0x5af
[<c016135e>] dput+0x15/0xda
[<c015b4d4>] __link_path_walk+0xb2d/0xc0e
[<c011d750>] current_fs_time+0x41/0x46
[<c013bacb>] __generic_file_aio_write_nolock+0x479/0x4c8
[<c015b65e>] link_path_walk+0xa9/0xb3
[<c013bb7b>] generic_file_aio_write+0x61/0xc2
[<c0159936>] permission+0xc8/0xdb
[<c0153384>] do_sync_write+0x0/0x10a
[<c015344b>] do_sync_write+0xc7/0x10a
[<c015c56e>] open_namei+0x254/0x571
[<c01286a1>] autoremove_wake_function+0x0/0x33
[<c0153384>] do_sync_write+0x0/0x10a
[<c0153c2a>] vfs_write+0xa8/0x130
[<c015419f>] sys_write+0x41/0x67
[<c0103d12>] syscall_call+0x7/0xb

Given that I never saw any jffs2 deadlocks in other 2.6.22 kernels,
maybe that commit should be reverted until a better solution is found?

Jason

Jason Lunz

2007-08-31 21:26:36 UTC

Permalink

Post by Jason Lunz
commit 1d8715b388c978b0f1b1bf4812fcee0e73b023d7 was added between
2.6.22.4 and 2.6.22.5 to cure a locking problem, but it seems to have
introduced another (worse?) one.

I spoke too soon. I checked more carefully, and this problem was
introduced somewhere between 2.6.21 and 2.6.22. The jffs2 fix in
2.6.22.5 isn't the culprit.

Jason

Jesper Juhl

2007-08-31 21:32:26 UTC

Permalink

Post by Jason Lunz

Post by Jason Lunz
commit 1d8715b388c978b0f1b1bf4812fcee0e73b023d7 was added between
2.6.22.4 and 2.6.22.5 to cure a locking problem, but it seems to have
introduced another (worse?) one.

I spoke too soon. I checked more carefully, and this problem was
introduced somewhere between 2.6.21 and 2.6.22. The jffs2 fix in
2.6.22.5 isn't the culprit.

Sounds like this belongs on the regression tracking page at
http://kernelnewbies.org/known_regressions and/or in a bugzilla
(http://bugzilla.kernel.org/) bugreport so it doesn't get lost.

--
Jesper Juhl <***@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

Jason Lunz

2007-09-01 19:06:03 UTC

Permalink

I've bisected the deadlock when many small appends are done on jffs2 down to
this commit:

commit 6fe6900e1e5b6fa9e5c59aa5061f244fe3f467e2
Author: Nick Piggin <***@suse.de>
Date: Sun May 6 14:49:04 2007 -0700

mm: make read_cache_page synchronous

Ensure pages are uptodate after returning from read_cache_page, which allows
us to cut out most of the filesystem-internal PageUptodate calls.

I didn't have a great look down the call chains, but this appears to fixes 7
possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in
ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in
block2mtd. All depending on whether the filler is async and/or can return
with a !uptodate page.

It introduced a wait to read_cache_page, as well as a
read_cache_page_async function equivalent to the old read_cache_page
without any callers.

Switching jffs2_gc_fetch_page to read_cache_page_async for the old
behavior makes the deadlocks go away, but maybe reintroduces the
use-before-uptodate problem? I don't understand the mm/fs interaction
well enough to say.

Someone more knowledgable should see if similar deadlock issues may have
been introduced for other read_cache_page callers, including the other
two in jffs2.

Signed-off-by: Jason Lunz <***@falooley.org>

---
fs/jffs2/fs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 1d3b7a9..8bc727b 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -627,7 +627,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info *c,
struct inode *inode = OFNI_EDONI_2SFFJ(f);
struct page *pg;

- pg = read_cache_page(inode->i_mapping, offset >> PAGE_CACHE_SHIFT,
+ pg = read_cache_page_async(inode->i_mapping, offset >> PAGE_CACHE_SHIFT,
(void *)jffs2_do_readpage_unlock, inode);
if (IS_ERR(pg))
return (void *)pg;

Nick Piggin

2007-09-02 04:20:12 UTC

Permalink

Post by Jason Lunz
It introduced a wait to read_cache_page, as well as a
read_cache_page_async function equivalent to the old read_cache_page
without any callers.
Switching jffs2_gc_fetch_page to read_cache_page_async for the old
behavior makes the deadlocks go away, but maybe reintroduces the
use-before-uptodate problem? I don't understand the mm/fs interaction
well enough to say.
Someone more knowledgable should see if similar deadlock issues may have
been introduced for other read_cache_page callers, including the other
two in jffs2.

Hmm, thanks for that. It does sound like it is deadlocking via
commit_write(). OTOH, it seems like it could be using the page
before it is uptodate -- it _may_ only be dealing with uptodate
data at that point... but if so, why even read_cache_page at
all?

However, it is a regression. So unless David can come up with a
more satisfactory approach, I guess we'd have to go with your
patch.

Post by Jason Lunz
---
fs/jffs2/fs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index 1d3b7a9..8bc727b 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -627,7 +627,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info *c,
struct inode *inode = OFNI_EDONI_2SFFJ(f);
struct page *pg;
- pg = read_cache_page(inode->i_mapping, offset >> PAGE_CACHE_SHIFT,
+ pg = read_cache_page_async(inode->i_mapping, offset >> PAGE_CACHE_SHIFT,
(void *)jffs2_do_readpage_unlock, inode);
if (IS_ERR(pg))
return (void *)pg;

David Woodhouse

2007-09-02 12:13:23 UTC

Permalink

Jason, thank you _so_ much for finding the underlying cause of this.

Post by Nick Piggin
Hmm, thanks for that. It does sound like it is deadlocking via
commit_write(). OTOH, it seems like it could be using the page
before it is uptodate -- it _may_ only be dealing with uptodate
data at that point... but if so, why even read_cache_page at
all?

jffs2_readpage() is synchronous -- there's no chance that the page won't
be up to date. We're doing this for garbage collection -- if there are
many log entries covering a single page of data, we want to write out a
single replacement which covers the whole page, obsoleting the previous
suboptimal representation of the same data.

Post by Nick Piggin
However, it is a regression. So unless David can come up with a
more satisfactory approach, I guess we'd have to go with your
patch.

I think Jason's patch is the best answer for the moment. At some point
in the very near future I want to improve the RAM usage and compression
ratio by dropping the rule that data nodes may not cross page boundaries
-- in which case garbage collection will need to do something other than
reading the page using read_cache_page() and then writing it out again;
it'll probably need to end up using its own internal buffer. But for
now, Jason's patch looks good.

Thanks.

--
dwmw2

Nick Piggin

2007-09-02 13:20:34 UTC

Permalink

Post by David Woodhouse
Jason, thank you _so_ much for finding the underlying cause of this.

OK, but then hasn't the patch just made the deadlock harder to hit,
or is there some invariant that says that readpage() will never be
invoked if gc was invoked on the same page as we're commit_write()ing?

The Q/A comments aren't very sure about this. I guess from the look
of it, prepare_write/commit_write make sure the page will be uptodate
by the start of commit_write, and you avoid GCing the page in
prepare_write because your new page won't have any nodes allocated
yet that can possibly be GCed?

BTW. with write_begin/write_end, you get to control the page lock,
so for example if the readpage in prepare_write for partial writes
is *only* for the purpose of avoiding this deadlock later, you
could possibly avoid the RMW with the new aops. Maybe it would
help you with data nodes crossing page boundaries too...

Post by David Woodhouse

Post by Nick Piggin
However, it is a regression. So unless David can come up with a
more satisfactory approach, I guess we'd have to go with your
patch.

OK, thanks for looking at it. If you'd care to pass it on to Linus
before he releases 2.6.23 in random() % X days time... ;)

David Woodhouse

2007-09-02 13:48:04 UTC

Permalink

Post by Nick Piggin
OK, but then hasn't the patch just made the deadlock harder to hit,
or is there some invariant that says that readpage() will never be
invoked if gc was invoked on the same page as we're commit_write()ing?
The Q/A comments aren't very sure about this. I guess from the look
of it, prepare_write/commit_write make sure the page will be uptodate
by the start of commit_write,

That's the intention, yes.

Post by Nick Piggin
and you avoid GCing the page in
prepare_write because your new page won't have any nodes allocated
yet that can possibly be GCed?

We _might_ GC the page -- it might not be a new page; we might be
overwriting it. But it's fine if we do. Actually it's slightly
suboptimal because we'll write out the same data twice -- once in GC and
then immediately afterward in the write which we were making space for.
But that's not the end of the world, and it's not very common.

Post by Nick Piggin
BTW. with write_begin/write_end, you get to control the page lock,
so for example if the readpage in prepare_write for partial writes
is *only* for the purpose of avoiding this deadlock later, you
could possibly avoid the RMW with the new aops. Maybe it would
help you with data nodes crossing page boundaries too...

I'll look at that; thanks.

Post by Nick Piggin
OK, thanks for looking at it. If you'd care to pass it on to Linus
before he releases 2.6.23 in random() % X days time... ;)

Not before the Kernel Summit now, I suspect. But yes, I'll do that later
today or in the morning (the linuxconf.eu conference has already
started).

--
dwmw2

Nick Piggin

2007-09-02 14:17:56 UTC

Permalink

Post by David Woodhouse

That's the intention, yes.

Post by Nick Piggin
and you avoid GCing the page in
prepare_write because your new page won't have any nodes allocated
yet that can possibly be GCed?

But doesn't GC only happen in prepare_write in the case that the
i_size is being extended into a new page?

If you GC the page in prepare_write (when it may be potentially
!uptodate), then I'm sure you would get a deadlock when read_cache_page
finds it non-uptodate and locks it for readpage().

Post by David Woodhouse
But that's not the end of the world, and it's not very common.

I'll look at that; thanks.

OK. The patches are in -mm now, but could get in as early as 2.6.24.
If you have any suggestions about the form of the APIs, it would be
good to hear them.

Post by David Woodhouse

Post by Nick Piggin
OK, thanks for looking at it. If you'd care to pass it on to Linus
before he releases 2.6.23 in random() % X days time... ;)

Not before the Kernel Summit now, I suspect. But yes, I'll do that later
today or in the morning (the linuxconf.eu conference has already
started).

Thanks,

David Woodhouse

2007-09-02 16:15:26 UTC

Permalink

Post by Nick Piggin

Post by David Woodhouse

That's the intention, yes.

Post by Nick Piggin
and you avoid GCing the page in
prepare_write because your new page won't have any nodes allocated
yet that can possibly be GCed?

But doesn't GC only happen in prepare_write in the case that the
i_size is being extended into a new page?

Ah, yes. I was thinking of commit_write, and it had temporarily escaped
my notice that we also write in prepare_write, to extend the file.

So yes, you're probably right that it doesn't matter; in any GC
triggered from _prepare_write_ we avoid GCing the page in question
because it by definition doesn't exist.

--
dwmw2