Discussion:
JFFS2 on dataflash problem
Creech, Matthew
2005-02-02 15:00:56 UTC
Permalink
Hi,

Please forgive the cross-posting, but I'm not sure where exactly to go
with this problem.

I have an embedded device based on Atmel's AT91RM9200DK board, which is
using serial dataflash (AT45DB642). I've allocated a JFFS2 partition to
store non-volatile data. In testing I stumbled across a particular
problem that only occurs after heavy hammering on our device, but is
fairly consistent in how and when it occurs. The pattern has been
narrowed down so that a script doing something like this:

while [ 1 ]; do
cp /mnt/jffs2/$RANDOM_FILE /mnt/jffs2/$BLAH
# File size has been tested between 8K and 64K
done

makes the problem occur within 24 to 36 hours. So something about
copying one file over another one breaks things. The "problem" here is
that every I/O operation having to do with the JFFS2 partition blocks
indefinitely. For example, after running the test for 2 days, you can
log into the device and try to "ls" the contents of /mnt/jffs2, and your
shell will hang. You can then login on another terminal, but you'll get
another hang if you try to have any interaction with the JFFS2
partition. So everything else seems to function normally, but JFFS2
just dies. Also note that rebooting the device sometimes fixes things
right up (JFFS2 mounts fine and works properly as if nothing happened at
all), but sometimes the filesystem image is corrupt and refuses to
mount.

The system's specs are as follows:
AT91RM9200
AT45DB642 8MB serial dataflash device on SPI channel
2.4.25 kernel with VRS2 patchset, plus...
Andrew Victor's AT91 patchset (http://maxim.org.za/AT91RM9200/), plus...
Various snapshots of MTD taken over the past several months (no change)
Snapgear-3.2.0 userland (doubt this makes a difference)

As noted above, I've tried this with various MTD snapshots (using the
default MTD in 2.4.25 makes JFFS2 die almost immediately when doing
anything). I've also recently compiled 2.6.10 with the AT91 patchset
(http://maxim.org.za/AT91RM9200/2.6/), but the *exact* same thing
happens.

The only kernel output I get is a few repetitions of this message, just
before the problem begins:

Node totlen on flash (0xffffffff) != totlen from node ref ([some
close-to-zero number])

My testing using raw dataflash access seems to rule out dataflash
issues, and _suggest_ that MTD isn't directly to blame, since no errors
occur when copying images to /dev/mtd/X then reading them back. But I
can't rule anything out for sure. It seems more likely that this is
some strange interaction between dataflash, MTD, and JFFS2, possibly
related to this device's 1056-byte blocksize; IIRC this was a problem in
the past that required some patching, since most code assumes a
power-of-two blocksize.

These are all just guesses, though, which is why I'm posting this
message. I'm wondering whether there are any other ideas I can try to
narrow this problem down; since testing it requires 1-2 days, it's
fruitless to just make random guesses and see if they "fix" things. Any
suggestions you have are greatly appreciated!

Thanks for the help
--
Matthew L. Creech
Ulf Samuelsson
2005-02-02 21:32:27 UTC
Permalink
Post by Creech, Matthew
I have an embedded device based on Atmel's AT91RM9200DK board, which is
using serial dataflash (AT45DB642). I've allocated a JFFS2 partition to
store non-volatile data. In testing I stumbled across a particular
problem that only occurs after heavy hammering on our device, but is
fairly consistent in how and when it occurs. The pattern has been
while [ 1 ]; do
cp /mnt/jffs2/$RANDOM_FILE /mnt/jffs2/$BLAH
# File size has been tested between 8K and 64K
done
If I wrote this script I would call it:

wear_out_dataflash_quickly.sh

There are some limitations to the number of erase cycles in the dataflash.
(IN any flash to be correct)
You can expect to reprogram it 50,000-100.000 times before the first errors
occur.
The second thing is that you need to do a block erase after the sum
of erases inside an 8 page block exceeds 10,000.

I am not at all sure that the MTD drivers/JFFS2 handle this (did not look at
the code).
I assume that JFFS may be able to detect a bad write and map
the block out from time to time, so this could explain why you can do the
recover.

If you really want to test the dataflash, write a CRC in the extra bytes
available.
(The page is 1024 + 32 bytes) and read back, checking CRC.

Best Regards,
Ulf Samuelsson
***@a-t-m-e-l.com

Loading...