unable to boot Linux smoothly while there is physical address 0xc0000000.

Tue Mar 7 01:40:19 PST 2006

On Tuesday 07 March 2006 10:49, stanliao at globalunichip.com wrote:
> I can't classify what the problem is. Only message could be shown.

At least we can see that it is an ARM based system.
Were we supposed to telepatically learn that from your previous message?

> The following PART1 is message while the problem appears (and the physical
> address 0xc0000000~0xc0100000 exists in the system).
> PART2 shows the message while there is no such problem (and there is no
> physical address 0xc0000000~0xc0100000 in the system).
> 
> 
> PART1:
> Based upon Swansea University Computer Society NET3.039
> Initializing RT netlink socket
> Starting kswapd
> devfs: v1.12c (20020818) Richard Gooch (rgooch at atnf.csiro.au)
> devfs: boot_options: 0x1
> i2c-mmp2.o:
> ttyAM%d0 at MMIO 0x320b0000 (irq = 28) is a GUC-MVP2000 UART
> RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
> dmfe: Globalunichip Mercury net driver, version 1.36.4 (2002-01-17)
> dmfe: mode =  8
>  12:23:44:49:05:87, irq 23.
> , mercury IO addr = f3012000.
> i2c_register_driver : I2C_BUS_MAX = 4
> uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus =
> 0xc0125
> 864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a7f4<6>uda->bus = 0
> xc0125864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a7f4<6>uda->b
> us = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a7f4<6>
> uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a
> 7f4<6>uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus =
> 0xc0125864<6>uda = 0
> xc014a7f4<6>uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus =
> 0xc0125864<6>N
> ET4: Linux TCP/IP 1.0 for NET4.0
> IP Protocols: ICMP, UDP, TCP
> IP: routing cache hash table of 512 buckets, 4Kbytes
> TCP: Hash tables configured (established 4096 bind 8192)
> NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
> NetWinder Floating Point Emulator V0.97 (extended precision)
> RAMDISK: Compressed image found at block 0
> Unable to handle kernel paging request at virtual address 1a00007d
> pgd = c0004000
> [1a00007d] *pgd=00000000, *pmd = 00000000
> Internal error: Oops: 1
> CPU: 0
> pc : [<c001497c>]    lr : [<c0014d0c>]    Not tainted
> sp : c020d9f0  ip : 00000002  fp : 00000009
> r10: 00000000  r9 : 00000006  r8 : 00002996
> r7 : 00000002  r6 : 00000003  r5 : 00001d8e  r4 : 00000005
> r3 : c012a6d4  r2 : 00000003  r1 : c02f9d89  r0 : 1a00007d
> Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  Segment kernel
> Control: 5317F  Table: 3C004000  DAC: 0000001D
> Process swapper (pid: 1, stack limit = 0xc020c368)
> Stack: (0xc020d9f0 to 0xc020e000)
> d9e0:                                     0000003f 000001ff c0250c08
> c02f6008
> da00: 00000006 00000004 00000004 00000000 c020e410 00000004 00000134
> 00000116
> da20: c0015798 c0103908 c026ccc4 c020da40 c020da48 c020da4c 0000001e
> 0000007f
> da40: 00000006 c0250c08 00000009 c02f6008 00000006 00000006 00000007
> 00000007
> da60: 00000006 00000007 00000007 00000008 00000007 00000008 00000008
> 00000008
> da80: 00000008 0000000a 00000008 0000000a 00000006 00000009 00000009
> 00000008
> daa0: 00000007 00000008 0000000a 0000000a 00000008 0000000c 00000009
> 00000009
> dac0: 00000008 0000000c 0000000a 0000000b 00000006 0000000c 0000000b
> 0000000a
> dae0: 00000008 0000000a 00000000 0000000b 00000008 0000000c 00000009
> 0000000a
> db00: 00000008 0000000a 0000000a 0000000a 00000006 00000009 0000000b
> 0000000a
> db20: 00000008 0000000a 0000000c 0000000b 00000008 0000000c 0000000a
> 0000000a
> db40: 00000008 00000000 0000000b 0000000b 00000007 0000000a 00000009
> 0000000a
> db60: 00000009 0000000a 0000000a 00000009 00000009 0000000c 0000000c
> 0000000c
> db80: 00000009 0000000c 00000000 0000000a 00000007 0000000b 00000009
> 00000009
> dba0: 0000000a 0000000a 00000009 0000000a 0000000a 0000000b 0000000a
> 0000000b
> dbc0: 00000008 0000000c 0000000b 0000000a 00000008 0000000a 00000009
> 0000000a
> dbe0: 00000009 0000000a 0000000a 0000000b 0000000a 0000000b 0000000c
> 0000000c
> dc00: 00000009 0000000a 0000000a 0000000c 00000008 0000000b 0000000b
> 0000000c
> dc20: 0000000a 0000000a 0000000b 0000000a 0000000a 0000000c 0000000b
> 00000000
> dc40: 0000000a 0000000a 0000000c 0000000a 00000007 00000008 00000008
> 00000009
> dc60: 00000007 00000008 00000008 00000009 00000009 0000000a 00000009
> 0000000c
> dc80: 0000000a 00000009 0000000a 0000000c 00000008 0000000a 00000009
> 00000009
> dca0: 00000008 00000009 00000009 0000000c 00000009 0000000c 00000009
> 0000000c
> dcc0: 0000000a 00000009 0000000c 00000009 00000007 0000000a 0000000b
> 0000000a
> dce0: 0000000a 0000000a 0000000a 0000000c 0000000a 0000000a 00000009
> 0000000a
> dd00: 0000000a 0000000a 0000000a 0000000b 00000009 0000000a 0000000a
> 0000000a
> dd20: 0000000a 0000000a 00000000 00000000 0000000b 00000000 0000000b
> 00000000
> dd40: 0000000b 0000000a 0000000c 0000000c 00000008 00000009 0000000a
> 00000009
> dd60: 00000009 0000000a 0000000b 0000000a 0000000a 00000000 00000009
> 0000000b
> dd80: 00000009 0000000c 00000000 00000000 00000009 0000000c 0000000a
> 0000000a
> dda0: 0000000a 0000000a 0000000a 0000000c 0000000c 00000000 00000009
> 0000000a
> ddc0: 00000009 0000000c 0000000a 0000000c 00000006 00000008 00000007
> 00000007
> dde0: 00000009 00000007 0000000c 00000009 00000008 0000000b 00000009
> 00000008
> de00: 0000000a 0000000b 0000000c 0000000c 00000009 0000000b 0000000a
> 00000000
> de20: 0000000b 0000000a 0000000b 0000000c 0000000a 0000000a 0000000c
> 00000009
> de40: 0000000a 0000000b 00000008 00000007 0000000c 00000004 00000003
> 00000004
> de60: 00000005 00000004 00000005 00000006 00000007 00000005 00000008
> 00000007
> de80: 00000008 00000007 00000008 00000008 00000009 00000009 00000000
> 0000000a
> dea0: 0000000b 0000000a 00000000 00000000 00000000 00000007 00000000
> 00000006
> dec0: 00000007 00000006 00000007 00000007 00000005 00000006 00000005
> 00000006
> dee0: 00000005 00000005 00000005 00000005 00000005 00000005 00000004
> 00000005
> df00: 00000004 00000004 00000004 00000004 00000004 00000004 00000003
> 00000004
> df20: 00000004 00000003 00000003 00000007 00000004 00000004 00000006
> 00000004
> df40: 00000000 c020df68 c012a6f4 c012a6d8 00000001 00000000 3c012594
> 00000000
> df60: c0015970 00000000 c012a6d0 ffffffff c012a6d4 00000000 c012a6d0
> c0015d74
> df80: 00008b1f ffffffff c012a6c8 c012a6cc 00000000 c0009a48 00000000
> c012a678
> dfa0: 00000000 c0008fc4 00000000 c0104b44 00000000 c0056664 5c2d2f7c
> 00000000
> dfc0: c012a678 00000000 c012b338 c0120d10 41069265 3c012594 00000000
> c0014298
> dfe0: 00000000 00000000 c011e000 c014909c c0014040 c0017b1c c020dff8
> c020dff8
> Backtrace: invalid frame pointer
> Code: e156000b 2a000012 e59f0414 e59f3414 (e5902000)
> Kernel panic: Attempted to kill init!

See attached HOWTO.

You need to send bug reports to linux-kernel list:

	Linux Kernel Mailing List <linux-kernel at vger.kernel.org>

not busybox one.

> PART 2:
> Linux version 2.4.21-guc1 (stanliao at gucpc237) (gcc version 3.3.2) #335 Thu
> Feb
> 23 17:34:15 2006
> CPU: ARM926EJ-Sid(wb) revision 5
> Machine: GUC-MVP2000
> On node 0 totalpages: 16384
> zone(0): 16384 pages.
> zone(1): 0 pages.
> zone(2): 0 pages.
> Kernel command line: root=0100 initrd=0x3c800000,0x400000 mem=64M
> Calibrating delay loop... 23.91 BogoMIPS
> Memory: 64MB = 64MB total
> Memory: 59392KB available (1033K code, 219K data, 48K init)
> Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
> Inode cache hash table entries: 4096 (order: 3, 32768 bytes)
> Mount cache hash table entries: 512 (order: 0, 4096 bytes)
> Buffer-cache hash table entries: 4096 (order: 2, 16384 bytes)
> Page-cache hash table entries: 16384 (order: 4, 65536 bytes)
> CPU: Testing write buffer: pass
> POSIX conformance testing by UNIFIX
> Linux NET4.0 for Linux 2.4
> Based upon Swansea University Computer Society NET3.039
> Initializing RT netlink socket
> Starting kswapd
> devfs: v1.12c (20020818) Richard Gooch (rgooch at atnf.csiro.au)
> devfs: boot_options: 0x1
> i2c-mmp2.o:
> ttyAM%d0 at MMIO 0x320b0000 (irq = 28) is a GUC-MVP2000 UART
> RAMDISK driver initialized: 16 RAM disks of 8192K size 1024 blocksize
> dmfe: Globalunichip Mercury net driver, version 1.36.4 (2002-01-17)
> dmfe: mode =  8
>  12:23:44:49:05:87, irq 23.
> , mercury IO addr = f3012000.
> i2c_register_driver : I2C_BUS_MAX = 4
> uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus =
> 0xc0125
> 864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a7f4<6>uda->bus = 0
> xc0125864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a7f4<6>uda->b
> us = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a7f4<6>
> uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus = 0xc0125864<6>uda =
> 0xc014a
> 7f4<6>uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus =
> 0xc0125864<6>uda = 0
> xc014a7f4<6>uda->bus = 0xc0125864<6>uda = 0xc014a7f4<6>uda->bus =
> 0xc0125864<6>N
> ET4: Linux TCP/IP 1.0 for NET4.0
> IP Protocols: ICMP, UDP, TCP
> IP: routing cache hash table of 512 buckets, 4Kbytes
> TCP: Hash tables configured (established 4096 bind 8192)
> NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
> NetWinder Floating Point Emulator V0.97 (extended precision)
> RAMDISK: Compressed image found at block 0
> Freeing initrd memory: 4096K
> VFS: Mounted root (ext2 filesystem) readonly.
> Mounted devfs on /dev
> Freeing init memory: 48K
> kmod: failed to exec /sbin/modprobe -s -k jffs2, errno = 2
> mount: Mounting /dev/mtdblock/2 on /mnt/mtd failed: No such device
> EXT2-fs warning: maximal mount count reached, running e2fsck is recommended
> Start Embedded Linux...
> Bummer, could not run '/bin/stty': No such file or directory
> 
> 
> BusyBox v1.01 (2005.11.23-12:13+0000) Built-in shell (ash)
> Enter 'help' for a list of built-in commands.
> 
> / #

--
vda
-------------- next part --------------
Okay, so you've got an oops and want to find out what happened?

In this HOWTO, I presume you did not delete and did not
tamper with your kernel build tree. Also, I recommend you
to enable these options in the .config:

CONFIG_DEBUG_SLAB=y
CONFIG_FRAME_POINTER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y

First one makes use-after-free bug hunt easy, second gives
you much more reliable stacktraces. Next two give you decoded
oops (2.6 only).

If you are running 2.4 kernel, please make sure you run
your klogd with -x switch.

	1. An OOPS decoding walkthrough

Ok, let's take a look at example OOPS. ^^^^ marks are mine.

Unable to handle kernel NULL pointer dereference at virtual address 00000e14
 printing eip:
c0162887
*pde = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in: eeprom snd_seq_oss snd_seq_midi_event..........
CPU:    0
EIP:    0060:[<c0162887>]    Not tainted
EFLAGS: 00010206   (2.6.7-nf2)
EIP is at prune_dcache+0x147/0x1c0
          ^^^^^^^^^^^^^^^^^^^^^^^^
eax: 00000e00   ebx: d1bde050   ecx: f1b3c050   edx: f1b3ac50
esi: f1b3ac40   edi: c1973000   ebp: 00000036   esp: c1973ef8
ds: 007b   es: 007b   ss: 0068
Process kswapd0 (pid: 65, threadinfo=c1973000 task=c1986050)
Stack: d7721178 c1973ef8 0000007a 00000000 c1973000 f7ffea48 c0162d1f 0000007a
       c0139a2b 0000007a 000000d0 00025528 049dbb00 00000000 000001fa 00000000
       c0364564 00000001 0000000a c0364440 c013add1 00000080 000000d0 00000000
Call Trace:
 [<c0162d1f>] shrink_dcache_memory+0x1f/0x30
 [<c0139a2b>] shrink_slab+0x14b/0x190
 [<c013add1>] balance_pgdat+0x1b1/0x200
 [<c013aee7>] kswapd+0xc7/0xe0
 [<c0114270>] autoremove_wake_function+0x0/0x60
 [<c0103e9e>] ret_from_fork+0x6/0x14
 [<c0114270>] autoremove_wake_function+0x0/0x60
 [<c013ae20>] kswapd+0x0/0xe0
 [<c01021d1>] kernel_thread_helper+0x5/0x14
Code: 8b 50 14 85 d2 75 27 89 34 24 e8 4a 2b 00 00 8b 73 0c 89 1c

Let's try to find out where did that exactly happen.
Grep in your kernel tree for prune_dcache. Aha, it is defined in
fs/dcache.c!

(BTW, if you have _only_ EIP: 0060:[<c0162887>] line (probably because
you're using 2.4), just look into System.map. You will discover that
EIP is in prune_dcache() because you see "c0162740 t prune_dcache"
in System.map and EIP falls between this and next entry.)

Ok, execute these two commands:

# objdump -d fs/dcache.o > fs/dcache.disasm
# make fs/dcache.s

Now in fs/ you should have:

dcache.c - source code
dcache.o - compiled object file
dcache.s - assembler output of C compiler ('half-compiled' code)
dcache.disasm - disassembled object file

Open dcache.disasm and find "prune_dcache":

00000540 <prune_dcache>:
     540:       55                      push   %ebp

We need to find prune_dcache+0x147. Using shell,

# printf "0x%x\n" $((0x540+0x147))
0x687

and in dcache.disasm:

     683:       85 c0                   test   %eax,%eax
     685:       74 07                   je     68e <prune_dcache+0x14e>
===> 687:       8b 50 14                mov    0x14(%eax),%edx    <======== OOPS
     68a:       85 d2                   test   %edx,%edx
     68c:       75 27                   jne    6b5 <prune_dcache+0x175>
     68e:       89 34 24                mov    %esi,(%esp)
     691:       e8 fc ff ff ff          call   692 <prune_dcache+0x152>
     696:       8b 73 0c                mov    0xc(%ebx),%esi
     699:       89 1c 24                mov    %ebx,(%esp)
     69c:       e8 9f f9 ff ff          call   40 <d_free>

Comparing with "Code: 8b 50 14 85 d2 75 27" - match!

We need to find matching line in dcache.s and, eventually, in dcache.c.
It's easy to find prune_dcache in dcache.s:

prune_dcache:
        pushl   %ebp

but even though it is not too hard to find matching instruction
(you just have to look for the same instruction sequence):

        movl    8(%edi), %eax
        decl    20(%edi)
        testb   $8, %al
        jne     .L593
.L517:
        movl    68(%ebx), %eax
        testl   %eax, %eax
        je      .L532
        movl    20(%eax), %edx  <========= OOPS
        testl   %edx, %edx
        jne     .L594
.L532:
        movl    %esi, (%esp)
        call    iput
.L565:
        movl    12(%ebx), %esi
        movl    %ebx, (%esp)
        call    d_free

it is unclear to which part of .c code it belongs:

static void prune_dcache(int count)
{
        spin_lock(&dcache_lock);
        for (; count ; count--) {
                struct dentry *dentry;
                struct list_head *tmp;
                tmp = dentry_unused.prev;
                if (tmp == &dentry_unused)
                        break;
                list_del_init(tmp);
                prefetch(dentry_unused.prev);
                dentry_stat.nr_unused--;
                dentry = list_entry(tmp, struct dentry, d_lru);
                spin_lock(&dentry->d_lock);
                /*
                 * We found an inuse dentry which was not removed from
                 * dentry_unused because of laziness during lookup.  Do not free
                 * it - just keep it off the dentry_unused list.
                 */
                if (atomic_read(&dentry->d_count)) {
                        spin_unlock(&dentry->d_lock);
                        continue;
                }
                /* If the dentry was recently referenced, don't free it. */
                if (dentry->d_flags & DCACHE_REFERENCED) {
                        dentry->d_flags &= ~DCACHE_REFERENCED;
                        list_add(&dentry->d_lru, &dentry_unused);
                        dentry_stat.nr_unused++;
                        spin_unlock(&dentry->d_lock);
                        continue;
                }
                prune_one_dentry(dentry);
        }
        spin_unlock(&dcache_lock);
}

What now?! Well, I have a silly method which helps to find
C code line corresponding to that asm one. Edit your
prune_dcache in dcache.c like this:

static void prune_dcache(int count)
{
        spin_lock(&dcache_lock);
        for (; count ; count--) {
                struct dentry *dentry;
                struct list_head *tmp;
asm("#1");
                tmp = dentry_unused.prev;
asm("#2");
                if (tmp == &dentry_unused)
                        break;
asm("#3");
                list_del_init(tmp);
asm("#4");
                prefetch(dentry_unused.prev);
asm("#5");
                dentry_stat.nr_unused--;
asm("#6");
...
...
asm("#e");
                prune_one_dentry(dentry);
        }
asm("#f");
        spin_unlock(&dcache_lock);
}

and do "make fs/dcache.s" again. Look into new dcache.s.
Nasty surprize:

APP
        #e
#NO_APP
        testb   $16, %al
        jne     .L495
        orl     $16, %eax
        leal    72(%ecx), %esi
        movl    %eax, 4(%ebx)
        movl    4(%esi), %edx
        movl    72(%ecx), %eax
        testl   %eax, %eax
        movl    %eax, (%edx)
        je      .L493
        movl    %edx, 4(%eax)
.L493:
        movl    $2097664, 4(%esi)
.L495:
        leal    40(%ebx), %ecx
        movl    40(%ebx), %eax
        movl    4(%ecx), %edx
        movl    %edx, 4(%eax)
        movl    %eax, (%edx)
        movl    $2097664, 4(%ecx)
        movl    $1048832, 40(%ebx)
        decl    dentry_stat
        movl    8(%ebx), %esi
        testl   %esi, %esi
        je      .L536
        leal    56(%ebx), %eax
        movl    $0, 8(%ebx)
        movl    56(%ebx), %edx
        movl    4(%eax), %ecx
        movl    %ecx, 4(%edx)
        movl    %edx, (%ecx)
        movl    %eax, 4(%eax)
        movl    %eax, 56(%ebx)
        movl    8(%edi), %eax
        decl    20(%edi)
        testb   $8, %al
        jne     .L592
.L518:
        movl    8(%edi), %eax
        decl    20(%edi)
        testb   $8, %al
        jne     .L593
.L517:
        movl    68(%ebx), %eax
        testl   %eax, %eax
        je      .L532
        movl    20(%eax), %edx    <======== OOPS
        testl   %edx, %edx
        jne     .L594
.L532:
        movl    %esi, (%esp)
        call    iput

How come one line of C code expanded in so much asm?!
Hmm... asm("#e") was directly before prune_one_dentry(dentry),
what's that?

static inline void prune_one_dentry(struct dentry * dentry)
{
        struct dentry * parent;
        __d_drop(dentry);
        list_del(&dentry->d_child);
        dentry_stat.nr_dentry--;        /* For d_free, below */
        dentry_iput(dentry);
        parent = dentry->d_parent;
        d_free(dentry);
        if (parent != dentry)
                dput(parent);
        spin_lock(&dcache_lock);
}

Argh! An inline function. Do asm trick to it too:

static inline void prune_one_dentry(struct dentry * dentry)
{
        struct dentry * parent;
asm("#A");
        __d_drop(dentry);
asm("#B");
        list_del(&dentry->d_child);
asm("#C");
        dentry_stat.nr_dentry--;        /* For d_free, below */
asm("#D");
        dentry_iput(dentry);
asm("#E");
...
...
}

"make fs/dcache.s", rinse, repeat. You will discover that OOPS
happened after #D mark, inside dentry_iput which is an inline too.
Will this ever end? Luckily, yes. After yet another round of asm
insertion, we arrive at:

static inline void dentry_iput(struct dentry * dentry)
{
        struct inode *inode = dentry->d_inode;
        if (inode) {
asm("#K");
                dentry->d_inode = NULL;
asm("#L");
                list_del_init(&dentry->d_alias);
asm("#M");
                spin_unlock(&dentry->d_lock);
asm("#N");
                spin_unlock(&dcache_lock);
asm("#O");
                if (dentry->d_op && dentry->d_op->d_iput)
{
asm("#P");
                        dentry->d_op->d_iput(dentry, inode);
}
                else
...

Which corresponds to this part of new dcache.s:

.L517:
#APP
        #O
#NO_APP
        movl    68(%ebx), %eax
        testl   %eax, %eax
        je      .L532
        movl    20(%eax), %edx   <=== OOPS
        testl   %edx, %edx
        jne     .L594
.L532:
#APP
        #Q
#NO_APP

This is "if (dentry->d_op && dentry->d_op->d_iput)" condition
check, and it is oopsing trying to do second check. dentry->d_op
contains bogus pointer value 0x00000e00 (see %eax value in OOPS
message we started with).

	1.1. Accesses of structure members via NULL pointer

If kernel says "Unable to handle kernel _NULL_ pointer dereference",
it does not really 100% guaranteed that NULL pointer was to blame.
You can check that by examining the code and register values. In our
example, dentry->d_op pointer wasn't NULL. Real NULL pointer would have
caused "Unable to handle kernel NULL pointer dereference at virtual
address 00000014" message and EAX register would be 00000000.

	1.2. Bit flips

If you see that only one bit in the referenced address is set
(example: "Unable to handle kernel paging request at virtual address
00040000"), you probably have faulty memory which sometimes can
flip bit from 0 to 1. It converted legitimate NULL pointer into
non-NULL one. Code which expected to stop on NULL pointer saw
a non-NULL value and happily used it.

Test your box with memtest86, cpuburn and/or memburn.

	1.3. OOPSes scrolling off the screen

If your OOPS is so long that it scrolls off the screen, and box freeze
after the OOPS with nothing in logs, you can try the following:

1. Boot with video=N such that you get large console size.
   Most x86 machines have 50- and 60-line mode.
2. Connect your box with null modem cable to another one,
   start terminal emulator there and boot with console=ttyS0
   kernel parameter.
3. Use netconsole (I never used that yet).

	2. Other decoding tricks

Marcelo Tosatti <marcelo.tosatti at cyclades.com> and
Johannes Stezenbach <js at convergence.de> suggest using

make EXTRA_CFLAGS="-g -Wa,-a,-ad" fs/dcache.o >dcache.asm

instead of asm trick. This works at least on 2.6.
Note that GCC may produce slightly different assembler code
with these options. This is normal.

Herbert Poetzl <herbert at 13thfloor.at> suggested simply using

addr2line -e vmlinux c012609d

(this might require CONFIG_DEBUG_KERNEL and recompile).

Zwane Mwaikambo <zwane at linuxpower.ca> has another tips
for you:

==============================================
Date: Sat, 14 Aug 2004 09:41:10 -0400 (EDT)
From: Zwane Mwaikambo <zwane at linuxpower.ca>
Subject: Re: [RFC] HOWTO find oops location
==============================================

There are a few very simple methods i use all the time:

compile with CONFIG_DEBUG_INFO (it's safe to select the option and
recompile after the oops even) and then;

Unable to handle kernel NULL pointer dereference at virtual address 0000000c
 printing eip:
c046a188
*pde = 00000000
Oops: 0000 [#1]
PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU:    0
EIP:    0060:[<c046a188>]    Not tainted VLI
EFLAGS: 00010246   (2.6.6-mm3)
EIP is at serial_open+0x38/0x170
[...]

# gdb vmlinux
GNU gdb 5.2
Copyright 2002 Free Software Foundation, Inc.
...
(gdb) list *serial_open+0x38
0xc046a188 is in serial_open (drivers/usb/serial/usb-serial.c:465).
460
461             /* get the serial object associated with this tty pointer */
462             serial = usb_serial_get_by_index(tty->index);
463
464             /* set up our port structure making the tty driver remember our port object, and us it */
465             portNumber = tty->index - serial->minor;
466             port = serial->port[portNumber];
467             tty->driver_data = port;
468
469             port->tty = tty;

And then for cases where you deadlock and the NMI watchdog triggers with
%eip in a lock section:

NMI Watchdog detected LOCKUP on CPU0,
eip c0119e5e, registers:
Modules linked in:
CPU:    0
EIP:    0060:[<c0119e5e>]    Tainted:
EFLAGS: 00000086   (2.6.7)
EIP is at .text.lock.sched+0x89/0x12b
[...]

(gdb) disassemble 0xc0119e5e
Dump of assembler code for function Letext:
[...]
0xc0119e59 <Letext+132>:        repz nop
0xc0119e5b <Letext+134>:        cmpb   $0x0,(%edi)
0xc0119e5e <Letext+137>:        jle    0xc0119e59 <Letext+132>
0xc0119e60 <Letext+139>:        jmp    0xc0118183 <scheduler_tick+487>

(gdb) list *scheduler_tick+487
0xc0118183 is in scheduler_tick (include/asm/spinlock.h:124).
119             if (unlikely(lock->magic != SPINLOCK_MAGIC)) {
120                     printk("eip: %p\n", &&here);
121                     BUG();
122             }
123     #endif
124             __asm__ __volatile__(
125                     spin_lock_string
126                     :"=m" (lock->lock) : : "memory");
127     }

But that's not much help since it's pointing to an inline function and not
the real lock location, so just subtract a few bytes:

(gdb) list *scheduler_tick+450
0xc011815e is in scheduler_tick (kernel/sched.c:2021).
2016            cpustat->system += sys_ticks;
2017
2018            /* Task might have expired already, but not scheduled off yet */
2019            if (p->array != rq->active) {
2020                    set_tsk_need_resched(p);
2021                    goto out;
2022            }
2023            spin_lock(&rq->lock);

So we have our lock location. Then there are cases where there is a "Bad
EIP" most common ones are when a bad function pointer is followed or if
some of the kernel text or a module got unloaded/unmapped (e.g. via
__init). You can normally determine which is which by noting that bad eip
for unloaded text normally looks like a valid virtual address.

Unable to handle kernel NULL pointer dereference at virtual address 00000000
00000000
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[<00000000>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210246
[...]
Call Trace:
 [<c01dbbfb>] smb_readdir+0x4fb/0x6e0
 [<c0165560>] filldir64+0x0/0x130
 [<c016524a>] vfs_readdir+0x8a/0x90
 [<c0165560>] filldir64+0x0/0x130
 [<c01656fd>] sys_getdents64+0x6d/0xa6
 [<c0165560>] filldir64+0x0/0x130
 [<c010adff>] syscall_call+0x7/0xb
Code: Bad EIP value.

From there you're best off examining the call trace to find the culprit.
======================================================

	3. Looking around in C code

Not written. I think live example of chasing bug further by following call
chain, placing printks etc will be most interesting to have here.

	4. More obscure oopses

Not written.
--
vda