Linux kernel oops

From Wikipedia, the free encyclopedia

Linux kernel oops on SPARC.
Linux kernel oops on SPARC.
Linux kernel oops on PA-RISC.
Linux kernel oops on PA-RISC.

An oops occurs when some programming defect or otherwise unexpected event interferes with the normal operation of the Linux kernel. It is named for the error message displayed on the system console (or seen in the system log files) when such a fault condition occurs. At this time, the currently active task (or process) is terminated and the kernel makes a best-effort attempt to continue operation. An oops message is used by Linux kernel engineers in order to track down (or debug) the fault condition which created the oops and to patch the kernel with a fix.

Once a system has experienced an oops, various internal resources may no longer be accounted for. Memory leaks may have occurred, as well as other undesirable side effects from the active task being killed. A kernel oops often leads on to a kernel panic once the system attempts to use resources which have been lost.

In response to a request for clarification on what exactly an oops was, Linux kernel mailing list participant John Bradford explained with the following message[1]:

> I've been reading LKML for a few weeks now to understand Linux
> development better, and there's one thing I just can't understand:
> what's an OOPS? What does it stand for, what is it?

It's a report of a bug in the kernel, for example, if the kernel tried
to access an invalid memory location. It doesn't necessarily indicate
a programming error - faulty hardware can cause an OOPS as well.

The following explaination may not be 100% accurate, hopefully
somebody else will post a better one, but here goes:

As far as I know it doesn't stand for anything, and the name is a
kind-of joke, (as in, "oops, we've found a bug in the kernel").

On X86, an OOPS contains information such as:

Text description - something like "Unable to handle NULL pointer
dereference". This tells you what sort of error it is.

The number of the oops, (I.E. whether it was the first, second, third,
etc, starting with 0000).

The CPU it occured on, (0 on a single processor machine). Note, I
think that on a multi processor machine, there isn't a physical
relationship between CPU and number, I.E. CPUs are assigned numbers on
boot, in a semi-random fashion.

The contents of the CPU's registers.

A stack backtrace.

The code the CPU was executing.

A call trace, which is, basically, a list of functions that the
process was in at the moment of the OOPS. The actual numeric values
are almost completely useless[1], because they depend on your
particular kernel. Only somebody who has access to the corresponding
symbol map for that kernel can identify the actual names of the
functions, and this is why there are often posts by developers on this
list asking people to decode an OOPS they have posted.

[1] Without it being decoded, you can still check, for example,
whether the CPU was executing data, but it's mostly speculation.

John.


it is not necessary to decode an OOPS anymore on 2.6 kernels. [2]

A slight correction[3] was posted by Szakacsits Szabolcs:

On Sat, 8 Mar 2003, John Bradford wrote:

> The number of the oops, (I.E. whether it was the first, second, third,
> etc, starting with 0000).

Urban myth (at least on i386). The "Oops:" part can be decoded on i386 as,

 * bit 0 == 0 means no page found, 1 means protection fault
 * bit 1 == 0 means read, 1 means write
 * bit 2 == 0 means kernel, 1 means user-mode

      Szaka 

[edit] Notes

  1. ^ http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0009.html
  2. ^ http://sosdg.org/~coywolf/lxr/source/Documentation/oops-tracing.txt?v=2.6.16
  3. ^ http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0027.html