<<Up     Contents

Naive Bayesian classification

Here is a worked example of naive Bayesian classification which is an application of Bayesian inference to the document classification[?] problem.

Consider the problem of classifying documents by their content, for example into spam and non-spam E-mails. Imagine that documents are drawn from a number of classes of documents which can be modelled as sets of words where the (independent) probability that the i-th word of a given document occurs in a document from class C can be written as

<math>p(w_i \vert C)</math>

(For this treatment, we simplify things further by assuming that the probability of a word in a document is independent of the length of a document, or that all documents are of the same length).

Then the probability of a given document D, given a class C, is

<math>p(D\vert C)=\prod_i p(w_i \vert C)</math>

The question that we desire to answer is: "what is the probability that a given document D belongs to a given class C?"

Now, by their definition, (see Probability axiom)

<math>p(D\vert C)={p(D\cap C)\over p(C)}</math>

and

<math>p(C\vert D)={p(D\cap C)\over p(D)}</math>

Bayes' theorem manipulates these into a statement of probability in terms of likelihood.

<math>p(C\vert D)={p(C)\over p(D)}\,p(D\vert C)</math>

Assume for the moment that there are only two classes, S and ¬S.

<math>p(D\vert S)=\prod_i p(w_i \vert S)</math>

and

<math>p(D\vert\neg S)=\prod_i p(w_i\vert\neg S)</math>

Using the Bayesian result above, we can write:

<math>p(S\vert D)={p(S)\over p(D)}\,\prod_i p(w_i \vert S)</math>

<math>p(\neg S\vert D)={p(\neg S)\over p(D)}\,\prod_i p(w_i \vert\neg S)</math>

Dividing one by the other gives:

<math>{p(S\vert D)\over p(\neg S\vert D)}={p(S)\,\prod_i p(w_i \vert S)\over p(\neg S)\,\prod_i p(w_i \vert\neg S)}</math>

Which can be re-factored as:

<math>{p(S)\over p(\neg S)}\,\prod_i {p(w_i \vert S)\over p(w_i \vert\neg S)}</math>

Thus, the probability ratio p(S | D) / p(¬S | D) can be expressed in terms of a series of likelihood ratios[?]. The actual probability p(S | D) can be easily computed from ln(p(S | D) / p(¬S | D)) based on the observation that p(S | D) + p(¬S | D) = 1.

Taking the logarithm of all these ratios, we have:

<math>\ln{p(S\vert D)\over p(\neg S\vert D)}=\ln{p(S)\over p(\neg S)}+\sum_i \ln{p(w_i\vert S)\over p(w_i\vert\neg S)}</math>

This technique of "log-likelihood ratios" is a common technique in statistics. In the case of two mutually exclusive alternatives (such as this example), the conversion of a log-likelihood ratio to a probability takes the form of a sigmoid curve: see logit for details.

In real life, the naive Bayes approach is more powerful than might be expected from the extreme simplicity of its model; in particular, it is fairly robust in the presence of non-independent attributes wi. Recent theoretical analysis has shown why the naive Bayes classifier is so robust.

See also:

External links:

wikipedia.org dumped 2003-03-17 with terodump