Normal distribution

The normal or Gaussian distribution is a ubiquitous and extremely important probability distribution considered in statistics. It is actually a family of distributions of the same general form, differing only in their location and scale parameters: the mean and standard deviation. The standard normal distribution is the normal distribution with a mean of zero and a standard deviation of one.

Table of contents

1 Probability density function

2 Standardizing Gaussian random variables

3 Occurrence

4 Further properties

5 Characteristic function

6 Generating Gaussian random variables

7 History

8 Cumulative distribution function of the normal distribution

9 External links and references

Probability density function

The probability density function of the normal distribution with mean μ and standard deviation σ (or variance σ²) is also known as the Gaussian function

<math>f(x) = {1 \over \sigma\sqrt{2\pi} }\,e^{-{(x-\mu )^2 \over 2\sigma^2}}</math>

(see exponential function and pi). If a random variable X follows this distribution, we write X ~ N(μ, σ²). If μ = 0 and σ = 1, we talk about the standard normal distribution, with formula

This picture is the graph of the probability density function of the standard normal distribution. The distribution is symmetric about its mean value and its shape resembles a bell, which has led to it being called the bell curve. About 68% of the area under the curve is within one standard deviation of the mean, 95.5% within two standard deviations, and 99.7% within three standard deviations (the "68 - 95.5 - 99.7 rule"). The inflection points[?] of the curve occur at one standard deviation away from the mean.

These statements are also true for non-standard normal distributions.

Standardizing Gaussian random variables

If X is a Gaussian random variable with mean μ and variance σ², then

<math> Z = \frac{X - \mu}{\sigma} </math>

is a standard normal random variable: Z~N(0,1). Conversely, if Z is a standard normal random variable,

<math>X=\sigma Z+\mu</math>

is a Gaussian random variable with mean μ and variance σ².

The standard normal distribution has been tabulated, and the other normal distributions are simple transformations of the standard one. Therefore, if one knows the mean and the standard deviation of a normal distribution, one can use this table to answer all questions about the distribution.

Occurrence

Approximately normal distributions occur in many situations, as a result of the central limit theorem. Simply stated, this theorem says that adding up a large number of small independent variables results in an approximately normal distribution. Therefore, whenever there is reason to suspect the presence of a large number of small effects acting additively, it is reasonable to assume that observations will be normal. The IQ score of an individual for example can be seen as the result of many small additive influences: many genes and many environmental factors all play a role.

IQ scores and other ability scores are approximately normally distributed. For most IQ tests, the mean is 100 and the standard deviation is 15.
the heights of adult specimens of an animal or plant species are approximately normally distributed, unless there is sexual dimorphism; the size of newborn children is approximately normal (but not weight, see below)
Repeated measurements of the same quantity usually yield results which are approximately normally distributed (many little effects contribute additively to the measurement error). This is, in any case, the central assumption of the mathematical theory of errors.
A binomial distribution with parameters n and p is approximately normal if n is big enough (the approximation is very good if both np and n(1-p) are at least 5). The approximating normal distribution has mean μ = np and standard deviation σ = (n p (1 - p))^1/2.
- For example, suppose you randomly sample n people out of a large population and ask them whether they agree with a certain statement. The proportion of people who agree will of course depend on the sample. If you sampled groups of n people repeatedly and truly randomly, the proportions would follow an approximate normal distribution with mean equal to the true proportion p of agreement in the population and with standard deviation σ = (p(1 - p)/n)^1/2. Large sample sizes n are good because the standard deviation gets smaller, which allows a more precise estimate of the unknown parameter p.
A Poisson distribution with parameter λ is approximately normal if λ is big enough (λ > 10 is sufficient). The approximating normal distribution has mean μ = λ and standard deviation σ = √λ.
- For instance, the number of edits per hour recorded on Wikipedia's Recent Changes page is approximately normal.

It is important to realize, however, that small effects often act as multiplicative (rather than additive) increases. In that case, the assumption of normality is not justified, and it is the logarithm of the variable of interest that is normally distributed. The distribution of the directly observed variable is then called log-normal. Good examples of this behaviour are financial indicators such as interest rates or stock values. Also, in biology it has been observed that organism growth sometimes proceeds by multiplicative rather than additive increments, implying that the distribution of body sizes should be log-normal.

Other examples of variables that are not normally distributed:

weight of humans: the weight is approximately proportional to the third power of the height, but the third power of a normally distributed variable is not normal.
blood pressure or height of adult humans: these are interesting cases which at first appear to yield normal distributions. In reality, they are mixtures of two normal variables: blood pressure (height) of males (which is normally distributed) and blood pressure (height) of females (which is also normally distributed, but with different mean). Generally, if there is a single characteristic (like sex) which has a large influence on the measured variable, the assumption of normality is not justified.
lifetimes of humans or technical devices

Further properties

If X ~ N(μ, σ²) and a and b are real numbers, then aX + b ~ N(aμ + b, (aσ)²).

If X₁ ~ N(μ₁, σ₁²) and X₂ ~ N(μ₂, σ₂²), and X₁ and X₂ are independent, then X₁ + X₂ ~ N(μ₁ + μ₂, σ₁² + σ₂²).

If X₁, ..., X_n are independent standard normal variables, then X₁² + ... + X_n² follows a chi-squared distribution with n degrees of freedom.

Characteristic function

The characteristic function of a gaussian random variable X ~ N(μ,σ²) is defined as the expected value of e^itX and can be written as

<math>\phi_X(t)=E\left[e^{itX}\right]=\int_{-\infty}^{\infty} {1 \over \sigma\sqrt{2\pi} }\,e^{-{(x-\mu )^2 \over 2\sigma^2}}\,e^{itx}\,dx = e^{i\mu t-\sigma^2 t^2/2}</math>

as can be seen by completing the square in the exponent.

Generating Gaussian random variables

For computer simulations, it is often necessary to generate values that follow a Gaussian distribution. This is best done with the Box-Muller transforms. These methods require two uniformly distributed values as input which can easily be generated by the computer's pseudorandom number generator.

History

The normal distribution was first introduced by de Moivre in an article in 1733 (reprinted in the second edition of his Doctrine of Chances, 1738) in the context of approximating certain binomial distributions for large n. His result was extended by Laplace in his book Analytical Theory of Probabilities[?] (1812), and is now called the Theorem of de Moivre-Laplace[?].

Laplace used the normal distribution in the analysis of errors[?] of experiments. The important method of least squares was introduced by Legendre in 1805. Gauss, who claimed to have used the method since 1794, justified it in 1809 by assuming a normal distribution of the errors.

The name "bell curve" goes back to Jouffret[?] who used the term "bell surface" in 1872 for a bivariate normal with independent components. The name "normal distribution" was coined independently by Charles S. Peirce, Francis Galton and Wilhelm Lexis[?] around 1875 [Stigler]. This terminology is unfortunate, since it reflects and encourages the fallacy that "everything is Gaussian".

Cumulative distribution function of the normal distribution

The following graph shows the probabilities that a given standard normal variable has a value less than z, for values of z from -4 to +4. This is known as the cumulative distribution function of the normal distribution, and has formula

<math>\Phi(z) = \int_{-\infty}^z {1 \over \sqrt{2\pi} }\,e^{-{x^2 \over 2}}\,dx</math>

So for instance, the probability that a standard normal variable has a value less than 0.12 is equal to 0.54776. The cumulative distribution function of the normal distribution does not have an analytic form, and has to be calculated using numerical techniques. It is so commonly used that it is often called "the" error function.

See also multivariate normal distribution.

External links and references

A. Kropinski's normal distribution tutorial (http://ce597n.www.ecn.purdue.edu/CE597N/1997F/students/michael.a.kropinski.1/project/tutorialMichael)
Stigler: Statistics on the Table, Harvard University Press 1999, chapter 22. History of the term "normal distribution".