Sufficiency (statistics)

In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call θ. A quantity T(X) that depends on the (observable) random variable X but not on the (unobservable) parameter θ is called a statistic. Sir Ronald Fisher tried to make precise the intuitive idea that a statistic may capture all of the information in X that is relevant to the estimation of θ. A statistic that does that is called a sufficient statistic. The precise definition is this: A statistic T(X) is sufficient for θ precisely if the conditional probability distribution of the data X given the statistic T(X) does not depend on θ.

For example:

If X₁, ...., X_n are independent Bernoulli-distributed random variables with expected value p, then the sum X₁ + ... + X_n is a sufficient statistic for p.
If X₁,....,X_n are independent and uniformly distributed on the interval [0,θ], then max(X₁,....,X_n ) is sufficient for θ.

Since the conditional distribution of X given T(X) does not depend on θ, neither does the conditional expected value of g(X) given T(X), where g is any (sufficiently well-behaved) function. Consequently that conditional expected value is actually a statistic, and so is available for use in estimation. If g(X) is any kind of estimator of θ, then typically the conditional expectation of g(X) given T(X) is a better estimator of θ ; one way of making that statement precise is called the Rao-Blackwell Theorem. Sometimes one can very easily construct a very crude estimator g(X), and then evaluate that conditional expected value to get an estimator that is in various senses optimal.