A Simple Derivation of the Bias-Variance Decomposition

We consider a statistical model paramterised by θ\theta , which of course, cannot be observed directly and must be estimated through a set of data. We denote our estimated parameter by θ^\hat \theta .

It is useful to have a sense of how close our estimated parameter θ^\hat \theta is to the true value, θ\theta . One straight forward measure is the so-called Mean Squared Error (MSE) which is the average square difference between the two values

MSE(θ^):=E[(θ^θ)2]\text{MSE}(\hat\theta) := \mathbb{E}[(\hat\theta - \theta)^2]

The expectation here is subtle, but note that (under a frequentist view of statistics) the true parameter θ\theta is fixed and non-random, but the estimated parameter will vary with each new dataset. The MSE reports the square difference of these two, averaged over all possible datasets.

Our goal is to show, with as little effort as possible, the following decomposition

MSE(θ^)=Var(θ^)+Bias(θ^)2(1)\text{MSE}(\hat\theta) = \text{Var}(\hat\theta) + \text{Bias}(\hat\theta)^2\tag{1}

Proof. For any random variable XX recall that Var(X)=E[X2]E[X]2\text{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 rearranging we have E[X2]=Var(X)+E[X]2\mathbb{E}[X^2] = \text{Var}(X) + \mathbb{E}[X]^2 Since this holds for any random variable, replace XX with θ^θ\hat\theta - \theta so we have E[(θ^θ)2]=Var(θ^θ)+E[θ^θ]2\mathbb{E}[(\hat\theta - \theta)^2] = \text{Var}(\hat\theta - \theta) + \mathbb{E}[\hat\theta - \theta]^2 The term on the left hand side is the definition of MSE(θ^)\text{MSE}(\hat\theta) . Similarly, the second term on the right hand side is the definition of the (squared) bias, Bias(θ^):=E[θθ]\text{Bias}(\hat\theta) := \mathbb{E}[\theta - \theta] . Finally, recall that θ\theta is non-random, so we have textVar(θ^θ)=Var(θ^)text{Var}(\hat\theta - \theta) = \text{Var}(\hat\theta) . Combining each of these, we arrive at (1). \square