Statistics/Point Estimation

Introduction
Usually, a random variable $$X$$ resulting from a random experiment is to follow a certain distribution with an unknown (but ) parameter (vector) $$\theta\in\mathbb R^k$$ ($$k$$ is a positive integer, and its value depends on the distribution), taking value in a set $$\Theta$$, called the parameter space.

For example, suppose the random variable $$X$$ is assumed to follow a normal distribution $$\mathcal N(\mu,\sigma^2)$$. Then, in this case, the parameter vector $$\theta=(\mu,\sigma)\in\Theta$$ is unknown, and the parameter space $$\Theta=\{(\mu,\sigma):\mu\in\mathbb R,\sigma>0\}$$. It is often useful to those unknown parameters in some ways to "understand" the random variable $$X$$ better. We would like to make sure the estimation should be "good" enough, so that the understanding is more accurate.

Intuitively, the (realization of) $$X_1,\dotsc,X_n$$ should be useful. Indeed, the estimators introduced in this chapter are all based on the random sample in some sense, and this is what mean. To be more precise, let us define and.

In the following, we will introduce two well-known point estimators, which are actually quite "good", namely and.

Maximum likelihood estimator (MLE)
As suggested by the name of this estimator, it is the estimator that some kind of "likelihood". Now, we would like to know what "likelihood" should we maximize to estimate the unknown parameter(s) (in a "good" way). Also, as mentioned in the introduction section, the estimator is based on the random sample in some sense. Hence, this "likelihood" should be also based on the random sample in some sense.

To motivate the definition of maximum likelihood estimator, consider the following example.

Intuitively, with these particular realizations (fixed), we would like to find a value of $$p$$ that maximizes this probability, i.e.,, makes the realizations obtained to be the one that is "most probable" or "with maximum likelihood". Now, let us formally define the terms related to MLE.

Now, let us find the MLE of the unknown parameter $$p$$ in the previous coin flipping example.

Sometimes, there is constraint imposed on the parameter when we are finding its MLE. The MLE of the parameter in this case is called a MLE. We will illustrate this in the following example.

To find the MLE, we sometimes use methods other than derivative test, and we do not need to find the log-likelihood function. Let us illustrate this in the following example.

In the following example, we will find the MLE of a parameter vector.

Method of moments estimator (MME)
For maximum likelihood estimation, we need to utilize the likelihood function, which is found from the joint pmf of pdf of the random sample from a distribution. However, we may not know exactly the pmf of pdf of the distribution in practice. Instead, we may just know some information about the distribution, e.g. mean, variance, and some moments ($$r$$th moment of a random variable $$X$$ is $$\mathbb E[X^r]$$, we denote it by $$\mu_r$$ for simplicity). Such moments often contain information about the unknown parameter. For example, for a normal distribution $$\mathcal N(\mu,\sigma^2)$$, we know that $$\mu=\mu_1$$ and $$\sigma^2=\mu_2-(\mu_1)^2$$. Because of this, when we want to estimate the parameters, we can do this through estimating the moments.

Now, we would like to know how to estimate the moments. We let $$m_r=\frac{\sum_{i=1}^{n}X_i^r}{n}$$ be the $$r$$th, where $$X_i$$'s are independent and identically distributed. By (assuming the conditions are satisified), we have In general, we have $$m_r\;\overset{p}\to\; \mu_r$$, since the conditions are still satisfied after replacing the "$$X$$" by "$$X^r$$" in the weak law of large number.
 * $$\overline X=m_1\;\overset{p}\to\; \mathbb E[X]=\mu_1$$
 * $$m_2\;\overset{p}\to\; \mathbb E[X^2]=\mu_2$$ (this can be seen from replacing the "$$X$$" by "$$X^2$$" in the weak law of large number, then the conditions are still satisfied, and so we can still apply the weak law of large number)

Because of these results, we can estimate the $$r$$-th moment $$\mu_r$$ using the $$r$$-th sample moment $$m_r$$, and the estimation is "better" when $$n$$ is large. For example, in the above normal distribution example, we can estimate $$\mu$$ by $$m_1$$ and $$\sigma^2$$ by $$m_2-(m_1)^2$$, and these estimators are actually called the.

To be more precise, we have the following the definition of the :

Properties of estimator
In this section, we will introduce some criteria for evaluating how "good" a point estimator is, namely, and.

Unbiasedness
For $$\hat\theta$$ to be a "good" estimator of a parameter $$\theta$$, a desirable property of $$\hat\theta$$ is that its expected value equals the value of the parameter $$\theta$$, or at least close to the value. Because of this, we introduce a value, namely, to measure how close is the mean of $$\hat\theta$$ to $$\theta$$.

We will also define some terms related to bias.

Efficiency
We have discussed how to evaluate the unbiasedness of estimators. Now, if we are given two unbiased estimators, $$\hat\theta$$ and $$\tilde\theta$$, how should we compare their goodness? Their goodness is the same if we are only comparing them in terms of unbiasedness. Therefore, we need another criterion in this case. One possible way is to compare their , and the one with smaller variance is better, since on average, the estimator is less deviated from its mean, which is the value of the unknown parameter by the definition of unbiased estimator, and thus the one with smaller variance is more accurate in some deviation sense. Indeed, an unbiased estimator can still have a large variance, and thus deviate a lot from its mean. Such estimator is unbiased since the positive deviations and negative deviations somehow cancel out each other. This is the idea of.

Actually, for the variance of unbiased estimator, since the mean of the unbiased estimator is the unknown paramter $$\theta$$, it measures the mean of the squared deviation from $$\theta$$, and we have a specific term for this deviation, namely (MSE).

Notice that in the definition of MSE, we do not specify that $$\hat\theta$$ to be an unbiased estimator. Thus, $$\hat\theta$$ in the definition may be biased. We have mentioned that when $$\hat\theta$$ is unbiased, then its variance is actually its MSE. In the following, we will give a more general relationship between $$\operatorname{MSE}(\hat\theta)$$ and $$\operatorname{Var}(\hat\theta)$$, not just for unbiased estimators.

Uniformly minimum-variance unbiased estimator
Now, we know that the smaller the variance of an unbiased estimator, the more efficient (and "better") it is. Thus, it is natural that we want to know what is the efficient (i.e., the "best") unbiased estimator, i.e., the unbiased estimator with the smallest variance. We have a specific name for such unbiased estimator, namely. To be more precise, we have the following definition for UMVUE:

Indeed, UMVUE is, i.e., there is exactly one unbiased estimator with the smallest variance among all unbiased estimators, and we will prove it in the following.

Cramer-Rao lower bound
Without using some results, it is quite difficult to determine the UMVUE, since there are many (perhaps even infinitely many) possible unbiased estimator, so it is quite hard to ensure that one particular unbiased estimator is relative more efficient than every other possible unbiased estimators.

Therefore, we will introduce some approaches that help us to find the UMVUE. For the first approach, we find a on the variances of all possible unbiased estimators. After getting such lower bound, if we can find an unbiased estimator with variance to be exactly equal to the lower bound, then the lower bound is the minimum value of the variances, and hence such unbiased estimator is an UMVUE by definition.

A common way to find such lower bound is to use the (CRLB), and we get the CRLB through. Before stating the inequality, let us define some related terms.

For the regularity conditions which allow interchange of derivative and integral, they include We have some results that assist us to compute the Fisher information.
 * 1) the partial derivatives involved should exist, i.e., the (natural log) of the functions involved is differentiable
 * 2) the integrals involved should be differentiable
 * 3) the support does not depend on the parameter(s) involved

Sometimes, we cannot use the CRLB method for finding UMVUE, because We will illustrate some examples for these two cases in the following.
 * the regularity conditions may not be satisfied, and thus we cannot use the Cramer-Rao inequality, and
 * the variance of the unbiased estimator may not be equal to the CRLB, but we cannot conclude that it is not an UMVUE, because it may be the case that the CRLB is not attainable at all, and the smallest variance among all unbiased estimators is actually the variance of that estimator, which is larger than the CRLB.

Since the CRLB is sometimes attainable and sometimes not, it is natural to question that can the CRLB be attained. In other words, we would like to know the for the CRLB, which are stated in the following corollary.

We have discussed MLE previously, and MLE is actually a "best choice" asymptotically (i.e., as the sample size $$n\to\infty$$) according to the following theorem.

Since we are not able to use the CRLB to find UMVUE in some situations, we will introduce another method to find UMVUE in the following, which uses the concepts of and.

Sufficiency
Intuitively, a $$T(X_1,\dotsc,X_n)$$, which is a function of a given random sample $$X_1,\dotsc,X_n$$, contains all information needed for estimating the unknown parameter (vector) $$\theta$$. Thus, the statistic $$T(X_1,\dotsc,X_n)$$ itself is "sufficient" for estimating the unknown parameter (vector) $$\theta$$.

Formally, we can define and describe as follows:

Let us state the above remark about transformation of sufficient statistic formally below.

Now, we discuss a theorem that helps us to check the sufficiency of a statistic, namely (Fisher-Neyman).

For some "nice" distributions, which belong to, sufficient statistics can be found using another alternative method easily and more conveniently. This method works because of the "nice" form of the pdf or pmf of those distributions, which can be characterized as follows:

Now, we will start discussing how is sufficient statistic related to UMVUE. We begin our discussion by.

To actually determine the UMVUE, we need another theorem, called, which is based on Rao-Blackwell theorem, and requires the concept of.

Completeness
When a random sample $$X_1,\dotsc,X_n$$ is from a distribution in exponential family, then a complete statistic can also be founded easily, similar to the case for sufficient statistic.

Consistency
In the previous sections, we have discussed and. In this section, we will discuss another property called.