Probability/Conditional Distributions

Motivation
Suppose there is an earthquake. Let $$X$$ be the number of casualties and $$Y$$ be the Richter scale of the earthquake.

(a) Without given anything, what is the distribution of $$X$$?

(b) Given that $$Y=1$$, what is the distribution of $$X$$?

(c) Given that $$Y=9$$, what is the distribution of $$X$$?

Are your answers to (a),(b),(c) different?

In (b) and (c), we have the distribution of $$X$$ given $$Y=1$$, and the  distribution of $$X$$ given $$Y=9$$ respectively.

In general, we have of $$X$$ given $$Y$$ ( observing the value of $$Y$$), or $$X$$ given $$Y=y$$ ( observing the value of $$Y$$).

Conditional distributions
Recall the definition of : $$ \mathbb P(A|B)=\frac{\mathbb P(A\cap B)}{\mathbb P(B)}, $$ in which $$A,B$$ are events, with $$\mathbb P(B)>0$$. Applying this definition to $$X,Y$$, we have $$\mathbb P(X=x|Y=y)=\frac{\mathbb P(X=x\cap Y=y)}{\mathbb P(Y=y)}=\frac{f(x,y)}{f_Y(y)},$$ where $$f(x,y)$$ is the joint pmf of $$X$$ and $$Y$$, and $$f_Y(y)$$ is the marginal pmf of $$Y$$. It is natural to call such conditional probability as, right? We will denote such conditional probability as $$f_{X|Y}(x|y)$$. Then, this is basically the definition of pmf:  pmf of $$X$$ given $$Y=y$$ is the conditional probability $$\mathbb P(X=x|Y=y)$$. Naturally, we will expect that is defined similarly. This is indeed the case:

To understand the definition more intuitively for the continuous case, consider the following diagram. Top view: |       |        *---*         |               |        |               | fixed y *===============* <--- corresponding interval |              |        |               |        *---*        |        * x

Side view:

*          / \         *\  *  /                                                  /|#\   \   |  / |##\ / *-*   | *  |###\            /\   | |\ |##/#\--/--\        | | \|#/###**   /                                | |  \/############/#\ /                                 | |y *\===========/===* | | / *-*   /                                   | |/              \ /                                    | **                                     |/                                                       *- x

Front view: |   |    |                   *\         |#\        |##\       |###\                 |####\   <-- Area: f_Y(y) |#####**     |###############\     *================*-- x

We can see that when we are conditioning $$Y=y$$, we take a "slice" out from the region under joint pdf, and the area of the "whole slice" is the area between the joint pdf $$f(x,y)$$ with fixed $$y$$ and variable $$x$$, and the $$x$$-axis. Since the area is given by $$\int_{-\infty}^{\infty}f({\color{darkgreen}x},y)\,d{\color{darkgreen}x}=f_Y(y)$$, while according to the probability axioms, the area should equal 1. Hence, we scale down the area of "slice" by a factor of $$f_Y(y)$$, by dividing the univariate joint pdf $$f(x,y)$$ by $$f_Y(y)$$. After that, the curve at the top of scaled "slice" is the graph of the conditional pdf $$\frac{f(x,y)}{f_Y(y)}$$.
 * : corresponding cross section from joint pdf
 * : corresponding cross section from joint pdf

Now, we have discussed the case where both random variables are discrete or continuous. How about the case where one of them is discrete and another one is continuous? In this case, there is no "joint probability function" of these two random variables, since one is discrete and another is continuous! But, we can still define the conditional probability function in some other ways. To motivate the following definition, let $$F_{X|Y}(x|y)$$ be the conditional probability $$\mathbb P(X\le x|Y=y)$$. Then, differentiating $$F_{X|Y}(x|y)$$ with respect to $$x$$ should yield the conditional pdf $$f_{X|Y}(x|y)$$. So, we have $$ \begin{align} f_{X|Y}(x|y)=\frac{d}{dx}F_{X|Y}(x|y) &=\lim_{h\to 0}\frac{\mathbb P(X\le x+h|Y=y)-\mathbb P(X\le x|Y=y)}{h}\\ &=\lim_{h\to 0}\frac{\mathbb P(x< X\le x+h|Y=y)}{h}\\ &=\lim_{h\to 0}\frac{\mathbb P(Y=y|x< X\le x+h)\mathbb P(x< X\le x+h)}{h\mathbb P(Y=y)}\\ &=\lim_{h\to 0}\frac{\mathbb P(Y=y|x< X\le x+h)\mathbb P(x< X\le x+h)}{h\mathbb P(Y=y)}\\ &=\lim_{h\to 0}\frac{\mathbb P(Y=y|x\le X\le x+h)}{\mathbb P(Y=y)}\lim_{h\to 0}\frac{\mathbb P(x< X\le x+h)}{h}\\ &=\frac{\mathbb P(Y=y|X=x)\frac{d}{dx}F_X(x)}{\mathbb P(Y=y)}\\ &=\frac{\mathbb P(Y=y|X=x)f_X(x)}{\mathbb P(Y=y)}.\\ \end{align} $$ Thus, it is natural to have the following definition.

Now, how about the case where $$X$$ is discrete and $$Y$$ is continuous? In this case, let us use the above definition for the motivation of definition. However, we should interchange $$X$$ and $$Y$$ so that the assumptions are still satisfied. Then, we get $$f_{Y|X}(y|x)=\frac{\mathbb P(X=x|Y=y)f_{Y}(y)}{\mathbb P(X=x)}.$$ In this case, $$X$$ is discrete, so it is natural to define the conditional pmf of $$X$$ given $$Y=y$$ as $$\mathbb P(X=x|Y=y)$$ in the expression. Now, after rearranging the terms, we get $$\mathbb P(X=x|Y=y)=\frac{f_{Y|X}(y|x)\mathbb P(X=x)}{f_Y(y)}.$$ Thus, we have the following definition.

Based on the definitions of conditional probability functions, it is natural to define the cdf as follows.

Graphical illustration of the definition (continuous random variables): Top view: |       |        *---*         |               |        |               | fixed y *=========@=====* <--- corresponding interval |        x     | |              |        *---*        |        *

Side view:

*          / \         *\  *  /                                                  /|#\   \   |  / |##\ / *-*   | *  |###\            /\   | |\ |##/#\--/--\        | | \|#/###**   /                                | |  \/#########   / \ /                                 | |y *\========@==/===* | | / *---x-*   / | |/             \ /                                    | **                                     |/                                                       *- x

Front view:

|   |    |    *\          |#\        |##\                  |###\                 |####\   <- Area: f_Y(y) |#####**     |###########    \     *==========@=====*--                 x If $$Y=\mathbf 1\{A\}$$ for some event $$A$$, we have some special notations for simplicity: $$ f_{X|Y}({\color{darkgreen}x}|y)= \begin{cases} f({\color{darkgreen}x}|A),& y=1;\\ f({\color{darkgreen}x}|A^c),& y=0. \end{cases} $$ $$ F_{X|Y}({\color{darkgreen}x}|y)=\mathbb P(X\le{\color{darkgreen}x}|Y=y) =\begin{cases} F({\color{darkgreen}x}|A),& y=1;\\ F({\color{darkgreen}x}|A^c),& y=0. \end{cases} $$
 * : the desired region from the cross section from joint pdf, whose area is the probability from the cdf
 * : the desired region from the cross section from joint pdf, whose area is the probability from the cdf
 * the conditional probability function of $$X$$ given $$Y=y$$ becomes
 * the conditional cdf of $$X$$ given $$Y=y$$ becomes

We can extend the definition of conditional probability function and cdf to groups of random variables, for joint cdf's and joint probability functions, as follows:

Then, we also have a similar proposition for determining independence of two random vectors.

Conditional distributions of bivariate normal distribution
Recall from the ../Important Distributions chapter that the joint pdf of $$\mathcal N_2(\boldsymbol\mu,\boldsymbol\Sigma)$$ is $$f(x,y)=\frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}}\exp\left(-\frac{1}{2(1-\rho^2)}\left(\left(\frac{x-\mu_X}{\sigma_X}\right)^2-2\rho\left(\frac{x-\mu_X}{\sigma_X}\right)\left(\frac{y-\mu_Y}{\sigma_Y}\right)+\left(\frac{y-\mu_Y}{\sigma_Y}\right)^2\right)\right),\quad (x,y)\in\mathbb R^2$$, and $$X\sim\mathcal N(\mu_X,\sigma_X^2)$$ and $$Y\sim\mathcal N(\mu_Y,\sigma_Y^2)$$ in this case. in which $$\rho=\rho(X,Y)$$ and $$\sigma_X,\sigma_Y$$ are positive.

Conditional version of concepts
We can obtain version of concepts previously established for 'unconditional' distributions analogously for distributions by substituting 'unconditional' cdf, pdf or pmf, i.e. $$F(\cdot)$$ or $$f(\cdot)$$, by their counterparts, i.e. $$F(\cdot{\color{darkgreen}|\cdot})$$ or $$f(\cdot{\color{darkgreen}|\cdot})$$.

Conditional expectation
Similarly, we have conditional version of law of the unconscious statistician.

The properties of $$\mathbb E[\cdot]$$ still hold for conditional expectations $$\mathbb E[\cdot{\color{darkgreen}|Y}]$$, with 'unconditional' expectation replaced by  expectation and some suitable modifications, as follows:

The following theorem about conditional expectation is quite important.

After defining expectation, we can also have  variance, covariance and correlation coefficient, since variance, covariance, and correlation coefficient are built upon expectation.

Conditional variance
Similarly, we have properties of variance which are similar to that of variance.

Beside law of total expectation, we also have law of total variance, as follows: