Probability/Random Variables

Motivation
In many experiments, there may be so many possible outcomes in the sample space that we may want to instead work with a "summary variable" for those outcomes. For example, suppose a poll is conducted for 100 different people to ask them whether they agree with a certain proposal. Then, to keep track of the answers from those 100 people completely, we may first use a number to indicate the response: (For simplicity, we assume that there are only these two responses available.) After that, to record which person answer which response, we use a vector with 100 numbers for the record. For example, $$(1,0,1,0,0,\dotsc,1,0,0)$$, etc. Since for every coordinate in the vector, there are two choices: "0" or "1", there are in total $$2^{100}\approx 1.268\times 10^{30}$$ different vectors in the sample space (denoted by $$\Omega$$)! Hence, it is very tedious and complicated to work with that many outcomes in the sample space $$\Omega$$. Instead, we are often only interested in how many "agree" and "disagree" are there, instead of which person answers which response, since the number of "agree" and "disagree" determines whether the proposal is agreed by majority of them, and thus captures the essence of the poll.
 * number "1" for "agree".
 * number "0" for "disagree".

Hence, it is more convenient to define a variable $$X$$ which gives the number of "1"s in the 100 coordinates in every outcome in the sample space $$\Omega$$. Then, $$X$$ can only take 101 possible values: 0,1,2,...,100, which is much fewer than the number of outcomes in the original sample space.

Through this, we can change the original experiment to a new experiment, where the variable $$X$$ takes one of the 101 possible values according to certain probabilities. For this new experiment, the sample space becomes $$\{0,1,\dotsc,100\}$$.

During the above process of defining the variable $$X$$ (called ), we have actually (implicitly) defined a function where the domain is the original sample space, and the range is $$\{0,1,\dotsc,100\}$$. Usually, we take the codomain of the random variable to be the set of all real numbers $$\mathbb R$$. That is, we define the random variable $$X:\Omega\to\mathbb R$$ by $$X(\omega)=\text{number of 1s in the coordinates of }\omega$$ for every $$\omega\in\Omega$$.

Definition
To define random variable formally, we need the concept of measurable function:

By defining a random variable $$X:\Omega\to\mathbb R$$ from a probability space $$(\Omega,\mathcal F,\mathbb P)$$, we actually a new probability space $$(\mathcal X,\mathcal F_X,\mathbb P_X)$$ where $$\mathbb P_X(E)=\mathbb P(\{X\in E\})$$
 * The induced sample space $$\mathcal X$$ is the of the random variable $$X$$: $$\mathcal X=\{X(\omega):\omega\in\Omega\}\subseteq\mathbb R$$.
 * The induced event space $$\mathcal F_X$$ is a $$\sigma$$-algebra of $$\mathcal X$$. (Here, we follow our previous convention: $$\mathcal F_X=\mathcal P(\mathcal X)$$ when $$\mathcal X$$ is countable.)
 * The induced probability measure $$\mathbb P_X:\mathcal F_X\to[0,1]$$ is defined by
 * for every $$E\in\mathcal F_X$$.

It turns out the induced probability measure satisfies all the probability axioms:

After proving this result, it follows that all properties of probability measure discussed previously also apply to the induced probability measure $$\mathbb P_X$$. Hence, we can use the properties of probability measure to calculate the probability $$\mathbb P_X(E)$$, and hence $$\mathbb P(X\in E)$$, for every set $$E\in\mathcal F_X$$. More generally, to calculate the probability $$\mathbb P(X\in B)$$ for every $$B\in\mathcal B$$ ($$B$$ does not necessarily belong to $$\mathcal F_X$$), we notice that $$\{X\in B\}=\{X\in B\cap\mathcal X\}$$, and it turns out that $$B\cap\mathcal X\in\mathcal F_X$$. Hence, we can calculate $$\mathbb P(X\in B)$$ by considering $$\mathbb P_X(B\cap\mathcal X)$$.

Sometimes, even it is infeasible to list out all sample points in the sample space, we can also determine the probability related to the random variable.

A special kind of random variable that is quite useful is the indicator random variable, which is a special case of :

Cumulative distribution function
For every random variable $$X$$, there is function associating with it, called the (cdf) of $$X$$:

We can see from the cdf in the example above that the cdf is not necessarily continuous. There are several discontinuities at the jump points. But we can notice that at each jump point the cdf takes the value at the of the jump, by the definition of cdf (the inequality involved includes also the equality). Loosely speaking, this suggests that the cdf is. However, the cdf is not in general.

In the following, we will discuss three properties of cdf.

Sometimes, we are only interested in the values $$x$$ such that $$\mathbb P(X=x)\ne 0$$, which are more 'important'. Roughly speaking, the values are actually the elements of the of $$X$$, which is defined in the following.

Discrete random variables
Often, for discrete random variable, we are interested in the probability that the random variable takes a specific value. So, we have a function that gives the corresponding probability for each specific value taken, namely.

Continuous random variables
Suppose $$X$$ is a discrete random variable. Partitioning $$S$$ into small disjoint intervals $$[x_1,x_1+\Delta x_1],\dotsc$$ gives $$\mathbb P(X\in S)=\mathbb P\left(X\in\bigcup_i[x_i+\Delta x_i]\right)=\sum_i\mathbb P\big(X\in[x_i+x_i+\Delta x_i]\big) =\sum_i\underbrace{\frac{\mathbb P\big(X\in[x_i+x_i+\Delta x_i]\big)}{\Delta x_i}}_{\text{probability per unit}}\cdot\Delta x_i .$$ In particular, the probability per unit can be interpreted as the density of the probability of $$X$$ over the interval. (The higher the density, the more probability is distributed (or allocated) to that interval).

Taking limit, $$\lim_{\Delta x_i\to 0}\sum_i\underbrace{\frac{\mathbb P\big(X\in[x_i+x_i+\Delta x_i]\big)}{\Delta x_i}}_{\text{density}}\cdot\Delta x_i=\int_S \underbrace{f(x)}_{\text{density}}\,dx, $$ in which, intuitively and non-rigorously, $$f(x)\,dx$$ can be interpreted as the probability over 'infinitesimal' interval $$[x,x+dx]$$, i.e. $$\mathbb P(X\in[x,dx])$$, and $$f(x)$$ can be interpreted as the density of the probability over the 'infinitesimal' interval, i.e. $$\frac{\mathbb P(X\in[x,dx])}{dx}$$.

These motivate us to have the following definition.

The name r.v. comes from the result that the cdf of this kind of r.v. is continuous.

Without further assumption, pdf is unique, i.e. a random variable may have multiple pdf's, since, e.g., we may set the value of pdf to be a real number at a single point outside its support (without affecting the probabilities, since the value of pdf at a single point is zero regardless of the value), and this makes another valid pdf for a random variable. To tackle this, we conventionally set $$f(x)=0$$ for each $$x\notin\operatorname{supp}(X)$$ to make the pdf become unique, and make the calculation more convenient.

Mixed random variables
You may think that a random variable can either be discrete or continuous after reading the previous two sections. Actually, this is wrong. A random variable can be neither discrete nor continuous. An example of such random variable is random variable, which is discussed in this section.

An example of singular random variable is the Cantor distribution function (sometimes known as Devil's Staircase), which is illustrated by the following graph. The graph pattern keeps repeating when you enlarge the graph.