Probability/Probability Spaces

Terminologies
The name of this chapter,, is a mathematical construct that models a random experiment. To be more precise:

Let us first give the definitions related to sample space. For the definitions of event space and probability, we will discuss it in later sections.

Probability interpretations
In this chapter, we will discuss probability mathematically, and we will give an axiomatic and abstract definition to probability (function). By axiomatic definition, we mean defining probability to be a function that satisfying some axioms, called probability axioms. But such axiomatic definition does not tell us how should we interpret the term "probability", so the definition is said to be from the interpretation of probability. Such independence make the formal definition always applicable, no matter how you interpret probability.

However, the axiomatic definition does not suggest a way to construct a probability measure (i.e., assigning probabilities to events): it just states that probability is a function satisfying certain axioms, but how can we construct such function in the first place? In this section, we will discuss two main types of probability interpretations: subjectivism and frequentism, where the method of assigning probabilities to events is mentioned in each of them.

Subjectivism
Intuitively and naturally, of an event is often regarded as a numerical measure of the "chance" of the occurrence of the event (that is, how likely the event will occur). So, it is natural for us to assign probability to an event based on our own assessment on the "chance". (In order for the probability to be valid according to the axiomatic definition, the assignment needs to satisfy the .) But different people may have different assessment on the "chance", depending on their personal opinions. So, we can see that such interpretation of probability is somewhat , since different people may assign different probabilities to the same event. Hence, we call such probability interpretation as (also known as ).

The main issue of the subjectivism is the lack of objectivity, since different probabilities can be assigned to the same event based on personal opinion. Then, we may have difficulties in choosing which of the probabilities should be used for that event. To mitigate the issue of the lack of objectivity, we may adjust our degrees of belief on an event from time to time when there are more observed data through, which will be discussed in later chapter, so that the value is assigned in a more objective way. However, even after the adjustment, the assignment of value is still not in an objective way, since the adjusted value (known as ) still depends on the initial value (known as ), which is assigned subjectively.

Frequentism
Another probability interpretation, which is objective, is called. We denote by $$n(E)$$ the number of occurrences of an event $$E$$ in $$n$$ repetitions of experiment. (An is any action or process with an  that is subject to uncertainty or randomness.) Then, we call $$\frac{n(E)}{n}$$ as the of the event $$E$$. Intuitively, we will that the relative frequency fluctuates less and less as $$n$$ gets larger and larger, and approach to a constant limiting value (we call this as ) as $$n$$ tends to infinity, i.e., the limiting relative frequency is $$\lim_{n\to \infty}\frac{n(E)}{n}$$. It is thus natural to take the limiting relative frequency as the probability of the event $$E$$. This is exactly what the definition of probability in the frequentism. In particular, the of such limiting relative frequency is an assumption or  in frequentism. (As a side result, when $$n$$ is large enough, the relative frequency of the event $$E$$ may be used to approximate the probability of the event $$E$$.)

However, an issue of frequentism is that it may be infeasible to conduct experiments many times for some events. Hence, for those events, no probability can be assigned to them, and this is clearly a limitation for frequentism.

Because of these issues, we will instead use a modern axiomatic and abstract approach to define probability, which is suggested by a Russian mathematician named Andrey Nikolaevich Kolmogorov in 1933. By, we mean defining probability quite broadly and abstractly as something that satisfy certain axioms (called ). Such probability axioms are the mathematical foundation and the basis of modern probability theory.

Probability axioms
Since we want use the probability measure $$\mathbb P$$ to assign probability $$\mathbb P(E)$$ to every event $$E$$ in the sample space, it seems natural for us to set of the probability measure $$\mathbb P$$ to be the set containing subsets of $$\Omega$$, i.e., the power set of $$\Omega$$, $$\mathcal P(\Omega)$$. Unfortunately, this situation is not that simple, and there are some technical difficulties if we set the domain like this, when the sample space $$\Omega$$ is.

This is because the power set of such uncountable sample space includes some "badly behaved" sets, which causes problems when assigning probabilities to them. (Here, we will not discuss those sets and these technical difficulties in details.) Thus, instead of setting the domain of the probability measure to be $$\mathcal P(\Omega)$$, we set the domain to be a (sigma-algebra) containing some "sufficiently well-behaved" events:

We have seen two examples of $$\sigma$$-algebra in the example above. Often, the "smallest" $$\sigma$$-algebra is not chosen to be the domain of the probability measure, since we usually are interested in events $$\varnothing$$ and $$\Omega$$.

For the "largest" $$\sigma$$-algebra, on the other hand, it contains every event, but we may not be interested in some of them. Particularly, we are usually interested in events that are "well-behaved", instead of those "badly behaved" events (indeed, it may be even impossible to assign probabilities to them properly (those events are called )).

Fortunately, when the sample space $$\Omega$$ is, every set in $$\mathcal P(\Omega)$$ is "well-behaved", so we can take this power set to be a $$\sigma$$-algebra for the domain of probability measure.

However, when the sample space $$\Omega$$ is, even if the power set $$\mathcal P(\Omega)$$ is a $$\sigma$$-algebra, it contains "too many" events, particularly, it even includes some "badly behaved" events. Therefore, we will not choose such power set to the domain of the probability measure. Instead, we just choose a $$\sigma$$-algebra that includes the "well-behaved" events to be the domain, so that we are able to assign probability properly to every event in the $$\sigma$$-algebra of the domain. Particularly, those "well-behaved" events are often the events of interest, so all events of interest are contained in that $$\sigma$$-algebra, that is, the domain of the probability measure.

To motivate the probability axioms, we consider some properties that the "probability" in frequentism (as a limiting relative frequency) possess: $$ \begin{align} \lim_{n\to \infty}\frac{n(\bigcup_{i=1}^{\infty}E_i)}{n} &=\lim_{n\to \infty}\frac{n(\bigcup_{i=1}^{\infty}E_i)}{n}\\ &=\lim_{n\to \infty}\frac{n(E_1)+n(E_2)+\dotsb}{n}&(\text{the events are pairwise disjoint})\\ &=\lim_{n\to \infty}\frac{n(E_1)}{n}+\lim_{n\to \infty}\frac{n(E_2)}{n}+\dotsb&(\text{every limit exists by the axiom in frequentism})\\ &=\sum_{i=1}^{\infty}\lim_{n\to \infty}\frac{n(E_i)}{n}, \end{align} $$
 * 1) The limiting relative frequency must be nonnegative. (We call this property as .)
 * 2) The limiting relative frequency of the whole sample space $$\Omega$$ ($$\Omega$$ is also an event) must be 1 (since by definition $$\Omega$$ contains all sample points, this event must occur in every repetition). (We call this property as .)
 * 3) If the events $$E_1,E_2,\dotsc$$ are pairwise disjoint (i.e., $$E_i\cap E_j=\varnothing$$ for every $$i,j$$ with $$i\ne j$$), then the limiting relative frequency of the event $$\bigcup_{i=1}^{\infty}E_i\overset{\text{ def }}=E_1\cup E_2\cup\dotsb$$ (union of subsets of $$\Omega$$ is a subset of $$\Omega$$, so it can be called an event) is
 * which is the sum of the limiting relative frequency of each of the events $$E_1,E_2,\dotsc$$. (We call this property as .)

It is thus very natural to set the probability axioms to be the three properties mentioned above:

Using the probability axioms alone, we can prove many well-known properties of probability.

Basic properties of probability
Let us start the discussion with some simple properties of probability.

Using this result, we can obtain from the countable additivity of probability:

Finite additivity makes the proofs of some of the following results simpler.

Constructing a probability measure
As we have said, the axiomatic definition does not suggest us a way to construct a probability measure. Actually, even for the same experiment, there can be many ways to construct a probability measure that satisfies the above probability axioms if there are not sufficient information provided:

However, we have previously mentioned that we may assign probabilities to events subjectively (as in subjectivism), or according to its limiting relative frequency (as in frequentism). Through these two probability interpretations, we may provide some background information for a random experiment, by assigning probabilities to some of the events before constructing the probability measure, to the extent that there is way to construct a probability measure. Consider the coin tossing example again:

In general, it is not necessary to assign probability to event in the event space in the background information for us to able to construct the probability measure in exactly one way. Consider the following example.

We can see from this example that to provide sufficient background information to the extent that the probability measure can be constructed in exactly one way, we just need the probability of each of the singleton events (which should be nonnegative and sum to one to satisfy the probability axioms). After that, we can calculate the probability for each of the other events in the event space, and hence construct the only possible probability measure.

This is true when the sample space is countable, in general:

The following is an important special case for the above theorem.

More advanced properties of probability
Recall the in combinatorics. We have similar results for probability:

The following is a classical example for demonstrating the application of inclusion-exclusion principle.