A First Course to Network Coding/Information Measure and Handy Inequations

Abstract
This chapter aims to introduce the concepts and some basic properties of Shannon’s information measure - entropy, mutual information, and relative entropy. Inequalities involving these concepts will be introduced, too, to help reinforce readers’ understanding and enlarge their toolbox for future learning.

Introduction
To understand network coding, one need background knowledge in information theory. Among the essential concepts of information theory, a fundamental one is Shannon’s information measure - entropy, mutual information and relative entropy. Instead of considering semantic aspects of information, the entropy measures the uncertainty of the outcome of a discrete random variable. Mutual information, as its name suggests, is how much information one d.r.v. contains about the other, or how much uncertainty one d.r.v. 'kills' for the other. Then we come to relative entropy, which can be seen as a generalization of mutual information. It measures the 'distance' between two d.r.v.s' probability distribution. After the three basic concepts are delivered, we introduce the inequalities that further characterize the properties of the information measures.

To characterize uncertainty, consider a coin flip. If the coin is not fair, say, the head has a 0.8 chance to turn up, while the tail only has a 0.2 chance. Then your uncertainty is low because you are quite certain that the head will turn up. However, if it is a fair coin, you will be more uncertain because the head and tail are equally likely to turn up. So if you were to use a mathematical expression to characterize the uncertainty of an outcome, this expression should be of the form of expectation, and reach its maximum when all choices are equally likely. Moreover, if we use dice instead of coin, or roulette instead of dice, then the uncertainty should grow as the number of choices grow. That is, the expression characterizing it should be increasing with the number of choices, i.e. the size of the alphabet. Lastly, if you roll two dices at once, the uncertainty of '3' and '4' turn up should be the same as that of rolling one dice then the other. So if sequential choices yield the same outcome as a single one, then the expressions of these choices should be able to combine and the combined result should be equal to that of the single choice. The three intuitive properties described above serves as criteria for choosing such expression.

Definitions
Discrete Random Variable Let $$X$$ be a discrete random variable with alphabet $$??$$ and PMF $$P_X (x)=Pr(X = x),x \in X.$$ Hereon, $$P_X(x)$$ is often written as $$P(x)$$ for convenience, and discrete random variable as d.r.v.

Entropy The entropy of a d.r.v. is defined as:

$$H(X) = -\sum P(x)\log_b P(x)$$, in other words, the expectation of $$log_b \frac{1}{P(x)}$$.

Letter 'b' in the base specifies the unit of entropy. If b = 2, the unit is bit (binary unit, note the difference between this and the other 'bit' -- binary digit); b = e, it's nat (natural unit). Unless specified otherwise, we use bit. This form of expectation of negative logarithm is tailor-made by Shannon for the three criteria we described in the last section.

Joint Entropy We defined entropy of a single d.r.v., the definition of two d.r.v.s is immediate:

$$H(X) = -\sum_{x, y \in S(X, Y)} P(x,y)\log P(x,y)$$.

Conditional Entropy The entropy of a d.r.v. conditioning on another d.r.v. is defined as:

$$H(X|Y = y) = -\sum_{x \in X} P(x|y)\log P(x|y)$$.

Now we can express the third criterion with conditional entropy:

$$H(XY) = H(X) + H(Y|X) = H(Y) + H(X|Y)$$. Rolling two dice at once or rolling one after the other should have the same uncertainty. And this can be extended to more than two d.r.v.s:

$$H(X_1,X_2,...,X_N) = \sum_{n=1}^N H(X_N|X_1,X_2,...,X_{N-1})$$. This is called the chain rule of entropy.

Convex Function

Properties of Entropy
Non-negativity

Inequalities
Jensen's inequality