Calculus/Calculus on matrices

This section gives an overview on how calculus can be applied to matrices. Note that a general understanding of Linear Algebra is expected - you're expected to be familiar with the common ways of manipulating matrices.

The problem
Consider a n by n matrix $$\textbf{A}$$ and a n by 1 vector $$\textbf{x}$$. How can we, for instance, find $$\frac{d}{d\textbf{x}} \textbf{A} \textbf{x}$$? Now, if you were to naively apply single-variable calculus rules, a plausible answer would be

$$\frac{d}{d\textbf{x}} \textbf{A} \textbf{x} = \textbf{A}$$

After all, the corresponding scalar form of the problem $$\frac{d}{dx} (ax)$$ would indeed be a. And indeed the answer to the vector form is $$\textbf{A}$$. But now consider the following problem: $$\frac{d}{d\textbf{A}} \textbf{A} \textbf{x}$$. If you were to take the scalar form, you'd probably think that the answer would be $$\textbf{x}$$. But that isn't right - the answer is actually $$\textbf{x}^T$$, where T refers to the transpose of the vector x.

The purpose of this section is to scratch the surface of this beautiful field - because it's not something that the average Calculus 3 or linear algebra course at university will teach - yet it has its own quirks. And what is it used for? Matrix calculus is widely used in machine learning and also in other fields, such as computational finance. It can also help us avoid having to take (potentially nasty) Lagrangians and effectively reduce the problem to a single-variable scenario!

Derivative with respect to a vector
In this section, we consider problems that involve differentiating with a vector x. As with above, we assume that x is a column vector.

One way to think about this problem is to reduce this to a problem of scalars. Notice that we can consider x to be a collection of scalars $$\textbf{x} = \begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ ... \\ x_n \end{pmatrix}$$. Now take the individual partial derivatives $$\frac{\partial}{\partial x_i}$$ for $$1 \leq i \leq n$$. Finally put them together. We're essentially finding $$\nabla f(\textbf{x})$$ after all - the steps are the same (only that before, the size of x was 2 or 3 that represented the i, j and k coordinate frame).

So let's try this from the above example. We want to find, for all $$1 \leq i \leq n$$,

$$\frac{\partial}{\partial x_i} (\textbf{A} x_i)$$

... which is A. And that's the same for every i.

Now what would get if you would combine all the partial derivatives? Just like how you'd find $${\displaystyle \nabla f({\textbf {x}})}= \nabla f\begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ ... \\ x_n \end{pmatrix} = \nabla \begin{pmatrix} \textbf{A} x_1 \\ \textbf{A} x_2 \\ \textbf{A} x_3 \\ ... \\ \textbf{A} x_n \end{pmatrix}$$, you'll get $$\nabla f(\textbf{x}) = \begin{pmatrix} \textbf{A} \\ \textbf{A} \\ \textbf{A} \\ ... \\ \textbf{A} \end{pmatrix}$$. This is just A! Indeed, that's why $$\frac{d}{d\textbf{x}} \textbf{A} \textbf{x} = \textbf{A}$$.

A first step towards matrices
Now we return back to the other problem: $$\frac{d}{d\textbf{A}} \textbf{A} \textbf{x}$$.

Let's assume that A is a 2 by 2 matrix, and represent A as $$\textbf{A} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix}$$. Using the same notation for x, perform matrix multiplication:

$$\textbf{A} \textbf{x} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix}\begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} a_{11} \times x_1 + a_{12} \times x_2 \\ a_{21} \times x_1 + a_{22} \times x_2 \end{pmatrix}$$

Take the partial derivatives with respect to each of the elements of A (this is equivalent to finding the Jacobian). For $$1 \leq i, j \leq 2$$, find $$\frac{\partial}{\partial A_{ij}}$$:

$$\frac{\partial}{\partial A_{11}} = \begin{pmatrix} x_1 \\ 0 \end{pmatrix} $$, $$\frac{\partial}{\partial A_{12}} = \begin{pmatrix} x_2  \\ 0 \end{pmatrix} $$, $$\frac{\partial}{\partial A_{21}} = \begin{pmatrix} 0  \\ x_1 \end{pmatrix} $$ and $$\frac{\partial}{\partial A_{22}} = \begin{pmatrix} 0  \\ x_2 \end{pmatrix} $$

But then what do you do with that? How do you "combine" the result? Clearly, we are missing something.

Dimensions of ∇f
Let's take a step back and ask the question: given a vector $$\textbf{f} = \textbf{A} \textbf{x} $$, what should be the dimension of $$\nabla \textbf{f} $$?

Consider the example above. We have two variables: A with 4 elements (2x2) and x with 2 elements, and we want to find $$\frac{d \textbf{f}}{d\textbf{A}}$$. It is straightforward to show that the dimension of f is a column vector with $$\textbf{f} = \binom{f_1}{f_2}$$, where $$\binom{f_1}{f_2} = \begin{pmatrix} a_{11} \times x_1 + a_{12} \times x_2 \\ a_{21} \times x_1 + a_{22} \times x_2 \end{pmatrix}$$. So we also need to consider the derivatives with respect to $$f_1$$ and $$f_2$$. In other words, the dimension of $$\frac{df_1}{d\textbf{A}}$$ and $$\frac{df_2}{d\textbf{A}}$$ is a 2x2 vector, corresponding to the partial derivatives of each element of the matrix A.

How many elements would $$\frac{d \textbf{f}}{d\textbf{A}}$$ have in total? For each of the two scalars that comprise f, there are four partial derivatives. This results in (2 * (2 * 2)) = 8 elements in total - actually a tensor (which can be thought of as a higher-order matrix). This is where things start to get messy, but fortunately this is a simple enough example.

Getting a solution
So let's use this above observation to solve the problem.

First consider $$\frac{df_1}{d\textbf{A}}$$, where $$f_1 = a_{11} \times x_1 + a_{12} \times x_2 $$. Compute the individual partial derivatives: $$\frac{\partial f_1}{\partial A_{11}} = x_1, \frac{\partial f_1}{\partial A_{12}} = x_2, \frac{\partial f_1}{\partial A_{21}} = 0$$ and $$\frac{\partial f_1}{\partial A_{22}} = 0$$. Similarly, $$\frac{\partial f_2}{\partial A_{11}} = 0, \frac{\partial f_2}{\partial A_{12}} = 0, \frac{\partial f_2}{\partial A_{21}} = x_1$$ and $$\frac{\partial f_2}{\partial A_{22}} = x_2$$.

Now, how do we combine this? The issue is that $$\nabla \textbf{f} $$ is a tensor, but we can display only a 2D representation using matrices. So let's take the "face" that corresponds to $$\nabla f_1$$. We can represent the collection of partial derivatives (that is, the Jacobian) we found above in a matrix: $$\nabla f_1 = \begin{pmatrix} x_1 & x_2 \\ 0 & 0 \end{pmatrix}$$. Similarly, $$\nabla f_2 = \begin{pmatrix} 0 & 0 \\ x_1 & x_2 \end{pmatrix}$$. What do we observe? This is simply $$\textbf{x}^T$$ (notice the change from column to row form) ! And indeed, that's how we show that $$\frac{d}{d\textbf{A}} \textbf{A} \textbf{x} = \textbf{x}^T$$.

In practice
In practice, you won't have to do all this work every time you want to find the derivative of a matrix. Instead, there are many matrix cookbooks (sometimes also called as a cheatsheet), which give a table of common derivatives with respect to matrices, and that's what you're likely to require in practice. Here's one.

An example
Consider the Markowitz problem. Assume that we have n stocks, and we want to assign weights $$w_i$$. The inter-element covariance between each stock is $$\sigma_{ij}$$, for all $$1 \leq i, j \leq n$$. Suppose we want to solve this problem in a traditional way. Then the optimisation problem is to minimise $$\sum_{i = 1}^{n} {\sum_{j = 1}^n {w_i \sigma_{ij} w_j}}$$ subject to constraints, while important, we won't mention here.

Solving this problem is likely to be messy given the double summation. Let's try matrix calculus. Let w be a N times 1 vector and $$\Sigma$$ be a N times N matrix. The above problem can be reduced to minimising $$\textbf{w}^T \Sigma \textbf{w}$$, and all you need to do is to take derivatives with respect to w! As shown above using a matrix calculus cookbook, $$\frac{d}{d \textbf{w}} \textbf{w}^T \Sigma \textbf{w} = 2 \textbf{w}^T \Sigma$$, which is much more elegant than trying to compute the individual partial derivatives.