Statistics/Numerical Methods/Quantile Regression

Quantile Regression as introduced by Koenker and Bassett (1978) seeks to complement classical linear regression analysis. Central hereby is the extension of "ordinary quantiles from a location model to a more general class of linear models in which the conditional quantiles have a linear form" (Buchinsky (1998), p. 89). In Ordinary Least Squares (OLS) the primary goal is to determine the conditional mean of random variable $$Y$$, given some explanatory variable $$x_i$$, reaching the expected value $$E[Y|x_i]$$. Quantile Regression goes beyond this and enables one to pose such a question at any quantile of the conditional distribution function. The following seeks to introduce the reader to the ideas behind Quantile Regression. First, the issue of quantiles is addressed, followed by a brief outline of least squares estimators focusing on Ordinary Least Squares. Finally, Quantile Regression is presented, along with an example utilizing the Boston Housing data set.

What are Quantiles
Gilchrist (2001, p.1) describes a quantile as "simply the value that corresponds to a specified proportion of an (ordered) sample of a population". For instance a very commonly used quantile is the median $$M$$, which is equal to a proportion of 0.5 of the ordered data. This corresponds to a quantile with a probability of 0.5 of occurrence. Quantiles hereby mark the boundaries of equally sized, consecutive subsets. (Gilchrist, 2001)

More formally stated, let $$Y$$ be a continuous random variable with a distribution function $$F_Y(y)$$ such that

$$ (1) F_Y(y) = P (Y\leq {y}) = \tau$$

which states that for the distribution function $$F_Y(y)$$ one can determine for a given value $$y$$ the probability $$\tau$$ of occurrence. Now if one is dealing with quantiles, one wants to do the opposite, that is one wants to determine for a given probability $$\tau$$ of the sample data set the corresponding value $$y$$. A $$\tau^{th}$$-quantile refers in a sample data to the probability $$\tau$$ for a value $$y$$.

$$ (2) F_Y(y_\tau) = \tau$$

Another form of expressing the $$\tau^{th}$$-quantile mathematically is following:

$$ (3) y_\tau = F_{Y}^{-1}(\tau)$$

$$y_\tau$$ is such that it constitutes the inverse of the function $$F_Y(\tau)$$ for a probability $$\tau$$.

Note that there are two possible scenarios. On the one hand, if the distribution function $$F_Y(y)$$ is monotonically increasing, quantiles are well defined for every $$\tau \in (0;1)$$. However, if a distribution function $$F_Y(y)$$ is not strictly monotonically increasing, there are some $$\tau$$s for which a unique quantile can not be defined. In this case one uses the smallest value that $$y$$ can take on for a given probability $$\tau$$.

Both cases, with and without a strictly monotonically increasing function, can be described as follows:

$$ (4) y_\tau = F_Y^{-1}(\tau) = inf \left\{ y | F_Y(y) \geq \tau \right\} $$

That is $$y_\tau$$ is equal to the inverse of the function $$F_Y(\tau)$$ which in turn is equal to the infimum of $$y$$ such that the distribution function $$F_Y(y)$$ is greater or equal to a given probability $$\tau$$, i.e. the $$\tau^{th}-$$quantile. (Handl (2000))

However, a problem that frequently occurs is that an empirical distribution function is a step function. Handl (2000) describes a solution to this problem. As a first step, one reformulates equation 4 in such a way that one replaces the continuous random variable $$Y$$ with $$n$$, the observations, in the distribution function $$F_Y(y)$$, resulting in the empirical distribution function $$F_n(y)$$. This gives the following equation:

$$ (5) \hat{y}_\tau = inf \left\{ y | F_n(y) \geq \tau \right\}$$

The empirical distribution function can be separated into equally sized, consecutive subsets via the number of observations $$n$$. Which then leads one to the following step:

$$ (6) \hat{y}_\tau = y_{(i)} $$

with $$i=1,...,n$$ and $$y_{(1)},...,y_{(n)}$$ as the sorted observations. Hereby, of course, the range of values that $$y_\tau$$ can take on is limited simply by the observations $$y_{(i)}$$ and their nature. However, what if one wants to implement a different subset, i.e. different quantiles but those that can be derived from the number of observations $$n$$?

Therefore a further step necessary to solving the problem of a step function is to smooth the empirical distribution function through replacing it a with continuous linear function $$\tilde{F}(y)$$. In order to do this there are several algorithms available which are well described in Handl (2000) and more in detail with an evaluation of the different algorithms and their efficiency in computer packages in Hyndman and Fan (1996). Only then one can apply any division into quantiles of the data set as suitable for the purpose of the analysis. (Handl (2000))

Ordinary Least Squares
In regression analysis the researcher is interested in analyzing the behavior of a dependent variable $$y_i$$ given the information contained in a set of explanatory variables $$x_i$$. Ordinary Least Squares is a standard approach to specify a linear regression model and estimate its unknown parameters by minimizing the sum of squared errors. This leads to an approximation of the mean function of the conditional distribution of the dependent variable. OLS achieves the property of BLUE, it is the best, linear, and unbiased estimator, if following four assumptions hold:

1. The explanatory variable $$x_i$$ is non-stochastic

2. The expectations of the error term $$\epsilon_i$$ are zero, i.e. $$E[\epsilon_i]=0$$

3. Homoscedasticity - the variance of the error terms $$\epsilon_i$$ is constant, i.e. $$var(\epsilon_i)=\sigma^{2}$$

4. No autocorrelation, i.e. $$cov(\epsilon_i, \epsilon_j )=0$$ , $$i\neq j$$

However, frequently one or more of these assumptions are violated, resulting in that OLS is not anymore the best, linear, unbiased estimator. Hereby Quantile Regression can tackle following issues: (i), frequently the error terms are not necessarily constant across a distribution thereby violating the axiom of homoscedasticity. (ii) by focusing on the mean as a measure of location, information about the tails of a distribution are lost. (iii) OLS is sensitive to extreme outliers that can distort the results significantly. (Montenegro (2001))

The Method
Quantile Regression essentially transforms a conditional distribution function into a conditional quantile function by slicing it into segments. These segments describe the cumulative distribution of a conditional dependent variable $$Y$$ given the explanatory variable $$x_i$$ with the use of quantiles as defined in equation 4.

For a dependent variable $$Y$$ given the explanatory variable $$X=x$$ and fixed $$\tau$$, $$0<\tau<1$$, the conditional quantile function is defined as the $$\tau-th$$ quantile $$Q_{Y|X}(\tau|x)$$ of the conditional distribution function $$F_{Y|X}(y|x)$$. For the estimation of the location of the conditional distribution function, the conditional median $$Q_{Y|X}(0,5|x)$$ can be used as an alternative to the conditional mean. (Lee (2005))

One can nicely illustrate Quantile Regression when comparing it with OLS. In OLS, modeling a conditional distribution function of a random sample ($${y_1,...,y_n}$$) with a parametric function $$\mu(x_i,\beta)$$ where $$x_i$$ represents the independent variables, $$\beta$$ the corresponding estimates and $$\mu$$ the conditional mean, one gets following minimization problem:

$$ (7) min_{\beta\in\Re}\sum_{i=1}^{n}(y_i-\mu(x_i,\beta))^{2} $$

One thereby obtains the conditional expectation function $$E[Y|x_i]$$. Now, in a similar fashion one can proceed in Quantile Regression. Central feature thereby becomes $$\rho_\tau$$, which serves as a check function. $$ (8) \rho_{\tau}(x)=\begin{cases}\tau*x & \mbox{if } x \ge 0 \\ (\tau-1)*x & \mbox{if } x < 0 \end{cases} $$

This check-function ensures that

1. all $$\rho_\tau$$ are positive

2. the scale is according to the probability $$\tau$$

Such a function with two supports is a must if dealing with L1 distances, which can become negative.

In Quantile Regression one minimizes now following function:

$$ (9) min_{\beta\in\Re}\sum_{i=1}^{n}\rho_{\tau}(y_i-\xi(x_i,\beta)) $$

Here, as opposed to OLS, the minimization is done for each subsection defined by $$\rho_\tau$$, where the estimate of the $$\tau^{th}$$-quantile function is achieved with the parametric function $$\xi(x_i, \beta)$$. (Koenker and Hallock (2001))

Features that characterize Quantile Regression and differentiate it from other regression methods are following:

1. The entire conditional distribution of the dependent variable $$Y$$ can be characterized through different values of $$\tau$$

2. Heteroscedasticity can be detected

3. If the data is heteroscedastic, median regression estimators can be more efficient than mean regression estimators

4. The minimization problem as illustrated in equation 9 can be solved efficiently by linear programming methods, making estimation easy

5. Quantile functions are also equivariant to monotone transformations. That is $$Q_{h(Y|X)}(x_\tau)=h(Q_{(Y|X)}(x_{\tau}))$$, for any function

6. Quantiles are robust in regards to outliers ( Lee (2005) )

A graphical illustration of Quantile Regression
Before proceeding to a numerical example, the following subsection seeks to graphically illustrate the concept of Quantile Regression. First, as a starting point for this illustration, consider figure 1. For a given explanatory value of $$x_i$$ the density for a conditional dependent variable $$Y$$ is indicated by the size of the balloon. The bigger the balloon, the higher is the density, with the mode, i.e. where the density is the highest, for a given $$x_i$$ being the biggest balloon. Quantile Regression essentially connects the equally sized balloons, i.e. probabilities, across the different values of $$x_i$$, thereby allowing one to focus on the interrelationship between the explanatory variable $$x_i$$ and the dependent variable $$Y$$ for the different quantiles, as can be seen in figure 2. These subsets, marked by the quantile lines, reflect the probability density of the dependent variable $$Y$$ given $$x_i$$.



The example used in figure 2 is originally from Koenker and Hallock (2000), and illustrates a classical empirical application, Ernst Engel's (1857) investigation into the relationship of household food expenditure, being the dependent variable, and household income as the explanatory variable. In Quantile Regression the conditional function of $$Q_{Y|X}(\tau|x)$$ is segmented by the $$\tau^{th}$$-quantile. In the analysis, the $$\tau^{th}$$-quantiles $$\tau \in\left\{0,05; 0,1; 0,25; 0,5; 0,75; 0,9; 0,95\right\}$$, indicated by the thin blue lines that separate the different color sections, are superimposed on the data points. The conditional median ($$\tau=0,5$$) is indicated by a thick dark blue line, the conditional mean by a light yellow line. The color sections thereby represent the subsections of the data as generated by the quantiles.



Figure 2 can be understood as a contour plot representing a 3-D graph, with food expenditure and income on the respective y and x axis. The third dimension arises from the probability density of the respective values. The density of a value is thereby indicated by the darkness of the shade of blue, the darker the color, the higher is the probability of occurrence. For instance, on the outer bounds, where the blue is very light, the probability density for the given data set is relatively low, as they are marked by the quantiles 0,05 to 0,1 and 0,9 to 0,95. It is important to notice that figure 2 represents for each subsections the individual probability of occurrence, however, quantiles utilize the cumulative probability of a conditional function. For example, $$\tau$$ of 0,05 means that 5$$\%$$ of observations are expected to fall below this line, a $$\tau$$ of 0,25 for instance means that 25$$\%$$ of the observations are expected to fall below this and the 0,1 line.

The graph in figure 2, suggests that the error variance is not constant across the distribution. The dispersion of food expenditure increases as household income goes up. Also the data is skewed to the left, indicated by the spacing of the quantile lines that decreases above the median and also by the relative position of the median which lies above the mean. This suggests that the axiom of homoscedasticity is violated, which OLS relies on. The statistician is therefore well advised to engage in an alternative method of analysis such as Quantile Regression, which is actually able to deal with heteroscedasticity.

A Quantile Regression Analysis
In order to give a numerical example of the analytical power of Quantile Regression and to compare it within the boundaries of a statistical application with OLS the following section will be analyzing some selected variables of the Boston Housing dataset which is available at the md-base website. The data was first analyzed by Belsley, Kuh, and Welsch (1980). The original data comprised 506 observations for 14 variables stemming from the census of the Boston metropolitan area.

This analysis utilizes as the dependent variable the median value of owner occupied homes (a metric variable, abbreviated with H) and investigates the effects of 4 independent variables as shown in table 1. These variables were selected as they best illustrate the difference between OLS and Quantile Regression. For the sake of simplicity of the analysis, it was neglected for now to deal with potential difficulties related to finding the correct specification of a parametric model. A simple linear regression model therefore was assumed. For the estimation of asymptotic standard errors see for example Buchinsky (1998), which illustrates the design-matrix bootstrap estimator or alternatively Powell (1986) for kernel based estimation of asymptotic standard errors.

In the following firstly an OLS model was estimated. Three digits after the comma were indicated in the tables as some of the estimates turned out to be very small.

$$ (10) E [H_i | T_i, O_i, A_i, P_i] = \alpha + \beta T_i + \delta O_i + \gamma A_i + \lambda P_i $$

Computing this via XploRe one obtains the results as shown in the table below.

Analyzing this data set via Quantile Regression, utilizing the $$\tau^{th}$$ quantiles $$\tau \in \left(0,1; 0,3; 0,5; 0,7; 0,9\right)$$ the model is characterized as follows:

$$ (11) Q_H [\tau| T_i, O_i, A_i, P_i] = \alpha_\tau + \beta_\tau T_i + \delta_\tau O_i  + \gamma_\tau A_i  + \lambda_\tau P_i   $$

Just for illustrative purposes and to further foster the understanding of the reader for Quantile Regression, the equation for the $$0,1^{th}$$ quantile is briefly illustrated, all others follow analogous:

$$ (12) min\left[\rho_{0,1}(y_1-x_1\beta)+\rho_{0,1}(y_2-x_2\beta)+ ...+\rho_{0,1}(y_n-x_n\beta)\right] $$

$$ \mbox{ equation 12 with } \rho_{0,1}(y_i-x_i\beta)=\begin{cases}0,1(y_i-x_i\beta) & \mbox{if }(y_i-x_i\beta)>0 \\-0,9(y_i-x_i\beta) & \mbox{if }(y_i-x_i\beta)<0 \end{cases} $$

Now if one compares the results for the estimates of OLS from table 2 and Quantile Regression, table 3, one finds that the latter method can make much more subtle inferences of the effect of the explanatory variables on the dependent variable. Of particular interest are thereby quantile estimates that are relatively different as compared to other quantiles for the same estimate.

Probably the most interesting result and most illustrative in regards to an understanding of the functioning of Quantile Regression and pointing to the differences with OLS are the results for the independent variable of the proportion of non-retail business acres $$(T_i)$$. OLS indicates that this variable has a positive influence on the dependent variable, the value of homes, with an estimate of $$\hat{\beta}=0,021$$, i.e. the value of houses increases as the proportion of non-retail business acres $$(T_i)$$ increases in regards to the Boston Housing data.

Looking at the output that Quantile Regression provides us with, one finds a more differentiated picture. For the 0,1 quantile, we find an estimate of $$\hat{\beta}_{0,1}= 0,087$$ which would suggest that for this low quantile the effect seems to be even stronger than is suggested by OLS. Here house prices go up when the proportion of non-retail businesses $$(T_i)$$ goes up, too. However, considering the other quantiles, this effect is not quite as strong anymore, for the 0,7th and 0,9th quantile this effect seems to be even reversed indicated by the parameter $$\hat{\beta}_{0,7}= -0,021$$ and $$\hat{\beta}_{0,9}= -0,062$$. These values indicate that in these quantiles the house price is negatively influenced by an increase of non-retail business acres $$(T_i)$$. The influence of non-retail business acres $$(T_i)$$ seems to be obviously very ambiguous on the dependent variable of housing price, depending on which quantile one is looking at. The general recommendation from OLS that if the proportion of non-retail business acres $$(T_i)$$ increases, the house prices would increase can obviously not be generalized. A policy recommendation on the OLS estimate could therefore be grossly misleading.

One would intuitively find the statement that the average number of rooms of a property $$(O_i)$$ positively influences the value of a house, to be true. This is also suggested by OLS with an estimate of $$\hat{\delta}=38,099$$. Now Quantile Regression also confirms this statement, however, it also allows for much subtler conclusions. There seems to be a significant difference between the 0,1 quantile as opposed to the rest of the quantiles, in particular the 0,9th quantile. For the lowest quantile the estimate is $$\hat{\delta}_{0,1}= 29,606$$, whereas for the 0,9th quantile it is $$\hat{\delta}_{0,9}=51,353$$. Looking at the other quantiles one can find similar values for the Boston housing data set as for the 0,9th, with estimates of $$\hat{\delta}_{0,3}=45,281$$, $$\hat{\delta}_{0,5}=53,252$$, and $$\hat{\delta}_{0,7}=50,999$$ respectively. So for the lowest quantile the influence of additional number of rooms $$(O_i)$$ on the house price seems to be considerably smaller then for all the other quantiles.

Another illustrative example is provided analyzing the proportion of owner-occupied units built prior to 1940 $$(A_i)$$ and its effect on the value of homes. Whereas OLS would indicate this variable has hardly any influence with an estimate of $$\hat{\gamma}=0,001$$, looking at Quantile Regression one gets a different impression. For the 0,1th quantile, the age has got a negative influence on the value of the home with $$\hat{\gamma}_{0,1}=-0,022$$. Comparing this with the highest quantile where the estimate is $$\hat{\gamma}_{0,9}=0,004$$, one finds that the value of the house is suddenly now positively influenced by its age. Thus, the negative influence is confirmed by all other quantiles besides the highest, the 0,9th quantile.

Last but not least, looking at the pupil-teacher ratio $$(P_i)$$ and its influence on the value of houses, one finds that the tendency that OLS indicates with a value of $$\hat{\lambda}=-0,953$$ to be also reflected in the Quantile Regression analysis. However, in Quantile Regression one can see that the influence on the housing price of the pupils-teacher ratio $$(P_i)$$ gradually increases over the different quantiles, from the 0,1th quantile with an estimate of $$\hat{\lambda}_{0,1}=-0,443$$ to the 0,9th quantile with a value of $$\hat{\lambda}_{0,9}=-1,257$$.

This analysis makes clear, that Quantile Regression allows one to make much more differentiated statements when using Quantile Regression as opposed to OLS. Sometimes OLS estimates can even be misleading what the true relationship between an explanatory and a dependent variable is as the effects can be very different for different subsection of the sample.

Conclusion
For a distribution function $$F_Y (y)$$ one can determine for a given value of $$y$$ the probability $$\tau$$ of occurrence. Now quantiles do exactly the opposite. That is, one wants to determine for a given probability $$\tau$$ of the sample data set the corresponding value $$y$$. In OLS, one has the primary goal of determining the conditional mean of random variable $$Y$$, given some explanatory variable $$x_i$$, $$E[Y|x_i]$$. Quantile Regression goes beyond this and enables us to pose such a question at any quantile of the conditional distribution function. It focuses on the interrelationship between a dependent variable and its explanatory variables for a given quantile. Quantile Regression overcomes thereby various problems that OLS is confronted with. Frequently, error terms are not constant across a distribution, thereby violating the axiom of homoscedasticity. Also, by focusing on the mean as a measure of location, information about the tails of a distribution are lost. And last but not least, OLS is sensitive to extreme outliers, which can distort the results significantly. As has been indicated in the small example of the Boston Housing data, sometimes a policy based upon an OLS analysis might not yield the desired result as a certain subsection of the population does not react as strongly to this policy or even worse, responds in a negative way, which was not indicated by OLS.