Regression Analysis

Regression Analysis is a supervised machine learning algorithm using which we try to find linear correlations between an outcome variable Y and a set of regressors X, where Y is also called target variable or dependent variable while the set of regressors is also called features or independent variables. Y is a real random variable, and X is a vector of random variables, and the components of X are denoted as (x1, x2,..., xp)T where x1, x2, ... represent some features, and the superscript T denotes the transpose of the vector. For example, you want to build a regression model to predict the hourly wages of a worker using variables like education level, gender, experience, skillset, etc.

Here, the hourly wage is the outcome variable (Y) that needs to be predicted and education level, gender, experience, skillset, etc., are a set of regressors or a set of features (X) using which we want to predict the hourly wages.

Using regression analysis, we can answer two questions:


 * 1. The Prediction Question: How can we use the set of regressors X to predict Y well?
 * 2. The Inference Question: How does the predicted value (Y) change if we change only one component or one feature of X, leaving all other components constant?

In our example above, regarding hourly wages:

The prediction question will be: How will we use the set of regressors X, namely education level, gender, experience, skillset, etc., to predict hourly wages (Y)? The inference question will be, for example: how does gender affect the wage of a worker?

For the inference question, we divide the set of target regressors X into two parts:

$$ X = (D,W) $$

Where D is the target regressor; in the above example, gender is the target regressor as we want to check its effect by keeping all other job-relevant features constant. W represents controls or confounders: which are the remaining job-relevant characteristics, namely: education level, experience, skillset, etc.

Sample and population linear regression
We can define regression problems for both population and sample data.

We have population data if we have access to the whole dataset. For example, in our "predicting hourly wage" example, the population will be all the people living in the United States. Using this data, we can easily get the expected value of the population or the average of the population.

We denote the expected value of a population by E(Y).

Now our goal is to make a model which could estimate the value E(Y) given some set of regressors X. Where X= (x1, x2,..., xp).

We will now make the best linear predictor of Y using X, this means that when we estimate Y using X, while will be the best approximation of given data.



In the diagram to the right, blue points are the data points, and the red line is a line that can fit the data. Our goal is to find the line which best fits the data, i.e. it gives the least error.

Mathematically, we define our predicted value of Y using a set of regressor (X) as:

$$ y_i = \beta_{0} + \beta_{1} x_{i1} + \cdots + \beta_{p} x_{ip} + \varepsilon_i = \mathbf{x}^\mathsf{T}_i\boldsymbol\beta + \varepsilon_i, \qquad i = 1, \ldots, n, $$

We call these parameters $$\beta_{j}$$ as regression parameters.

For example: For our prediction wage example: we can represent the predicted hourly wage as β’X. Where X is our job-relevant characteristics like gender, education level, skillset, experience, etc. and β are regression parameters.

Now let’s define error for this problem. In supervised learning, we know the actual value of the target variable, i.e., Y from the data. Now, using the best linear predictor, we estimated the predicted values can be represented as β’X. Therefore, the error we got is defined as:

Error = E(Y − β’X)2

As we saw at the start, E(Y) is the expected value or the average of the population of Y.

Naturally, our job is to minimize this error.