The Magic Behind The Linear Methods for Regression – Part 1

It is time to expose the oracle! In this journey, everything started with an amazingly simple method: linear regression. Let’s dive into the nitty-gritty of it.

What we will see:
-Univariate linear regression (with OLS)
-Orthogonalization
-Multivariate linear regression(with OLS)
-Gauss-Markov Theorem
-Gram-Schmidt Procedure

Revision

In the previous post,  we learned that regression is simply
a conditional expectation of the outcome Y given by input variables X=x (E(Y|X)) and least squares fitting can be used to find the coefficients of the linear regression. The linear model either assumes that the regression function E(Y|X) is linear or that the linear model is a reasonable approximation.

As you may remember, what we are trying to do is to find such betas so as to minimize the RSS:

From the previous post, we know that the betas that minimize the RSS can be found by:

For those who are familiar with linear algebra, does it remind you of something which is widely used in linear algebra? Yes, it is orthogonal projection which is something extremely important.

Here is a brief overview of what orthogonal projection is and how it works:
As it can be clearly seen from the picture above, the least squares estimate of the betas is the coefficients requred for the orthogonal projection of response vector y onto the columns space of the input(feature) matrix. Then all we have to do is to multiply input matrix with our vector of estimated betas to find the least squares estimation of the responses:

X*(X^t*X)^-1*X^t part of the equation is called the projection matrix because it projects y onto the columns space of X and produces the orthogonal projections. The least squares estimates are essentially the orthogonal projections because the orthogonal distance between actual value and predicted value is the minimum distance between them. Think: what happens to projection matrix when the input matrix X is singular?

The Gauss-Markov Theorem

This theorem states that the LSE(least squares estimates) of the betas have the smallest variance among all linear unbiased estimates. But do the unbiased estimates always the best? As we will see later, there may be some biased estimates of betas that give a smaller mean squared error overall. The book gives the following equations:

This estimate is unbiased which means its expectation equals to original betas:

The proof is as below:

As we always keep in mind our assumptions, the proofs are easy as it can be seen. Then the book states that if we have any other linear estimator c’y  (‘ means transpose) that is unbiased for  then the variance of this estimates cannot be smaller than the variance of  , the estimates of ordinay least squares.

Let’s try to prove it.

First: setting the equation:

The expectation of the linear estimator can be calculated as below:

Then the variance of it can be calculated as following:

For the ones who wonder the variance of OLS beta, here is how we calculate it:

As it can be seen, the variance of another linear estimator for the  is greater than or equal to the variance of .

Then let’s try to find how well we estimate the betas by calculating the MSE of it:Correction: variance decomposition should be expectation of the second moment minus the squared expectation, not plus.

As it is clear from the picture above, the MSE of our estimates of betas are linearly related to its variance and squared bias. Variance of the estimate is minimum when OLS is used. The Gauss-Markov theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias: squared bias term is the same for all unbiased estimators but the variance part is the minimum for OLS thus cause OLS to have minimum MSE among all unbiased estimators. But sometimes biased estimates can cause a greater reduction in variance than they cause an increase in squared bias. Thus, it is possible to have a smaller MSE with biased estimates. This is bias-variance tradeof. This (along with preventing overfitting) is actually the main motivation behind biased estimates such as Ridge Regression and the LASSO.

Multiple Regression and Gram-Schmidt Procedure

So far we have seen univariate regression (regression with one input variable). Now we will see what happens when we have many of them which we call multivariate regression. In this case it is very likely to have input variables that are correlated with each other. But correlated variables are bad for regression. Why?

When we have input variables that are not correlated with each other, then it means input variables are orthogonal to each other. Thus, each input variable does not have any effect on the coefficients of other input variables. That means since they are orthogonal to each other, the effect of an input variable cannot be captured by the any other input variables. But when they are not orthogonal (some correlation), then the coefficients of correlated variables become shaky and unstable.

In the picture above, the left plot shows two centered (the mean of the vector is substracted from  each element of it) orthogonal  vectors. Any change along the blue vector axis cannot be captured by the orange vector axis and vice versa: they might change independently. Orthogonality does not necessarily imply uncorrelatedness or independence. It implies uncorrelatedness when the centered vectors have zero inner product because when they are centered their inner product gives the covariance of them. Independence is something related to their joint probability distribution.  For example you can fix a point on the orange line and go freely along with the blue line. The right plot, however, shows two correlated vectors (no orthogonality). In this case the orange vector captures the some of the information provided by the blue line because they go in the similar direction. That means a change along the direction of orange line is somewhat captured by the blue line. So the two vectors say the same thing to some extend.

So we have to orthogonalize input variables with respect to each other so that they do not provide the same information. The book gives the following algorithm for this:

Here, z(i) is the residual vector of x(i) after x(i) is orthogonalized with respect to z(0), z(1), …, z(i-1). We have seen how orthogonalization works in the beginning. Suppose we have the input matrix:

This algorithm says that orthogonalize x1 with respect to x0 which is the first columns of 1s. To do it, we have to find orthogonal projection of x1 onto x0, then substract it from x1. To visualize it:

Here x1 is orthogonally projected onto x2, thereby production green vector x1′. Then when we substract x1′ from the blue vector x1, here is the residual vector z1. In other words: z1 is the orthogonalized vector of x1 with respect to x2.  Then betas are simply calculated as follows:

In this example, the beta of x1 is the orthogonal projection of y(responses) onto the z1 (orthogonalized(or “adjusted”) version of x1 with respect to x2). When the residual vector zp is small, then it means variable p is correlated with some of the other variables. This causes its beta to be large and any small change in the residual would cause large changes in the beta, making the beta unstable and shaky. That is why we try to avoid correlation.

This entire procedure is called Gram-Schmidt procedure for multiple regression. The book says that we can represent the step 2 of the algorithm in matrix form:

where X is the original input matrix, Z has as columns z(j) and the other part is the upper triangular matrix with the coefficients of the residual vectors in the step 2 of the algorithm. Lets see how we can represent this algorithm in this way:

First: matrix Z is constructed like this:

Second: when we left x(s) alone, the equations and the resulting matrix X look like

So we can represent these equations in the matrix form as below:

A simple matrix calculation yields the same equations in the second part. This decomposition looks like QR decomposition, where Z is orthogonal input vectors and the other matrix is an upper triangular matrix. But in order for it to be a real QR decomposition, Z must be orthonormal: vectors must be orthogonal(ortho) and their length must equalt to 1 (normal). So in order to do this, we must divide each column of Z by its corresponding length (a.k.a norm). Here is what normal vector is:

So dividing each column of Z by its corresponding length means multiplying Z with some diagonal matrix:

Let D denote the diagonal matrix on the picture above. Its inverse is simply the same matrix with the diagonal elements 1/a, 1/b, and so on. If the jth diagonal corresponds to the length of z(j) then the following equation gives the QR decomposition:

where Q is orthonormal matrix and R is an upper triangular matrix. If we replace X with QR in our calculation of beta, the equations look like:

This is how they are derived:

Beta estimates are much easier to calculate in this form compared to original algorithm. That is why QR decomposition is used very commonly.

From Least Squares to KNN, Statistical Models, Supervised Learning and Function Approximation

Infinity returns!

As you know, I follow the book The Elements of Statistical Learning and my main motivation is to help people like me who are suffering from not understanding anything on their first attemp to read the book since the book is not quite appropriate for beginners. This is a little bit tough subject but I will try to simplfy it as much as I can or at least to give some intuition that lays the foundation of the theories we will encounter.

Starting with Two Simple Methods: Least Squares and Nearest Neighbors

These are the good starting points in this journey because they are the easiest, most basic, and most intuitive approaches that have their origins in the pre-computer era, and the large portions of today’s much fancier methods are the intelligent derivations of these two approaches.

The linear regression with least squares makes a huge assumptions: the model is linear. On the other hand, k-nearest neighbors makes very mild assumption about the structure of the model. The assumption made in linear regression makes it relatively stable but its predictions tend to be inaccurate because the real world is almost never linear. The KNN, on the other hand, tends to give accurate predictions but it can be easily unstable due to its mild assumptions. Let’s dive into them.

Linear Models and Least Squares

From this point on, the book starts to express every calculation and computation in matrix and vector forms so the understanding of linear algebra would be a huge time and life saver in the remaining of the book. The residual sum of squares is given by:

Here we are trying to find such betas(coefficients) to minimize the RSS. In this example, there is only one input variable X and one output variable Y.  So the alternate representation of this is in vector form:

Consider Y is a vector of outputs and X is a matrix of inputs (1 for every row of the first column, and the value of the random variable X for the second column).  Taking the derivative of it and equating it to zero, we obtain betas as:

This is a basic linear algebra and differentiation job. But why did we include ones for the first columns? Let’s leave it for further subjects.

The result is a two-element vector, the first element is intercept and the other is the slope. Why? Because we include the intercept in X and made it two columns matrix.

This formula requires X transpose * X to be nonsingular so that it has an inverse. What does it mean? In linear algebra, a matrix has an inverse if it is nonsingular or in other words has full-rank. This means that the columns of the matrix have to be linearly independent of each other so that its determinant will be non-zero quantity. Since we use determinant in the denominator to find the inverse of a matrix,  a number / 0 is undefined. That is why it has to be full rank. Fortunately, there are more intelligent ways to calculate betas utilizing matrix decomposition methods such as QR decomposition that does not suffer from ill-conditioned zero determinant case.

Nearest-Neighbor Methods

This is a more intuitive one than least squares linear regression is. In order to form a prediction for an observation, we find the nearest neighbors of it in terms of some distance measures such as euclidean distance.Then we average the outcomes of the neighbors to predict the outcome of the observation.

This simple method has only one parameter which is k: the number of neighbors. As k goes to infinity the prediction will be less accurate since we average over more observations but this decreases the variance of the predictions. On the other hand, as k goes to 1, the prediction tends to be more accurate because an observation is likely to be very similar to its nearest neighbor but the variance of prediction will be higher. In this case, we cannot use least squares methods to find optimal k because it would always pick 1 for k.

One big disadvantage of this method is that in order to find an observation’s nearest neighbors it needs n*n distance matrix to store each observation’s distance to all other observations. Here, n is the number of observations in the data and as n increases, n*n increases even faster which makes the computation and storage very costly.

From Least Squares to Neares Neighbors

Least squares linear regression in which linear decision boundary is assumed is the most rigid one and stable to fit. Nearest neighbors, on the other hand, does not assume any stringent assumptions about the decision boundary and can adapt any situation. However, this makes it unstable relative to least squares (hight variance, low bias).

As it is always said, there is no free lunch in this science, so each method has its own advantages and disadvantages depending on data at hand. In some ways, these two methods can be enhanced to develop more advanced methods:

1- In KNN, all k-nearest neighbors of an observation has equal weights in averaging but it doesn’t have to be in this way. We might use some weights that decrease smoothly to zero with distance from the target point. We do this by a function called Kernel.

2- If the space is high dimensional then distance kernels can be modified to emphasize some variables more than others.

3- Rather than fitting a linear regression to entire data globally, we can fit linear models locally by locally weighted least squares.

4- We can expand the original inputs to a basis in which we can develop any complex models. Truncated power basis is an important methods for this.

5- We can non-linearly transform linear models by projection pursuit or neural network.

Statistical Decision Theory

This section of the book is little bit difficult if you do not know basic probability. The book says than the statistical decision theory requires a loss function L(Y, f(X)) for penalizing errors in prediction. This is intuitive because we can quantify or measure how well our model is only when we have some penalty or loss function and all what we do is to minimize this loss. Squared error loss by far is the most convenient one:

E means expected value of its argument which is also known as mean. The first line might be obvious to you, but in the second line the things become messy. Well, it is the basic properties of probability theory actually. The quantity that we are trying to find its expected value is a continuous jointly distributed random variable. It consists of Y as a one random variable and f(X) as the other random variable and our quantity is the function of the two random variables: Y-f(X). In order to find the expected value of it, we need to integrate it with its joint probability density function. For more information, you can visit my notes of probability course from MIT.

If you revisited the probability then you may remember that a joint pdf of two random variable can be splitted into two parts as:

So when you replace Pr(dx, dy) in the original EPE formula above with this split and rearrange the terms, you can obtain:

here g(x,y) denotes the functions of two random variables which is [y-f(x)]^2. Then it is simply conditioned on X, and EPE can be written as:

This is again from probability theory. This is specifically called total expectation theorem (or more specifically law of iterated expectation, which is an abstract version of total expectation theorem). I cannot go over details of these theorems but you can learn by your self and use my notes of probability. My aim is just to give you some guidance to make you understand what is going on here. Going back to our discussion, EPE can be minimized poinwise. The solution is the conditional expectation which is known as regression function. Then when least squares is used in fitting, the best prediction of Y at any point X=x is the conditional mean of Y.

The nearest-neighbor methods attempt to directly implement this recipe using the training data. The difference is that expectation is approximated by averaging over sample data, and conditioning at a point is relaxed to conditioning on some region close to the target point.

The two methods end up approximating conditional expectations by averages. But where do they differ? The least squares assumes that f(x) is well approximated by a globally linear function whereas knn assumes f(x) is well approximated by a locally constant function. KNN fails in high dimensional space because the neares neighbors need not be close to the target point and can result in large errors. That is why KNN is avoided when the input space is high dimensional.

Function Approximation

All we try to do is to approximate to the true function (relationship) between inputs and output. What ever you call it machine learning, statistical learning, mathematical modelling, and so on, they all have this aim. The aim is to find a useful approximation to f(x). We can treat supervised learning as a function estimation problem in this sense.

The majority of approximations have associated a set of parameters θ so that we can modify them to suit the data at hand. In linear regression those parameters are betas, β, (the coefficients). Linear basis expansions could be another class of useful approximation and sigmoid transformation common to neural networks could be an example of nonlinear transformation. Least squares methods can be used to estimates such parameters by minimizing the residual sum of squares. Linear models has a closed form solution to the minimization problems but in some models the solution requires either iterative methods or numerical optimization.

In some situations least squares does not make much sense. A more general principle for estimation is maximum likelihood estimation. If we have observations from a density Prθ(y) indexed by some parameters θ, the log probability of the observed sample is

Maximum likelihood estimation says that the probability of observed sample is maximum. The intuition behind it is that we are likely to observe high probability observations more often than we observe low probability observations. So we assume that what we observe must be the most likely one.

Why is maximum likelihood estimation is more general methods than least squares is? How do they related to each other? Well actually under the assumption of normally distributed error terms, least squares methods are derived from maximum likelihood estimation method and hence they are identical:

As you can see the picture above, the least squares and maximum likelihood are closely related.

My Self-Taught Data Science Journey

Yes! It is a huge world. A really, really huge world in which you have infinitely many ways to achive a goal unless there are some restrictions which always exist fortunately. You may find it little bit murky, little bit daunting, little bit overwhelming, and a little bit scary in the beginning. You may not understand anything about it, you will probabily stumble over and over again, and you definitely get lost in the way. If, however, you have a passion for it, you will see that all those things will be the underlying force behind your progress, and ultimately your success.

Yes, I am talking about Data Science. From now on, I will write my topics in English and they will cover my journey of data science. I want to write about it because there are ever incresing demand for data science and there are also lots of people like me(in the beginning of my journey) who don’t know where to start the journey! Well, let’s start.

Everything started with the book Elements of Statistical Learning, Data Mining, Inference, and Prediction. During Analytics Edge course on edX from MIT, one of the successful students said that we should read and understand at least the first eight chapters of this book in order to be competent at data science. Then I decided to read the book. At the first glance, the book looked very attractive and fun to read, but all those good thoughts immediately vanished after a copule of pages. That didn’t happen due to the book, but happened due to my lack of basic knowledge about certain subjects. I want to go over them chapter by chapter. Since the first chapter is an introduction, I skip it. All the subjects from chapter 2 to chatper 8 are the building blocks of every statistical and machine learning method.

The second chapter of the book is Overview of Supervised Learning. It starts with basic
terminologies which we will encounter frequently in this journey. It assumes that you have basic knowledge of calculus, statistics, probability theory, and linear algebra. If you don’t, then you are in trouble! Fortunately, we are living in a world where we are just a couple of clicks away from anything we want to learn. There are lots of invaluable sources on the Internet.

– If you want to learn calculus, or refresh your knowledge of calculus, is a perfect place.

– If you want to learn statistics and linear algebra, there are lots of good courses on Coursera and edX. For linear algebra Linear Algebra from MIT given by Prof. Gilbert Strang is an amazing course. It covers all the subjects of linear algebra you need to know for data science. Especially spaces and matrix decompositions are the fundamentals.

– If you learn probability theory, there is a very good course called Introduction to Probability – The Science of Uncertainty. It follows the book Introduction to Probability, 2nd Edition, which I love much.

In order to understand this chapter and the followings, you will need at least basic understanding of those subjects. You will frequently encounter “differentiating w.r.t, integrating, probability distribution, probability density function, least squares, bias, variance”. If you are not familiar with those, then you can get familiar with them using the sources I wrote above. Furthermore, this chapter puts all the statistical and machine learning methods on a spectrum on which traditional least squares regression represents one end and K-Nearest Neighbor represents the other end, and any other algorithms are put in-between. The least squares regression is the most rigid one and has the lowest variance but potentially has higher bias, whereas KNN has the lowest bias but potentially higher variance. All those terms may make no sense for now, but they make sense as we progress.

The third chaper is Linear Methods for Regression. We will encounter regression
problems frequently in which we want to estimate a continuous outcome or response from various inputs. To understand the building blocks of regression, you must have an understanding of linear algebra, or vector algebra, or matrix algebra, all of which are used interchangeably. All of the computation and algorithms are created using linear algebra. It also covers the best subset selection and shrinkage methods, again frequently used in machine learning. For example, it is always said that correlated features are bad for regression. But if I say that “it is bad because the algorithm that minimizes the squared errors need to computer the inverse of feature matrix, and for this feature matrix to be invertable it has to be full rank. In other words, the columns of it have to be linearly independent. If the columns are not linearly independent (two or more features are highly correlated), then it probabily will have no unique inverse.”, doesn’t it make much sense? This sentence includes lots of linear algebra and differentiation. If you do not understand it, you need to go over the source I wrote, if you want to be a real data scientist I guess. Yeah, it comes with a price. Understanding of shrinkage is also very important because it helps prevent overfitting. Ridge Regression, for example, adds a positive constant to the diagonal entries of your input matrix before inversion. This makes the problem nonsingular even if original input matrix is not of full rank. The LASSO(Least Absolute Shrinkage and Subset Selection) is another regularization method that performs both subset selection and shrinkage. If you want to know where and why to use them, you need to get your hands dirty.

The fourt chapter is Linear Methods for Classification. This time, linear models are developed for classification tasks. This chapter includes Linear Discriminant Analysis(LDA), Quadratic Discriminant Analysis(QDA), Logistic Regression, and Separating Hyperplanes(Rosenblatt’s Perceptron and Optimal Separating Hyper Plane). These subjects require at least elementary knowledge of statistics and probability theory. Logistic Regression and LDA have much things in common, but the main difference lies in the way they are fitted. If you want to understand it, you need to know math. And If you understand it, you will know when to use which, which makes you faster and wiser. This covers Bayes Theorem form probability.

Chapters 5, 6, 7, and 8 are build upon these four chapters. I will go over them in the next post. Until then, challenge yourself. Try to understand what those chapters say in essence.