From Least Squares to KNN, Statistical Models, Supervised Learning and Function Approximation

Infinity returns!

As you know, I follow the book The Elements of Statistical Learning and my main motivation is to help people like me who are suffering from not understanding anything on their first attemp to read the book since the book is not quite appropriate for beginners. This is a little bit tough subject but I will try to simplfy it as much as I can or at least to give some intuition that lays the foundation of the theories we will encounter.

Starting with Two Simple Methods: Least Squares and Nearest Neighbors

These are the good starting points in this journey because they are the easiest, most basic, and most intuitive approaches that have their origins in the pre-computer era, and the large portions of today’s much fancier methods are the intelligent derivations of these two approaches.

The linear regression with least squares makes a huge assumptions: the model is linear. On the other hand, k-nearest neighbors makes very mild assumption about the structure of the model. The assumption made in linear regression makes it relatively stable but its predictions tend to be inaccurate because the real world is almost never linear. The KNN, on the other hand, tends to give accurate predictions but it can be easily unstable due to its mild assumptions. Let’s dive into them.

Linear Models and Least Squares

From this point on, the book starts to express every calculation and computation in matrix and vector forms so the understanding of linear algebra would be a huge time and life saver in the remaining of the book. The residual sum of squares is given by:

1

Here we are trying to find such betas(coefficients) to minimize the RSS. In this example, there is only one input variable X and one output variable Y.  So the alternate representation of this is in vector form:

3

Consider Y is a vector of outputs and X is a matrix of inputs (1 for every row of the first column, and the value of the random variable X for the second column).  Taking the derivative of it and equating it to zero, we obtain betas as:

2

This is a basic linear algebra and differentiation job. But why did we include ones for the first columns? Let’s leave it for further subjects.

The result is a two-element vector, the first element is intercept and the other is the slope. Why? Because we include the intercept in X and made it two columns matrix.

This formula requires X transpose * X to be nonsingular so that it has an inverse. What does it mean? In linear algebra, a matrix has an inverse if it is nonsingular or in other words has full-rank. This means that the columns of the matrix have to be linearly independent of each other so that its determinant will be non-zero quantity. Since we use determinant in the denominator to find the inverse of a matrix,  a number / 0 is undefined. That is why it has to be full rank. Fortunately, there are more intelligent ways to calculate betas utilizing matrix decomposition methods such as QR decomposition that does not suffer from ill-conditioned zero determinant case.

Nearest-Neighbor Methods

This is a more intuitive one than least squares linear regression is. In order to form a prediction for an observation, we find the nearest neighbors of it in terms of some distance measures such as euclidean distance.Then we average the outcomes of the neighbors to predict the outcome of the observation.

This simple method has only one parameter which is k: the number of neighbors. As kknn goes to infinity the prediction will be less accurate since we average over more observations but this decreases the variance of the predictions. On the other hand, as k goes to 1, the prediction tends to be more accurate because an observation is likely to be very similar to its nearest neighbor but the variance of prediction will be higher. In this case, we cannot use least squares methods to find optimal k because it would always pick 1 for k. 

One big disadvantage of this method is that in order to find an observation’s nearest neighbors it needs n*n distance matrix to store each observation’s distance to all other observations. Here, n is the number of observations in the data and as n increases, n*n increases even faster which makes the computation and storage very costly.

From Least Squares to Neares Neighbors

Least squares linear regression in which linear decision boundary is assumed is the most rigid one and stable to fit. Nearest neighbors, on the other hand, does not assume any stringent assumptions about the decision boundary and can adapt any situation. However, this makes it unstable relative to least squares (hight variance, low bias).

As it is always said, there is no free lunch in this science, so each method has its own advantages and disadvantages depending on data at hand. In some ways, these two methods can be enhanced to develop more advanced methods:

1- In KNN, all k-nearest neighbors of an observation has equal weights in averaging but it doesn’t have to be in this way. We might use some weights that decrease smoothly to zero with distance from the target point. We do this by a function called Kernel.

2- If the space is high dimensional then distance kernels can be modified to emphasize some variables more than others.

3- Rather than fitting a linear regression to entire data globally, we can fit linear models locally by locally weighted least squares.

4- We can expand the original inputs to a basis in which we can develop any complex models. Truncated power basis is an important methods for this.

5- We can non-linearly transform linear models by projection pursuit or neural network.

Statistical Decision Theory

This section of the book is little bit difficult if you do not know basic probability. The book says than the statistical decision theory requires a loss function L(Y, f(X)) for penalizing errors in prediction. This is intuitive because we can quantify or measure how well our model is only when we have some penalty or loss function and all what we do is to minimize this loss. Squared error loss by far is the most convenient one:

4

E means expected value of its argument which is also known as mean. The first line might be obvious to you, but in the second line the things become messy. Well, it is the basic properties of probability theory actually. The quantity that we are trying to find its expected value is a continuous jointly distributed random variable. It consists of Y as a one random variable and f(X) as the other random variable and our quantity is the function of the two random variables: Y-f(X). In order to find the expected value of it, we need to integrate it with its joint probability density function. For more information, you can visit my notes of probability course from MIT.

If you revisited the probability then you may remember that a joint pdf of two random variable can be splitted into two parts as:

8

So when you replace Pr(dx, dy) in the original EPE formula above with this split and rearrange the terms, you can obtain:

9

here g(x,y) denotes the functions of two random variables which is [y-f(x)]^2. Then it is simply conditioned on X, and EPE can be written as:

5

This is again from probability theory. This is specifically called total expectation theorem (or more specifically law of iterated expectation, which is an abstract version of total expectation theorem). I cannot go over details of these theorems but you can learn by your self and use my notes of probability. My aim is just to give you some guidance to make you understand what is going on here. Going back to our discussion, EPE can be minimized poinwise. The solution is the conditional expectation which is known as regression function. Then when least squares is used in fitting, the best prediction of Y at any point X=x is the conditional mean of Y.

The nearest-neighbor methods attempt to directly implement this recipe using the training data. The difference is that expectation is approximated by averaging over sample data, and conditioning at a point is relaxed to conditioning on some region close to the target point. 

The two methods end up approximating conditional expectations by averages. But where do they differ? The least squares assumes that f(x) is well approximated by a globally linear function whereas knn assumes f(x) is well approximated by a locally constant function. KNN fails in high dimensional space because the neares neighbors need not be close to the target point and can result in large errors. That is why KNN is avoided when the input space is high dimensional.

Function Approximation

All we try to do is to approximate to the true function (relationship) between inputs and output. What ever you call it machine learning, statistical learning, mathematical modelling, and so on, they all have this aim. The aim is to find a useful approximation to f(x). We can treat supervised learning as a function estimation problem in this sense.

The majority of approximations have associated a set of parameters θ so that we can modify them to suit the data at hand. In linear regression those parameters are betas, β, (the coefficients). Linear basis expansions could be another class of useful approximation and sigmoid transformation common to neural networks could be an example of nonlinear transformation.Capture Least squares methods can be used to estimates such parameters by minimizing the residual sum of squares. Linear models has a closed form solution to the minimization problems but in some models the solution requires either iterative methods or numerical optimization.

In some situations least squares does not make much sense. A more general principle for estimation is maximum likelihood estimation. If we have observations from a density Prθ(y) indexed by some parameters θ, the log probability of the observed sample is

7

Maximum likelihood estimation says that the probability of observed sample is maximum. The intuition behind it is that we are likely to observe high probability observations more often than we observe low probability observations. So we assume that what we observe must be the most likely one.

Why is maximum likelihood estimation is more general methods than least squares is? How do they related to each other? Well actually under the assumption of normally distributed error terms, least squares methods are derived from maximum likelihood estimation method and hence they are identical:

da712ae9-9d36-4bda-bfee-9af7b7ef727b

As you can see the picture above, the least squares and maximum likelihood are closely related.

MITx: 6.041x – Introduction to Probability – The Science of Uncertainty

I am currently taking a course from MIT called Introduction to Probability – The Science of Uncertainty and this is one of the best courses that I have taken so far. The professor John Tsitsiklis and the other course staffs do great things in this course. How the subjects are studied is quite different than how they are in Turkish education system, hence that makes me realize what our education system lacks: questioning, multiple ways rather than one absolute way to find an answer to a question, developing intuition and application.

Here is some information about content and aim of the course:

About this course 

The world is full of uncertainty: accidents, storms, unruly financial markets, noisy communications. The world is also full of data. Probabilistic modeling and the related field of statistical inference are the keys to analyzing data and making scientifically sound predictions.

Probabilistic models use the language of mathematics. But instead of relying on the traditional “theorem – proof” format, we develop the material in an intuitive — but still rigorous and mathematically precise — manner. Furthermore, while the applications are multiple and evident, we emphasize the basic concepts and methodologies that are universally applicable.

The course covers all of the basic probability concepts, including:

  • multiple discrete or continuous random variables, expectations, and conditional distributions
  • laws of large numbers
  • the main tools of Bayesian inference methods
  • an introduction to random processes (Poisson processes and Markov chains)

The contents of this course are essentially the same as those of the corresponding MIT class (Probabilistic Systems Analysis and Applied Probability) — a course that has been offered and continuously refined over more than 50 years. It is a challenging class, but it will enable you to apply the tools of probability theory to real-world applications or your research.

The course covers:

  • The basic structure and elements of probabilstic models
  • Random variables, their distributions, means, and variances
  • Probabilistic calculations
  • Inference methods
  • Laws of large numbers and their applications
  • Random processes

https://www.edx.org/course/introduction-probability-science-mitx-6-041x-1#!

Here is my notes:

1-2 Conditioning and Bayes’ rule
-Independence
-Counting: Permutation, Combination, Partitioning

IMG_5518

 

3-Discrete Random Variables:
-Probability mass functions and expectations
-Variance; Conditioning on an event; Multiple r.v.’s
-Conditioning on a random variable; Independence of r.v.’s

4-Continuous Random Variables:
-Probability density functions
-Conditioning on an event; Multiple r.v.’s
-Conditioning on a random variable; Independence; Bayes’ rule

5- Further topics on random variables
-Derived distributions
-Sums of r.v.’s; Covariance and correlation
-Conditional expectation and variance revisited; Sum of a random number of r.v.’s

6- Bayesian inference
-Introduction to Bayesian inference
-Linear models with normal noise
-Least mean squares (LMS) estimation
-Linear least mean squares (LLMS) estimation

7- Limit theorems and classical statistics
-Inequalities, convergence, and the Weak Law of Large Numbers
-The Central Limit Theorem (CLT)
-An introduction to classical statistics

8- Bernoulli and Poisson processes
-The Bernoulli process
-The Poisson process
-More on the Poisson process

9- Markov chains
-Finite-state Markov chains
-Steady-state behavior of Markov chains
-Absorption probabilities and expected time to absorption