The Least Squares Regression Line (and friends)


The idea behind the technique of least squares is as follows. Suppose you were interested in two quantities, call them X and Y, which were supposed to be related via a linear function Y = a+bX. Suppose that a and b are unknown, so you decide to make some measurements (X1,Y1),(X2,Y2),... of X and Y. Suppose, however, that as you measure the quantities, there are some errors introduced, so the measured values Yi are related to the measured values Xi via Yi = a+bXi+Ei, where Ei is the (unknown) error in measurement i. How can you find a and b?

This is, in fact, a remarkably common situation. This problem occurs in biology, psychology, physics, chemistry, in fact all the pure and applied sciences, as well as economics, business, medicine, nutrition... the list is endless.

The most common technique used to find approximations for a and b is as follows. Suppose we guessed two values a and b for a and b. Suppose then we worked out a+bXi, for each of the Xi, and then found the difference between the measured Yi and the calculated a+bXi, say ri = Yi-(a+bXi). It stands to reason that if the a and b we guessed were correct, then most of the ri should be reasonably small. So to find a good approximation to a and b, we should find the a and b that make the ri collectively the smallest in some sense.

How can this be done? The most common method is to calculate the sum of the squares of the residuals,

S = n

i = 1 
ri2 = n

i = 1 
(Yi-a-bXi)2,
and then find the a and b which make S as small as possible. This technique is called least squares regression, and the line obtained, say Y = A+BX is called the least squares line, the line of best fit or the line of regression.

Why choose least squares? Why not least fourth powers, or least absolute values or something else? It's a secret, but the main reason is that... it makes the mathematics a lot simpler. There's no deep philosophical reason to prefer least squares regression over any other kind of regression one might invent, unless ït's easy to use" is a deep philosophical reason.

The fact is, minimizing S is an easy calculus problem. A bright pre-calculus students could perhaps do it without using any calculus at all. The values of A and B obtained are

B =
n
(YiXi)-(
Yi)(
Xi)

n
(Xi2)-(
Xi)2
  and  A =

Yi

n
-b

Xi

n
.

The DotPlacer Applet (on this site) allows you to place a set of points on the screen, and have it draw the least squares line calculated from the points. You can even move the points around, and watch the line change. Try it!

If you suspect that the X and Y do not follow a linear relationship, but instead (for example) Y = a+bX+gX2, you can apply very similar techniques to the above, to obtain the least squares quadratic. Or, if you suspect the relationship is that of a cubic polynomial, or an exponential curve, or whatever, there are least squares techniques available. The DotPlacer Applet will also draw least squares polynomials for you, up to the degree 4 polynomial.

For more information on least squares regression, see this World of Mathematics article. Or else, you may like to consider the book displayed on the left. It is a well written book on statistics, including of course regression. It shows not just the how of the techniques, but explains also the why. It would be good for a person seeking to seriously statistics, for example regression, to a problem at hand. The book on the right, with code samples and good explanations, is regarded by many as one of the best books around for applying java (and smalltalk) to numerical methods.


File translated from TEX by TTH, version 2.25.