Let us say you have 10 observations of 2 different variables. How do you determine which of the observations to use? Should you throw out the outliers? Should you only include the most similar values? Does more observations increase or decrease the amount of measurement error?
These problems can be answered by the discipline of Statistics. An interesting book by Stigler recounts The History of Statistics. Astronomers lead many of the statistical advances in the seventeenth and eighteenth centuries. Accurate measurement is very important to astronomers. Further, observations with respect to the circumference and oblateness of the earth were made at different times and places throughout history. This leaves a conundrum of how best to combine these observations.
Mayer, Boscovich, and others contributed to the development of the idea of least squares, but Stigler credits Legendre with the invention of least squares. Legendre came up with the idea in his attempt to measure the length of the median quadrant (the distance from the equator to the North Pole) through Paris.
To demonstrate some of his ideas, I will use a simpler example. Let us assume that a drug can have a dosage level between 0 and 5 and we want to find it’s impact on health (measured from a 0-10 scale). Let us look at the following data. The goal is to find the parameters m (slope) and b (intercept) that accurately measure the relationship between drug dosage and health (ignore any questions of endogeneity). Should we include all 10 observations?
Although Euler recognized that including more observations increases the maximum possible error, Legendre realized that adding more observations also greatly increased the probability of getting close to the true value of the parameters of interest.
In my example, we need to fit a line to measure the parameters m and b. How do we set up the errors so that we have the most accurate calculations. Laplace believed that the following two conditions would need to hold:
- Σi Dosagei*ei = 0
- Σi |Dosagei*ei| = minimum
The first condition basically says that the errors are uncorrelated with the independent variables on average. The second condition hopes to minimize the errors. Legendre extended Laplace’s second condition to minimize the sum of the squared errors rather than just the absolute error level.
Another key point is that this regression line must go through the “center of gravity.” In my example, the average dosage for the ten observations is 2.2 and the average health level is 5.9. This means the center of gravity is at the coordinates (2.2, 5.9). In the solution in my example is to set m=1.1456 and b=3.3797. We see that if we plug 2.2 into the equation, the output is 5.9; thus, the regression line does indeed go through the center of gravity.
Understanding the historical development of modern statistical techniques is an interesting task, and Stigler’s book enlightens the reader with much detail.
- SM Stigler (1986) “The History of Statistics: The Measurement of Uncertainty before 1900” Belknap, Harvard, 410 pages.
It’s also worth noting one important (and historical) reason that least-squares regression is so much more commonly employed than least-absolute-errors regression (or any other regression): the calculus is easier. Computation of the LS regression is straightforward, easy to program and quick to compute. In contrast, LAE regression is an iterative process with a few possible pitfalls along the way (non-unique solutions, for instance).
However, from a practical standpoint, LS regression not only pays more attention to points which lie farther from the regression line, but it actually emphasizes them! Whether squared errors are even an appropriate measure of model fit is very often not even questioned. While there are good theoretical reasons to prefer LS regression in some circumstances, there are also historical reasons for its popularity.