Mount Everest is not cold (controlling for altitude and latitude)
The real Everest regression fallacy
Most researchers don't seem to think carefully about what they put in their regression models as statistical controls. By controlling for a variable, you attempt to hold that variable at a constant value while letting another variable vary to see what effect that has. If you include multiple controls, these are all kept constant insofar as possible with the given dataset. The usual interpretation applied to the model results usually assume a simple causal model namely that the dependent variable is caused by everything else directly, and these other variables aren't caused by the dependent variable, and neither do they cause each other. If some of the predictors have causal relationships among each other, then the ultimate variables will have their effect size reduced to whichever effects that aren't mediated by the other variables. For instance, if you predict wealth of a person as a function of their income and sex, you might find that men aren't wealthier than women (controlling for income). In other words, men are wealthier but not if you look at men and women with the same income, or alternatively, that income differences explain all of the sex difference in wealth (I don't know whether this is true, it's an example.).
In many cases, researchers add as controls to their regression models just about every other variable they have, resulting in a so-called kitchen sink approach. They then interpret the discovered coefficients as the causal effects in some important sense. We saw an example of this yesterday with the German migrant crime data. Areas with more migrants aren't more criminal than average (controlling for the unemployment rate, the German crime rate, the male %, and the mean age). This controls too much for the interpretation of the results to tell us about relevant causality (obviously, the migrants in low unemployment rate areas with low-crime Germans aren't regular migrants, it's improper to control for these factors). This kind of controlling for problematic variables error is sometimes called the Mount Everest regression fallacy (a few others have named other Everest fallacies unrelated to this). Mount Everest is not cold (controlling for altitude). This idea has been mentioned a few times (seems to be from Garett Jones), but no one seems to have empirically shown that it is true. So let's try!
The data:
There is a dataset of 10k towns with at least 10k population size here. It is based on Wikipedia and includes somewhat dirty climate data, but not altitude.
I downloaded elevation (altitude) and annual temperature worldwide data from here. Then I looked up the location of each town + Mount Everest. This is not entirely precise because these are based on a grid of the Earth of 1 kmĀ² size. Mount Everest has an altitude of 8424 m here instead of the peak of 8848.
First, the correlations among the variables:
Mainly, temperature is predicted by latitude, not actually altitude. This is because the towns in mountains tend to be at warmer places (r = -0.24 with abs. latitude), presumably related to where humans choose to settle.
The pairwise relationships look like this:
Mount Everest is a negative outlier for each of the simple relationships. Let's try their combination:
Altitude by itself explains just about nothing (r = -0.04), absolute latitude explains quite a lot (-0.88, 77% variance), their combination even more so (83.6% variance), and the interaction adds a bit more (84.1%). So, is Mount Everest actually cold (controlling for latitude and altitude)?
No, it turns out that Mount Everest is actually warm (controlling for altitude and latitude and their interaction). It is a positive outlier, so it is a few degrees warmer than expected!
Garett Jones tells us that he got the idea from Joel Schneider who made the similar statement in 2007 that:
"I Am as Tall as the Rocky Mountains! (After controlling for barometric pressure)
Just because I can use regression to show that the houses in my neighborhood are just as tall as the Rocky Mountains after statistically controlling for barometric pressure at the summits of each house and mountain does not mean that I have proved that there is no true difference in the heights of houses and mountains. It also doesnāt prove that houses and mountains would be the same height if we were to equalize the barometric pressure differences. We know that such an analysis is stupid because we know that changing altitude causes air pressure to change and that changing air pressure has no effect on altitude."
http://www.iqscorner.com/2007/05/temp.html?m=1
Great post. I prefer the version, "Mt. Everest has the same elevation as London after controlling for air pressure."
The fallacy seems to be pretending that endogenous variables are exogenous. The reason why such researchers would control for unemployment rate etc. of migrants is that: "If we get them employed then they won't be so criminal." But of course both unemployment and criminality may follow largely from genetic variation.