How exactly does one properly remove the effect of one variable? Residuals, partial correlation, multiple regression, semi-partial?
I ran into trouble trying to remove the effects of one variable on other variables doing the writing my reanalysis of the Noble et al 2015 paper. It was not completely obvious which exact method to use.
The authors in their own paper used OLS aka. standard linear regression. However, since I wanted to plot the results at the case-level, this was not useful to me. Before doing this I did not really understand how partial correlations worked, or what semi-partial correlations were, or how multiple regression works exactly. I still don't understand the last, but the others I will explain with example results.
Generated data
Since we are cool, we use R.
We begin by loading some stuff I need and generating some data:
#load libs
library(pacman) #install this first if u dont have it
p_load(Hmisc, psych, ppcor ,MASS, QuantPsyc, devtools)
source_url("https://github.com/Deleetdk/psych2/raw/master/psych2.R")
#Generate data
df = as.data.frame(mvrnorm(200000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical=TRUE))
cor(df)[1,2]
[1] 0.5
The correlation from this is exactly .500000, as chosen above.
Then we add a gender specific effect:
df$gender = as.factor(c(rep("M",100000),rep("F",100000)))
df[1:100000,"V2"] = df[1:100000,"V2"]+1 #add 1 to males
cor(df[1:2])[1,2] #now theres error due to effect of maleness!
[1] 0.4463193
The correlation has been reduced a bit as expected. It is time to try and undo this and detect the real correlation of .5.
#multiple regression
model = lm(V1 ~ V2+gender, df) #standard linear model
coef(model)[-1] #unstd. betas
lm.beta(model) #std. betas
V2 genderM
0.4999994 -0.5039354
V2 genderM
0.5589474 -0.2519683
The raw beta coefficient is on spot, but the standardized is not at all. This certainly makes interpretation more difficult if the std. beta does not correspond in size with correlations.
#split samples
df.genders = split(df, df$gender) #split by gender
cor(df.genders$M[1:2])[1,2] #males
cor(df.genders$F[1:2])[1,2] #females
[1] 0.4987116
[1] 0.5012883
These are both approximately correct. Note however that these are the values not for the total sample as before, but for each subsample. If the sumsamples differ in their correlation with, it will show up here clearly.
#partial correlation
df$gender = as.numeric(df$gender) #convert to numeric
partial.r(df, c(1:2),3)[1,2]
[1] 0.5000005
This is using an already made function for doing partial correlations. Partial correlations are calculated from the residuals of both variables. This means that they correlate whatever remains of the variables after everything that could be predicted from the controlling variable is removed.
#residualize both
df.r2 = residuals.DF(df, "gender")
cor(df.r2[1:2])[1,2]
[1] 0.5000005
This should be the exact same as the above, but done manually. And it is.
#semi-partials
spcor(df)$estimate #hard to interpet output
spcor.test(df$V1, df$V2, df$gender)[1] #partial out gender from V2
spcor.test(df$V2, df$V1, df$gender)[1] #partial out gender from V1
V1 V2 gender
V1 1.0000000 0.4999994 -0.2253951
V2 0.4472691 1.0000000 0.4479416
gender -0.2253103 0.5005629 1.0000000
estimate
1 0.4999994
estimate
1 0.4472691
Semi-partial correlations aka part correlations are the same as above, except that only one of the two variables is residualized by the controlled variable. The above has two different already made functions for calculating these. The first works on an entire data.frame, and outputs semi-partials for all variables controlling for all other variables. The output is a bit tricky to read however, and there is no explanation in the help. One has to read it by looking at each row, this is the original variable correlated with the residualized variables in each col. The two calls below are using the other function one where has to specify which two variables to correlate and which variables to control for. The second variable called is the one that gets residualized by the control variables. We see that the results are as they should be. Controlling V1 for gender has about zero effect. This is because gender has no effect on this variable aside from a very small chance effect (r=-0.00212261). Controlling V2 for gender has the desired effect of returning a value very close to .5, as it should.
#residualize only V2
df.r1.V1= df.r2 #copy above
df.r1.V1$V1 = df$V1 #fetch orig V1
cor(df.r1.V1[1:2])[1,2]
[1] 0.4999994
#residualize only V1
df.r1.V2= df.r2 #copy above
df.r1.V2$V2 = df$V2 #fetch orig V2
cor(df.r1.V2[1:2])[1,2]
[1] 0.4472691
These are the two manual ways of doing the same as above. We get the exact same results, so that is good.
So where does this lead us? Well, apparently using multiple regression to control variables is a bad idea since it results in difficult to interpret results.