Discover more from Just Emil Kirkegaard Things
How exactly does one properly remove the effect of one variable? Residuals, partial correlation, multiple regression, semi-partial?
I ran into trouble trying to remove the effects of one variable on other variables doing the writing my reanalysis of the Noble et al 2015 paper. It was not completely obvious which exact method to use.
The authors in their own paper used OLS aka. standard linear regression. However, since I wanted to plot the results at the case-level, this was not useful to me. Before doing this I did not really understand how partial correlations worked, or what semi-partial correlations were, or how multiple regression works exactly. I still don't understand the last, but the others I will explain with example results.
Since we are cool, we use R.
We begin by loading some stuff I need and generating some data:
#load libs library(pacman) #install this first if u dont have it p_load(Hmisc, psych, ppcor ,MASS, QuantPsyc, devtools) source_url("https://github.com/Deleetdk/psych2/raw/master/psych2.R") #Generate data df = as.data.frame(mvrnorm(200000, mu = c(0,0), Sigma = matrix(c(1,0.50,0.50,1), ncol = 2), empirical=TRUE)) cor(df)[1,2]
The correlation from this is exactly .500000, as chosen above.
Then we add a gender specific effect:
df$gender = as.factor(c(rep("M",100000),rep("F",100000))) df[1:100000,"V2"] = df[1:100000,"V2"]+1 #add 1 to males cor(df[1:2])[1,2] #now theres error due to effect of maleness!
The correlation has been reduced a bit as expected. It is time to try and undo this and detect the real correlation of .5.
#multiple regression model = lm(V1 ~ V2+gender, df) #standard linear model coef(model)[-1] #unstd. betas lm.beta(model) #std. betas
V2 genderM 0.4999994 -0.5039354
V2 genderM 0.5589474 -0.2519683
The raw beta coefficient is on spot, but the standardized is not at all. This certainly makes interpretation more difficult if the std. beta does not correspond in size with correlations.
#split samples df.genders = split(df, df$gender) #split by gender cor(df.genders$M[1:2])[1,2] #males cor(df.genders$F[1:2])[1,2] #females
 0.4987116  0.5012883
These are both approximately correct. Note however that these are the values not for the total sample as before, but for each subsample. If the sumsamples differ in their correlation with, it will show up here clearly.
#partial correlation df$gender = as.numeric(df$gender) #convert to numeric partial.r(df, c(1:2),3)[1,2]
This is using an already made function for doing partial correlations. Partial correlations are calculated from the residuals of both variables. This means that they correlate whatever remains of the variables after everything that could be predicted from the controlling variable is removed.
#residualize both df.r2 = residuals.DF(df, "gender") cor(df.r2[1:2])[1,2]
This should be the exact same as the above, but done manually. And it is.
#semi-partials spcor(df)$estimate #hard to interpet output spcor.test(df$V1, df$V2, df$gender) #partial out gender from V2 spcor.test(df$V2, df$V1, df$gender) #partial out gender from V1
V1 V2 gender V1 1.0000000 0.4999994 -0.2253951 V2 0.4472691 1.0000000 0.4479416 gender -0.2253103 0.5005629 1.0000000
estimate 1 0.4999994
estimate 1 0.4472691
Semi-partial correlations aka part correlations are the same as above, except that only one of the two variables is residualized by the controlled variable. The above has two different already made functions for calculating these. The first works on an entire data.frame, and outputs semi-partials for all variables controlling for all other variables. The output is a bit tricky to read however, and there is no explanation in the help. One has to read it by looking at each row, this is the original variable correlated with the residualized variables in each col. The two calls below are using the other function one where has to specify which two variables to correlate and which variables to control for. The second variable called is the one that gets residualized by the control variables. We see that the results are as they should be. Controlling V1 for gender has about zero effect. This is because gender has no effect on this variable aside from a very small chance effect (r=-0.00212261). Controlling V2 for gender has the desired effect of returning a value very close to .5, as it should.
#residualize only V2 df.r1.V1= df.r2 #copy above df.r1.V1$V1 = df$V1 #fetch orig V1 cor(df.r1.V1[1:2])[1,2]
#residualize only V1 df.r1.V2= df.r2 #copy above df.r1.V2$V2 = df$V2 #fetch orig V2 cor(df.r1.V2[1:2])[1,2]
These are the two manual ways of doing the same as above. We get the exact same results, so that is good.
So where does this lead us? Well, apparently using multiple regression to control variables is a bad idea since it results in difficult to interpret results.