Ethnic heterogeneity and tail effects

Dec 10, 2015

Chisala has his 3rd installment up: http://www.unz.com/article/closing-the-black-white-iq-gap-debate-part-3/

One idea I had while reading it was that tail effects interact with population ethnic/racial heterogeneity. To show this, I did a simulation experiment. Population 1 is a regular population with a mean of 0 and sd of 1. Population 2 is a composite population of three sub-populations: one with a mean of 0 (80%; "normals") one with mean of -1 (10%; "dullards") and one with a mean of 1 (10%; "brights"). Population 3 is a normal population but with a slightly increased sd so that it is equal to the sd of population 2.

Descriptive stats:

> describe(df, skew = F, ranges = T)
     vars     n mean  sd median trimmed  mad   min  max range se
pop1    1 1e+06    0 1.0      0       0 1.00 -4.88 4.65  9.53  0
pop2    2 1e+06    0 1.1      0       0 1.09 -5.43 5.37 10.80  0
pop3    3 1e+06    0 1.1      0       0 1.09 -5.30 5.13 10.44  0

We see that the sd is increased a bit in the composite population (2) as expected. We also see that the range is somewhat increased, even compared to population 3 which has the same sd.

How do the tails look like?

> sapply(df, percent_cutoff, cutoff = 1:4)
      pop1     pop2     pop3
1 0.158830 0.179495 0.180856
2 0.022903 0.034342 0.034074
3 0.001314 0.003326 0.003126
4 0.000036 0.000160 0.000150

We are looking at the proportions of persons with scores above 1-4 (rows) by each population (cols). What do we see? Population 2 and 3 have clear advantages over population 1, but population 2 has a slight advantage over population 3 too.

Simulation 2

In the above, the composite population is made out of 3 populations. But what if it were instead made out of 5?

Descriptives:

> describe(df, skew = F)
     vars     n mean   sd median trimmed  mad   min  max range se
pop1    1 1e+06    0 1.00      0       0 1.00 -4.88 4.65  9.53  0
pop2    2 1e+06    0 1.27      0       0 1.21 -5.91 6.03 11.94  0
pop3    3 1e+06    0 1.27      0       0 1.26 -6.12 5.92 12.04  0

The sd is clearly increased. There is not much difference in the range, but the range is very susceptible to sampling error, which we have. How do the tails look like?

> sapply(df, percent_cutoff, cutoff = 1:4)
      pop1     pop2     pop3
1 0.158830 0.205814 0.214353
2 0.022903 0.057077 0.056874
3 0.001314 0.011057 0.008872
4 0.000036 0.001246 0.000804

We see strong effects. At the +3 level, there are roughly 10x as many persons in the composite population as in the normal population. Population 3 also has more, but clearly fewer than the composite population.

We can conclude that one must take heterogeneity of populations into account when thinking about the tails.

R code

You can re-do the experiment yourself with this code, or try out some other numbers.

library(pacman)
 p_load(reshape, kirkegaard, psych)

n = 1e6

# first simulation --------------------------------------------------------
 set.seed(1)
 {
 pop1 = rnorm(n)
 pop2 = c(rnorm(n*.8), rnorm(n*.1, 1), rnorm(n*.1, -1))
 pop3 = rnorm(n, sd = sd(pop2))
 }

#df
 df = data.frame(pop1, pop2, pop3)

#stats
 describe(df, skew = F)
 sapply(df, percent_cutoff, cutoff = 1:4)

# second simulation -------------------------------------------------------
 set.seed(1)
 {
 pop1 = rnorm(n)
 pop2 = c(rnorm(n*.70), rnorm(n*.10, 1), rnorm(n*.10, -1), rnorm(n*.05, 2), rnorm(n*.05, -2))
 pop3 = rnorm(n, sd = sd(pop2))
 }

#df
 df = data.frame(pop1, pop2, pop3)

#stats
 describe(df, skew = F)
 sapply(df, percent_cutoff, cutoff = 1:4)

Just Emil Kirkegaard Things

Discussion about this post