Direct estimate of between group heritability with estimated polygenic scores
A while ago, I had an idea for a direct estimate of the between group heritability from polygenic scores. The idea is based on the old DeFries (1972) paper, which had the following equation:
where h2B is the between group heritability, h2W is the within group heritability, t is the intraclass phenotypic correlation, and r is the intraclass genotypic correlation. I have shown thru detailed simulations that this equation works for polygenic group differences, i.e. produces correct results: http://rpubs.com/EmilOWK/how_to_genetic_group_differences I undertook this simulation work mainly because Eric Turkheimer was (again) talking ideologically:
For all the hereditarians’ idle intuitions about differences being part genetic and part environmental, where is the empirical or quantitative theory that describes how this apportioning is supposed to work? There is no such thing as a “group heritability coefficient,” no way to put any meat on the speculative bones about partial genetic determination.
However, a limitation of the above function is that one has access to true polygenic scores, not the noisy estimates we have. In fact, it is better to be very clear on the question. We can think of polygenic scores and error according to the classical psychometric measurement model: true polygenic score = estimated polygenic score + error (that is, random error). In this equation, true polygenic score is the genetic potential for the trait in question. Since we in practice only have estimates, and our estimates are quite noisy, one will have to adjust for this to estimate the r value for the equation. Not doing the correction just deflates the BGH estimate because the r is in the numerator. This is the same situation which was encountered in the recent studies of dysgenic selection based on our current PGSs, and which one correct for the 'missing heritability' i.e. the measurement error in the PGS estimate. See:
Kong, A., Frigge, M. L., Thorleifsson, G., Stefansson, H., Young, A. I., Zink, F., ... & Gudbjartsson, D. F. (2017). Selection against variants in the genome associated with educational attainment. Proceedings of the National Academy of Sciences, 114(5), E727-E732.
Most importantly, POLYEDU is just a fraction of the full genetic component of educational attainment, which we denote by POLYFULL. It is the rate of change of POLYFULL that is of ultimate interest. Under an assumption that the part of POLYFULL that is not captured by POLYEDU behaves in a similar fashion in its impact on reproduction, the rate of change is proportional to the square root of the variance explained (SI Text). Thus, if POLYFULL is assumed to account for 30% of the variance of EDU, then its estimated rate of change, by extrapolation, is −0.010 × (30/3.74)1/2 = −0.028 SUs per decade. To test the validity of this method of extrapolation we computed a separate polygenic score for educational attainment, denoted by POLY-U.K.B, which was based on the same GWAS results used to construct POLYEDU, except that the contribution from 111,349 UK Biobank samples was removed (Materials and Methods). When we applied POLY-U.K.B to the Icelandic data, it explained 2.52% of the variance of EDU, and the rate of decline estimated based on its effects on reproduction is −0.0085 SU per decade (Materials and Methods). Hence, with the polygenic score strengthening from POLY-U.K.B to POLYEDU, the estimated rate of decline increased by a factor of (0.0104/0.0085) = 1.22, nearly identical to (3.74/2.52)1/2 = 1.22, the square root of the variance explained ratio.
In our case above, we need to adjust an intraclass correlation for measurement error. I haven't found anyone who knew about to do this, but since intraclass correlations are merely ANOVAs in disguise, and ANOVAs are just linear regression models in disguise, perhaps one can find a method from the econometrics literature. Solution of this issue requires someone with better math stats abilities than I possess (suggestions very welcome!).
Supposing one solves the measurement error problem above, the second issue is that current PGSs are not equally valid across the groups we are interested in. Though I haven't simulated this situation (yet?), I'm sure this will result in estimation bias of the BGH value. Specifically, lower validity of our estimated PGS in one group will bias the intraclass correlation towards zero. I think the approach here is to utility an estimated PGS that does not suffer from group predictive bias in the classical test theory sense of equal slopes and intercepts. Currently, one could try Lee et al 2018's putatively causal PGS estimate (see Section 5 in their supplementary materials) and see if this produces unbiased predictions within groups of interest. Testing this requires a lot of sample size, which is why we have not yet done it. The TCP dataset, which we now have access to, will probably allow a test of this idea since it has ~3,000 blacks and ~4,500 whites. Assuming one can find an estimated PGS that shows no predictive bias by group, then one can adjust it for (random) measurement error as discussed above, and then finally get a first direct estimate of the BGH for the trait and groups in question. We are working on a series of ideas using the TCP dataset, hopefully multiple of them to be finished in 2019, so stay tuned! :)
Reversely, if one is willing to grant that the genetic architecture of human polygenic traits is the same across race groups (supported by simulations done using 1000 genomes; Zanetti and Weale 2018), one can use this fact to identify the causal variants by giving more probability of causality to variants that have no predictive bias across ancestry groups. This is the general reasoning behind multi-ethnic GWASs, which have been receiving increasing attention recently, e.g.: Klarin et al 2018, Roselli et al 2018, Li et al 2018 (and many others). In fact, Africans are the best population to use because they have the weakest LD between variants (because they didn't experience out of Africa population bottlenecks that decrease variation/increase LD), making it easier to spot the true causal variant among the correlated variants.