Regression Modeling Strategies (2nd ed.) - Frank Harrell (review)

Jun 22, 2017

https://www.goodreads.com/book/show/10753824-regression-modeling-strategies

I heard some good things about this book, and some of it is good. Surely, the general approach outlined in the introduction is pretty sound. He sets up the following principles:

Satisfaction of model assumptions improves precision and increases statistical power.
It is more productive to make a model fit step by step (e.g., transformation estimation) than to postulate a simple model and find out what went wrong.
Graphical methods should be married to formal inference.
Overfitting occurs frequently, so data reduction and model validation are important.
In most research projects, the cost of data collection far outweighs the cost of data analysis, so it is important to use the most efficient and accurate modeling techniques, to avoid categorizing continuous variables, and to not remove data from the estimation sample just to be able to validate the model.
The bootstrap is a breakthrough for statistical modeling, and the analyst should use it for many steps of the modeling strategy, including derivation of distribution-free confidence intervals and estimation of optimism in model fit that takes into account variations caused by the modeling strategy.
Imputation of missing data is better than discarding incomplete observations.
Variance often dominates bias, so biased methods such as penalized maximum likelihood estimation yield models that have a greater chance of accurately predicting future observations.
Software without multiple facilities for assessing and fixing model fit may only seem to be user-friendly.
Carefully fitting an improper model is better than badly fitting (and overfitting) a well-chosen one.
Methods that work for all types of regression models are the most valuable.
Using the data to guide the data analysis is almost as dangerous as not doing so.
There are benefits to modeling by deciding how many degrees of freedom (i.e., number of regression parameters) can be “spent,” deciding where they should be spent, and then spending them.

Readers will recognize many of these from my writings. Not mentioned in the principles is that the book takes a somewhat anti-P value stance (roughly 'they have some uses but are widely misused, so beware!'), and pro effect size estimation stance. And some the chapters do seem to follow these principles, but IMO the majority of the book does not really follow it. Mostly it is about endless variations on testing for non-linear effects of predictors, whereas in real life a lot of predictors will be boringly linear. There's some decent stuff about overfitting, bootstrapping and penalized regression, but they have been done better already (read Introduction to Statistical Learning). I did learn some new things, including on the applied side (e.g. the ease of applying cubic splines, something that would have been useful for this study), and the book comes with a complimentary R package (rms) so one can apply the ideas to one's own research immediately. On the other hand, most of the graphics in the book are terrible base plot ones, and only some are ggplot2.

This edition needed (came out 2015, first edition 2001) more work before it should have been published, but it is still worth reading for people with an interest in post-replication crisis statistics. Frank Harrell should team up with Andy Field, who's a much better writer, and with someone with good ggplot2 skills (throw in Shiny too for extra quality). Then they could write a really good stats book.

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?