Self-teaching stats with R in 2020

Jun 26, 2020

See 2019 post on introductions to psychology etc. for broader coverage Anon asks me today:

What would be the best way to quickly self-teach enough statistics to be able to evaluate academic papers? I have the basic concepts down but I've never taken an actual stats class.

The glib reply is: l2code.

For reals though. But first, read some very basics:

Spiegelhalter, D. (2019). The art of statistics: learning from data. Penguin UK.
Chambers, C. (2019). The seven deadly sins of psychology: A manifesto for reforming the culture of scientific practice. Princeton University Press.

Once you get that down, you will want to start playing around with data. To properly learn that, you need to learn to code. You can do this with Python or Julia too, but R is better. Why do you need coding? Can't you just point-and-click software? You can, but you can't get good that way. Not enough control, too inflexible. Bad analytic patterns to get locked into. So you want to learn R, you want to use this book:

Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data.

Once you know the basics of R, you can start doing stuff like these:

http://emilkirkegaard.dk/understanding_statistics/
If you want to learn how to make these, you need to read e.g. this tutorial, the package is called shiny.

Alright, so after learning basics of tidyverse R stuff, you are ready to learn more stats:

Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge.
- You won't be using the author's Excel code stuff, you will re-do the parts in R you care about.

Most statistics books at this point are going to be very inapplied and equation heavy. You don't care much about these. Many of them waste time doing by hand stuff like t-tests, chi square. You don't care too much about these either. (In point of fact, these legacy tests can all be done using regression models.) The only thing one needs to know is that these are NHST tests, and based on some assumptions, produce a p value. The p value is just the probability of the data given noise/nothing is going on/no pattern (so-called null model). If you have enough data, the p value of any pattern in your data will always be very small, and is of no other particular interest. What you really care most about are the effect sizes. This point is hammered home in the Cumming book above. Going further from here really depends on which way you want to go. If you want meta-analysis skills, you can read any of the excellent introductions to meta-analysis in R. If you fancy Schmidt and Hunter style, there's:

Dahlke, J. A., & Wiernik, B. M. (2019). psychmeta: An R package for psychometric meta-analysis. Applied Psychological Measurement, 43(5), 415-416.
https://psychmeta.com/

If you favor regular style:

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of statistical software, 36(3), 1-48.
http://www.metafor-project.org/doku.php

sf package CRAN page In general, reading these papers is the wrong way to learn the code in some area. What you want is find a good R 'vignette' (that means code and natural language mixed so easy to see what happens). These can generally be found on the package website, sometimes listed on the CRAN page for each package. You can also start playing around with stuff others have done. Something you care about. One excellent option is browsing Rpubs.com, which has 1000s of public R analysis notebooks, including 100s of my own:

https://rpubs.com/EmilOWK/

Many of these contain public data, so you can simply download the same data and rerun their code (expect bugs!). Eventually, when you have done some of the simpler stuff, you will need to read this book. It may take you a month because it isn't that easy, but it is great and well-worth the time.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.

They provide boomer-tier R code you can copy and run (mostly for glmnet). You however don't really want to stick with their way, you want to migrate to tidymodels framework for applied machine learning. For spatial statistics, you want to learn tidy spatial statistics with sf package. For regression modeling, you will want to read through this odd but informative book:

Harrell Jr, F. E. (2015). Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer.

Which has its own package too, rms. The psych package provides a lot of nice functions for psychology related stuff. For latent variable modeling, you want lavaan package. For item response theory, you can begin with psych and migrate to mirt afterwards. Learning how to do stuff in R is mainly a question of finding the right package. Ask someone who has worked on the kind of problem you have what package is good for that.

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?