Scraping: temperature and precipitation data for all NOAA weather stations
Some people claim that the climate has direct causal influences on income, cognitive ability and so on. Usually, these academics just regress IQ on climate variables at the country or US state-level. However, it is possible to do it at the US county-level too. Unfortunately, it is difficult to find climate data by the county. I looked at NOAA. They have a nice dataset, but there is no option to download it all at once. They say there are over 9800 stations (by my count there are 8377 stations in their dataset), so manually copying the data is nearly impossible. The stations have total precipitation, mean, min and max temperature averaged over the last 30 years. Some have missing data, but most have all the data. So I wrote to NOAA to get to get the data, but that was not to be. So I have to resort to scraping.
The job initially seemed easy. If we click on a state and look in the network menu (F12 in Firefox), we see that we received a json file that has the list of all stations in that state. Looks like this (for the smallest 'state'):
{"metadata":{"resultset":{"offset":1,"count":2,"limit":1000}},"results":[{"elevation":45.7,"mindate":"2010-01-01","maxdate":"2010-12-01","latitude":38.9385,"name":"DALECARLIA RESERVOIR, DC US","datacoverage":1,"id":"GHCND:USC00182325","elevationUnit":"METERS","longitude":-77.1134},{"elevation":15.2,"mindate":"2010-01-01","maxdate":"2010-12-01","latitude":38.9133,"name":"NATIONAL ARBORETUM DC, DC US","datacoverage":1,"id":"GHCND:USC00186350","elevationUnit":"METERS","longitude":-76.97}]}
So, this is easy to parse in R with jsonlite. But, suppose we want to download the json for each state automatically (too lazy to click 51 times). We get the URL and try:
http://www.ncdc.noaa.gov/cdo-web/api/v2/stations?limit=1000&offset=1&datasetid=normal_mly&locationid=FIPS%3A11&sortfield=name
But no, it gives us another json file essentially saying "no, no":
{"status" : "400", "message" : "Token parameter is required."}
But at least it tells us what it wants. A token. Actually, it wants some headers on the request. If we click copy as cURL, we get the necessary call:
curl 'http://www.ncdc.noaa.gov/cdo-web/api/v2/stations?limit=1000&offset=1&datasetid=normal_mly&locationid=FIPS%3A11&sortfield=name' -H 'Host: www.ncdc.noaa.gov' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'token: 0x2a' -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: http://www.ncdc.noaa.gov/cdo-web/datatools/normals' -H 'Cookie: JSESSIONID=A94A1B7C0A8FF43E0E43D20F6E4D2F0E; _ga=GA1.2.1535824352.1459267703; _ga=GA1.3.1535824352.1459267703; __utma=198104496.1535824352.1459267703.1459267703.1459527590.2; __utmz=198104496.1459527590.2.2.utmcsr=t.co|utmccn=(referral)|utmcmd=referral|utmcct=/CJ7yyAO3rn; has_js=1; __utmc=198104496; __utmb=198104496.10.9.1459528698517; _gat=1; _gat_GSA_ENOR0=1; __utmt_GSA_CP=1' -H 'Connection: keep-alive'
The first part is the URL we already saw. The others are the headers that my browser sent. So we simply set up a script to get each state's json using these headers. Then we get a folder full of these state json files. Each state json has the IDs of the stations, e.g. GHCND:USC00182325. If we go back to the website and click one station, we can see that we receive a json file in the background:
{"metadata":{"resultset":{"offset":1,"count":20,"limit":100}},"results":[{"date":"2010-01-01T00:00:00","datatype":"ANN-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":4566},{"date":"2010-01-01T00:00:00","datatype":"ANN-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":568},{"date":"2010-01-01T00:00:00","datatype":"ANN-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":671},{"date":"2010-01-01T00:00:00","datatype":"ANN-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":466},{"date":"2010-01-01T00:00:00","datatype":"DJF-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":927},{"date":"2010-01-01T00:00:00","datatype":"DJF-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":365},{"date":"2010-01-01T00:00:00","datatype":"DJF-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"C","value":455},{"date":"2010-01-01T00:00:00","datatype":"DJF-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":274},{"date":"2010-01-01T00:00:00","datatype":"JJA-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":1266},{"date":"2010-01-01T00:00:00","datatype":"JJA-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":764},{"date":"2010-01-01T00:00:00","datatype":"JJA-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":867},{"date":"2010-01-01T00:00:00","datatype":"JJA-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":662},{"date":"2010-01-01T00:00:00","datatype":"MAM-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":1210},{"date":"2010-01-01T00:00:00","datatype":"MAM-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":556},{"date":"2010-01-01T00:00:00","datatype":"MAM-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":670},{"date":"2010-01-01T00:00:00","datatype":"MAM-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":442},{"date":"2010-01-01T00:00:00","datatype":"SON-PRCP-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":1163},{"date":"2010-01-01T00:00:00","datatype":"SON-TAVG-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":584},{"date":"2010-01-01T00:00:00","datatype":"SON-TMAX-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":686},{"date":"2010-01-01T00:00:00","datatype":"SON-TMIN-NORMAL","station":"GHCND:USC00182325","attributes":"S","value":481}]}
This file has all the data for that station. So, we use the header curl approach from before and simply request all these json files using the IDs in the state files. This gives us some >8000 json files. Then it's just a matter of parsing these to get what we want. Skipping some parsing headache steps, the end result is a datafile for each station with the following variables:
longitude
latitude
elevation
name
annual, winter, spring, summer, autumn:
precipitation
mean temperature
max temperature
min temperature
To get county-level data, one can assign each station to the nearest county and average the values within each county. There are ~3100 counties but ~8400 stations, so with some luck all counties or nearly so will be covered.
The datafile is available on OSF (stations_data.csv). Hopefully, it will be useful to climatologists, meteorologists and sociologists alike.
The R code to scrape with is also in the same repository (scrape_states.R scrape_stations.R).