Sometimes doing elementary things in R is a pain

Jan 19, 2015

Getting a percentage table from a dataframe A reviewer asked me to:

1) As I said earlier, there should be some data on the countries of origin of the immigrant population. Most readers have no idea who actually moves to Denmark. At the very least, there should be basic information like "x% of the immigrant population is of non-European origin and y% of European origin as of 2014." Generally, non-European immigration would be expected to increase inequality more, given that IQ levels are relatively uniform across Europe.

I have population counts for each year 1980 through 2014 in a dataframe and I'd like to get them as a percent of each year so as to get the relative sizes of the countries. There is a premade function for this, prop.table, however, it works quite strangely. If one gives it a dataframe and no margin, it will use the total sum of the data.frame instead of by column. This is sometimes useful, but not in this case. However, if one gives it a data.frame and margin=2, it will complain that:

Error in margin.table(x, margin) : 'x' is not an array

Which is odd when it just accepted it before. The relatively lack of documentation made it not quite easy to figure out how to make it work. Turns out that one just has to convert the dataframe to a matrix when giving it:

census.percent = prop.table(as.matrix(census), margin=2)

and then one can convert it back and also multiple by 100 to get percent instead of fractions:

census.percent = as.data.frame(prop.table(as.matrix(census), margin=2)*100)

Getting the top 10 countries with names for selected years

This one was harder. Here's the code I ended up with:

selected.years = c("X1980","X1990","X2000","X2010","X2014") #years of interest
for (year in selected.years){ #loop over each year of interest
  vector = census.percent[,year,drop=FALSE] #get the vector, DONT DROP!
  View(round(vector[order(vector, decreasing = TRUE),,drop=FALSE][1:10,,drop=FALSE],1)) #sort vector, DONT drop! and get 1:10 and DONT DROP!
}

First we choose the years we want (note that X goes in front because R has trouble handling columns that begin with a number). Then we loop over each year of interest. Then we pick it out to avoid having to select the same column over and over. However, normally when picking out 1 column from a dataframe, R will convert it to numeric, which is very bad because this removes the rownames. That means that even tho we can find the top 10 countries, we don't know which ones they are. The solution for this is to set drop=FALSE. The next part consists of first ordering the vector (without drop!), and then selecting the top 10 countries without dropping. I open them in View (in Rstudio) because this makes it easier to copy the values for further use (e.g. in a table for a paper).

So, drop=FALSE is another one of those pesky small things to remember. It is just like stringsAsFactors=FALSE when using read.table (or read.csv).

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?