Data sharing and Add Health

Mar 29, 2016

I am doing an S factor study of US counties in the usual way. For that reason, I need some kind of county-level cognitive ability estimate. I know that this is possible to create using the Add Health database, but that the data are not sharable. However, it may be possible to do some tricks, so I wrote to their support to get things clarified.

Emil:

Dear Add Health, I could not find an answer to this question anywhere. Suppose I or someone else has calculated an aggregate value (usually a mean) for each county in the US. Would it be against the Add Health data rules to release this non-personal data to other researchers?

Add Health:

Emil,
Thank you for contacting Add Health about this issue. The following guidelines listed in the contract must be followed when publishing information about Add Health.
To avoid inadvertent disclosure of persons, families, or households by using the following guidelines in the release of statistics derived from the Data Files.
In no table should all cases in any row or column be found in a single cell.
In no case should the total for a row or column of a cross-tabulation be fewer than three (3).
In no case should a cell frequency of a cross-tabulation be fewer than three (3) cases.
In no case should a quantity figure be based on fewer than three (3) cases.
Data released should never permit disclosure when used in combination with other known data.
Since the mean for a county is likely to be unique you would not be able to satisfy the above conditions. Also, the data may allow users to identify the county by comparing it with the source data. Therefore, you would not be able to release the aggregate information.
I hope this is helpful.
Best,
Joyce Tabor
Add Health Data Manager

Emil:

Hi Joyce,
How about adding a random number to the county means and excluding the counties with few cases? In this way, one cannot derive the scores of the individuals in the counties. Alternatively, how about binning the data by rounding the mean score to the nearest e.g. 5 (on a scale with a mean of 100 and SD of 15). This would make the means non-unique.

Add Health:

Emil,
Please provide more details about what you would like to do to help me better understand your project. For example, who do you want to make these data available to? What measures do you want to create at the county level? Would your data be linkable to Add Health data? If so, how would they link?
Best,
Joyce

Emil:

Joyce,
Thank you for your reply. Let me clarify matters. I did not do a study using Add Health data. However, some other researchers published a few studies using Add Health data. The studies in question are:
http://www.sciencedirect.com/science/article/pii/S0160289610001340
http://www.sciencedirect.com/science/article/pii/S0160289612001201
http://www.sciencedirect.com/science/article/pii/S019188691300189X
They used the Add Health data to calculate an average IQ score for each county with data and then correlated this with other variables. I am unable to gain access to the Add Health data myself, but I work in the same field and would like the IQ scores for my own analyses. I asked the authors to send me the numbers, but they refused on account on the Add Health data sharing rules. However, it is my thinking that the privacy rules of Add Health are there to protect individuals from being deanonymized and so it should be possible to release aggregate-level county data without risking deanonymization. Hence, I wrote to you to investigate whether it would be possible to release the aggregate-level data so that other researchers can use it (open data).
As far as I understand, you had some concerns about this because this world produce unique values by county. I'm not sure I understand why that is a problem, but my proposals to remedy this problem were to either to 1) add a little random noise to the variable, thus obscuring the true values, or 2) bin the datapoints into nearest e.g. 2 or 5 point score (so e.g. 113.1434 gets turned into 115). Both of these obscure the original data but do not cause serious statistical degradation of the data. The methods can also be combined.
I am not interested in the individual-level data for this analysis, so I'm not sure what you mean by making the data linkable to the Add Health data. To make it perfectly clear, I would like for it to be possible to release the mean IQ scores by county with the county names, so that they can be merged with other datasets (e.g. Measure of America's) for analytic purposes. If it is not possible to release the real means, then it is my hope that we can release the means with some added noise or binning as described above.

Add Health:

Emil,
The researchers are correct that they cannot provide you with data that they constructed using Add Health. Only Add Health can release and share data based on Add Health. Add Health never releases geographic identifiers smaller than Census region, therefore, county level data are not available. The researchers you cite only have access to pseudo county level codes. They do not have the data you want and Add Health will not release any data by location.
Best,
Joyce

So, the Add Health data sharing rules are extremely strict making the dataset much less useful. Some other way to estimate county-level IQ must be found.

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?