SQL server for population frequencies from 1000 genomes
Note: 2018 June 26
Server down right now, investigating.
Note: August 16, 2017
server IP changed to 67.207.92.10.
Original post
We need dplyr for this:
library(dplyr)
First, use the anon user to log into the SQL server (user = "anon", pass = "", ip = "67.207.92.10", port = 3306):
sql = src_mysql("population_freqs", host = "67.207.92.10", user = "anon", port = 3306)
Select the 1000 genomes phase 3 table:
sql_1kg = tbl(sql, "1000genomes_phase3")
#look at the first 10 rows
sql_1kg
## Source: query [?? x 35]
## Database: mysql 10.0.27-MariaDB-0ubuntu0.16.04.1 [anon@67.207.92.10:/population_freqs]
##
## CHR SNP A1 A2 ACB ASW BEB CDX CEU CHB
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 rs1000033 G T 0.4062 0.3279 0.2151 0.3172 0.2071 0.3107
## 2 1 rs1000050 C T 0.6302 0.5246 0.2267 0.4247 0.1364 0.3932
## 3 1 rs1000070 T C 0.6146 0.5902 0.5116 0.4946 0.2828 0.3204
## 4 1 rs1000073 G A 0.4219 0.3689 0.2093 0.1989 0.6717 0.1456
## 5 1 rs1000075 T C 0.3750 0.4180 0.2733 0.3172 0.3586 0.2573
## 6 1 rs1000085 C G 0.0781 0.1393 0.0349 0.0645 0.2121 0.0437
## 7 1 rs1000127 C T 0.2500 0.2787 0.4012 0.4677 0.3081 0.5728
## 8 1 rs1000184 C G 0.0521 0.0574 0.2442 0.8333 0.2879 0.7767
## 9 1 rs1000211 T C 0.0260 0.0246 0.0000 0.0000 0.0000 0.0000
## 10 1 rs1000212 A G 0.0365 0.0246 0.0000 0.0000 0.0000 0.0000
## # ... with more rows, and 25 more variables: CHS <dbl>, CLM <dbl>,
## # ESN <dbl>, FIN <dbl>, GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>,
## # ITU <dbl>, JPT <dbl>, KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>,
## # PEL <dbl>, PJL <dbl>, PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>,
## # EAS <dbl>, EUR <dbl>, AFR <dbl>, AMR <dbl>, SAS <dbl>
The entire file is really large, about 3.6 GB in memory. You often only need a few (1-1000) SNPs, so let’s try downloading only a few:
#first 10 of the hits from the latest height GWAS
#http://www.nature.com/ng/journal/v46/n11/full/ng.3097.html
some_snps = c("rs425277", "rs9434723", "rs10779751", "rs2284746", "rs12137162",
"rs212524", "rs1014987", "rs2806561", "rs4601530", "rs926438")
#fetch from SQL server
(sql_height_freqs = sql_1kg %>% filter(SNP %in% some_snps) %>% collect())
## # A tibble: 10 x 35
## CHR SNP A1 A2 ACB ASW BEB CDX CEU CHB
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 rs1014987 G C 0.1458 0.1475 0.1802 0.5000 0.2525 0.4612
## 2 1 rs10779751 A G 0.6406 0.5984 0.1279 0.0591 0.2677 0.0631
## 3 1 rs12137162 A C 0.1198 0.1557 0.2384 0.1935 0.2677 0.2621
## 4 1 rs212524 T C 0.1823 0.1967 0.3605 0.2473 0.4293 0.1796
## 5 1 rs2284746 G C 0.1927 0.1475 0.4012 0.1882 0.5657 0.2573
## 6 1 rs2806561 A G 0.3073 0.3934 0.3488 0.5591 0.5909 0.5388
## 7 1 rs425277 T C 0.0990 0.0984 0.3023 0.1452 0.2576 0.1990
## 8 1 rs4601530 T C 0.4271 0.3852 0.4535 0.5538 0.2475 0.4515
## 9 1 rs926438 T C 0.8177 0.7541 0.2093 0.1774 0.5455 0.3010
## 10 1 rs9434723 A G 0.2240 0.2541 0.1744 0.0538 0.1414 0.0922
## # ... with 25 more variables: CHS <dbl>, CLM <dbl>, ESN <dbl>, FIN <dbl>,
## # GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>, ITU <dbl>, JPT <dbl>,
## # KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>, PEL <dbl>, PJL <dbl>,
## # PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>, EAS <dbl>, EUR <dbl>,
## # AFR <dbl>, AMR <dbl>, SAS <dbl>
All the atomic populations are there as well as the 5 super populations (‘macro races’). The numbers for the super populations differ slightly from those that can be seen on ensembl because they used weighted means and I used unweighted means.