Just Emil Kirkegaard Things

Share this post

SQL server for population frequencies from 1000 genomes

www.emilkirkegaard.com

SQL server for population frequencies from 1000 genomes

Emil O. W. Kirkegaard
Oct 15, 2016
Share

Note: 2018 June 26

Server down right now, investigating.

Note: August 16, 2017

server IP changed to 67.207.92.10.

Original post

We need dplyr for this:

library(dplyr)

First, use the anon user to log into the SQL server (user = "anon", pass = "", ip = "67.207.92.10", port = 3306):

sql = src_mysql("population_freqs", host = "67.207.92.10", user = "anon", port = 3306)

Select the 1000 genomes phase 3 table:

sql_1kg = tbl(sql, "1000genomes_phase3")
#look at the first 10 rows
sql_1kg
## Source:   query [?? x 35]
## Database: mysql 10.0.27-MariaDB-0ubuntu0.16.04.1 [anon@67.207.92.10:/population_freqs]
## 
##      CHR       SNP    A1    A2    ACB    ASW    BEB    CDX    CEU    CHB
##    <int>     <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1 rs1000033     G     T 0.4062 0.3279 0.2151 0.3172 0.2071 0.3107
## 2      1 rs1000050     C     T 0.6302 0.5246 0.2267 0.4247 0.1364 0.3932
## 3      1 rs1000070     T     C 0.6146 0.5902 0.5116 0.4946 0.2828 0.3204
## 4      1 rs1000073     G     A 0.4219 0.3689 0.2093 0.1989 0.6717 0.1456
## 5      1 rs1000075     T     C 0.3750 0.4180 0.2733 0.3172 0.3586 0.2573
## 6      1 rs1000085     C     G 0.0781 0.1393 0.0349 0.0645 0.2121 0.0437
## 7      1 rs1000127     C     T 0.2500 0.2787 0.4012 0.4677 0.3081 0.5728
## 8      1 rs1000184     C     G 0.0521 0.0574 0.2442 0.8333 0.2879 0.7767
## 9      1 rs1000211     T     C 0.0260 0.0246 0.0000 0.0000 0.0000 0.0000
## 10     1 rs1000212     A     G 0.0365 0.0246 0.0000 0.0000 0.0000 0.0000
## # ... with more rows, and 25 more variables: CHS <dbl>, CLM <dbl>,
## #   ESN <dbl>, FIN <dbl>, GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>,
## #   ITU <dbl>, JPT <dbl>, KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>,
## #   PEL <dbl>, PJL <dbl>, PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>,
## #   EAS <dbl>, EUR <dbl>, AFR <dbl>, AMR <dbl>, SAS <dbl>

The entire file is really large, about 3.6 GB in memory. You often only need a few (1-1000) SNPs, so let’s try downloading only a few:

#first 10 of the hits from the latest height GWAS
#http://www.nature.com/ng/journal/v46/n11/full/ng.3097.html
some_snps = c("rs425277", "rs9434723", "rs10779751", "rs2284746", "rs12137162", 
"rs212524", "rs1014987", "rs2806561", "rs4601530", "rs926438")
#fetch from SQL server
(sql_height_freqs = sql_1kg %>% filter(SNP %in% some_snps) %>% collect())
## # A tibble: 10 x 35
##      CHR        SNP    A1    A2    ACB    ASW    BEB    CDX    CEU    CHB
##    <int>      <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1      1  rs1014987     G     C 0.1458 0.1475 0.1802 0.5000 0.2525 0.4612
## 2      1 rs10779751     A     G 0.6406 0.5984 0.1279 0.0591 0.2677 0.0631
## 3      1 rs12137162     A     C 0.1198 0.1557 0.2384 0.1935 0.2677 0.2621
## 4      1   rs212524     T     C 0.1823 0.1967 0.3605 0.2473 0.4293 0.1796
## 5      1  rs2284746     G     C 0.1927 0.1475 0.4012 0.1882 0.5657 0.2573
## 6      1  rs2806561     A     G 0.3073 0.3934 0.3488 0.5591 0.5909 0.5388
## 7      1   rs425277     T     C 0.0990 0.0984 0.3023 0.1452 0.2576 0.1990
## 8      1  rs4601530     T     C 0.4271 0.3852 0.4535 0.5538 0.2475 0.4515
## 9      1   rs926438     T     C 0.8177 0.7541 0.2093 0.1774 0.5455 0.3010
## 10     1  rs9434723     A     G 0.2240 0.2541 0.1744 0.0538 0.1414 0.0922
## # ... with 25 more variables: CHS <dbl>, CLM <dbl>, ESN <dbl>, FIN <dbl>,
## #   GBR <dbl>, GIH <dbl>, GWD <dbl>, IBS <dbl>, ITU <dbl>, JPT <dbl>,
## #   KHV <dbl>, LWK <dbl>, MSL <dbl>, MXL <dbl>, PEL <dbl>, PJL <dbl>,
## #   PUR <dbl>, STU <dbl>, TSI <dbl>, YRI <dbl>, EAS <dbl>, EUR <dbl>,
## #   AFR <dbl>, AMR <dbl>, SAS <dbl>

All the atomic populations are there as well as the 5 super populations (‘macro races’). The numbers for the super populations differ slightly from those that can be seen on ensembl because they used weighted means and I used unweighted means.

Share
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Emil O. W. Kirkegaard
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing