GetGene.jl

GetGene.jl

GetGene.jl is a Julia package that queries the NIH NCBI's dbSNP API to retrieve gene information based on an input RefSNP ID (rsid). The package can currently get loci names for an array of rsids and retrieve other gene information for a single inputted rsid.

Installation

This package requires Julia v0.7.0 or later. Start julia and use the ] key to switch to the package manager REPL and proceed as follows:

(v1.0) pkg> add GetGene
# machine information for this tutorial
versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)

In the tutorial we will use the following packages:

using DataFrames, GetGene

Basic usage

The following commands can be used to create the test dataset. We will use the following rsids.

rsids = ["rs113980419", "rs17367504", 
"rs13107325", "rs2392929", "rs11824864",
"78067132", "rs4909945", "Affx-4150211",
"rs2270993"]
df = DataFrame(rsids = rsids)
9×1 DataFrame
│ Row │ rsids        │
│     │ String       │
├─────┼──────────────┤
│ 1   │ rs113980419  │
│ 2   │ rs17367504   │
│ 3   │ rs13107325   │
│ 4   │ rs2392929    │
│ 5   │ rs11824864   │
│ 6   │ 78067132     │
│ 7   │ rs4909945    │
│ 8   │ Affx-4150211 │
│ 9   │ rs2270993    │

The basic commands for MendelPlots.jl are

getgenes()
getgeneinfo()

getgenes()

getgenes() gets the gene loci information from NCBI's dbSNP API.

Inputs

getgenes() takes either an array of RefSNP IDs or a dataframe where one column has the rsids. The variable name by default is assumed to be named snpid. The variable name can be specified using the idvar keyword argument.

Output

getgenes() outputs an array with the corresponding gene loci names of the inputted rsids. If the rsid is not in dbSNP, it will say the rsid was not in the database for that entry. If there was no gene associated with the rsid, it will say there is no gene listed for that entry.

For documentation of the getgenes function, type ?getgenes in Julia REPL.

GetGene.getgenesFunction.
getgenes(data::DataFrame; idvar::AbstractString)

Position arguments

  • data::DataFrame: A DataFrame containing a column with the Ref SNP IDs. By default, assumes that the variable name is "snpid". The variable name can be specified using the idvar keyword.

Keyword arguments

  • idvar::AbstractString: the variable name in the dataframe that specifies the Ref SNP ID (rsid).
getgenes(snps::AbstractArray)

Position arguments

  • snps::AbstractArray: Ref SNP IDs (rsid) to get loci names for.

Output

Returns an array of gene loci associated to the Ref SNP IDs.

source
getgenes(rsids)
9-element Array{String,1}:
 "C1orf167"             
 "MTHFR"                
 "SLC39A8"              
 "No gene listed"       
 "SPI1"                 
 "No gene listed"       
 "MRVI1"                
 "snpid not in database"
 "PTPRJ"

the rsid should start with the prefix rs, but if you omit the rs it will also work.

getgenes("13107325")
"SLC39A8"

getgeneinfo()

getgeneinfo() gets the gene information of the inputted rsid from NCBI's dbSNP API.

Inputs

getgenes() takes either an array of RefSNP IDs or a dataframe where one column has the rsids. The variable name by default is assumed to be named snpid. The variable name can be specified using the idvar keyword argument.

Output

getgenes() outputs a dictionary of the corresponding rsids. If the rsid is not in dbSNP, it will return an error. If there was no gene associated with the rsid, it will say there is no gene information listed for that entry. It returns a dictionary of gene information associated with the Ref SNP ID. The dictionary keys are seq_id, annotation_release, gene_name, gene_id, gene_locus, gene_is_pseudo, and gene_orientation.

getgeneinfo("rs13107325")
Dict{String,AbstractString} with 7 entries:
  "annotation_release" => "Homo sapiens Annotation Release 109"
  "gene_name"          => "solute carrier family 39 member 8"
  "gene_locus"         => "SLC39A8"
  "gene_id"            => "64116"
  "seq_id"             => "NC_000004.12"
  "gene_is_pseudo"     => "0"
  "gene_orientation"   => "1"