CGDSNPDB v1.5 Mouse SNP Database - About data

Browse
Build 37 SNPs Summary
Total :    66,028,809
Transition:   42,744,395(65%)
Transversion:   23,284,436(35%)
Intergenic:   38,919,578(59%)
On Transcript:   27,109,231(41%)
Intronic:   26,102,091 ; Exonic:   1,102,004
UTR:   699,677 ;Synonymous:   290,014
Non-Synonymous:   161,260
Announcements
  • August 2012

    • SNP database v1.5 maintenance release

About the Database ( Mouse NCBI Mouse Build 37 )

Single nucleotide polymorphisms (SNPs) are an important tool to study genetic variations. CGDSNPdb is a high quality SNP database with more than 66 Million SNPs from 136 strains of laboratory mice, drawn from several sources. These SNPs have been quality checked with an automated pipeline, highlighting inconsistent or ambiguous SNPs. The SNP data has also been integrated with nearby gene annotations, including crosslinks,using Ensembl and MGI annotations. Our annotations also highlight various functional characteristics and implications of the SNP, such as amino acid changes, base-pair substitution types, codon location within protein,location within CpG dinucleotides,location within CDS, and Codon usage frequency and Codon Adaptation Index(CAI).

CGDSNPDB also provides an interface to two new resources developed within the scope of the Center for Genome Dynamics. The first is the recently developed "imputed SNP resource" (Imputed - Jeremy R. Wang et al. 2012).The imputed SNP calls may be searched, retrieved, and analyzed identically to the experimentally verified SNPs, with the additional information, such as HMM likelihood score provided in the query return.

The second new resource is the output from the Mouse Genome Diversity Array(MusDiv Diversity Array - Yang et al.2011 ), a high-density genotyping microarray that includes over 500,000 SNP probes and over 900,000 invariant probes that target exons, copy number variant, and the other features of interest. As of July 2012, CGDSNPdb includes:

  • MusDiv SNP calls 549,645 SNPs for 100 inbred laboratory mouse strains;
  • Imputed SNP calls 65,243,635 SNPs for 105 inbred laboratory mouse strains;
  • Imputed SNP calls 7,867,995 SNPs for 74 inbred laboratory mouse strains;
  • and,NIEHS SNP calls 8,228,050 SNPsfor 16 inbred laboratory mouse strains

The automated Quality Control(QC) analysis includes tests for a number of possible inconsistencies.All QC data is archived in the database, enabling searches by SNP ID to retrieve these descriptions when the SNP has been pruned from the final database

CGD flags the following cases into the core database tables :

  • SNPs for which the source-provided position maps to ambigous("N") base calls in the reference strain
  • SNPs for which the source-provided reference allele does not match the reference strain allele call
  • SNPs for which the SNP other allele call from the same or different SNP sources does not agree
  • SNPs for which the strain genotype allele call from the same source or diffeent SNP sources does not agree
  • SNPs with multiple allele calls for the SNP other allele
  • SNPs for which the reference allele call matches the reverse strand of the genome
  • SNPs for which the reference allele call matches the other allele instead of the reference genome

Current Sources
Source NameStrain Initial SNP countNCBI BUILD Unique SNP Loaded into CGDData Report
NIEHS168,239,37437 8,228,050 (about 99.9%)
Imputed - Jeremy R. Wang et al. 2012 10565,243,63537 65,027,153 (about 99.7%)
Imputed - Szatkiewicz et al.2008 7437 7,867,995 (about 99.7%)
MusDiv Diversity Array - Yang et al.2011100549,682 37 549,645 (about 99.9%)
Data Conflict
  • SNP conflict:We have 8 types of allele conflicts
    1. error_flag= 2 ==> Conflict(Reverse Strand): When checking data against the reference genome of the same build, we found that the provided reference allele does not match our reference genome of the same build and strand.But Instead the provided allele matches the corresponding basepair on the reverse strand
    2. error_flag= 3 ==>Conflict(Ambiguous SNPs): the provided reference allele does not match the base pair at the SNP location of our reference genome of the same build and strand or reverse strand, or the other allele
    3. error_flag= 4 ==>Conflict(Unmapped/Ambiguous SNPs): The base pair at the provided genome location is "N"
    4. error_flag= 5 ==>Conflict(AlleleSwap SNPs): the provided B6 allele does not match the base pair at the SNP location of our B6 genome of the same build and strand or reverse strand,but matches the other allele (SNP allele)
    5. error_flag= 6 ==>SNP allele conflict between sources
    6. error_flag= 7 ==>Genotype allele conflict for a given strain
    7. error_flag= 16 ==>Multiple SNP Alleles
    8. error_flag= 17 ==>Fields count missmatch between header and data
    So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
  • snp_allele conflict between sources :
    We say that two SNP sources or two SNP calls from the same source have a snp allele conflict if and only if:
    1. the snp_allele from call1 != "N"
    2. the snp_allele from call2 != "N"
    3. call1_snp_allele != call2_snp_allele
    So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
  • genotype allele conflict:
    We say that two genotype calls have a genotype allele conflict for a given strain if and only if:
    1. the genotype allele call1 != "N"
    2. the genotype allele call2 != "N"
    3. call1_genotype_allele != call2_genotype_allele
    So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G