ANNOUNCEMENTS

  • February 2009

    • SNP database 1.2 maintenance release
  • March 2009

    • 74 Strains Imputed SNP release
  • May 2009

    • Mouse Assembly Build 37 SNP release
    • - SNP Integration with Codon frequency for synonymous and non-synonymous mutations
mouse logo

About the Database ( Mouse NCBI Build 37 )

Single nucleotide polymorphisms (SNPs) are an important tool to study genetic variations. CGDSNPdb is a high quality SNP database with more than 8 Million SNPs from 97 strains of laboratory mice, drawn from several sources. These SNPs have been quality checked with an automated pipeline, highlighting inconsistent or ambiguous SNPs. The SNP data has also been integrated with nearby gene annotations, including crosslinks,using Ensembl and MGI annotations. Our annotations also highlight various functional characteristics and implications of the SNP, such as amino acid changes, base-pair substitution types, ,location within CpG dinucleotides, and Codon usage frequency and Codon Adaptation Index(CAI).

CGDSNPDB also provides an interface to two new resources developed within the scope of the Center for Genome Dynamics. The first is the recently developed "imputed SNP resource" in which a hidden Markov model(HMM) was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice. The imputed SNP calls may be searched, retrieved, and analyzed identically to the experimentally verified SNPs, with the additional information, such as HMM likelihood score provided in the query return.

SNPs detected in previous genome assembly builds are converted to current build. The second new resource is the output from the Mouse Genome Diversity Array(MusDiv), a high-density genotyping microarray that includes over 623,000 SNP probes and over 900,000 invariant probes that target exons, copy number variant, and the other features of interest. As of July 2009, CGDSNPdb includes MusDiv SNP calls for 72 inbred laboratory mouse strains.

The automated Quality Control(QC) analysis includes tests for a number of possible inconsistencies.All QC data is archived in the database, enabling searches by SNP ID to retrieve these descriptions when the SNP has been pruned from the final database

CGD does not load the following cases into the core database tables :
SNPs with inconsistency with the reference C57BL/6J(B6) genome, including:

  • SNPs for which the source-provided position maps to ambigous("N") base calls in B6
  • SNPs that do not match the B6 genome at the source-provided position and where the source-provided flanking sequences also do not match to>=70%
  • SNPs that match the B6 genome at the source-provided position, but where the source-provided flanking sequences do not match to >=60%
  • SNPs where multiple accession IDs map to the same location with conflicting strain allele assignments, and no flanking sequence for resolution

CGDSNP Overview
A summary of the total SNP data included in CGDSNPdb
Total 9,686,537
Transition6,607,155
Transversion3,079,382
Intergenic5,617,609
Genic 4,068,928
Intronic3,850,229
Exonic247,920
UTR112,067
CDS126,078
CDS:Synonymous85,090
CDS:Non-Synonymous43,698
Noncoding gene exon 17,032
Noncoding gene intron5,910
Current Sources
Source NameStrain Initial SNP countNCBI BUILD Total Loaded into CGDData Report
NIEHS168,262,79337 8,229,903 (about 99.6%)
Broad49138,60237 138,594(about 99.99%)
  • All Duplicates SNPs(9)
  • All Ambiguous SNPs (233)
  • All Allele Swap SNPs (209)
  • Reverse Strand (277)
GNF76156,51337 155,677(about 99.5%)
Mouse Diversity Array72 581,672( genotype data files)37 548,363(about 94.3%)( Download)
SNPs Imputed74 7,868,02437 7,867,856(about 99.99%)
  • All Ambiguous SNP allele call (341)
  • All Allele Swap SNPs (221)
  • Reverse Strand (8800)
Celera SNPs5 2,122,06037 2,122,059(about 100%)
  • All Ambiguous SNP allele call ()
  • All Allele Swap SNPs ()
  • Reverse Strand ()
Paigen SNPs50 24,60737 23,975(about 97.4%)
  • All Ambiguous SNP allele call ()
  • All Allele Swap SNPs ()
  • Reverse Strand ()
Wild Derived SNPs12 66637 666(about 100%)
  • All Ambiguous SNP allele call ()
  • All Allele Swap SNPs ()
  • Reverse Strand ()
Data Conflict
  • C57BL/6J allele conflict:We have 4 types of allele conflicts
    1. Allele_conflict= 0/1 ==> No conflict: the provided B6 allele matches the base pair at the SNP location of our B6 genome of the same build and strand
    2. Allele_conflict= 2 ==> Conflict(Bad Strand):the provided B6 allele does not match the base pair at the SNP location of our B6 genome of the same build and strand. But Instead the provided allele matches the corresponding basepair on the reverse strand
    3. Allele_conflict= 3 ==> Conflict(Ambiguous SNPs):the provided B6 allele does not match the base pair at the SNP location of our B6 genome of the same build and strand or reverse strand, or the other allele
    4. Allele_conflict= 4 ==> Conflict(Unmapped/Ambiguous SNPs):The base pair at the provided genome location is "N"
    5. Allele_conflict= 5 ==> Conflict(AlleleSwap SNPs):the provided B6 allele does not match the base pair at the SNP location of our B6 genome of the same build and strand or reverse strand, but matches the other allele (SNP allele)
    So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
  • snp_allele conflict between sources :
    We say that two snp sources have a snp allele conflict if and only if:
    1. the snp_allele from source1 != "N"
    2. the snp_allele from source2 != "N"
    3. source1_snp_allele != source2_snp_allele
    So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
  • genotype allele conflict:
    We say that two snp sources have a genotype allele conflict for a given strain if and only if:
    1. the genotype allele from source1 != "N"
    2. the genotype allele from source2 != "N"
    3. source1_genotype_allele != source2_genotype_allele
    So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G