About the Database ( Mouse
NCBI Build 37 )
Single nucleotide polymorphisms (SNPs)
are an important tool to study genetic variations.
CGDSNPdb is a high quality SNP database with more than 8 Million SNPs from
97 strains of laboratory mice,
drawn from several sources.
These SNPs have been quality checked with an automated pipeline,
highlighting inconsistent or ambiguous SNPs.
The SNP data has also been integrated with nearby gene annotations,
including crosslinks,using Ensembl and MGI annotations.
Our annotations also highlight various functional characteristics and
implications of the SNP, such as amino acid changes, base-pair substitution types,
,location within CpG dinucleotides,
and Codon usage frequency and Codon Adaptation Index(CAI).
CGDSNPDB also provides an interface to two new resources developed
within the scope of the Center for Genome Dynamics. The first is the recently developed
"imputed SNP resource" in which a hidden Markov model(HMM) was used
to assess local haplotypes and the most probable base assignment at several
million genomic loci in tens of strains of mice. The imputed SNP calls
may be searched, retrieved, and analyzed identically to the experimentally verified SNPs,
with the additional information, such as HMM likelihood score provided in the query return.
SNPs detected in previous genome assembly builds are converted to current build.
The second new resource is the output from the Mouse Genome Diversity Array(MusDiv),
a high-density genotyping microarray that includes over 623,000 SNP probes and over 900,000
invariant probes that target exons, copy number variant, and the other features of
interest. As of July 2009, CGDSNPdb includes MusDiv SNP calls for 72 inbred
laboratory mouse strains.
The automated Quality Control(QC) analysis includes tests for a number
of possible inconsistencies.All QC data is archived in the database,
enabling searches by SNP ID to retrieve these descriptions when the SNP
has been pruned from the final database
CGD does not load the following cases into the core database tables :
SNPs with inconsistency with the reference C57BL/6J(B6) genome, including:
- SNPs for which the source-provided position maps to ambigous("N") base calls in B6
- SNPs that do not match the B6 genome at the source-provided position and where
the source-provided flanking sequences also do not match to>=70%
- SNPs that match the B6 genome at the source-provided position, but where
the source-provided flanking sequences do not match to >=60%
- SNPs where multiple accession IDs map to the same location with
conflicting strain allele assignments, and no flanking sequence for resolution
|
| CGDSNP Overview |
| A summary of the total SNP data included in CGDSNPdb |
| Total | 9,686,537 |
| Transition | 6,607,155 |
| Transversion | 3,079,382 |
| Intergenic | 5,617,609 |
| Genic | 4,068,928 |
| Intronic | 3,850,229 |
| Exonic | 247,920 |
| UTR | 112,067 |
| CDS | 126,078 |
| CDS:Synonymous | 85,090 |
| CDS:Non-Synonymous | 43,698 |
| Noncoding gene exon | 17,032 |
| Noncoding gene intron | 5,910 |
|
| Current Sources |
| Source Name | Strain |
Initial SNP count | NCBI BUILD |
Total Loaded into CGD | Data Report |
| NIEHS | 16 | 8,262,793 | 37 |
8,229,903 (about 99.6%) |
|
| Broad | 49 | 138,602 | 37 |
138,594(about 99.99%) |
- All Duplicates SNPs(9)
- All Ambiguous SNPs
(233)
- All Allele Swap SNPs
(209)
- Reverse Strand (277)
|
| GNF | 76 | 156,513 | 37 |
155,677(about 99.5%) |
|
| Mouse Diversity Array | 72 |
581,672(
genotype data files) | 37 |
548,363(about 94.3%)(
Download)
|
|
| SNPs Imputed | 74 |
7,868,024 | 37 |
7,867,856(about 99.99%)
|
- All Ambiguous SNP allele call
(341)
- All Allele Swap SNPs (221)
- Reverse Strand (8800)
|
| Celera SNPs | 5 |
2,122,060 | 37 |
2,122,059(about 100%)
|
- All Ambiguous SNP allele call
()
- All Allele Swap SNPs ()
- Reverse Strand ()
|
| Paigen SNPs | 50 |
24,607 | 37 |
23,975(about 97.4%)
|
- All Ambiguous SNP allele call
()
- All Allele Swap SNPs ()
- Reverse Strand ()
|
| Wild Derived SNPs | 12 |
666 | 37 |
666(about 100%)
|
- All Ambiguous SNP allele call
()
- All Allele Swap SNPs ()
- Reverse Strand ()
|
|
|
| Data Conflict |
- C57BL/6J allele conflict:We have 4 types of allele conflicts
- Allele_conflict= 0/1 ==> No conflict: the provided B6 allele matches
the base pair at the SNP location of our B6 genome of the same build and strand
- Allele_conflict= 2 ==> Conflict(Bad Strand):the provided B6 allele does not match
the base pair at the SNP location of our B6 genome of the same build and strand.
But Instead the provided allele matches the corresponding basepair on the reverse strand
- Allele_conflict= 3 ==> Conflict(Ambiguous SNPs):the provided B6 allele does not match
the base pair at the SNP location of our B6 genome of the same build and strand or reverse strand,
or the other allele
- Allele_conflict= 4 ==> Conflict(Unmapped/Ambiguous SNPs):The base pair at the
provided genome location is "N"
- Allele_conflict= 5 ==> Conflict(AlleleSwap SNPs):the provided B6 allele does not match
the base pair at the SNP location of our B6 genome of the same build and strand or reverse strand,
but matches the other allele (SNP allele)
So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
- snp_allele conflict between sources
:
We say that two snp sources have a snp allele conflict if and only if:
- the snp_allele from source1 != "N"
- the snp_allele from source2 != "N"
- source1_snp_allele != source2_snp_allele
So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
- genotype allele conflict:
We say that two snp sources have a genotype allele conflict for a given strain if and only if:
- the genotype allele from source1 != "N"
- the genotype allele from source2 != "N"
- source1_genotype_allele != source2_genotype_allele
So the following cases are not conflicts: N/A, A/N, T/N,N/T,N/C, C/N, G/N,N/G
| |
|