To address the demand for data science skills and knowledge in biomedical research, we are developing and disseminating a cutting-edge curriculum in genomic data analysis and computational modeling. To build collaborations between those from quantitative and non-quantitative backgrounds, curriculum modules will provide cross training in statistical methods, computing, and genomics. Our approach will be to provide advanced hands-on training in statistical inference, modeling, and high-throughput sequence analysis on the cloud to an audience of practicing scientists, postdoctoral fellows and doctoral students. We will refine the curriculum at the Jackson Laboratory and will extend our reach through Software Carpentry, a volunteer- powered nonprofit that teaches lab skills for scientific computing and provides open access teaching materials. Our goals are to produce biomedical researchers that are also skilled data science practitioners and to nurture collaborations between disciplines.
We propose to accomplish our goals with the following specific aims:
We will develop a curriculum for advanced training in genomic data analysis. A modular set of lessons will provide practical hands-on training in statistical inference, modeling, and cloud computing. The full curriculum will be available on the web under an open license for self-study and for teaching.
Aim 2. We will perfect the curriculum by testing and evaluating each lesson on site at the
Jackson Laboratory with an audience of practicing scientists, postdoctoral fellows, and graduate students. We will prepare this audience for our advanced curriculum by providing baseline training in scientific computing for genomics through Software Carpentry and Data Carpentry workshops.
Aim 3. We will disseminate the curriculum through Software Carpentry’s global instructor network. Software Carpentry’s network of more than 400 instructors in more than 25 countries includes many genomics experts. We will train instructors to teach the curriculum and will engage the instructor community in refining and updating the curriculum.
Module 1: Concepts of Genetic Mapping. This module lays the groundwork for understanding statistical design and data analysis methods for genetic mapping of complex, quantitative traits in model organisms and in human populations. The module will be taught as three self-contained units designed for one day of instruction each, for a total of 24 hours of instruction. Each unit will introduce new data and software tools, but they will all be tied together by a common set of concepts. Learners will progress from working with simple inbred crosses (Unit 1) through human populations (Unit 3) with advanced cross populations (Unit 2) serving as a conceptual bridge.
Unifying concepts of genetic mapping. The goal of genetic mapping is to discover genetic variants that affect phenotypes. This is accomplished by fitting models that regress the phenotype on genotype. We will describe different types of mapping populations and will present approaches to mapping in genetic crosses and natural populations. Learners will develop an understanding of the differences and commonalities between experimental and natural population mapping studies. Themes that recur across each unit include:
Data formats and quality control for genetic mapping data: Prior to performing analysis, data must be correctly formatted and cleaned. Standard formats for genetic mapping data will be presented and steps for data quality assessment will be carried out.
Evaluating statistical significance: Investigators must determine the statistical significance of genetic mapping results. We will present a variety of methods that are used to account for multiple testing across the genome. Learners will understand the multiple testing problem and will evaluate the significance of genetic mapping results using Bonferroni correction, false discovery rates, and permutation analysis.
Population structure: Whether experimental or natural, all genetic mapping populations embody a structure of relationships among individuals. Methods for estimating kinship between individuals and heritability of traits can provide insights into the genetic architecture of quantitative traits. In each unit, learners will analyze the population structure. They will learn when and how to apply linear mixed model regression to account for population structure in genetic mapping analysis.
Learning goals. After completing this module, learners will:
Understand the concepts of genetic mapping in experimental crosses and natural populations.
Be able to work with common data formats.
Know how to run quality control in preparation for genetic mapping analysis.
Carry out genome-wide mapping analysis with inbred crosses, advanced crosses and human data.
Understand and apply methods to evaluate statistical significance of genome-wide mapping results.
Understand the concepts of heritability and kinship.
Be able to apply the linear mixed model in genetic mapping analysis.
Understand the role of covariates in genetic mapping analysis.
Be able to use linear models to evaluate environment- or sex-specific genetic effects and epistasis.
Module 2: High Throughput Sequence Analysis. Module 2 introduces sequencing technologies and their applications with an emphasis on quantitative gene expression analysis by RNA sequencing (RNA-seq). The module is organized into units that can be taught over a two-day period for a total of 16 hours of instruction. Learners will analyze raw RNA-Seq data (FASTQ files) and will implement a reproducible analysis pipeline. The software tools to be used in this module are primarily run from the command line, so familiarity with the Unix computing environment is assumed. Software Carpentry and Data Carpentry workshops provide essential training in the Unix command line, and as such meet this prerequisite for module 2.
Learning goals. After completing this module, learners will:
Be familiar with modern sequencing technologies and their applications.
Be able to work with standard data formats.
Be able to run quality control analysis.
Understand principles of experimental design as they pertain to RNA-Seq experiments.
Be able to align and visualize sequence reads using standard alignment software.
Be able to use and interpret alignment-free methods for quantification of RNA-Seq data.
Be able to compute and normalize quantitative measures of gene expression.
Be able to compute and evaluate statistical measures of differential expression.
Understand the concepts of empirical Bayes estimation and multiple testing corrections.
Be able to assemble and run their own RNA-Seq analysis pipeline on the cloud.
Module 3: Systems Genetics Analysis of Gene Expression and Proteomics Data. Systems genetics applies large-scale molecular phenotyping in the context of a genetic mapping population to investigate networks of regulatory relationships. Genetic variation provides a set of perturbations that are assumed to be causal for variation in molecular and clinical phenotypes. The extensive data used in systems genetics analysis present many challenges and potential pitfalls, but also opportunities for unbiased discovery of new mechanisms that govern the molecular substrate underlying health and disease.
This module is composed of five units that build on one another sequentially and are designed to be taught as two one- day sessions, for a total of 16 hours of instruction. Computation will be carried out in R using tools drawn from multiple software packages and custom scripts (to be provided). Learners are assumed to have completed Modules 1 and 2 of this series and to be comfortable working with R in a cloud- computing environment.
Learning goals. After completing this module, a learner will:
Be familiar with a variety of “omics” technologies that generate large-scale measurement data from individual biological samples.
Understand the strengths and weakness of different genetic strategies for systems genetics studies.
Be able to carry out quality control analyses including diagnosing and correcting batch effects.
Be able to use data reduction and visualization techniques to explore large omics data sets.
Be able to carry out expression QTL (eQTL) and protein QTL (pQTL) mapping using appropriate statistical methods including multiple-testing corrections.
Be able to compute and evaluate tests of causality and understand the pitfalls of causal inference.
Be able to combine transcriptome and proteome profiling data to infer regulatory networks.