Key Considerations for Measuring Allelic Expression on a Genomic Scale Using High-throughput Sequencing
Landry, Christian R.
Wittkopp, Patricia J.
Gruber, Jonathan D.
MetadataShow full item record
CitationFontanillas, Pierre, Christian R. Landry, Patricia J. Wittkopp, Carsten Russ, Jonathan D. Gruber, Chad Nusbaum, and Daniel L. Hartl. 2010. Key considerations for measuring allelic expression on a genomic scale using high-throughput sequencing. Next Generation Molecular Ecology. Special Issue. Molecular Ecology 19:212-227.
AbstractDifferences in gene expression are thought to be an important source of phenotypic diversity, so dissecting the genetic components of natural variation in gene expression is important for understanding the evolutionary mechanisms that lead to adaptation. Gene expression is a complex trait that, in diploid organisms, results from transcription of both maternal and paternal alleles. Directly measuring allelic expression rather than total gene expression offers greater insight into regulatory variation. The recent emergence of high-throughput sequencing offers an unprecedented opportunity to study allelic transcription at a genomic scale for virtually any species. By sequencing transcript pools derived from heterozygous individuals, estimates of allelic expression can be directly obtained. The statistical power of this approach is influenced by the number of transcripts sequenced and the ability to unambiguously assign individual sequence fragments to specific alleles on the basis of transcribed nucleotide polymorphisms. Here, using mathematical modelling and computer simulations, we determine the minimum sequencing depth required to accurately measure relative allelic expression and detect allelic imbalance via high-throughput sequencing under a variety of conditions. We conclude that, within a species, a minimum of 500–1000 sequencing reads per gene are needed to test for allelic imbalance, and consequently, at least five to 10 millions reads are required for studying a genome expressing 10 000 genes. Finally, using 454 sequencing, we illustrate an application of allelic expression by testing for cis-regulatory divergence between closely related Drosophila species.
A major challenge in evolutionary biology today is understanding the genetic and molecular mechanisms that give rise to phenotypic differences within and between species. Such differences can arise from mutations affecting the function of gene products (i.e. proteins or RNAs) or mutations that affect expression of these genes. Historically, researchers have looked almost exclusively for (and often found) changes in protein coding regions that appeared to contribute to phenotypic evolution; however, during the last decade, there has been a dramatic increase in the number of studies showing that changes affecting gene regulation can also bring about diversity in ecologically relevant traits that affect behaviour, physiology and morphology (e.g. Duda & Remigio 2008; Giger et al. 2008; Voelckel et al. 2008; see also for reviews Wray 2007; Hoekstra & Coyne 2007; Stern & Orgogozo 2008; Pennisi 2008; Wolf et al. 2010).
Studies of gene expression have become routine with the development of techniques that quantify transcript abundance in a high-throughput way. Microarray studies, in particular, have produced valuable catalogues of differences in transcript levels between individuals (Oleksiak et al. 2002; Whitehead & Crawford 2006), between species in diverse taxa (Rifkin et al. 2003) and between ecological conditions (Reymond et al. 2000; Carsten et al. 2005; Derome et al. 2006). Such studies also show that inter-individual differences in gene expression are often highly heritable (Wayne et al. 2004; Gibson & Weir 2005; Hughes et al. 2006; Lemos et al. 2008; Ayroles et al. 2009).
Because of this heritability, quantitative trait locus (QTL) mapping can be combined with microarray analysis to investigate the genetic basis of variable gene expression (Vasemagi & Primmer 2005). When a QTL affecting a gene’s transcription maps close to the affected gene it can be classified as cis-acting, while a QTL that maps further away on the same chromosome, or to another chromosome, can be classified as trans-acting (Brem et al. 2002). However, strictly speaking, ‘cis’ describes mutations that affect expression of only the allele on the same chromosome as the mutation, whereas ‘trans’ describes mutations that affect allelic expression on both homologous chromosomes. Examples of cis-acting sequences include promoters and enhancers, which are typically located close to the gene that they regulate, while examples of trans-acting regulators include genes that encode transcription factors, which may be located anywhere in the genome. Classifications of expression QTLs as cis- or trans-acting based solely on their proximity to the affected gene are therefore only an approximation – and one that comes with many caveats (Rockman & Kruglyak 2006).
Nevertheless, studies mapping expression QTLs suggest that both cis- and trans-regulatory mutations contribute to transcriptional variation, with a preponderance of expression QTLs appearing to be cis-acting (Wayne et al. 2004; Hughes et al. 2006; Osada et al. 2006; Bergen et al. 2007; Genissel et al. 2008; Gilad et al. 2008; Price et al. 2008; Lemos et al. 2008; but see Morley et al. 2004), although this methodology generally has less statistical power to detect trans-acting than cis-acting variants (Cookson et al. 2009). In addition, QTL mapping studies of variable gene expression require microarrays suitable for studying the species of interest, molecular markers that cover its complete genome, and resources for genotyping these markers in a segregating population. The lack of any one of these things can be a significant impediment for mapping expression QTLs outside well-established genetic model systems.
An alternative strategy for studying regulatory variation uses allelic transcript abundance and the fact that cis-regulatory mutations have allele-specific effects on gene expression while trans-regulatory mutations affect expression of both alleles in a diploid cell (Cowles et al. 2002; Wittkopp et al. 2004). One or more transcribed differences in nucleotide sequence are used to discriminate between transcripts produced by each allele. Asymmetric expression of two alleles, also known as allelic imbalance (AI) that is observed between alleles present in the same cell (i.e. exposed to the same trans-regulatory environment) provides direct evidence of cis-regulatory differences. Expression differences observed between individuals homozygous for two different alleles that are not also observed between these same alleles in heterozygotes are attributed to trans-regulatory differences (Wittkopp et al. 2004).
This allele-specific approach has now been used to decompose variable gene expression into its cis- and trans-regulatory component parts for flies (e.g. Wittkopp et al. 2008a,b), humans (e.g. Pant et al. 2006; Serre et al. 2008), plants (e.g. de Meaux et al. 2005; Guo et al. 2008) and yeast (Tirosh et al. 2009). With the exception of Tirosh et al. (2009), who developed custom microarrays, the methods used to measure allelic expression in these studies are not readily scalable to an entire genome. Furthermore, methods used in these studies, including Tirosh et al. (2009), require polymorphic sites that differentiate alleles to be known a priori. For these reasons, studying allelic expression genome wide has been impractical for nonmodel (as well as most model) species.
Next generation sequencing technologies have the potential to revolutionize studies of allelic expression. Because they obviate the need for a priori sequence information, molecular markers, and locus-specific genotyping assays, next generation sequencing methods can measure allelic abundance at a genomic level in virtually any species. Only transcribed nucleotide differences between alleles and sufficient sequencing depth for detecting AI are required. For these reasons, we expect measurements of allelic expression based on next generation sequencing will soon be acquired by many researchers, not only to disentangle cis- and trans-regulatory variation, but also to quantify the heritability of gene expression, examine dominance among regulatory alleles, evaluate their contribution to morphological, physiological, or behavioural changes, and reveal patterns of allelic variation within and between species.
Not surprisingly, the benefits of next generation sequencing come with a price – and often a high one. A single ‘run’ of high-throughput sequencing can provide up to hundreds of millions of sequences, but currently costs thousands of dollars. The precise cost per base differs among technologies, as does the length of each sequenced fragment and the total number of sequences collected. Because of this cost, careful experimental design that maximizes the data per dollar for allelic expression studies using next generation sequencing is critical. Optimal experimental design is particularly paramount for studies in molecular ecology that seek to examine allelic expression in multiple individuals, species or environmental conditions.
In this study, we use mathematical modelling and computer simulations to identify critical parameters affecting measurements of allelic expression and the detection of AI with high-throughput sequencing. We show that the statistical power of this method depends upon four crucial parameters (Fig. 1): sequence divergence between alleles, the relative transcript abundance, the average read length (i.e. amount of transcript sequenced) and sequencing depth (i.e. average number of reads per gene). The latter two parameters determine the number of sequencing reads expected to map to each gene. The former two parameters determine the proportion of sequence reads per gene that are informative for allelic expression [i.e. contain one or more single nucleotide polymorphisms (SNPs) that allow reads to be unambiguously assigned to an allele]. We show that this probability is strongly affected by the location of SNPs within an mRNA as well as by the way in which the cDNA library is prepared for sequencing. Here, we derive a mathematical model that determines the minimum number of reads required to test for significant AI given various levels of sequence divergence, read lengths, and distributions of relative transcript abundance, and we compare these results with simulations. Finally, to illustrate the potential of this approach, we describe an empirical study using measurements of allelic expression in F1 hybrids between Drosophila melanogaster and Drosophila simulans obtained using 454 sequencing (Roche 454 Life Sciences).
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:10405235
- FAS Scholarly Articles