Leveraging genetic association data to investigate the polygenic architecture of human traits and diseases A dissertation presented by Ying Leong Chan to The Division of Medical Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Genetics and Genomics Harvard University Cambridge, Massachusetts April 2014 © 2014 Ying Leong Chan All rights reserved. Dissertation Advisor: Professor Joel N. Hirschhorn Ying Leong Chan Leveraging genetic association data to investigate the polygenic architecture of human traits and diseases ABSTRACT Many human traits and diseases have a polygenic architecture, where phenotype is partially determined by variation in many genes. These complex traits or diseases can be highly heritable and genome-wide association studies (GWAS) have been relatively successful in the identification of associated variants. However, these variants typically do not account for most of the heritability and thus, the genetic architecture remains uncertain. This dissertation describes analytical approaches to look for evidence of models of genetic architecture that could explain the remaining heritability. We develop methods to make predictions under various models, and compare the expected results from these predictions against the observed data for several traits and diseases. First, in studies of height (a classical polygenic trait), we modeled the expected cumulative effect of common variants identified from GWAS and compared the model with empirical data in individuals from the tails of the height distribution. We found that these common variants are predictive of stature, but have less than expected effects specifically at the short end of the height distribution. This result is consistent with models where rare variants with moderate effect, influence stature only in the shortest individuals. Second, we showed that under genetic models where low frequency variants make iii polygenic contributions to disease, there will be an excess of low frequency risk-increasing variants detected in GWAS. As such, by comparing the number of detected risk-increasing to risk-decreasing variants, one can detect a signal of the contribution to polygenic inheritance from low frequency variants. Finally, we examine the genetic architecture of sitting height ratio (SHR), a measure of body proportion that varies dramatically between individuals of African and European ancestry. We find that the SHR difference between populations is largely due to polygenic architecture; there is no evidence for any major locus accounting for most of this difference. These results show that, with the appropriate computational and genetic models, one can use empirical results of genetics studies to make inferences regarding genetic architecture of human traits and diseases. Doing so can help investigators prioritize strategies for uncovering the remaining unexplained heritability. iv TABLE OF CONTENTS Abstract Dedication Acknowledgements Attributions Chapter 1: Introduction A Preamble Heritability of complex traits Methods for studying the genetics of complex traits Complex phenotypes Heritability of human traits Accumulating evidence from multiple studies Summary Chapter 2: Common variants show predicted polygenic effects on height in the tails of the distribution, except in extremely short individuals Chapter 3: An excess of risk-increasing low frequency variants can be a signal of polygenic inheritance in complex diseases Chapter 4: Genome wide association in European and African Americans discover novel loci associated with sitting height ratio Chapter 5: Concluding remarks Overview Major findings and implications Future Directions A Postscript iii vi vii xi 1 2 7 12 21 18 25 29 38 85 136 169 170 170 174 178 v DEDICATION I dedicate this thesis to my loving wife, Teng Ting (Elaine) Lim. You are the person that has always been there during difficult and trying times. We share and do almost everything together. I would not be the person I am today without your love and support. Elaine, I dedicate this thesis to you. vi ACKNOWLEGEMENTS When I first arrived on the Harvard Medical School campus, I was captivated by the breath and majesty of just being there. Besides the Victorian building design of the Quad buildings, the place was surrounded by many hospitals and people plus that the sound of sirens wailing from ambulances made the whole area a really busy place. It was recruitment weekend for new students and part of the program was to have a couple of faculty members give talks to the potential incoming students and there was where I met my eventual dissertation advisor, Joel Hirschhorn. It was March of 2009. I rotated in the Hirschhorn lab for the summer of that year and eventually joined the lab as a student the next year. Joel was extremely helpful and encouraging mentor throughout graduate school. We (members of the lab) meet regularly with him, at least once in two weeks (30 to 60 minutes) despite his extremely busy schedule and we will always get his full undivided attention during each session. Also, during lab meetings, he will always be there to give critical feedback and suggestions when you present your work and ideas, whether it is about possible experiments to perform or just feedback on giving the presentation itself. Furthermore, he sometimes performs his own analysis, contrary to the long held belief that “PIs don’t do experiments themselves”. To me, it is a privilege to have the opportunity to be his student as well as a member of his lab. I remember when I first joined the lab, I was given the task of “coming up with 10 ideas” by Joel. I only did 8 and after long discussions about each of the aims, 1 of them eventually became one of my thesis aims with the help of another member of the lab, Andrew Dauber. My time in graduate school would not be the same without Andrew. Andrew started as a fellow in the lab the same time when I started my rotation so he has been in the lab for about a year when I vii eventually joined. It was during one of the lab meetings that Andrew presented some genotyping data on height extremes (very short and tall individuals) that could answer one of the 8 ideas that I had initially. Andrew was very kind and helpful and we worked together to answer the question. I will always be grateful to Andrew for getting me started as well as his mentorship throughout my time in the lab. Rany Salem, a post doctoral fellow in the lab, is someone I would also like to specially mention. Rany started as a post doctoral fellow the same day I started my rotation in the lab. Although, he is based at the Broad Institute, he comes to Children’s (Boston Children’s Hospital) frequently and we would always get coffee and exchange ideas. His work on diabetic nephropathy allowed me to develop my next idea, using the summary statistics generated from that project. Besides that, his efforts to obtain genotype data from many different cohorts allowed me and others to explore other ideas with regards to complex traits. In general, members of the Hirschhorn lab are a very sociable bunch which is odd, considering that most of us are ‘computational’ people that do not have a reputation for being sociable. To illustrate this, a former student of the lab, who is now a post doctoral fellow, Charleston Chiang, frequently organizes “games-night” at his place (about once a month) where he invites fellow members of the lab to hang out and play board games. One of our favorite games is called “Betrayal”, which is a game about a group of adventurers exploring a haunted house and one of the members will become the “traitor” midway in the game. That was fun and I will always remember those good times. Another example worth mentioning is our regular sashimi buffets. Somehow, there is a sizeable number of people in our lab that just love gorging on raw-fish, me included. We started out at a reasonably priced restaurant called Yamato somewhere in Allston where they have an all-you-can-eat buffet for just thirty dollars. However, viii when Tonu Esko joined us midway during my time in the lab, he “discovered” a new place, called Takusan where it is cheaper and serves oysters as well. Takusan is now our regular hangout until we find a new place. Therefore, I would like to take the opportunity to thank other members of the lab, past and present for creating the wonderful lab environment that is conducive for sharing ideas and establishing collaboration. Thanks to Tune Pers for providing opportunity to work together on DEPICT. Thanks to Sophie Wang for the help in performing the forward simulations for our 2nd project as well as for free ice-cream. Thanks to Sailaja Vedamtam for pointing me to necessary files when I need them as well as for the wonderful vegetarian dinner that you make. Thanks to Tonu Esko for having the opportunity to work with the Estonian data as well as introducing Takusan Sushi. Thanks to Michael Guo for coffee in exchange for performing LD calculations. Thanks to Yan Meng for discussions about finance and investments. Thanks to Jon Swartz for wonderful story about seaweed soup. Thanks to Vidhu Thaker for sharing her knowledge of Obesity and New York. Thanks to Jennifer Moon for lunchtime discussions about biological techniques as well as all things Korean. Thanks to Yu-Han Hsu for accompanying us when we have our regular coffee breaks and thanks to Frances Lopez for suggesting Waterville valley for a summer getaway. Finally, I would also like to say a big thank you to Meghan Foster for your patience in scheduling the weekly meetings with Joel despite his very busy schedule. I would also thank members of my dissertation advisory committee: Jonathon Seidman, Matthew Warman and Mark Daly for the thoughtful advice and generous comments during graduate school. I would also like to thank my dissertation examiners: Souyma Raychaudhuri, David Page and Shaun Purcell for agreeing to take time off from your busy schedule to be my examiners as well as for going through this dissertation. ix A huge thank you to fellow graduate students for the enjoyable time and company spent during graduate school. I really enjoyed the regular games night, playing “Singaporean bridge” and Mario party. Donkey Kong (Rigel) would like to thank Mario (Elaine), Luigi (Laura) and Yoshi (Palak) for the company. I would also like to thank my life-long mentors, Guna Rajagopal and Arnold Levine for getting me started in my scientific career and providing me with adequate opportunities and training prior to graduate school. Also, I would like to thank my wife, Elaine Lim, for the support and guidance. Life in America can be lonely as our families are back in Singapore. We could not have what we have now without you being here with me. Finally, I would like to thank my family for their support in my graduate career, especially for the support for us traveling more than 10,000 miles away from home for graduate school. To my parents, Ngai Kong and Hwee Koon, older brother, Ying Soon, sister-in-law, Grace, thank you so much. x ATTRIBUTIONS Chapter 2 Yingleong Chan: Together with J.N.H and A. D., conceived and performed the single SNP ORs comparisons for FINRISK. Conceived and performed WAS analysis for FINRISK, WAS simulations for HUNT, FINRISK and modeling of additional variants for HUNT, FINRISK. Wrote the initial draft of the manuscript and together with O.L.H., A.D., T.M.F., J.N.H and M.N.W edited the later drafts. Oddgeir L Holmen: Conceived and performed the single SNP ORs comparisons for HUNT. Edited parts of the manuscript and provided helpful comments. Andrew Dauber: Conceived and performed the single SNP ORs comparisons for FINRISK. Directed the genotyping of the height associated SNPs for the FINRISK samples. Substantially edited parts of the manuscript and provided many helpful comments. Lars Vatten: Provided data for the HUNT cohort. Aki S Havulinna: Provided data for the FINRISK cohort. Frank Skorpen: Provided data for the HUNT cohort. Kirsti Kvaløy: Provided data for the HUNT cohort. Kaisa Silander: Provided data for the FINRISK cohort. Thutrang T Nguyen: Perform the genotyping of the height associated SNPs for the FINRISK samples. Cristen Willer: Provided data for the HUNT cohort. Michael Boehnke: Provided data for the HUNT cohort. Suggested edits to manuscript. xi Markus Perola: Provided data for the FINRISK cohort. Aarno Palotie: Provided data for the FINRISK cohort. Veikko Salomaa: Provided data for the FINRISK cohort. Kristian Hveem: Provided data for the HUNT cohort. Timothy M Frayling: Conceived of single SNP ORs comparisons for HUNT. Edited parts of the manuscript and provided helpful comments and direction. Joel N Hirschhorn: Conceived of single SNP ORs comparisons for FINRISK. Calculated the expected single SNP ORs. Edited most of the manuscript and provided a lot of helpful comments and direction. Michael N Weedon: Conceived of single SNP ORs comparisons for HUNT. Calculated the expected single SNP ORs. Calculated the combined HUNT and FINRISK single SNP ORs. Calculated the WAS for HUNT. Performed sibling analysis for HUNT. Wrote the initial parts of the manuscript, edited most of the manuscript and provided helpful comments and direction. Chapter 3 Yingleong Chan: Conceived and performed R/P ratio analysis. Performed power calculations with varying parameters. Performed the calculations and simulations to obtain phenotypes for the selection model. Performed the R/P ratio simulations for uneven sample sizes and population stratification. Assessed the R/P ratio on published GWAS results. Wrote the manuscript and together with E.T.L and J.N.H edited the later drafts with revisions. Elaine T Lim: Conceived and directed the acquisition of GWAS summary statistics of xii Schizophrenia, Bipolar and Major depressive disorder, Crohn’s disease and ulcerative colitis for R/P ratio analysis. Edited the manuscript and provided helpful suggestions. Niina Sandholm: Performed GWAS for cohorts relevant to diabetic nephropathy. Sophie R Wang: Performed the forward simulation to obtain allele frequencies and effect sizes. Amy Jayne McKnight: Performed GWAS for cohorts relevant to diabetic nephropathy. Stephan Ripke: Performed GWAS for cohorts relevant to inflammatory bowel disease. DIAGRAM Consortium: Provided GWAS summary statistic for type 2 diabetes. GENIE Consortium: Provided GWAS summary statistics for diabetic nephropathy. GIANT Consortium: Provided GWAS summary statistics for obesity. IIBDGC Consortium: Provided GWAS summary statistics for inflammatory bowel disease. PGC Consortium: Provided GWAS summary statistics for Schizophrenia, Bipolar and Major depressive disorder. Mark J Daly: Provided suggestions on power calculations. Benjamin M Neale: Provided suggestions on power calculations. Rany M Salem: Performed GWAS for cohorts relevant to diabetic nephropathy. Joel N Hirschhorn: Provided guidance and suggestions on R/P ratio analysis. Conceived the negative selection analysis. Conceived the population stratification analysis. Provided much of the NCP ratio proof. Heavily edited the manuscript. Provided extensive feedback for revisions. Chapter 4 xiii Yingleong Chan: Conceived and performed sitting height ratio (SHR) comparisons between European and African Americans. Performed principal component analysis for determining admixture percentages. Performed association of admixture percentages with SHR. Performed GWAS on African and European cohorts with SHR. Performed comparisons with height associated variants. Wrote the manuscript. Rany M Salem: Downloaded the data from dbGAP, set up the pipeline for quality control on the data. Joel N Hirschhorn: Conceived the idea of studying sitting height as a phenotype. Suggested various analysis of studying sitting height. Provided extensive feedback on manuscript. xiv Chapter 1 Introduction A PREAMBLE Mendelian inheritance It has long been recognized that physical traits are more likely to be shared by parents and their offspring, between siblings or close relatives as well as between individuals of similar ethnic ancestry [1]. Such a phenomenon is known as heritability and the modern explanation of heritability was first broadly described by Gregor Mendel more than two centuries ago, where he showed that some traits of pea plants follow a specific pattern of inheritance [2]. Mendel theorized that each individual possesses a pair of alleles for each trait and will randomly pass on one of the alleles to its offspring. The offspring would then inherit two alleles, one from its father and one from its mother and the pair of alleles would determine the trait of the offspring. Such a pattern is now popularly known as Mendelian inheritance. Polygenic inheritance In the beginning of the 20th century, there were anthropologists and biologists who argued that since Mendelian inheritance predicts that traits would be discrete in nature, it cannot account for the number of continuous or quantitative traits (e.g. height) observed in humans and thus the theory cannot be applied to humans. However, in 1918, R. A. Fisher demonstrated that if there were multiple allele pairs, that each pair is responsible for only a fraction of the trait and each of these pairs observed the same pattern of Mendelian inheritance, it could account for most of the continuous or quantitative traits observed in humans [3]. This proposed model of Fisher is what we now call polygenic inheritance. Disease mapping Today, we know that the source of heritability is largely from within the variants contained 2 in the DNA of genes of our diploid chromosomes although there is some indication that epigenetics, molecular factors that attach to DNA, can have a role as well [4]. With the invention of methods like molecular cloning and the subsequent typing of genetic markers, it became possible to map heritable diseases to their respective genetic locus. Doing so allowed researchers to pinpoint the exact genetic variants that are responsible for causing the disease. Studying the genes underlying these variants can potentially inform us about the disease etiology and thereby be informative for developing therapeutics. Therefore to map a disease to a genetic locus, one must be able to determine if a genetic marker is associated with disease status. Linkage analysis There many types of genetic markers that can be used for this purpose. One of the earliest markers that were used for this purpose were microsatellite markers or short tandem repeats (STRs) [5]. These STR alleles can be genotyped in a variety of ways, from performing gelelectrophoresis to parallel sequencing [6]. However, more recently, since the completion of the human genome project [7] and the international hapmap project [8], single nucleotide polymorphisms (SNPs) have become the dominant marker of choice as it is more abundant and covers more of the human genome than any of the other known markers [9]. Having determined the marker, determining if the genetic marker is associated with disease status is the next problem. One of the first methodology used for determining this is linkage analysis [10,11]. Linkage analysis is a process by which researcher use genetic markers to determine if disease status co-segregates with any of these markers more so than by random chance by studying the inheritance pattern of these markers in families that have the trait or disease. The degree of cosegregation is measured by the LOD (logarithm of odds) score and a LOD score of 3.0 or greater is usually taken as evidence that the genetic locus represented by the marker harbors the variant 3 that causes the disease. This approach has been very successful at identifying Mendelian diseased genes [12] but fall short when trying to identify genes for complex disease [13]. Linkage analysis not amenable for complex diseases There are a number of reasons why linkage analysis is not amenable to identifying genes associated with complex diseases. First, complex diseases are thought to be genetically influenced by multiple genes rather than a single gene. This could mean that an affected individual could be genetically predisposed to having the disease because of variants from many genes, each of which causes a small increase to the risk of obtaining the disease (polygenic inheritance). This could also mean that while for each family, only mutations in a single gene is responsible, that gene is different for different families (locus heterogeneity). For example, an autosomal recessive disease like Fanconi Anemia has about 16 different genes [14]. If the number of genes were to be much more, for example 160 instead of 16, then there would be a good chance that every family analyzed for the disease will have a different causal gene and thus no overlapping genes. Whichever the case maybe, be it polygenic inheritance of locus heterogeneity, linkage analysis will be less powered for complex diseases as the genetic basis for each affected child within each family or across families is different. Genetic Association Studies Polygenic inheritance is a defining feature of most complex traits and one of the major reasons why linkage analysis in family pedigrees is not amenable to identifying genes responsible for complex traits. The problem is further compounded by the fact that many complex traits are influence by non-genetic (environmental) factors as well. To solve this problem, researchers suggested that genetic association studies rather than linkage analysis would be more effective in identifying the responsible genetic loci under the assumption of 4 polygenic inheritance. Instead of examining chromosome markers that co-segregate with disease status in family pedigrees, genetic association studies examine the frequency of the allele in a large population cohort to determine if the allele frequency is correlated with the trait or disease status. Indeed, researchers have shown that for studies with the same sample sizes, genetic association studies significantly outperforms linkage analysis under the assumption of polygenic inheritance. The process of performing genetic association studies have now evolved into a process known as Genome wide association studies (GWAS), where markers on the entire genome are systematically tested at the appropriate threshold of significance such that the significant results are robust and reproducible [15]. To date, there are many successful GWAS that are published highlighting the overall success of GWAS as a methodology for identifying genetic loci associated with complex traits or diseases. Missing heritability Although genome wide association studies (GWAS) have been largely successful, the variants identified typically do not explain most of the trait’s heritability. This result is known as the missing heritability problem and there are suggested hypotheses to explain the missing heritability [16]. One such hypothesis is that a substantial fraction of the heritability of the disease or trait is due to rare genetic variants [17]. As these variants are rare in the population, they are not well assayed by many of the genotyping arrays available nor are they amenable to imputation [18]. Another hypothesis is that there are more common variants with even smaller effect sizes and these studies are not well powered to detect these variants. A solution to answer this question would be to perform whole-genome sequencing instead of using genotyping arrays on even more number of samples although performing such an experiment can be costly as whole-genome sequencing is still significantly more expensive than genotyping arrays. 5 Therefore, perhaps it would be useful to determine if the exercise of performing sequencing and/or studying more samples to answer this question would be fruitful from the results from existing GWAS. In this dissertation, I present various methods to infer from GWAS results the genetic landscape that could explain the remaining trait heritability. Apart from performing GWAS, I described two different and independent approaches for making this inference without the need for performing additional whole-genome or whole-exome sequencing. Approaches to examine genetic architecture The first approach is one that explores the possibility of rare genetic variants contributing to the trait by examining the effect of the variants identified through GWAS on individuals at the tails of the distribution (Chapter 2). Using human height as our model phenotype, we showed that common variants identified through GWAS at the short end of the distribution are less predictive than expected. This result can be explained by the presence of rare genetic variants contributing to short stature. The second approach is one that explores the summary statistics obtained from GWAS (Chapter 3). By examining the direction of effect (odds-ratios or effect sizes), an excess of risk-increasing variants compared to risk-decreasing one can be indicative of polygenic inheritance from low-frequency or rare genetic variants, especially for dichotomous traits or diseases. In the subsequent chapter (Chapter 4), I will describe our study to determine genetic variants that can explain the heritable complex trait of body proportion using sittingheight ratio (SHR) as the phenotype. SHR is thought to be heritable and the SHR of European Americans is known to be significantly larger than African Americans. I will provide evidence that this difference in SHR is largely genetically driven as well as polygenic. Finally, I will conclude with a summary of the findings presented and discuss the potential implications and possible future research stemming from the discoveries described in this dissertation. 6 THE HERITABILITY OF COMPLEX TRAITS A description of heritability Complex traits or diseases are broadly defined as phenotypes that do not follow a Mendelian pattern of inheritance. Such traits are usually relatively common, i.e. at least 1% of the population have the trait or disease in contrast to Mendelian disorders, which are usually much rarer [19]. A main question in human biology is whether the expression of a trait of interest is due to genetic factors, environmental factors or just a product of stochasticity. To measure the contribution of genetic factors to the trait, one can measure the heritability. Heritability is a measurement of how much genetics play a role of explaining the difference of the trait between individuals of a population [20]. It can be loosely described as how much of the trait that you have is due to you inheriting it from your parents. It is also a technical term, defined as the ratio of variances, specifically the proportion of total variance in a population for a trait that is attributable to genetic variation [20]. This distinction of its varied use in literature is sometimes not made which can be a source of confusion [21]. Heritability can also be divided into 2 categories, the first being broad-sense heritability and the second being narrow-sense heritability. Broad-sense heritability (H2) describes the attribution of total genetic variation to the trait’s variability while narrow-sense heritability (h2) describes the attribution of only additive genetic variation to the trait’s variability. Methods for estimating heritability As heritability is not a physical trait that can be directly measured, one can only use various methods to provide an estimate. One of the first methods would be to determine if average phenotypic value of the parents (mid-parental phenotype) is correlated with the offspring’s phenotypic value. This method was first used by Francis Galton over a 100 years ago 7 to show that human height is heritable [22]. The correlation can be measured by linear regression and studies have put the estimate of height as high as 80% (h2 ~ 0.8) [23]. This method can also be adapted to use correlation estimates of full-siblings instead of parent-offspring although some other adjustments are required. Another popular way of measuring heritability would be the use of Falconer's formula in twin studies [24]. Given that dizygotic (DZ) twins on average are only 50% identical by descent (IBD) while monozygotic (MZ) twins are 100% IBD, MZ twins are therefore expected to be two times more similar than DZ twins. As DZ twins are approximately 50% IBD, heritability can be estimated by taking twice the difference of the phenotypic correlation between MZ twins and DZ twins. More recently, with the introduction of wholegenome genotyping arrays, heritability can be now be estimated by taking the correlation of phenotypic values with IBD estimates from full siblings [25] as well as using the correlation of all common SNPs in predicting the phenotype [26]. Heritability is not necessary constant over time. Heritability can decrease with increased environmental variability. It has been suggested that heritability for morphological traits will decrease in poorer environmental conditions [27], e.g. nutrient poor environment. This fits the theory that in a poor environment, competition for resources will cause increased environmental variability that will influence the outcome of the trait. Nonetheless, heritability estimates provide us with a way to determine which traits are mainly genetically influenced and which traits are mainly environmentally influenced. Heritability and genetic architecture It is known that it is not a single gene but a multitude of genes that are responsible for complex traits or diseases. We also find that most of these complex traits or diseases, their occurrences are not as rare as most of the Mendelian diseases with prevalence rate very much greater than 1 in 1000 individuals. For example, a study of the incidence of Schizophrenia 8 reported an average lifetime morbid risk for schizophrenia to be 7.2 per 1000 persons [28]. A study of prevalence of type 2 diabetes in adolescents put the prevalence as high as 110 per 1000 persons (11%) [29]. Given that the disease can be common in the population, we can ask if the genetic variants that are responsible for the disease are common or rare in the population. Asking this question would illustrate 2 concepts. The first is what is known as the “common disease, common variant hypothesis”. In this scenario, it is thought that the genetic variants that give rise to risk of disease is relatively common but that each variant’s contribution to disease risk is small. This means that the effect size per allele is small, that is the effect size usually less than 0.1 standard deviations or has an odds-ratio less than 1.1. In such a mode of inheritance, also known as polygenic inheritance, the genetic cause of the disease per individual or family is due to all the risk variants collectively. The next concept is what is known as the “common disease, rare variant hypothesis”. In this case, the genetic variants that give rise to the risk of disease are very rare and each variant’s contribution to disease risk is large. The effect size per allele can be large, perhaps more than 0.5 standard deviations or an odds-ratio greater than 1.6. For this mode of inheritance, also called locus heterogeneity, the genetic cause of the disease per individual or family is due largely to only 1 gene and other individuals or family with the disease have other genes responsible for their disease. Although these “hypotheses” are seemingly different, they do not have to be mutually exclusive. Effectively, these “hypotheses” can be unified by addressing the effect sizes and variant frequencies for the spectrum of genetic variants that give rise to the disease. For such traits, the variant cannot be common and have a large effect. If that is true, the trait or disease would be monogenic and would be classified as a Mendelian disorder. As such, it is not inconceivable that a disease could have both rare large effect alleles as well as common small effect alleles. For example, even when GWAS show that most variants that are associated 9 with height have small effects [30], there are very rare alleles that can give rise to short stature, e.g. Achondroplasia [31] as well as rare alleles that give rise to tall stature, e.g. Marfan syndrome [32]. The same can be said for many other complex traits or disease and it is important to be aware of the genetic architecture giving rise to the trait or disease. Heritability and polygenic inheritance To explain the heritability of non-Mendelian complex traits and diseases, the pattern of inheritance is usually assumed to be polygenic, i.e. many variants across multiple genes each contribute a small fraction of the heritability. Examples of complex traits include asthma, schizophrenia, type 2 diabetes, inflammatory bowel disease and coronary heart disease. These traits are highly heritable [33–37] even though they do not follow a Mendelian pattern of inheritance. In type 2 diabetes, the first notable gene with variation conferring risk to the disease was TCF7L2 [38]. While not completely penetrant, individuals having a single copy of the risk allele are 1.45 times more likely to get type 2 diabetes than individuals without the risk allele. Since then, studies with much larger sample sizes have yielded about 30 distinct loci that are associated with the risk of getting type 2 diabetes [39]. A similar situation exists for schizophrenia, where prior to having sufficiently large sample sizes, no single locus or gene was determined to be significantly associated with schizophrenia [40]. However, in one of the earlier studies of schizophrenia with just over 3,000 cases and 3,000 controls, the authors reported a significant signal of polygenic inheritance from common variants [41]. In that study, the authors used the common variants that were marginally associated in their samples to model a “polygenic score”, a score that represents the overall cumulative predictability of these common variants to schizophrenia risk. They found that the polygenic score is significantly predictive of schizophrenia in an independent cohort of individuals. This suggest that there are many, perhaps 10 thousands of variants that modulates the risk of acquiring schizophrenia, each of which have only a very small effect on the overall risk and they are not discovered to be significant because the study is simply underpowered and further studies with many more samples would be necessary. Indeed, when larger sample sizes were available for association, we begin to see significant loci emerge [42]. Other complex traits have similar stories where multiple loci have been discovered, each of which confers only a fraction of the total risk. Quantitative traits and polygenic inheritance Quantitative traits that are approximately normally distributed in a population are usually complex traits as well. If such a trait is heritable, then it is unlikely for variation only within a single gene or locus to influence the trait. Traits like height, body mass index (BMI), lipid levels, fasting glucose levels, blood pressure are just some notable examples. There are now well over a hundred loci that are associated with human height [30], each locus only has a very small effect on the overall height. For example, a variant in the HMGA2 locus (rs1351394), one of the first loci discovered to be associated with height, has an allele frequency of 49% and an effect size of 0.054 standard deviations or approximately 0.3 centimeters. That means every height increasing allele of this variant predicts on average an increase of only a 0.3 centimeter increase in overall height, which is just a small effect. For other quantitative traits, similar results were reported from association studies like BMI [43], LDL cholesterol [44] and blood pressure [45], etc, where many common variants have been found to be associated with these traits with each of these variants explaining only a small fraction of the overall trait. Because of the highly polygenic nature of such complex traits, methods like linkage analysis that were successful in identify loci for Mendelian disorders can be suboptimal when applied on complex traits. Therefore, new methodologies and paradigms were developed to map the variants in genes that influence 11 complex traits. We will discuss this in more detail in the next section. METHODS FOR STUDYING GENETICS OF COMPLEX TRAITS Single nucleotide polymorphisms (SNPs) The principle for determining if variants in a gene cause or modulate risk for a disease or a trait is to be able to determine if they are associated with the disease or trait in a non-random way. As whole genome deep coverage sequencing in a large number of individuals is not feasible at this point in time (too costly), we rely on genetic markers for mapping a trait or disease to a genetic locus. The genetic marker that is currently very widely used for such a purpose is singlenucleotide polymorphisms (SNPs). SNPs are single base pair differences within the genome that is polymorphic in a population. As we are diploid for most of our chromosomes (males are largely hemizygous for the X-chromosome), some individuals in the population might have a different pair of alleles than other individuals for any particular SNP. These usually bi-allelic markers are found in abundance throughout the genome, much more frequently than STRs [46]. For each SNP, because of their bi-allelic nature as well as being diploid, each individual would largely be of only 3 genotypic states. For example, if the alleles for the SNP are “A” and “C”, then the 3 possible genotypic states would be homozygous “AA”, heterozygous “AC” or homozygous “CC”. SNPs were discovered and made publically available in a major way from the efforts of the International Hapmap Project [47]. In the phase 2 release of the project, they reported more than 3 million SNPs from 4 geographically diverse populations [9]. To find and characterize even more SNPs, the 1000 genomes project, a project that aims to characterize genomic variation from whole genome sequencing, reported their findings of about 15 million SNPs [48]. 12 SNP genotyping strategies While it might not be difficult or tedious to determine the genotypic state of any SNP in an individual, genotyping many SNPs for many individuals can be challenging, both from a technical as well as a cost perspective. Methods such as Sanger sequencing [49] and PCR-RFLP [50] were possible methods for performing SNP genotyping but is too tedious and expensive to perform them in high-throughput (many SNPs) over many individuals. As such, there was a need for a relatively cheap and fast technology that could genotype thousand of SNPs efficiently in many individuals. With success from efforts to characterize SNPs within human populations, that knowledge made it possible for the design of high-density SNP genotyping arrays. SNP genotyping arrays work in principle by probing for sequence variation of many targets in parallel by immobilizing the probe sequences on a surface and determine the genotype by reading out the strength to which these probe sequences are bound to their targets. These arrays can easily genotype many SNPs across the genome in a cost efficient manner [51]. There are now many companies that sell these high-density SNP genotyping arrays that can perform genotyping for over a million SNPs per sample. However, high-density SNP genotyping arrays might become less and less utilized with the growth and availability of whole-genome sequencing. Wholegenome sequencing cost have gone down significantly and it may come to a point in the near future that whole-genome sequencing will be the major strategy used by researchers to perform genotyping of genetic variants on a large scale. Genotype imputation While it is possible to genotype many SNPs in parallel, it is still not possible to genotype all or most of the known SNPs in the human genome from SNP genotyping arrays. This is because there are just too many SNPs and it is impossible to fit all or most of them onto a single 13 genotyping array. As such, these SNP genotyping arrays have only a subset of the total possible SNPs from the human genome. Another potential problem would be that different companies that design and sell these arrays do not use the same subset of SNPs. This problem can be solved by performing genotype imputation. Genotype imputation is the process of determining the genotypes of unknown markers with some level of certainty using the genotype information of neighboring markers. This is possible because linkage disequilibrium, that variants within the genome are not independent [52]. This is because the human population is relatively new and variants that were introduced into the population tend to travel together. With enough time, recombination events between the variants will break the variants’ correlation and bring about linkage equilibrium which will make imputation impossible. Genotype imputation can be performed computationally with the use of a reference panel. The reference panel is typically a more complete catalog of SNP genotypes obtained from a large cohort of individuals. Some examples of these panels would be those provided by the International Hapmap Project [47] as well as the 1000 genomes project [48] although it is not uncommon to use panels from other sources as well. With these panels together with the genotypes of one’s samples, one can computationally impute the variants that are present in the panels but not genotyped in the samples. Some of the more utilized software for this purpose include BEAGLE [53], MACH [54] and IMPUTE2 [55] just to name a few. With imputation, SNPs that were directly genotyped in one set of samples that were not directly genotyped in other sets of samples can now be use for association studies. Performing genome wide association Linkage analysis has been shown to be less successful at identifying loci associated with complex traits than with Mendelian traits [13]. An arguably more effective approach would be to 14 perform a Genome wide association study (GWAS). Instead of tracking genetic markers in affected familial pedigrees, one can instead design a study and determine if the frequency of the genetic markers are significantly different between case individuals and control individuals. In such study designs, case individuals (cases) are usually randomly selected unrelated individuals that are affected and control individuals (controls) are randomly selected unrelated individuals that are unaffected. Assuming a scenario where 2 SNPs are genotyped in 1000 cases and 1000 controls (Figure 1.1), one can measure the frequency of the alleles in both SNPs to determine if the allele frequencies are significantly different by performing a chi-squared test. In this example, SNP1 is significantly associated (P = 2.82 x 10-13) at a genome wide significance. The genome wide significance threshold is taken to be P < 5x10-8 although it has been suggested that it could be relaxed just a little [56]. The genome wide significance threshold has to be stringent to correct for multiple hypothesis testing given that GWAS test multiple markers at the same time [57]. SNP2 on the other hand is only marginally associated (P = 0.001) and does not reach genome wide significance. This process can be systematically pursued for all the SNPs that were genotyped via the high-density SNP arrays and subsequently imputed from a reference panel. The first successful GWAS was performed on a disease called Age-related macular degeneration in 2005 [58]. In that study of 96 cases and 50 controls, they reported 2 strongly associated SNPs (P < 10-7) in the complement factor H gene (CFH). Since then, there are many more GWAS performed with more than 10,000 SNPs identified as genome wide significant for various different traits and diseases in more than 1000 publications [59]. The large growth of GWAS can be attributed to the affordability of high density SNP arrays as well as freely available bioinformatics tools like PLINK [60] for data analysis. Besides performing 15 Figure 1.1: An example of GWAS on cases versus controls. SNP1 and SNP2 are genotyped in 1000 cases and 1000 controls (1 stickman = 100 individuals). SNP1 is significantly associated with disease status while SNP2 is only marginally associated and does not reach genome wide significance. Genome wide significance is assumed to be P < 5 x 10-8. 16 case-control analyses, GWAS can also be performed on quantitative traits like height, BMI, blood pressure and lipid levels. Since there are no cases or controls, GWAS on quantitative traits seek to determine if the allele dosages for each SNP is significantly trending, either increasing or decreasing with the trait. This is usually done by linear regression of the allele dosages against the quantitative trait via a simple linear model [61]. For example, we simulated a scenario where a SNP with minor allele frequency of 30% have a 0.5 standard deviation effect (β) on the phenotype. After performing a linear regression of the allele dosages against the phenotypic score, we find a strong correlation between the SNP and the phenotype (Figure 1.2A) resulting in an estimated β of 0.47 and a very strong association signal (P=2.97 x 10-22). On the other hand, when we simulated a scenario where the SNP has no effect, then there is no strong correlation (Figure 1.2B). This example shows that GWAS can be use not only for dichotomous traits, but also for quantitative traits. GWAS ineffective if causal variants not linked to SNPs Linkage disequilibrium (LD) is a major factor for the success of GWAS. This is because, the vast majority of the time, SNP markers tested for association with diseases are not the actual genetic variant that has an effect but rather simply a marker that is in linkage disequilibrium with the disease variant. The disease variants could be SNPs, copy number polymorphisms (CNVs), short tandem repeats, insertion or deletion polymorphisms (indels) and perhaps even inversion polymorphisms. In most cases, there should be a SNP that is in LD (tagging) the causal variant. For example, many SNPs have been shown to be strong tagging the common inversion polymorphism on the human chromosome 17 [62]. It has also been shown that some SNPs from GWAS hits are strongly tagging CNVs and that these CNVs are suggested to be the causal variants [63]. However, we cannot discount the possibility that the causal variant is not 17 Figure 1.2: An example of GWAS on quantitative trait. Phenotypic score represents the quantitative trait. SNP genotype dosage is the number of effect alleles (0, 1 or 2) that each individual has. The association of between genotype and phenotype is shown by the leastsquared regression line. (A) The least squared regression line (red) shows a positive correlation of genotype dosage with phenotype (β=0.47, P=2.97 x 10-22). (B) The least squared regression line (grey) shows no correlation of genotype dosage with phenotype (β=0.06, P=0.21). 18 well-tagged by SNPs. For example, a recent study showed that CNVs in two amylase genes (AMY1 and AMY2) are associated with obesity and that these regions are hard to be mapped by SNPs [64]. Only by genotyping the copy number did the authors observed the association. Therefore, besides performing GWAS using only SNPs as markers, it may be in some cases, useful to also genotype other potential markers, especially in genomic regions not well covered by SNPs. Not straightforward to implicate causal gene from GWAS locus While linkage disequilibrium allows one to find loci associated with disease, it is not clear which gene within the identified locus is the gene that is causal. Because of linkage disequilibrium, the region implicated in GWAS can span many genes and in that respect, linkage disequilibrium is more of a problem than a solution. To overcome this, solutions such as systems approaches that examines all the loci associated with the disease to determine its molecular architecture may be the way forward [65]. Using methods to determine if certain genes within various loci identified through GWAS are more biologically connected, those genes are more likely to be the causal gene within each of their locus. For example, in one study, the authors used a variety of biological functional databases to determine the degree of connectivity between genes [66]. In another study, the authors described an approach to form relationships between genes by analyzing PubMed abstracts [67]. These approaches have been successfully applied to results from GWAS and can prioritize the genes within each locus as to which of them are more likely to be the causal gene. Population stratification Genome wide association studies (GWAS) may also be confounded by population stratification. Unlike linkage analysis where studies are perform on familial pedigrees; GWAS on 19 the other hand compares genetic markers between unrelated cases against controls. As such, markers that reflect differences in the underlying structure of the population between cases and controls may have significant associations when performing GWAS. For example, a SNP in the LCT gene locus had significant association with height but the association is largely driven by population stratification [68]. Many methods have been developed to try and correct for population stratification. One of the more popular methods would be to include principal components as covariates when performing the statistical association with linear or logistic regression [69]. A study performed to determine the efficacy of the available methods showed that most of the methods work comparatively well to address the problem of population stratification [70]. Therefore population stratification is now not a major problem and can be adequately corrected for. Admixture mapping Another possible method besides GWAS would be admixture mapping. Admixture mapping, also known as “mapping by admixture linkage disequilibrium” (MALD) is a method that uses genetically mixed populations to determine if the local ancestry of different ancestral populations is correlated with a trait or disease [71]. For example, African Americans have genetic ancestry of largely African descent with a proportion being of European origin [72]. If one could determine the genomic regions of European ancestry, one could test if having European ancestry in these regions is associated with trait differences between individuals. Following this idea, methods were developed to accurately determine which regions in an individual’s genome are of any particular ancestry. One of the first approach that is used extensively for this purpose is to perform the prediction using a hidden markov model (HMM) [73]. By systematically walking through each marker consecutively, HMM can be use to predict 20 the most likely ancestral state of the genetic marker given its frequency in each ancestral population. The more divergent the frequencies are in different population, the more likely the prediction will be accurate. The accuracy can be further improved by incorporating both linked markers as well as the use of an explicit population genetic model [74]. Admixture mapping has been performed on a multitude of phenotypes, including prostate cancer [75], body mass index [76,77], blood lipids [78], just to name a few. One of the reasons why admixture mapping might perform better than GWAS is because of admixture linkage disequilibrium. One of the initial reasons why GWAS on African populations might yield fewer results than GWAS performed on European or non-African populations is because the average linkage disequilibrium (LD) block in Africans is much smaller as they are a relatively older population [79,80]. As such, when there are relatively few SNPs genotyped for performing GWAS, it might be sufficient for studies in non-African populations but inadequate in populations of African ancestry. However, since admixture LD, LD of genomic regions due to admixture from a different population, is much stronger, this allows association signals to be discovered even with relatively lower marker density. However, this also means that if an admixture signal were to be discovered, it would be much harder to pinpoint the gene responsible for the association. GWAS on the other hand would be more sensitive and better powered if there is high density coverage of the genome, either from using high density SNP arrays or whole genome sequencing strategies. Given that high density SNP arrays are now widely used, GWAS might now be a better strategy to uncover genetic loci associated with disease. COMPLEX PHENOTYPES Human height is a classical complex trait 21 Human height is probably the best example of a heritable trait that has a polygenic architecture [81]. It is the example that Fisher used to reconcile how quantitative traits could also adhere to Mendelian inheritance [3]. Instead of having a single gene influencing the outcome of one’s height, having many genes do so can explain the distribution of height in the population, which is in most cases, normally distributed [82]. However, we do know about diseases that are caused by rare mutations that have large effects on one’s stature. These diseases, in most cases, have other obvious phenotypes besides the change in stature. For example, Achondroplasia, the most common cause of dwarfism is caused by a rare mutation in the FGFR3 gene. Individual carrying the mutant allele have on average about a 6 standard deviation decrease in height. The prevalence of Achondroplasia is extremely rare, affecting only about 1 in 25,000 individuals [83]. Another example would be Marfan syndrome, a genetic disorder caused by mutations in the FBN1 gene. Individuals with this Marfan syndrome are unusually tall, on average about 2 standard deviations taller. The prevalence of Marfan syndrome is rare, affecting only about 1 in 9802 individuals [84]. In both of these examples, individuals with Achondroplasia or Marfan syndrome have other consequential phenotypes as well besides their short or tall stature. Achondroplasia individuals usually present with other phenotypes like short fingers and toes [85]. Individuals with Marfan syndrome normally present with cardiovascular or vision problems too [86]. Nonetheless, rare Mendelian diseases like these do not explain for most of the variation of height in the population. The alleles of height Most of the variation of height is probably due to common variants that have small effect sizes. Indeed, the first such gene implicated in height is HMGA2 [87]. Identified from an initial GWAS of just under 5000 individuals, it harbors a common variant that has only an estimated 22 effect size of 0.4 cm per allele. Since then, many more common variants with small effects robustly associated with height have been discovered [30]. Among these variants, there are some that are associated with human syndromes characterized by abnormal skeletal growth. For example, the gene ACAN, of which there is a signal of common variant association with height, have been shown to be responsible for syndromes like Osteochondritis dissecans [88] and Spondyloepimetaphyseal dysplasia [89]. This suggests that while the common variants might be altering the gene activity in a minor way resulting in a small change in overall height, deleterious variants in these genes can cause severe reduction in stature. Thus the question remains as to what the genetic architecture is for non-syndromic individuals with short or tall stature. Is there a contribution of such large effect variants that can explain a person’s tall or short stature in the general population? Or is a person’s tall or short stature driven mainly by small effect common variants? In chapter 2, we shall discuss a method to infer the genetic architecture of individuals at the tails of the height distribution by examining the recently discovered common variants associated with height. Body proportion is more constrained than height While height is a commonly measured anthropometric that varies within a population, our heights are not as constrained and individuals can be relatively short or tall without any adverse effect on our health. Most of the problems associated with extreme tall or short stature are usually because of other adverse phenotypes associated with the tall or short stature. For example, individuals with Turner syndrome, a disease cause by monosomy X have short stature but commonly have other problems like Lymphedema or cardiovascular related problems. Also, given that women are about 2 standard deviations shorter than men shows that short stature itself is does not necessary have any health consequences and can vary within the population. On the 23 other hand, our body proportions are more well-defined. Humans have expected ratios of limb lengths that are vastly different from other species. For example, unlike humans, chimpanzees have arms longer than their legs [90]. Sitting height ratio as a measurement of body proportion There certain measurements other than our full body height that can be use to judge our body proportions. Iliac length, subischial leg length, thigh length, knee height, sitting height are just some such measurements [91]. Another such measurement is arm span, which is a good proxy for overall height [92]. These measurements can be measured in a clinic but require either precise instruments or trained practitioners that they are usually not measured of patients when they pay a visit to their doctors even though they may be as informative as knowing our overall height and weight. However, one of the measurements that exist in some publically available data-sets is measurements of sitting height. Sitting height is the total stature that is comprised by the head and trunk. It is usually measured by first having the person sit on a table, then taking the measurement of the distance from the surface of the table to the top of the person’s head. If one were to divide the sitting height with a person’s height, one can calculate the sitting height ratio (SHR) which can then be a measure of body proportion. While short and tall stature is the characteristic of many skeletal dysplasia and overgrowth syndromes respectively, many of these syndromes can also cause severe deviations of SHR. For example, adult individuals with Achondroplasia have average SHR values of 0.66, very much higher than the population average, which is around 0.53 [93]. Another type of dysplasia, Spondyloepiphyseal dysplasia, is a syndrome characterized by severe short spines and neck. These patient’s hands and feet are of normal length suggesting that their SHR values will be lower than average [94]. Next, individuals with Marfan syndrome have above average heights and may have lower than average 24 SHR values [95]. However, some individuals with mutations causing severe short stature might have SHR within the normal range. For example, a patient with premature pubarche and severe short stature has normal SHR [96]. SHR has also been used as a rudimentary predictor of phenotypes like body mass index, Age of Menarche and risk of diabetes [97]. Sitting height ratio (SHR) is a measurement that changes with age. More of our stature is due to our head and trunk as children than as adults, evidenced from the gradual decreasing of SHR till we reach adulthood [95]. Sitting height ratio and ancestry SHR also differs significantly from individuals with different ancestries. Accordingly, individuals of Asian ancestry have higher SHRs than individuals of European ancestry and individuals of European ancestry have higher SHRs than individuals of African ancestry [91]. The question remains as to whether genetics is the primary driving force for the difference between SHR in different populations and whether these SHR differences between populations is a polygenic phenomenal or driven by only a single or a few genes. In chapter 4, I shall present some recent findings that will reveal more about the genetic architecture of SHR. ACCUMULATING EVIDENCE FROM MULTIPLE STUDIES Being underpowered While Genome wide association studies (GWAS) have been very successful at elucidating loci that are associated with complex traits and diseases [98], this has not always been the case. Studies performed with limited samples are just underpowered for any genome wide significant associations to be discovered. The power to detect any SNP to be associated 25 with the trait is directly correlated to the variance of the phenotype explained by the SNP. This means that the larger the effect size or the more frequent the SNP is, the more power there is for the SNP to be detected as genome wide significant. However, given that for complex traits, the effect sizes for any given variant is very small, larger numbers of samples are required for any loci to be discovered. Combining results by meta-analysis While most studies may be underpowered due to small sample sizes, different studies performed on different samples with similar phenotypes could be combined or pooled together in an effort to increase the power of the study. Ideally, the genotypes and phenotypes could be shared among different research groups such that every group would have access to other group’s data to perform the joint study. However, this is usually not feasible due to data sharing constraints such as the lack of storage space, privacy issues as well as the unwillingness of research groups to share their data prior to publication of their results. As such, for a typical GWAS, the association is performed on individual cohorts. Each of these cohorts has whole genome SNP data, usually produced by genotyping arrays as well as their corresponding phenotypes. The phenotype can be either a quantitative one, e.g. height, body mass index, blood pressure, etc, where there is a numerical value attached to each individual or a dichotomous one, e.g. type 2 diabetes, schizophrenia, etc, where each individual is either affected with the disease (cases) or are unaffected (controls). A dichotomous phenotype can be modeled as a phenotype with an underlying quantitative trait distribution, of which individuals whose trait value exceeded a threshold are affected and individuals who do not are unaffected [99]. For example, in the case of obesity, the underlying phenotype can be body mass index (bmi) and individuals whose bmi exceeds 30 can be classified as obese while those whose bmi are below 30 are not 26 [100]. However, in most dichotomous traits or diseases, this underlying trait is usually unobservable or unknown. Testing for genetic factors associated with the trait or disease is usually done by performing linear regression (quantitative trait), logistic regression (dichotomous traits) or some other test of correlation of the SNP dosages with the phenotype. Performing the test will produce resulting statistics for each SNP and by combining the statistics produced across cohorts through a process known as meta-analysis [101], one would be able to obtain the resulting summary statistics for the GWAS. GWAS summary statistics The resulting summary statistics contains the necessary information to determine which SNPs are significantly associated with the trait or disease in question. Typically, the summary statistics is reported in the following manner. Each row represents the result of the test for a SNP and each column reports a specific result for that SNP. There would a SNP identifier, usually the dbSNP rs-number [102], the allele frequency, the odds-ratio or effect size as well as the significance of the result, reported as the 2-tailed P-value. For a dichotomous trait, the odds-ratio (OR) would tell us the direction of effect of the allele, whether they are associated with increased or decreased risk for being affected by the trait or disease. An OR > 1 would indicate increased risk while an OR < 1 would indicate decreased risk. For a quantitative trait, the effect size (β) would be the equivalent, with a positive β indicating that the allele is associated with increased trait values and vice-versa. In either case, the P-value gives us the strength of the association and a P-value < 5 x 10-8 is suggested to be the genome-wide significant threshold [57]. SNPs that have P-values that are less than this threshold are said to have reached genome-wide significance and they are usually reported to be significantly associated with the trait or disease in question. Genes in the vicinity of such SNPs are then suspected to be involved with the trait or disease 27 etiology and are usually reported as well. Independent loci Even though by performing the GWAS, the SNPs that achieved genome-wide significance are not necessary independent and one must consider the effect of linkage disequilibrium (LD). As discussed earlier, there are many variants, SNPs included, in the genome that are correlated with one another due to LD. The SNP with the lowest P-value is called the “lead SNP” with SNPs in strong LD (usually taken as r2 > 0.5) with it labeled as tagging SNPs. Together these SNPs represent only a single locus of association as the signal may well be only from a single source of variation within this region in the genome. To determine the total number of independent loci that is significantly associated with the trait, one can perform a process known as LD-pruning. This process orders the SNPs from most significant to least and systematically takes away significant SNPs that are in LD with any of the SNPs prior. On top of LD-pruning, one could also perform conditional analysis where SNPs in LD with the lead SNP can be tested again with the dosage of the lead SNP as a covariate. If the significant association is solely due to LD, the resulting P-value would not be significant. However, if the resulting Pvalue is still significant, then that SNP’s significance cannot be explained just by LD and therefore could be counted as a separate locus associated with the trait or disease. This process could also be done in a high throughput manner taking existing summary statistics and LD information [103]. Not quite genome wide significant While the genome-wide significant signals are said to be robust associations discovered with the trait or disease, SNPs that do not reach genome-wide significance could still be informative. While these marginally associated SNPs cannot be individually considered as robust 28 associations, they may inform us about the genetic architecture of the trait or disease as a whole. For example, in the GWAS of human height, the QQ-plots show a significant deviation of the marginally associated SNPs from the null model. This is indicative of the fact that there are more marginally associated SNPs than that expected under the null model and thus informs us that there are more associations to be discovered. In other cases, the QQ-plot might now show much of a deviation. Nonetheless, this suggests that the marginally associated SNPs can be informative even if most GWAS publications chose only to report the genome-wide significant ones. In chapter 3, I will discuss in depth a new method that can exploit the marginally associated SNPs to determine if there is evidence of polygenic inheritance. SUMMARY It is well known that many human traits and diseases do not follow a Mendelian pattern of inheritance; that most of these traits and diseases are influenced by variation in many genes, each of which contribute a small effect to the total heritability of the trait. These traits and diseases are highly heritable and therefore, mapping these traits and diseases to their genetic locus can be useful for understanding the disease etiology thereby informative for the development of potential therapeutics. Performing genetic association studies (GWAS), where one performs genetic genotyping on many genetic markers to determine if they are associated with the trait or disease has become a common technique for identifying such genetic loci. However, these loci discovered in most GWAS do not account for most of the heritability and thus our genetic understanding of these diseases is far from complete. This dissertation aims to leverage on results from GWAS to infer the genetic architecture of various complex human traits and diseases which could lead to increasing our understanding of disease etiology. 29 REFERENCES 1. 2. 3. 4. 5. 6. Wills C (2007) Principles of Population Genetics, 4th edition. J Hered 98: 382–382. doi:10.1093/jhered/esm035. Miko I (2008) Gregor Mendel and the principles of inheritance. Nat Educ 1: 134. Fisher RA (1918) The correlation between relatives on the supposition of Mendelian inheritance. Trans R Soc Edinb 52: 399–433. Kaminsky ZA, Tang T, Wang S-C, Ptak C, Oh GHT, et al. (2009) DNA methylation profiles in monozygotic and dizygotic twins. Nat Genet 41: 240–245. doi:10.1038/ng.286. Hearne CM, Ghosh S, Todd JA (1992) Microsatellites for linkage analysis of genetic traits. Trends Genet TIG 8: 288–294. Bornman DM, Hester ME, Schuetter JM, Kasoji MD, Minard-Smith A, et al. (2012) Shortread, high-throughput sequencing technology for STR genotyping. BioTechniques 0: 1–6. doi:10.2144/000113857. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921. doi:10.1038/35057062. The International HapMap Consortium (2005) A haplotype map of the human genome. Nature 437: 1299–1320. doi:10.1038/nature04226. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861. doi:10.1038/nature06258. 7. 8. 9. 10. Elston RC (1998) Methods of linkage analysis--and the assumptions underlying them [see comment]. Am J Hum Genet 63: 931–934. 11. MORTON NE (1955) Sequential tests for the detection of linkage. Am J Hum Genet 7: 277–318. 12. Jorde LB (2000) Linkage Disequilibrium and the Search for Complex Disease Genes. Genome Res 10: 1435–1444. doi:10.1101/gr.144500. 13. Altmüller J, Palmer LJ, Fischer G, Scherb H, Wjst M (2001) Genomewide scans of complex human diseases: true linkage is hard to find. Am J Hum Genet 69: 936–950. doi:10.1086/324069. 14. D’Andrea AD (2010) Susceptibility Pathways in Fanconi’s Anemia and Breast Cancer. N Engl J Med 362: 1909–1919. doi:10.1056/NEJMra0809889. 15. Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. Am J Hum Genet 86: 6– 30 22. doi:10.1016/j.ajhg.2009.11.017. 16. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missing heritability of complex diseases. Nature 461: 747–753. doi:10.1038/nature08494. 17. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, et al. (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11: 446– 450. doi:10.1038/nrg2809. 18. Li Y, Willer C, Sanna S, Abecasis G (2009) Genotype Imputation. Annu Rev Genomics Hum Genet 10: 387–406. doi:10.1146/annurev.genom.9.081307.164242. 19. McKusick VA (2007) Mendelian Inheritance in Man and Its Online Version, OMIM. Am J Hum Genet 80: 588–604. doi:10.1086/514346. 20. Visscher PM, Hill WG, Wray NR (2008) Heritability in the genomics era -- concepts and misconceptions. Nat Rev Genet 9: 255–266. doi:10.1038/nrg2322. 21. Jacquard A (1983) Heritability: one word, three concepts. Biometrics 39: 465–477. 22. Aulchenko YS, Struchalin MV, Belonogova NM, Axenovich TI, Weedon MN, et al. (2009) Predicting human height by Victorian and genomic methods. Eur J Hum Genet 17: 1070– 1075. doi:10.1038/ejhg.2009.5. 23. Silventoinen K (2003) Determinants of variation in adult body height. J Biosoc Sci 35: 263–285. 24. Teikari JM, Kaprio J, Koskenvuo MK, Vannas A (1988) Heritability estimate for refractive errors--a population-based sample of adult twins. Genet Epidemiol 5: 171–181. doi:10.1002/gepi.1370050304. 25. Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, et al. (2006) AssumptionFree Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings. PLoS Genet 2: e41. doi:10.1371/journal.pgen.0020041. 26. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42: 565–569. doi:10.1038/ng.608. 27. Charmantier A, Garant D (2005) Environmental quality and evolutionary potential: lessons from wild populations. Proc Biol Sci 272: 1415–1425. doi:10.1098/rspb.2005.3117. 28. McGrath J, Saha S, Chant D, Welham J (2008) Schizophrenia: A Concise Overview of Incidence, Prevalence, and Mortality. Epidemiol Rev 30: 67–76. doi:10.1093/epirev/mxn001. 29. Hotu S, Carter B, Watson P, Cutfield W, Cundy T (2004) Increasing prevalence of type 2 diabetes in adolescents. J Paediatr Child Health 40: 201–204. doi:10.1111/j.144031 1754.2004.00337.x. 30. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. Available: http://www.ncbi.nlm.nih.gov.ezp-prod1.hul.harvard.edu/pubmed/20881960. Accessed 4 October 2010. 31. Richette P, Bardin T, Stheneur C (2008) Achondroplasia: from genotype to phenotype. Jt Bone Spine Rev Rhum 75: 125–130. doi:10.1016/j.jbspin.2007.06.007. 32. Gillis E, Kempers M, Salemink S, Timmermans J, Cheriex EC, et al. (2014) An FBN1 Deep Intronic Mutation in a Familial Case of Marfan Syndrome: An Explanation for Genetically Unsolved Cases? Hum Mutat. doi:10.1002/humu.22540. 33. Fagnani C, Annesi-Maesano I, Brescianini S, D’Ippolito C, Medda E, et al. (2008) Heritability and shared genetic effects of asthma and hay fever: an Italian study of young twins. Twin Res Hum Genet Off J Int Soc Twin Stud 11: 121–131. doi:10.1375/twin.11.2.121. 34. O’Donovan MC, Williams NM, Owen MJ (2003) Recent advances in the genetics of schizophrenia. Hum Mol Genet 12: R125–R133. doi:10.1093/hmg/ddg302. 35. Almgren P, Lehtovirta M, Isomaa B, Sarelin L, Taskinen MR, et al. (2011) Heritability and familiality of type 2 diabetes and related quantitative traits in the Botnia Study. Diabetologia 54: 2811–2819. doi:10.1007/s00125-011-2267-5. 36. Brant SR (2011) Update on the heritability of inflammatory bowel disease: The importance of twin studies. Inflamm Bowel Dis 17: 1–5. doi:10.1002/ibd.21385. 37. Mayer B, Erdmann J, Schunkert H (2007) Genetics and heritability of coronary artery disease and myocardial infarction. Clin Res Cardiol Off J Ger Card Soc 96: 1–7. doi:10.1007/s00392-006-0447-y. 38. Grant SFA, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, et al. (2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat Genet 38: 320–323. doi:10.1038/ng1732. 39. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segrè AV, et al. (2012) Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44: 981–990. doi:10.1038/ng.2383. 40. Bergen SE, Petryshen TL (2012) Genome-wide association studies (GWAS) of schizophrenia: does bigger lead to better results? Curr Opin Psychiatry 25: 76–82. doi:10.1097/YCO.0b013e32835035dd. 41. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, et al. (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748–752. doi:10.1038/nature08185. 32 42. Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43: 969– 976. doi:10.1038/ng.940. 43. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, et al. (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42: 937–948. doi:10.1038/ng.686. 44. Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707– 713. doi:10.1038/nature09270. 45. Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, et al. (2009) Eight blood pressure loci identified by genome-wide association study of 34,433 people of European ancestry. Nat Genet 41: 666–676. doi:10.1038/ng.361. 46. Evans DM, Cardon LR (2004) Guidelines for Genotyping in Genomewide Linkage Studies: Single-Nucleotide–Polymorphism Maps Versus Microsatellite Maps. Am J Hum Genet 75: 687–692. doi:10.1086/424696. 47. International HapMap Consortium (2003) The International HapMap Project. Nature 426: 789–796. doi:10.1038/nature02168. 48. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. doi:10.1038/nature11632. 49. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci 74: 5463–5467. 50. Novel detection assay by PCR–RFLP and frequency of the CYP3A... : Pharmacogenetics and Genomics (n.d.). Available: http://journals.lww.com/jpharmacogenetics/Fulltext/2002/06000/Novel_detection_assay_by _PCR_RFLP_and_frequency_of.9.aspx. Accessed 2 April 2014. 51. Mei R, Galipeau PC, Prass C, Berno A, Ghandour G, et al. (2000) Genome-wide Detection of Allelic Imbalance Using Human SNPs and High-density DNA Arrays. Genome Res 10: 1126–1137. doi:10.1101/gr.10.8.1126. 52. Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, et al. (2001) Linkage disequilibrium in the human genome. Nature 411: 199–204. doi:10.1038/35075590. 53. Browning BL, Browning SR (2009) A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals. Am J Hum Genet 84: 210–223. doi:10.1016/j.ajhg.2009.01.005. 54. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34: 816–834. 33 doi:10.1002/gepi.20533. 55. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44: 955–959. doi:10.1038/ng.2354. 56. Panagiotou OA, Ioannidis JPA, Genome-Wide Significance Project (2012) What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations. Int J Epidemiol 41: 273–286. doi:10.1093/ije/dyr178. 57. Johnson RC, Nelson GW, Troyer JL, Lautenberger JA, Kessing BD, et al. (2010) Accounting for multiple comparisons in a genome-wide association study (GWAS). BMC Genomics 11: 724. doi:10.1186/1471-2164-11-724. 58. Klein RJ, Zeiss C, Chew EY, Tsai J-Y, Sackler RS, et al. (2005) Complement Factor H Polymorphism in Age-Related Macular Degeneration. Science 308: 385–389. doi:10.1126/science.1109557. 59. Welter D, MacArthur J, Morales J, Burdett T, Hall P, et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 42: D1001–1006. doi:10.1093/nar/gkt1229. 60. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. doi:10.1086/519795. 61. Anderson CA, McRae AF, Visscher PM (2006) A Simple Linear Regression Method for Quantitative Trait Loci Linkage Analysis With Censored Observations. Genetics 173: 1735–1745. doi:10.1534/genetics.106.055921. 62. Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, et al. (2005) A common inversion under selection in Europeans. Nat Genet 37: 129–137. doi:10.1038/ng1508. 63. Gamazon ER, Nicolae DL, Cox NJ (2011) A Study of CNVs As Trait-Associated Polymorphisms and As Expression Quantitative Trait Loci. PLoS Genet 7. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3033384/. Accessed 3 April 2014. 64. Falchi M, El-Sayed Moustafa JS, Takousis P, Pesce F, Bonnefond A, et al. (2014) Low copy number of the salivary amylase gene predisposes to obesity. Nat Genet advance online publication. Available: http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.2939.html. Accessed 3 April 2014. 65. Civelek M, Lusis AJ (2014) Systems genetics approaches to understand complex traits. Nat Rev Genet 15: 34–48. doi:10.1038/nrg3575. 66. Franke L, Bakel H van, Fokkens L, de Jong ED, Egmont-Petersen M, et al. (2006) Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing 34 Positional Candidate Genes. Am J Hum Genet 78: 1011–1025. doi:10.1086/504300. 67. Raychaudhuri S, Plenge RM, Rossin EJ, Ng ACY, Purcell SM, et al. (2009) Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions. PLoS Genet 5: e1000534. doi:10.1371/journal.pgen.1000534. 68. Campbell CD, Ogburn EL, Lunetta KL, Lyon HN, Freedman ML, et al. (2005) Demonstrating stratification in a European American population. Nat Genet 37: 868–872. doi:10.1038/ng1607. 69. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. doi:10.1038/ng1847. 70. Wu C, DeWan A, Hoh J, Wang Z (2011) A comparison of association methods correcting for population stratification in case-control studies. Ann Hum Genet 75: 418–427. doi:10.1111/j.1469-1809.2010.00639.x. 71. McKeigue PM (2005) Prospects for admixture mapping of complex traits. Am J Hum Genet 76: 1–7. doi:10.1086/426949. 72. Tang H, Jorgenson E, Gadde M, Kardia SLR, Rao DC, et al. (2006) Racial admixture and its impact on BMI and blood pressure in African and Mexican Americans. Hum Genet 119: 624–633. doi:10.1007/s00439-006-0175-4. 73. Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, et al. (2004) Methods for high-density admixture mapping of disease genes. Am J Hum Genet 74: 979–1000. doi:10.1086/420871. 74. Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, et al. (2009) Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations. PLoS Genet 5: e1000519. doi:10.1371/journal.pgen.1000519. 75. Freedman ML, Haiman CA, Patterson N, McDonald GJ, Tandon A, et al. (2006) Admixture mapping identifies 8q24 as a prostate cancer risk locus in African-American men. Proc Natl Acad Sci U S A 103: 14068–14073. doi:10.1073/pnas.0605832103. 76. Cheng C-Y, Kao WHL, Patterson N, Tandon A, Haiman CA, et al. (2009) Admixture mapping of 15,280 African Americans identifies obesity susceptibility loci on chromosomes 5 and X. PLoS Genet 5: e1000490. doi:10.1371/journal.pgen.1000490. 77. Basu A, Tang H, Arnett D, Gu CC, Mosley T, et al. (2009) Admixture mapping of quantitative trait loci for BMI in African Americans: evidence for loci on chromosomes 3q, 5q, and 15q. Obes Silver Spring Md 17: 1226–1231. doi:10.1038/oby.2009.24. 78. Basu A, Tang H, Lewis CE, North K, Curb JD, et al. (2009) Admixture mapping of quantitative trait loci for blood lipids in African-Americans. Hum Mol Genet 18: 2091– 35 2098. doi:10.1093/hmg/ddp122. 79. Sawyer SL, Mukherjee N, Pakstis AJ, Feuk L, Kidd JR, et al. (2005) Linkage disequilibrium patterns vary substantially among populations. Eur J Hum Genet EJHG 13: 677–686. doi:10.1038/sj.ejhg.5201368. 80. Kang SJ, Chiang CWK, Palmer CD, Tayo BO, Lettre G, et al. (2010) Genome-wide association of anthropometric traits in African- and African-derived populations. Hum Mol Genet 19: 2725–2738. doi:10.1093/hmg/ddq154. 81. Visscher PM (2008) Sizing up human height variation. Nat Genet 40: 489–490. doi:10.1038/ng0508-489. 82. Schilling MF, Watkins AE, Watkins W (2002) Is Human Height Bimodal? Am Stat 56: 223– 229. doi:10.1198/00031300265. 83. Wynn J, King TM, Gambello MJ, Waller DK, Hecht JT (2007) Mortality in achondroplasia study: a 42-year follow-up. Am J Med Genet A 143A: 2502–2511. doi:10.1002/ajmg.a.31919. 84. Gray JR, Bridges AB, Faed MJ, Pringle T, Baines P, et al. (1994) Ascertainment and severity of Marfan syndrome in a Scottish population. J Med Genet 31: 51–54. doi:10.1136/jmg.31.1.51. 85. Scott CI Jr (1976) Achondroplastic and hypochondroplastic dwarfism. Clin Orthop: 18–30. 86. Roberts WC, Honig HS (1982) The spectrum of cardiovascular disease in the Marfan syndrome: A clinico-morphologic study of 18 necropsy patients and comparison to 151 previously reported necropsy patients. Am Heart J 104: 115–135. doi:10.1016/00028703(82)90650-0. 87. Weedon MN, Lettre G, Freathy RM, Lindgren CM, Voight BF, et al. (2007) A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat Genet 39: 1245–1250. doi:10.1038/ng2121. 88. Stattin E-L, Wiklund F, Lindblom K, Onnerfjord P, Jonsson B-A, et al. (2010) A missense mutation in the aggrecan C-type lectin domain disrupts extracellular matrix interactions and causes dominant familial osteochondritis dissecans. Am J Hum Genet 86: 126–137. doi:10.1016/j.ajhg.2009.12.018. 89. Tompson SW, Merriman B, Funari VA, Fresquet M, Lachman RS, et al. (2009) A recessive skeletal dysplasia, SEMD aggrecan type, results from a missense mutation affecting the Ctype lectin domain of aggrecan. Am J Hum Genet 84: 72–79. doi:10.1016/j.ajhg.2008.12.001. 90. Zihlman AL, Stahl D, Boesch C (2008) Morphological variation in adult chimpanzees (Pan troglodytes verus) of the Taï National Park, Côte d’Ivoire. Am J Phys Anthropol 135: 34– 41. doi:10.1002/ajpa.20702. 36 91. Bogin B, Varela-Silva MI (2010) Leg length, body proportion, and health: a review with a note on beauty. Int J Environ Res Public Health 7: 1047–1075. doi:10.3390/ijerph7031047. 92. Chhabra SK (2008) Using arm span to derive height: Impact of three estimates of height on interpretation of spirometry. Ann Thorac Med 3: 94–99. doi:10.4103/1817-1737.39574. 93. Stokes DC, Pyeritz RE, Wise RA, Fairclough D, Murphy EA (1988) Spirometry and chest wall dimensions in achondroplasia. Chest 93: 364–369. 94. Spranger JW, Densler J (1970) Spondyloepiphyseal Dysplasia Congenita. Radiology 94: 313–322. doi:10.1148/94.2.313. 95. Fredriks A, van Buuren S, van Heel WJM, Dijkman-Neerincx R, Verloove-Vanhoric... S, et al. (2005) Nationwide age references for sitting height, leg length, and sitting height/height ratio, and their diagnostic value for disproportionate growth disorders. Arch Dis Child 90: 807–812. doi:10.1136/adc.2004.050799. 96. Noordam C, Dhir V, McNelis JC, Schlereth F, Hanley NA, et al. (2009) Inactivating PAPSS2 Mutations in a Patient with Premature Pubarche. N Engl J Med 360: 2310–2318. doi:10.1056/NEJMoa0810489. 97. Conway BN, Shu X-O, Zhang X, Xiang Y-B, Cai H, et al. (2012) Age at Menarche, the Leg Length to Sitting Height Ratio, and Risk of Diabetes in Middle-Aged and Elderly Chinese Men and Women. PLoS ONE 7. Available: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3309033/. Accessed 16 March 2014. 98. Clarke AJ, Cooper DN (2010) GWAS: heritability missing in action? Eur J Hum Genet 18: 859–861. doi:10.1038/ejhg.2010.35. 99. Dempster ER, Lerner IM (1950) Heritability of Threshold Characters. Genetics 35: 212– 236. 100. Sharma AM, Kushner RF (2009) A proposed clinical staging system for obesity. Int J Obes 33: 289–295. doi:10.1038/ijo.2009.2. 101. Willer CJ, Li Y, Abecasis GR (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinforma Oxf Engl 26: 2190–2191. doi:10.1093/bioinformatics/btq340. 102. Sherry ST, Ward M-H, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: 308–311. doi:10.1093/nar/29.1.308. 103. Yang J, Ferreira T, Morris AP, Medland SE, Consortium GI of AnT (GIANT), et al. (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44: 369–375. doi:10.1038/ng.2213. 37 Chapter 2 Common variants show predicted polygenic effects on height in the tails of the distribution, except in extremely short individuals Yingleong Chan1,2,3*, Oddgeir L Holmen4,5*, Andrew Dauber2,3*, Lars Vatten6, Aki S Havulinna7, Frank Skorpen8, Kirsti Kvaløy4, Kaisa Silander7,9, Thutrang T Nguyen3, Cristen Willer10, Michael Boehnke10, Markus Perola7,9,11, Aarno Palotie2,9,12,13, Veikko Salomaa7, Kristian Hveem4, Timothy M Frayling14*, Joel N Hirschhorn1,2,3*, Michael N Weedon14* 1 2 3 4 Harvard Medical School, Department of Genetics, Boston, Massachusetts, USA. Broad Institute, Cambridge, Massachusetts, USA. Children's Hospital Boston, Boston, Massachusetts, USA. HUNT Research Centre, Department of Public Health and General Practice, Norwegian University of Science and Technology, Levanger, Norway. 5 6 St. Olav Hospital, Trondheim University Hospital, Trondheim, Norway. Department of Public Health and General Practice, Norwegian University of Science and Technology, Trondheim, Norway. 7 8 National Institute for Health and Welfare, Helsinki, Finland. Department of Laboratory Medicine, Children’s and Women’s Health, Norwegian University of Science and Technology, Trondheim, Norway. 9 Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland. 10 Department of Internal Medicine, Division of Cardiovascular Medicine, University of Michigan, Ann Arbor, Michigan, USA. 11 12 13 Estonian Genome Project, University of Tartu, Tartu, Estonia. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. Department of Medical Genetics, University of Helsinki and University Central Hospital, Helsinki, Finland. 14 * Genetics of Complex Traits, Peninsula Medical School, University of Exeter, Exeter, UK. These authors contributed equally to this work Originally published as: Chan Y, Holmen OL, Dauber A, et. al., PLOS Genetics (2011). DOI: 10.1371/e1002439 ABSTRACT Common genetic variants have been shown to explain a fraction of the inherited variation for many common diseases and quantitative traits, including height, a classic polygenic trait. The extent to which common variation determines the phenotype of highly heritable traits such as height is uncertain, as is the extent to which common variation is relevant to individuals with more extreme phenotypes. To address these questions, we studied 1,214 individuals from the top and bottom extremes of the height distribution (tallest and shortest ~1.5%), drawn from ~78,000 individuals from the HUNT and FINRISK cohorts. We found that common variants still influence height at the extremes of the distribution: common variants (49/141) were nominally associated with height in the expected direction more often than is expected by chance (p <5x1028 ) and the odds ratios in the extreme samples were consistent with the effects estimated previously in population-based data. To examine more closely whether the common variants have the expected effects, we calculated a weighted allele score (WAS), which is a weighted prediction of height for each individual based on the previously estimated effect sizes of the common variants in the overall population. The average WAS is consistent with expectation in the tall individuals, but was not as extreme as expected in the shortest individuals (p<0.006), indicating that some of the short stature is explained by factors other than common genetic variation. The discrepancy was more pronounced (p<10-6) in the most extreme individuals (height <0.25 percentile). The results at the extreme short tails are consistent with a large number of models incorporating either rare genetic, non-additive or rare non-genetic factors that decrease height. We conclude that common genetic variants are associated with height at the extremes as well as across the population, but that additional factors become more prominent at the shorter extreme. 39 AUTHOR SUMMARY Although there are many loci in the human genome that have been discovered to be significantly associated with height, it is unclear if these loci have similar effects in extremely tall and short individuals. Here, we examine hundreds of extremely tall and short individuals in 2 population-based cohorts to see if these known height determining loci are as predictive as expected in these individuals. We found that these loci are generally as predictive of height as expected in these individuals but that they begin to be less predictive in the most extremely short individuals. We showed that this result is consistent with models that not only include the common variants but also multiple low frequency genetic variants that substantially decrease height. However, this result is also consistent with non-additive genetic effects or rare nongenetic factors that substantially decrease height. This finding suggests the possibility of a major role of low frequency variants, particularly in individuals with extreme phenotypes and has implications on whole-genome or whole-exome sequencing efforts to discover rare genetic variation associated with complex traits. INTRODUCTION Height is a highly heritable trait, with estimates of heritability as high as 90% [1]. Recent genome-wide association studies of height have discovered over 180 common variants associated with height [2]. These variants have small effect sizes and collectively explain approximately 10% of the heritability. While these 180 common variants are robustly associated with height when studied as a quantitative trait in the general population, it is not known whether these variants have similar associations with stature in individuals at the extreme tails of the height distribution. If these common variants do not show the expected association with stature 40 at the extremes (based on their continuous distribution effect sizes), then other factors beyond common variants must contribute to extreme stature. Although there are multiple possible scenarios, one possible explanation is the existence of rare or low frequency variants with larger effect sizes, which have been proposed to explain a portion of the heritability not accounted for by the known common variants [3–5] and which may provide novel biological insights into mechanisms that affects height. Understanding the role of common variants in the tails of the height distribution will also provide methodological insight into the utility of extreme tails analysis for future genetic studies of quantitative traits. In this chapter, we describe our approach to determine whether common alleles known to be associated with height in the general population have the expected distribution in individuals from the extremes of the height distribution. We used DNA samples from individuals with extreme heights from two population-based cohorts of Finnish (FINRISK) and Norwegian (HUNT) ancestry and genotyped them for common variants known to be associated with height. Under a polygenic model in which there are many variants and each variant additively contributes a small effect to the phenotype, we found that for individuals within ~2.81 standard deviations of the mean, the common variants have the predicted associations with height, consistent with their effect sizes estimated from the previous population study [2]. However, in individuals with more extreme short stature (the shortest 0.25% of the distribution), common variants play a less prominent role in explaining phenotype, and the data are consistent with various models in which rare variants, non-additive effects or rare non-genetic factors contribute to short stature in these individuals. RESULTS 41 Individual common variants are associated with height in the extremes We attempted to genotype SNPs at the 180 loci previously associated with height in individuals from the short and tall extremes of the FINRISK and HUNT cohorts and then performed association analyses for each SNP with height using the Cochran-Mantel-Haenszel test and logistic regression respectively. In FINRISK, SNPs at 158 of the height loci were successfully genotyped in 181 short and 192 tall individuals from the 1% tails of the height distribution. In the HUNT study, SNPs at 160 of the height loci were successfully genotyped in 385 short and 456 tall individuals from the ~1.5% tails of height. Here we focus on the 279 short and 309 tall individuals from the 1% tails of the HUNT study, so as to provide consistency with the FINRISK study. In both cohorts, the majority of SNPs had effect directions consistent with the published results [2] (HUNT 137/160, p<0.0001; FINRISK 122/155, p<0.0001) and there was a significant enrichment in SNPs reaching nominal significance for association with height (Table 2.1; Table 2.2). We then combined the data from both cohorts in a meta-analysis of 141 overlapping loci (Table 2.3). Ninety-one percent of SNPs (128/141, p<0.0001) had directions of effect consistent with previously published results [2] and 49 SNPs had p-values <0.05, as opposed to 7 expected by chance (p<5x10-28). This result confirms that, as a group, SNPs found to be associated with height in the general population are also associated with height at the extremes of the phenotypic spectrum. The effect sizes of individual common variants on height are similar in the extremes and the general population We next tested whether the observed odds ratios (OR) are consistent with the expected odds ratios, based on the previously estimated effect sizes from the GIANT study [2] and study 42 Table 2.1: Individual SNP analysis for HUNT cohort Rsid rs425277 rs6657613 rs2903545 rs4601530 rs7532866 rs11209376 rs17391694 rs6699417 rs10874746 rs12731372 rs11205277 rs17346473 rs1014719 rs1046934 rs10863936 rs6684205 rs11118346 rs1172294 rs1545552 rs2341459 rs3791675 rs1913671 rs7567288 rs3770047 rs12470505 rs6756793 rs12694997 rs2597513 rs13088462 rs2336725 rs9833926 rs17806888 rs11128265 rs6765930 rs9844666 rs724016 Chr 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 Pos 2059032 17200787 23413695 24916698 26614131 41270935 78396214 88896031 93096559 118654498 148159496 170349716 175069389 182290152 210304421 216676325 217810342 25022704 33213842 44621706 55964813 88680998 134151294 178393780 219616613 224737163 241911659 13530836 51046753 53093779 56625218 67499012 72538487 130503468 137456906 142588260 Closest gene PRKCZ MFAP2 HTR1D CLIC4 LIN28 SCMH1 GIPC2 PKN2 RPL5 SPAG17 SF3B4 DNM3 PAPPA2 TSEN15 DTL TGFB2 LYPLAL1 DNAJC27 LTBP1 C2orf34 EFEMP1 EIF2AK3 NCKAP5 PDE11A CCDC108/IHH SERPINE2 SEPT2 HDAC11 DOCK3 RTF1 C3orf63 SUCLG2 RYBP C3orf47 PCCB ZBTB38 Effect allele t t t c a g t t c c g g t c g g c a g t c c c g t t g c c c a t a g g g Effect size 0.0240 0.0328 0.0215 0.0238 0.0222 0.0319 0.0399 0.0217 0.0217 0.0379 0.0452 0.0365 0.0253 0.0459 0.0220 0.0328 0.0264 0.0334 0.0246 0.0276 0.0496 0.0268 0.0309 0.0402 0.0483 0.0248 0.0274 0.0392 0.0543 0.0263 0.0216 0.0399 0.0304 0.0352 0.0284 0.0670 Freq 0.28 0.52 0.59 0.74 0.68 0.22 0.13 0.63 0.64 0.75 0.44 0.30 0.55 0.35 0.47 0.26 0.58 0.50 0.72 0.28 0.75 0.37 0.22 0.06 0.91 0.55 0.77 0.10 0.07 0.46 0.50 0.90 0.80 0.79 0.76 0.46 Observed OR 1.19 1.19 1.16 1.02 1.16 1.33 1.25 1.12 1.12 1.25 1.46 0.95 1.21 1.21 1.13 1.15 1.12 1.42 1.27 1.18 1.65 1.31 1.11 0.99 0.85 1.30 1.07 1.73 1.49 0.96 1.24 1.40 1.27 1.24 1.36 1.52 Expected OR 1.14 1.19 1.12 1.14 1.13 1.19 1.24 1.12 1.12 1.22 1.27 1.21 1.14 1.28 1.12 1.19 1.15 1.19 1.14 1.16 1.30 1.15 1.18 1.24 1.29 1.14 1.16 1.23 1.34 1.15 1.12 1.24 1.18 1.21 1.16 1.43 43 Table 2.1 (Continued) Rsid rs572169 rs720390 rs2247341 rs6449353 rs17081935 rs7697556 rs1975474 rs10010325 rs7689420 rs955748 rs13154066 rs6897117 rs6894139 rs13177718 rs274546 rs526896 rs4282339 rs12153391 rs889014 rs422421 rs6879260 rs12198986 rs806794 rs3129109 rs2596530 rs6457617 rs2780226 rs6457821 rs12530016 rs310405 rs7759938 rs3757235 rs6915129 rs1490384 rs6569648 rs7763064 rs543650 Chr 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 Pos 173648421 187031377 1671115 17642586 57518233 73734177 82397961 106325802 145787802 184452669 32867427 55022532 88363538 108141243 131727766 134384604 168188818 171136043 172916720 176449932 179663620 7665058 26308656 29192211 31495352 32771829 34307070 35510783 44974300 81857081 105485647 109818534 117629512 126892853 130390812 142838982 152152636 Closest gene GHSR IGF2BP2 SLBP/FGFR3 LCORL POLR2B ADAMTS3 PRKG2/BMP3 TET2 HHIP WWC2 NPR3 SLC38A9 MEF2C FER SLC22A5 PITX1 SLIT3 FBXW11 BOD1 FGFR4/NSD1 GFPT2 BMP6 Histone cluster OR2J3 MICA HLA locus HMGA1 PPARD/FANCE SUPT3H/RUNX2 FAM46A LIN28B ZBTB24 VGLL2 C6orf173 L3MBTL3 GPR126 ESR1 Effect allele t a a t t t g a c g t t t c g t g c c c c a a c g c c c g a c c c t c g g Effect size 0.0355 0.0305 0.0251 0.0714 0.0306 0.0219 0.0376 0.0214 0.0687 0.0243 0.0350 0.0278 0.0266 0.0412 0.0278 0.0315 0.0352 0.0329 0.0290 0.0332 0.0281 0.0359 0.0528 0.0257 0.0341 0.0238 0.0790 0.1210 0.0305 0.0300 0.0420 0.0216 0.0216 0.0370 0.0358 0.0445 0.0318 Freq 0.33 0.36 0.38 0.86 0.19 0.48 0.30 0.47 0.84 0.78 0.39 0.27 0.56 0.90 0.59 0.72 0.81 0.76 0.67 0.79 0.60 0.46 0.71 0.64 0.53 0.52 0.08 0.98 0.80 0.52 0.35 0.58 0.60 0.54 0.24 0.72 0.59 Observed OR 1.32 1.24 1.29 1.52 1.01 1.21 1.30 1.21 1.28 1.23 1.12 1.28 1.18 1.48 1.27 1.14 1.08 1.21 1.15 1.13 1.24 1.34 0.98 0.90 1.38 1.38 1.39 0.98 1.55 1.08 1.00 0.91 1.13 1.17 1.26 1.31 1.25 Expected OR 1.21 1.18 1.14 1.46 1.18 1.12 1.22 1.12 1.44 1.14 1.20 1.16 1.15 1.25 1.16 1.18 1.21 1.19 1.17 1.19 1.16 1.21 1.32 1.15 1.20 1.14 1.52 1.90 1.18 1.17 1.25 1.12 1.12 1.22 1.21 1.27 1.18 44 Table 2.1 (Continued) Rsid rs12206717 rs798489 rs1708299 rs6959212 rs42235 rs822552 rs17088190 rs6473015 rs6470764 rs894345 rs7864648 rs11144688 rs296886 rs181338 rs2814828 rs9969804 rs1257763 rs473902 rs7027110 rs1468758 rs751543 rs7466269 rs12338076 rs7909670 rs7332 rs11599750 rs2237886 rs7937898 rs1330 rs2904315 rs1814175 rs3782089 rs7112925 rs606452 rs494459 rs654723 rs2954980 Chr 6 7 7 7 7 7 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 11 11 11 11 11 11 11 11 11 11 12 Pos 158830686 2768329 28156471 38094851 92086012 148281567 24167275 78341040 130794847 135682763 16358732 77732106 85781846 88297981 90001002 94468941 95933766 97296056 108638867 112846903 118162163 132453905 138261561 12958770 80784066 101795432 2767307 12660137 17272605 48066524 49515748 65093395 66582736 74953826 118079885 128091365 11750815 Closest gene TULP4 GNA12 JAZF1 STARD3NL CDK6 PDIA4 ADAM28 PEX2 GSDMC ZFAT BNC2 PCSK5 C9orf64 ZCCHC6 SPIN1 IPPK PTPDC1 PTCH1/FANCC ZNF462 LPAR1 PAPPA FUBP3 QSOX2 CCDC3 PPIF CPN1 KCNQ1 TEAD1 NUCB2 PTPRJ/SLC39A13 FOLH1 SSSCA1 RHOD SERPINH1 TREH FLI1 ETV6 Effect allele g c a c t g c c c c t g g t t a a t a c t a c c g c t g t a t c c a t a t Effect size 0.0487 0.0515 0.0417 0.0229 0.0548 0.0302 0.0278 0.0320 0.0469 0.0297 0.0246 0.0548 0.0250 0.0234 0.0268 0.0281 0.0685 0.0741 0.0337 0.0258 0.0287 0.0359 0.0304 0.0219 0.0252 0.0230 0.0429 0.0239 0.0241 0.0311 0.0230 0.0583 0.0229 0.0397 0.0207 0.0237 0.0295 Freq 0.95 0.75 0.34 0.68 0.30 0.27 0.75 0.32 0.83 0.59 0.35 0.87 0.19 0.53 0.24 0.45 0.06 0.93 0.22 0.76 0.69 0.66 0.29 0.55 0.50 0.66 0.11 0.48 0.38 0.30 0.44 0.93 0.64 0.16 0.43 0.61 0.36 Observed OR 0.91 1.37 1.43 1.24 1.25 1.20 0.91 1.25 0.92 1.17 0.96 1.14 1.04 1.20 1.31 0.94 1.31 1.01 0.81 1.03 1.13 1.39 0.95 1.11 1.30 1.12 1.16 0.90 1.16 1.04 1.15 1.10 1.20 1.12 0.99 1.19 1.11 Expected OR 1.30 1.32 1.25 1.13 1.34 1.17 1.16 1.19 1.28 1.17 1.14 1.34 1.14 1.13 1.15 1.16 1.44 1.48 1.20 1.15 1.17 1.21 1.18 1.12 1.14 1.13 1.26 1.14 1.14 1.18 1.13 1.36 1.13 1.24 1.12 1.13 1.17 45 Table 2.1 (Continued) Rsid rs10770705 rs2638953 rs2066807 rs1351394 rs10748128 rs11107116 rs12298826 rs7332115 rs3118906 rs4773624 rs1950500 rs10483727 rs6573834 rs862031 rs10150088 rs16964211 rs7178424 rs10152591 rs3759901 rs5742915 rs11259936 rs16942341 rs4965598 rs1659127 rs4640244 rs3110496 rs3764419 rs17780080 rs1043515 rs4986172 rs11652146 rs227723 rs2079795 rs12325866 rs11867479 rs4800452 rs2078286 Chr 12 12 12 12 12 12 12 13 13 13 14 14 14 14 14 15 15 15 15 15 15 15 15 16 17 17 17 17 17 17 17 17 17 17 17 18 18 Pos 20748734 28425682 55026949 64638093 68113925 92502635 122394981 32045548 50004789 90817730 23900690 60142628 67878151 74061608 91573329 49317787 60167551 67835211 70298469 72123686 82371586 87189909 98577137 14295806 21224816 24941897 26188149 27367259 34175722 40571807 44777362 52133903 56851431 59109706 65601802 18981609 45132860 Closest gene SLCO1C1 CCDC91 STAT2 HMGA2 FRS2 SOCS2 SBNO1 PDS5B/BRCA2 DLEU7 GPC5 NFATC4 SIX6 RAD51L1 LTBP2 TRIP11 CYP19A1 C2CD4A TLE3 MYO9A PML ADAMTSL3 ACAN ADAMTS17 MKL2 KCNJ12 ANKRD13B ATAD5/RNF135 LRRC37B PIP4K2B ACBD4 ZNF652 NOG TBX2 CSH1/GH1 KCNJ16/KCNJ2 CABLES1 DYM Effect allele a c g t t t g g g g t t c g t g c a a c c c c a a g c a g c g t t a t t a Effect size 0.0314 0.0356 0.0520 0.0535 0.0347 0.0524 0.0350 0.0250 0.0518 0.0286 0.0323 0.0322 0.0253 0.0224 0.0270 0.0511 0.0235 0.0447 0.0555 0.0308 0.0419 0.1335 0.0353 0.0240 0.0279 0.0229 0.0374 0.0344 0.0219 0.0283 0.0255 0.0272 0.0395 0.0343 0.0240 0.0475 0.0372 Freq 0.34 0.69 0.08 0.55 0.37 0.21 0.21 0.39 0.75 0.40 0.24 0.38 0.80 0.64 0.60 0.95 0.51 0.88 0.02 0.47 0.48 0.98 0.30 0.29 0.56 0.67 0.62 0.17 0.54 0.68 0.30 0.28 0.33 0.28 0.36 0.80 0.41 Observed OR 1.14 1.53 1.68 1.31 1.36 1.68 0.85 1.10 1.24 1.11 1.55 1.25 1.01 1.42 1.05 1.32 1.12 1.15 0.94 0.91 1.30 1.26 1.20 1.05 1.01 0.81 1.12 1.01 1.38 1.02 0.96 1.13 1.23 1.36 1.24 1.33 1.20 Expected OR 1.18 1.21 1.32 1.33 1.20 1.32 1.20 1.14 1.32 1.16 1.19 1.19 1.14 1.13 1.15 1.31 1.13 1.27 1.34 1.18 1.25 2.04 1.21 1.14 1.16 1.13 1.22 1.20 1.12 1.16 1.15 1.16 1.23 1.20 1.14 1.29 1.22 46 Table 2.1 (Continued) Rsid rs6567160 rs12980348 rs891088 rs4542783 rs2279008 rs17318596 rs1741344 rs2145272 rs7274811 rs143384 rs1567865 rs2834440 rs4821083 Chr 18 19 19 19 19 19 20 20 20 20 20 21 22 Pos 55980115 2132607 7135762 8548160 17144303 46628935 4049800 6574218 31796842 33489170 47315374 34612369 31386341 Closest gene MC4R DOT1L INSR ADAMTS10 MYO9B ATP5SL SMOX BMP2 ZNF341 GDF5 ZNFX1 KCNE2 SYN3 Effect allele c g g t t a c g g g t a t Effect size 0.0245 0.0323 0.0251 0.0313 0.0308 0.0290 0.0263 0.0386 0.0402 0.0639 0.0337 0.0247 0.0332 Freq 0.27 0.36 0.23 0.55 0.75 0.38 0.39 0.34 0.76 0.41 0.22 0.62 0.84 Observed OR 1.15 1.30 0.91 1.07 1.04 1.31 1.14 1.25 1.47 1.42 1.15 1.23 1.15 Expected OR 1.14 1.19 1.14 1.18 1.18 1.17 1.15 1.23 1.24 1.41 1.20 1.14 1.19 The table shows the results for the SNPs used in the individual association analysis in the HUNT cohort. 47 Table 2.2: Individual SNP analysis for FINRISK cohort Rsid rs425277 rs2284746 rs1738475 rs4601530 rs2154319 rs17391694 rs6699417 rs10874746 rs9428104 rs11205277 rs17346452 rs1325598 rs1046934 rs10863936 rs6684205 rs11118346 rs10799445 rs4665736 rs6714546 rs17511102 rs2341459 rs3791675 rs11684404 rs7567288 rs1351164 rs12470505 rs2629046 rs2580816 rs12694997 rs2597513 rs13088462 rs2336725 rs9835332 rs9863706 rs9844666 rs724016 rs572169 rs720390 rs2247341 rs6449353 rs17081935 rs7697556 rs10010325 rs7689420 rs955748 Chr 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 Pos 2069172 17306675 23536891 24916698 41745770 78623626 89123443 93323971 118855587 149892872 172053287 175058872 184023529 212237798 218609702 219743719 227911883 25187599 33361425 37960613 44768202 56111309 88924622 134151294 217980143 219908369 225047744 232797966 241911659 13555836 51071713 53093779 56642722 72437413 135974216 142588510 172165727 185548683 1671115 18033488 57823476 73515313 106106353 145568352 184215675 Closest gene PTCH1/FANCC FAM46A NPR3 FUBP3 OR2J3 SLC38A9 SBNO1 DNM3 ADAMTS10 TGFB2 WWC2 RTF1 SCMH1 SF3B4 CCDC53/GNPTAB TSEN15 SPAG17 PPIF LTBP1 CEP120 ZBTB24 BNC2 DNAJC27 IGF1R RYBP NCKAP5 GPR126 C6orf173 NPPC L3MBTL3 DOCK3 LIN28B GDF5 KCNE2 ZNFX1 C2CD4A SERPINH1 CYP19A1 HMGA1 ETV6 TET2 CTU2/GALNS PRKCZ MKL2 BMP2 Effect allele T A C A T T T T C G A T C G A C A T C A T G C G T T T C C C C C A A G C T A A T T T A G A Effect size 0.024 0.0354 0.0216 0.0238 0.0335 0.0399 0.0217 0.0217 0.0375 0.0452 0.038 0.0256 0.0459 0.022 0.0328 0.0264 0.0306 0.0335 0.0254 0.0601 0.0276 0.0496 0.027 0.0309 0.0279 0.0483 0.0247 0.0412 0.0274 0.0392 0.0543 0.0263 0.0217 0.0304 0.0284 0.067 0.0355 0.0305 0.0251 0.0714 0.0306 0.0219 0.0214 0.0687 0.0243 Freq 0.30 0.52 0.61 0.27 0.74 0.12 0.66 0.36 0.22 0.62 0.77 0.45 0.63 0.53 0.67 0.50 0.72 0.59 0.28 0.90 0.31 0.25 0.65 0.75 0.75 0.88 0.55 0.20 0.24 0.88 0.92 0.55 0.48 0.21 0.23 0.58 0.33 0.43 0.40 0.87 0.22 0.49 0.45 0.19 0.24 Observed OR 1.18 1.52 1.14 1.25 1.18 2.31 0.75 0.98 1.09 1.76 1.37 1.12 1.47 1.10 1.16 1.30 1.73 1.22 1.53 2.05 1.59 1.25 0.89 1.05 1.62 1.37 1.11 1.65 0.88 0.77 1.02 1.48 1.03 0.98 0.99 1.58 1.06 1.17 0.96 1.40 1.00 1.49 1.12 1.81 0.87 Expected OR 1.14 1.21 1.12 1.14 1.20 1.24 1.12 1.12 1.22 1.28 1.23 1.15 1.28 1.13 1.19 1.15 1.18 1.20 1.15 1.38 1.16 1.31 1.16 1.18 1.16 1.30 1.14 1.25 1.16 1.24 1.34 1.15 1.12 1.18 1.17 1.44 1.21 1.18 1.14 1.47 1.18 1.13 1.12 1.45 1.14 48 Table 2.2 (Continued) Rsid rs1173727 rs11958779 rs10037512 rs1582931 rs274546 rs526896 rs4282339 rs12153391 rs889014 rs422421 rs6879260 rs3812163 rs1047014 rs806794 rs3129109 rs2256183 rs2780226 rs9472414 rs9360921 rs310405 rs7759938 rs1046943 rs961764 rs1490384 rs6569648 rs7763064 rs543650 rs9456307 rs798489 rs4470914 rs12534093 rs1708299 rs6959212 rs42235 rs822552 rs7460090 rs6473015 rs6470764 rs12680655 rs7864648 rs11144688 rs7853377 rs2778031 rs1257763 Chr 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8 8 9 9 9 9 9 Pos 32830521 55001899 88354675 122657199 131727766 134384604 168256240 171203438 172984114 176517326 179731014 7670759 19949472 26200677 29084232 31380529 34199092 44946506 76265642 81800362 105378954 109783941 117522156 126851160 130349119 142797289 152110943 158929442 2801803 19616522 23502974 28189946 38128326 92248076 148650634 57194163 78178485 130725665 135637337 16358732 78542286 86552205 90835726 96893945 Closest gene LTBP1 EFEMP1 MFAP2 SLBP/FGFR3 TULP4 SSSCA1 ZNF462 EIF2AK3 MC4R PTPDC1 PDS5B/BRCA2 PCSK5 PKN2 KCNJ16/KCNJ2 PEX2 PPARD/FANCE TWISTNB SMOX INSR CDK6 KCNJ12 GIPC2 ZNF341 GHSR SOCS2 ANKRD13B RHOD MYO9B NOG PAPPA TNS1 HHIP DLEU7 SPIN1 DYM ADAMTSL3 STAT2 CCDC91 SERPINE2 PIP4K2B DTL LRRC37B GNA12 CCDC108/IHH Effect allele T T T G A T G T T G T A C A A A T T G A G A G T T G C T A T T A G T G T G C C T A A T A Effect size 0.0356 0.0282 0.0267 0.0254 0.0278 0.0315 0.0352 0.0329 0.029 0.0332 0.0281 0.0366 0.0291 0.0528 0.0257 0.0345 0.079 0.0306 0.0479 0.03 0.042 0.0223 0.0228 0.037 0.0358 0.0445 0.0318 0.0499 0.0515 0.0328 0.0298 0.0417 0.0229 0.0548 0.0302 0.0546 0.032 0.0469 0.0298 0.0246 0.0548 0.0256 0.0273 0.0685 Freq 0.38 0.72 0.58 0.49 0.45 0.72 0.22 0.28 0.40 0.22 0.38 0.57 0.75 0.64 0.37 0.37 0.92 0.22 0.85 0.48 0.71 0.53 0.41 0.49 0.76 0.29 0.47 0.07 0.33 0.17 0.20 0.32 0.30 0.32 0.74 0.86 0.74 0.19 0.55 0.35 0.12 0.76 0.23 0.06 Observed OR 1.47 0.89 1.45 0.93 1.58 1.27 0.90 1.31 1.17 1.19 1.38 1.08 1.19 1.21 0.88 1.29 1.19 0.92 1.83 1.40 1.48 0.96 1.47 1.11 1.26 1.92 1.34 1.48 1.19 1.22 1.24 0.95 0.93 1.33 1.22 1.23 1.41 1.21 1.21 1.18 0.91 1.88 1.25 1.17 Expected OR 1.21 1.16 1.15 1.15 1.16 1.19 1.21 1.19 1.17 1.20 1.16 1.22 1.17 1.33 1.15 1.20 1.53 1.18 1.29 1.18 1.25 1.13 1.13 1.22 1.21 1.27 1.19 1.31 1.32 1.19 1.17 1.25 1.13 1.34 1.18 1.34 1.19 1.29 1.17 1.14 1.34 1.15 1.16 1.45 49 Table 2.2 (Continued) Rsid rs473902 rs7027110 rs751543 rs7466269 rs7849585 rs7909670 rs2145998 rs11599750 rs2237886 rs7926971 rs1330 rs1814175 rs3782089 rs7112925 rs634552 rs494459 rs654723 rs2856321 rs10770705 rs2638953 rs2066807 rs1351394 rs11107116 rs7971536 rs11830103 rs1809889 rs7332115 rs3118905 rs7319045 rs1950500 rs1570106 rs7155279 rs16964211 rs7178424 rs12902421 rs5742915 rs11259936 rs16942341 rs4965598 rs2871865 rs1659127 rs8052560 rs4640244 rs3110496 rs3764419 Chr 9 9 9 9 9 10 10 10 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 13 13 13 14 14 14 15 15 15 15 15 15 15 15 16 16 17 17 17 Pos 98256235 108638867 119122342 133464084 138251691 12918764 81121696 101805442 2810731 12698040 17316029 49559172 65336819 66826160 75282052 118574675 128586155 11855773 20857467 28534415 56740682 66351826 93978504 102373788 122389499 124801226 33147548 51105334 92024574 24830850 68813115 91555634 49317787 62380259 72161403 74336633 84580582 89388905 98577137 99194896 14388305 87304743 21284223 24941897 29164023 Closest gene CPN1 GPC5 ADAMTS17 ACAN ATAD5/RNF135 ACBD4 Histone JMJD4 MICA NME2 C3orf63 FBXW11 ZFAT NFATC4 FLI1 TEAD1 HMGA2 JAZF1 RPL5 ESR1 FGFR4/NSD1 PCCB PAPPA2 ZNF652 CDC42EP3 SLIT3 PML SDR16C5 MYO9A BOD1 IGF2BP2 RAD51L1 ADAMTS3 TRIP11 SEPT2 TREH LYPLAL1 POLR2B NUCB2 STARD3NL LCORL TBX2 CCDC3 PDIA4 GSDMC Effect allele T A T A T C G A T C T T C C T T A A A C T T T G T T C T A T G C C T A T T C T C A A A C C Effect size 0.0741 0.0337 0.0287 0.0359 0.0324 0.0219 0.0252 0.023 0.0429 0.0244 0.0241 0.023 0.0583 0.0229 0.0412 0.0207 0.0237 0.0298 0.0314 0.0356 0.052 0.0535 0.0524 0.0247 0.0351 0.0315 0.025 0.052 0.029 0.0323 0.0256 0.0285 0.0511 0.0235 0.0691 0.0308 0.0419 0.1335 0.0353 0.0535 0.024 0.0392 0.0279 0.0229 0.0374 Freq 0.90 0.23 0.73 0.60 0.32 0.46 0.56 0.34 0.11 0.58 0.33 0.30 0.07 0.32 0.17 0.41 0.64 0.65 0.30 0.70 0.93 0.49 0.24 0.41 0.78 0.31 0.58 0.33 0.45 0.30 0.23 0.35 0.09 0.54 0.95 0.56 0.47 0.05 0.68 0.88 0.35 0.79 0.59 0.36 0.39 Observed OR 1.08 0.98 1.23 1.03 1.06 1.02 0.98 1.17 1.58 1.16 1.07 1.31 0.84 1.05 1.43 1.13 1.13 1.22 1.29 0.96 1.19 1.45 1.04 0.78 1.12 1.33 0.92 1.31 1.49 1.06 1.38 0.95 1.39 1.34 0.96 0.98 1.82 1.90 1.17 1.45 1.09 1.07 1.24 1.01 1.18 Expected OR 1.49 1.20 1.17 1.21 1.19 1.13 1.15 1.13 1.26 1.14 1.14 1.13 1.37 1.13 1.25 1.12 1.14 1.17 1.18 1.21 1.32 1.33 1.33 1.14 1.21 1.19 1.14 1.32 1.17 1.19 1.15 1.17 1.32 1.14 1.45 1.18 1.25 2.06 1.21 1.33 1.14 1.24 1.16 1.13 1.22 50 Table 2.2 (Continued) Rsid rs17780086 rs1043515 rs4986172 rs4605213 rs2072153 rs227724 rs2079795 rs11867479 rs9967417 rs17782313 rs12982744 rs891088 rs4072910 rs2279008 rs1741344 rs2145272 rs7274811 rs143384 rs237743 rs2834442 rs4821083 Chr 17 17 17 17 17 17 17 17 18 18 19 19 19 19 20 20 20 20 20 21 22 Pos 30343282 36922196 43216281 46599746 47390014 52133816 59496649 68090207 46959500 57851097 2177193 7184762 8644031 17283303 4101800 6626218 32333181 33489170 47903019 35690786 31386341 Closest gene SLC22A5 CLIC4 FOLH1 QSOX2 GFPT2 SUPT3H/RUNX2 BMP6 C2orf34 SYN3 PITX1 HDAC11 DOT1L C9orf64 SENP6 MEF2C ID4 TLE3 ZBTB38 VGLL2 IGF2BP3 KCNQ1 Effect allele A C C C C A T T C G C G G T C C A G A A T Effect size 0.0346 0.0219 0.0283 0.0234 0.0264 0.0272 0.0395 0.024 0.0381 0.0249 0.0325 0.0251 0.0289 0.0308 0.0263 0.0386 0.0402 0.0639 0.0338 0.0269 0.0332 Freq 0.16 0.46 0.39 0.32 0.30 0.61 0.30 0.34 0.62 0.80 0.61 0.70 0.48 0.68 0.64 0.69 0.24 0.55 0.21 0.67 0.84 Observed OR 1.16 1.14 0.97 1.21 0.99 1.07 1.58 0.97 0.98 1.15 1.36 0.96 1.18 1.45 0.94 1.42 1.29 1.68 1.30 1.16 1.09 Expected OR 1.21 1.13 1.16 1.13 1.15 1.16 1.24 1.14 1.23 1.14 1.19 1.14 1.17 1.18 1.15 1.23 1.24 1.41 1.20 1.16 1.20 The table shows the results for the SNPs used in the individual association analysis in the FINRISK cohort. 51 Table 2.3: Meta-analysis of individual SNPs for HUNT and FINRISK cohort Rsid rs425277 rs2284746 rs1738475 rs4601530 rs2154319 rs17391694 rs6699417 rs10874746 rs9428104 rs11205277 rs17346452 rs1325598 rs1046934 rs10863936 rs6684205 rs11118346 rs4665736 rs6714546 rs2341459 rs3791675 rs11684404 rs7567288 rs12470505 rs2629046 rs12694997 rs2597513 rs13088462 rs2336725 rs9835332 rs9863706 rs9844666 rs724016 rs572169 rs720390 rs2247341 rs6449353 rs17081935 rs7697556 rs10010325 rs7689420 rs955748 rs1173727 rs11958779 rs10037512 Chr 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 Pos 2069172 17306675 23536891 24916698 41745770 78623626 89123443 93323971 118855587 149892872 172053287 175058872 184023529 212237798 218609702 219743719 25187599 33361425 44768202 56111309 88924622 134151294 219908369 225047744 241911659 13555836 51071713 53093779 56642722 72437413 135974216 142588510 172165727 185548683 1671115 18033488 57823476 73515313 106106353 145568352 184215675 32830521 55001899 88354675 Closest gene PRKCZ MFAP2 HTR1D CLIC4 SCMH1 GIPC2 PKN2 RPL5 SPAG17 SF3B4 DNM3 PAPPA2 TSEN15 DTL TGFB2 LYPLAL1 DNAJC27 LTBP1 C2orf34 EFEMP1 EIF2AK3 NCKAP5 CCDC108/IHH SERPINE2 SEPT2 HDAC11 DOCK3 RTF1 C3orf63 RYBP PCCB ZBTB38 GHSR IGF2BP2 SLBP/FGFR3 LCORL POLR2B ADAMTS3 TET2 HHIP WWC2 NPR3 SLC38A9 MEF2C Effect allele t t t c g t t c c g g a c g g c a g t c c c t t g c c c a a g g t a a t t t a c g t t t Freq 0.28 0.50 0.59 0.74 0.23 0.11 0.62 0.63 0.76 0.42 0.27 0.57 0.36 0.46 0.29 0.54 0.53 0.72 0.27 0.77 0.33 0.20 0.90 0.55 0.76 0.11 0.06 0.46 0.54 0.79 0.74 0.43 0.31 0.39 0.36 0.85 0.19 0.48 0.49 0.84 0.75 0.39 0.30 0.56 Effect Size 0.02 0.03 0.02 0.02 0.03 0.04 0.02 0.02 0.04 0.05 0.04 0.03 0.05 0.02 0.03 0.03 0.03 0.02 0.03 0.05 0.03 0.03 0.05 0.02 0.03 0.04 0.05 0.03 0.02 0.03 0.03 0.07 0.04 0.03 0.03 0.07 0.03 0.02 0.02 0.07 0.02 0.04 0.03 0.03 Observed OR 1.19 1.30 1.15 1.10 1.27 1.56 0.97 1.06 1.19 1.56 1.07 1.17 1.32 1.12 1.15 1.19 1.35 1.36 1.34 1.49 1.11 1.08 1.05 1.21 1.00 1.28 1.32 1.14 1.16 1.15 1.21 1.55 1.21 1.21 1.13 1.48 1.01 1.31 1.17 1.45 1.08 1.26 1.10 1.29 P-value 0.0932 0.0045 0.1364 0.3495 0.0355 0.0008 0.7864 0.5108 0.1106 0.0000 0.5121 0.0992 0.0050 0.2176 0.1579 0.0702 0.0017 0.0033 0.0040 0.0002 0.2966 0.4521 0.7572 0.0364 0.9930 0.1184 0.1658 0.1695 0.1039 0.2250 0.0929 0.0000 0.0543 0.0528 0.2082 0.0111 0.9645 0.0036 0.0885 0.0043 0.4925 0.0176 0.3480 0.0086 Expected OR 1.14 1.19 1.12 1.14 1.19 1.24 1.12 1.12 1.22 1.27 1.21 1.14 1.28 1.12 1.19 1.15 1.19 1.14 1.16 1.30 1.15 1.18 1.29 1.14 1.16 1.23 1.34 1.15 1.12 1.18 1.16 1.43 1.21 1.18 1.14 1.46 1.18 1.12 1.12 1.44 1.14 1.21 1.16 1.15 52 Table 2.3 (Continued) Rsid rs274546 rs526896 rs4282339 rs12153391 rs889014 rs422421 rs6879260 rs3812163 rs806794 rs3129109 rs2256183 rs2780226 rs9472414 rs310405 rs7759938 rs1046943 rs961764 rs1490384 rs6569648 rs7763064 rs543650 rs9456307 rs798489 rs1708299 rs6959212 rs42235 rs822552 rs6473015 rs6470764 rs12680655 rs7864648 rs11144688 rs7853377 rs2778031 rs1257763 rs473902 rs7027110 rs751543 rs7466269 rs7849585 rs7909670 rs2145998 rs11599750 rs2237886 rs7926971 Chr 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 11 11 Pos 131727766 134384604 168256240 171203438 172984114 176517326 179731014 7670759 26200677 29084232 31380529 34199092 44946506 81800362 105378954 109783941 117522156 126851160 130349119 142797289 152110943 158929442 2801803 28189946 38128326 92248076 148650634 78178485 130725665 135637337 16358732 78542286 86552205 90835726 96893945 98256235 108638867 119122342 133464084 138251691 12918764 81121696 101805442 2810731 12698040 Closest gene SLC22A5 PITX1 SLIT3 FBXW11 BOD1 FGFR4/NSD1 GFPT2 BMP6 Histone OR2J3 MICA HMGA1 SUPT3H/RUNX2 FAM46A LIN28B ZBTB24 VGLL2 C6orf173 L3MBTL3 GPR126 ESR1 TULP4 GNA12 JAZF1 STARD3NL CDK6 PDIA4 PEX2 GSDMC ZFAT BNC2 PCSK5 C9orf64 SPIN1 PTPDC1 PTCH1/FANCC ZNF462 PAPPA FUBP3 QSOX2 CCDC3 PPIF CPN1 KCNQ1 TEAD1 Effect allele g t g c c c c a a c g c g a c c c t c g g g c a c t g c c a t g g t a t a t a c c g c t g Freq 0.61 0.73 0.80 0.75 0.64 0.78 0.61 0.47 0.71 0.60 0.45 0.08 0.79 0.53 0.32 0.58 0.59 0.50 0.24 0.71 0.60 0.94 0.71 0.31 0.68 0.31 0.25 0.29 0.79 0.60 0.32 0.89 0.23 0.24 0.04 0.92 0.23 0.71 0.64 0.33 0.57 0.52 0.61 0.11 0.46 Effect Size 0.03 0.03 0.04 0.03 0.03 0.03 0.03 0.04 0.05 0.03 0.03 0.08 0.03 0.03 0.04 0.02 0.02 0.04 0.04 0.04 0.03 0.05 0.05 0.04 0.02 0.05 0.03 0.03 0.05 0.03 0.02 0.05 0.03 0.03 0.07 0.07 0.03 0.03 0.04 0.03 0.02 0.03 0.02 0.04 0.02 Observed OR 1.39 1.18 1.01 1.25 1.16 1.15 1.29 1.22 1.08 0.89 1.35 1.32 1.26 1.20 1.17 0.93 1.26 1.15 1.26 1.50 1.28 1.01 1.29 1.22 1.12 1.28 1.21 1.29 1.01 1.18 1.04 1.05 1.30 1.29 1.27 1.04 0.87 1.17 1.24 0.99 1.07 1.16 1.14 1.28 0.99 P-value 0.0005 0.1189 0.9540 0.0350 0.1339 0.2751 0.0067 0.0320 0.4548 0.2318 0.0016 0.1486 0.0438 0.0566 0.1087 0.4634 0.0156 0.1290 0.0383 0.0002 0.0075 0.9579 0.0106 0.0498 0.2541 0.0188 0.0714 0.0147 0.9297 0.0854 0.6868 0.7566 0.0289 0.0167 0.2484 0.8135 0.2405 0.1351 0.0246 0.9361 0.4500 0.1120 0.1950 0.1129 0.9516 Expected OR 1.16 1.18 1.21 1.19 1.17 1.19 1.16 1.21 1.33 1.15 1.20 1.52 1.18 1.17 1.25 1.12 1.12 1.22 1.21 1.27 1.18 1.30 1.32 1.25 1.13 1.34 1.17 1.19 1.28 1.17 1.14 1.34 1.14 1.15 1.44 1.48 1.20 1.17 1.21 1.18 1.12 1.14 1.13 1.26 1.14 53 Table 2.3 (Continued) Rsid rs1330 rs1814175 rs3782089 rs7112925 rs634552 rs494459 rs654723 rs2856321 rs10770705 rs2638953 rs2066807 rs1351394 rs11107116 rs11830103 rs1809889 rs7332115 rs3118905 rs7319045 rs1950500 rs1570106 rs7155279 rs16964211 rs7178424 rs12902421 rs5742915 rs11259936 rs16942341 rs4965598 rs1659127 rs4640244 rs3110496 rs3764419 rs17780086 rs1043515 rs4986172 rs2072153 rs227724 rs2079795 rs11867479 rs9967417 rs17782313 rs12982744 rs891088 rs4072910 rs2279008 Chr 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 13 13 13 14 14 14 15 15 15 15 15 15 15 16 17 17 17 17 17 17 17 17 17 17 18 18 19 19 19 19 Pos 17316029 49559172 65336819 66826160 75282052 118574675 128586155 11855773 20857467 28534415 56740682 66351826 93978504 122389499 124801226 33147548 51105334 92024574 24830850 68813115 91555634 49317787 62380259 72161403 74336633 84580582 89388905 98577137 14388305 21284223 24941897 29164023 30343282 36922196 43216281 47390014 52133816 59496649 68090207 46959500 57851097 2177193 7184762 8644031 17283303 Closest gene NUCB2 FOLH1 SSSCA1 RHOD SERPINH1 TREH FLI1 ETV6 SLCO1C1 CCDC91 STAT2 HMGA2 SOCS2 SBNO1 FAM101A PDS5B/BRCA2 DLEU7 GPC5 NFATC4 RAD51L1 TRIP11 CYP19A1 C2CD4A MYO9A PML ADAMTSL3 ACAN ADAMTS17 MKL2 KCNJ12 ANKRD13B ATAD5/RNF135 LRRC37B PIP4K2B ACBD4 ZNF652 NOG TBX2 KCNJ16/KCNJ2 DYM MC4R DOT1L INSR ADAMTS10 MYO9B Effect allele t t c c a t a t a c g t t g t g g g t c t g c a c c c c a a g c a g c g t t t a c g g t t Freq 0.35 0.34 0.94 0.64 0.14 0.40 0.61 0.36 0.33 0.68 0.08 0.49 0.22 0.22 0.29 0.38 0.71 0.39 0.30 0.79 0.62 0.95 0.54 0.03 0.47 0.52 0.97 0.32 0.34 0.61 0.67 0.61 0.15 0.54 0.65 0.31 0.32 0.33 0.35 0.42 0.24 0.41 0.26 0.56 0.75 Effect Size 0.02 0.02 0.06 0.02 0.04 0.02 0.02 0.03 0.03 0.04 0.05 0.05 0.05 0.04 0.03 0.03 0.05 0.03 0.03 0.03 0.03 0.05 0.02 0.06 0.03 0.04 0.13 0.04 0.02 0.03 0.02 0.04 0.03 0.02 0.03 0.03 0.03 0.04 0.02 0.04 0.02 0.03 0.03 0.03 0.03 Observed OR 1.13 1.21 1.01 1.14 1.26 1.04 1.17 1.15 1.19 1.28 1.49 1.36 1.34 0.95 1.28 1.03 1.27 1.24 1.33 1.15 1.01 1.35 1.19 0.94 0.94 1.49 1.43 1.19 1.07 1.09 0.89 1.14 1.06 1.28 1.00 0.98 1.10 1.34 1.14 1.12 1.15 1.32 0.93 1.12 1.20 P-value 0.2302 0.0438 0.9532 0.1905 0.0734 0.6646 0.1041 0.1499 0.0721 0.0133 0.0334 0.0009 0.0106 0.6298 0.0150 0.7892 0.0230 0.0188 0.0075 0.2214 0.9025 0.1532 0.0699 0.8741 0.4832 0.0000 0.2533 0.0952 0.5156 0.3451 0.2488 0.1612 0.6451 0.0067 0.9611 0.8110 0.2980 0.0032 0.1713 0.2481 0.2121 0.0043 0.5104 0.2580 0.0657 Expected OR 1.14 1.13 1.36 1.13 1.24 1.12 1.13 1.17 1.18 1.21 1.32 1.33 1.32 1.21 1.18 1.14 1.32 1.16 1.19 1.14 1.15 1.31 1.13 1.34 1.18 1.25 2.04 1.21 1.14 1.16 1.13 1.22 1.20 1.12 1.16 1.15 1.16 1.23 1.14 1.22 1.14 1.19 1.14 1.18 1.18 54 Table 2.3 (Continued) Rsid rs1741344 rs2145272 rs7274811 rs143384 rs237743 rs2834442 rs4821083 Chr 20 20 20 20 20 21 22 Pos 4101800 6626218 32333181 33489170 47903019 35690786 31386341 Closest gene SMOX BMP2 ZNF341 GDF5 ZNFX1 KCNE2 SYN3 Effect allele c g g g t a t Freq 0.37 0.35 0.77 0.42 0.21 0.62 0.83 Effect Size 0.03 0.04 0.04 0.06 0.03 0.02 0.03 Observed OR 1.07 1.31 1.40 1.52 1.20 1.20 1.13 P-value 0.4895 0.0099 0.0018 0.0000 0.0900 0.0613 0.3619 Expected OR 1.15 1.23 1.24 1.41 1.20 1.14 1.19 The table shows the results for the SNPs used in the meta-analysis of the HUNT and FINRISK cohorts. 55 specific allele frequencies (see Materials and Methods). Overall, the number of SNPs with observed odds ratio greater than expected odds ratios was no different than expectation under the model of equal effect sizes in extremes and the general population (HUNT 79/160 SNPs, p=0.94; FINRISK 75/155 SNPs, p=0.48 and combined 75/141, p=0.45); (Table 2.1; Table 2.2 and Table 2.3). Next, for each SNP we tested for a difference between the expected and observed odds ratio in the individual studies and in the meta-analysis. Overall there were no more or fewer significant associations than would be expected under the equal effect size model (Figure 2.1). This result demonstrates that the individual SNPs have similar effects at the extremes as in the general population. Weighted Allele Score (WAS) analysis: The additive effect of the common variants differs significantly from expected in the short extremes After determining that the individual SNPs have similar effects at the extremes of the height distribution as in the general population, we then performed additional analyses on the combined set of height-associated variants. We asked whether extremely short and extremely tall individuals show overall enrichment of height-decreasing and height-increasing alleles, respectively, to the extent expected under a purely polygenic additive model. If the enrichment is less than expected, this result would suggest that the common variants are not explaining as much of the phenotypic variation in the extremes as in the general population. To test this possibility, we first calculated the weighted allele score (WAS) for each individual using the height-associated SNPs previously described. The WAS is the cumulative effect of all of the SNPs on height weighted by each SNP's estimated effect size (β). In Figure 2.2, we show a plot of each individual’s WAS based on the 143 loci genotyped in both cohorts versus the individual 56 Figure 2.1: QQ Plot of p-values for individual SNPs based on the meta-analysis of HUNT and FINRISK. The figure shows a Q-Q plot of the p-values of the difference between the observed odd-ratios and the expected odd-ratios. 57 Figure 2.2: Plot of weighted allele scores (WAS) against Height Z-scores for HUNT and FINRISK Cohorts. The plot shows the WAS, a measure of the genetic prediction of height by known common variants, against the height Z-scores. The tall individuals (Z-score > 2.14) have generally larger WAS than the short individuals (Z-score < -2.14). Individuals from the HUNT study are labeled blue and individuals from the FINRISK study are labeled red. 58 height Z-scores. As expected, the WAS are significantly different between the tall extremes and the short extremes (p<3x10-86), with individuals in the tall extreme having higher WAS on average than individuals in the short extremes. We then tested whether the WAS in the short and tall groups are within expectations based on the population specific allele frequencies and previously estimated effect sizes of these SNPs, assuming a purely polygenic model. To generate the distribution of WAS under these expectations, we simulated populations that mimicked our ascertainment of extreme samples from the HUNT and FINRISK populations (see Materials and Methods). For each cohort, we compared the observed mean WAS with the distribution of mean WAS under the simulated model (Figure 2.3 and Figure 2.4). For the HUNT study the sample of 1224 individuals from the middle of the distribution suggest our modeling is behaving as expected (Figure 2.3). Finally, we analyzed the data by combining both studies using the 143 SNPs present in both data-sets (Figure 2.5). In each study separately and in the combined analysis, the mean observed WAS for the tall individuals was within expectation, but we observed a significant upward deviation of the mean observed WAS in the short extremes (p=0.006 for the combined-analysis). These results suggest that the collective effect of the common variants in the short extremes do not account for as much of the phenotypic variation in height as predicted from the effects seen in the general population. The reduced effect of common variants is limited to the most extreme short individuals Having established that the common variants do not explain as much phenotypic variation in the short extremes, we then sought to determine if this finding was accentuated in 59 Figure 2.3: Comparison of the observed versus simulated mean weighted allele score (WAS) in the HUNT study. The plot shows the result of comparing the mean WAS of the short and tall individuals observed in the HUNT cohort against that obtained from simulation. Each row represents a different stratification of the extremes identical to those defined in Figure 2.5. The plot also show the mean WAS of 1224 non-extreme individuals taken from the middle of the height distribution. There is no difference between the mean WAS of the non-extreme individuals from that obtained from simulation (p=0.56). 60 Figure 2.4: Comparison of the observed versus simulated mean weighted allele score (WAS) in the FINRISK study. The plot shows the result of comparing the mean WAS of the short and tall individuals observed in the FINRISK cohort against that obtained from simulation. Each row represents a different stratification of the extremes identical to those defined in Figure 2.5. 61 Figure 2.5: Comparison of the observed versus simulated mean weighted allele score (WAS) in the combined cohort. The plot shows the result of comparing the mean WAS of the short and tall individuals observed from both the HUNT and FINRISK cohorts against that obtained from simulation. Each row represents a different stratification of the extremes. The percentiles and numbers of individuals in the short and tall extreme respectively are listed for each stratum. The p-values represent the comparison between the observed and simulated mean WAS. The observed mean WAS for the tall individuals were not different from the simulation in any of the strata. The observed mean WAS for the short individuals was not different from the simulation in the first stratum. As a progressively more extreme sample is used, the short individuals’ mean WAS becomes progressively more significantly different than the simulation. 62 individuals with the most extreme short stature. We stratified our analysis in several ways (Figure 2.5; Figure 2.3; Figure 2.4). First, we removed the most extreme individuals: those below the 0.25 percentile and above the 99.75 percentile. In the combined cohorts, the mean observed WAS in the short extremes was no longer significantly different than expected (p=0.526), indicating that the shift in WAS is driven by the most extremely short individuals. To further explore this hypothesis, we then selected more extreme individuals at two thresholds, including only the top and bottom 0.5% or 0.25% of the population (see Materials and Methods). For both strata, there was a more pronounced deviation of the mean observed WAS in the short extremes (p=7.12 x 10-6 and p=9.88 x 10-7 for the 0.5% and 0.25% extremes respectively), but again no deviation in the tall extremes. Similar observations occurred when we analyzed the cohorts separately using the same stratification procedure (Figure 2.3; Figure 2.4). We repeated the analysis using Z-scores based on inverse normal transformation, and with the three -6 SD outliers removed, and the results were essentially unchanged. The difference observed in the WAS analysis is also supported by the individual SNP analysis: when we performed the combined analysis described above for the 0.25% extremes rather than the entire cohort, 60% (84/139) of the SNPS have an observed effect size smaller than expected (p=0.02) (data not shown). This analyses clearly suggest that the initial marginally significant shift of the mean observed WAS in the short extremes is primarily driven by the most extreme short individuals. Therefore, in general, as one selects individuals with more extreme short stature, in particular those with heights below the 0.25 percentile, the common variants play a much smaller role in explaining stature, indicating that there must be other factors contributing to the phenotypic variation in these extremely short individuals. 63 Low frequency or rare variants with larger effect sizes could explain the phenotypic variation in the short extremes We hypothesized that lower frequency and rare genetic variants with larger effect sizes than the common variants may explain the phenotypic variation in the short extremes. To test this hypothesis, we performed population simulations with rare-variants of various allele frequencies and effect sizes, and asked if our observed data were consistent with these simulated scenarios (Figure 2.6). As a negative control, we first modeled an additional 180 SNPs, each with allele frequency of 0.3 and average effect sizes of -0.05 SD, which is similar to the allele frequency and effect size for previously discovered common variants associated with height. In this simulation, the mean WAS distribution did not change, indicating that adding additional common variants of similar effect sizes cannot explain the phenotypic variation in the short extremes. We then modeled a single rare variant of very large effect: frequency 0.005 and effect size of -4 SD. In this model, the mean WAS distribution in the extremely short individuals shifts more than we observed in our population. This simulation essentially excludes the possibility of a 0.5% variant of very large effect within our cohort. Such a variant would also be likely to be discovered in linkage studies of several thousand sib-pairs [6]. However, there are several rare variant models that would likely not have been detected in previous linkage analyses of height and generate a shift in the mean WAS consistent with our observed data (Figure 2.6). One such possibility is a single low frequency variant (allele frequency = 0.005) with an effect size of -2 SD; another model consistent with our data includes 10 variants each with an allele frequency of 0.005 and a moderate effect size of -1 SD. These simulations suggest that individuals with very short stature may harbor small numbers of low frequency variants of moderately large effect or a greater number of low 64 Figure 2.6: Comparison of the observed versus simulated mean WAS with models incorporating additional variants. The plot shows the result of comparing the mean WAS of the short and tall individuals observed from both the HUNT and FINRISK cohorts against that obtained from simulation with different scenarios of additional variants. All rows use the approximate 1.5% tails of the height distribution as extremes, resulting in 566 short and 648 tall individuals. The 1st row shows the result where the model has no additional variants affecting height and thus is identical to that from the 2nd row of Figure 2.5. The 2nd row shows a model where there are 180 additional common variants that slightly decreases height (allele frequency = 0.3 and effect size (β) = -0.05). This model does not result in any significant change to the simulated WAS of the short individuals and the observed WAS is still significantly different (p=0.00756). The 3rd row shows a model where there is 1 additional low frequency variant with a large height decreasing effect (allele frequency = 0.005 and effect size (β) = -4). This model results in a large shift in the simulated WAS of the short individuals to the right. The observed WAS is still significantly different (p=4.54 x 10-8) than the simulation but in the opposite direction and thus is not consistent with our data. The 4th row shows a model where there is 1 additional low frequency variant that decreases height significantly (allele frequency = 0.005 and effect size (β) = -2). This model results in a shift in the simulated WAS of the short individuals to the right such that the observed WAS is no longer different from the simulation (p=0.544). The 5th row shows a model where there are 10 additional low frequency variants that moderately decreases height (allele frequency = 0.005 and effect size (β) = -1). This model also results in a shift in the simulated WAS of the short individuals to the right such that the observed WAS is no longer different from the simulation (p=0.39). The final two models are consistent with our observed data. 65 Figure 2.6 (Continued) 66 frequency variants of moderate effects contributing to their short stature. This result stands in contrast to the remainder of the height distribution in which a polygenic effect of common and rare variants with small effects could explain the majority of the heritability of height, even though only a small percentage of height-associated common variants have been identified. Sibling analysis provides support for a different genetic architecture in extreme short individuals To provide further support for a different genetic architecture in individuals in the extreme short tails we performed an analysis in siblings from the HUNT study. We queried the entire HUNT database (N=106,455) and identified 21,365 siblings pairs. The correlation of age and gender adjusted height between siblings was high (r = 0.466). We then identified 98 individuals (aged between 20-70yrs) with a Z-score < -2.81 (~0.25% tails) and 80 with a Z-score > 2.81 who also had at least one sibling in the database (the results are similar if we use inverse normal transformation). The average height Z-score for the siblings of the extreme short group was -0.97 (95% CI: -0.80, -1.15); the average Z-score for the full siblings of the extreme tall group was 1.29 (95% CI: 1.14, 1.45) which are significantly different (t-test, p=0.007 after reversing signs for the short group). We then performed this same analysis for the 0.25% to 1.5% tails individuals and there was no significant difference in z-scores of siblings between the short (-1.05 95% CI: -1.13, -0.97) and tall (1.11 95% CI: 1.03, 1.18) groups (t-test, p=0.28). So the differential regression to the mean appears to be limited to the shortest ~0.25% of individuals with this group regressing more quickly than the tall extreme group. This is consistent with the results we observe with the weighted allele score (WAS) approach. We do not have the twin data that would allow us to separate out the environmental and genetic effects in this group and our 67 data is consistent with both. If the effect were due to genetics, then a model with de novo mutations and/or multiple recessive rare variants could cause an increased regression to the mean in extremely short individuals, although there are other plausible explanations. DISCUSSION We have assessed whether common variants robustly associated with height in the general population also associate with height at the extreme tails of the height distribution. We further tested whether this association is to the extent expected under a purely polygenic model. By genotyping ~160 height SNPs identified from the GIANT study [2] (that explain ~10% of the population variation in height) in individuals from the ~1% tails of height from two large population based cohorts, we have shown that the polygenic model can explain the associations in the ~1% tails of height. However, our data indicate that the polygenic model starts to break down in extreme short individuals near the 0.25 percentile cut off. This conclusion is supported by our sibling analysis, which demonstrated that siblings in the 0.25% short tail regress to the mean more than those in the 0.25% tall group. Interestingly, the overall height distribution also shows a slight asymmetric deviation from normality, with an excess of individuals with extremely short stature but not for extremely tall stature. While in general the individuals in the ~1% tails carry as many height increasing alleles as would be predicted based on their height, there was a clear deviation for individuals in the shortest 0.25% tail. On average, these individuals carry significantly more “tall” alleles at the 160 SNPs than would be expected if common alleles were explaining their short stature. This suggests that the heights of these individuals are explained by factors other than common variants. Our simulations suggest that rare variants could explain this difference in the 0.25% 68 shortest tail. For example, 10 rare variants with modest effects on height (1SD) are consistent with our observed data, as is a single variant with a 2SD effect. The sibling analysis also suggests a role for de novo or multiple recessive variants in the extreme short individuals. While rare height-decreasing variants of large effect are a plausible explanation, there are many other genetic models consistent with our data, including a mixture of height-decreasing with a smaller number of height-increasing rare variants, or variants having non-additive effects. While nonadditive genetic effects could explain the data, no evidence was found for dominance or genegene interaction effects for the SNPs used in this study in the original GIANT publication [2]. It is also possible that these individuals are short for non-genetic reasons. One could suggest that these individuals are short because of differences in ancestry, but we have taken steps to remove any possible ethnic outliers from our extremes (see Materials and Methods). Measurement or recording error is another possibility, although the fact that the tall group does not show this effect (which presumably is equally likely to contain measurement error as the short group) suggests this is an unlikely explanation. Non-genetic factors could also be a possibility, for example, poor early-life nutrition, severe infection, or other chronic childhood diseases could have prevented these individuals from reaching their genetic height potential. This result also suggests that these families would be good candidates to investigate in sequencing studies, as they may be enriched for rare or de novo, higher penetrance alleles. More generally, the weighted allele score (WAS) method developed here could be used to select individuals to sequence in the search for these types of rarer variants, not only for height but also for other polygenic traits and diseases. Specifically, individuals in the extreme tails of a trait distribution who have an unexpectedly high or low weighted allele score may be particularly useful to sequence, especially if multiple relatives with these characteristics were present in the 69 extreme tails. Our study also demonstrates empirically that selecting individuals from the extreme tails of a complex trait distribution is an efficient approach for genetic studies, as was proposed both for linkage studies [7,8] and association studies [9,10]. Despite a quite modest sample size (N<1000), we replicated a large fraction of the individual SNPs identified in the GIANT study in our extreme height analysis. Ninety-one percent of the SNPs had odds ratios that were directionally consistent with the direction in the published GIANT study (p<0.0001), and 35% (49/141) of SNPS had p<0.05 in the consistent direction. Our analyses also demonstrate that, outside of the 0.25% tails, this level of association is entirely consistent with that expected given the extreme tail ascertainment of our samples and the individual SNP continuous distribution effect sizes. Given this result, the ascertainment of our 923 samples from the ~1% to 0.25% tails provides equivalent power to approximately 6000 samples randomly selected from the general population for a variant explaining approximately 0.1% of the variation in height. Indeed, the ability to detect associations in samples ascertained for extreme phenotypes has been recently demonstrated in studies of bone mineral density [11], body mass index [12], triglyceride levels [13], and type 2 diabetes (using a liability threshold model [14]). Also, our results suggest that the statistical power of detecting these small effect variants would be reduced if we were to include the most extreme tails of the phenotypic distribution (in our case, the shortest 0.25% of individuals), consistent with predictions made based on simulation studies of mixtures of common and rare variants [15]. Nonetheless, our findings suggest that the use of individuals with the most extreme phenotypes could be particularly valuable to detect rarer variants with larger effect sizes more efficiently. In conclusion, we have shown that common genetic variants associated with height in the 70 general population are also associated with height at the ~1% tails of the height distribution. Our data suggest that common variants play less of a role, and the effect of rarer larger-effect alleles and/or strong environmental factors start to predominate around the 0.25% extreme. This finding may also have broader implications for studies of disease, in that the polygenic model may apply well to those diseases that represent the tails of an underlying normal distribution, but perhaps less well to diseases that correspond to more extreme phenotypes. MATERIALS AND METHODS Ethics statement Both studies were conducted according to the principles expressed in the Declaration of Helsinki. Attendance was voluntary, and each participant signed a written informed consent including information on genetic analyses. Local institutional review boards approved study protocols. Subjects The HUNT study The Nord-Trøndelag Health Study (HUNT) is a comprehensive population based health study (www.ntnu.edu/hunt) with personal and family medical histories on approximately 120,000 people from Nord-Trøndelag County, Norway, collected during three intensive studies (HUNT 1, 2, and 3). Inviting all citizens aged 20 and over, information was collected from selfreported questionnaires consisting of >200 health-related questions, standardized clinical 71 examinations, urine and non-fasting venous blood sample. The population in Nord-Trøndelag County is ethnically homogeneous, <3% of non-Caucasian ethnicity, making it especially suitable for epidemiological genetic research. Height was measured by trained personnel to the nearest 1.0 cm with the participants wearing light clothes without shoes according to standardized methods [16]. For this study we sourced data from HUNT 2 (1995-97) in which 65,258 individuals participated (71.2% of invited). We generated age and gender standardized height for the whole population, and selected the shortest 1000 individuals and the tallest 1000 individuals from the 54,909 participants aged between 18 and 70yrs. We removed known 1st degree relatives based on information from the Medical Birth Registry of Norway, those reporting to be living outside of Norway their first year of life, and those with low DNA concentrations. We then genotyped the remaining shortest 471 individuals (<-2.14 SDs) and the tallest 479 individuals (>2.14 SDs) from the cohort. We also genotyped 1,458 individuals of all ages with a Z-score between +/- 2 SDs as our middle group. The FINRISK Study FINRISK is a Finnish national survey on risk factors of chronic and non-communicable diseases. It is carried out every five years since 1972 using independent, random and representative population samples from different parts of Finland [17]. For this study, we selected individuals from 4 different sub-populations divided by geography (East vs. West Finland) and gender (Table 2.4). Individuals aged 25 to 74 years were included. We then took approximately the tallest and shortest 50 individuals (Table 2.4) from each tail of the distribution 72 Table 2.4: The FINRISK cohort divided into 4 sub-populations Cohort men/west men/east women/west women/east Total Total successfully genotyped Total with genotypes used No. of Individuals 4271 6582 5025 7610 23488 No. of short extremes 53 52 52 52 209 186 181 No. of tall extremes 51 52 52 52 207 192 192 The table shows the number of individuals used for each of the FINRISK sub-populations. The FINRISK cohort is sub-divided between male and female as well as individuals from east and west Finland. 73 from each sub-population (extremes) and performed genotyping. Genotyping and Quality Control HUNT study Blood sampling was done whenever subjects attended HUNT 2. DNA was extracted from peripheral blood leukocytes from whole blood or blood clots stored in the HUNT Biobank, using the Puregene kit (Gentra Systems, Minneapolis, MN) manually or with an Autopure LS (Gentra Systems). Laboratory technicians were blinded to the results of the height measurements. Details on the DNA extraction and the HUNT Biobank are described elsewhere [16]. Genotyping of short and tall individuals were done at the Norwegian University of Science and Techonology, Norway using the iSelect Metabochip (Illumina, San Diego, CA) and the Infinium HD ultra protocol. Each 96-well plate included both tall and short individuals and one sample of identical reference DNA. Genotype calling was done using GenTrain version 2.0 in GenomeStudio V2010.3 (Illumina, San Diego, CA). Genotyping of the middle group was done on the Metabochip at the Center for Inherited Disease Research (CIDR, MD) and called with BeadStudio 3.3.7 with Gentrain version 1.0 (Illumina, San Diego, CA). Samples that did not meet a 99% completion threshold were excluded from further analysis (N=19; 0.7%). Additional post-genotyping exclusions based on gender discrepancy (N=11) and first-degree relatedness (pi-hat >0.2; N=152, 6.3%) were done using PLINK [18]. Ethnic outliers (N=174, 7.2%) were excluded using the EIGENSTRAT software package [19]. After quality assessment 2,063 individuals (85.7%) remained for further analysis, 385 (81.2%) short, 456 74 (95.2%) tall and 1,224 (83.9%) individuals in the middle group. 106 SNPs of the 180 GIANT height hits were directly typed on the Metabochip. In addition, we used the SNP Annotation and Proxy Search to map 54 of the remaining 74 SNPs with a HapMap r2 > 0.8 linkage disequilibrium proxy result [20]. These 160 SNPs (i.e. 106 directly typed and 54 proxies) were used in subsequent analyses. All SNPs showed a genotyping success rate >98% and were in Hardy Weinberg equilibrium. FINRISK study We directly genotyped the samples for the 180 previously identified height SNPs. The genotyping was done at Children’s Hospital Boston using Sequenom iPLEX genotyping (Sequenom, Inc, San Diego, CA, USA). In total, 186 short individuals and 192 tall individuals were successfully genotyped for 158 SNPs. All 158 SNPs had a genotyping success rate ≥ 90% and the overall genotyping rate was 97.85%. One of these SNPs (rs1809889) is not part of the 180 GIANT SNPs, but data were available for this SNP from the GIANT meta-analysis so it was included in our analysis. We genotyped an additional 49 ancestry informative markers (AIMs) to identify ethnic outliers [21]. We inputted genotype data from our subjects as well as the reference HAPMAP samples (CEU, YRI, CHB+JPT) for the 49 AIMs together with 130 height SNPs into Structure 2.3.3 [22]. We detected 5 ethnic outliers with >10% Asian ancestry who were excluded from further analysis leaving a total of 181 short and 192 tall individuals as our FINRISK study group. Statistical Analysis 75 Individual SNP analysis For FINRISK, we calculated the observed odds ratio for each of our 158 SNPs using the Cochran-Manzel-Hansel test, which is a stratified chi-square test. We stratified the individuals into 4 sub-cohorts based on geography and gender (Table 2.4) and performed the test using PLINK [18]. The observed odds-ratio for each SNP was recorded, along with the 95% confidence interval. For HUNT the observed odds-ratio and 95% confidence intervals and the single association analysis was performed using logistic regression in PLINK. For both cohorts, we calculated the expected odds ratio for each SNP by estimating the odds of the height-increasing versus the height-decreasing allele in both the tall extremes (cases) and the short extremes (controls) assuming a standard normal distribution for standardized height, i.e. height ~ Normal(0,1). For a given SNP, we defined the height-increasing effect size as β and the height-increasing allele frequency as p. The mean height for the height-increasing allele would be Mi = β (1 - p) and the mean height for the height-decreasing allele would be Md = - β p. The variance of height for the both alleles would be V = 1 – β2 p (1-p). We then calculated the odds of observing the height-increasing allele versus the height-decreasing allele for both the tall extremes (cases) and the short extremes (controls) by taking the ratio of the probabilities of each allele being seen in the cases and the controls respectively. These are calculated as: Oddscases     2.326  N ( x | M i ,V ) dx N ( x | M d ,V ) dx 2.326 Oddscontrols    2.326  2.326 N ( x | M i ,V ) dx N ( x | M d ,V ) dx  76 where N(x|M,V) denotes the density function at x of a Normal distribution with mean M and variance V. We use a cut-off of +/-2.326 to denote the approximate 1% tails. We then calculated the expected odds-ratio by taking the ratio between Oddscases over Oddscontrols, i.e. To assess whether individual SNPs had odds ratios significantly different from expectation, we generated upper and lower 95% confidence limits for the expected distribution based on the GIANT beta and standard errors estimates as above, and used the natural log of these confidence limits to estimate an approximate standard error for the expected odds ratio, i.e. We then assessed significance by a Z-test of the difference between observed odds ratio and expected odds ratio to obtain the Zscore, i.e. Meta-analysis The HUNT and FINRISK studies genotyped different sets of SNPS, with only 98 of the SNPs matching exactly across the studies. We therefore used forty-three of the HUNT SNPs that had r2 > 0.8 HapMap proxies with a genotyped FINRISK SNPs. We used the inverse variance method to meta-analyze the odds ratios for these 141 SNPs from the two studies. As opposed to the individual studies, where study specific allele frequencies were used, we used the GIANT allele frequency information to generate the expected odds ratios for the meta-analysis. This did 77 not appreciably affect the results for individual SNP analysis within the individual studies, and the meta-analyzed results were consistent to those in the two individual studies. Modeling the Weighted Allele Score (WAS) To calculate the Weighted Allele Score (WAS) for each individual, we took the sum of the effective allele dosages of the height SNPs multiplied by their respective estimated effect sizes (βs) using the Stage 1 betas from the GIANT study, as shown in the formula below. β and SNP are the effect size and effective allele dosage (0, 1 or 2) of the height SNPs and WAS is the weighted allele score. N is the total number of SNPs available to calculate the weighted allele score. α is the mean of the sum such that the expected WAS is 0 as shown by the formula below. Frequency is the allele frequency of the effect allele obtained from the Finnish or HUNT estimates. We calculated the statistical difference between the WAS of the short versus the tall individuals by performing a 2-tailed 2-sample t-test to obtain the respective p-value. All the calculations were done using the R statistical software package. 78 Obtaining Finnish allele frequency estimates The allele frequency estimate for each SNP was obtained by taking only the Finnish individuals from the GIANT height study and calculating the expected allele frequency. The cohorts used were the FUSION NIDDM Case control study from Finland, the GenMets Case control study from Finland and the FINRISK component of the MIGen cohort. The total number of individuals used for obtaining the estimates is 3618. Simulating the distribution of WAS under the null model The null model assumes that the only factors determining height (Z-score) are the cumulative additive effects of the GIANT height SNPs and noise. We modeled the Z-score with the formula below. Zscore is the height Z-score, N(0, σ2remaining) is a normally distributed random variable with mean 0 and variance σ2remaining. σ2remaining is calculated such that the variance of Zscore is 1, i.e. σ2remaining is 1 – var(WAS). The variance of WAS can be calculated with the formula below, On the other hand, a simulated individual's effective allele dosage is obtained by sampling from a set of binomial distributions with N=2 and p being the allele frequencies of each SNP. The simulated effective allele dosages can then be used to calculate each individual's WAS. The 79 simulation approach for each cohort was modeled to mirror the methods of subject selection. Simulating FINRISK For the FINRISK study, the simulations were performed using the following steps. We first generated the effective allele dosages for each SNP for 200,000 individuals by random sampling. We then randomly sampled 4271, 6582, 5025 and 7610 individuals to represent the 4 sub-populations and obtained their Z-scores using the previously described modeling. For each subgroup, we picked the appropriate number of the most extreme individuals to mimic the actual sample selection. We then pooled the short and tall extremes together and randomly dropped individuals to obtain exactly 181 short extremes and 192 tall extremes. We then randomly drop SNPs from the simulated individuals to mimic the missing genotype rate in FINRISK and then calculate the Weighted Allele Score (WAS) for each simulated individual. This simulation process was repeated 10,000 times. For the stratified analyses of various height cut-offs, we adjusted the numbers of selected individuals in each strata by taking the floor of the expected number of individuals in that strata. In our cohort, the top 0.5% extremes included 21, 32, 25 and 38 individuals from each tail of the 4 sub-populations respectively, and for the top 0.25% extremes included 10, 16, 12 and 19 individuals from each tail of the 4 sub-populations. For the top ~1% to 0.25% extremes, we included all our extremes but excluded the top 10, 16, 12 and 19 individuals from each tail of the 4 sub-populations. Simulating HUNT The simulations for HUNT were performed as follows. We generated the effective allele 80 dosages for each SNP for 400,000 individuals by random sampling. We then randomly selected 50,000 individuals and obtained their Z-scores. We then selected all short and tall extremes with a Z-score cut-off of -2.14 and +2.14 respectively. Next, we randomly selected 385 short extremes and 456 tall extremes and calculated the WAS. This process was repeated 10,000 times. As in the FINRISK simulation, the number of individuals varies for each stratified analysis. Because we performed stratified analyses for varying levels of height cut-offs, our definition for the top 0.5% extremes is a Zscore cut-off below -2.57 and above +2.57 and for the top 0.25% extremes is a Z-score cut-off below -2.81 and above +2.81. For the top ~1.5% to 0.25% extremes, we used only extremes that had Z-scores between -2.14 and -2.81 for the short extremes and between 2.14 and 2.81 for the tall extremes. Determining if the mean observed WAS is significantly different from the simulated expectation We evaluated the significance of the mean observed WAS by determining the p-value of the mean observed WAS from the null distribution of the mean WAS obtained from the simulations. The two-tailed p-value is calculated by evaluating the mean observed WAS from Normal(μsimulation , σ2simulation) where μsimulation is the mean of the mean WAS and σ2simulation is the variance of the mean WAS from the simulations. Modeling Rare-variants with moderate to large effect sizes Modeling the rare-variant effect into the simulation is accomplished by adding an additional 81 rare-variant term into the calculation of the height Z-score without changing the definition of WAS as shown in the equation below. where n is the number of independent rare-variants, B represents the effect size of the rarevariants, and V is the allele dosage of the rare-variant. αrv is the mean of the rare-variants score such that the rare-variants do not change the expected Z-score, i.e. the expected Z-score is still 0. Similarly, αrv can be calculated by the following formula, σ2remaining in this case will have to be adjusted for the rare-variants such that the variance of the Zscore remains at 1, i.e. σ2remaining is 1 – var(WAS) – var(Σ B V). F is the allele frequency of the rare-variants. Simulations done with modeling rare-variants are identical to the prior simulations of FINRISK or HUNT except that the new terms are used for calculating the Z-score. ACKNOWLEDGEMENTS We would like to thank Sailaja Vedantam for the calculation of the Finnish allele frequencies, Minttu Sauramo and Elina Mäkinen for aliquotting the FINRISK DNA samples. REFERENCES 1. Visscher PM, Macgregor S, Benyamin B, Zhu G, Gordon S, et al. (2007) Genome Partitioning of Genetic Variation for Height from 11,214 Sibling Pairs. Am J Hum Genet 82 81: 1104–1110. doi:10.1086/522934. 2. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, et al. (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. Available: http://www.ncbi.nlm.nih.gov.ezp-prod1.hul.harvard.edu/pubmed/20881960. Accessed 4 October 2010. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, et al. (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11: 446– 450. doi:10.1038/nrg2809. Cirulli ET, Goldstein DB (2010) Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 11: 415–425. doi:10.1038/nrg2779. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missing heritability of complex diseases. Nature 461: 747–753. doi:10.1038/nature08494. Sham PC, Cherny SS, Purcell S, Hewitt JK (2000) Power of linkage versus association analysis of quantitative traits, by use of variance-components models, for sibship data. Am J Hum Genet 66: 1616–1630. doi:10.1086/302891. Lander ES, Botstein D (1989) Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199. Risch N, Zhang H (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268: 1584–1589. doi:10.1126/science.7777857. Van Gestel S, Houwing-Duistermaat JJ, Adolfsson R, van Duijn CM, Van Broeckhoven C (2000) Power of selective genotyping in genetic association analyses of quantitative traits. Behav Genet 30: 141–146. 3. 4. 5. 6. 7. 8. 9. 10. Abecasis GR, Cookson WOC, Cardon LR (2001) The Power to Detect Linkage Disequilibrium with Quantitative Traits in Selected Samples. Am J Hum Genet 68: 1463– 1474. doi:10.1086/320590. 11. Duncan EL, Danoy P, Kemp JP, Leo PJ, McCloskey E, et al. (2011) Genome-Wide Association Study Using Extreme Truncate Selection Identifies Novel Genes Affecting Bone Mineral Density and Fracture Risk. PLoS Genet 7: e1001372. doi:10.1371/journal.pgen.1001372. 12. Cotsapas C, Speliotes EK, Hatoum IJ, Greenawalt DM, Dobrin R, et al. (2009) Common body mass index-associated variants confer risk of extreme obesity. Hum Mol Genet 18: 3502–3507. doi:10.1093/hmg/ddp292. 13. Hegele RA, Ban MR, Hsueh N, Kennedy BA, Cao H, et al. (2009) A polygenic basis for four classical Fredrickson hyperlipoproteinemia phenotypes that are characterized by hypertriglyceridemia. Hum Mol Genet 18: 4189–4194. doi:10.1093/hmg/ddp361. 83 14. Guey LT, Kravic J, Melander O, Burtt NP, Laramie JM, et al. (2011) Power in the phenotypic extremes: a simulation study of power in discovery and replication of rare variants. Genet Epidemiol 35: 236–246. doi:10.1002/gepi.20572. 15. Allison DB, Heo M, Schork NJ, Wong S-L, Elston RC (1998) Extreme Selection Strategies in Gene Mapping Studies of Oligogenic Quantitative Traits Do Not Always Increase Power. Hum Hered 48: 97–107. doi:10.1159/000022788. 16. Holmen J, Midthjell K, Krüger Ø, Langhammer A, Holmen TL, et al. (2003) The NordTrøndelag Health Study 1995-97 (HUNT 2): Objectives, contents, methods and participation. Nor Epidemiol 13: 19–32. 17. Vartiainen E, Laatikainen T, Peltonen M, Juolevi A, Männistö S, et al. (2010) Thirty-fiveyear trends in cardiovascular risk factors in Finland. Int J Epidemiol 39: 504–518. doi:10.1093/ije/dyp330. 18. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. doi:10.1086/519795. 19. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. doi:10.1038/ng1847. 20. Johnson AD, Handsaker RE, Pulit SL, Nizzari MM, O’Donnell CJ, et al. (2008) SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap. Bioinformatics 24: 2938–2939. doi:10.1093/bioinformatics/btn564. 21. Egyud MRL, Gajdos ZKZ, Butler JL, Tischfield S, Marchand L, et al. (2009) Use of weighted reference panels based on empirical estimates of ancestry for capturing untyped variation. Hum Genet 125: 295–303. doi:10.1007/s00439-009-0627-8. 22. Pritchard JK, Stephens M, Donnelly P (2000) Inference of Population Structure Using Multilocus Genotype Data. Genetics 155: 945–959. 84 Chapter 3 An excess of risk-increasing low frequency variants can be a signal of polygenic inheritance in complex diseases Yingleong Chan1,2,3, Elaine T Lim1,2,4, Niina Sandholm5,6,7, Sophie R Wang1,2,3, Amy Jayne McKnight8, Stephan Ripke2,4, DIAGRAM Consortium, GENIE Consortium, GIANT Consortium, IIBDGC Consortium, PGC Consortium, Mark J Daly1,2,4, Benjamin M Neale2,4, Rany M Salem1,2,3, Joel N Hirschhorn1,2,3 1 2 Harvard Medical School, Department of Genetics, Boston, Massachusetts, USA. Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA. 3 Department of Endocrinology, Boston Children's Hospital, Boston, Massachusetts, USA. 4 Analytic and Translational Genetics Unit, Massachusetts General Hospital, Massachusetts , USA. 5 Folkhälsan Institute of Genetics, Folkhälsan Research Center, Biomedicum Helsinki, Helsinki, Finland 6 Division of Nephrology, Department of Medicine, Helsinki University Central Hospital, Helsinki, Finland 7 Department of Biomedical Engineering and Computational Science, Aalto University School of Science, Helsinki, Finland 8 Nephrology Research, Centre for Public Health, Queen's University of Belfast, Belfast, Northern Ireland, UK Originally published as: Y Chan, et. al., American Journal of Human Genetics (2014), Volume 94, Issue 3, Pages 437–452 ABSTRACT In most complex diseases, much of the heritability remains unaccounted for by common variants. It has been postulated that lower frequency variants contribute to the remaining heritability. Here, we describe a method to test for polygenic inheritance from lower frequency variants using GWAS summary association statistics. We explored scenarios with many causal low frequency variants and showed that there is more power to detect risk variants than protective variants, resulting in an increase in the ratio of detected risk to protective variants (R/P ratio). Such an excess can also occur if risk variants are present and kept at lower frequencies because of negative selection. The R/P ratio can be falsely elevated because of reasons unrelated to polygenic inheritance, such as uneven sample sizes or asymmetric population stratification, so precautions to correct for these confounders are essential. We tested our method on published GWAS results and observed a strong signal in some diseases (schizophrenia and type 2 diabetes) but not others. We also explored the shared genetic component in overlapping phenotypes related to inflammatory bowel disease (Crohn’s disease [CD] and ulcerative colitis [UC]) and diabetic nephropathy (macroalbuminuria and end stage renal disease [ESRD]). While the signal was still present when both CD and UC were jointly analyzed, the signal was lost when macroalbuminuria and ESRD were jointly analyzed, suggesting that these phenotypes should best be studied separately. Thus, our method may also help guide the design of future genetic studies of various traits and diseases. INTRODUCTION Most common diseases involve a mix of both genetic and environmental factors and do 86 not follow simple patterns of Mendelian inheritance. In such diseases, the genetic component is usually polygenic: genetic variation in many genes individually contribute a small or a moderate component of disease risk [1]. Genome-wide association studies (GWAS) have identified numerous genomic loci in which common variants (≥5% frequency) are associated with complex diseases [2]. Even in some of the largest and most successful GWAS to date, much of the genetic contribution to phenotype remains unexplained (sometimes called “missing heritability”) [3,4], suggesting that lower frequency variants, not well surveyed by GWAS, may also contribute to the missing heritability. Indeed, in some diseases such as autism spectrum disorders (ASD [MIM 209850]), inherited rare (<1% frequency) and low frequency (<5% frequency) variants have been recently shown to play an important role in the genetic architecture of the disorder [5,6], suggesting that more loci with low frequency variants could be identified if appropriate additional studies were performed. In other diseases, there is as yet little evidence of a substantial role for low frequency variation, leaving open the question of whether studies of low frequency variation will be fruitful for those diseases. The relative success of different approaches in identifying more contributing loci will depend on what type of variation accounts for the missing heritability. Low frequency variants may remain undetected because they may not be well-represented or well-tagged by markers on genotyping arrays and therefore would not be well-imputed [7]. Along these lines, the statistical power to detect low frequency variants in GWAS is much lower than common variants if their underlying effect sizes are similar[8]. Knowing whether low frequency variants contribute to the missing heritability of a disease is important because approaches better-suited to identify additional common variants differ from those aimed at identifying rarer variants (genotyping arrays with common variants compared to arrays with lower frequency variants or sequencing). 87 Methods for detecting a contribution from common variants to the missing heritability have been described previously. In a GWAS of schizophrenia (SCZ [MIM 181500]) [9], Purcell and colleagues developed the concept of a polygenic score by combining the effects of multiple common variants that are modestly associated with schizophrenia. They showed that the score is predictive of schizophrenia in an independent cohort, thus indicating that there is a polygenic signal from many yet-to-be-detected common variants in schizophrenia. Yang and colleagues adopted a different approach by assessing the narrow-sense heritability of human height with a linear-model analysis using hundreds of thousands of common variants [10]. They found that at least 45% of the variance of height can be accounted for by common variants, indicating that there are many common variants associated with height that have yet to be discovered. Although both methods can be used to detect a signal of polygenic inheritance from common variants in complex diseases, these tests were not designed to specifically test for low frequency variants, and also require individual-level genotype data. In this chapter, we describe an approach that can be applied directly to GWAS summary statistics to ascertain the presence of polygenic inheritance from low frequency variants. We observed that, if low frequency variants contribute to disease susceptibility, there can be an excess of associated risk variants compared to protective variants at a given significance level. Here, risk variants are defined as variants for which the minor allele is associated with increased risk of disease and protective variants are defined as variants for which the minor allele is associated with decreased risk of disease. Under the null model, there should be no excess of associated risk variants compared to protective variants. We calculated the risk to protective ratio (R/P ratio): the ratio of the number of detected risk variants over the number of detected protective variants, to test for such an excess of risk variants. We explored various scenarios that 88 could give rise to an increased in the R/P ratio. First, we showed empirically and analytically that when low allele frequency variants contribute to polygenic inheritance of a disease with low prevalence, there is an elevated R/P ratio because of greater power to detect risk variants than protective variants. Next, we showed through simulations that under a scenario of polygenic inheritance that includes negative selection, risk variants can have lower average frequencies than protective variants leading to an elevated R/P ratio within the lower frequency range. However, we also showed that such an elevated R/P ratio can occur because of reasons unrelated to polygenic inheritance. First, we showed that an uneven sample size of having substantially more controls than cases can produce an apparent increase in the R/P ratio and therefore, where the sample size is not balanced between cases and controls, one should compare the observed R/P ratio against that obtained through simulations with the same number of cases and controls. Next, we showed that particular scenarios of asymmetric population stratification can produce a similar excess of low frequency risk variants and recommend that precautions for detecting and correcting for such stratification should be performed before one can confidently interpret an excess of risk variants as being a signal of polygenic inheritance. We then applied our method to results from published GWAS for several diseases, including schizophrenia [11], bipolar disorder (BIP [MIM 125480]) [12], major depressive disorder (MDD [MIM 608516]) [13], type 2 diabetes (T2D [MIM 125853]) [14] and various classes of obesity (OB [MIM 601665]) [15]. We observed strong signals of increased risk variants in several of the diseases but little or no signal in others, suggesting that efforts to discover low frequency and rare variants will be more fruitful for the diseases with such a signal. We further used our method to test whether apparently related phenotypes share low frequency or rare genetic contributors and hence should be analyzed together or separately. By applying the 89 method to phenotypes related to diabetic nephropathy (DN [MIM 603933] [16] and inflammatory bowel disease (IBD [MIM 266600]) [17], we found that the polygenic signal was eliminated when individuals with macroalbuminuria and individuals with end stage renal disease were analyzed together, whereas we still observed a significant signal when individuals with Crohn’s disease and ulcerative colitis were analyzed together. Thus, our method has the potential to guide the strategy in searching for additional genetic loci as well as in prioritizing the choice of phenotype for future studies of rare genetic variation in polygenic traits and diseases. MATERIALS AND METHODS Testing for an excess of risk variants from GWAS summary statistics Calculating the R/P ratio statistic from observed GWAS summary statistics The four input fields we used for R/P ratio calculations for each SNP are: an identifier (rsID), the minor allele frequency, the association P-value, and a field to determine the direction of effect, i.e. either an odds-ratio (OR) or an effect size (β). The ORs or βs were adjusted to reflect the effect of the minor allele by inverting the ORs or changing the sign of the βs if they were reported for the major allele. Each variant was assigned as risk if the OR > 1 or β > 0 and protective if the OR < 1 or β < 0. Neutral variants, i.e. OR = 1 or β = 0 were discarded from the analysis. We removed SNPs not present in the Hapmap CEU population (phase 2 release 28) [18,19],not in the 1,000 Genomes EUR population [20] as well as SNPs with minor allele frequency less than 1%. We sorted the remaining variants in order from most significant to least and performed LD-pruning by systematically going through the variants and removing variants that have an r2 > 0.1 with any of the more significantly associated variants. We used PLINK [21] 90 to calculate r2 correlations of variant-pairs within a 1 mega-base window from 379 EUR individuals of the 1,000 Genomes. To measure the excess of risk variants in the lower frequency range, we separated the low frequency variants into 3 distinct bins, i.e. 1%-5%, 5%-10% and 10%-15%. We also included the 30%-50% bin as a negative-control, where we should not observe any excess of risk variants. For each bin, we counted the number of detected risk variants and the number of detected protective variants that meet significance cutoffs of P < 0.001 and P < 0.01. We calculated the R/P ratio as, Assessing the significance of the observed Risk/Protective (R/P) ratio To assess the significance of an elevation in R/P ratio, we simulated individuals using HAPGEN [22] by using parameters from the Hapmap CEU population (phase 3, r2) to obtain the null distribution of the log2 R/P ratio statistic. We first simulated 100,000 individuals to form a pool of individuals that we can subsequently sample from. Next, we randomly sampled the same number of individuals in cases and controls as were used in the actual GWAS, performed the association test using PLINK, with LD-pruning and R/P ratio calculations identical to the procedure described above. We repeated this process 1000 times to obtain accurate estimates of the sample mean ( ) and standard deviation ( ) of the log2 R/P ratio under the null for each of our frequency bins and P-value cutoffs. We calculated the significance of the observed log2 R/P ratio by performing a one-tailed Z-test to obtain the Zscore and P-value (P), i.e. 91 We defined P < 0.01 as our significance threshold for calling a significant excess of risk variants. Calculating non-centrality parameter (NCP) for comparing power between risk and protective variants Power calculation The power of a variant is expressed by calculating the expected non-centrality parameter (NCP) of the χ2 distribution for the alternative distribution. The greater the NCP, the more power there is to detect the effective variant. The algorithm for calculating NCP is identical to the genetic power calculator[8] for case-control threshold-selected quantitative traits, assuming an additive model of the QTL effect, i.e. the dominance to additive QTL effect parameter is set to 0. The variance explained for a SNP with allele frequency as p and effect size as β is β22p(1-p). For risk variants, we calculated the NCP (NCPrisk) for multiple values of effect sizes (β), ranging from 0 to 0.5 with intervals of 0.01. Similarly, for protective variants, we calculated the NCP (NCPprotective) for multiple values of β, ranging from 0 to -0.5 with intervals of 0.01. The relative difference in power between risk and protective variants is measured by the NCP ratio. The NCP ratio is calculated as, 92 Base Model We define the base model as a set of parameters used for calculating NCP. 10,000 cases, 10,000 controls, effective and marker variant frequency set to 1%. The prevalence is set as 1%, i.e. the trait threshold’s lower and upper limit is 2.33 and 9 respectively for cases and -9 and 2.33 for controls. We have used 9 and -9 as surrogates for infinity (+∞ and -∞ respectively) but any sufficiently large number will not change the conclusions of the downstream analyses. Complete linkage disequilibrium (LD) between the causal variant and marker variant is assumed, i.e. D’ = 1. Simulating R/P ratios for negative selection Obtaining frequencies and effect sizes If the variants that have an effect on the phenotype are under negative selection, it can lead to scenarios where there are more risk variants than protective variants to begin with, especially for low frequency variants. To illustrate this, we simulated neutral variants and causal variants under negative selection using previously published models and parameters that result in an allele spectrum similar to that observed in European population [23,24]. We used the forward simulation package ForSim [25] to simulate coding sequence variation in the European population in 1000 genes. The average gene coding length was set as 1500bp. We used a mutation rate per site of 2x10-8 and a uniform locus-wide recombination rate of 2Mb/cM. We modeled the distribution of selection coefficients (s) for de novo missense mutations by a gamma distribution [26]. We used the conventional 4-parameter model of the history of the European population with long-term constant size (N=8100 for 45,000 generations) followed by a bottleneck (N=2000) and then by exponential growth (1.5% increase per generation for 370 93 generations) to achieve a final population size of approximately 500,000 individuals [23,24]. We obtained 823 non-neutral variants that have minor allele frequencies ≥ 1% and assigned them as effective variants and assuming that the allele under negative selection confers risk, i.e. positive effect (Figure 3.1). By considering only additive genetic effects, we assigned effect sizes as: β=sτ(1+ε) as suggested in Eyre-Walker [27]. Here, β is the variant’s additive effect on the quantitative trait; s is the absolute value of the variant’s selection coefficient and ε is a normally distributed random noise parameter which was set to having mean 0 and standard deviation 0.05. τ is the degree of coupling between β and s and was set at 0.5 for our analyses. The effect sizes are scaled so that these 823 variants explain 60% of the phenotypic variance. 94 Figure 3.1: The frequency and effect sizes for the 823 SNPs under selection. The plot shows the minor allele frequency (x-axis) and effect size in standard deviation units (y-axis) for the 823 SNPs that were obtained through simulating a trait under negative selection. 95 Obtaining phenotypes and calculating R/P ratio for the selection model We use the 100,000 HAPGEN simulated individuals and selected 823 matched SNPs such that the frequency matches the variants generated by ForSim. We then assigned these matched SNPs with effect sizes determined earlier. We calculated the phenotypic Zscore for each of our 100,000 individuals in the same way that we did in a previous study [28], i.e. by calculating the weighted allele score (WAS) and adding it to a randomly generated variable sampled from a normal distribution of mean 0 and variance 0.4 such that the total variance explained is 1. We then sampled 2,000 individuals with phenotypic Zscores > 1.645 (5% prevalence) as cases and another 2,000 individuals with phenotypic Zscores ≤ 1.645 as controls. We used PLINK to perform the association test on all the variants and calculated the R/P ratio within the same frequency bins as well as P-value cutoffs as described above. This process was repeated 1,000 times to obtain the distribution of the R/P ratio. For the control model, we randomly sampled 2,000 individuals as cases and 2,000 individuals as controls and calculated the R/P ratio as described above. Simulating R/P ratios for population stratification We use HAPGEN to simulate 4,000 distinct individuals from the Hapmap CEU population (phase 3, r2) as well as another 4000 distinct individuals from the Hapmap TSI population (phase 3, r2). For complete stratification, we randomly sampled 1,000 individuals from the CEU pool as controls and 1,000 individuals the TSI pool as cases. We simulated asymmetric mixtures of 1, 5 and 10 percent by randomly sampling 1000 individuals from the CEU pool as controls and sampled 10, 50 and 100 individuals from the TSI pool as cases, 96 respectively, and made up the remainder of the cases from the CEU pool. We used PLINK to perform the association test on all the variants and calculated the R/P ratio within the same frequency bins as well as P-value cutoffs as described above. Each process was repeated 1,000 times to obtain the distribution of the R/P ratio. All PCA analysis was performed using smartpca from the EIGENSOFT 3.0 package [29]. All meta-analysis of GWAS summary statistics were performed using METAL[30]. Inflation of the GWAS test statistic due to population stratification was assessed by genomic control inflation factor (λGC) [31]. Calculating R/P ratio from published GWAS summary statistics Schizophrenia, major depressive disorder and bipolar disorder GWAS summary statistics were provided from published results of schizophrenia [11], bipolar disorder [12] and major depressive disorder [13]. SNPs that failed imputation (INFO < 0.6) were discarded. The number of cases and controls used for simulating the null distribution are as follows: Schizophrenia (SCZ), 9,394 cases and 12,462 controls; major depressive disorder (MDD), 9,240 cases and 9,519 controls; bipolar disorder (BIP), 7,481 cases and 9,250 controls. Type 2 diabetes GWAS summary statistics were provided from published results of type 2 diabetes [14]. SNPs that passed imputation for less than 15,000 individuals (Ncases < 15,000) were discarded. The number of cases and controls used for simulating the null distribution are 15,000 cases and 50,337 controls. 97 Obesity GWAS summary statistics were provided from published results of various classes of obesity[15]. SNPs that passed imputation for less than 50,000 individuals (Ncases < 50,000), 10,000 individuals (Ncases < 10,000), 2,000 individuals (Ncases < 2,000) and 1,000 individuals (Ncases < 1,000) were discarded for the overweight (BMI > 25), class1 (BMI > 30), class2 (BMI > 35) and class3 (BMI > 40) datasets respectively. The number of cases and controls used for simulating the null distribution are as follows: overweight, 50,000 cases and 35,715 controls; class1, 10,000 cases and 20325 controls; class2, 2,000 cases and 12,466 controls; Class3, 1,000 cases and 18,346 controls. Inflammatory bowel disease GWAS summary statistics were provided from published results of Crohn’s disease (CD) [32], ulcerative colitis (UC) [33] and the combined case cohort of both Crohn’s disease and ulcerative colitis (CD+UC) [17]. SNPs that failed imputation (INFO < 0.6) were discarded. The number of cases and controls used for simulating the null distribution are as follows: CD, 5,956 cases and 14,927 controls; UC, 6,968 cases and 20,464 controls; CD+UC, 12,882 cases and 21,770 controls. Diabetic nephropathy GWAS summary statistics were provided from published results of phenotypes related to 98 diabetic nephropathy [16] which are Macroalbuminuria (MACRO) and End stage renal disease (ESRD). SNPs that failed imputation in at least 1 cohort were discarded. The number of cases and controls used for simulating the null distribution are as follows: macroalbuminuria versus control (MACROctrl), 1,478 cases and 3,315 controls; end stage renal disease versus control (ESRDctrl), 1,399 cases and 3,315 controls; ESRD versus controls that include MACRO (ESRDctrl+macro), 1,399 cases and 5,253 controls; combined MACRO and ESRD versus control ([MACRO + ESRD]ctrl), 2,916 cases and 3,315 controls. RESULTS We developed a method to detect and assess the significance of an excess of risk variants, measured by the ratio of risk variants to protective variants (R/P ratio) within a series of frequency bins and P-value cutoffs (see Materials and Methods). We proceeded to show that under an assumption of polygenic inheritance from low frequency variants, there is more statistical power to detect risk variants than protective variants, which can result in an increased R/P ratio. We also showed that such an excess can also occur if risk variants are kept at lower frequencies because of negative selection. However, such an excess can also occur because of reasons unrelated to a contribution of rare variants to disease risk: uneven sample sizes or asymmetric population stratification. Therefore, steps have to be taken to account for these latter possibilities before one can confidently interpret the excess of risk variants as a true signal of polygenic inheritance. Finally, we applied the method to GWAS summary statistics from several published studies. 99 Significantly higher power to detect low frequency risk variants of moderate to large effect The liability threshold model for disease [34] has been shown to be consistent with results from GWAS for multiple diseases [35]. This model assumes that there is an underlying unmeasured trait related to disease risk, and that individuals are affected with disease only when the value of the trait exceeds a particular threshold. Under such a model, we discovered that the statistical power to detect risk variants is higher than the power to detect protective variants, even when they have the same effect size with respect to the underlying unmeasured trait. For example, we calculated power using a pre-defined set of parameters defined as the ‘base model’ (see Materials and Methods). From our calculations, we observed that, as effect size increases, there is significantly more power to detect risk than protective variants as indicated by the increase in the NCP ratio (Figure 3.2). This result shows that for this scenario, where the number of risk and protective variants are equal and have similar absolute effect sizes, the difference in power can create an excess of detected risk variants over protective variants which can result in an increased R/P ratio. The difference in power is larger under certain scenarios We explored how the difference in power to detect risk and protective variants would be affected when we varied the parameters in the model under which we calculated power. First, we calculated power using the base model but varied the minor allele frequency from 1% to 15%. The difference in power for risk and protective variants decreases as the variant frequency increases (Figure 3.3A). Second, we varied the disease prevalence from 1% (trait Z-score > 2.33) to 15% (trait Z-score > 1.03). Here, the difference in power decreases with increasing disease 100 Figure 3.2: Comparing the power to detect risk and protective variants with the same underlying effect size. The plot shows the power as the non-centrality parameter (NCP) for detecting minor alleles that confer risk (risk variants) and minor alleles that confer protection (protective variants) with varying absolute effect sizes (0 < β < 0.5 in standard deviation units) using parameters from the base model (see Materials and Methods). It also shows the NCP Ratio, which is the NCP of risk variants divided by the NCP of protective variants with the same absolute effect size (right vertical axis). The equivalent odds-ratio (OR) for the risk variants is also shown on the horizontal axis. 101 Figure 3.3: Effects of varying various parameters on the NCP Ratio. The plots show the difference in power for detecting risk versus protective variants through the NCP Ratio under varying parameters. Unless otherwise specified, the parameters used for calculating NCP are from the base model (see Materials and Methods). (A) Minor allele frequency of the associated variant varying from 1% to 15%. (B) Disease prevalence (threshold of liability) varying from 1% to 15%. (C) Linkage disequilibrium (LD) between the causal variant and the marker variant as a function of D’ (varying from 0.5 to 0.8). (D) The marker variant frequency is set at 5% with the causal variant frequency ranging from 1% to 4%. 102 prevalence (Figure 3.3B), and there is no difference in power at any effect size when the disease prevalence is exactly 50%. Third, we varied the linkage disequilibrium (LD) between the associated variant and the causal variant from moderate LD (D’ = 0.5) to strong LD (D’ = 0.8). While there is a general loss of power with decreasing LD, the difference in power between risk and protective variants increases with decreasing LD (Figure 3.3C). Along similar lines, when we assumed that low frequency causal variants are being tagged by variants of higher frequencies (fixing the frequency of the tagged variant at 5% and varying the frequency of the causal variant from 4% to 1%), we also observed a greater difference in power as the causal variant frequency decreased (Figure 3.3D). These results show that the difference in power between risk and protective variants should be more obvious when testing variants within the low frequency range (< 5% frequency), in polygenic diseases with lower prevalence, and when the markers being tested are proxies for lower frequency causal variants. The driving force behind this result is that cases are ascertained from individuals with an extreme distribution of liability scores whereas controls have a much broader distribution of liability scores. Consequently, given an equal number of cases and controls, the increase in minor allele count of a risk variant in the cases is greater than the increase in minor allele count of an equally strong protective variant in the controls, leading to higher power for detecting the risk variant (see Appendix for derived formulae that confirm the increase in power). Thus, if rare or low frequency variants play a substantial role in certain diseases with polygenic architecture, these results predict that we could observe an increased R/P ratio for low frequency variants in the GWAS summary statistics for these diseases. 103 Excess of risk variants can be caused by negative selection Beyond the differences in power, an excess of risk compared to protective variants can also occur if there is negative selection against the disease, leading risk variants to be kept at lower frequencies than protective variants. To illustrate this scenario, we simulated negative selection by coupling effects on evolutionary fitness and on a quantitative trait for a set of variants (frequency ≥ 1%), and then assigning case-control status based on the trait values (see Materials and Methods). We observed an increase in the R/P ratio for the frequency bins within 1% to 15% but not for the 30-50% frequency bin (Figure 3.4). These results show that under a model where rare variants contribute to disease and are under negative selection, we could also observe an increase in the R/P ratio for low frequency variants in the GWAS summary statistics for these diseases. Excess of risk variants arise from having more controls than cases The previous results show that polygenic inheritance from lower frequency variants can lead to an increase in the R/P ratio, but such an increase can also occur in other settings. Under the null hypothesis, one would expect that on average, the number of detected risk variants to be equal to the number of detected protective variants resulting in an expected R/P ratio of 1. However, in our simulations, we observed that the expected R/P ratio can deviate from 1 because of an imbalance between the number of cases and controls. Specifically, if there are substantially more controls than cases, a feature present in some GWAS of dichotomous traits, it would result in the increase of the expected R/P ratio (R/P ratio > 1). To illustrate this, we randomly simulated 1,000 cases and 3,000 controls (1k/3k) and measured the distribution of the R/P ratio under a null 104 Figure 3.4: The distribution of the R/P ratio from simulating variants under negative selection. The figure shows the distribution of the log2 R/P ratio for various frequency bins and P-value cutoffs from simulating variants under negative selection. The selection model (red) uses the 823 effective variants while the control (black) model assumes no variants affect the phenotype. 105 Figure 3.5: The null distribution of the R/P ratio with larger number of controls than cases. The figure shows the distribution of the log2 R/P ratio for various frequency bins and P-value cutoffs from simulating larger number of controls than cases. The 1k/3k (red) model simulates the null distribution of the log2 R/P ratio for 1,000 cases and 3,000 controls. The 10k/30k (orange) model simulates the null distribution of the log2 R/P ratio for 10,000 cases and 30,000 controls. The control (black) model simulates the null distribution of the log2 R/P ratio for 1,000 cases and 1,000 controls. 106 model of no association (see Materials and Methods). We observed that there is an increase in the R/P ratio distribution for 1k/3k for the low frequency bins (Figure 3.5). This increase is not seen with common variants (30-50% frequency bin), nor if the number of cases and controls are equal (Figure 3.5). Of note, with larger sample sizes (10,000 cases and 30,000 controls; 10k/30k, we observed that the increase in R/P ratio is substantially attenuated (Figure 3.5). These results show that an excess of controls can increase the expected R/P ratio, and should be accounted for by comparing the observed R/P ratio against those obtained through simulations under a null model. These results also show that with sufficiently large number of cases (e.g. > 10,000 cases), the increase in the expected R/P ratio due to this imbalance will be minimal. Excess of risk variants can be due to asymmetric population stratification We also considered whether an excess of risk variants could be seen in GWAS that are confounded by population stratification. As a first test, we randomly simulated 1,000 individuals of either northern European ancestry (CEU, based on allele frequencies in the CEU HapMap sample) or southern European ancestry (TSI, based on allele frequencies in the TSI HapMap sample). In one experiment, we simulated 1,000 CEU individuals as controls and 1,000 TSI individuals as cases (see Materials and Methods), and as a stratification-free experiment, we simulated 1,000 CEU controls and 1,000 CEU cases. We found that while there was a large excess of apparent associations for both risk and protective variants, leading to enormous inflation of the genomic control test statistic (λGC ~ 22.9), the resulting R/P ratio did not deviate substantially from expectations under the null (Figure 3.6). Therefore, even extreme scenarios with the usual forms of population stratification should not cause substantial deviations of the 107 Figure 3.6: The distribution of the R/P ratio from simulating population stratification. The figure shows the distribution of the log2 R/P ratio for various frequency bins and P-value cutoffs from simulating population stratification. The stratification model (red) simulates the association perform with cases only from the TSI population and controls only from the CEU population. The control model (black) simulates both cases and controls from the CEU population. 108 Figure 3.7: The distribution of the R/P ratio from simulating asymmetric population stratification. The figure shows the distribution of the log2 R/P ratio for various frequency bins and P-value cutoffs from simulating asymmetric population stratification. The models for asymmetric population stratification are as follows. Mixed 10%, 5% and 1% indicates 10%, 5% and 1% of the cases are TSI individuals respective while the rest of the individuals used are of CEU ancestry. The control model comprises of only CEUs without any population stratification. 109 R/P ratio. However, we reasoned that a special case of asymmetric population stratification could potentially cause the R/P ratio to depart from expectations under the null. Specifically, if there were a mixture of different populations only in cases but not in controls, or vice-versa, it could lead to an increase or decrease of the R/P ratio. To test this, we randomly simulated a series of models where controls are homogenous (CEU), while cases are a mixture of CEU and TSI (see Materials and Methods). At a 1% mixture in cases (λGC ~ 1.01), we did not observe any significant excess of risk variants, but at 5% mixture (λGC ~ 1.06), we observed an excess of risk variants within the low frequency ranges (Figure 3.7). This excess is even larger with a 10% mixture (λGC ~ 1.24) (Figure 3.7). Variants within the common frequency range do not show an excess of risk variants (Figure 3.7). These results show that such asymmetric population stratification can increase the R/P ratio, with only moderate increases in the genomic control statistics. As a corollary, if the mixture were to exist in controls but not cases, we would expect the R/P ratio to decrease. Finally, we meta-analyzed the results from the asymmetrically stratified GWAS with results from non-stratified GWAS (see Materials and Methods) to determine the effect on the R/P ratio if only a subset of the studies had asymmetric population stratification. We found that the increase in the R/P ratio is attenuated after meta-analysis (Figure 3.8). These results indicate that while asymmetric population stratification can give rise to an excess of risk variants, combining such results with non-stratified results can reduce the magnitude of the signal. Because this particular type of stratification is unlikely to be present in most of the cohorts prior to metaanalysis, it may be useful to examine the summary statistics of each study individually to determine if the increased R/P ratio is derived from a subset of studies in the GWAS 110 Figure 3.8: The distribution of the R/P ratio from simulating asymmetric population stratification after meta-analysis. The figure shows the distribution of the log2 R/P ratio for various frequency bins and P-value cutoffs from simulating asymmetric population stratification after meta-analysis with non-stratified data. The model “mixed 10%” and “metaanalyzed” refers to asymmetric population stratification of 10% mixture of TSI individuals of the cases before and after being meta-analyzed with 4 other datasets without such stratification respectively. The control model indicates no asymmetric population stratification. meta-analysis. 111 Ideally, if an increased R/P ratio is observed, principal component analysis or other methods should also be applied to the primary data to search for outliers present exclusively in cases, to further rule out asymmetric population stratification as a cause of an increased R/P ratio. Using the R/P ratio in actual GWAS results to search for signals of low frequency variants contributing to disease risk Schizophrenia, major depressive disorder and bipolar disorder We applied our method to data from several psychiatric disorders: schizophrenia [11], bipolar disorder [12] and major depressive disorder [13]. We observed a significant increase in the R/P ratio only for schizophrenia in the 1-5% frequency bin, at a cutoff of P < 0.01 (P = 2.42 x 10-7) (Table 3.1). We did not observe any significant differences in the other frequency bins nor for any of the other psychiatric disorders (Table 3.1). These results are indicative of polygenic inheritance from low frequency variants in schizophrenia but do not provide similar support for a role of low frequency variants in major depressive disorder or bipolar disorder. Type 2 diabetes Next, we applied our method to GWAS results of type 2 diabetes [14]. The R/P ratio for type 2 diabetes was significantly increased in the low frequency bins (Table 3.2). The most significant difference was observed in the 1-5% bin with cutoff of P < 0.01 (P = 3.08 x 10-15). 112 Table 3.1: Schizophrenia, Major depressive disorder and Bipolar disorder SCZ E(R/P) 1.127 1.032 1.057 1.019 1.082 1.019 1.022 1.003 MDD E(R/P) 1.058 1.006 1.039 1.005 1.035 1.005 1.003 1.001 BIP E(R/P) 1.110 1.028 1.077 1.013 1.055 1.015 1.039 1.009 Freq (%) 1-5 5-10 10-15 30-50 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 O(R/P) 1.864 1.623 1.348 1.230 1.050 1.054 1.063 1.001 P 0.0298 2.42e-7 0.1279 0.0111 0.4926 0.3335 0.3736 0.5010 O(R/P) 1.210 1.169 0.933 0.914 1.348 1.193 1.098 0.944 P 0.269 0.048 0.623 0.865 0.126 0.027 0.264 0.836 O(R/P) 0.884 0.953 1.038 0.973 1.038 1.046 1.122 1.070 P 0.748 0.778 0.509 0.678 0.473 0.349 0.291 0.165 The observed, expected R/P ratios and P-values obtained from analyzing GWAS summary statistics of psychiatric disorders: Schizophrenia (SCZ), major depressive disorder (MDD) and bipolar disorder (BIP). O(R/P) refers to the observed R/P ratio while E(R/P) refers to the expected R/P ratio obtained through simulations. P refers to the p-value obtained from a 1-tailed Z-test (In bold: P < 0.01). 113 Table 3.2: Type 2 diabetes T2D E(R/P) 1.205 1.069 1.131 1.051 1.081 1.033 1.038 1.008 Freq (%) 1-5 5-10 10-15 30-50 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 O(R/P) 3.833 2.009 1.636 1.439 1.660 1.400 1.041 1.035 P 5.89e-6 3.08e-15 0.043 2.28e-5 0.031 8.36e-4 0.459 0.308 The observed, expected R/P ratios and P-values obtained from analyzing GWAS summary statistics of type 2 diabetes (T2D). O(R/P) refers to the observed R/P ratio while E(R/P) refers to the expected R/P ratio obtained through simulations. P refers to the p-value obtained from a 1tailed Z-test (In bold: P < 0.01). 114 We also observed a significant excess of risk variants in the 10-15% bin (P < 0.01, P = 2.28 x 105 ). As the difference in power between risk and protective variants becomes minimal as the variant frequency increases, this observed excess of risk variants is more likely due to negative selection on diabetes risk alleles, tagging of low frequency variants by the more common SNPs in this frequency range, and/or possibly asymmetric population stratification. Nonetheless, these results are indicative of polygenic inheritance from low frequency variants in type 2 diabetes. Obesity We also applied our method to GWAS results for various classes of obesity [15]: overweight (BMI > 25), class 1 (BMI > 30), class 2 (BMI > 35) and class 3 (BMI > 40). The controls used for each class of obesity were individuals with BMI < 25. We observed a significant increase in the 1-5% frequency bin with a cutoff of P < 0.01 for only the class 1 dataset (P = 8.8 x 10-6) (Table 3.3). Also, while we generally observed a gradual increase in the R/P ratio with increasing BMI definitions of obesity, which could be consistent with a role of lower frequency variants, the increase in R/P ratio could also be explained by having more controls than cases. We did not observe any significant excess of risk variants for the low frequency bins in the class 2 or class 3 datasets, likely because of the severely reduced sample sizes for the more extreme BMI definitions of obesity. Testing whether related phenotypes are likely to share low frequency causal variants To increase the power of GWAS, some studies have pooled apparently related phenotypes 115 Table 3.3: Obesity Overweight E(R/P) 0.997 0.986 0.998 0.991 0.999 1.003 0.991 0.999 Class2 E(R/P) 2.410 1.376 1.640 1.222 1.567 1.208 1.094 1.035 Class1 E(R/P) 1.164 1.050 1.098 1.023 1.087 1.028 1.020 1.004 Class3 O(R/P) 3.454 1.617 2.067 1.346 1.766 1.267 1.112 1.044 Freq (%) 1-5 5-10 10-15 30-50 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 O(R/P) 1.188 1.120 1.026 1.023 0.784 1.109 1.121 1.022 O(R/P) 2.462 1.533 0.697 1.108 1.276 1.066 0.949 0.985 P 0.228 0.078 0.408 0.328 0.826 0.113 0.194 0.340 P 0.410 0.114 0.999 0.871 0.713 0.883 0.763 0.816 O(R/P) 0.917 1.536 1.139 0.937 0.971 1.013 1.059 1.045 E(R/P) 3.700 1.814 1.857 1.227 1.385 1.269 1.019 0.955 P 0.758 8.8e-6 0.393 0.838 0.610 0.544 0.380 0.225 P 0.354 0.111 0.607 0.845 0.779 0.479 0.696 0.946 Freq (%) 1-5 5-10 10-15 30-50 The observed, expected R/P ratios and P-values obtained from analyzing GWAS summary statistics of clinical classes of obesity: Overweight (BMI > 25), Class1 (BMI > 30), Class2 (BMI > 35) and Class3 (BMI > 40). O(R/P) refers to the observed R/P ratio while E(R/P) refers to the expected R/P ratio obtained through simulations. P refers to the p-value obtained from a 1-tailed Z-test (In bold: P < 0.01). 116 into a single case group [16,17]. We applied our method to measure the R/P ratio on published GWAS results of these related phenotypes. We reasoned that our method could also be used to test if pooling related phenotypes would increase power to detect low frequency variants, using only the GWAS summary statistics. We applied our method to GWAS results from two different pairs of related phenotypes, one pair for inflammatory bowel disease and one pair for diabetic nephropathy. Inflammatory bowel disease The two major types of inflammatory bowel disease are Crohn’s disease (CD) and ulcerative colitis (UC)[36]. We examined the R/P ratio in GWAS results for Crohn’s disease [32], ulcerative colitis[33] and the combined case cohort of both Crohn’s disease and ulcerative colitis [17]. We observed significant increases in the R/P ratio for both Crohn’s disease and ulcerative colitis within the low frequency bins (Table 3.4). The most significant increases were found in the 1-5% bin with cutoff of P < 0.01 (CD: P = 1.55 x 10-10, UC: P = 2.25 x 10-9), consistent with a polygenic role of low frequency variants in both diseases. However, when Crohn’s disease and ulcerative colitis were combined as a single case group (CD + UC), the increase in R/P ratio is less significant than in the individual GWAS results (Table 3.4). These results suggest that there are some low frequency genetic contributors to Crohn’s disease and ulcerative colitis that are not shared by both diseases. However, because the signal is still present (albeit attenuated) when both diseases were studied together, it also suggests that the two diseases do share some overlapping low frequency genetic contributors, although the attenuated signal could reflect 117 Table 3.4: Inflammatory bowel disease: Crohn’s disease and Ulcerative Colitis CD E(R/P) 1.347 1.111 1.162 1.069 1.181 1.059 1.035 1.018 UC E(R/P) 1.358 1.106 1.192 1.066 1.186 1.066 1.037 1.009 CD+UC E(R/P) 1.159 1.048 1.107 1.027 1.076 1.026 1.026 1.005 Freq (%) 1-5 5-10 10-15 30-50 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 O(R/P) 2.545 1.994 1.148 1.314 1.200 1.043 0.925 1.052 P 0.017 1.55e-10 0.477 1.4e-3 0.424 0.551 0.743 0.266 O(R/P) 1.958 1.866 1.490 1.460 1.279 1.213 1.163 1.004 P 0.075 2.25e-9 0.153 8.59e-5 0.337 0.075 0.217 0.524 O(R/P) 1.385 1.457 1.099 1.239 1.583 1.104 1.036 1.043 P 0.222 1.6e-4 0.463 0.012 0.059 0.205 0.445 0.251 The observed, expected R/P ratios and P-values obtained from analyzing GWAS summary statistics of inflammatory bowel diseases: Crohn’s disease (CD), Ulcerative colitis (UC) and the combined CD and UC as a single case group (CD+UC). O(R/P) refers to the observed R/P ratio while E(R/P) refers to the expected R/P ratio obtained through simulations. P refers to the pvalue obtained from a 1-tailed Z-test (In bold: P < 0.01). 118 Table 3.5: Diabetic Nephropathy: Macroalbuminuria and End stage renal disease MACROctrl E(R/P) 1.655 1.198 1.359 1.116 1.275 1.104 1.066 1.023 ESRDctrl+macro E(R/P) 2.008 1.285 1.584 1.187 1.397 1.160 1.076 1.032 ESRDctrl E(R/P) Freq (%) 1-5 5-10 10-15 30-50 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 Pvalue cutoff 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 O(R/P) 2.000 1.560 1.563 1.200 0.893 1.208 1.122 0.990 O(R/P) 2.667 2.270 1.533 1.552 1.462 1.310 0.968 1.038 P 0.205 1.4e-3 0.253 0.175 0.892 0.150 0.343 0.690 P 0.146 9e-11 0.496 2.9e-4 0.380 0.078 0.719 0.449 O(R/P) P Freq (%) 1-5 5-10 10-15 30-50 1.944 1.706 0.283 1.705 1.207 6.4e-5 1.278 1.404 0.585 1.240 1.143 0.147 1.343 1.304 0.403 1.190 1.128 0.258 1.198 1.051 0.197 1.152 1.014 0.017 [MACRO + ESRD]ctrl E(R/P) O(R/P) P 1.087 1.026 0.875 1.045 0.912 1.053 1.037 0.981 1.133 1.042 1.071 1.017 1.038 1.009 1.001 1.003 0.504 0.550 0.754 0.352 0.640 0.290 0.382 0.652 The observed, expected R/P ratios and P-values obtained from analyzing GWAS summary statistics of diabetic nephropathy: macroalbuminuria (MACROctrls), end stage renal disease (ESRDctrls), ESRD versus controls that include MACRO (ESRDctrls+macro) and the combined MACRO and ESRD as a single case group ([MACRO + ESRD]ctrls). O(R/P) refers to the observed R/P ratio while E(R/P) refers to the expected R/P ratio obtained through simulations. P refers to the p-value obtained from a 1-tailed Z-test (In bold: P < 0.01). 119 persistence of two separate individual signals that are diluted after combination of the two sets of cases. Diabetic nephropathy We performed a similar analysis on two phenotypes used to characterize diabetic nephropathy[16]: macroalbuminuria (MACRO) and end stage renal disease (ESRD). Unlike inflammatory bowel disease, MACRO and ESRD are not necessarily distinct; MACRO is a milder form of diabetic nephropathy and some of those individuals progress to develop ESRD. The controls used for that study were diabetic individuals that did not develop nephropathy. We analyzed the GWAS results performed for individuals with macroalbuminuria versus controls (MACROctrl), individuals with end stage renal disease versus controls (ESRDctrl), individuals with end stage renal disease versus controls that also include individuals with macroalbuminuria (ESRDctrl+macro) and a combined case cohort that includes both individuals with macroalbuminuria and end stage renal disease versus controls ([MACRO + ESRD]ctrl). For the analyses of MACROctrl and of ESRDctrl, we observed significant increases to the R/P ratio in the 1-5% bin with cutoff of P < 0.01 (MACROctrl: P = 0.001, ESRDctrl: P = 6.4 x 10-5) (Table 3.5). For the ESRDctrl+macro analysis, where individuals with macroalbuminuria are included within the controls, there is an even larger increase of the R/P ratio (ESRDctrl+macro: P = 9 x 10-11) (Table 3.5). However, when MACROctrl and ESRDctrl were combined into a single case group ([MACRO + ESRD]ctrl), none of the frequency bins showed significant increases in the R/P ratio (Table 3.5). These results suggest that while there are low frequency contributors to both macroalbuminuria and end stage renal disease, these contributors do not substantially overlap. 120 There is no detectable increase in the R/P ratio when both phenotypes are combined, unlike our observations for inflammatory bowel disease. Thus, these results indicate that studies of low frequency variation for diabetic nephropathy would be more fruitful if MACRO and ESRD are tested separately. DISCUSSION We have shown that our method for measuring the R/P ratio can be used as a test for the presence of multiple low frequency or rare genetic contributors to disease risk. This method can be applied to GWAS summary statistics, even if there are few or no genome-wide significant associations. We analyzed results from multiple published GWAS studies, and found significant signals in some but not all diseases. These results support the hypotheses that the diseases where the R/P ratio is increased have a polygenic contribution from as-yet undetected low frequency or rare variants. Some existing methods for detecting polygenic inheritance [9,10,37] use variants that achieve nominal significance in GWAS to determine if they are informative as predictors of phenotype. Because our method assesses the direction of effect of these variants against the null model, our method represents a rather different, independent approach for assessing polygenic inheritance of low frequency variants. Furthermore, our method does not require having identified associated loci or the availability of individual level data. For example, in schizophrenia, it has been shown that a substantial proportion of schizophrenia disease risk is the result of variants with frequency > 1% [38]. Our finding suggests that some of disease risk is accounted for by variants within the low frequency range (frequency < 5%). In a recent exome 121 sequencing study of 2,536 schizophrenia cases and 2,543 controls[39], Purcell and colleagues showed a polygenic burden of rare disruptive mutations, which is consistent with our observation. Similarly, for type 2 diabetes, our results suggest the presence of low frequency or rare variants contributing to disease risk, even though most of the variants known to be associated with disease risk are common (frequency ≥ 5%) [14]. We also showed that negative selection under polygenic inheritance can increase the R/P ratio for low frequency variants, because risk variants would be kept at lower frequencies while the protective variants could drift to higher frequencies. Indeed, in a previous study [40], Park and colleagues showed that across most qualitative traits, minor alleles conferred risk more often than protection which they concluded to be evidence for purifying selection. While this can be the case for some diseases, we also showed that this increase in the R/P ratio can also arise because there is more power to detect risk variants than protective variants. Furthermore, we have established that if there are substantially more controls than cases, a feature present in many GWAS, this imbalance can distort the null distribution such that there would appear to be more risk than protective variants. However, this imbalance can be accounted for through simulations, as we have demonstrated. Our method also provides a simple and early way of assessing the utility of different phenotype definitions for genetic studies of low frequency variation simply from GWAS summary statistics. Our results for inflammatory bowel disease are consistent with the idea that Crohn’s disease and ulcerative colitis have some overlapping genetic contributors. Indeed, a previous study exploring the effect of common Crohn’s disease variants on ulcerative colitis identified significant overlaps between the two diseases, but also loci specific to Crohn’s disease [41]. For diabetic nephropathy, where there are few established loci from which to draw 122 conclusions from, we observed signals for both macroalbuminuria and particularly for end stage renal disease when analyzed separately, but no significant signal when both diseases were combined as a single case group. This suggests that macroalbuminuria and end-stage renal disease are distinct in their genetic architecture and would be more productive if they were to be studied separately. Interestingly, the same GWAS on diabetic nephropathy discovered a single genome-wide significant locus only when end stage renal disease was treated separately from macroalbuminuria [16], consistent with our observation. Finally, asymmetric population stratification between cases and controls can lead to both false positive associations (as evidenced by an increased genomic control inflation factor) [42], and also an increase in the R/P ratio. Thus, while our observations of higher than expected R/P ratios in some of the published GWAS datasets are suggestive of a role of low frequency variants, we cannot completely rule out that some of these signals could be in part explained by asymmetric population stratification. Of note, none of the R/P ratios showed a deficit of risk variants (which would be expected under some models of asymmetric population stratification), suggesting that asymmetric population stratification is not widespread. Furthermore, these GWAS have used methods to detect and correct for population stratification. In conclusion, our method can be used to screen for polygenic inheritance from low frequency or rare variants in diseases where GWAS have been performed. Our method can also be extended to other summary statistics, e.g. studies from sequencing or exome-chip genotyping, to assess low frequency variants that were directly genotyped rather than imputed. This method can serve as a simple approach to guide researchers in prioritizing strategies in searching for as yet unexplained heritability for specific diseases. For example, in a study of epilepsy [43], Heinzen and colleagues failed to identify any rare variants of large effect through exome 123 sequencing; analysis of GWAS data for epilepsy can in theory help guide decisions about embarking on additional studies of low frequency or rare variants with larger sample sizes. Although a lack of a signal from our method does not rule out a role for low frequency variants, and may reflect a combination of small sample sizes, and a set of effect sizes and frequencies that do not significantly alter the R/P ratio, a positive signal can provide greater confidence about the likelihood that low frequency or rare variants contribute to disease risk. APPENDIX Calculating NCP from various given parameters We define the following parameters required to calculate the non-centrality parameter (NCP) as a function of effect size of minor allele (β), minor allele frequency (p), liability threshold (t), number of case individuals (Nd) and number of control individuals (Nc). We denote the minor allele (effect allele) as a1 and the major allele (non-effect allele) as a2. As such, the liability distribution of a1 is N(x, μ1, σ2) and the liability distribution of a2 is N(x, μ2, σ2) such that N(x, μ,σ2) is the probability density function of a normal distribution with mean μ and variance σ2. The mean liabilities for a1 and a2 are as follows: Mean liability for a1 = μ1 = β - β p = β q Mean liability for a2 = μ2 = - β p where q is the major allele frequency such that p + q = 1. The variance remaining σ2 is: Variance remaining = σ2 =1 – β 2 p q 124 Next, we calculate a series of conditional probabilities as follows: With these conditional probabilities, we proceed to calculate the expected allele frequencies of both the minor allele and major allele in both cases and controls using Bayes’ theorem. These are calculated as: We then calculate the NCP by the χ2 statistic from a 2 by 2 contingency table for the expectation of the observed number of a1 and a2 in both cases and controls. 125 Case a1 a2 Total Control Total 2 Nd Pd1 2 Nc Pc1 2A 2B 2T 2 Nd (1- Pd1) 2 Nc (1- Pc1) 2 Nd 2 Nc where, A = Nd Pd1 + Nc Pc1 B = Nd (1- Pd1) + Nc (1- Pc1) T = A + B = Nd + Nc The expected number for each cell is the row total times the column total divided by the grand total. Thus, the NCP is calculated as: 126 After some algebra and simplification, Therefore, ) We verified that these formulae were correct by comparing to simulated results. Determining NCP ratio between risk and protective variants with the same magnitude of effect We formulated the various probabilities between risk and protective variants. Assuming β to be positive, the risk variant would have the following probabilities, and the protective variant with the same magnitude of effect would have the following probabilities, 127 Assuming that there are equal number of cases and controls (N1 = N2), then The ratio between risk and protective variants with the similar magnitude of β is therefore We can transform the distributions such that, where , and Then, 128 When prevalence is 50% (t=0), and therefore This shows that when prevalence is 50% (t=0) and there are equal sample numbers in cases and controls (N1 = N2), the NCP between risk and protective variants with identical magnitudes of effect (β) would be the same regardless of any other parameters. For the case where t > 0, if 129 then the NCP for risk variants will be greater than the NCP for protective variants and the NCP ratio will be greater than 1. When t > βq, this will be true because the normal distribution is monotonic decreasing above z=0 (y=0). To extend this to the more general case of t>0, we first examine the individual components, where eft is the error function. Similarly, Therefore, 130 Taking the first 2 terms of the Taylor-series expansion of the error function and approximating σ to 1 (σ ≈ 1), As such, if t >0, Therefore, if t > 0, Therefore, for diseases with low prevalence (t > 0), there is more power to detect risk variants 131 compared with the protective variant. REFERENCES 1. Hirschhorn JN, Gajdos ZKZ (2011) Genome-Wide Association Studies: Results from the First Few Years and Potential Implications for Clinical Medicine. Annual Review of Medicine 62: 11–24. doi:10.1146/annurev.med.091708.162036. Cantor RM, Lange K, Sinsheimer JS (2010) Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. The American Journal of Human Genetics 86: 6–22. doi:10.1016/j.ajhg.2009.11.017. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missing heritability of complex diseases. Nature 461: 747–753. doi:10.1038/nature08494. Eichler EE, Flint J, Gibson G, Kong A, Leal SM, et al. (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet 11: 446– 450. doi:10.1038/nrg2809. Lim ET, Raychaudhuri S, Sanders SJ, Stevens C, Sabo A, et al. (2013) Rare complete knockouts in humans: population distribution and significant role in autism spectrum disorders. Neuron 77: 235–242. doi:10.1016/j.neuron.2012.12.029. Yu TW, Chahrour MH, Coulter ME, Jiralerspong S, Okamura-Ikeda K, et al. (2013) Using whole-exome sequencing to identify inherited causes of autism. Neuron 77: 259–273. doi:10.1016/j.neuron.2012.11.002. Iyengar SK, Elston RC (2007) The genetic basis of complex traits: rare variants or “common gene, common disease”? Methods Mol Biol 376: 71–84. doi:10.1007/978-159745-389-9_6. Purcell S, Cherny SS, Sham PC (2003) Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19: 149–150. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, et al. (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748–752. doi:10.1038/nature08185. 2. 3. 4. 5. 6. 7. 8. 9. 10. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, et al. (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42: 565–569. doi:10.1038/ng.608. 11. Schizophrenia Psychiatric Genome-Wide Association Study (GWAS) Consortium (2011) Genome-wide association study identifies five new schizophrenia loci. Nat Genet 43: 969– 132 976. doi:10.1038/ng.940. 12. Psychiatric GWAS Consortium Bipolar Disorder Working Group (2011) Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat Genet 43: 977–983. doi:10.1038/ng.943. 13. Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium, Ripke S, Wray NR, Lewis CM, Hamilton SP, et al. (2013) A mega-analysis of genome-wide association studies for major depressive disorder. Mol Psychiatry 18: 497–511. doi:10.1038/mp.2012.21. 14. Morris AP, Voight BF, Teslovich TM, Ferreira T, Segrè AV, et al. (2012) Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 44: 981–990. doi:10.1038/ng.2383. 15. Berndt SI, Gustafsson S, Mägi R, Ganna A, Wheeler E, et al. (2013) Genome-wide metaanalysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet. doi:10.1038/ng.2606. 16. Sandholm N, Salem RM, McKnight AJ, Brennan EP, Forsblom C, et al. (2012) New Susceptibility Loci Associated with Kidney Disease in Type 1 Diabetes. PLoS Genet 8: e1002921. doi:10.1371/journal.pgen.1002921. 17. Jostins L, Ripke S, Weersma RK, Duerr RH, McGovern DP, et al. (2012) Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491: 119–124. doi:10.1038/nature11582. 18. Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58. doi:10.1038/nature09298. 19. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861. doi:10.1038/nature06258. 20. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. doi:10.1038/nature11632. 21. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. doi:10.1086/519795. 22. Su Z, Marchini J, Donnelly P (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics 27: 2304–2305. doi:10.1093/bioinformatics/btr341. 23. Adams AM, Hudson RR (2004) Maximum-Likelihood Estimation of Demographic Parameters Using the Frequency Spectrum of Unlinked Single-Nucleotide Polymorphisms. 133 Genetics 168: 1699–1712. doi:10.1534/genetics.104.030171. 24. Agarwala V, Flannick J, Sunyaev S, GoT2D Consortium, Altshuler D (2013) Evaluating empirical bounds on complex disease genetic architecture. Nat Genet. doi:10.1038/ng.2804. 25. Lambert BW, Terwilliger JD, Weiss KM (2008) ForSim: a tool for exploring the genetic architecture of complex traits with controlled truth. Bioinformatics 24: 1821–1822. doi:10.1093/bioinformatics/btn317. 26. Kryukov GV, Shpunt A, Stamatoyannopoulos JA, Sunyaev SR (2009) Power of deep, allexon resequencing for discovery of human trait genes. Proc Natl Acad Sci USA 106: 3871– 3876. doi:10.1073/pnas.0812824106. 27. Eyre-Walker A (2010) Genetic architecture of a complex trait and its implications for fitness and genome-wide association studies. Proc Natl Acad Sci U S A 107: 1752–1756. doi:10.1073/pnas.0906182107. 28. Chan Y, Holmen OL, Dauber A, Vatten L, Havulinna AS, et al. (2011) Common variants show predicted polygenic effects on height in the tails of the distribution, except in extremely short individuals. PLoS Genet 7: e1002439. doi:10.1371/journal.pgen.1002439. 29. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. doi:10.1038/ng1847. 30. Willer CJ, Li Y, Abecasis GR (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26: 2190–2191. doi:10.1093/bioinformatics/btq340. 31. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55: 997– 1004. 32. Franke A, McGovern DPB, Barrett JC, Wang K, Radford-Smith GL, et al. (2010) Genomewide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat Genet 42: 1118–1125. doi:10.1038/ng.717. 33. Anderson CA, Boucher G, Lees CW, Franke A, D’Amato M, et al. (2011) Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associations to 47. Nature Genetics 43: 246–252. doi:10.1038/ng.764. 34. Falconer DS (1965) The inheritance of liability to certain diseases, estimated from the incidence among relatives. Annals of Human Genetics 29: 51–76. doi:10.1111/j.14691809.1965.tb00500.x. 35. Slatkin M (2008) Exchangeable models of complex inherited diseases. Genetics 179: 2253– 2261. doi:10.1534/genetics.107.077719. 36. Baumgart DC, Carding SR (2007) Inflammatory bowel disease: cause and immunobiology. 134 Lancet 369: 1627–1640. doi:10.1016/S0140-6736(07)60750-8. 37. Yang J, Lee SH, Goddard ME, Visscher PM (2013) Genome-wide complex trait analysis (GCTA): methods, data analyses, and interpretations. Methods Mol Biol 1019: 215–236. doi:10.1007/978-1-62703-447-0_9. 38. Lee SH, DeCandia TR, Ripke S, Yang J, Sullivan PF, et al. (2012) Estimating the proportion of variation in susceptibility to schizophrenia captured by common SNPs. Nat Genet 44: 247–250. doi:10.1038/ng.1108. 39. Purcell SM, Moran JL, Fromer M, Ruderfer D, Solovieff N, et al. (2014) A polygenic burden of rare disruptive mutations in schizophrenia. Nature. doi:10.1038/nature12975. 40. Park J-H, Gail MH, Weinberg CR, Carroll RJ, Chung CC, et al. (2011) Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A 108: 18026–18031. doi:10.1073/pnas.1114759108. 41. Anderson CA, Massey DCO, Barrett JC, Prescott NJ, Tremelling M, et al. (2009) Investigation of Crohn’s disease risk loci in ulcerative colitis further defines their molecular relationship. Gastroenterology 136: 523–529.e3. doi:10.1053/j.gastro.2008.10.032. 42. Price AL, Zaitlen NA, Reich D, Patterson N (2010) New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11: 459–463. doi:10.1038/nrg2813. 43. Heinzen EL, Depondt C, Cavalleri GL, Ruzzo EK, Walley NM, et al. (2012) Exome Sequencing Followed by Large-Scale Genotyping Fails to Identify Single Rare Variants of Large Effect in Idiopathic Generalized Epilepsy. Am J Hum Genet 91: 293–302. doi:10.1016/j.ajhg.2012.06.016. 135 Chapter 4 Genome wide association in European and African Americans discover novel loci associated with sitting height ratio Yingleong Chan1,2,3, Rany M Salem1,2,3, Joel N Hirschhorn1,2,3 1 2 Department of Genetics, Harvard Medical School, Boston, MA, USA Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA. 3 Department of Endocrinology, Boston Children’s Hospital, Boston, MA, USA. ABSTRACT Body proportion is a phenotype that is determined by the ratio of different components of the human anatomy. While there are many genetic studies that have been performed for height, little is known about the genetics underlying our body proportion and the genes regulating our proportion might play a more important role for growth and development. Here we report our findings of our analysis on sitting height ratio (SHR), the ratio of sitting height to overall height. We show that genetics contribute in a major way to explain the difference in SHR between African Americans and European Americans. After adjusting for height, age, sex, body mass index and the relevant principal components, the genome-wide association study (GWAS) in African and Europeans Americans uncover 3 loci associated with SHR. One of the loci (rs5959358) resides on the X-chromosome and was reported to be also associated with height. Comparing the known loci associated with height with the results of SHR reveal that most of the loci are associated with alterations of SHR too. While these confirm that SHR is largely genetically determined, nonetheless more samples are required to reveal the full genetic architecture in SHR determination. INTRODUCTION Human height is a commonly used trait to illustrate a highly heritable that is polygenic. Our height however, is in reality a summation of many different components, e.g. head length, trunk length, leg length, etc. One of the first reports on how these individual lengths should correlate with each other was given by Leonardo da Vinci in his illustration of Vitruvian Man circa 1490. In it, he recorded the expected proportions of these measurements in relation with each other for the human body. Further research have postulated that some of these 137 measurements may be predictors of diseases [1]. For example, there is evidence that leg length can be a predictor of metabolic disorders underlying type 2 diabetes [2]. One such measurement is sitting height. Sitting height is defined as total stature that is comprised by head and trunk and is usually measured by having the individual sit on a table and measuring the length from the table surface to the top of the person’s head. Since sitting height is a component of an individual’s total height, the sitting height ratio (SHR), defined as the sitting height divided by total height is an indicator of an individual’s body proportion. The SHR of an individual changes as we grow. Unlike height which increases as we age, SHR rapidly decreases as we progress from being a baby to being a teenager an increases slightly as we become adults [3,4]. In the extreme case, individuals affected with skeletal dysplasias not only have short stature, but also have disproportionate SHR [5]. Depending on the type of skeletal dysplasia, the SHR can be severely increased. For example, individuals with Achondroplasia have average SHR of 0.66 (normal range: 0.52-0.53) [6]. On the other hand, individuals with spondyloepiphyseal and spondylometaepiphyseal dysplasias may have normal SHR values [7]. The SHR is also slightly different between people from different ancestries. Individuals of Asian ancestry have higher SHR than individuals of European ancestry and individuals of European ancestry have higher SHR than individuals of African ancestry [8]. This difference is assumed to be due to genetic factors, although it remains unclear whether the difference is due to many variants with small effect sizes or a few variants with large effect sizes. In this chapter, we described our approach to determine if there is a strong genetic influence on the SHR difference between individuals of different ancestry. We found that SHR is highly correlated with the degree to which African Americans have admixed of European ancestry. The more European ancestry an African American has, the higher his or her SHR, 138 consistant with the reported observations. We performed genome wide association study of SHR with both European Americans as well as African Americans and reported 3 loci associated with SHR. We then examine several variants that were known to be associated with height and observed that many of these variants were also marginally associated with SHR. These results suggest that variants associated with height that are also associated with SHR might be in genes that regulate development of the growth plate. RESULTS European Americans have higher sitting height ratios (SHR) than African Americans We used the ARIC [9] and CARDIA [10] cohorts as they include both European and African Americans with both sitting height and height measurements. After removing individuals that failed our quality control (see Materials and Methods), we have 7,257 European American individuals and 2,354 African American individuals from ARIC. For CARDIA, we have 1,047 European American individuals and 715 African American individuals. Comparing the sitting height ratio (SHR) between European and African American individuals, we find that European Americans have higher SHR values than their African Americans (Figure 4.1). In both ARIC and CARDIA, the mean SHR for European Americans is 0.53 while the mean SHR for African Americans is 0.51. After correcting SHR for covariates like height, age, sex, BMI and expressed SHR in terms of a Zscore (see Materials and Methods), we observed that there is more than a 1 standard deviation difference (ARIC = 1.16, CARDIA = 1.06) between European Americans and African Americans. This result is consistent with earlier findings that European Americans have higher SHR than African Americans [8]. 139 Figure 4.1: Sitting height, height and sitting height ratio (SHR) distribution. We examined the sitting height, heights and sitting height ratios (SHRs) for individuals in both the ARIC and CARDIA cohorts. European Americans (EA) are colored in blue while African Americans (AA) are colored in red. (A) The top panel plots the sitting heights versus total height for the individuals in the ARIC cohort (N=9,611). The bottom panel represents the histogram of SHR of European American and African Americans where there is about a 1.18 standard deviation difference between the 2 populations. (B) The CARDIA cohort (N=1,762). The bottom panel shows the histogram of SHR of EA and AA where there is about a 1.06 standard deviation difference between the 2 populations. 140 Degree of European admixture is predictive of sitting height ratio (SHR) in African Americans To determine if the SHR difference between European and African Americans has a genetic component, we reasoned that we could test this by exploring the genetic landscape of African Americans. As it is common for African Americans to high levels (>10%) of European ancestry [11], the level of European ancestry in any given African American should be correlated with SHR, if there is a genetic component to this difference. Given that European Americans have higher SHR than African Americans, we expect this correlation to be positive. To test this, we used principal component analysis to determine the degree of European admixture for the African Americans in both the ARIC and CARDIA (see Materials and Methods). We observed that there is a gradient of percentage European admixture in the African Americans (Figure 4.2AB) with some African Americans having as much as 60% European ancestry. There are significance positive correlations between the percentage European admixture and normalized sitting height ratios (SHR) (Figure 4.2C-D). This result shows that the SHR difference between European and African Americans has a significant genetic component. Analysis of African American individuals identifies variant associated with sitting height ratio (SHR) Given evidence for a genetic component, we proceeded to test for genetic markers that are associated with sitting height ratio (SHR). We performed genome wide association on the using the genotypes of the African American individuals from both the ARIC and CARDIA cohorts and performed the meta-analysis by combining the results from both cohorts (see Materials and Methods). We observed a genome-wide significant signal at the chromosome 141 Figure 4.2: Association of global European ancestry with sitting height ratio (SHR). The plots show the degree of European admixture for each African American individual and how it correlates with SHR. A and C show the degree of European admixture in the 2 cohorts by principal component analysis. Individuals closer to CEU (blue) have more European ancestry than individuals close to YRI (red). B and D show the association of European ancestry with SHR using linear regression. (A) Global European ancestry for ARIC. (B) Correlating global European ancestry with SHR for ARIC. (C) Global European ancestry for CARDIA. (D) Correlating global European ancestry with SHR for CARDIA. 142 Figure 4.3: Genome wide association study (GWAS) of African American individuals. The Manhattan plot of the GWAS performed for the African American individuals from the ARIC and CARDIA cohorts. Only 1 locus (rs201786365) reached genome wide significance. 143 3p21.33 locus of which the lead variant (rs201786365) has an association statistic of P=1.252 x 10-8 with the minor allele (MAF = 0.14) associated with increased SHR (β = 0.21) (Figure 4.3). This variant is present only in African Americans and is fixed as the major allele in European Americans. As such, this variant does not explain for the SHR difference between African and European Americans. The closest gene in the locus to the lead SNP is ABHD5 of which mutations in the gene has been associated with Chanarin-Dorfman syndrome [12]. Analysis of European American individuals identifies 2 loci associated with sitting height ratio (SHR) We continued to explore for genetic associations for SHR by performing the test on our European American individuals. We performed the test on the European American individuals in the ARIC, CARDIA, CHS, FHS cohorts (see Materials and Methods). We observed a genomewide significant signal at the chromosome 18p11.23 locus of which the lead variant (rs140449984) has an association statistic of P=3.70 x 10-9 with the minor allele (MAF = 0.07) associated with decreased SHR (β = -0.149) (Figure 4.4). This variant lies within an intron of the PTPRM gene, which the protein encoded is a member of protein tyrosine phosphatase (PTP) family. Additionally, we observed a significant signal on the X-chromosome (rs5959358) that the minor allele (MAF = 0.37) is associated with decrease SHR (β = -0.097, P = 9.71 x 10-8) only in women (Figure 4.5). Interestingly, the locus, which is in the vicinity of ITM2A, has been shown to be associated with height and also reported to escape dosage compensation [13]. That reported variant (rs1751138) is also associated with decrease SHR (β = -0.0945, P = 3.18 x 10-7) and is in strong linkage disequilibrium (LD) with rs5959358. 144 Figure 4.4: Genome wide association study (GWAS) of European American individuals. The Manhattan plot of the GWAS performed for the European American individuals from the ARIC, CARDIA, CHS and FHS cohorts. Only 1 locus reached genome wide significance. The lead variant (rs140449984) is in the PTPRM gene. 145 Figure 4.5: Genome wide association study (GWAS) of the X-chromosome in European American women. The plot of the X-chromosome association performed for the European American women from the ARIC, CARDIA, CHS and FHS cohorts. The strongest association signal (rs5959358) has the closest gene (ITM2A) that was previously reported to harbor variants that escape dosage compensation. 146 Variants associated with height are also associated with sitting height ratio (SHR) As sitting height is one of the components of height, we reasoned that variants that alter our height may be enriched for variants that also alter our SHR. To test this, we obtained a set of 421 LD-independent variants that have been shown to be robustly associated with height (Wood et. al., unpublished) and determine if they are also associated with SHR. Although none of these 421 variants reached genome-wide significance, we observed that as a whole, the 421 height associated variants are also significantly associated with SHR (Figure 4.6). We observed 49 of the 421 variants to have SHR P-values less than 0.05, which is significant (Expected=21.05/421; P=2x10-8). The strongest associated variant (rs2079795) has an association with SHR with a Pvalue of approximately 3 x 10-6. Also, the variant associated with height in GDF5, which was previously suggested to also have some association with sitting height [14] had some marginal association with SHR (P=0.01) (Table 4.1). These results are indicative that SHR is polygenic and a substantial number of height associated alleles do alter the SHR as well. DISCUSSION We have shown that body proportion as determined by our sitting height ratio (SHR) is mainly genetically driven. SHR also appears to be more constraint than height as while a standard deviation (SD) of height is 6.08 cm (ARIC), an SD of sitting height adjusted for height is just 1.95 centimeters (ARIC). Also, in general, men and women can differ in heights as much as 12cm [15], the sitting height adjusted for height difference between men and women is just approximately 0.47cm (ARIC). This is suggestive that the genes underlying the variability of SHR might just be more relevant than height to development as there is more selective pressure to keep our SHR within an acceptable range. However, we and others have shown that there is a 147 Figure 4.6: QQ-plot of the 421 LD-independent SNPs known to be associated with height. The plot shows the 421 height SNPs are as a group, also associated with sitting height ratio (SHR) even if none of them reached genome wide significance. The x-axis is the expected -log10 of the P-values while the y-axis is the observed -log10 of the P-values obtained from the association with SHR from the European American individuals. The gray points represent 5 different random samplings of 421 different variants from the GWAS of SHR from the European American individuals. 148 Table 4.1: The effect sizes, P-values of the 421 height associated SNPs with sitting height ratio (SHR). Height Rsid rs2079795 rs310421 rs4369779 rs1614303 rs42039 rs217181 rs3790086 rs4803468 rs3807931 rs3825199 rs3791679 rs2224538 rs3760318 rs8006657 rs1966913 rs7733195 rs6485978 rs12323101 rs11642612 rs17081935 rs9428104 rs16968242 rs143384 rs314263 rs2888893 rs11659752 rs212524 rs10877030 rs2871865 rs3116168 rs10770705 rs1797625 rs1884897 rs1658351 rs2597513 rs953199 Chr 17 6 18 10 7 16 16 19 7 12 2 20 17 14 16 5 11 13 16 4 1 15 20 6 12 18 1 12 15 2 12 3 20 3 3 9 Position 56851431 81848782 18989406 123386796 92082358 70671503 68445208 46614192 20348199 92501085 55950396 37985492 26271841 54314899 65941727 172927230 12634991 32041406 29937696 57518233 118657110 74527274 33489170 105499438 105862761 75323850 21455898 56542981 97012419 232698075 20748734 114309105 6560832 57988613 13530836 99522797 Ref t t c t t t c a a g a t g g a g c a c t g g g c c t c t c c a t a c c c Alt c g t g c c g g g a g c a a t a t g a c a c a t t g t g g t c a g t t a Reffreq 0.33 0.54 0.79 0.83 0.27 0.2 0.56 0.42 0.45 0.23 0.77 0.65 0.63 0.59 0.96 0.64 0.46 0.37 0.4 0.2 0.75 0.07 0.42 0.32 0.51 0.7 0.6 0.68 0.88 0.73 0.34 0.36 0.36 0.35 0.11 0.76 Effect Size 0.045 0.028 0.056 0.023 0.051 0.024 0.023 0.029 0.027 0.054 0.084 0.018 0.054 0.024 0.042 0.028 0.022 0.021 0.017 0.03 0.044 0.034 0.063 0.043 0.017 0.025 0.021 0.023 0.059 0.022 0.03 0.018 0.038 0.023 0.042 0.02 P-value 8.60E-48 1.40E-21 1.40E-54 1.50E-09 4.10E-39 2.20E-10 9.80E-16 1.20E-20 4.00E-21 1.90E-53 1.30E-96 4.90E-09 2.30E-59 3.70E-15 1.00E-08 3.40E-20 2.10E-14 2.40E-12 3.20E-08 5.00E-16 1.10E-37 6.20E-09 1.30E-71 3.10E-43 7.30E-09 2.10E-13 4.70E-12 2.80E-13 8.10E-32 2.40E-09 4.80E-22 1.00E-09 4.70E-33 3.00E-13 1.10E-18 6.10E-09 Sitting height ratio Effect Size -0.067 0.0508 -0.0557 -0.0579 -0.0518 -0.055 0.0417 0.0421 -0.0425 0.0504 -0.0467 0.0404 -0.0399 0.0396 -0.092 0.0404 -0.036 0.0374 -0.0351 -0.0444 -0.0386 -0.0634 0.0343 -0.0365 -0.0334 -0.0361 0.0326 0.0347 0.0499 -0.0337 -0.0338 -0.0318 -0.0313 0.0318 -0.0476 0.0331 P-value 2.99E-06 0.0001647 0.0006526 0.0009193 0.0009542 0.0009664 0.001443 0.001541 0.001648 0.001875 0.002166 0.002936 0.00306 0.003301 0.004018 0.004234 0.006102 0.007877 0.008347 0.009117 0.01043 0.01065 0.01146 0.01148 0.01289 0.0133 0.01502 0.01597 0.01789 0.01866 0.01879 0.01912 0.02021 0.021 0.02557 0.0278 Closest Gene C17orf82 FAM46A CABLES1 FGFR2 CDK6 HPR WWP2 BCKDHA ITGB8 SOCS2 EFEMP1 MAFB CENTA2 SAMD4A LRRC36 FAM44B TEAD1 PDS5B FLJ25404 C4orf14 SPAG17 SCAPER GDF5 LIN28B C12orf23 NFATC1 ECE1 CTDSP2 IGF1R DIS3L2 SLCO1C1 C3orf17 BMP2 FLNB HDAC11 XPA 149 Table 4.1 (Continued) Height Rsid rs7517682 rs7544462 rs1326023 rs17349981 rs9825951 rs7701414 rs12519505 rs3814333 rs2763273 rs1171615 rs11640018 rs12779328 rs4953951 rs12120956 rs7033487 rs10948222 rs1055144 rs2272566 rs606452 rs936339 rs4802134 rs1546391 rs7534365 rs7985356 rs26868 rs12186664 rs9650315 rs798497 rs1405212 rs2166898 rs6446315 rs2034172 rs4686904 rs8103992 rs2338115 rs1461503 rs2093210 Chr 1 1 20 15 3 5 5 1 6 10 16 10 2 1 9 6 7 11 11 3 19 3 1 13 16 5 8 7 6 2 4 3 3 19 17 11 14 Position 103292177 37735343 54275785 80018975 100752611 131613857 77541632 182273742 168577472 61139096 73885809 12983979 135903815 113004094 118169078 45352393 25837634 234552 74953826 144018195 43038525 116180147 148142748 114045564 2189377 95655981 57318152 2762483 117597357 121329129 5086488 55386803 188921216 19526643 34183104 122350285 60027032 Ref g a a a t g c t c c c c c g t c t a a t a g c t a t g a c g g g c a t c c Alt a c g t a a t c t t t t t a c t c g c c g c t a t a t g t a a a t c c a t Reffreq 0.44 0.91 0.3 0.85 0.35 0.44 0.78 0.32 0.76 0.22 0.37 0.72 0.9 0.77 0.79 0.58 0.19 0.48 0.14 0.19 0.21 0.07 0.19 0.77 0.47 0.32 0.87 0.7 0.59 0.84 0.17 0.68 0.35 0.2 0.54 0.57 0.42 Effect Size 0.022 0.032 0.024 0.028 0.02 0.041 0.023 0.049 0.022 0.022 0.019 0.028 0.035 0.025 0.041 0.032 0.022 0.016 0.043 0.022 0.027 0.042 0.045 0.023 0.025 0.021 0.057 0.057 0.023 0.027 0.025 0.018 0.022 0.024 0.024 0.018 0.039 P-value 9.20E-14 1.10E-09 9.80E-14 2.40E-11 4.20E-10 4.90E-42 3.20E-10 1.90E-53 1.80E-10 4.50E-09 2.10E-09 1.50E-17 4.30E-11 9.90E-13 3.50E-29 8.70E-22 1.80E-09 2.40E-08 6.40E-23 2.00E-08 2.90E-11 2.50E-12 3.50E-20 2.50E-11 2.70E-13 6.00E-12 2.50E-34 2.70E-71 4.60E-14 8.70E-11 3.60E-09 2.50E-08 1.00E-12 3.60E-10 1.10E-16 3.70E-10 7.50E-36 Sitting height ratio Effect Size 0.0282 -0.0517 0.0299 0.0399 -0.0287 -0.0276 0.0327 -0.0288 -0.0325 -0.0371 -0.0275 -0.0298 -0.0433 -0.0313 0.0316 0.0266 -0.0327 -0.0246 -0.0352 -0.031 0.0279 -0.0446 -0.0336 0.0285 0.0264 -0.0255 0.0352 0.0262 0.0242 -0.0305 -0.0309 0.0247 -0.0232 0.0274 0.0222 -0.0222 0.022 P-value 0.03245 0.03251 0.03458 0.03573 0.03733 0.0396 0.04161 0.04288 0.04334 0.04364 0.04409 0.0446 0.04661 0.05032 0.05056 0.05165 0.05777 0.05786 0.05948 0.06448 0.06557 0.06734 0.06785 0.07016 0.07022 0.07265 0.07642 0.07826 0.08067 0.08378 0.08545 0.08616 0.08732 0.08779 0.08954 0.09167 0.09649 Closest Gene COL11A1 C1orf149 MC3R MEX3B COL8A1 PDLIM4 AP3B1 GLT25D2 SMOC2 SLC16A9 CFDP1 CCDC3 ZRANB3 CAPZA1 PAPPA SUPT3H NFE2L3 PSMD13 SERPINH1 PCOLCE2 SIPA1L3 ZBTB20 SV2A CDC16 CASKIN1 PCSK1 CHCHD7 GNA12 VGLL2 GLI2 CYTL1 WNT5A BCL6 PBX4 PIP4K2B BSX C14orf39 150 Table 4.1 (Continued) Height Rsid rs9967417 rs2275325 rs3885668 rs6955948 rs4868126 rs2633761 rs820848 rs1681630 rs422421 rs17574650 rs3802758 rs1562975 rs7659107 rs10997979 rs862034 rs6441170 rs6694089 rs6061231 rs11144688 rs10767838 rs17792664 rs12209223 rs2326458 rs6584575 rs291979 rs17410035 rs1935157 rs9816693 rs8052560 rs17511102 rs17264185 rs7971536 rs181338 rs4735677 rs552707 rs2956605 rs13177718 Chr 18 1 2 7 5 3 5 11 5 5 11 4 4 10 14 3 1 20 9 11 14 6 16 10 10 5 1 3 16 2 15 12 9 8 7 8 5 Position 45213498 202067358 10095930 150139653 171216074 4703104 74000416 47925728 176449932 42472673 45892611 109628057 114961698 69607198 74060499 159289654 170350504 60390312 77732106 30304503 20960523 76221309 83545180 105567399 121119787 31576899 219383881 38022958 87304743 37814117 64784141 100897919 88297981 78310746 28171828 76045609 108141243 Ref g c c t g a g t c c a a g g g c a c g a g a c a a t g c a t g t t t t a c Alt c g t c t g a c t a g g a a a t g a a g c c a g g g c g c a a a c a c c t Reffreq 0.43 0.28 0.43 0.28 0.6 0.5 0.29 0.34 0.78 0.11 0.94 0.3 0.23 0.5 0.64 0.38 0.28 0.72 0.89 0.72 0.14 0.12 0.25 0.1 0.23 0.33 0.3 0.17 0.79 0.09 0.27 0.54 0.51 0.28 0.31 0.38 0.92 Effect Size 0.037 0.019 0.022 0.031 0.025 0.017 0.021 0.031 0.034 0.036 0.041 0.025 0.024 0.018 0.03 0.022 0.027 0.02 0.064 0.025 0.033 0.046 0.022 0.032 0.03 0.017 0.024 0.031 0.037 0.049 0.021 0.028 0.029 0.036 0.047 0.027 0.054 P-value 1.20E-32 5.00E-09 6.90E-14 8.80E-20 2.70E-11 3.70E-09 3.60E-09 1.10E-23 1.70E-20 1.50E-11 5.10E-10 4.00E-15 6.60E-12 3.50E-10 2.60E-23 8.60E-14 2.00E-13 1.70E-10 5.90E-24 1.80E-14 2.70E-14 1.90E-20 4.50E-10 1.20E-09 5.50E-18 1.70E-08 3.10E-14 3.60E-15 8.40E-17 2.80E-17 1.30E-10 5.00E-18 5.70E-24 1.20E-29 7.40E-49 1.70E-17 5.40E-19 Sitting height ratio Effect Size 0.0218 -0.0241 0.022 0.0242 -0.025 0.021 0.0239 -0.0214 -0.0243 -0.0353 -0.0349 -0.0217 -0.024 0.0196 -0.0199 -0.0198 0.0211 -0.0212 -0.03 0.0212 -0.0275 -0.0321 -0.0218 -0.0323 -0.023 -0.0205 -0.0206 0.0248 -0.0235 -0.0334 -0.0203 0.0185 -0.0178 -0.0203 -0.02 -0.0186 -0.0337 P-value 0.101 0.1031 0.1049 0.1052 0.108 0.1086 0.1169 0.1218 0.1277 0.1278 0.1346 0.1376 0.1385 0.1405 0.1408 0.1409 0.141 0.1436 0.1437 0.1457 0.1459 0.1466 0.1478 0.1492 0.1492 0.1524 0.1525 0.1548 0.1634 0.1665 0.1679 0.1685 0.1697 0.1706 0.1752 0.1891 0.1902 Closest Gene DYM ZC3H11A KLF11 TMEM176A FBXW11 ITPR1 HEXB PTPRJ FGFR4 GHR PEX16 RPL34 CAMK2D MYPN LTBP2 SHOX2 DNM3 RPS21 PCSK5 C11orf46 CHD8 FILIP1 ZDHHC7 SH3PXD2A GRK5 C5orf22 HLX VILL C16orf84 CDC42EP3 SMAD6 CCDC53 ZCCHC6 PXMP3 JAZF1 CRISPLD1 FER 151 Table 4.1 (Continued) Height Rsid rs2834442 rs3809790 rs6894139 rs11612228 rs2164747 rs301901 rs12228415 rs7727731 rs9858528 rs9766 rs12214804 rs8756 rs17450430 rs8180991 rs891088 rs273945 rs17806888 rs2211866 rs9977276 rs12855 rs584828 rs7043114 rs1812175 rs11867479 rs11616380 rs955748 rs2117563 rs4548838 rs12190423 rs11152213 rs9880211 rs4974480 rs12470505 rs4875421 rs4725061 rs7181724 rs7259684 Chr 21 17 5 12 12 5 12 5 3 17 6 12 20 8 19 7 3 21 21 1 17 9 4 17 13 4 17 15 6 18 3 3 2 8 7 15 19 Position 34612656 24979666 88363538 447245 102868966 37082383 14411968 64710202 184838099 38106367 34296844 64646019 47205671 126569532 7135762 137262106 67499012 38609977 46260755 51212681 35852756 94427804 145794294 65601802 79603316 184452669 70880580 98578713 72259432 56003928 137590239 135661252 219616613 4814740 8053164 92352611 12047611 Ref a c t t g a g t a a c c t c g c t a g t c c g t t g g t g c g t t t g g g Alt t t g c a g a c g g t a a g a a c g t c t t a c g a a c c a a a g a a a a Reffreq 0.64 0.53 0.56 0.38 0.1 0.57 0.45 0.11 0.74 0.54 0.08 0.49 0.24 0.77 0.26 0.58 0.88 0.41 0.78 0.09 0.6 0.44 0.84 0.35 0.28 0.76 0.83 0.46 0.62 0.25 0.75 0.68 0.9 0.46 0.44 0.45 0.07 Effect Size 0.024 0.022 0.031 0.02 0.028 0.026 0.017 0.033 0.021 0.021 0.087 0.054 0.034 0.029 0.027 0.018 0.033 0.022 0.023 0.036 0.028 0.028 0.052 0.025 0.02 0.028 0.025 0.034 0.016 0.025 0.032 0.037 0.046 0.019 0.02 0.02 0.039 P-value 5.70E-15 1.50E-13 4.50E-25 3.70E-10 9.10E-09 4.50E-19 2.70E-08 2.10E-11 8.50E-11 2.40E-13 1.60E-52 1.30E-71 6.20E-24 2.80E-16 1.30E-15 2.90E-09 4.30E-12 1.90E-13 1.30E-10 1.00E-12 3.30E-20 1.30E-22 8.40E-30 2.00E-15 1.20E-09 4.80E-16 2.10E-10 9.10E-30 4.30E-08 9.20E-13 1.30E-20 5.70E-23 4.00E-20 1.10E-10 1.50E-10 2.40E-10 1.70E-09 Sitting height ratio Effect Size -0.0179 0.0173 -0.0175 -0.0192 -0.0278 0.0172 -0.0173 -0.0259 -0.0179 -0.0161 0.0302 -0.0163 -0.0188 -0.0193 -0.0177 0.0164 0.0236 -0.0157 0.0185 0.0266 0.0159 -0.0154 -0.0211 -0.0164 -0.0176 0.0184 -0.0196 -0.0152 0.016 -0.0176 -0.0173 -0.0158 -0.0246 0.0153 0.016 0.0155 0.0271 P-value 0.1907 0.1914 0.1918 0.1925 0.1969 0.1997 0.2032 0.2171 0.2196 0.2203 0.2221 0.2235 0.2251 0.2274 0.229 0.2322 0.2343 0.2349 0.2354 0.2363 0.2396 0.2401 0.2404 0.243 0.2431 0.2436 0.2467 0.2487 0.2497 0.2518 0.2531 0.2552 0.256 0.2567 0.2569 0.2573 0.2587 Closest Gene KCNE2 SSH2 MEF2C B4GALNT3 HSP90B1 NIPBL ATF7IP ADAMTS6 KLHL24 EZH1 HMGA1 HMGA2 STAU1 TRIB1 INSR CREB3L2 SUCLG2 KCNJ15 COL6A1 CDKN2C IGFBP4 IPPK HHIP KCNJ16 SPRY2 WWC2 GRB2 ADAMTS17 OGFRL1 MC4R STAG1 ANAPC13 CCDC108 CSMD1 GLCCI1 MCTP2 LOC729747 152 Table 4.1 (Continued) Height Rsid rs7567851 rs10083886 rs11047239 rs7319045 rs1113765 rs6949739 rs10131337 rs16834765 rs11750568 rs12513181 rs6879260 rs13088462 rs4239020 rs17113369 rs4656220 rs6794009 rs2306694 rs6920372 rs2662027 rs10880969 rs14062 rs7834383 rs2581830 rs2748483 rs3782089 rs199515 rs6696239 rs6420435 rs2306596 rs316618 rs724016 rs12621643 rs7692995 rs1265097 rs7033940 rs692964 rs1036821 Chr 2 17 12 13 7 7 14 1 5 4 5 3 17 1 1 3 12 6 5 12 18 8 3 6 11 17 1 16 4 15 3 2 4 6 9 18 8 Position 178392966 67434950 24099047 90822575 55856828 46383928 36214267 32144029 178468319 124055106 179663620 51046753 77769930 95559811 168915901 61488535 54966903 109830632 56290242 45113290 17704301 13317848 53109138 146377253 65093395 42211804 225816691 80741702 39020335 39583790 142588260 223626227 17545732 31214438 6430419 13084132 135719665 Ref c t g a g t t t a c c c c t t g g g g c g t t a c c g a a t g g t c g g g Alt g c c g a a c c g a t t t c c a a a t t a g c t t g a c c a a t c a c a a Reffreq 0.08 0.26 0.3 0.39 0.82 0.91 0.24 0.06 0.33 0.26 0.61 0.06 0.33 0.97 0.39 0.44 0.07 0.59 0.9 0.7 0.67 0.36 0.4 0.55 0.94 0.8 0.81 0.21 0.52 0.78 0.44 0.7 0.85 0.89 0.87 0.4 0.7 Effect Size 0.039 0.02 0.022 0.024 0.025 0.037 0.027 0.045 0.019 0.019 0.027 0.053 0.021 0.07 0.022 0.016 0.047 0.026 0.032 0.023 0.018 0.021 0.025 0.018 0.053 0.023 0.038 0.025 0.02 0.026 0.078 0.019 0.101 0.04 0.024 0.019 0.047 P-value 2.20E-12 3.40E-09 4.10E-12 3.70E-15 1.00E-10 4.80E-12 5.60E-13 1.40E-12 4.20E-10 2.10E-08 1.10E-17 1.10E-14 1.50E-11 2.40E-08 7.50E-12 2.80E-08 1.20E-16 5.70E-19 1.40E-11 1.10E-12 1.60E-08 1.90E-11 7.60E-16 2.40E-09 1.00E-15 1.60E-09 2.80E-25 1.80E-11 1.80E-11 9.80E-13 1.10E-156 1.70E-08 5.20E-100 6.50E-15 3.80E-08 2.30E-10 2.80E-38 Sitting height ratio Effect Size 0.0268 -0.0163 0.0166 0.0156 0.019 0.0285 0.0173 -0.0305 -0.0153 0.0161 -0.0146 0.0322 0.0146 0.0374 0.0146 -0.0134 0.0277 0.0138 0.022 0.0146 0.014 -0.0144 0.013 -0.0132 -0.0258 0.0163 -0.0164 -0.0151 0.0132 -0.0155 0.0123 -0.0133 -0.0169 0.02 0.0182 -0.0123 -0.0137 P-value 0.2604 0.2621 0.2622 0.2668 0.2709 0.2725 0.2743 0.2747 0.2805 0.285 0.286 0.2916 0.2968 0.2995 0.2998 0.3011 0.3015 0.3081 0.3082 0.3117 0.3175 0.3186 0.328 0.3281 0.3288 0.3297 0.3299 0.3308 0.3354 0.3393 0.3484 0.3489 0.3516 0.3538 0.354 0.3556 0.3563 Closest Gene PDE11A SOX9 SOX5 GPC5 SEPT14 IGFBP3 PAX9 PTP4A2 ADAMTS2 NUDT6 GFPT2 DOCK3 CCDC57 RWDD3 PRRX1 PTPRG CS PPIL6 MIER3 SLC38A2 MIB1 DLC1 RFT1 GRM1 SSSCA1 WNT3 ZNF678 MPHOSPH6 RFC1 LTK ZBTB38 KCNE4 LCORL PSORS1C1 UHRF2 CEP192 ZFAT 153 Table 4.1 (Continued) Height Rsid rs780094 rs7466269 rs989393 rs26024 rs4640244 rs11221442 rs11683207 rs12987566 rs1832871 rs3739707 rs870183 rs4812586 rs2961830 rs1036477 rs354196 rs17038954 rs2510396 rs5742915 rs34651 rs13416119 rs7273787 rs6974574 rs10779751 rs6952113 rs738288 rs1047014 rs17807185 rs12904334 rs3132297 rs2815379 rs1923367 rs7177711 rs6988484 rs2057291 rs12137162 rs1550162 rs6813055 Chr 2 9 9 5 17 11 2 2 6 9 17 20 5 15 2 2 11 15 5 2 20 7 1 7 22 6 7 15 9 1 10 15 8 20 1 8 4 Position 27594741 132453905 100783157 127723921 21224816 128082834 97699722 171860892 158642022 112832527 546561 34978087 50490489 46702218 54819911 1624680 68174228 72123686 72179761 42316434 4046567 38076598 11206923 120564855 38237607 19949472 77146231 70629759 136441687 67283062 80802835 60167263 49576333 56905438 19635983 117632713 88849055 Ref c a t c a g t t a c g a a a g t c c c a g t a g g c g a g g g a c a a g a Alt t g c a g c c c g a a g t g a c g t t g a a g a a t a g a a c g t g c a t Reffreq 0.61 0.64 0.71 0.35 0.61 0.75 0.8 0.27 0.34 0.75 0.53 0.84 0.35 0.9 0.53 0.06 0.86 0.47 0.08 0.9 0.35 0.69 0.28 0.62 0.47 0.25 0.38 0.02 0.83 0.71 0.52 0.54 0.25 0.34 0.28 0.29 0.49 Effect Size 0.021 0.033 0.023 0.023 0.025 0.027 0.024 0.023 0.021 0.024 0.017 0.033 0.019 0.032 0.021 0.04 0.029 0.038 0.042 0.029 0.022 0.031 0.02 0.018 0.019 0.033 0.022 0.094 0.024 0.018 0.029 0.021 0.023 0.02 0.019 0.024 0.017 P-value 7.50E-12 1.70E-26 5.80E-13 3.90E-13 6.60E-14 3.00E-14 5.70E-09 1.30E-12 9.20E-12 1.60E-11 3.80E-09 1.70E-16 1.10E-09 2.80E-11 1.90E-12 1.10E-10 2.60E-12 1.20E-34 4.20E-13 4.90E-09 3.00E-12 2.30E-19 5.80E-10 1.10E-09 1.50E-10 7.50E-20 3.30E-13 1.50E-13 6.40E-09 2.50E-08 3.20E-22 1.60E-13 4.20E-12 4.80E-10 4.90E-09 2.90E-14 5.50E-09 Sitting height ratio Effect Size 0.0122 -0.0126 -0.013 -0.0128 -0.0136 -0.0137 -0.0183 0.013 -0.0124 -0.0133 0.0114 -0.0163 0.0121 -0.0179 -0.0112 -0.0241 0.0158 0.0115 -0.0207 0.0197 -0.0113 -0.0116 0.0118 -0.0112 0.0106 0.0135 0.0107 0.0417 0.0136 -0.011 0.0104 0.0101 -0.0116 0.0104 0.0109 0.011 0.0098 P-value 0.3564 0.3606 0.3611 0.3654 0.3668 0.3672 0.3723 0.377 0.3786 0.3793 0.381 0.3847 0.3886 0.39 0.3937 0.3959 0.3969 0.4033 0.4079 0.409 0.411 0.4115 0.4139 0.4145 0.4151 0.4209 0.4358 0.4384 0.4409 0.4445 0.4457 0.4471 0.4482 0.4508 0.4535 0.4641 0.4663 Closest Gene GCKR FUBP3 COL15A1 FBN2 KCNJ12 FLI1 ZAP70 METTL8 TULP4 LPAR1 VPS53 SAMHD1 ISL1 FBN1 SPTBN1 PXDN GAL PML TNPO1 EML4 SMOX STARD3NL FRAP1 C7orf58 SMCR7L ID4 RSBN1L ARIH1 RXRA SLC35D1 ZCCHC24 FAM148A EFCAB1 GNAS CAPZB EIF3H DMP1 154 Table 4.1 (Continued) Height Rsid rs4072910 rs7551732 rs1401795 rs4425077 rs9841435 rs4350272 rs7980687 rs888403 rs564914 rs17122659 rs749234 rs4605213 rs567401 rs7743622 rs9292468 rs9434723 rs3812040 rs757081 rs9309101 rs11687941 rs7112925 rs11835818 rs11090631 rs17783015 rs318095 rs7849585 rs16964211 rs817300 rs7567288 rs486359 rs17250196 rs2023693 rs2682587 rs429433 rs1659127 rs2145357 rs763318 Chr 19 1 17 2 3 10 12 18 1 12 2 17 1 6 5 1 5 11 2 2 11 12 22 12 17 9 15 9 2 6 7 16 19 8 16 6 4 Position 8550031 88911629 52194651 216118761 192593854 25096124 122388664 2756938 47687820 58243190 144947819 46599746 85760746 132772065 32854830 9214869 39461777 17308259 43483116 241840083 66582736 120979192 44225035 88755517 44329733 138251691 49317787 97420043 134151294 160694431 99655132 20787541 48774269 8785304 14295806 116558135 12572672 Ref g a a g g a a g t g a c t g t a t g g c c c t c t t g g c c t g a a a g g Alt c t g c a g g a a a g g c c c g c c a g t t c t c g a a t g g a c g g a a Reffreq 0.56 0.61 0.51 0.41 0.32 0.28 0.2 0.36 0.39 0.12 0.32 0.34 0.17 0.58 0.4 0.16 0.72 0.34 0.33 0.75 0.64 0.49 0.2 0.84 0.46 0.33 0.95 0.93 0.2 0.49 0.07 0.6 0.2 0.05 0.34 0.27 0.53 Effect Size 0.031 0.027 0.03 0.02 0.019 0.02 0.036 0.019 0.025 0.032 0.018 0.019 0.028 0.018 0.053 0.028 0.024 0.024 0.02 0.025 0.023 0.017 0.022 0.025 0.023 0.036 0.044 0.07 0.028 0.017 0.044 0.017 0.024 0.046 0.03 0.022 0.025 P-value 9.90E-18 2.50E-19 5.00E-25 1.40E-11 6.10E-10 2.00E-09 1.30E-21 1.00E-08 4.10E-17 6.80E-11 1.30E-08 2.00E-09 3.10E-11 4.60E-08 4.80E-46 8.60E-12 3.50E-13 3.20E-14 3.20E-10 4.00E-13 2.60E-14 4.30E-09 1.50E-08 5.20E-10 3.30E-15 9.80E-29 4.80E-09 2.20E-23 3.80E-13 1.60E-08 8.70E-10 1.40E-08 4.30E-10 6.70E-11 1.20E-19 1.70E-11 4.40E-17 Sitting height ratio Effect Size -0.0101 -0.0096 -0.0093 -0.0095 0.0099 0.0108 0.0118 -0.0104 0.0096 -0.0154 0.0099 -0.0098 -0.0132 -0.0094 -0.0092 -0.0122 0.0102 -0.0092 0.0092 -0.01 -0.0089 0.0088 -0.011 0.0122 0.0083 0.0091 0.0188 -0.0177 -0.0107 0.0085 0.0205 0.0081 -0.0103 -0.0228 -0.0087 0.0091 -0.008 P-value 0.4685 0.471 0.4711 0.4726 0.4745 0.4747 0.4763 0.4772 0.4773 0.4782 0.4809 0.482 0.4906 0.4917 0.4987 0.5023 0.5048 0.5057 0.5068 0.5077 0.5116 0.5121 0.5143 0.5169 0.5228 0.5248 0.5248 0.5274 0.529 0.5299 0.5335 0.5388 0.5442 0.5449 0.546 0.5461 0.5476 Closest Gene ADAMTS10 PKN2 C17orf67 FN1 CCDC50 ARHGAP21 SBNO1 SMCHD1 FOXD2 SLC16A7 ZEB2 NME1NME2 DDAH1 MOXD1 C5orf23 H6PD DAB2 NUCB2 THADA HDLBP RHOD BCL7A RIBC2 ATP2B1 ATP5G1 QSOX2 CYP19A1 PTCH1 NAP5 SLC22A3 GATS/ DCUN1D3 XRCC1 MFHAS1 MKL2 NT5DC1 RAB28 155 Table 4.1 (Continued) Height Rsid rs32855 rs2631676 rs2974438 rs4883972 rs2806561 rs2856321 rs12871822 rs7162825 rs6435143 rs7027110 rs4896582 rs497273 rs12882130 rs6540834 rs6561319 rs11855014 rs2013265 rs165189 rs6962887 rs10748128 rs39623 rs2715094 rs2289195 rs4986172 rs7652177 rs540652 rs6688100 rs3818416 rs7899004 rs9993613 rs6761041 rs16895130 rs12474201 rs6971575 rs1996422 rs10790381 rs2123731 Chr 5 10 5 13 1 12 13 15 2 9 6 12 14 1 13 15 8 5 7 12 5 7 2 17 3 2 1 13 10 4 2 6 2 7 4 11 19 Position 79871948 93027389 168183481 73956482 23377382 11747040 48099041 61226239 202902501 108638867 142745570 119689065 102948527 212694042 46010121 83529838 24148445 139125931 134696326 68113925 129082520 50697946 25316987 40571807 173451771 169415674 158666210 77372469 104331425 73694878 224738373 42032909 46774789 95877584 48382108 119762705 4880473 Ref a g g c a g g t a a g c c c a g c g t t a g a c g t t c t t t g a c g a a Alt g a a g g a t c c g a g g t c a t a g g t a g t c c c a c g c a g g a g g Reffreq 0.78 0.19 0.8 0.55 0.57 0.36 0.34 0.5 0.44 0.23 0.7 0.38 0.63 0.66 0.64 0.71 0.75 0.15 0.68 0.35 0.08 0.25 0.43 0.65 0.51 0.46 0.48 0.78 0.56 0.47 0.55 0.28 0.36 0.29 0.28 0.82 0.73 Effect Size 0.024 0.028 0.038 0.019 0.027 0.031 0.018 0.016 0.019 0.032 0.051 0.019 0.024 0.028 0.021 0.022 0.027 0.031 0.022 0.038 0.045 0.021 0.042 0.038 0.037 0.021 0.016 0.021 0.024 0.03 0.024 0.025 0.029 0.022 0.022 0.027 0.025 P-value 4.50E-11 1.50E-12 3.90E-26 1.70E-10 2.70E-21 1.00E-25 2.70E-09 2.80E-08 2.40E-10 2.30E-20 6.90E-58 2.80E-10 2.90E-14 1.70E-17 1.50E-11 1.40E-10 9.20E-17 2.70E-11 9.60E-11 4.60E-29 7.20E-17 1.20E-09 3.00E-34 1.60E-31 1.00E-36 6.20E-13 2.20E-08 4.60E-09 4.30E-17 7.80E-25 1.70E-16 2.00E-14 1.70E-20 3.60E-10 1.30E-11 1.20E-12 6.80E-13 Sitting height ratio Effect Size 0.0098 0.0103 -0.0098 0.008 0.0076 -0.0081 -0.0082 0.0073 0.0074 -0.0087 -0.0082 -0.0079 -0.0077 0.0073 -0.0076 -0.0078 0.008 0.0104 0.0078 0.0078 0.0129 -0.008 -0.0068 0.0069 0.0065 -0.0065 -0.0064 0.0078 -0.0066 -0.0065 -0.0062 -0.0072 -0.0065 -0.007 -0.0073 -0.0081 -0.0067 P-value 0.5518 0.5539 0.5541 0.5565 0.5588 0.5603 0.5645 0.5716 0.5737 0.5738 0.5769 0.5774 0.5819 0.5822 0.5827 0.5926 0.6017 0.6018 0.6019 0.6033 0.6075 0.6112 0.613 0.6196 0.6207 0.6215 0.6225 0.6246 0.6268 0.6312 0.6318 0.6319 0.6333 0.6408 0.6428 0.643 0.6462 Closest Gene FAM151B PCGF5 SLIT3 KLF12 LUZP1 ETV6 CYSLTR2 LACTB NOP5 ZNF462 GPR126 SPPL3 MARK3 PTPN14 LRCH1 PDE8A ADAM28 PSD2 CNOT4 FRS2 ADAMTS19 GRB10 DNMT3A ACBD4 FNDC3B NOSTRIN VANGL2 EDNRB SUFU ADAMTS3 SERPINE2 CCND3 SOCS5 SLC25A13 FRYL ARHGEF12 UHRF1 156 Table 4.1 (Continued) Height Rsid rs6902771 rs6658763 rs6439168 rs2120335 rs6080830 rs2149163 rs4843367 rs10152739 rs13113518 rs17556750 rs10972628 rs11156098 rs7007200 rs3812423 rs11648796 rs11799609 rs10794175 rs11880992 rs932445 rs4601530 rs2072268 rs11616067 rs7716219 rs4624820 rs12693589 rs2302580 rs992157 rs761391 rs2811594 rs2058092 rs6714546 rs17330192 rs10883563 rs2175513 rs7853235 rs3923086 rs1980850 Chr 6 1 3 2 20 9 16 15 4 4 9 6 8 8 16 1 10 19 6 1 17 12 5 5 2 4 2 6 1 14 2 6 10 3 9 17 14 Position 152199574 145158997 130533633 68348506 17719113 16445833 84975391 36271158 56094405 82374592 35927611 156629523 109854114 25354627 732191 241684940 126348063 2127403 2112224 24916698 63814947 114877557 54990828 141661972 191540907 8659534 218863025 85504822 93115870 73002719 33214929 17697354 102674370 68705056 85850602 60979950 67716941 Ref t c g g a c c t c a g t g g g t t a t c g a t a c c a c g t g c a g t c g Alt c t a a g g t a t c a c c c a g g g c t a g c g t t g t a c a t c a c a a Reffreq 0.46 0.92 0.79 0.59 0.56 0.4 0.66 0.25 0.36 0.31 0.74 0.12 0.69 0.64 0.25 0.16 0.43 0.4 0.59 0.74 0.52 0.76 0.31 0.52 0.25 0.58 0.57 0.46 0.63 0.56 0.72 0.28 0.55 0.43 0.2 0.6 0.83 Effect Size 0.029 0.034 0.038 0.018 0.016 0.02 0.02 0.023 0.017 0.044 0.02 0.029 0.017 0.021 0.034 0.026 0.021 0.032 0.021 0.026 0.021 0.02 0.029 0.018 0.022 0.029 0.018 0.021 0.023 0.017 0.035 0.019 0.023 0.017 0.029 0.025 0.029 P-value 1.50E-21 5.10E-10 5.20E-26 8.40E-10 1.50E-08 2.90E-11 7.10E-10 3.00E-11 1.60E-08 2.80E-43 1.20E-08 5.40E-10 4.70E-08 1.00E-12 7.70E-19 7.00E-10 4.00E-12 1.10E-26 1.10E-11 5.90E-15 1.70E-11 1.00E-08 2.50E-21 1.80E-10 5.50E-11 1.20E-15 1.60E-09 6.10E-10 3.10E-13 8.40E-09 2.40E-24 1.20E-08 3.20E-15 2.00E-08 8.80E-15 2.80E-14 2.40E-13 Sitting height ratio Effect Size 0.0062 0.011 -0.0071 0.0059 -0.0058 0.0059 -0.006 -0.0064 0.0059 -0.0061 -0.006 -0.0088 0.0059 0.0056 -0.0068 0.0073 0.0052 -0.0051 0.0053 0.0056 -0.0053 0.0059 0.0055 -0.005 0.0055 -0.0052 -0.0048 -0.0049 0.0048 0.0047 0.0051 -0.0052 0.0045 0.0044 0.0053 -0.0046 -0.0053 P-value 0.6485 0.6504 0.6545 0.6546 0.6559 0.656 0.6634 0.6699 0.6711 0.6781 0.6846 0.6867 0.6896 0.6905 0.6928 0.6986 0.6991 0.7003 0.7006 0.704 0.7058 0.709 0.7092 0.7095 0.7114 0.7155 0.7166 0.7183 0.7187 0.7243 0.7327 0.7348 0.7387 0.7396 0.7503 0.7511 0.7554 Closest Gene ESR1 FMO5 H1FX PPP3R1 BANF2 BNC2 FOXF1 SPRED1 CLOCK PRKG2 OR2S2 ARID1B TMEM74 KCTD9 NARFL SDCCAG8 FAM53B DOT1L GMDS CLIC4 ARSG MED13L SLC38A9 SPRY4 STAT1 CPZ PNKD TBX18 FAM69A NUMB LTBP1 FAM8A1 FAM178A FAM19A1 RMI1 AXIN2 RAD51L1 157 Table 4.1 (Continued) Height Rsid rs2298265 rs1155939 rs7253628 rs1325596 rs6746356 rs11618507 rs975210 rs929637 rs1571892 rs12669267 rs6838153 rs9835332 rs8058684 rs7261425 rs999599 rs8097893 rs12639764 rs1074683 rs1420023 rs2284746 rs7154721 rs2345835 rs12435366 rs1552173 rs6600365 rs833152 rs12538407 rs11624136 rs2829941 rs632124 rs13006748 rs568610 rs2280470 rs9217 rs1950500 rs4785393 rs806794 Chr 1 6 19 1 2 13 15 7 9 7 4 3 16 20 9 18 4 20 12 1 14 2 14 17 1 2 7 14 21 11 2 8 15 17 14 16 6 Position 149525667 126907826 35739109 175060689 174524144 29070751 68151406 12243047 93298657 72942572 122940449 56642722 52072619 20016635 116051416 73112043 106435654 31768314 12767378 17179262 91497101 18438433 34908140 74230437 41328840 182927346 23487841 58758573 26130806 118118445 20015300 27583914 87196630 7303812 23900690 48816984 26308656 Ref c a g a a t a g c c g g a c t a t c c g t c c c c c a a t a c t a c t g a Alt t c a g c g g t a t a c g g c g c g g c c t t t t a g g g t g c g t c a g Reffreq 0.88 0.5 0.16 0.57 0.75 0.25 0.18 0.78 0.29 0.87 0.34 0.54 0.3 0.71 0.37 0.95 0.62 0.76 0.88 0.52 0.57 0.54 0.73 0.46 0.43 0.42 0.6 0.5 0.61 0.42 0.3 0.24 0.33 0.37 0.3 0.16 0.71 Effect Size 0.03 0.042 0.024 0.025 0.019 0.023 0.034 0.022 0.017 0.029 0.021 0.028 0.021 0.021 0.017 0.044 0.027 0.047 0.028 0.04 0.027 0.019 0.023 0.018 0.027 0.016 0.043 0.017 0.017 0.022 0.023 0.023 0.031 0.03 0.031 0.023 0.055 P-value 6.90E-11 1.30E-47 6.20E-10 2.10E-18 1.80E-08 1.80E-10 7.90E-17 1.90E-10 5.00E-08 2.60E-08 7.90E-12 3.00E-22 6.40E-11 5.10E-10 9.70E-09 1.30E-10 5.00E-20 2.40E-42 1.60E-08 1.20E-40 1.30E-20 3.40E-10 3.60E-11 2.00E-10 9.90E-21 4.00E-08 1.00E-35 3.30E-09 3.20E-08 2.20E-14 1.00E-11 1.20E-11 5.50E-21 4.40E-23 2.70E-22 1.80E-08 7.80E-59 Sitting height ratio Effect Size -0.0061 0.0041 0.0054 0.0039 0.0045 0.0049 0.0053 0.0046 0.0041 -0.0065 -0.0039 -0.0035 0.0038 0.0039 -0.0036 0.0088 0.0036 0.004 0.0056 0.0033 0.0032 -0.0033 0.0038 -0.0031 -0.0031 0.003 -0.0031 -0.0029 -0.003 0.0029 0.003 -0.0031 0.0028 -0.0026 0.0028 0.0032 -0.0028 P-value 0.7574 0.7606 0.7612 0.7621 0.7647 0.7678 0.7698 0.7731 0.7733 0.7742 0.7831 0.7845 0.787 0.7874 0.7884 0.7893 0.7913 0.7939 0.7965 0.8027 0.8072 0.809 0.8099 0.8137 0.8151 0.8195 0.8212 0.8215 0.8222 0.8267 0.8387 0.8416 0.8419 0.846 0.8481 0.8497 0.8512 Closest Gene ZNF687 C6orf173 ZNF536 PAPPA2 SP3 SLC7A1 TLE3 TMEM106B NFIL3 WBSCR28 EXOSC9 C3orf63 RBL2 C20orf26 COL27A1 GALR1 TET2 PXMP4 CDKN1B MFAP2 TRIP11 RDH14 NFKBIA PSCD1 SCMH1 PDE1A IGF2BP3 DAAM1 APP DDX6 WDR35 SCARA3 ACAN ZBTB4 NFATC4 PAPD5 HIST1H2BF 158 Table 4.1 (Continued) Height Rsid rs2059877 rs3958122 rs7284476 rs9291926 rs4733724 rs7740107 rs11049611 rs991946 rs1582931 rs8017130 rs13388725 rs10863936 rs6911389 rs1599473 rs9395264 rs10995319 rs4332428 rs3915129 rs11783655 rs11684404 rs1544196 rs13150868 rs9392918 rs10780910 rs3118905 rs8103068 rs2247870 rs915506 rs7069985 rs1945237 rs1576900 rs2781373 rs425277 rs17391694 rs8102380 rs11779459 rs822531 Chr 19 4 22 5 8 6 12 6 5 14 2 1 6 8 6 10 10 3 8 2 1 4 6 9 13 19 5 10 10 11 9 14 1 1 19 8 7 Position 52880621 1663729 36459278 67635412 130792910 130416154 28491511 166249852 122685098 22828996 108413622 210304421 144121322 120544539 47582981 52432893 4955434 41218746 145109561 88705737 222699405 152400121 7653630 90039075 50003335 17383869 90187345 97795064 27930837 55986645 18619792 64637968 2059032 78396214 10662185 124049732 148260692 Ref t t a t a t c c g g g g t g g t a g t c g t c t g t a g g c g g t t g t t Alt g c g g g a t t a a a a g t t c g t a t a g t a a c g a a t a a c c a c c Reffreq 0.26 0.35 0.43 0.49 0.8 0.26 0.7 0.52 0.52 0.69 0.41 0.47 0.35 0.75 0.68 0.76 0.88 0.47 0.61 0.34 0.77 0.44 0.47 0.43 0.72 0.86 0.55 0.65 0.25 0.09 0.7 0.62 0.28 0.12 0.31 0.35 0.78 Effect Size 0.019 0.027 0.016 0.019 0.05 0.034 0.037 0.022 0.028 0.023 0.018 0.021 0.018 0.026 0.02 0.019 0.036 0.016 0.019 0.032 0.019 0.018 0.041 0.028 0.044 0.032 0.017 0.019 0.023 0.03 0.019 0.021 0.028 0.04 0.021 0.018 0.035 P-value 2.30E-08 1.20E-17 4.60E-08 9.30E-10 2.60E-42 5.10E-22 6.30E-30 6.80E-14 2.70E-20 6.50E-12 2.40E-09 9.00E-13 1.40E-08 4.10E-14 1.10E-10 2.20E-08 1.10E-15 3.80E-08 5.40E-10 2.30E-25 2.80E-08 1.20E-09 2.40E-43 6.20E-21 1.60E-33 5.00E-12 1.60E-08 1.50E-10 1.30E-11 8.70E-09 6.50E-09 2.90E-12 4.80E-17 4.00E-14 5.90E-12 3.40E-08 1.10E-18 Sitting height ratio Effect Size -0.0027 -0.0025 -0.0024 0.0024 -0.003 0.0027 -0.0025 -0.0023 0.0023 -0.0025 0.0022 -0.0022 -0.0024 -0.0024 -0.0022 0.0022 -0.0028 0.0017 0.0018 -0.0018 -0.002 -0.0016 0.0015 0.0014 0.0016 0.0021 -0.0013 0.0013 -0.0014 -0.0021 -0.0012 0.0011 -0.0011 -0.0015 0.0008 -0.0008 0.001 P-value 0.8556 0.8564 0.8576 0.8582 0.8588 0.8606 0.8613 0.8628 0.8635 0.8655 0.8672 0.8677 0.8686 0.8772 0.8805 0.8871 0.8916 0.8946 0.8956 0.8967 0.8991 0.9087 0.9126 0.9148 0.9167 0.9176 0.9211 0.9275 0.9301 0.9312 0.9342 0.9376 0.9407 0.9437 0.9538 0.9573 0.9574 Closest Gene GLTSCR1 SLBP TRIOBP PIK3R1 MLZE L3MBTL3 CCDC91 T CCDC100 HOMEZ GCC2 DTL PHACTR2 NOV CD2AP PRKG1 AKR1C1 CTNNB1 PLEC1 EIF2AK3 WDR26 ESSPL BMP6 SPIN1 DLEU7 BST2 GPR98 CCNJ RAB18 OR5M9 ADAMTSL1 MAX PRKCZ GIPC2 ILF3 ZHX2 EZH2 159 Table 4.1 (Continued) Height Rsid rs720390 rs12330322 rs991967 rs3763631 rs897080 rs7162542 rs7568069 rs6462432 rs6691924 rs526896 rs2854207 rs2074977 rs1199734 rs2237886 rs8069300 Chr 3 3 1 9 2 15 2 7 1 5 17 19 13 11 17 Position 187031377 72538045 216682074 35798334 44627706 82305294 71437993 32902049 54726833 134384604 59300839 3385028 20468246 2767307 11924957 Ref a c c c c g g a t t g c g t g Alt g t a g t c a g c g c a t c c Reffreq 0.38 0.78 0.28 0.69 0.26 0.55 0.42 0.39 0.9 0.73 0.27 0.36 0.81 0.11 0.47 Effect Size 0.068 0.034 0.038 0.021 0.033 0.03 0.021 0.017 0.031 0.037 0.04 0.028 0.021 0.042 0.016 P-value 8.70E-58 3.50E-22 4.10E-32 3.40E-11 2.60E-21 7.70E-16 1.40E-13 1.80E-08 4.70E-10 2.60E-27 4.20E-28 4.60E-20 4.00E-08 1.60E-17 1.70E-08 Sitting height ratio Effect Size -0.0007 0.0007 0.0006 0.0005 -0.0004 -0.0003 0.0003 -0.0003 -0.0004 -0.0003 0.0001 -0.0001 0.0001 -0.0001 0 P-value 0.9626 0.9668 0.9687 0.9711 0.9778 0.9806 0.9811 0.9842 0.9853 0.9855 0.9921 0.9927 0.9942 0.9969 0.9985 Closest Gene IGF2BP2 RYBP TGFB2 NPR2 C2orf34 ADAMTSL3 ZNF638 KBTBD2 ACOT11 PITX1 CSH2 NFIC LATS2 KCNQ1 MAP2K4 The 421 height associated SNPs and their effect sizes and P-values with sitting height ratio (SHR). The reference allele has been aligned such that the effect size for height is always positive. The variants are ordered with decreasing significance to SHR. 160 large difference of SHR between people of different ancestral background and there is more than 1 standard deviation difference between the SHR of European American and African Americans. While uncovering the underlying genetic reason for such a difference could improve our understanding of developmental biology during growth, differences of other phenotypes between European and African Americans could also be studied to determine if genetics is the primary cause for the difference. For example, studies of cancer rates have shown that African Americans have significantly higher incident rates of cancer [16] and subsequent genetic studies have uncovered common variants associated with prostate cancer that could explain for the greater incidence in African Americans [17,18]. Interestingly, we managed to observe some loci that reach genome wide significance even with our relatively small sample size. The lead variant (rs201786365) discovered in our African American samples is not in any genes. The closest gene (120kb upstream) is ABHD5, where mutations in ABHD5 (also known as CGI-58) has been associated with Chanarin-Dorfman syndrome, a syndrome characterized by the individual’s inability to process triglycerides which can lead to having short stature [19]. The variants discovered from our studies in European Americans, rs140449984 (PTPRM) and rs5959358 (ITM2A) are also interesting. PTPRM, while not known to be associated with height, is associated with a syndrome called deletion 18p syndrome which can lead to mental and growth retardation, and craniofacial dysmorphism [20]. While the variant (rs5959358) does not lie in any gene, the closest gene, ITM2A (70kb upstream), a gene found on the X-chromosome is associated with SHR in women but not in men. The locus have also been reported to be strongly associated with height [13]. This result suggests that the variant responsible for altering SHR plays a role in escaping dosage compensation in women that results in altered SHR [13,21,22]. Finally, we also show that most of the SNPs 161 associated with height do show effects that alter SHR. While variants that increase height have slightly higher probability to be associated with decreased SHR (e.g. FGFR2, CDK6), some variants that increase height are also associated with increase SHR (e.g. FAM46A, WWP2). Other variants while having a strong effect on overall height, they do not seem to be associated with SHR (e.g. HMGA2, ZBTB38). These 3 classes of genes might be clustered distinct biological pathways that have very different mechanism on how they alter overall height. In conclusion, this study is a large scale whole genome experiment to discover the underlying genetic basis for differences in body proportion using the sitting height ratio (SHR) as a read out. We uncovered a few loci that are significantly associated with SHR and that there are a significant number of loci associated with height that also alters the SHR. These results suggest that SHR is also polygenic and further studies of larger sample sizes is required to explain the full genetic spectrum of SHR. MATERIALS AND METHODS Quality control (QC) The data were downloaded from dbGAP and passed through our quality control pipeline. The QC is largely done using PLINK [23] software. Samples that have ambiguous or incorrect gender were filtered out (using --check-sex option in PLINK). SNPs that have > 5% missing rate were filtered out. Samples that have > 2% missing SNPs were removed. SNPs that have minor allele frequencies < 1% were dropped. We then examine samples that have extreme heterozygosity and removed samples that were +/- 4 standard deviations (using –het option in PLINK). The SNP annotations for chromosome and base-pair positions were set to the coordinates of hg19 162 (GRCh37) using liftover. We then calculated pairwise IBD/IBS (using –genome option in PLINK) and remove individuals that have excessive matching with other individuals (PI_HAT > 0.05). The samples were then superimposed on the HAPMAP [24] version 3 by comparing principal components using SMARTPCA [25]. Samples that do not belong to the right PCA cluster were removed. SNPs that have excessive plate-effects (P < 1 x 10-7) were dropped. For samples that are of European ancestry, SNPs that have excessive deviation from Hardy-Weinberg equilibrium (P < 1 x 10-7) were dropped. Determining global European ancestry in African American individuals For the African American individuals (ARIC and CARDIA cohorts), the global European ancestry was calculated by SMARTPCA from the CEU and YRI samples of HAPMAP version 3. The CEU individuals are proxies of European ancestry while the YRI individuals are proxies of African ancestry. The principal components were calculated using only the CEU and YRI individuals while projecting them onto the ARIC and CARDIA African Americans. The first principal component is taken to be the axis that represents the degree of global European admixture for each of our individuals. Genotype Imputation The genotypes were phased using SHAPEIT2 [26] and imputed using IMPUTE2 [27]. The imputation panel used were from the 1000 genomes [28] containing 379 Europeans, 246 Africans and African-Americans, 286 Asians and 181 Latin Americans. The imputation panel consists of approximately 22 million variants (SNPs and indels). For the X-chromosome, only the non163 pseudo autosomal region was imputed. The phasing and imputation were done separately for males and females. Genome wide association The associations were performed using sitting height ratio (SHR) adjusted for sex, height, age and body mass index (BMI). SHR was calculated by taking sitting height divided by total height. Individuals that were missing for SHR or any of the above covariate were discarded. Only unrelated individuals were used, i.e. no pair of individuals has PI_HAT > 0.05. The SHR were inverse-normalized per cohort. The top 10 principal components (PCs) were calculated using SMARTPCA and any PCs that had an association with SHR (P < 0.05) were used as a covariate as well. For African American individuals, the global percentage European admixture was included as an additional covariate. The association for the imputed variants with SHR was performed by a linear regression (--linear command with PLINK). The resulting association results for each cohort were then meta-analyzed together using METAL [29] with GC correction turned on. Variants on the X-chromosome were analyzed separately between males and females. Atherosclerosis Risk In Communities (ARIC) cohort We obtained genotypic and phenotypic data from dbGAP. There were initially 13,113 samples (European + African Americans) and after performing the quality control (QC) procedure, there were 7,257 (3,551 males and 3,706 females) European Americans and 2,354 African Americans (894 males and 1,460 females). The genotypes were typed using the Affymetrix Genome-Wide Human SNP Array 6.0 platform. 164 Coronary Artery Risk Development in Young Adults (CARDIA) cohort We obtained genotypic and phenotypic data from dbGAP. There were initially 1,675 European American samples and after performing the quality control (QC) procedure, there were 1,047 (494 males and 553 females) samples remaining. For African Americans, there were initially 1,393 samples and after performing the QC procedure, there were 715 (275 males and 440 females) samples remaining. The genotypes were typed using the Affymetrix Genome-Wide Human SNP Array 6.0 platform. Cardiovascular Health Study (CHS) cohort We obtained genotypic and phenotypic data from dbGAP. There were initially 3,980 European American samples and after performing the quality control (QC) procedure, there were 2,926 (1,163 males and 1,763 females) samples remaining. The genotypes were typed using the Illumina HumanCNV370v1-Duo platform. Framingham Heart Study (FHS) cohort We obtained genotypic and phenotypic data from dbGAP. The FHS cohort is largely data with family pedigrees. Sitting height measurements were only observed for the original cohort. After removing samples that do not have sitting height measurements and that are unrelated, there were 713 (269 males and 444 females) samples remaining. The genotypes were typed using the Affymetrix 500K platform. 165 REFERENCES 1. Wadsworth MEJ, Hardy RJ, Paul AA, Marshall SF, Cole TJ (2002) Leg and trunk length at 43 years in relation to childhood health, diet and family circumstances; evidence from the 1946 national birth cohort. Int J Epidemiol 31: 383–390. doi:10.1093/ije/31.2.383. Johnston LW, Harris SB, Retnakaran R, Gerstein HC, Zinman B, et al. (2013) Short leg length, a marker of early childhood deprivation, is associated with metabolic disorders underlying type 2 diabetes: the PROMISE cohort study. Diabetes Care 36: 3599–3606. doi:10.2337/dc13-0254. De Arriba Muñoz A, Domínguez Cajal M, Rueda Caballero C, Labarta Aizpún JI, Mayayo Dehesa E, et al. (2013) Sitting height/standing height ratio in a spanish population from birth to adulthood. Arch Argent Pediatría 111: 309–314. doi:10.1590/S032500752013000400009. Fredriks A, van Buuren S, van Heel WJM, Dijkman-Neerincx R, Verloove-Vanhoric... S, et al. (2005) Nationwide age references for sitting height, leg length, and sitting height/height ratio, and their diagnostic value for disproportionate growth disorders. Arch Dis Child 90: 807–812. doi:10.1136/adc.2004.050799. Krakow D, Rimoin DL (2010) The skeletal dysplasias. Genet Med 12: 327–341. doi:10.1097/GIM.0b013e3181daae9b. Stokes DC, Pyeritz RE, Wise RA, Fairclough D, Murphy EA (1988) Spirometry and chest wall dimensions in achondroplasia. Chest 93: 364–369. Hertel NT, Müller J (1994) Anthropometry in skeletal dysplasia. J Pediatr Endocrinol 7: 155–161. Bogin B, Varela-Silva MI (2010) Leg length, body proportion, and health: a review with a note on beauty. Int J Environ Res Public Health 7: 1047–1075. doi:10.3390/ijerph7031047. The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators (1989). Am J Epidemiol 129: 687–702. 2. 3. 4. 5. 6. 7. 8. 9. 10. Gardin JM, Wagenknecht LE, Anton-Culver H, Flack J, Gidding S, et al. (1995) Relationship of Cardiovascular Risk Factors to Echocardiographic Left Ventricular Mass in Healthy Young Black and White Adult Men and Women The CARDIA Study. Circulation 92: 380–387. doi:10.1161/01.CIR.92.3.380. 11. Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, et al. (2009) The genetic structure and history of Africans and African Americans. Science 324: 1035–1044. doi:10.1126/science.1172257. 12. Redaelli C, Coleman RA, Moro L, Dacou-Voutetakis C, Elsayed SM, et al. (2010) Clinical and genetic characterization of Chanarin-Dorfman syndrome patients: first report of large 166 deletions in the ABHD5 gene. Orphanet J Rare Dis 5: 33. doi:10.1186/1750-1172-5-33. 13. Tukiainen T, Pirinen M, Sarin A-P, Ladenvall C, Kettunen J, et al. (2014) Chromosome XWide Association Study Identifies Loci for Fasting Insulin and Height and Evidence for Incomplete Dosage Compensation. PLoS Genet 10: e1004127. doi:10.1371/journal.pgen.1004127. 14. Sanna S, Jackson AU, Nagaraja R, Willer CJ, Chen W-M, et al. (2008) Common variants in the GDF5-BFZB region are associated with variation in human height. Nat Genet 40: 198– 203. doi:10.1038/ng.74. 15. KUCZMARSKI MF, KUCZMARSKI RJ, NAJJAR M (2001) Effects of Age on Validity of Self-Reported Height, Weight, and Body Mass Index: Findings from the Third National Health and Nutrition Examination Survey, 1988–1994. J Am Diet Assoc 101: 28–34. doi:10.1016/S0002-8223(01)00008-6. 16. Landis SH, Murray T, Bolden S, Wingo PA (1999) Cancer statistics, 1999. CA Cancer J Clin 49: 8–31. doi:10.3322/canjclin.49.1.8. 17. Amundadottir LT, Sulem P, Gudmundsson J, Helgason A, Baker A, et al. (2006) A common variant associated with prostate cancer in European and African populations. Nat Genet 38: 652–658. doi:10.1038/ng1808. 18. Haiman CA, Patterson N, Freedman ML, Myers SR, Pike MC, et al. (2007) Multiple regions within 8q24 independently affect risk for prostate cancer. Nat Genet 39: 638–644. doi:10.1038/ng2015. 19. Bruno C, Bertini E, Di Rocco M, Cassandrini D, Ruffa G, et al. (2008) Clinical and genetic characterization of Chanarin–Dorfman syndrome. Biochem Biophys Res Commun 369: 1125–1128. doi:10.1016/j.bbrc.2008.03.010. 20. Portno?? M-F, Gruchy N, Marlin S, Finkel L, Denoyelle F, et al. (2007) Midline defects in deletion 18p syndrome: clinical and molecular characterization of three patients: Clin Dysmorphol 16: 247–252. doi:10.1097/MCD.0b013e328235a572. 21. Bondy CA, Cheng C (2009) Monosomy for the X chromosome. Chromosome Res 17: 649– 658. doi:10.1007/s10577-009-9052-z. 22. Castagné R, Zeller T, Rotival M, Szymczak S, Truong V, et al. (2011) Influence of sex and genetic variability on expression of X-linked genes in human monocytes. Genomics 98: 320–326. doi:10.1016/j.ygeno.2011.06.009. 23. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. doi:10.1086/519795. 24. International HapMap Consortium (2003) The International HapMap Project. Nature 426: 789–796. doi:10.1038/nature02168. 167 25. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. doi:10.1038/ng1847. 26. Delaneau O, Zagury J-F, Marchini J (2013) Improved whole-chromosome phasing for disease and population genetic studies. Nat Methods 10: 5–6. doi:10.1038/nmeth.2307. 27. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat Genet 44: 955–959. doi:10.1038/ng.2354. 28. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. doi:10.1038/nature11632. 29. Willer CJ, Li Y, Abecasis GR (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinforma Oxf Engl 26: 2190–2191. doi:10.1093/bioinformatics/btq340. 168 Chapter 5 Concluding Remarks OVERVIEW The development of whole genome genotyping technologies (DNA microarrays, whole genome sequencing, etc) coupled with computational capabilities for performing genotypephenotype associations have allowed genome wide association studies (GWAS) to be successful at identifying genetic variants associated with many complex traits and diseases [1]. This is because GWASs are best suited for identifying variants for traits with polygenic architecture where many loci have only a small effect on the resulting phenotype [2]. There is now compelling evidence that many of these variants result in changes of RNA expression levels which could be the reason behind their association with the phenotype [3,4]. While GWASs have been largely successful, much of the heritability has not been explained by the currently discovered variants although as the sample sizes increase, the better powered GWASs will be in detecting variants with smaller effect sizes [5]. Nonetheless, even if GWASs yield no new variants associated with phenotypes, the landscape of the genetic association statistics from GWASs might still be informative in teaching us about the genetic architecture of the phenotype. In this dissertation, we demonstrated how one can leverage the results from GWASs to infer the role of rare and common variants to polygenic architecture. MAJOR FINDINGS AND IMPLICATIONS In chapter 2, we discussed experiments that analyze the common variant’s effects on height at the tails of the height distribution. The findings are:   Single SNP analysis shows that common variants have expected effects at the tails. The short individuals have less than expected number of common short alleles (alleles 170 that are shown to reduce stature).   This effect is driven by the shortest individuals. This result is consistent with rare variants having moderate effects on short stature. Given that the short individuals have less than expected number of common short alleles, there is a fair chance that there are many such rare variants that have moderate effects on short stature in the population. Studies have been performed to determine what some of these variants might be [6], but might still prove difficult given the lack of power as the allele frequencies of such variants are very low. There is also evidence that rare copy number variants (CNVs) in genomic regions can explain the short stature in some patients [7,8]. Therefore, one of the implications of our results is that if one wants to have a strategy for identifying rare variants that cause short stature in the population, the recruitment of individuals with short stature is critical. It would be better to first genotype individuals with short stature for their height-associated common variants and determine if these individuals have a deficit of height decreasing alleles. As the short stature individuals could be short because of rare variants and/or common variants, enriching for individuals with a deficit of common height decreasing alleles would enrich for individuals harboring rare variants. Our results also implicate the use of ‘extreme’ individuals for genetic studies, that such studies can be used to compliment our knowledge about the genetic architecture of the trait in question. In chapter 3, we discussed a method to determine polygenic inheritance from low frequency variants by examining if there is an excess of risk conferring variants from summary statistics of association studies. The findings are,  An excess of low frequency risk-increasing variants can be a signal of polygenic inheritance as measured by an increase in the risk to protective (R/P) ratio. 171  This excess can be due to risk-increasing variants being more statistically powered than risk-decreasing variants with the same magnitude of effect.  This excess can also be due to having more risk variants to begin with because of negative selection keeping risk variants at low frequencies.  There is a higher probability for false positive associations to be risk variants if there are substantially more controls than cases.  This excess can also be due to asymmetric population stratification because of badly designed GWAS.  An analysis of some published GWAS summary statistics reveal significantly increased R/P ratios for schizophrenia, type 2 diabetes and obesity.  Significant increased R/P ratios were observed for macroalbuminuria and end stage renal disease but not if these subtypes of diabetic nephropathy were combined into a single case group. These findings suggest that one could simply test for an excess of risk conferring variants to determine if the low frequency variants contribute as a whole to disease risk. Methods to detect for a contribution of low frequency or rare genetic variants to disease risk are crucial as they can inform researchers whether pursuing the hypothesis would be a fruitful endeavor. While methods like GCTA [9] and polygene score [10] can be adapted to perform such analyses, examining for the excess of risk conferring variants provide an independent support for low frequency polygenic contributors to disease risk and requires only summary statistics without the need for primary genotype data. Besides having such a method, the findings suggest that most GWAS are designed to better discover low frequency variants that confer risk to disease. While this is useful for explaining disease etiology, it may be suboptimal for discovering genes that might be useful 172 as drug targets for treatment. This is because genes that have low frequency variants that confer protection to disease are best suited as drug targets assuming that the variants confer some loss of function effect on the gene. For example, low frequency loss of function variants in PCSK9 have been found to have a protective effect against coronary heart disease [11] and now has become a drug target for lowering LDL cholesterol [12]. Our results suggest that if GWAS were designed such that cases are individuals strongly protected against disease and controls are everyone else, that design will be better optimized to discover rare protective variants. In chapter 4, we examine the extent of genetic contribution to sitting height ratio (SHR) by performing genome wide association studies on African and European Americans. The findings are,  Degree of European admixed ancestry in African Americans strongly associated with sitting height ratio (SHR) suggests strong genetic contribution.     GWAS in African Americans discover a locus associated with SHR. GWAS in European Americans discover 2 loci associated with SHR. More than expected height-associated variants show association with SHR as well. Some of these height-increasing allele decreases SHR while other increases SHR. These results show that the difference of sitting height ratios (SHRs) between European and African Americans is genetic and that GWAS performed can reveal variants that are associated with SHR. However, the few variants discovered through GWAS do not explain the difference between European and African Americans suggesting that this difference is polygenic. As such, to fully uncover the full extent of such a difference, many more samples are required. The excess of known height-associated variants associated with SHR is also interesting. While sitting height is a component of total height, the sitting height ratio is not. Given that we corrected for total 173 height when performing the linear regression, the association statistic represents the change between the upper-body to lower-body ratio. As such, these height-associated SNPs can be grouped into 3 categories, i.e. the height-increasing allele does not alter SHR, the heightincreasing allele decreases SHR and the height-increasing allele increases SHR. While we perhaps do not have enough height loci to investigate this, there is a strong hypothesis that the height-increasing alleles that increases SHR are probably in genes that function to increase spine length or that the alternate allele decreases femur or tibia length. On the other hand the heightincreasing alleles that decrease SHR may perhaps be working to increase the length of the femur or tibia. The variants that have no effect on SHR may perhaps be regulating hormonal output. Perhaps examining these 3-classes of variants will shed more light on the biology of growth and the relevant developmental pathways involved. Genome wide association studies (GWAS) can inform us about the genetic architecture of traits and diseases. We argue that one should not merely look at only the genome wide significant results from GWASs and ignore variants that are insignificant. By performing computational modeling on the full range of results, one would be able to infer the genetic architecture of the trait or disease and perhaps shed light on the biological mechanism responsible for producing the change in the phenotype. FUTURE DIRECTIONS In this section, we focus on the results from this dissertation to the understanding of disease etiology, the broader implications and potential future research directions and goals towards the broader aim of improving our understanding of genetic diseases as well as towards the discovery therapeutic strategies. It has been suggested that while there is a plethora of effort 174 for performing disease mapping through the use of GWAS, little has been discovered about the mechanisms of how these variants influence the disease pathology and even less so in terms of therapeutics. This trend might change in the future but there are several issues that might hinder the effort for understanding the mechanism of common variants to disease. First, as the effect sizes of these common variants are small, studying the variant’s effect either in in-vitro systems or animal models may not be feasible as the magnitude of effect may be too small to be observed from the readout. Next, given the many number of such variants, it may be impractical to simultaneously study most of them. As such, rare-variants with large effects might be better suited for such follow up studies. We have observed from the results from studying individuals from the extreme ends of the height distribution, individuals with short stature could potentially be short because of rare variants of moderate effects. These effect sizes could be large enough to register a read out from studying animal models. In fact, it has been shown that human alleles could be introduced into zebrafish causing these zebrafish to have similar phenotypes [13]. Therefore, given that rare variants with moderate effects are not likely to be discovered from GWAS as the SNP markers from GWAS are mainly common, new approaches for rare variant discovery are needed. Some have suggested and performed either whole-genome, whole-exome or exome-chip experiments as an effort to discover rare variants associated with diseases. Results from our work suggest that analyzing the GWAS results may be informative as to how likely such efforts would be fruitful. From our studies of individuals with short and tall stature, we found that there is a less than expected number of short alleles for the short individuals suggesting that they may have rare variants that moderately cause a decrease in height. If one were to sample from short individuals where their common variant profile predicts tall stature or above-average stature, these 175 individuals would more likely to harbor such rare-variants. Such rare-variants could lie in genes there are known in pathways that regulate growth or could be from genes without much known biological function or mechanism. There could be several approaches to studying these genes to elucidate their unknown biological function or mechanism. One approach would be to introduce these variants via genome-engineering methods into some model organism. Since, if these organisms have homologous genes such that the human version of these genes is still functional, it is possible to replace the organism’s endogenous gene with a human version harboring these variants. If the human allele of the gene causes a similar phenotype in the organism, in this case, short stature, it would be evidence that the allele is the causal variant responsible for the human phenotype and subsequent studies into the mechanism of action can be studied via the model organism. This strategy could be extended to phenotypes of other quantitative traits like body-mass-index (BMI), lipid levels and blood pressure. Although not widely done, modeling disease outcome using human alleles has been demonstrated to be successful in zebrafish [14]. The key would be to identify the rare-variants with large effects and we have shown that studying the phenotypic extremes can be more optimal for doing so. While identifying rare variants with relatively larger effect sizes may be useful for understanding disease etiology, it may not be as useful for the development of therapeutics, in particular, the genes underlying these rare variants do not make good candidate drug targets. This is because these variants are usually deleterious variants and therefore targeting these genes is predictive of increasing risk to disease. Also, even if the variants are gain of function variants, targeting these genes would only work for individuals that have the risk allele, which would still be rare in the population. The truth is that most individuals are affected by complex diseases not 176 because of rare variants but by the cumulative effect of common variants in many genes together with environmental stimuli. Even when certain traits, like sitting height ratio, are highly differentiated between populations, the reason behind that differentiation is usually polygenic. Unless it becomes feasible to have drugs that target many genes concurrently where each target is only modestly affected, one would perhaps need a better solution for treating a polygenic disease. One possibility would be to target genes where there are rare deleterious variants that have moderate protective effects. However, we have shown that for case-control association studies, there is more power to detect risk than protective variants. Therefore, in order to optimize power to detect protective variants, the “case” individuals used in a case-control association should be individuals that are protected against the disease. Finding such individuals however, is a challenge on its own as individuals who are protected against disease do not show up at a clinic. One possibility would be use a quantitative trait measurement that is a proxy for the disease. For example, one criterion for having type 2 diabetes is having fasting glucose levels above 125 mg/dL. If one were to be able to recruit individuals that have lower than normal fasting glucose levels as cases and controls to be anyone else, then that case-control study design would be more optimized for detecting such protective variants. Another approach would be to use unaffected individuals that have strong environmental exposure to getting the disease. For example, the use of healthy middle-aged adults that are obese but do not have type 2 diabetes could be used as cases. Since there is a high probability of getting type 2 diabetes if one is obese, non-diabetic obese individuals might harbor protective variants against type 2 diabetes. Perhaps, such a new paradigm for performing GWAS might be the way forward for optimizing the power to detect rare protective variants. 177 A POSTSCRIPT We are at a point in time when research in human genetics for understanding complex diseases is in its critical moment. It was not too long ago where we do not have even a single gene or locus associated with any complex disease but now we have many, perhaps too many to even comprehend how it is possible to move forward. As genomic techniques improve and sequencing cost gets reduced, perhaps having a whole genome sequence for any single individual would be easily achieved. In the near future, having a genomic profile for any patient would be like measuring blood pressure today. It would be quick, easy and inexpensive. Therefore, the challenge of the future would be to determine how one could harness the genome’s sequence of every patient to improve our understanding of disease mechanisms as well as to aid in the development of new therapeutics. It is incumbent on us scientist to make that a reality and I strongly believe that we will succeed. It is only a matter of time. REFERENCES 1. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356–369. doi:10.1038/nrg2344. 2. Stranger BE, Stahl EA, Raj T (2011) Progress and Promise of Genome-Wide Association Studies for Human Complex Trait. Genetics 187: 367–383. doi:10.1534/genetics.110.120907. 3. Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, et al. (2010) Trait-Associated SNPs Are More Likely to Be eQTLs: Annotation to Enhance Discovery from GWAS. PLoS Genet 6: e1000888. doi:10.1371/journal.pgen.1000888. 4. Nica AC, Montgomery SB, Dimas AS, Stranger BE, Beazley C, et al. (2010) Candidate Causal Regulatory Effects by Integration of Expression QTLs with Complex Trait Genetic Associations. PLoS Genet 6: e1000895. doi:10.1371/journal.pgen.1000895. 5. Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five Years of GWAS Discovery. Am 178 J Hum Genet 90: 7–24. doi:10.1016/j.ajhg.2011.11.029. 6. Wang SR, Carmichael H, Andrew SF, Miller TC, Moon JE, et al. (2013) Large-Scale Pooled Next-Generation Sequencing of 1077 Genes to Identify Genetic Causes of Short Stature. J Clin Endocrinol Metab 98: E1428–E1437. doi:10.1210/jc.2013-1534. 7. Dauber A, Yu Y, Turchin MC, Chiang CW, Meng YA, et al. (2011) Genome-wide Association of Copy-Number Variation Reveals an Association between Short Stature and the Presence of Low-Frequency Genomic Deletions. Am J Hum Genet 89: 751–759. doi:10.1016/j.ajhg.2011.10.014. 8. Zahnleiter D, Uebe S, Ekici AB, Hoyer J, Wiesener A, et al. (2013) Rare Copy Number Variants Are a Common Cause of Short Stature. PLoS Genet 9: e1003365. doi:10.1371/journal.pgen.1003365. 9. Yang J, Lee SH, Goddard ME, Visscher PM (2013) Genome-wide complex trait analysis (GCTA): methods, data analyses, and interpretations. Methods Mol Biol Clifton NJ 1019: 215–236. doi:10.1007/978-1-62703-447-0_9. 10. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, et al. (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460: 748–752. doi:10.1038/nature08185. 11. Cohen JC, Boerwinkle E, Mosley TH, Hobbs HH (2006) Sequence Variations in PCSK9, Low LDL, and Protection against Coronary Heart Disease. N Engl J Med 354: 1264–1272. doi:10.1056/NEJMoa054013. 12. Stein EA, Mellis S, Yancopoulos GD, Stahl N, Logan D, et al. (2012) Effect of a Monoclonal Antibody to PCSK9 on LDL Cholesterol. N Engl J Med 366: 1108–1118. doi:10.1056/NEJMoa1105803. 13. Khanna H, Davis EE, Murga-Zamalloa CA, Estrada-Cuzcano A, Lopez I, et al. (2009) A common allele in RPGRIP1L is a modifier of retinal degeneration in ciliopathies. Nat Genet 41: 739–745. doi:10.1038/ng.366. 14. Davis EE, Zhang Q, Liu Q, Diplas BH, Davey LM, et al. (2011) TTC21B contributes both causal and modifying alleles across the ciliopathy spectrum. Nat Genet 43: 189–196. doi:10.1038/ng.756. 179