Computational Questions in Evolution A dissertation presented by Varun Nandkumar Kanade to The School of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science Harvard University Cambridge, Massachusetts August 2012 c 2012 - Varun Nandkumar Kanade All rights reserved. Thesis advisor Leslie G. Valiant Author Varun Nandkumar Kanade Computational Questions in Evolution Abstract Darwin’s theory (1859) proposes that evolution progresses by the survival of those individuals in the population that have greater fitness. Modern understanding of Darwinian evolution is that variation in phenotype, or functional behavior, is caused by variation in genotype, or the DNA sequence. However, a quantitative understanding of what functional behaviors may emerge through Darwinian mechanisms, within reasonable computational and information-theoretic resources, has not been established. Valiant (2006) proposed a computational model to address the question of the complexity of functions that may be evolved through Darwinian mechanisms. In Valiant’s model, the goal is to evolve a representation that computes a function that is close to some ideal function under the target distribution. While this evolution model can be simulated in the statistical query learning framework of Kearns (1993), Feldman has shown that under some constraints the reverse also holds, in the sense that learning algorithms in this framework may be cast as evolutionary mechanisms in Valiant’s model. In this thesis, we present three results in Valiant’s computational model of evolution. The first shows that evolutionary mechanisms in this model can be made robust to gradual drift in the ideal function, and that such drift resistance is universal, in the sense that, if some concept class is evolvable when the ideal function is stationary, it iii Abstract iv is also evolvable in the setting when the ideal function drifts at some low rate. The second result shows that under certain definitions of recombination and for certain selection mechanisms, evolution with recombination may be substantially faster. We show that in many cases polylogarithmic, rather than polynomial, generations are sufficient to evolve a concept class, whenever a suitable parallel learning algorithm exists. The third result shows that computation, and not just information, is a limiting resource for evolution. We show that when computational resources in Valiant’s model are allowed to be unbounded, while requiring that the information-theoretic resources be polynomially bounded, more concept classes are evolvable. This result is based on widely believed conjectures from complexity theory. Contents Title Page . . . . . Abstract . . . . . . Table of Contents . Bibliographic Note Acknowledgments . Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i iii v viii ix xii 1 7 8 9 10 11 14 15 16 17 18 19 21 22 23 23 26 27 33 38 40 50 1 Introduction and summary 2 Learning Models 2.1 Setting . . . . . . . . . . . . . 2.2 PAC Learning . . . . . . . . . 2.3 SQ Learning . . . . . . . . . . 2.4 Correlational Statistical Query . . . . . . . . . . . . . . . . . . Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Computational Model 3.1 Environment and Ideal Function . . . . . . . . . . . . . . . . 3.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Performance (or Expected Loss) . . . . . . . . . . . . . . . . 3.4 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Selection Rules . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Selection Based on Neutral and Beneficial Mutations 3.5.2 Selection based on Optimization . . . . . . . . . . . . 3.5.3 Weak Selection . . . . . . . . . . . . . . . . . . . . . 3.6 Evolvability . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Some Evolutionary Algorithms . . . . . . . . . . . . . . . . . 3.7.1 Monotone Conjunctions . . . . . . . . . . . . . . . . 3.7.2 Homogeneous Linear Separators . . . . . . . . . . . . 3.8 Relations to Learning Theory . . . . . . . . . . . . . . . . . 3.8.1 From Learning to Evolvability . . . . . . . . . . . . . 3.8.2 Making the representation class independent of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Contents vi 3.8.3 Reduction from Learning Frameworks to Evolvability . . . . . 51 55 56 57 60 64 75 78 81 82 82 84 86 86 88 89 91 95 97 110 114 114 116 118 121 124 125 126 128 130 132 133 135 136 136 137 141 145 153 4 Drifting Targets 4.1 Notions of Monotonicity . . . . . . . . 4.1.1 Notation . . . . . . . . . . . . . 4.2 Resistance to Drift . . . . . . . . . . . 4.2.1 Universality of Drift Resistance 4.3 Quasi-Monotonic Evolution . . . . . . 4.3.1 Removing the Need to Know . 4.4 Some Specific Evolutionary Algorithms 4.4.1 Monotone Conjunctions . . . . 4.4.2 Homogeneous Linear Separators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The Role of Recombination 5.1 Related Work . . . . . . . . . . . . . . . . . . . . 5.2 Evolution Model with Recombination . . . . . . . 5.2.1 Recombination . . . . . . . . . . . . . . . 5.2.2 Selection Rules . . . . . . . . . . . . . . . 5.2.3 Evolution with Recombination . . . . . . . 5.3 Model of Parallel Learning . . . . . . . . . . . . . 5.4 Reduction to Evolution with Recombination . . . 5.4.1 Recombination and Weak Selection . . . . 5.5 Some Evolutionary Algorithms . . . . . . . . . . . 5.5.1 Evolving Conjunctions . . . . . . . . . . . 5.5.2 Evolving Homogeneous Linear Separators . 5.6 Lower Bounds . . . . . . . . . . . . . . . . . . . . 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Limitations on Evolvability 6.1 Statistical Query Learning Models . . . . . . . . . . . . 6.1.1 Distribution-Specific SQ Learning . . . . . . . . 6.1.2 Distribution-Independent SQ Learning . . . . . 6.1.3 Correlational Statistical Query Learning . . . . 6.1.4 Notation . . . . . . . . . . . . . . . . . . . . . . 6.2 Computational Limitations on Evolvability . . . . . . . 6.3 Statistical Lower Bounds . . . . . . . . . . . . . . . . . 6.4 Computational Lower Bounds . . . . . . . . . . . . . . 6.4.1 Weak Distribution-Specific SQ/CSQ Learning . 6.4.2 Strong Distribution-Specific SQ/CSQ Learning . 6.4.3 Strong Distribution-Independent SQ Learning . 6.4.4 Strong Distribution-Independent CSQ Learning 6.5 Upper Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents vii A Equivalence between CSQ and ZCSQ learning Bibliography 157 161 Bibliographic Note Chapter 1 is an introduction to the contents and structure of this thesis. Chapter 2 reviews some frameworks studied in computational learning theory – the probably approximately correct (PAC) learning framework [38], the statistical query (SQ) learning framework [22] and Feldman’s more recent correlational SQ (or CSQ) framework [12]. Chapter 3 describes Valiant’s model of evolvability [39], some extensions to the model and some existing results. The subject matter described in Chapter 3 appears in [39, 12, 14, 40, 21] and, with the exception of Section 3.7.2, is not based on research performed by the author. The author’s contribution is an attempt to unify and simplify some of the existing results. Most of the contents of Chapters 4, 5 and 6 have been previously published in some form. Chapter 4 is based on the paper – “Evolution with Drifting Targets” (COLT 2010 [21]) co-authored with Leslie G. Valiant and Jennifer Wortman Vaughan. Chapter 5 is based on the paper “Evolution with Recombination” (FOCS 2011 [20]). Chapter 6 is based on the paper “Computational Bounds on Statistical Query Learning” (COLT 2012 [16]) co-authored with Vitaly Feldman. viii Acknowledgments I am extremely fortunate to have had Leslie Valiant and Adam Kalai as advisers in two different stages of my graduate studies. Les has been a great source of inspiration and a vast resource of knowledge. Through interactions with him, I hope that I have begun to understand how to separate the fundamental from the incidental, and hope that I will follow his ideal in trying to find answers to questions that really matter. Adam introduced me to computational learning theory. I thank him for letting me move with him to Boston and for all the dinners, beers, poker games and most importantly research problems and ideas. To both Les and Adam, I am grateful for patiently listening to my often silly ideas and for having confidence in my ability to conduct research, even at times when I didn’t. I hope I have learned from them, at least to some degree, how to be a good adviser and teacher. I thank Michael Mitzenmacher and Salil Vadhan for serving on my thesis committee and all their help during my stay at Harvard. I thank Yiling Chen for serving on my qualifying examination committee. Michael Rabin has been a great source of inspiration, I learned a lot from his courses, both as a student and teaching fellow. To Margo Seltzer I owe my (very limited) knowledge of systems and thank her for making the experience at Harvard more enjoyable, inside and outside of her CS265 class. Sham Kakade and Santosh Vempala have played a very helpful role in my research career. To Dana Randall, I remain indebted; she was the first person with whom I started research and she was most supportive in my first (and at times trying) year in graduate school. At Harvard, Carol Harlow (in SEAS) and Darryl Ziegler (in HIO), made paperwork seem painless and for that I am extremely grateful to them. My time at Georgia Tech was enjoyable mostly due to the wonderful (and too ix Acknowledgments x large to list here) theory group. I have been fortunate to have had a chance to collaborate with and talk to a number of outstanding researchers – Pranjal Awasthi, David Bader, Nina Balcan, Vitaly Feldman, Daniel Hsu, Sham Kakade, Kamesh Madduri, Yishay Mansour, Brendan McMahan, Ohad Shamir, Jennifer Wortman Vaughan, and Santosh Vempala. I must also thank all my math teachers from high-school and IIT who stimulated by interest in theoretical computer science one way or another. My move to and stay in the Boston area would not have been nearly as pleasant without the friendship of Karthik, Veena and Dipti. Purushottam has been a great friend whose support in matters academic and otherwise, I have always been able to count on. The years spent during graduate school would not have been as fun without times with friends from high-school and college – Anjor, Gayatri, Gireeja, Jay, Mandar G., Mandar H., Niyati, Onkar, Rohit, Samyogita, Shubhangi, Shweta, Sukhada and Sumedh. My housemates Harriotte, V´ra, Cynthia, Tom and Laini e made it enjoyable to occasionally go home early. Maxwell Dworkin 138 has been the most wonderful office – Kai-Min, Zhenming, Anna, Colin, Jon, Justin, Thomas, Scott, James, SueYeon and Jiayang. From them I have learned a lot (theory and otherwise) and with them I have had a lot of fun (in theory and otherwise). Other friends in MD – Alice, Brendan, Elaine, Elena, Guille, Heather – have made a time spent here much more enjoyable. Anna (the only nontheorist in MD 138) gets particular credit for making MD 138 the wonderful place it is; Thomas for his love for the Queen; Justin for his jokes and 1 minute discussions that last a few hours; Elena for being the constant provider of chocolates; Elaine for Acknowledgments xi being the provider of food, drink, French movies, a teacher of Californian accents and much more. My family, in India and in the States, has been very supportive in innumerable ways during my graduate study – but in particular I would like to thank my parents for believing in me and (almost) always letting me do what I want. This work would not have been possible without a fellowship by the Harvard School of Engineering and Applied Sciences and support from NSF grants CCF-0427129 and CCF-09-64401. to Jeeves and Wooster xii Chapter 1 Introduction and summary Darwin’s theory of natural selection proposes that species evolved from a single or few common ancestors [8]. Darwin’s crucial observation was that more creatures are born, than could possibly be supported by the environment, and so those that were more “fit” were more likely to survive. Central to the theory of evolution is the notion of inheritance, by which genetic material is passed down to progeny. The understanding from modern genetics is that the functional behavior of an organism is encoded in its genome (DNA sequence)1 . For selection to play any role in adaptation of organisms to their environment, variation within organisms is essential; this variability is caused by mutations (changes) in the genetic code (DNA sequence). Although, we know that the functional behavior or phenotype of an organism is controlled by the genotype (DNA sequence), the relationship between genotype and phenotype, in particular how changes in genotype affect phenotype, is not yet It is known that there are epigenetic (i.e. not directly encoded as a sequence in the DNA) factors that affect expression of inherited functional traits. However, the extent to which these factors affect the rate of evolution is not known. 1 1 Chapter 1: Introduction and summary 2 well understood. The DNA of an organism encodes sequences of amino acids that form proteins, and also encodes regulatory networks that govern interactions between these proteins. A regulatory network can be thought of as a circuit computing a function, where the inputs are concentration levels of a certain set of proteins and other molecules, and the outputs expression levels of the various proteins. An important question that is largely unanswered in our current understanding of evolution is that of the nature of complexity that can emerge in these genetic circuits. Valiant [39] proposed a computational model of evolution to study this question in the framework of computational learning theory and to understand the complexity of genetic circuits by the mathematical functions they represent. An organism in Valiant’s model is defined by a representation (its genotype), r, encoded as a string that is also interpreted as a function over some domain X . A point x in the domain can be though of as the input (e.g. concentration levels of proteins) that results from a certain environmental condition. For example, environmental conditions such as the presence of sugar molecules, oxygen levels, moisture and temperature can cause concentration levels of proteins to change. There is presumed to be an ideal function that indicates the optimal behavior (output of the circuit) in any possible environmental condition. The complexity of the ideal function is captured by a class of functions that contain it, e.g. the ideal function being a low-weight threshold function. The performance of an organism is defined as a measure of fitness and expresses how well the function represented by the organism compares to the ideal. Valiant introduces a notion of mutation of the representation and models natural selection as favoring those mutations that increase performance. An important con- Chapter 1: Introduction and summary 3 straint on evolution is population and representation size, which limits the number of feasible mutations that can be explored. Valiant asks which classes of functions allow for evolution of a representation that is close to the ideal function (in that class) within realistic time frames and reasonable population sizes. Also of interest is to identify classes that are provably not evolvable within feasible resources. Evolution in this model is viewed as a restricted form of learning, where exploration is limited to mutations and feedback is only obtained through (natural) selection. A key difference between this form of learning from the classical setting of learning from examples [38] is that a candidate hypothesis is judged only by its aggregate performance (how well it does on some range of environmental conditions) and not on its performance on specific examples. Indeed, this model of learning is related to the statistical query learning model [22]; Valiant observed that an evolutionary process could be simulated in the statistical query learning framework and it was shown by Feldman [12] that a restricted form of statistical query learning algorithms could be case as evolutionary mechanisms in Valiant’s model. This thesis addresses three questions in Valiant’s computational model of evolution. Structure of the thesis • Chapter 2 reviews a few models from computational learning theory. The main model of interest is Kearns’ statistical query learning (SQ) framework [22] and its restriction – correlational statistical query (CSQ) learning framework [12]. • Chapter 3 describes in detail Valiant’s model of evolution [39] and variants introduced by Feldman [14] and P. Valiant [40]. A few simple evolutionary Chapter 1: Introduction and summary 4 mechanisms for some concept classes are described. This chapter also contains Feldman’s proof showing the equivalence between CSQ learning and evolution. • Chapter 4 is concerned with the question of understanding how drift in the ideal function affects evolvability in Valiant’s model. The main result is that for any concept class that is evolvable at all, there exists an (adaptive) evolutionary mechanism that tolerates a non-negligible (though possibly small) drift rate in the ideal function. The main result exploits the connection of evolvability to statistical query learning. For some specific concept classes, evolutionary mechanisms that tolerate drift are presented. Most of Chapter 4 appeared in [21]. • Chapter 5 explores the role of recombination and sexual reproduction in evolution. The question is a difficult one on which no consensus has so far been reached. On the one hand there seems to be an obvious cost to sex – recombination may break down genetic combinations with high fitness. On the other hand, the fact that almost all higher organisms reproduce sexually suggests that genes that enable recombination are indeed selected for. Fisher [18] and Muller [30] proposed the idea that sexual reproduction increases the speed of evolution. Their explanation is the following: Suppose that a → A and b → B are two beneficial mutations that could occur in an organism and furthermore that the effect of these mutations is additive. Suppose that the population initially only had a and b alleles. Then, in the absence of recombination in order for both A and B to be fixed in the population, Chapter 1: Introduction and summary 5 the second mutation say b → B must occur in a descendant of an organism which already has undergone the mutation a → A. On the other hand, in a population that reproduces sexually, recombination allows the appearance of a member containing both A and B alleles in possibly far fewer generations. Chapter 5 reviews a few other explanations from the biology literature on the possible advantages of recombination. The one most relevant to work in this thesis is that outlined above due to Fisher and Muller. It is shown that under certain selection mechanisms, it is possible that evolution of a certain functionality may be significantly accelerated if the corresponding statistical query learning algorithm is parallelizable. Most of the material in this chapter appeared in [20]. • Chapter 6 investigates the limitations on evolvability. The fact that evolvability of a concept class implies learnability in the statistical query model immediately imposes limitations on evolvability. For example it was shown by Kearns [22] that the class of parity functions is not learnable in the SQ model. Later other concept classes such as polynomial size decision-trees and polynomial size disjunction normal form (DNF) formulae were shown not to be learnable in the statistical query model [6]. Even shallow monotone circuits have been shown not to be SQ learnable [17]. In all of these results, it is shown that the impediment to learning is that polynomially many queries do not give sufficient information about the target function. Thus, these results hold even if the learning algorithm is allowed unbounded computation. The question of interest is whether computation is a barrier to evolution (or Chapter 1: Introduction and summary 6 SQ learning) in cases when information is not, i.e. does their exist a concept class that can be learned with few (polynomially many) queries if the learner is allowed infinite computational power, but not if the learner is computationally (polynomially) bounded. The question is answered in the affirmative by explicit construction of such a concept class. Most parts of this chapter appeared in [16]. Chapter 2 Learning Models This chapter reviews some standard frameworks from computational learning theory. Computational learning theory [38] was introduced by Valiant in a seminal paper and initiated a formal study of mechanistic learning. We describe the following three frameworks of learning relevant to the work in this thesis: 1. The probably approximate correct (PAC) learning framework [38] – learning from random examples. 2. The statistical query (SQ) learning framework [22] - where the learner does not have access to random examples, but only aggregate statistics of the function being learned. 3. The correlational statistical query (CSQ) learning framework [12], which is a restriction of the SQ framework and is closely related to Valiant’s notion of evolvability [39, 12]. 7 Chapter 2: Learning Models 8 2.1 Setting In all these learning frameworks, the common goal is to learn an (unknown) target function, f : X → Y, where X is the instance space (typically X = {−1, 1}n or X = Rn ) and Y is the range of the function (typically Y = {−1, 1} corresponding to a boolean function or Y ⊆ R). The target function is assumed to come from a class of functions, C (e.g. linear threshold functions, polynomial-size DNF expressions, polynomials). A parameter1 , n, corresponds to the size of the examples in the instance space X ; in most cases this size parameter is natural such as the dimension of the boolean cube or real space. There is a distribution D over the examples in X and it is supposed that D comes from a class of distributions D. Let Y be a set such that Y ⊆ Y and let : Y × Y → R be a loss func- tion. We require that the loss function satisfy the properties: For every y ∈ Y, miny ∈Y (y , y) = (y, y) = 0 and that for every y ∈ Y , y ∈ Y, (y , y) ≤ B for some B that depends only on Y . Thus, the loss function is consistent and bounded. The goal of the learning algorithm (learner) is to output a hypothesis h : X → Y , such that the expected loss of h with respect to the target function f : X → Y and the target distribution, D, is Lf,D (h) = Ex∼D [ (h(x), f (x))] ≤ . We refer to accuracy parameter. The frameworks of learning differ in the manner in which the learning algorithm is allowed access to the target function. In the PAC setting, the learner may request Typically, in computational learning theory, the instance space, concept class and distribution are defined as families over all possible size parameters n ≥ 0. This allows one to meaningfully talk about asymptotics in terms of the size parameter n. To simplify presentation, we will implicitly assume that this is the case, rather than define these terms as families. 1 as the Chapter 2: Learning Models 9 random labeled examples, (x, f (x)), where x is drawn from the distribution D; in the SQ setting, the learner may request the (approximate) statistic Ex∼D [ψ(x, f (x))] where ψ(x, y) is some function defined by the learner; finally the CSQ setting is similar to the SQ setting, but the learner may only request (approximate) correlational statistics Ex∼D [ψ(x)f (x)]. Remark 2.1. The standard definitions of these models typically assume that the target functions and the output hypotheses are boolean. The standard loss function is the classification error c, where c (y , y) = 1 if y = y and 0 otherwise. We define a more general setting because some of the evolutionary algorithms (e.g. [29, 14, 13, 15, 40]) do consider the range of the output hypothesis to be different from {−1, 1} and the loss to be different from classification error. 2.2 PAC Learning In the probably approximately correct (PAC) learning setting, the learner is allowed access to an example oracle, EX(f, D), where f is the target function and D the target distribution. On being queried, the example oracle, EX(f, D), returns a random example (x, f (x)) where x ∼ D. Formally, PAC learning is defined as: Definition 2.1 (PAC Learning [38]). A concept class, C, is said to be probably approximately correctly (PAC) learnable with respect to a class of distributions, D, and loss function, , if there exists a (possibly randomized) learning algorithm LA, that for every , δ > 0, every target function f ∈ C and every distribution D ∈ D with access to oracle EX(f, D) outputs a hypothesis, h, that satisfies with probability Chapter 2: Learning Models 10 at least 1 − δ, Lf,D (h) = Ex∼D [ (h(x), f (x)] ≤ . Furthermore, the running time of the algorithm LA must be bounded by a polynomial in n, 1/ , and 1/δ and the output hypothesis, h, is required to be polynomially evaluable. 2.3 SQ Learning The statistical query (SQ) model of learning was introduced by Kearns [22] as a technique to design noise-tolerant learning algorithms. Let f : X → Y be the target function and D be the target distribution. In this model, the learning algorithm only has access to a statistics oracle STAT(f, D). The query to the statistics oracle, STAT(f, D), is of the form (ψ, τ ), where ψ : X × Y → [−1, 1] is the query function2 and τ is a tolerance parameter. The oracle responds with a value v that satisfies |v − Ex∼D [ψ(x, f (x))]| ≤ τ . Thus, the learning algorithm has access to approximate statistics about the target function and distribution (e.g. what is the fraction of examples for which x1 = 1 and the label is −1?), but not labeled examples themselves. Definition 2.2 (SQ Learning [22]). A concept class, C, is said to be learnable with statistical queries (or SQ learnable) with respect to a class of distributions, D, and loss function, , if there exists a (possibly randomized) learning algorithm, LA, that for every > 0, every target function f ∈ C and every distribution D ∈ D with access to oracle, STAT(f, D) outputs a hypothesis, h, that satisfies with probability at least Kearns requires this function to be boolean. However, as was observed by Aslam and Decatur [1], a function with range [−1, 1] can be viewed as a randomized boolean function and the definition of SQ learning is essentially unchanged. 2 Chapter 2: Learning Models 11 1/2, Lf,D (h) = Ex∼D [ (h(x), f (x))] ≤ . The running time of the algorithm must be bounded by a polynomial in n and 1/ . Furthermore each query (ψ, τ ) made to the oracle STAT(f, D) must be such that ψ is polynomially evaluable, 1/τ is bounded by a polynomial in n, 1/ and the final output hypothesis, h, must be polynomially evaluable. We discuss some variants of the statistical query learning framework in Chapter 6. 2.4 Correlational Statistical Query Learning In this section, we restrict attention to the case where the target function, f , is boolean and the loss function is the classification error, c, i.e. c (y , y) = 1 if y = y and 0 otherwise. In this case, we refer to the expected loss as errf,D rather than Lf,D . A statistical query, ψ : X × {−1, 1} → [−1, 1], is said to be target-independent if ψ(x, b) ≡ ψ ti (x) for some function ψ ti : X → [−1, 1]. A statistical query is said to be correlational, if ψ(x, b) = bψ cor (x) for some function ψ cor : X → [−1, 1]. Bshouty and Feldman [7] showed that if a learning algorithm is only allowed to make targetindependent or correlational queries to the statistics oracle, STAT(f, D), then to get a valid response for an arbitrary query, (ψ, τ ), it is sufficient to make two queries, one target-independent and one correlational, each with tolerance τ /2. Lemma 2.1 ([7]). Let STAT(f, D) be the statistics oracle. It is possible to simulate a valid response of STAT(f, D) to the query (ψ, τ ) where ψ : X × {−1, 1} → [−1, 1] by making two queries (ψ ti , τ /2) and (ψ cor , τ /2), where ψ ti is target-independent and Chapter 2: Learning Models ψ cor is correlational. Proof. The key observation is the following (since y ∈ {−1, 1}): ψ(x, y) = ψ(x, 1) 1+y 1−y + ψ(x, −1) 2 2 ψ(x, 1) − ψ(x, −1) ψ(x, 1) + ψ(x, −1) +y· = 2 2 12 Define ψ ti : X → [−1, 1] to be ψ ti (x) = (ψ(x, 1)+ψ(x, −1))/2 and ψ cor : X → [−1, 1] to be ψ cor (x) = (ψ(x, 1) − ψ(x, −1))/2. If v1 and v2 satisfy |Ex∼D [ψ ti (x)] − v1 | ≤ τ /2 and |Ex∼D [f (x)ψ cor (x)] − v2 | ≤ τ /2, then: |Ex∼D [ψ(x, f (x))] − (v1 + v2 )| = |Ex∼D [ψ ti (x)] + Ex∼D [f (x)ψ cor (x)] − v1 − v2 | ≤ τ The correlational statistical query (CSQ) learning model was introduced by Feldman [12]. In this model, the learning algorithm is only allowed to make correlational queries to the statistics oracle. Denote by CSQ-O(f, D), the oracle that on receiving a query (ψ cor , τ ), where ψ cor : X → [−1, 1] is the query function and τ is the tolerance, outputs a value, v, such that |Ex∼D [ψ cor (x)f (x)] − v| ≤ τ . Formally, CSQ learning is defined as: Definition 2.3 (CSQ Learning [12]). A concept class, C, is said to be learnable with correlational statistical queries (or CSQ learnable) with respect to a class of distributions, D, if there exists (a possibly randomized) learning algorithm, LA, that for every > 0, every target function f ∈ C and every distribution D ∈ D, with access to oracle CSQ-O(f, D) outputs hypothesis, h, that satisfies with probability at Chapter 2: Learning Models 13 least 1/2, errf,D (h) = Pr [h(x) = f (x)] ≤ . x∼D The running time of the algorithm must be bounded by a polynomial in n and 1/ . Furthermore, each query (ψ cor , τ ) made to the oracle CSQ-O(f, D) must be such that ψ cor is polynomially evaluable, 1/τ is bounded by a polynomial in n, and 1/ , and the final output hypothesis, h, is polynomially evaluable. The CSQ learning framework is discussed in greater detail in Chapter 6. Chapter 3 Computational Model This chapter presents Valiant’s computational model of evolution [39] and some extensions to the model by Feldman [14] and P. Valiant [40]. This model and its extensions form the basis of most of the work in this thesis. For further details the reader is referred to Valiant’s paper [39]. In Valiant’s study of evolution, the objective is to understand quantitatively which mechanisms can develop in living organisms within a reasonable time frame and with realistic population sizes and which are too complex. For simplicity, evolution of a single functionality is discussed. The functional behavior of an organism is encoded in the genome as a gene regulatory network or a genetic circuit. The output of a genetic circuit is typically a function of concentration levels of several proteins. The output could be, for example, whether or not an enzyme is produced, or the expression level of a protein. The question of interest is to understand how complex these circuits could be, given that they have evolved through the processes of mutation and natural selection. 14 Chapter 3: Computational Model 15 Valiant proposes understanding the question of complexity from the viewpoint of computational complexity, where complexity of functions is measured by the mathematical functions they realize. Evolution is treated as a form of computational learning; however, only the “aggregate fitness” of the hypothesis may affect the course of evolution, not the outcome on specific instances. This restriction is placed to model natural selection which only acts on the cumulative fitness of an individual and is blind to the outcome of an individual action in specific circumstances. 3.1 Environment and Ideal Function An organism is treated as an entity that computes a many-argument function on some domain X ; typically, the settings of interest are X = {−1, 1}n and X ⊆ Rn . It is assumed that n is a parameter1 that represents the size of instances in the input space and it is required that the resources used should be bounded by a polynomial in n. Each instance, x ∈ X , is considered as a “setting” of an environmental condition – an environmental condition could be represented by whether or not certain proteins are expressed or the concentration levels of various proteins. An ideal function defines what the optimal output should be for each possible input setting or environmental condition. We allow the output space of the ideal function to be an arbitrary set Y. Since the goal is to study feasibility of evolution depending on the complexity of the ideal function, as in learning theory, the ideal function is assumed to come from some concept class. A concept class, C, of functions This avoids the cumbersome notational overhead where instance spaces, concept classes and distribution classes are expressed as families for all values of n ≥ 0. 1 Chapter 3: Computational Model 16 from X → Y is simply defined to be a subset of functions from X → Y, e.g. linear threshold functions, polynomial functions, polynomial-size decision trees. In Valiant’s work [39] the output space, Y, is restricted to be {−1, 1} – modeling a genetic circuit as a boolean function. Thus, the ideal function could be in the class of conjunctions, threshold functions etc. Later, P. Valiant introduced the setting where the output itself could be real-valued and he was particularly interested in the setting where Y ⊆ Rm . In this case the classes of interest could be sets of linear functions, polynomial functions etc. In this thesis, we retain generality whenever possible and allow mechanisms to be functions from an arbitrary input space X to an arbitrary output space Y. When further constraints are necessary for our results, we will mention those explicitly. 3.2 Representation An organism is treated as a string that represents a function, the requirement being that there exist a polynomial time Turing machine that given the string representation, r, of a function and a point, x ∈ X , outputs the value r(x). Here, by a slight abuse of notation, the letter r is used to denote both the representation of the function as a string and also the function itself. The representation, r, computes a function r : X → Y where Y ⊆ Y . The output of a representation or organism is deliberately allowed to be outside of the range of the ideal function, as this could potentially allow for richer feedback. A representation class, R, is a set of string representations, where each (string) representation corresponds to a polynomially evaluable function over the instance space X . Chapter 3: Computational Model 17 3.3 Performance (or Expected Loss) The performance of an organism (or its representation) is a measure of how “close” it is to the ideal function. The goal of evolution is to achieve high performance or alternatively low expected loss. Suppose : Y × Y → R+ is a loss function that satisfies for every y ∈ Y, miny ∈Y (y , y) = (y, y) = 0, i.e. the loss (y , y) is minimized when y = y and also that for every y ∈ Y , y ∈ Y, (y , y) ≤ B for some B that depends only on Y . In other words, the loss function, , is consistent and bounded. Suppose D is a distribution over the instance space X ; in the case of evolution, D indicates how likely it is that an organism might face particular kinds of environmental conditions. For any representation, r : X → Y , we can define the expected loss with respect to an ideal function f : X → Y and distribution D, as Lf,D (r) = Ex∼D [ (r(x), f (x)]. Alternatively, we may define performance2 as -Perf f,D (r) = 1 − 2Lf,D (r)/B, so that -Perf f,D (r) ∈ [−1, 1]. The goal of evolution is to reach a representation that is -close to the ideal function, i.e. Lf,D (r) ≤ and for evolution to be considered feasible, it is required that this be achieved using resources that are bounded by a polynomial in n and 1/ . Valiant’s original paper [39] and indeed most of the follow-up work uses the notion of performance rather than expected loss. However, in the setting where we allow arbitrary input and output spaces, it is more convenient to describe the model and results in terms of expected loss. 2 Chapter 3: Computational Model 18 3.4 Mutation Mutations are caused due to changes in the DNA sequence, but the exact nature of mutations is as yet inadequately understood. Several different kinds of mutations are observed in nature – base substitution, deletion of a few base pairs, insertion of a few base pairs, copying a few base pairs, inversions and possibly other more involved ones. It is not known whether these mutations occur uniformly at random across the genome (see for example [33]). A more serious difficulty in modeling mutations of representations directly as changes in the genome sequence is that the relationship between genotype and phenotype is not well understood. Predicting the functional behavior of a section of the genome, given the sequence of base pairs, seems currently out of reach. However, it may still be possible to understand the limitations and potential of evolution in a more abstract sense. The extended Church-Turing thesis asserts that the processes of interpreting the genome as a function and mutation of the genome should be computable by an efficient Turing machine. Valiant’s model allows the mutations to be expressed as the output of a polynomial time Turing machine that takes as input the existing representation. In some sense, this can be seen as allowing mutations to be caused by the most powerful (computationally feasible) process. A mutator is defined a randomized polynomial time Turing machine that takes as input a representation, r, and a target accuracy parameter, , and outputs a new (possibly random) representation r . Formally, Definition 3.1 (Mutator). A mutator for a representation class, R, is a randomized Turing machine, Mut, that takes as inputs a representation, r ∈ R, and , and outputs a representation r . The running time of Mut is bounded by a polynomial in |r| and Chapter 3: Computational Model 19 1/ . We think of Mut(r, ) as a random variable that takes values in R. 3.5 Selection Rules A key idea in the theory of evolution is natural selection, the fact that “more fit” members of a population are more likely to survive than “less fit” ones. Unfortunately, how to define fitness is often far from clear. In population genetics, fitness of a genotype is defined as the (expected) number of offspring that an individual with that genotype produces that live to maturity. However, this side-steps the question of what makes certain genotypes “more fit” than others from the point of view of functional behavior. In Valiant’s model of evolution, a natural notion of fitness is that of the performance (or expected loss) of individual genomes (representations). However, there still remains the difficulty of understanding how better performance with respect to a functionality translates to increased fitness in the sense of leaving greater numbers of progeny (which is how natural selection acts). This depends on how critical the functionality is for survival, which in turn decides the strength of natural selection. What can be said with certainty is that the fitness, or the chance of survival, of a genome should be an increasing function of its performance (decreasing function of the expected loss). A few different selection rules are discussed in this section and it has been shown by Feldman [14] and Valiant [39] that there is a certain degree of robustness in the model, in the sense that which concept classes are evolvable remains more or less unchanged for several choices of selection rules. The size of the population places a constraint on the number of mutations that Chapter 3: Computational Model 20 can be explored for natural selection to act on. Given two representations, the one with lesser expected loss (greater performance) is more likely to survive; however, it is unrealistic to expect that differences in expected loss that are negligible (superpolynomially small) will be “detected” by natural selection. Thus, selection is governed not by the exact expected loss of a particular representation, but by an empirical estimate. To simplify presentation3 , we assume that there is an oracle Lf,D (·, ·) that for ideal function, f , and distribution, D, when queried with a representation, r, and some approximation parameter, τ , returns a value v = Lf,D (r, τ ) such that |v − Lf,D (r)| ≤ τ , i.e. Lf,D (r, τ ) is a τ -(additive)-approximation to the expected loss, Lf,D (r). Furthermore, it is supposed that there is a tolerance parameter, t, that determines if the difference in (approximate) expected loss is large enough to be detected by natural selection. The process of (natural) selection starting from a representation, r, first involves action of a mutator, Mut, which defines the neighborhood that may be explored. The number of candidate mutations explored is given by a (population) size parameter, s. The s-neighborhood, Neigh(r, ), of a representation, r, given accuracy parameter, , is a (random) multi-set obtained by running the mutator, Mut(r, ), s times. The empirical performances of each of these mutations is obtained by querying the oracle Lf,D (·, τ ), for some approximation parameter τ . Finally, depending on the selection rule and the tolerance parameter, t, one of the mutations from the multi-set, Neigh(r, ), is selected. Each selection rule can be thought of as a random variable, that outputs a representation, r , for the next generation and we express this selection Valiant instead assumes that the expected loss (performance) is estimated by taking a sample from the distribution D. 3 Chapter 3: Computational Model 21 process as: r ← Sel(r, , Mut, Lf,D (·, ·), τ, s, t) Three specific selection rules are defined in this section. The first two appear in Valiant’s paper [39] and the last implicitly in Feldman’s work [14]. In all of the three selection rules described below, we use the following notation: Let r be the current representation, the accuracy parameter, s the size param- eter (that determines the number of mutations explored), t the tolerance parameter, and τ the approximation parameter (for estimates of the expected loss). Then Neigh(r, ) is a multi-set obtained by running the Turing machine, Mut(r, ), s times. Let Neigh(r, ) = {r1 , . . . , rs } and denote by vi the quantity Lf,D (ri , τ ), so that vi satisfies |vi − Lf,D (ri )| ≤ τ . Also, let v = Lf,D (r, τ ), so that |v − Lf,D (r)| ≤ τ . 3.5.1 Selection Based on Neutral and Beneficial Mutations Beneficial mutations, Bene, are those whose (approximate) expected loss is noticeably lower than that of r. Neutral mutations, Neut, are those whose (approximate) expected loss is not noticeably different from that of r. The remaining mutations are considered deleterious. Whether or not the difference is noticeable is determined by a tolerance parameter, t. Formally, Bene = {ri ∈ Neigh(r, ) | vi < v − t} Neut = {ri ∈ Neigh(r, ) | |vi − v| ≤ t} Selection proceeds as follows: if the multi-set Bene is not empty, then a representation from Bene is chosen uniformly at random; if Bene is empty and Neut is non-empty, a Chapter 3: Computational Model 22 representation from Neut is selected uniformly at random; if both Bene and Neut are empty the special representation, ⊥, is selected, which indicates the end of the evolutionary process (or extinction). Although, the selection from Bene or Neut is done uniformly at random, the fact that these are multi-sets means that some beneficial (or neutral) mutations may be more likely to survive than others, depending on the probability with which the mutator generated these. We will denote this selection rule by BN-Sel. In particular, if r is the representation selected as an outcome of this process, we express this as: r ← BN-Sel(r, , Mut, Lf,D (·, ·), τ, s, t). 3.5.2 Selection based on Optimization In this selection rule, one among the “best” (having least expected loss) mutations from the neighborhood is selected. Let v ∗ = mini∈[s] vi . If v ∗ > v +t, i.e. the mutation with the least (approximate) expected loss is worse than the starting representation, r, the selection rule chooses the special representation, ⊥, indicating an end to the evolutionary process. Otherwise, the multi-set, Opt, consists of the mutations from Neigh(r, ) that have least (approximate) expected loss (and are not noticeably worse than r). Formally, Opt = {ri ∈ Neigh(r, ) | |vi − v ∗ | ≤ t and vi ≤ v + t}. In this case, selection chooses a mutation from the multi-set Opt uniformly at random to be the representation for the next stage (generation). We denote this selection rule by Opt-Sel. In particular, if r is the representation selected as the outcome of this Chapter 3: Computational Model 23 process, we express this as: r ← Opt-Sel(r, , Mut, Lf,D (·, ·), τ, s, t). 3.5.3 Weak Selection In this selection rule, neutral and beneficial mutations are not distinguished explicitly. Thus, all neutral and beneficial mutations are considered feasible. Formally, denote the multi-set of feasible mutations, Feas, as: Feas = {ri ∈ Neigh(r, ) | vi ≤ v + t}. Selection proceeds as follows: if Feas is not empty, a representation from Feas is selected uniformly at random; otherwise the representation for the next generation is set to ⊥, indicating an end to the evolutionary process. We introduce this selection rule to model scenarios when selection pressures are weak, e.g. because the trait that is being evolved is not highly critical for the survival of an organism. We discuss this in greater detail in the context of evolution with recombination (see Chapter 5). Feldman considers a more general (and arguably more natural) notion of weak selection [14]. However, as far as evolvability of concept classes is concerned, the weak selection rules seem to be identical. The interested reader is referred to Section 6 of [14]. 3.6 Evolvability In this section, we formally define the notion of evolvability of a concept class. Let X be the instance space (set of environments) and let C be a concept class of Chapter 3: Computational Model 24 functions from X → Y. Let D be a class of distributions over X . Let n denote the parameter that characterizes the size of instances in X (e.g. if X = {−1, 1}n or X ⊆ Rn ). (Formally, we require that the length of the string representation of every x ∈ X , denoted by |x|, is such that |x| is bounded by some polynomial in n.) Let be the target accuracy parameter. We consider the following components of an evolutionary mechanism: • Representation Class: We say that a representation class, R, of (representations of) functions from X → Y , is polynomially bounded (with respect to n and ), if for every r ∈ R, the length of the string encoding r, denoted by |r|, is bounded by a polynomial in n and 1/ , and for every x ∈ X , there is a polynomial time (in |r| and |x|) Turing machine, that with r and x as inputs, outputs r(x). • Mutator: Given a polynomially bounded representation class, R, an evolutionary mechanism uses a polynomially-bounded mutator as defined in Definition 3.1. • Approximation Parameter: An evolutionary mechanism uses an approximation parameter, τ . This is used to obtain (additive) approximate estimates of the expected loss of representations using the oracle Lf,D (·, τ ). • Size Function: Let R be a polynomially bounded representation class. Let s : R × R+ → N denote the size function, which is said to be polynomially bounded, if for every r ∈ R and > 0, s(r, ) is bounded by some polynomial in n and 1/ , and s(r, ) can be computed in polynomial time. The size function determines the size of the neighborhood that is explored using the mutator. Chapter 3: Computational Model 25 • Tolerance Function: Let R be a polynomially bounded representation class. Let t : R×R+ → R+ denote the tolerance function, which is said to be polynomially sandwiched, if there exist polynomials t (·, ·) and tu(·, ·) such that for every r ∈ R and every > 0 t (n, 1/ ) ≤ 1/t(r, ) ≤ tu(n, 1/ ) and that there exists a constant c such that tu(n, 1/ ) ≤ t (n, 1/ )c . It is also required that t(r, ) be computable in polynomial time. Thus, we define an evolutionary mechanism as a 5-tuple, EM = (R, Mut, τ, s, t), consisting of a representation class, mutator, approximation parameter, size function and tolerance function. We say that an evolutionary mechanism, EM = (R, Mut, τ, s, t), is polynomially-bounded if R is a polynomially bounded representation class, Mut is a polynomially bounded mutator, 1/τ is bounded by a polynomial in n and 1/ , s is polynomially bounded and t is polynomially sandwiched. Let f ∈ C be the (target) ideal function and let D ∈ D be a distribution over X . We define the notion of an evolutionary step involving an evolutionary mechanism, EM , starting representation, r0 ∈ R, and a selection rule, Sel. Formally, Definition 3.2 (Evolutionary Step). For ideal function f and distribution D, an evolutionary step starting from representation r0 , using evolutionary mechanism EM and selection rule Sel, produces a representation r ∈ R, where: r ← Sel(r, , Mut, Lf,D (·, ·), τ, s(r, ), t(r, )). We refer to this in short as r ← ES(r0 , EM , Sel). Chapter 3: Computational Model 26 Thus, an evolutionary step is simply one stage of evolution involving mutation and (natural) selection steps. An evolutionary sequence, r0 , r1 , r2 , . . . is obtained by a series of evolutionary steps starting from r0 , where ri ← ES(ri−1 , EM , Sel) for all i ≥ 1. We can now define evolvability of a concept class, C, with respect to distribution class, D, with loss function and selection rule Sel. Definition 3.3 (Evolvability [39]). A concept class, C, is said to be evolvable with respect to a class of distributions, D, using a selection rule Sel and loss function , if there exists a polynomially bounded evolutionary mechanism, EM = (R, Mut, τ, s, t), such that for every f ∈ C and every distribution, D ∈ D, when for any r : X → Y , the expected loss defined as Lf,D (r) = Ex∼D [ (r(x), f (x))] and for any r0 ∈ R, the evolutionary sequence, r0 , r1 , . . . , rg obtained using EM and Sel, is such that Lf,D (rg ) ≤ , and g is bounded by a polynomial in n and 1/ . 3.7 Some Evolutionary Algorithms In order to better explain the notion of evolvability, we describe two evolutionary mechanisms. The first evolves the class of monotone conjunctions under the uniform distribution over the boolean cube. The second evolves the class of homogeneous linear separators in Rn , with respect to radially symmetric distributions. In each case, we need to specify the representation class, mutator, τ , s and t. We also need to specify the selection rule and the loss function. Chapter 3: Computational Model 27 3.7.1 Monotone Conjunctions The first evolutionary algorithm we describe is for the class of monotone conjunctions over the boolean cube, X = {−1, 1}n (we assume that n ≥ 2). Let U denote the uniform distribution over {−1, 1}n and in this section we assume that the loss function is the classification loss, c (where c (y , y) = 1 if y = y and 0 otherwise)4 . Formally, if X = {−1, 1}n , the class of monotone conjunctions is defined as C={ i∈S xi | S ⊆ [n]}. We use the representation class that is the same as the concept class, R = C. Thus, each representation will simply be a monotone conjunction. For a monotone conjunction r ∈ R = C, let lit(r) denote the set of indexes corresponding to the literals in r, so that r ≡ i∈lit(r) xi . For example, if r = x2 ∧ x7 ∧ x12 , then lit(r) = {2, 7, 12}. Let be the accuracy parameter. Next, we define the mutator Mut(r, ) as follows: • If |lit(r)| > log2 (3/ ), then the mutator, Mut(r, ), picks an i ∈ lit(r) uniformly at random and outputs a monotone conjunction, r , such that lit(r ) = lit(r)\{i}, i.e. r is the monotone conjunction obtained by dropping the literal xi from r. • If |lit(r)| ≤ log2 (3/ ), the mutator, Mut(r, ), does the following: 1. Deleting a literal: With probability 1/(2n), it picks an i ∈ lit(r) uniformly at random and outputs r such that lit(r ) = lit(r) \ {i}. The result actually holds for every loss function, since both the ideal function and representations are boolean. In this case, as long as (1, −1) = (−1, 1) = 0, all loss functions are equivalent. 4 Chapter 3: Computational Model 28 2. Adding a literal: If |lit(r)| + 1 ≤ log2 (3/ ), with probability 1/(2|lit(r)| + 2) (otherwise with probability 0), it picks an i ∈ lit(r) uniformly at random and outputs, r , such that lit(r ) = lit(r) ∪ {i} (or equivalently r ≡ r ∧ xi ). 3. Swapping literals: With the remaining probability, it picks an i ∈ lit(r) uniformly at random and a j ∈ lit(r) uniformly at random and outputs r , such that lit(r ) = (lit(r) \ {i}) ∪ {j}. Notice that when n ≥ 2, the mutator outputs a mutation obtained by swapping literals with probability at least 1/4. We use τ = 2 /36 as the approximation parameter. The size function, s : R × R+ → N, is defined as s(r, ) = 4n(|lit(r)| + 1) log(36/ 3 ) . Finally, the tolerance function, t : R × R+ → R+ , is defined as t(r, ) = 2 /12, if |lit(r)| ≤ log2 (3/ ) and t(r, ) = /2, if |lit(r)| > log2 (3/ ). The proof of Theorem 3.1 first appeared in Valiant’s work [39] and was later simplified by Diochnos and Tur´n [10]. The proof a presented here is a minor modification of the latter. Theorem 3.1 ([39, 10]). Let C denote the class of monotone conjunctions on the instance space X = {−1, 1}n and let U denote the uniform distribution over X . Then C is evolvable to accuracy by the evolutionary mechanism, EM = (C, Mut, τ, s, t), described above in this section, using selection rule, BN-Sel, and classification error, c, as loss function, in g generations, where g ≤ n − log2 (3/ ) + 36/ c, 2 . Proof. When the loss function is the classification error, for any two monotone conjunctions r1 and r2 , Lr1 , U (r2 ) = Lr2 , U (r1 ) = Prx∼U [r1 (x) = r2 (x)]. A useful Chapter 3: Computational Model 29 observation is the following: for monotone conjunctions r1 and r2 : Pr [r1 (x) = r2 (x)] = 2−|lit(r1 )| (1 − 2−|lit(r2 )\lit(r1 )| ) + 2−|lit(r2 )| (1 − 2−|lit(r1 )\lit(r2 )| ) = 2−|lit(r1 )| + 2−|lit(r2 )| − 2−|lit(r1 )∪lit(r2 )|+1 (3.1) x∼U Let f denote the ideal function and note that lit(f ) denotes the indexes of the literals in f . Let r0 be the starting representation. First, suppose that |lit(r0 )| > log2 (3/ ). Let r = Mut(r0 , ) be a random candidate mutation. Note that when |lit(r0 )| > log2 (3/ ), r is simply obtained by deleting a literal from r0 . Thus, |lit(r )| = |lit(r0 )| − 1. Then using (3.1) we have, Lf, U (r0 ) − Lf, U (r ) = 2−|lit(r0 )| − 2−|lit(r0 )∪lit(f )|+1 − 2−|lit(r )| + 2−|lit(r )∪lit(f )|+1 = 2−|lit(r0 )| − 2−|lit(r0 )|+1 + 2−|lit(r )∪lit(f )|+1 − 2−|lit(r0 )∪lit(f )|+1 ≥ −2−|lit(r0 )| ≥ − /3 where the last line follows from the fact that 2−|lit(r )∪lit(f )|+1 − 2−|lit(r0 )∪lit(f )|+1 ≥ 0 (since lit(r ) ⊂ lit(r0 )) and the fact that |lit(r0 )| > log2 (3/ ). Note that when |lit(r0 )| > log2 (3/ ), the tolerance function is given by: t(r0 , ) = /2. Thus, when τ = 2 /36, let Lf, U (r0 , τ ) and Lf, U (r , τ ) denote the (approximate) values of the expected loss. It holds that Lf, U (r , τ ) ≤ Lf, U (r ) + τ and Lf, U (r0 , τ ) ≥ Lf, U (r0 ) − τ . Then, we have the following: Lf, U (r , τ ) − Lf, U (r0 , τ ) ≤ Lf, U (r ) − Lf, U (r0 ) + 2τ 2 ≤ 3 +2 36 ≤ 2 = t(r0 , ) Thus every representation, r , that could have been output by the mutator Mut(r0 , ) is either neutral or beneficial. Also, since every such mutation, r , has one fewer literal Chapter 3: Computational Model 30 compared to r0 , a representation that has at most log2 (3/ ) literals will be reached in at most n − log2 (3/ ) evolutionary steps (generations). Thus, we may in fact assume that the starting representation, r0 , is such that |lit(r0 )| ≤ log2 (3/ ). We show that unless a representation, r0 , for which |lit(r0 )| ≤ log2 (3/ ), already satisfies Lf, U (r0 ) ≤ , there exists a beneficial mutation, r , that would be output by the mutator, Mut(r0 , ) (with significant probability). Case 1: Suppose there exists i, such that i ∈ lit(f ) and i ∈ lit(r0 ) and furthermore suppose it is also the case that |lit(r0 )| + 1 ≤ log2 (3/ ), then let r be such that lit(r ) = lit(r0 ) ∪ {i}, i.e. r = r0 ∧ xi . Consider Lf, U (r0 ) − Lf, U (r ) = 2−|lit(r0 )| − 2−|lit(r0 )∪lit(f )|+1 − 2−|lit(r )| + 2−|lit(r )∪lit(f )|+1 Notice that |lit(r )| = |lit(r0 )| + 1 and |lit(r ) ∪ lit(f )| = |lit(r0 ) ∪ lit(f )|. Thus, Lf, U (r0 ) − Lf, U (r ) = Note that since t(r0 , ) = 1 −|lit(r0 )| ·2 ≥ /6 2 2 /12, when τ = 2 /36 and for < 1, we can easily show that Lf, U (r , τ ) − Lf, U (r0 , τ ) ≤ − /6 + 2( 2 /36) < − 2 /12 = −t(r0 , ). Thus, the mutation, r , is beneficial. The last thing that remains to be checked is that the specific mutation, r , is in the neighborhood, Neigh(r0 , ), obtained by running the mutator, Mut(r0 , ), s(r0 , ) times. Note that for a fixed i, the probability that mutator, Mut(r0 , ), outputs r = r0 ∧ xi , is at least 1/((2|lit(r0 )| + 2)n), thus when s(r, ) = 4n(|lit(r0 )| + 1) ln(36/ 3 ) , except with probability Neigh(r0 , ). Thus, except with probability lected. Case 2: Suppose that adding a literal to r0 is not possible because |lit(r0 )| = 3 3 /36, r will be in /36, a beneficial mutation will be se- Chapter 3: Computational Model 31 log2 (3/ ) , but it is the case that there is some i ∈ lit(f ) such that i ∈ lit(r0 ) and also that there is some j ∈ lit(r0 ) such that j ∈ lit(f ). Let r be the conjunction satisfying lit(r ) = (lit(r0 ) ∪ {i}) \ {j}. Consider, Lf, U (r0 ) − Lf, U (r ) = 2−|lit(r0 )| − 2−|lit(r0 )∪lit(f )|+1 − 2−|lit(r )| + 2−|lit(r )∪lit(f )|+1 (3.2) Notice that |lit(r )| = |lit(r0 )| and |lit(r ) ∪ lit(f )| = |lit(r0 ) ∪ lit(f )| − 1. Also, we have assumed that |lit(r0 )| = log2 (3/ ) . If |lit(f )| > log2 (3/ ) , then using (3.1) it is easy to see that Lf, U (r0 ) ≤ , in which case the evolutionary mechanism has succeeded. Thus, we may assume that |lit(f )| ≤ log2 (3/ ) , and hence |lit(r0 ) ∪ lit(f )| ≤ 2 log2 (3/ ) . Hence, (3.2) implies that: Lf, U (r0 ) − Lf, U (r ) ≥ 2 · 2−|lit(r0 )∪lit(f )| ≥ 2 · Now, when t(r0 , ) = 2 2 9 /12 and τ = 2 /36, it is easy to show that Lf, U (r , τ ) − Lf, U (r0 , τ ) ≤ Lf, U (r ) − Lf, U (r0 ) + 2τ ≤ −2( 2 /9) + 2( 2 /36) < − 2 /12 = −t(r0 , ). Thus, the mutation, r , is beneficial. Notice, that for any fixed i and j, the probability that the mutator outputs r (as defined in this case) with probability at least 1/(4n(lit(r0 ) + 1)). Thus, it is easy to see that when s(r0 , ) = 4n(|lit(r0 )| + 1) ln(36/ 3 ) , except with probability 3 /36, the specific (beneficial) mutation, r , will be in Neigh(r0 , ) (obtained by running Mut(r0 , ), s(r0 , ) times). Thus, except with probability 3 /36, a beneficial mutation will be selected. Case 3: Finally, if it is the case that no literal may be added to r0 as in Case 1, or no literal may be swapped as in Case 2, then, lit(f ) ⊆ lit(r0 ) and |lit(r0 )| ≤ log2 (3/ ). Suppose lit(f ) = lit(r0 ), then obviously evolution has succeeded. Otherwise, let i ∈ lit(r0 ) such that i ∈ lit(f ). Let r be the conjunction such that lit(r ) = lit(r0 ) \ {i}, Chapter 3: Computational Model 32 i.e. r0 = r ∧ xi . Observe that lit(r0 ) ∪ lit(f ) = lit(r0 ) and lit(r ) ∪ lit(f ) = lit(r ). Then using (3.1) we get, Lf, U (r0 ) − Lf, U (r ) = 2−|lit(r0 )| − 2 · 2−|lit(r0 )| − 2−|lit(r )| + 2 · 2−|lit(r )| = 2−|lit(r )| − 2−|lit(r0 )| ≥ 2−|lit(r0 )| ≥ Thus, it is easily seen that for t(r0 , ) = 2 3 2 /12 and τ = /36, Lf, U (r , τ ) − Lf, U (r0 , τ ) ≤ − /3 + 2 2 /36 < − 2 /12 = −t(r0 , ). Thus, r is a beneficial mutation. Notice that for any i ∈ lit(r0 ), the probability that r (as defined in this case) is output by the mutator is at least 1/(2n(lit(r0 ) + 1)). When s(r0 , ) = 4n(|lit(r0 )| + 1) ln(36/ 3 ) , except with probability 3 /36, the particular mutation, 3 r , will be in Neigh(r0 , ). Thus, except with probability will be selected. /36, a beneficial mutation To finish the argument, consider the following: Suppose the starting representation, r0 , was such that |lit(r0 )| > log2 (3/ ), then in at most n− log2 (3/ ) generations (evolutionary steps), a representation, r0 , such that |lit(r0 )| ≤ log2 (3/ ) is reached. Consider the next g = 36/ 2 generations (evolutionary steps) after the first such rep- resentation, r0 , satisfying |lit(r0 )| ≤ log2 (3/ ), is reached. Call these representations r0 , r1 , . . . , rg . As proved above, for each i, except with probability 3 /36, ri+1 must have been a beneficial mutation with respect to ri . By a union bound, we can say that except with probability , all of these steps were such that a beneficial mutation was selected. But this implies that Lf, U (ri+1 , τ ) ≤ Lf, U (ri , τ ) − t(ri , ). Using the fact that Lf, U (r) − τ ≤ Lf, U (r, τ ) ≤ Lf, U + τ for every r ∈ R, we get that Lf, U (ri+1 ) ≤ Lf, U (ri ) − t(ri , ) + 2τ . Since, for all 0 ≤ i ≤ g , |lit(ri )| ≤ log2 (3/ ), t(ri , ) = 2 /12. Because τ = 2 /36, we get that Lf, U (ri+1 ) ≤ Lf, U (ri ) − 2 /36. Us- Chapter 3: Computational Model 33 ing the fact that Lf, U (r0 ) ≤ 1, it must be the case that for some i ≤ g (= 36/ 2 ), Lf, U (ri ) ≤ . Remark 3.1. The reader can easily check that the same evolutionary mechanism also evolves the class of monotone conjunctions under the uniform distribution, using the selection rule Opt-Sel. This follows from the fact that when r0 is such that |lit(r0 )| > log2 (3/ ), every mutation output by Mut(r0 , ) is either beneficial or neutral, and the number of literals in the resulting mutations is one fewer than |lit(r0 )|. When |lit(r0 )| ≤ log2 (3/ ), we have shown that with high probability a beneficial mutation exists in the neighborhood, Neigh(r0 , ). The selection rule, Opt-Sel, chooses one among the “best” (those with least (approximate) expected loss) mutations in the neighborhood, and hence, will always choose a beneficial mutation. 3.7.2 Homogeneous Linear Separators In this section, we describe an evolutionary mechanism that evolves homogeneous linear separators in Rn under radially symmetric distributions 5 . Let w ∈ Rn be a vector such that w 2 = 1. Consider the function hw : Rn → {−1, 1}, such that for any point x ∈ Rn , hw (x) = sign(w · x) (we use the convention that sign(0) = 1). The class of homogeneous linear separators in Rn is defined as: Hn = {hw | w ∈ Rn , w 2 = 1} Let be the target accuracy parameter. As in the case of monotone conjunctions, we will use classification error, 5 c, as the loss function. (Recall that when both The class of distributions where the probability density at any point in Rn depends only on its distance from the origin. Chapter 3: Computational Model 34 the representations and ideal functions are boolean, all loss functions are essentially equivalent to the classification error.) We now describe the evolutionary mechanism. The class of representations R is the same as Hn . Each representation is a unit vector in Rn . Next, we define the mutator, Mut, for every representation. Let w be the vector corresponding to representation hw . Let w, u2 , . . . , un constitute an arbitrary orthonormal basis of Rn . Then the mutator, Mut(hw , ), does the following: pick i ∈ {2, . . . , n} uniformly at random; √ √ pick ξ ∈ {−1, 1} uniformly at random; set w = cos( /(π n))w + ξ sin( /(π n))ui and output the representation hw . Note that when the orthonormal basis is fixed, there are exactly 2n − 2 possible mutations and each is output with equal probability. Define the approximation parameter, τ = /(3π 3 n). Define the size function, s : Hn × R+ → R+ , as s(hw , ) = 2n log(3nπ 3 / 2 ) for every hw ∈ Hn . Finally, define the tolerance function, t : Hn × R+ → R+ , as t(hw , ) = /(π 3 n) for every hw ∈ Hn . The following theorem appeared in [21]. Theorem 3.2 ([21]). Let Hn be the class of homogeneous linear separators in Rn and let DR be the class of radially symmetric distributions over Rn . Hn is evolvable to accuracy , with respect to any distribution D ∈ DR , by the evolutionary mechanism, EM = (Hn , Mut, τ, s, t), described above in this section, using selection rule, BN-Sel, and classification error, c, as the loss function, in g generations, where g = 3nπ 3 / . Proof. Let f = hw∗ denote the ideal function and D ∈ DR be the target radially symmetric distribution. Then observe that for any other representation, hw , the following holds (the observation is simple to prove, but for completeness see Dasgupta Chapter 3: Computational Model 35 [9]): Lf,D (hw ) = Pr [f (x) = hw(x) ] = Pr [sign(w∗ · x) = sign(w · x)] = x∼D x∼D arccos(w∗ · w) π The proof makes frequent use of the following two trigonometric facts for any θ ∈ [0, π/2]: 2θ ≤ sin(θ) ≤ θ π 4θ2 ≤ 1 − cos(θ) ≤ θ2 /2 π2 (3.3) (3.4) Now suppose hw is a representation such that Lf,D (hw ) ≥ /2. We consider the action of the mutator, Mut(hw , ). Let w, u2 , . . . , un be the orthonormal basis used by the mutator. We show that there exists ξ ∈ {−1, 1} and i ∈ {2, . . . , n} such that √ √ Lf,D (hw )−Lf,D (hw ) ≥ /(π 3 n), where w = cos( /(π n))w+ξ sin( /(π n))ui . Note that this particular representation, hw will be the output of the mutator, Mut(hw , ) with probability at least 1/(2n − 2). We use the fact that Lf,D (hw ) = arccos(w · w∗ )/π ≥ /2. We express the vector, w∗ , in terms of the orthonormal basis, w, u2 , . . . , un . Suppose that, n w∗ = λ1 w + j=2 λj uj Let i = maxj∈{2,...,n} |λj | and ξ = sign(λi ). Then notice that λ2 ≥ (1 − λ2 )/n (since i 1 √ √ w∗ has unit norm). Let w = cos( /(π n))w + ξ sin( /(π n))ui . Then, consider: w · w∗ = λ1 cos ≥ λ1 cos √ π n π n √ + |λi | sin + √ π n √ π n (3.5) 1 − λ2 1 sin n Chapter 3: Computational Model 36 We use the following trigonometric equality: Let α, β ∈ [−1, 1] such that α < β, then arccos(α) − arccos(β) = arccos αβ + Then consider, arccos(w · w∗ ) − = arccos(λ1 ) − arccos cos = arccos λ1 cos π2n + (1 − α2 )(1 − β 2 ) π2n π2n 1 − λ2 sin 1 π2n (3.6) Observe that the function arccos : [−1, 1] → [0, π] is a strictly decreasing function. Thus, if we want to show that RHS in (3.6) is larger than arccos(w · w∗ ), it suffices to show that (using (3.5)): λ1 cos √ + 1 − λ2 1 sin n √ π n ≥ λ1 cos + 1 − λ2 sin 1 π n π2n π2n After some re-arranging of terms, it suffices to show that, λ1 cos − cos √ ≤ 1 − λ2 1 1 √ sin n 2 π2n π n √ π n − sin π2n First, we make the observation that when λ1 < 0, since √ /(π 2 n) ≤ /(π n), the LHS above is less than 0. The RHS on the other hand is positive using (3.3). Thus, we only concern ourselves with the case when λ1 > 0. Notice that Lf,D (hw ) = arccos(λ1 )/π ≥ /2, hence λ1 ≤ cos( π/2) and that: cos π 2 cos − cos √ ≤ sin π 2 1 √ sin n √ π n − sin 1 − λ2 ≥ sin( π/2). Thus, it is sufficient to show 1 π2n π n π2n Using (3.3) and (3.4) we get: cos π 2 2 cos π2n − cos √ π n ≤ 1 − cos π n √ ≤ 2π 2 n , Chapter 3: Computational Model 37 and π 2 1 √ sin n √ − sin ≥ 2 − π2n π2n 2 sin π n π2n = π2n Thus, putting everything together and using (3.6) we get: arccos(w · w∗ ) − ≥ arccos(w · w∗ ) π2n and hence, arccos(w · w∗ ) − arccos(w · w∗ ) ≥ π2n and also, Lf,D (hw ) − Lf,D (hw ) ≥ π3n To complete the proof observe the following: Lf,D (hw , τ )−Lf,D (hw , τ ) ≤ Lf,D (hw )− Lf,D (hw ) + 2τ ≤ − /(π 3 n) + 2( /3π 3 n) ≤ − /(3π 3 n) = t(hw , ). Thus, the mutation, hw , is beneficial. Since the size function, s(hw , ) = 2n log(3nπ 3 / 2 ), except with probability 2 /(3nπ 3 ), the particular (beneficial) mutation, hw , will be in Neigh(hw , ) (obtained by running Mut(hw , ) s(hw , ) times). Note that when the tolerance function is t(hw , ) = /(π 3 n), for any hw ∈ R and the approximation parameter, τ = /(3π 3 n), any beneficial mutation must decrease the (expected) loss by /(3π 3 n). Thus, in at most 3π 3 n/ generations, the evolutionary mechanism will produce a representation, hw , such that Lf,D (hw ) ≤ . Note that the probability that ¯ ¯ there is an evolutionary step (among the 3nπ 3 / ) for which a beneficial mutation does not exist in the neighborhood is at most (by a simple union bound). Chapter 3: Computational Model 38 Remark 3.2. As in the case of monotone conjunctions, the evolutionary mechanism described above also succeeds when the selection rule is Opt-Sel (selection based on optimization). The proof is a simple modification of the proof given above. 3.8 Relations to Learning Theory Valiant already observed that evolvability is more restrictive than learnability, in the sense that any concept class that is evolvable is also learnable in the PAC and SQ frameworks6 [39]. Feldman observed that in fact any concept class that is evolvable is also learnable in the restricted CSQ (correlational statistical query) framework when the loss function is the classification error [12]. Theorem 3.3 ([39, 14]). Let C be a concept class of boolean functions, i.e. c ∈ C is such that c : X → Y = {−1, 1}. Let Y = [−1, 1]. Suppose : Y × Y → R+ , is a loss function that is evaluable in polynomial time and is bounded by some constant B, then if C is evolvable with respect to a class of distributions D, any of the three selection rules BN-Sel, Opt-Sel, or Weak-Sel, and loss function , then C is learnable in the SQ (and hence also PAC) framework. Proof. The proof is the simple observation that an evolutionary mechanism can be efficiently simulated by a polynomial time algorithm with access to the statistics oracle, STAT (see Section 2.3). The only access the evolutionary process has to the target function and distribution is through obtaining (approximate) expected loss values of candidate mutations using the oracle Lf,D (·, ·). 6 Valiant’s original model only used boolean functions and classification error. Chapter 3: Computational Model 39 For any representation, r : X → Y , define ψr : X ×{−1, 1} → [−1, 1] as ψr (x, y) = (r(x), y)/B. Note that the function, ψr , is efficiently computable as r and are both efficiently computable functions. Thus, the response of the statistics oracle, STAT, to the query, (ψr , τ ), can be used as the value Lf,D (r, τ ). The rest of the steps required to simulate the evolutionary mechanism are straightforward, since the representations are efficiently evaluable functions, the size and tolerance functions are efficiently evaluable, and the mutator is a polynomial time Turing machine. The selection rules described in Section 3.5 can be easily implemented efficiently. When representations are also required to be (possibly randomized) boolean functions, i.e. Y = Y = {−1, 1}, and the loss function is the classification error, that c (y c (recall , y) = 1 if y = y and 0 otherwise), any concept class, C, that is evolvable is also learnable in the more restricted CSQ framework [12]. Theorem 3.4 ([12]). If C is evolvable with respect to a class of distributions D, any of the three selection rules BN-Sel, Opt-Sel, or Weak-Sel, and the loss function that is the classification error, c, then C is learnable with respect to the class of distributions D in the CSQ framework. Proof. The only extra observation here is that, when the loss function is c and the representation, r, represents a boolean function, Lf,D (r) = Prx∼D [r(x) = f (x)] = (1 − Ex∼D [r(x)f (x)])/2 (in the case that r is a randomized boolean function, the probability is also taken over the randomness of r). Thus, in this case the (approximate) expected loss oracle, Lf,D (r, τ ), can be simulated with only access to a CSQ oracle, CSQ-O (see Section 2.4). The rest of the proof is exactly the same as above. Chapter 3: Computational Model 40 3.8.1 From Learning to Evolvability In this section, we present some results showing that in several cases, it is also possible to convert learning algorithms to evolutionary mechanisms. The first such result was shown by Feldman in the context of CSQ learning and evolvability with boolean representations and classification error as the loss function, c [12]. Subse- quently, Feldman showed that SQ learning algorithms could be simulated by evolutionary mechanisms using (a large class of) non-linear loss functions [14]. P. Valiant showed that Feldman’s ideas could also be generalized to the case when the ideal functions are real-valued. He showed when the optimization problem of minimizing the expected loss over the representation class admits a (weak) optimization procedure using only (approximate) oracle access to the objective function (in this case the expected loss), such an optimization procedure can be transformed into an evolutionary mechanism [40]. For the rest of this section, we use the following notation. The ideal function, f : X → Y, comes from some concept class, C. Let D be the target distribution over X . We make no assumption about the sets X and Y (except that Y is bounded). Any representation class we use must encode functions defined from X → Y , where Y⊆Y. : Y × Y → R+ is a loss function that is efficiently evaluable and bounded. We assume that for every y ∈ Y, miny ∈Y (y , y) = (y, y) and for every y ∈ Y , y ∈ Y, (y , y) ≤ B (where B depends only on Y ). For any candidate hypothesis, h : X → Y , the expected loss is defined as Lf,D (h) = Ex∼D [ (h(x), f (x))]. For any h, Lf,D (h) ∈ [0, B]. Suppose H is a hypothesis space, where C ⊆ H. We can view learning as an Chapter 3: Computational Model 41 optimization problem, where the goal is to find some h ∈ H, for which Lf,D (h) is minimized. We assume that there is a special point, 0 ∈ Y , and that the function, Z : X → Y , defined as Z(x) = 0, for every x ∈ X , is contained in H 7 . For reasons that will be clear in the proof of the main result in this section, it is useful to consider the objective function, Gf,D (·), where Gf,D (h) = Lf,D (h)−Lf,D (Z). Minimizing Gf,D (·) is equivalent to minimizing Lf,D (·), since Lf,D (Z) is a constant. Note that for any h, Gf,D (h) ∈ [−B, B] (since Lf,D (h) ∈ [0, B]). Let Gf,D (·, ·) denote an oracle that on some query (h, τ ), returns a value, Gf,D (h, τ ) ∈ [Gf,D (h) − τ, Gf,D (h) + τ ]. We define yet another oracle, G≤ (·, ·, ·), that takes a query f,D of the form (φ, τ, θ), where φ : X → Y , τ ≥ 0 and θ ∈ [−B, B] such that |θ| ≥ τ . The output, G≤ (φ, τ, θ) is defined as follows: f,D    1  if Gf,D (φ) ≤ θ − τ     G≤ (φ, τ, θ) = 0 if Gf,D (φ) ≥ θ + τ f,D       1 or 0 otherwise  Feldman [12] proved the following simple result. The proof is a simple noisy binary search argument. Lemma 3.1. For any φ : X → Y , τ ≥ 0, the oracle Gf,D (φ, τ ) can be simulated by O(log(B/τ )) queries to the oracle G≤ (·, ·, ·), where each query is of the form f,D (φ, τ /2, θ), where θ ∈ [−B, B] and |θ| ≥ τ /2. The above lemma implies that any algorithm that uses a Gf,D (·, ·) oracle can be We may think of 0 as a randomized point in Y , where effectively 0 is a random variable over Y . For example, when Y = {−1, 1}, we think of 0 as the point that takes value −1 or +1 with equal probability. The function, Z, thus defined is a randomized function. 7 Chapter 3: Computational Model 42 modified to instead use a G≤ (·, ·, ·) with very little extra overhead. We consider f,D learning algorithms that use such an oracle G≤ (·, ·, ·). Formally, f,D Definition 3.4. Assume the notation defined in this Section. We say that some concept class, C, and be learned with respect to distribution, D, and loss function, , using a G≤ (·, ·, ·) oracle, if there exists an algorithm Alg that for every f,D > 0, every f ∈ C, using access to a G≤ (·, ·, ·) oracle, outputs a hypothesis, h, such that f,D Lf,D (h) ≤ . The running time of the algorithm is polynomial in n and 1/ , every query (φ, τ, θ) made to the oracle G≤ (·, ·, ·) is such that φ is efficiently evaluable, 1/τ f,D is bounded by a polynomial in n and 1/ , θ ∈ [−B, B] and |θ| ≥ τ . Also, the output hypothesis, h, is efficiently evaluable. The main result we present unifies the various reductions from learning algorithms to evolutionary mechanisms. The first of these was proved by Feldman [12], where he showed that CSQ algorithms could be simulated by evolutionary algorithms that use boolean representations and classification error as the loss function. Feldman generalized this to the case where the representations were real-valued and loss functions were non-linear [14]. P. Valiant showed a reduction in the case when both the ideal function and representations were real-valued [40]. Theorem 3.5 ([12, 14, 40]). Any concept class, C, that is learnable with respect to distribution, D, and using loss function, , with access to a G≤ (·, ·, ·) oracle, is f,D evolvable with respect to distribution, D, using loss function, , and selection rule, BN-Sel. The rest of this section is devoted to proving the above theorem. Let be the accuracy parameter. Let Alg be an algorithm that with access to oracle, G≤ (·, ·, ·), f,D Chapter 3: Computational Model 43 learns C with respect to distribution, D. Furthermore, assume that Alg makes exactly q queries to the oracle, G≤ (·, ·, ·) (if Alg makes fewer that q queries it can be forced f,D to make frivolous queries). Let τAlg be such that every query by the Alg is of the form, (φ, τAlg , θ), where φ : X → Y , 1/τAlg is bounded by a polynomial in n and 1/ , and τAlg ≤ |θ| ≤ B. Also, let H be a class of hypotheses from which Alg outputs the final hypothesis, h (which satisfies Lf,D (h) ≤ ). We now design an evolutionary mechanism using such an algorithm Alg. First, we construct the representation class. We will not define the entire class at once, but add representations as they are required in the proof. All representations will be randomized functions. We express a randomized function as a weighted combination of deterministic functions. Suppose r = a deterministic function for every i, wi ≥ 0, and m i=1 m i=1 wi ψi , where ψi : X → Y is wi ≤ 1. Notice, that we have allowed the sum of weights (probabilities) to be less than 1. The randomized function, r, is interpreted as follows. On input x ∈ X , for all 1 ≤ i ≤ m, with probability wi , r(x) = ψi (x), and with the remaining probability 1 − m i=1 wi , r(x) = Z(x) = 0. Thus, if the weights of functions don’t add up to 1, we assume that the default remaining weight is assigned to the zero function, Z. For such a representation, r, Lf,D (r) = m i=1 wi Lf,D (φi ) + (1 − m i=1 wi )Lf,D (Z) and Gf,D (r) = m i=1 wi Gf,D (φi ) (since Gf,D (Z) = 0). Let Sq = {z ∈ {0, 1}∗ | |z| ≤ q} be the set of binary strings of length at most q. For a string z ∈ Sq , we interpret the representation r[z] as follows: • If |z| = 0, r[z] = Z, the zero function. • Otherwise, let z i denote the prefix of z of length i − 1. For any string z , let Chapter 3: Computational Model 44 (φz , τAlg , θz ) be the (|z | + 1)th query that the algorithm, Alg, makes if the first |z | query responses are consistent with the string z , i.e. the j th query response is the bit zj . Then, we interpret the function for representation, r[z], as follows: r[z] = τAlg zi φ q|θzi | i:z =1 i (3.7) Note that the sum of weights of the functions in r[z] adds up to at most 1. This is because |z| ≤ q and |θz | ≥ τAlg . (Recall that if the weight adds up to less than 1, with the remaining probability r[z] is the function, Z.) In order to show that the representation class is polynomially bounded, first observe that the representation size of any function is at most q (which is polynomially bounded since Alg is efficient). We also need to show that given r[z] and x as input, there exists a polynomial time Turing machine, that outputs r[z](x). It is easy to see how to do this: given r[z] and x, starting with i = 0, run the algorithm, Alg, until it makes a query (φz , τAlg , θz ), with probability τAlg /(|θz |q) output φz (x). Otherwise, assume that the query response is zi+1 , increment i and continue simulation of Alg. If the value i = |z| + 1 is reached, output Z(x) = 0. Let R1 = {r[z] | z ∈ Sq } – this is part of the representation class. The final representation class will also contain the set of hypotheses, H, as a subset. If z is a string of length q, let hz be the hypothesis in H that algorithm, Alg, would output, if the query responses it received were as encoded in the string z, i.e. the ith query response it received was zi . We define the mutator, Mut(r[z], q) for representations in r[z] ∈ R1 . The parameter, η, will be defined later (it will be the case that 1/η is bounded by a polynomial in n and 1/ ): i i i i i Chapter 3: Computational Model 45 • When |z| = q, then Mut(r[z], ) = hz with probability 1 − η and Mut(r[z], ) = r[z] with the remaining probability η. • When |z| < q, let (φz , τAlg , θz ) be the query Alg would make if the responses it received to the first |z| queries are consistent with the string z. Then, 1. If θz > 0, Mut(r[z], ) = r[z1] with probability 1−η and Mut(r[z], ) = r[z0] with probability η. 2. If θz < 0, Mut(r[z], ) = r[z0] with probability 1−η and Mut(r[z], ) = r[z1] with probability η. 2 Let τ = τAlg /(2Bq) be the approximation parameter, s(r[z], ) = (1/η) log(1/η) be the size function and t(r[z], ) = τAlg /q be the tolerance function. So let EM = (R, Mut, τ, s, t) be the evolutionary mechanism. For now, we assume that the evolutionary process begins with the starting representation, r[σ], where σ denotes the empty string. We consider the first q evolutionary steps. We prove the following claim by induction. Claim 3.1. After j evolutionary steps, except with probability 2ηj, the representation r[z] is such that |z| = j, and for all 1 ≤ i ≤ j, zi is a valid answer to the query, (φz , τAlg , θz ), made by algorithm Alg (recall that z i is the prefix of z of length i − 1). The claim is obviously true for j = 0. Assume that the claim holds for some value j. Let z be a string of length j, and r[z] the representation after j evolutionary steps. We consider the representation at the (j + 1)th evolutionary step: let (φz , τAlg , θz ) be the corresponding query that Alg would make (given previous responses as encoded in z). We consider two cases: i i Chapter 3: Computational Model 46 Case 1: (θz > 0) The mutator, Mut(r[z], ) outputs r[z1] with probability 1 − η and r[z0] with probability η. Using (3.7), we can see that Lf,D (r[z0]) = Lf,D (r[z]) (this is because when the extra bit is 0, the function represented is unchanged). When the approximation parameter, τ , is such that 2τ ≤ t(r[z], ), r[z0] is always a neutral mutation, since |Lf,D (r[z0]) − Lf,D (r[z])| ≤ |Lf,D (r[z0]) − Lf,D (r[z])| + 2τ ≤ t(r[z], ). Now, if r[z1] is not a deleterious mutation, the probability that r[z1] will be chosen by selection rule BN-Sel is at least 1 − η. To complete the induction step, we show the following: if 1 is an invalid answer to the query (φz , τAlg , θz ), then r[z1] is a deleterious mutation. Note that the only case when 1 is an invalid answer to the query is when Gf,D (φz ) ≥ θz + τAlg . Then, consider the following: Lf,D (r[z1]) − Lf,D (r[z]) = Gf,D (r[z1]) − Gf,D (r[z]) Since Gf,D (Z) = 0, we get, Lf,D (r[z1]) − Lf,D (r[z]) = τAlg G (φz ) z | f,D q|θ 2 τAlg τAlg τAlg ≥ z (θz + τAlg ) ≥ + qθ q Bq where the last step holds since θz ≤ B. But, this implies that Lf,D (r[z1])−Lf,D (r[z]) ≥ t(r[z], ) + 2τ , i.e. r[z1] is a deleterious mutation. One last thing we need to show is that at least one copy of the mutation, r[z0], is in the neighborhood, Neigh(r[z], ). This is because if r[z1] is deleterious, without having at least one copy of r[z0] in Neigh(r[z], ), the representation, ⊥, would be selected and the evolutionary process would come to an end. The probability of there being no copy of r[z0] in Neigh(r[z], ) is at most (1 − η)s(r[z], ) ≤ η. Chapter 3: Computational Model 47 Let r[zb] be the mutation selected at this evolutionary step. Then, if zb does not encode a valid response to the query (φz , τAlg , θz ) it must be because r[z0] was chosen even though r[z1] is neutral (this happens with probability at most η), or r[z1] was deleterious and r[z0] did not exist in Neigh(r[z], ) (also with probability at most η). Thus, the claim holds for j + 1. Case 2: θj < 0: The calculations in this case are very similar to the previous one. Thus, we only briefly describe the basic idea. The probability that Mut(r[z], ) = r[z1] is η and the probability that Mut(r[z], ) = r[z0] is 1 − η. Again, r[z0] is certainly a neutral mutation. Thus, unless r[z1] is beneficial, r[z0] will be selected with probability at least 1 − η. We show that if 0 is an invalid answer to the query (φz , τAlg , θz ), then r[z1] must be beneficial. When 0 is an invalid answer, it must be the case that Gf,D (φz ) ≤ θz − τAlg . 2 Thus, we can show that Lf,D (r[z1]) − Lf,D (r[z]) ≤ −(τAlg /q) − (τAlg /Bq). Thus, r[z1] must be a beneficial mutation. As in the previous case, the probability that there is no copy of the mutation, r[z1], in Neigh(r[z1], ) is at most η. So except, with probability η, if r[z1] is beneficial, it is the mutation that will be selected. Thus, the claim holds for j + 1. The above argument shows that after q evolutionary steps (generations), it must be the case that the string z, |z| = q, encodes the correct query responses corresponding to the simulation of algorithm, Alg. For such a representation, r[z], Mut(r[z], ) = hz with probability 1 − η. Thus, unless hz is a deleterious mutation, with probability at least 1 − η, hz will be selected. Note that Lf,D (hz ) ≤ (since Alg correctly learns C). So if hz is deleterious, it must be the case that, Lf,D (r[z]) ≤ Lf,D (hz )+2τ −t(r[z], ) ≤ Chapter 3: Computational Model 48 . Thus, in either case, a representation with expected loss at most is produced. The total probability of failure is at most (2q + 1)η. If we set η = /(2q + 1), the total failure probability is at most . This completes the proof for the case when the evolutionary process starts from the representation, r[σ]. Remark 3.3. In the proof above, we have assumed that the algorithm, Alg, is deterministic. If Alg is randomized, the first evolutionary step can be made to consist of a (neutral) mutation which encodes a random string of the length required by Alg into the representation. Subsequently, the simulation of Alg will use the random bits stored in the representation. For further details, see [12]. Remark 3.4. It is clear from the reduction that the selection rule, Opt-Sel, would also result in successful simulation of the algorithm, Alg. Removing the initialization requirement In the proof above, we assumed that evolutionary mechanism is allowed to start from a fixed representation, r[σ]. Feldman [12] also showed that this is not necessary. However, this requires adding a few extra representations. Our presentation here differs slightly from Feldman’s because we have described evolution in terms of losses rather than performances. Also, for Feldman’s trick of re-initialization to work, it is important to know the value of Lf,D (Z) (see Remark 3.5), which we assume is at least 2 8 . Thus, Lf,D (Z) − ≥ . Our presentation is brief and the reader interested in greater detail is referred to Feldman’s paper [12]. If this is not the case, we can always modify the mutator to output Z as a mutation with nontrivial probability. Note that (by slight re-scaling of ) Z essentially is close to the ideal function. Thus, the evolutionary mechanism succeeds in essentially 1 generation. 8 Chapter 3: Computational Model 49 Let r ∈ R be a representation of the form r[z] with |z| = q or some h ∈ H. Let t(r, ) be the tolerance function and τ the approximation parameter. Let r be the representation, r = (1 − (t(r, ) − 2τ )/(Lf,D (Z) − ))r (we assume that (t(r, ) − 2τ )/(Lf,D (Z) − ) ≤ 1 – if necessary t(r, ) and τ can be scaled down to make this happen, while keeping them still polynomially bounded). Then, we have, Lf,D (r ) = Consider the following, Lf,D (r ) − Lf,D (r) = t(r, ) − 2τ (Lf,D (Z) − Lf,D (r)) Lf,D (Z) − 1− t(r, ) − 2τ Lf,D (Z) − Lf,D (r) + t(r, ) − 2τ Lf,D (Z) Lf,D (Z) − Note that unless Lf,D (r) ≤ , i.e. evolution has already succeeded, r must be a neutral mutation. Let the mutator be such that Mut(r, ) = r , with probability 1 − η, and Mut(r, ) = r with probability η, for some small enough value of η. If r is neutral, the evolutionary mechanism will take r to r with probability at least η. Now, we define several more representations. Let α = t(r, )−2τ and for notational convenience assume that 1/α = K is an integer. Let r(0) = r . For 1 ≤ k ≤ K, let r(k) = (1 − kα)r . Note that Lf,D (r(k+1) ) − Lf,D (r(k) ) = α = t(r, ) − 2τ , thus r(k+1) is definitely a neutral mutation with respect to r(k) . Define the mutator to act as Mut(r(k) , ) = r(k+1) with probability 1. Thus starting from r(0) , in exactly K steps, the evolutionary sequence reaches the representation r(K) ≡ Z ≡ r[σ]. Recall that the proof above shows that starting from r[σ], in an additional q steps, the evolutionary mechanism reaches a representation that is highly accurate. We add all such representations r(k) to the representation class, and note that from any such representation the evolutionary mechanism is guaranteed to succeed in at most K + q Chapter 3: Computational Model 50 steps. Finally, suppose that the evolutionary mechanism is started from some arbitrary starting representation, r, of the form r[z] for z ∈ Sq , or h ∈ H, or if r ≡ r[z] for |z| < q, then the evolutionary mechanism continues simulation of Alg as if z encoded the correct answers to the query responses, until a representation of the form r[z] with |z| = q is reached, also let hz be the hypothesis in H that would be output if Alg received query responses as encoded in z. (Of course, this simulation may be wrong if the starting representation was not r[σ].) Note that if Lf,D (r[z]) ≤ or Lf,D (hz ) ≤ , the evolutionary mechanism has already succeeded. If not, it can slide back to the starting representation r[σ] as described above and now the evolutionary mechanism starts from r[σ] and is guaranteed to succeed by Theorem 3.5. Thus, no matter what the starting state, in at most 2q + K + 1 steps, the evolutionary mechanism reaches a representation that has expected loss at most . Remark 3.5. Here we assumed that the value Lf,D (Z) is known – and typically this also means that it should be independent of the target function f . Below, we show that this is the case when the concept class C encodes boolean functions. However, even when the value Lf,D (Z) is not known, similar tricks can be used to allow back-sliding to r[σ] to start the simulation. One possible approach is presented in Chapter 4 and another one appears in P. Valiant’s paper [40]. 3.8.2 Making the representation class independent of In the definition of evolvability, we have allowed the representation class to depend on the target accuracy, . This is not strictly necessary. Note that representations Chapter 3: Computational Model 51 constructed in the simulation above depend on depend on ). We may assume that evolution may be run by setting since =2 (because the queries made by Alg is always a power of 2, because otherwise log( ) . The resulting accuracy is only better ≥ /2. < , but the polynomial bounds are not compromised since Consider to be from the set {1/2, 1/4, . . . , 2−n } and suppose that the represen- tation class includes representations (as constructed above) corresponding to each of these values of . (Note that < 2−n is not required, since at such low values of , the evolutionary algorithm is already allowed to use resources exponential in n.) We observe the difficulty in this approach: Suppose is the true target parameter. Suppose is such that the representation that was designed for cannot be described by a string whose length is at most polynomial in n and 1/ . This may be because, the number of queries made by Alg, q(n, 1/ ) may be asymptotically larger than q(n, 1/ ). Thus, we add the requirement that if the target accuracy of evolution is , then the starting representation must have size that is bounded by some polynomial in n and 1/ . (See also [12]). 3.8.3 Reduction from Learning Frameworks to Evolvability We discuss how the general reduction described above can be applied to specific learning frameworks. In particular, we describe three situations: 1. The ideal function is boolean and the representations are also (possibly randomized) boolean functions. In this situation, the only meaningful loss is the classification error, c (y , y) = 1 if y = y and c (y , y) = 0 if y = y . 2. The ideal function is boolean but the representations may express real-valued Chapter 3: Computational Model 52 function in the range [−1, 1]. 3. Both the ideal function and the representation may be real-valued functions. Boolean Ideal Function and Representation In this case, Y = Y = {−1, 1}. Let Z : X → {−1, 1} be the function that for every x ∈ X , outputs +1 or −1 uniformly at random. Note, that in this case, Lf,D (Z) = 1/2 for every boolean function f , where expected loss is taken for the classification error. Note that for any boolean function φ : X → {−1, 1}, Lf,D (φ) = Ex∼D [ c (φ(x), f (x))] = Ex∼D [(1−φ(x)f (x))/2] = 1/2−(1/2)Ex∼D [φ(x)f (x)]. Note that, Gf,D (φ) = Lf,D (φ)− Lf,D (0) = −(1/2)Ex∼D [φ(x)f (x)]. Thus, the oracle Gf,D (·) is equivalent to the CSQ oracle. Thus, Theorem 3.5 implies the following result: Theorem 3.6 ([12]). Suppose a concept class, C, is learnable in the CSQ framework with respect to a class of distributions, D, then C is evolvable with respect to the class of distributions, D, using the classification error, selection rule, BN-Sel. c, as the loss function and with Boolean Ideal Function and Real-Valued Representations Michael [29] first considered the generalization of evolvability to the setting where the representations are allowed to express real-valued functions. Feldman [14] showed that when the loss function satisfies the constraints listed below, any concept class that is learnable in the SQ framework is also evolvable. Chapter 3: Computational Model 53 : Y × Y → R+ is a loss Assume that Y = {−1, 1} and Y = [−1, 1]. Suppose function such that: 1. (1, −1) = (−1, 1) = 2 and (−1, −1) = (1, 1) = 0. 2. (Symmetric) (y, 1) = (−y, −1). 3. (Monotone) The function (y, 1) is continuous and monotonically decreasing for y ∈ [−1, 1] (and hence the function (y, −1) is continuous and monotonically increasing for y ∈ [−1, 1]). 4. (Non-degenerate) For every y ∈ [−1, 1], (y, 1) + (y, −1) > 0. 5. (Non-Quasilinear) It is not the case that for every y ∈ [−1, 1], (y, 1)+ (y, −1) = (1, −1). An obvious example of a loss that satisfies the above conditions is the squared loss (y , y) = (1/2)(y − y)2 . However, it is clear that a large class of loss functions satisfy the above constraints. As in the previous case, let Z : X → [−1, 1] denote the randomized boolean function that on any input x ∈ X outputs 1 or −1 with equal probability. Then note that for any loss function that satisfies the condition above, it is the case that for the corresponding expected loss, Lf,D (Z) = 1. Thus, for any function φ : X → [−1, 1], Gf,D (φ) = Lf,D (φ) − 1. Feldman [14] showed that a concept class is learnable in the SQ framework if and only if it is learnable only with access to a Gf,D (·) oracle (for a loss function satisfying the conditions above). Thus, Theorem 3.5 implies the following theorem: Chapter 3: Computational Model 54 Theorem 3.7 ([14]). Suppose a concept class, C, is learnable in the SQ framework with respect to a class of distributions, D, then C is evolvable with respect to the class of distributions, D, using the any loss function, , that satisfies the conditions listed above and with selection rule, BN-Sel. Real-Valued Ideal Function and Representations P. Valiant [40] introduced the notion of evolvability of real-valued functions, i.e. in the case when the ideal function is real-valued. Suppose the class of representations can be parametrized in some convex set K ⊆ Rd . (This is the case for example if each representation is a polynomial with bounded coefficients or a linear function with bounded coefficients.) Suppose r ∈ K and let be a loss function such that for any ideal function, f ∈ C, the expected loss, Lf,D (r), is a convex function (of r ∈ K). Suppose that Z is also a function in K. Then Gf,D (r) = Lf,D (r) − Lf,D (Z) is also a convex function of r. The goal of evolution is weak-optimization, i.e. to find a r ∈ K that is a near minimizer of Gf,D , i.e. r such that Gf,D (r) ≤ minr ∈K Gf,D (r ) + equivalently Lf,D (r) ≤ , since by assumption minr ∈K Lf,D (r ) = 0). P. Valiant [40] observes that when Gf,D (·) is convex and bounded, the weak optimization problem can often be solved only using access to the oracle, G≤ (·, ·, ·), using f,D the ellipsoid method (when certain boundedness conditions are met). In particular, he uses this to show that the class of constant-degree polynomials (with bounded coefficients) can be evolved with respect to any convex loss function (including linear loss). (or Chapter 4 Drifting Targets In this chapter, we consider the issue of stability of an evolutionary mechanism to gradual change, or drift in the ideal (or target) function. Such stability is a desirable property for evolutionary mechanisms, that is not explicitly captured in the definition of evolvability discussed in Chapter 3. Another property desirable in an evolutionary mechanism is monotonicity, i.e. the performance should not degrade during the process of evolution. In this chapter, two main results are presented. The first shows that all evolutionary mechanisms can be adapted so as to be robust with respect to a (gradually) drifting ideal function. The second shows that any evolutionary mechanism may be made quasi-monotonic, i.e. the performance does not degrade substantially during the process of evolution. For specific learning algorithms, we also provide rates of drift in the ideal function that may be tolerated by certain evolutionary mechanisms. In this chapter, we describe three different notions of monotonicity: (i) quasimonotonicity, where for any , the expected loss of any intermediate representation 55 Chapter 4: Drifting Targets 56 must not be greater than the loss of the starting representation, r0 , by more than , (ii) monotonicity, where the expected loss of each intermediate representation should be non-increasing (and in particular is always at most that of the expected loss of the starting representation, r0 ), and (iii) strict monotonicity, where the expected loss of each intermediate representation should decrease noticeably (at least by an inverse polynomial amount) at each generation. The notion of monotonic evolution appears in Feldman’s work [13] and the notion of strict monotonicity is implicit in the work of Michael [29]. We define the notion of an evolutionary mechanism as being stable to drift in the sense that for some inverse polynomial amount of drift in the ideal function, using only polynomial resources, the evolutionary mechanism will converge so that the representation has expected loss at most , and will stay at representations with such low rates of expected loss in perpetuity, in the sense that at every subsequent timestep, except with probability , the expected loss of the representation at that time will be at most . We show a general result that shows that any strictly monotonic evolutionary mechanism is automatically robust to drift in the ideal function. For two specific evolutionary mechanisms discussed in the previous chapter – those for monotonic conjunctions and homogeneous linear separators, we quantify explicitly the amount of drift that the evolutionary mechanisms can tolerate. 4.1 Notions of Monotonicity In this section, we describe formally the three notions of monotonicity mentioned above. Feldman [13, 14] introduced the notion of monotonic evolution as stated in Chapter 4: Drifting Targets 57 Definition 4.1 below. This notion of monotonicity requires that with high probability the expected loss of the current representation, ri , is not more than the expected loss of the initial representation r0 . 4.1.1 Notation Let C be a concept class and D a class of distributions defined over X , where each c ∈ C is some function c : X → Y. Let n be the size parameter associated with instances, x ∈ X , e.g. X = {−1, 1}n or X ⊆ Rn . Let EM = (R, Mut, τ, s, t) be an evolutionary mechanism consisting of a representation class, R, mutator, Mut, approximation parameter, τ , size function, s, and tolerance function, t. Let be some loss function that is consistent and bounded. Suppose that EM evolves C with respect to any distribution, D ∈ D, and loss function, , using some selection rule, Sel. Let g be the maximum number of generations required by EM to evolve C, and let r0 , r1 , . . . , rg denote the evolutionary sequence. We define the following notions of monotonicity: Definition 4.1 (Monotonic Evolution). An evolutionary mechanism, EM , monotonically evolves a concept class, C, with respect to a class of distributions, D, using loss function, , and selection rule, Sel, if EM evolves C over the class of distributions, D, using loss function, , and selection rule, Sel, in g generations and furthermore with probability at least 1 − , for every i ≤ g, Lf,D (ri ) ≤ Lf,D (r0 ), where r0 , . . . , rg is the evolutionary sequence. When explicit initialization of the starting representation, r0 , is not allowed, the definition above is equivalent to requiring that Lf,D (ri ) ≤ Lf,D (ri−1 ), for every i ≤ Chapter 4: Drifting Targets 58 g, with high probability. In other words, it is equivalent to requiring that with high probability, the expected loss never increases during the evolutionary process. Feldman showed that if representations are allowed to express real-valued functions and the loss function, , is the squared loss, i.e. (y , y) = (1/2)(y − y)2 , then any class that is efficiently learnable in the statistical query (SQ) framework with respect to a fixed samplable distribution, D, is monotonically evolvable over D [13]. Feldman also showed that under a large class of non-linear loss functions (which includes the squared loss), the class of large margin halfspaces is monotonically evolvable distribution independently [15]. A stronger notion of monotonicity was used by Michael [29], in the context of real-valued representations and quadratic loss functions. He proposed an evolutionary mechanism for the class of 1-decision lists1 which always allowed for strictly beneficial mutations at every time step. In this spirit, we define the notion of strict monotonic evolution, which requires a significant (inverse polynomial) decrease in the expected loss at each stage of the evolutionary process. The definition below also assumes notation defined at the beginning of this section. Definition 4.2 (Strict Monotonic Evolution). An evolutionary mechanism, EM , strictly monotonically evolves a class, C, with respect to a class of distributions, D, using loss function, , and selection rule, Sel, if EM evolves C with respect to the class of distributions, D, using loss function, , and selection rule, Sel, and for some polynomial, m(n, 1/ ), with probability at least 1 − , for every i ≤ g, either Lf,D (ri ) ≤ or Lf,D (ri ) ≤ Lf,D (ri−1 ) − 1/m(n, 1/ ), where g and r0 , . . . , rg are as defined at the 1 Readers not familiar with the notion of 1-decision lists are referred to Kearns and Vazirani [23]. Chapter 4: Drifting Targets 59 beginning of this section. A notion related to that of strict monotonic evolution is that of a strictly beneficial neighborhood mutator. Informally, a mutator is a strictly beneficial neighborhood mutator if it outputs a mutation that is guaranteed to have expected loss noticeably lower than that of the starting representation, with significant probability. Formally, we define the notion of a (b, ρ)-strictly beneficial neighborhood mutator: Definition 4.3 (Strictly Beneficial Neighbourhood Mutator). For a concept class, C, class of distributions, D, loss function, , and representation class, R, we say that a mutator, Mut, is a (b, ρ)-strictly beneficial neighborhood mutator if the following is true for every f ∈ C and every D ∈ D: For any starting representation, r, let Bene(r) = {r ∈ R | Lf,D (r ) < Lf,D (r) − b}. Then either, Lf,D (r) ≤ b or Pr[Mut(r, ) ∈ Bener ] ≥ ρ. We require that 1/b and 1/ρ are bounded by some polynomial in n and 1/ . In Section 4.4, we show that the evolutionary mechanisms described in Chapter 3 – for evolving monotone conjunctions and homogeneous linear separators – have strictly beneficial neighborhood mutators. It easily follows that evolutionary mechanisms that have strictly beneficial neighborhood mutators are strictly monotonically evolvable. In Section 4.2, we show that such evolutionary mechanisms are also robust to drift in the ideal (target) function. Finally, we define quasi-monotonic evolution. This is similar to monotonic evolution, except that the performance is allowed to go slightly below that of the starting representation, r0 . Section 4.3 shows that this notion is essentially universal, in the sense that every evolvable class is also evolvable quasi-monotonically. Chapter 4: Drifting Targets 60 Definition 4.4 (Quasi-Monotonic Evolution). An evolutionary mechanism EM quasimonotonically evolves C with respect to class of distributions D using selection rule Sel if EM evolves C with respect to class of distributions D using selection rule Sel and with probability at least 1 − , for every i ≤ g(n, 1/ ), Lf,D (ri ) ≤ Lf,D (r0 ) + , where g(n, 1/ )4 and r0 , r1 , . . . , are as defined at the beginning of this section. 4.2 Resistance to Drift There are several ways one could choose to formalize the notion of resistance to drift. Our formalization is closely related to ideas from the work on tracking drifting concepts in the computational learning literature. The first models of concept drift were proposed around the same time by Helmbold and Long [19] and Kuh et al. [24]. In both these models, at each time, t, and input point, xt , is drawn from a mixed but unknown distribution, D, and labeled according to some target function, ft ∈ C. It is assumed that the error of ft with respect to ft−1 on the distribution, D, is less than a fixed value ∆. Helmbold and Long [19] showed that a simple algorithm that chooses a concept to (approximately) minimize error over recent time-steps achieves √ ˜ an average error rate of O( ∆d) where d is the VC dimension of C 2 . More general models of drift have also been proposed [2, 4]. Let ft ∈ C denote the ideal function at time step, t, of the evolutionary process. Following Helmbold and Long [19], we make the assumption that for every t, Prx∼D [ft−1 (x) = ft (x)] ≤ ∆ for some value of ∆. 2 We say that a sequence ˜ Here, the notation O(·) suppresses logarithmic factors. Chapter 4: Drifting Targets 61 f1 , f2 , . . . , ft , . . . is a ∆-drifting3 sequence with respect to a distribution, D, if for every t, Prx∼D [ft−1 (x) = ft (x)] ≤ ∆. We may then define the notion of evolvability with respect to a drifting target (ideal function). Definition 4.5 (Evolvability with Drifting Targets). For a concept class, C, distribution, D, we say that C is evolvable with drifting targets under distribution, D, using loss function, , and selection rule, Sel, by an evolutionary mechanism, EM = (R, Mut, τ, s, t), if there exists some polynomially bounded g, and a polynomial, d(n, 1/ ), such that for every r0 ∈ R, > 0, for any ∆ ≤ 1/d(n, 1/ ), and for every ∆-drifting sequence f1 , f2 , . . . , ft , . . . , (with each ft ∈ C), if r0 , r1 , . . . is the evolutionary sequence (resulting from EM and Sel) then for every l ≥ g, with probability at least 1 − , Lfl ,D (rl ) ≤ . We refer to d(n, 1/ ) as the drift polynomial. The drift polynomial, d(n, 1/ ), defined above characterizes the amount of drift that can be tolerated by a specific evolutionary mechanism. Theorem 4.1 shows that whenever an evolutionary mechanism has a strictly beneficial neighborhood mutator for some concept class, C, under distribution, D, such a mechanism is robust to drift in the ideal function. Roughly speaking, the amount of drift that may be tolerated is the same as the advantage (in terms of expected loss) the beneficial mutations in the neighborhood enjoy over the starting representation. Theorem 4.1. For a concept class, C, distribution, D, if EM = (R, Mut, τ, s, t) is Note that when ft are all boolean functions, this is a very natural notion of drifting target (ideal) functions. In the case that f are real-valued functions, it may be possible to consider a more general notion of distance between ft−1 and ft . Then, with further assumptions on the loss function such as some Lipschitz constraint on (·, ·), it would be possible to obtain similar results as described here. However, in this thesis we stick to the definition of drift as described here, even for real-valued ideal functions. 3 Chapter 4: Drifting Targets 62 an evolutionary mechanism, such that Mut is a (b, ρ)-strictly beneficial neighborhood mutator (with respect to loss function, ), then if the selection rule is BN-Sel, EM evolves C with drifting targets, under the following conditions: 1. b < /2. 2. The drift rate, ∆ ≤ b/(24B). 3. The approximation parameter τ = b/6. 4. The size function, s(r, ) = (1/ρ) log(8B/(b )), for every r ∈ R. 5. The tolerance function, t(r, ) = b/2, for every r ∈ R. Proof. Define g = 8B/b, we will show that for any ∆-drifting sequence of ideal functions, f0 , f1 , f2 , . . . , fg , fg+1 , . . ., if the corresponding evolutionary sequence of representations is r0 , r1 , . . . , rg , rg+1 , . . ., then for any l ≥ g, with probability at least 1 − , Lfl ,D (rl ) ≤ . Note that Mut is a (b, ρ)-strictly beneficial neighborhood mutator. g. We consider the evolutionary sequence, starting from rl−g . Let l ≥ When s(r, ) = (1/ρ) log(8B/(b )), for any representation, r, except with probability b /(8B), there exists r ∈ Neigh(r, ) (obtained by running Mut(r, ) s(r, ) times), such that Lf,D (r ) ≤ Lf,D (r) − b. This holds for every possible ideal function, f . Now suppose f and f are such that, Prx∼D [f (x) = f (x)] ≤ ∆. Then for any representation, r, we have, |Lf,D (r) − Lf ,D (r)| = |Ex∼D [ (r(x), f (x)) − (r(x), f (x))]| ≤ Pr [f (x) = f (x)]B ≤ ∆B ≤ b/24 x∼D Chapter 4: Drifting Targets 63 The last assertion holds because the quantity inside the representation makes any contribution to the expectation, only when f (x) = f (x). Also, | (r(x), f (x)) − (r(x), f (x))| ≤ B, since 0 ≤ (y , y) ≤ B for every y , y. Observe that, except with probability , for 8B/b evolutionary steps starting from representation, rl−g , for any l − g ≤ i < l, there is some ri in Neigh(ri , ), such that Lfi ,D (ri ) ≤ Lfi ,D (ri ) − b. This follows from a simple union bound over failure probabilities at each time-step. We assume that this is the case, allowing a failure probability of . To complete the proof, we show that for any l − g ≤ i < l, either, Lfi ,D (ri ) ≤ b, or Lfi+1 ,D (ri+1 ) − Lfi ,D (ri ) ≤ −b/8. Fix some time step, i + 1. Let ri ∈ Neigh(ri , ) such that Lfi ,D (ri ) ≤ Lfi ,D (ri ) − b. Then, it must be the case that Lfi ,D (ri , τ ) ≤ Lfi ,D (ri , τ ) − b + 2τ ≤ −2b/3 ≤ −t(ri , ). Thus, the selection rule, BN-Sel, certainly picks a beneficial mutation. Let ri+1 be the chosen beneficial mutation, then Lfi ,D (ri+1 ) ≤ Lfi ,D (ri )−t(ri , )+2τ ≤ Lfi ,D (ri )−b/6. Note that this implies, Lfi+1 ,D (ri+1 ) ≤ Lfi ,D (ri+1 ) + ∆B ≤ Lfi (ri ) − b/6 + ∆B ≤ Lfi ,D (ri ) − b/8. Thus, in at most 8B/b generations (starting) from l − g (since, obviously Lfl−g ,D (rl−g ) ≤ B), there must be a performance with expected loss at most b. Also, note that if ri was such that Lfi ,D (ri ) ≤ b, then ri+1 must be at least a neutral mutation. It must then be the case that, Lfi ,D (ri+1 ) ≤ Lfi ,D (ri ) + t + 2τ ≤ b + 5b/6. Thus, Lfi+1 ,D (ri+1 ) ≤ b + 5b/6 + ∆B ≤ 2b. Thus, we can see that except with probability , once there is some i in the sequence, rl−g , . . . , rl , such that Lf,D (ri ) ≤ b, then for every subsequent rj , j > i, it must be the case that Lf,D (rj ) ≤ 2b. As long Chapter 4: Drifting Targets 64 as b < /2, this completes the proof. Section 4.4 appeals to the above theorem to show that certain specific evolutionary mechanisms for evolving monotone conjunctions and homogeneous linear separators are robust to drift in the ideal function. 4.2.1 Universality of Drift Resistance In this section, we prove our main result that resistance to drift is in some sense universal, i.e. for any class, C, that is evolvable, it is also evolvable with some (inverse polynomial) drift. We show that the reduction from learning algorithms to evolutionary mechanisms described in Section 3.8.1 can be modified to make the reduction robust to drift in ideal functions. We recall some notation from Section 3.8.1. Let X be the instance space and n be the representation size parameter associated with X (e.g. X = {−1, 1}n ). Let C be the target concept class, where each c ∈ C is defined as c : X → Y. Let f ∈ C be the ideal function (target) and let : Y × Y → R+ be a consistent and bounded loss function, i.e. for every y ∈ Y, miny ∈Y (y , y) = (y, y) = 0 and for every y ∈ Y , y ∈ Y, (y , y) ≤ B. Let a candidate hypothesis function be defined as h : X → Y . Recall that the expected loss of h is Lf,D (h) = Ex∼D [ (h(x), f (x))]. As in the reduction, we assume that 0 ∈ Y is the zero (possibly randomized) point in Y and let Z : X → Y be the function, Z(x) = 0 for every x ∈ X . For any φ : X → Y , recall the function Gf,D (φ) = Lf,D (φ) − Lf,D (Z). Recall that the oracle, G≤ (·, ·, ·), on receiving query, (φ, τAlg , θ), where φ : X → Y , 1/τAlg is bounded by a f,D Chapter 4: Drifting Targets 65 polynomial in n and 1/ and τAlg ≤ |θ| ≤ B,    1      ≤ Gf,D (φ, τAlg , θ) = 0       0 or 1  responds as follows: if Gf,D (φ) ≤ θ − τAlg if Gf,D (φ) ≥ θ + τAlg otherwise Suppose Alg is an algorithm that learns the concept class, C, under distribution, D, with access to a G≤ (·, ·, ·) oracle, in polynomial time, such that any oracle query f,D it makes is of the form (φ, τAlg , θ), where φ : X → Y is an efficiently evaluable function, 1/τAlg is polynomially bounded (in n and 1/ ) and τAlg ≤ |θ| ≤ B. Let q be the number of queries made by Alg and if h is the hypothesis output by Alg, then Lf,D (h) ≤ /2 (note the stronger requirement on accuracy). Our goal is to show that there exists some drift rate, ∆, such that 1/∆ is bounded by a polynomial in n and 1/ , such that there exists an evolutionary mechanism (obtained using algorithm Alg) that is robust to the ideal function drifting at a rate ∆. The proof we present assumes that the selection rule is BN-Sel. (The proof is equally applicable when the selection rule is Opt-Sel, but not directly for Weak-Sel.) For the rest of the section, we assume that the target accuracy is fixed. We define a representation class R (that depends on ). The construction of the evolutionary mechanism is along the same lines as the one described in Section 3.8.1. The high-level idea of the reduction is as follows: Let Φ denote a representation of the form that we used in Section 3.8.1 to simulate the learning algorithm, Alg. Also, let H denote the class of hypothesis from which the algorithm Alg outputs its hypotheses. We will use representations that are a combination of these two kinds of Chapter 4: Drifting Targets 66 representations. In particular, define: r = (1 − )h + Φ, (4.1) 2B 2B where h ∈ H is a possible hypothesis and Φ is a representation of the form used in Section 3.8.1. (Note that B is the bound on the loss function.) The representation, r , is interpreted as a randomized function that behaves exactly like h with probability 1 − /(2B) and like Φ with probability /(2B). We show that when ∆ (drift rate) is small enough, in the short range (approximately q generations that are required to simulate algorithm Alg), the ideal function remains essentially fixed. This is because the algorithm, Alg, only uses queries that require approximate access to some property of the (target) ideal function, in our case through the oracle, G≤ (·, ·, ·). This simulation produces a hypothesis, h, that f,D is accurate for a few generations (until the ideal function drifts substantially). This hypothesis, h, is encoded in the (1 − /(2B))h part of the hypothesis. However, the ( /(2B))Φ part of the hypothesis restarts simulation with respect to the possibly drifted ideal function. Thus, the evolutionary mechanism always catches up with the drifting ideal function. Construction of Evolutionary Mechanism We now give a formal construction of the outline described above. We refer to notation as described above. Note that the algorithm, Alg, makes q queries to the G≤ (·, ·, ·) oracle and furthermore each query is of the form (φ, τAlg , θ), where 1/τAlg f,D is bounded by a polynomial in n and 1/ , and τAlg ≤ |θ| ≤ B. Also, recall that the final output hypothesis of algorithm, Alg, comes from some class H. Chapter 4: Drifting Targets 67 Let Sq = {z ∈ {0, 1}∗ | |z| ≤ q} be the set of binary strings of length at most q. For some h ∈ H and z ∈ Sq , we interpret the representation4 r[h, z] = (1− /(2B))h+ ( /(2B))r[z] as follows: When x ∈ X is received as input r[h, z](x) is evaluated as – • With probability 1 − ( /(2B)), r[h, z](x) = h(x). • With the remaining probability, i.e. probability /(2B), r[h, z](x) = r[z](x), where r[z](x) is as defined in Section3.8.1, i.e. r[z] = i i τAlg zi φ q|θzi | i:z =1 i , where (φz , τAlg , θz ) is the ith query made by Alg, if the responses to the first i − 1 queries were as encoded in the string z i . (Recall that z i is the prefix of z of length i − 1.) The complete evolutionary mechanism also needs a few more representations, which we introduce later to maintain the clarity of presentation. We define the mutator operator, Mut, on the representations defined so far. Let r = r[h, z] ∈ R and be given, then Mut(r[h, z], ) behaves as follows: • When |z| < q, let (φz , τAlg , θz ) be the query that Alg makes to the G≤ (·, ·, ·) f,D oracle, if the answers to the first j−1 queries are as encoded in z. The parameter η used below is defined later. Then, 1. If θj > 0, Mut(r[h, z], ) = r[h, z1] with probability 1−η and Mut(r[h, z], ) = r[h, z0] with probability η. We abuse notation slightly to simplify presentation. Here r[z] indicates the representation as used in Section 3.8.1 in the reduction from learning algorithms to evolutionary mechanisms and r[h, z] is the representation of the form r defined in (4.1) that is a randomized function combining some hypothesis h ∈ H with weight (1 − ( /2)) and a representation of the form r[z] (which is used to simulate the learning algorithm) with weight /2 4 Chapter 4: Drifting Targets 68 2. If θj < 0, Mut(r[h, z], ) = r[h, z0] with probability 1−η and Mut(r[h, z], ) = r[h, z1] with probability η. 2 Let τ = τAlg /(6q) be the approximation parameter, s(r[h, z], ) = (1/η) log(1/η) be the size function (as stated earlier, η, will be defined later, but in any case it is bounded by a polynomial in n and 1/ , so that the size function is polynomially bounded), and t(r[h, z], ) = τAlg /(2q) be the tolerance function. The size and tolerance function are defined for representations of the form r[h, z], where |z| < q. So far we have used G≤ (·, ·, ·) to denote the oracle accessed by the algorithm, Alg. f,D However, note that this depends on the target (ideal) function. Suppose, f0 , f1 , . . . , is a ∆-drifting sequence of ideal functions. We show a simple fact that depending on the drift rate, ∆, the query response, G≤ (φ, τAlg , θ) does not change significantly f,D at each evolutionary step. In particular, we show that if i and j are such that |i − j| ≤ τAlg /(4B∆), then the query response, G≤ ,D (φ, τAlg /2, θ), is a valid response fi to the query, (φ, τAlg , θ), made to the G≤ ,D (·, ·, ·) oracle. (Note the subscripts on the fj ideal functions in the two oracles). Lemma 4.1. Let f0 , f1 , f2 , . . . , be a ∆-drifting sequence of ideal functions with respect to distribution, D, For any indexes, i and j, such that |i − j| ≤ τAlg /(4B∆), for any function, φ : X → Y and any θ such that τAlg /2 ≤ |θ| ≤ B, the following hold: 1. If Gfj ,D (φ) ≤ θ − τAlg , then Gfi ,D (φ) ≤ θ − (τAlg /2). 2. If Gfj ,D (φ) ≥ θ + τAlg , then Gfi ,D (φ) ≤ θ + (τAlg /2). In particular, this shows that the query response, G≤ ,D (φ, τAlg /2, θ) is valid as a refi sponse to the query, (φ, τAlg , θ) made to the oracle, G≤ ,D (·, ·, ·). fj Chapter 4: Drifting Targets 69 Proof. Assume that i < j. The proof in the case that i > j is nearly identical, and the result is trivial when i = j. For any τAlg and any function φ : X → Y , |Gfi ,D (φ) − Gfj ,D (φ)| = |Lfi ,D (φ) − Lfj ,D (φ)| + |Lfi ,D (Z) − Lfj ,D (Z)| We use the fact that for any function, φ : X → Y , |Lfi ,D (φ)−Lfj ,D (φ)| ≤ Prx∼D [fi (x) = fj (x)]B. Thus, we have: |Gfi ,D (φ) − Gfj ,D (φ)| ≤ 2B Pr Pr[fj (x) = fi (x)] x∼D Since fi and fj come from a ∆ drifting sequence, Prx∼D [fi (x) = fj (x)] ≤ |j − i|∆. Thus, we have: |Gfi ,D (φ) − Gfj ,D (φ)| ≤ 2B∆|j − i| ≤ τAlg /2 Thus, if Gfj ,D (φ) ≥ θ + τAlg , Gfi ,D (φ) ≥ θ + (τAlg /2) and if Gfj ,D (φ) ≤ θ − τAlg , then Gfi ,D (φ) ≤ θ − (τAlg /2). For a string z of length q, we say that z is consistent with target function f , if for all 1 ≤ i ≤ |z|, zi is a valid query response to the query G≤ (φz f,D i−1 , τAlg , θz i−1 ). Consider the evolutionary mechanism, EM = (R, Mut, τ, s, t), and consider the evolutionary sequence of representations, r0 , r1 , . . . , rq , for the first q time steps. If r0 = r[h, σ], then using the proof of Theorem 3.5, we can claim that except with probability 2ηq, rq = r[h, z], where zi is a valid response to the query, (θz , τAlg /2, φz ), made to the oracle, G≤ ,D (·, ·, ·). Lemma 4.1 implies that zi is also a valid response to the query, fi (φz , τAlg , θz ), made to the oracle, G≤ ,D (·, ·, ·), as long as ∆ ≤ τAlg /(4Bq). We state fq i i i i this as Lemma 4.2. Chapter 4: Drifting Targets 70 Lemma 4.2. Let ∆ ≤ τAlg /(4Bq), then for any ∆ drifting sequence f0 , f1 , . . . , fq , if r0 = r[h, σ], is the starting representation, then the evolutionary mechanism, EM , with selection rule, BN-Sel, with probability at least 1 − 2ηq, after q evolutionary steps, reaches a representation, rq = r[h, z], where z is a string of length q and z is consistent with fq . Suppose that z is a string that is consistent with respect to a function fq . Then, let hz denote the hypothesis that Alg would output if it had received query responses as encoded in z, and furthermore it is the case that Lfq ,D (hz ) ≤ /2 (by assumption). At this point, we want the evolutionary sequence to mutate from the representation, r[h, z] to r[hz , σ], where hz is the hypothesis output by Alg and σ is the empty string. This will allow the evolutionary mechanism to essentially restart the simulation of Alg (using the r[σ] part of the representation). Note that Lfq ,D (r[hz , σ]) = (1 − /(2B))Lfq ,D (hz ) + ( /(2B))Lfq ,D (r[z]) ≤ . However, it may be the case that Lfq ,D (r[h, z]) ≤ Lfq ,D (r[hz , σ]). Thus, we cannot simply define the mutator, Mut(r[h, z], ), to output r[hz , σ], as it may not be a beneficial mutation (or for that matter even a neutral one). We discuss below how this transition can be achieved in a small number of steps using Feldman’s backsliding trick [12]. Restarting the Simulation Suppose the evolutionary sequence, at time-step q, has some representation, r[h, z], where z is consistent with respect to fq . Ideally, we would simply define the mutator, Mut(r[h, z], ), to output r[hz , σ]. But discussed above, this may be a deleterious mutation. This transition can be achieved by introducing a series of intermediate Chapter 4: Drifting Targets 71 representations. These representations are along the lines of those defined in Section 3.8.1 (see section heading Removing the initialization requirement). Define r(0) = r[h, z], where h ∈ H and |z| = q. Let t = τAlg /(2q) be the tolerance function, and let K = 2/t. For simplicity of notation, we assume that K is an integer. For k = 1, . . . , K, define r(k) = (1 − kt/(2B))r(0) , i.e. for any x ∈ X with probability (1 − kt/(2B)), r(k) (x) = r(0) (x), and with the remaining probability wk (x) = Z(x). Let R(∗) = {r(k) | r(0) = r[h, z], h ∈ H, |z| = q, k ∈ {0, . . . , K}}. Note that R(∗) implicitly depends on the representation r[h, z]; thus, for every such representations we add additional representations as defined in R(∗) to the total set of representations. We define the mutator operator, Mut, on representations in R(∗) as follows: • Mut(r(K) , ) outputs r[Z, σ] with probability 1. (We assume without loss of generality that Z ∈ H.) • For 0 ≤ k < K, Mut(r(k) , ) outputs r(k+1) with probability η and r[hz , σ] with probability 1 − η. We now show that within at most (1/K) + 1 generations, starting from representation rq ≡ r[h, z], we reach a representation that is either r[Z, σ] or r[hz , σ]. Now, note that the following hold for any ideal function f : For any function h, Lf,D (r[h, σ]) = 1 − For any 1 ≤ k ≤ K, Lf,D (r(k) ) = 1− kt 2B Lf,D (r(0) ) + kt Lf,D (Z) 2B Lf,D (hz ) + Lf,D (Z) 2B 2B Chapter 4: Drifting Targets 72 In particular, Lf,D (r(K) ) = Lf,D (Z) and Lf,D (r[Z, σ]) = Lf,D (Z). Also, observe that, Lf,D (r(k) ) − Lf,D (r(k−1) ) = t t Lf,D (r(0) ) ≤ 2B 2 Since, 2τ ≤ t, r(k) is a neutral mutation with respect to r(k−1) , for every ideal function, f . Now, starting at r(0) , at any subsequent time-step, k < K, either the representation is of the form r[hz , σ], in which case our claim is true. Otherwise, the representation after K steps will be r(K) and at the K + 1th step the representation will be r[Z, σ] (which is always a neutral mutation with respect to r(K) ). Note that except with probability η, the representation r(k+1) will be in Neigh(r(k) , ), when s ≥ (1/η) log(1/η). Note that r(k+1) is always neutral (with respect to r(k) ). Suppose r(k+1) is indeed in Neigh(r(k) , ) (allowing failure probability of η), then if r[hz , σ] is deleterious, it is chosen with probability 0, if it is neutral, with probability at least 1 − η, and if it is beneficial, then with probability 1. Thus, if at some time step when the current representation is r(k) , if at some step r[hz , σ] is not chosen to be the representation at the next generation, then it must be the case that Lf ,D (r(k+1) ) ≤ Lf ,D (r[hz , σ]) (r[hz , σ] must have been deleterious), where f is the ideal function at that time-step. Thus, we can claim the following Lemma: Lemma 4.3. For any ∆-drifting sequence f0 , f1 , . . ., if r0 , r1 , . . . , is the sequence of representations resulting from the evolutionary mechanism, EM , described in above, starting at r0 = r(k) , then except with probability 2(K − k + 1)η, there exists a j ≤ K − k + 1, such that rj = r[hz , σ] or rj = r[Z, σ]. Furthermore, for all 1 ≤ i < j, Lfi ,D (ri ) ≤ Lfi ,D (r[hz , σ]). Chapter 4: Drifting Targets 73 Equivalence to Evolvability with Drifting Targets Combining these results, we prove the equivalence between evolvability and evolvability with drifting target starting from any representation in R = R ∪ R(∗) , where R(∗) includes representations defined in the above section. In the proof below we assume that the value of is known. For generalization to the case when is unknown, ideas described in Section 3.8.2 can be used (also see Section 4.3.1). Theorem 4.2 shows that every concept class that is evolvable (as defined in Chapter 3) is also evolvable with drifting targets. We do this by showing that any concept class that can be learned using access to a G≤ (·, ·, ·) oracle, can also be evolved with drifting f,D targets. Theorem 4.2. If C is evolvable with respect to a distribution D, then C is evolvable with respect to drifting targets over distribution D. Proof. We will use the constructions described in this Section to prove this result. Note that Alg is an algorithm that only makes queries to the G≤ (·, ·, ·) oracle, of the f,D form (φ, τAlg , θ), and makes exactly q queries. Also, we assume that the algorithm outputs a hypothesis, h, that has Lf,D (h) ≤ /2. Let EM be the evolutionary mechanism as described above. Note that the set of representations R = R ∪ R(∗) . Each representation r ∈ R is of the form r[h, z], where h ∈ H and z ∈ Sq . The representations in R(∗) , are the intermediate representations defined above to transition from r[h, z] to r[hz , σ]. Let g = 2q + K, and let ∆ = min{ /(2(q + K)), τAlg /(4Bq)}, where K = 1/t = 2q/( τAlg ). Note that the assertion of Lemma 4.2 holds for this value of ∆. We show that for any ∆-drifting sequence, f0 , f1 , . . . , fl , . . ., if r0 , r1 , . . . , rl , . . . is the evolution- Chapter 4: Drifting Targets 74 ary sequence obtained using evolutionary mechanism, EM , and selection rule, BN-Sel (with respect to loss function, ), then for any l ≥ g, with probability at least 1 − , Lfl ,D (rl ) ≤ . Lemmas 4.2 and 4.3 imply that there must be some j ≤ 2q + K, such that, rl−j is of the form r[h, σ], and consider the smallest such j as long as it is at least q. Consider the j steps after that and assume that the low probability events in the statements of Lemmas 4.2 and 4.3 do not occur (this happens with probability at least 1 − 2jη). Then, it must be the case that either (i) rl = r(k) for some 0 ≤ k ≤ K (and recall that r(0) = r[h, z] where z is consistent with fl−j+q ), or (ii) rl = r[hz , z1 ], where z1 is some string such that |z1 | < q (also, note that Lfl−j+q ,D (hz ) ≤ /2). (Note that rl = r[Z, z1 ] is also possible, but this only happens if Lfi ,D (Z) ≈ Lfi ,D (hz ) for some l − j + q ≤ i ≤ l. We assume that this does not happen, since Lfl−j+q ,D (hz ) ≤ /2 and ∆ ≤ /(2(q + K)), since that would mean that Z is a representation with low expected loss.) If rl is as in (i) described above, then using Lemma 4.3, it must be the case that Lfl ,D (rl ) ≤ Lfl ,D (r[hz , ]). On the other hand, if rl is as in (ii) above, then Lfl ,D (rl ) ≤ (1 − ( /(2B)))Lfl ,D (hz ) + /(2B)Lfl ,D (r[z1 ]). Note that, Lfl ,D (hz ) ≤ Lfl−j+q ,D (hz ) + ∆|j − q| ≤ ∆(q + K). Thus, as long as ∆ ≤ /(2(q + K)), we get Lfl ,D (rl ) ≤ + /(2B). (The required result can be obtained by re-scaling ). Note that setting η = 1/(4(q + K)) ensures that the failure probability is at most . Chapter 4: Drifting Targets 75 4.3 Quasi-Monotonic Evolution In this section, we show that any concept class that is evolvable, is also evolvable quasi-monotonically. In Theorem 4.2, we showed that after g = 2q + K timesteps, for any time-step l ≥ g, with high probability Lfl ,D (rl ) ≤ . Thus, it is only necessary to show quasi-monotonicity in the initial time-horizon. For the low drift rates that we considered in Theorem 4.2, we can essentially assume that the ideal (target) function is fixed for this time horizon. Thus, we discuss the notion of quasi-monotonic evolution, only with respect to a fixed ideal function. We use the same construction as used in Section 4.2.1 with minor modifications. We still assume that the representation class depends on the (fixed) accuracy parameter . We discuss in Section 4.3.1, how this assumption may be removed using a more involved construction. Theorem 4.3. If C is evolvable over distribution D, then C is quasi-monotonically evolvable over D. Proof. We use the set-up from the proof of Theorem 4.2. We omit the parts of the argument that are identical to the one in Theorem 4.2. Let r0 be the starting representation. There are two possible cases: ˜ ˜ ˜ 1. r0 = r[h, z ] for some h ∈ H and some string z , such that 0 ≤ |˜| < q. ˜ z ¯ ¯ ¯ 2. r0 = r(k) , for some 0 ≤ k < K, where r(0) = r[h, z ], where h ∈ H and z is a ¯ string of length q. r(k) = (1 − kt/(2B))r(0) . First, we show that starting from r0 as in Case 1, we can move to an r0 as in Case 2 quasi-monotonically. This follows easily (along the lines of Theorems 3.5 and 4.2), Chapter 4: Drifting Targets 76 since for any two representations, r[h, z ], r[h, z ], satisfy: |Lf,D (r[h, z1 ]) − Lf,D (r[h, z2 ])| = |Lf,D (r[z]) − Lf,D (r[z1 ])| ≤ 2B ˜ ˜ Thus, starting from representation, r0 = r[h, z ], in exactly q − |˜| steps the represenz ˜ ˜ tation will be of the form r(0) = r[h, z ], where z is a prefix of the string z (which has ˜ ˜ length q). Also, each intermediate representation, r, satisfies, Lf,D (r) ≤ Lf,D (r0 ) + . Thus, we may assume instead that the starting representation is of the form r(k) , for 0 ≤ k < K. We show that in at most K − k + 1 steps, a representation of the form r[h , σ] is reached, where h ∈ H and σ is the empty string. This is the same as Lemma 4.3. But, we show that in fact this transition is quasi-monotonic. ¯ ¯ z Let r(0) = r[h, z ], |¯| = q. Let r0 = r(k) , for some 0 ≤ k < K, be the starting representation. We change the behavior of the mutator on these representations as follows: • For 0 ≤ k ≤ K −1, Mut(r(k) , ) outputs with probability 1−η, the representation ¯ r[h, σ]. With η/2 probability it outputs r(k+1) and with the remaining η/2 probability r[hz , σ] (where hz is the hypothesis that Alg would output when the query responses it received were consistent with those encoded in z). ¯ • For r(K) , Mut(r(K) , ) outputs with probability 1 − η, the representation r[h, σ] and with remaining probability r[Z, σ] Note that the mutator is essentially the same as used in Lemma 4.3, except that ¯ now with probability 1 − η, the representation r[h, σ] is output. This ensures quasimonotonicity. Chapter 4: Drifting Targets 77 √ The first thing we note is that except with probability 2 η, if Neigh(r(k) , ) is the multi-set obtained by running the mutator, Mut(r(k) , ) s times, then Neigh(r(k) ) will contain at least one copy of r[hz , σ] and at least one copy of r(k+1) (Note that r(k+1) is always neutral with respect to r(k) .) We consider two cases: ¯ Case 1: r[hz , σ] is beneficial with respect to r(k) and r[h, σ] is not beneficial. In this case, the selection rule, BN-Sel, will pick r[hz , σ]. But note that in this case, ¯ Lf,D (r[hz , σ]) ≤ Lf,D (r[h, σ]). ¯ Case 2: r[hz , σ] and r[h, σ] are either both beneficial or both neutral. In this case, ¯ except with probability, η, r[h, σ] will be selected. √ What we can say is the following (except with probability 2 η + η), if r ← BN-Sel(r(k) , , Lf,D , τ, s, t) ¯ then, Lf,D (r) ≤ lerrf,D (r[h, σ]). Thus, evolution reaches a representation of the form r[h , σ]. Now, consider the next q + K + 1 steps after that. Note that evolution is quasi-monotonic for these q + K + 1 steps, in the sense that the expected loss of any intermediate representation does not exceed that of r[h , σ] by more than . (Note that this means that it does not exceed by more than 2 from the actual starting representation, r0 .) But then using the analysis of Theorem 3.5 and Theorem 4.2, the representation after at most q + K + 1 steps (starting from r[h , σ]) reaches some r∗ such that Lf,D (r∗ ) ≤ . Thus, by rescaling the accuracy requirements of the algorithm, Alg, and by setting η= 2 /(64(q + K + 1)) (this ensures that the total failure probability is at most ), we can claim that, EM , evolves C quasi-monotonically in at most 2q+2K +1 generations, Chapter 4: Drifting Targets 78 with respect to selection rule, BN-Sel. 4.3.1 Removing the Need to Know We described above how, if some concept class C is evolvable with respect to a distribution D, then it is also evolvable quasi-monotonically, provided is fixed and the representation class R is allowed to depend on . Here, we describe how a representation class that simultaneously encodes all values of can be constructed. Note that the definition of evolvability allows the mutator and other operations to depend on the accuracy parameter , but the starting representation may be arbitrarily selected from the set of representations. We assume that the accuracy parameter is always a power of 2. Were this not =2 log the case, we could simply assume that the evolutionary mechanism uses instead. The performance guarantees of such a mechanism would only be better since ≤ , but the polynomial bounds would not be (significantly) affected since We consider the values of ≥ /2. in the range S = {1/2, 1/4, 1/8, . . . , 2−n }, where n is the size parameter associated with the instance space. Note that it is unnecessary to consider values of lower than 2−n , since this effectively allows the evolutionary process resources that are exponential in n. Thus, evolvability of (almost) any concept class is trivial. We use notation defined earlier in this chapter. In particular, Alg is a learning algorithm that makes queries to a G≤ (·, ·, ·) oracle – Alg makes exactly q such queries f,D and outputs a hypothesis h, such that Lf,D (h) ≤ . Let the representation r[z] be as defined in Section 3.8.1. Recall that H is the class from which Alg outputs its final Chapter 4: Drifting Targets 79 hypothesis. Define a term as follows: • Every h ∈ H is a term, and h is said to encode no value of . • For any > 1. 1 ∈ S , let T1 be a term that either encodes no , or encodes only values Then T = (1− 1 /(2B))T1 +( 1 /(2B))r 1 [z] is a term if |z| ≤ q(n, 1/ 1 ). that T1 encodes and also the Furthermore, T is said to encode all values of value 1. Thus any term T may encode up to n values of , and the values of will increase as we get deeper in the term T . This ensures that all terms have representation size that is polynomial in n and q(n, 1/ ∗ ), where ∗ is the smallest value of that T encodes. Note, also that the total number of terms is finite. Let list(T ) denote the list of all ∈ S that are encoded in T . Observe that the definition of term implies that the smallest ∈ list(T ) is encoded at the outermost 1 level, and the values increase as we move to the interior. In particular if is the smallest value in list(T ), then T = (1 − 1 /(2B))T1 + ( 1 /(2B))r 1 [z] for some term T1 and string z, and it is the case that list(T1 ) = list(T ) \ { 1 }. Denote by out(T ) the smallest value of in list(T ) and let next(T ) be T1 such that T = (1 − out(T )/(2B))T1 + (out(T )/(2B))rout(T ) [z] for some z. We consider all terms except those of the form h ∈ H to be valid representations. The representation class also contains additional representations that are defined shortly. Consider the following three cases: (a) The evolutionary process is in some representation, T , such that out(T ) = , where is the true accuracy parameter required. Then (pretending as if T Chapter 4: Drifting Targets 80 is in H) the proof of Theorem 4.3 applies directly. In particular, let T = r [T1 , z] = (1 − /(2B))T1 + ( /(2B))r [z]. Then Mut(T, ) outputs either r [T1 , z0] or r [T1 , z1] with probabilities as defined in Section 4.2.1 if |z| ≤ q(n, 1/ ). When |z| = q(n, 1/ ), additional representations of the form R(∗) (see heading Restarting the Simulation from Section 4.2.1) may be defined that allow the evolutionary process in at most K steps to transition to a representation of the form r [h , σ], where h ∈ H ∪ {Z, T1 }. Notice, that in the event that h = T1 , r [h , σ] is a term that only encodes one value of . Also, notice that unless Lf,D (T1 ) ≤ , the evolutionary process will only allow h = T1 once (as a result of partially incorrect simulation), except with a very small probability. (b) The case when out(T ) < : Let T0 = T , and define Ti = next(Ti−1 ) for all i. Let k be the smallest such that out(Tk ) ≥ (it may be the case that Tk = h for some h ∈ H). Note that, T0 = (1 − 1 /(2B)) ((1 − 2 /(2B)) (· · · k−1 /(2B))Tk (1 − +( k−1 /(2B))r k−1 [zk−1 ] · · · ) + ( 2 /(2B))r 2 [z2 ]) + ( 1 /(2B))r 1 [z1 ] Then since 1 < 2 < ··· < k−1 ≤ /2, and since every i is a power of 2, Chapter 4: Drifting Targets 81 k−1 Lf,D (Tk ) ≤ Lf,D (T0 ) + ≤ Lf,D (T0 ) + ≤ Lf,D (T0 ) + 1 2 ( 1 /(2B))Lf,D (r i [zi ]) i=1 k−1 i i=1 Since Lf,D (r) ≤ B for all r 2 If out(Tk ) = , let rb = Tk . Otherwise, let rb = (1 − ( /(2B))Tk + ( /(2B))r [σ]. Note that rb is always a term and that Lf,D (rb ) ≤ Lf,D (T ) + . (c) When out(T ) > . Let rc = (1 − /(2B))T + ( /(2B))r [σ]. Again, rc is a valid term and it is easy to see that Lf,D (rc ) ≤ Lf,D (T ) + /2. In cases (b) and (c) above, if a transition to rb or rc respectively can be made, the evolutionary process can then proceed as in case (a). Also note that transitioning to rb (respectively rc ) does not compromise on the quasi-monotonicity. None the less, rb (respectively rc ) may be deleterious with respect to the starting representation T . However, we can use the by now common trick of adding new representations of the form R(∗) , that allow the evolutionary mechanism to slide from T to rb (or rc ). Note that as discussed in Section 3.8.2, it is required to assume that the starting representation is such that its size is bounded by some polynomial in n and 1/ . 4.4 Some Specific Evolutionary Algorithms Finally, in this section we show that the two evolutionary mechanisms, for monotone conjunctions and homogeneous linear separators, described in Section 3.7 are Chapter 4: Drifting Targets 82 robust to drift. We do this by showing that the mutators used in the evolutionary mechanisms are indeed strictly beneficial neighborhood mutators, and appeal directly to Theorem 4.1. 4.4.1 Monotone Conjunctions Consider the evolutionary mechanism, described in Section 3.7.1, for evolving monotone conjunctions under the uniform distribution over the boolean cube, X = {−1, 1}n . Theorem 3.1 shows that for any representation, r, such that |lit(r)| ≤ log2 (3/ ) the mutator, Mut, is an ( 2 /36, 1/(4n2 )) strictly beneficial neighborhood mutator. Notice that if the starting representation, r0 , does not satisfy |lit(r0 )| ≤ log2 (3/ ) after at most n − log2 (3/ ) such a representation is reached. Also, after that stage, the evolutionary mechanism never produces a representation, r , for which |lit(r | > log2 (3/ ). Thus, appealing to Theorem 4.1, we may prove the following result: Theorem 4.4. Let C denote the class of monotone conjunctions on the boolean cube, X = {−1, 1}n and let U denote the uniform distribution over X . Then C is evolvable using evolutionary mechanism, EM = (R, Mut, τ, s, t) (as described in Section 3.7.1), using selection rule BN-Sel, with respect to distribution U, with drifting targets for any drift rate, ∆ ≤ 2 /864. 4.4.2 Homogeneous Linear Separators Consider the evolutionary mechanism described in Section 3.7.2 for evolving homogeneous linear separators in Rn under radially symmetric distributions. The proof Chapter 4: Drifting Targets 83 of Theorem 3.2 essentially proves that the mutator operator defined there is an ( /(π 3 n), 1/(2n − 2)) strictly beneficial neighborhood mutator. Thus, appealing to Theorem 4.1, we can prove the following result: Theorem 4.5. Let Hn be the class of homogeneous linear separators in Rn and D be any radially symmetric distribution over Rn . Then Hn is evolvable under drifting target (ideal) functions with respect to distribution D, using the evolutionary mechanism, EM = (R, Mut, τ, s, t) (as described in Section 3.7.2), for drift rate, ∆ ≤ /(24π 3 n). Chapter 5 The Role of Recombination One of the most important aspects of the biological world not modeled explicitly by Valiant is the existence of two sexes and the process of recombination. Sexual reproduction is nearly universal in higher organisms and thus is thought to be an important factor in evolution. There are several proposed explanations for the role of sex and recombination in evolution. Some of these are discussed in Section 5.1, but the most relevant argument for our work is the one that sexual reproduction can accelerate evolution through parallelism. Fisher [18] first proposed that sexual reproduction can speed up evolution (see also [30, 34]): A consequence of sexual reproduction which seems to be of fundamental importance to evolutionary theory is that advantageous changes in different structural elements of the germ plasm can be taken advantage of independently; whereas with asexual organisms either genetic uniformity of the whole group must be such that evolutionary progress is greatly retarded, or if there is considerable genetic diversity, many beneficial changes will be lost through occurring in individuals destined to leave no ultimate descendants in the species. (from The Genetical Theory of Natural Selection - R. A. Fisher 1930) 84 Chapter 5: The Role of Recombination 85 A simple explanation for this is the following: Suppose that there are two allelic mutation a → A and b → B that are both favorable and also additive in their effect. Thus, an individual having both A and B alleles will be the fittest. However, beneficial mutations are extremely rare in nature. Under asexual reproduction an individual would possess both alleles A and B, only if a mutation, say b → B, occurs in an individual already possessing allele A. For this to occur with high probability, the A allele must have already spread in the population. For a large fraction of the population to acquire both A and B, several generations may be required. Under sexual reproduction, even if the two mutations a → A and b → B occur in different members of the population, they are able to spread quickly in the population and via recombination there will be a member with both mutations in much fewer generations. Roughly speaking, if there are n loci on which selection acts additively, asexual reproduction may require O(n) generations to produce the fittest variant, while sexual reproduction is able to achieve the same in only O(log(n)). Our main result shows that recombination allows for parallelism, this result is discussed in Section 5.4. We show that the reduction from learning algorithms to evolutionary mechanisms described in Chapter 3 can be modified to be such that if there exists an efficient parallel algorithm for learning a concept class, C, then there is an evolutionary mechanism that exploits the parallelism to achieve faster evolution in a model of evolution with recombination. It can be shown that these evolutionary mechanisms are provably faster than those possible without recombination in Valiant’s original model. This is shown in Section 5.6. Chapter 5: The Role of Recombination 86 5.1 Related Work Several explanations have been proposed to understand the factors responsible for maintaining sex and recombination. However, there seems to be no single answer. We have discussed above Fisher’s argument that sexual reproduction may accelerate evolution. Another advantage, proposed by Muller [30], is that recombination is useful as a means to weed out deleterious mutations. In a finite population, mildly deleterious mutations are likely to be fixed in the population one at a time merely by chance and eventually a larger number of these will be accumulated. In the absence of recombination, a deleterious mutation cannot be removed except by a back-mutation which is extremely rare. This effect is known as Muller’s ratchet. Epistasis refers to the non-independence between loci with respect to their effect on fitness. Hitchhiking is the phenomenon where certain (possibly deleterious) alleles are maintained because they are coupled to other beneficial mutations. Epistasis, hitchhiking and other factors are thought to play a role in the maintenance of recombination. Livnat et al. [26, 27] have also suggested that sexual reproduction gives rise to mixability or modularity. Maynard Smith [35, 36] and Barton and Charlesworth [3] survey several proposed explanations of sex and recombination. 5.2 Evolution Model with Recombination In this work, the question of interest is whether evolution can be sped up in Valiant’s computational model. Thus, Fisher’s ideas are most relevant to this work, and indeed this work is inspired by his. In order to model the process of recom- Chapter 5: The Role of Recombination 87 bination, it is necessary to consider a finite population with variation, whereas in Valiant’s model all members of the population were considered identical. The model considered here is inspired by those in population genetics, particularly the WrightFisher model [18, 41]. As in the Wright-Fisher model, we have discrete generations and a fixed population in each generation. The members of each generation choose their parents randomly from the previous generation1 . However, unlike the models in population genetics, the individuals in our population are representations of functions (rather than abstract entities). The goal as in Valiant’s model is to evolve a representation that predicts an unknown ideal function (almost) accurately using feedback obtained through (natural) selection. We briefly recall some notions from Chapter 3. The input space, X , is thought to model all possible environments that may be faced by an organism; typical settings of interest are X = {−1, 1}n and X ⊆ Rn . The parameter, n, represents the size of instances x ∈ X . The ideal function is defined as a function, f : X → Y, and indicates the best behavior in every possible environmental condition. The ideal function comes from some concept class, C, of functions defined from X → Y. Recall that an organism is treated as a string that represents a function from X → Y , such that Y ⊆ Y . The requirement is that given the string representation, r, and some input, x ∈ X , there is a polynomial time Turing machine that outputs the value, r(x) ∈ Y . Thus, by slight abuse of notation, the representation, r, is treated as a string and also a function from X → Y . Let R be a class of representations of functions X → Y . However, the term generation in our model does not necessarily refer to one biological generation. This is discussed further in Section 5.7. 1 Chapter 5: The Role of Recombination 88 The performance of an organism is measured with respect to a loss function : Y × Y → R+ , such that for every y ∈ Y, miny ∈Y (y , y) = (y, y) = 0, and for every y ∈ Y , y ∈ Y, (y , y) ≤ B, i.e. is consistent and bounded. Suppose that D is target distribution over the instance space X ; then for any representation r : X → Y , the expected loss is defined as Lf,D (r) = Ex∼D [ (r(x), f (x))]. Alternatively, one may view the performance of r as -Perf f,D (r) = 1 − 2Lf,D (r)/B. 5.2.1 Recombination Recombination is a process that takes two genotypes and produces a new one combining the two. The exact process of recombination is not completely understood, and modeling such a process seems inherently problematic. Following Valiant’s idea of defining mutations as output of a randomized polynomial time Turing machine, we can define recombination in a similar fashion. A recombinator is defined as a randomized polynomial time Turing machine that takes as input two representations r , r and the target loss parameter a representation r. Formally, Definition 5.1 (Recombinator). . A recombinator for a representation class, R, is a randomized Turing machine, Rec, that takes as inputs, r , r ∈ R and , and outputs a representation r ∈ R ∪ {⊥}. Th running time of Rec is bounded by a polynomial in |r |+|r | and 1/ . We think of Rec(r , r , ) as a random variable (random descendant or recombinant of r and r ) that takes values in R. If the recombinator outputs ⊥ that is considered as r and r being unable to leave a descendant in the next generation. and outputs Chapter 5: The Role of Recombination 89 5.2.2 Selection Rules We define selection rules similarly to those defined in Section 3.5. Let r , r ∈ R be two representations and let be the accuracy parameter. Recall from Section 3.5, that Lf,D (·, ·) is an oracle that when queried (r, τ ), returns an additive approximate value of the expected loss of representation, r, i.e. |Lf,D (r, τ ) − Lf,D (r)| ≤ τ . Let t be the tolerance parameter and s ∈ N the size parameter, then we can define the selection of one descendant of r and r as follows: • Let Desc(r , r , ) = {r1 , . . . , rs } be a multi-set obtained by running Rec(r , r , ) s times. This is the candidate pool of possible descendants (recombinants) of r and r . Let vi = Lf,D (ri , τ ). Let v = Lf,D (r , τ ) and v = Lf,D (r , τ ). • We consider the three possible selection rules – selection based on beneficial and neutral recombinants (BN-Sel), selection based on optimization (Opt-Sel) and weak selection (Weak-Sel), to select one descendant that will survive to live in the next generation. 1. Selection based on beneficial and neutral recombinants: Beneficial recombinants (descendants) are defined as those whose expected loss is noticeably lower than at least one of its two parents – r and r . Neutral recombinants (descendants) are those whose expected loss is close to that of the (worse) of its two parents. Thus, Bene = {ri ∈ Desc(r , r , ) | vi ≤ max(v , v ) − t} Neut = {ri ∈ Desc(r1 , r2 , ) | |vi − max(v , v )| ≤ t} Chapter 5: The Role of Recombination 90 Selection proceeds as follows: if the multi-set Bene is not empty, then a representation from Bene is chosen to be a descendant uniformly at random; if Bene is empty and Neut is non-empty, a representation from Neut is selected uniformly at random; if both Bene and Neut are empty, then ⊥ is selected, and it is understood as r , r leaving no descendant in the next generation. Let r denote the representation selected by such a process, we express this as: r ← BN-Sel(r , r , , Rec, Lf,D , τ, s, t) 2. Selection based on optimization: This selection rule chooses one among the best possible candidate recombinants. Let v ∗ = mins vi . Then, if i=1 v ∗ ≥ max(v , v ) + t, i.e. the best descendant is noticeably worse than the worse of the two parents, ⊥, is selected and it is understood as r and r leaving no descendants in the next generation. Otherwise, define: Opt = {ri ∈ Desc(r , r , ) | |vi − v ∗ | ≤ t and vi ≤ max(v , v ) + t} Note that when v ∗ ≤ max(v , v ) + t, Opt is guaranteed to be non-empty. The selection rule chooses a random descendant, r, from Opt. We express this selection rule as: r ← Opt-Sel(r , r , , Rec, Lf,D , τ, s, t). 3. Weak Selection: In this selection rule, a distinction is not made between beneficial and neutral recombinants (descendants). Any recombinant that is not noticeably worse than the worse among its two parents is considered Chapter 5: The Role of Recombination 91 feasible. Define the multi-set, Feas, as: Feas = {ri ∈ Desc(r , r , ) | vi ≤ max(v , v ) + t} Selection proceeds as follows: if Feas is not empty, a representation, r, from multi-set Feas is selected uniformly at random; otherwise it is assumed that r and r are unable to leave a descendant to the next generation. We express this selection process as: r ← Weak-Sel(r , r , , Rec, Lf,D , τ, s, t). In this chapter, we will consider all three selection rules described above. The case for weak selection is particularly compelling, since as we discuss in Section 5.7, recombination is most likely to play a role in accelerating evolution when selection pressures are weak. 5.2.3 Evolution with Recombination An evolutionary mechanism (with recombination) may be defined as the 5-tuple, EM = (R, Rec, τ, s, t). Here, R is the representation class, Rec the recombinator operator, τ the approximation parameter (that is used to make queries to get estimates of expected loss from oracle Lf,D (·, ·)), s : R × R × R+ → N is the size function (here the size function depends on two representations, r , r , to define the size of the multiset Desc(r , r , )), and similarly the tolerance function, t : R × R × R+ → R+ . As defined in Section 3.6, we say that the evolutionary mechanism, EM , is polynomially bounded, if the representation class, R, is polynomially bounded, the recombinator, Rec, is polynomially bounded, the approximation parameter, τ , is such that 1/τ is Chapter 5: The Role of Recombination 92 bounded by a polynomial in n and 1/ , the size function, s, is polynomially bounded and the tolerance function, t, is polynomially sandwiched. It is also required that the size function, s, and tolerance function, t, are polynomially evaluable. Valiant’s original model did not require a diverse population. Indeed in an evolutionary process in the sense of Valiant [39], at each generation only a single genotype was preserved and changes across generations were caused solely due to mutation. For recombination to influence evolution in any way, variation in population at each generation is a must. In this chapter, we assume the existence of a finite population and if its size is bounded by a polynomial (in n and 1/ ), we consider the population size to be reasonable. Define a population to be a subset of the set of representations, R. We define an evolutionary step as taking a population, P0 ⊆ R, to population, P1 ⊆ R, at the next generation, using some evolutionary mechanism, EM , and selection rule, Sel. This transition evolves the action of the recombinator operator on existing representations (genotypes) in P0 , and then selection using rule, Sel. The population P1 is produced as follows: Each member of P1 is picked by picking parents r and r uniformly at random from P0 (with replacement). This is followed by selection according selection rule, Sel, as described above to possibly produce a descendant, r. Note that it is possible for r and r to produce no viable descendant 2 . This process is repeated until |P1 | = |P0 |. Formally, Definition 5.2 (Evolutionary Step). An evolutionary step using evolutionary mechOne can imagine imposing several kinds of restriction on mating – being of the same sex, being geographically distant, or other ones. We believe that using string representations is general enough, so any such information can be effectively encoded in the recombinator Turing machine, forcing it to output ⊥ when the mating is infeasible. 2 Chapter 5: The Role of Recombination 93 anism, EM = (R, Rec, τ, s, t), and selection rule, Sel, (with respect to loss function, , ideal function, f , and target distribution, D,) beginning with a starting population, P0 ⊆ R, produces a population, P1 ⊆ R, for the next generation as follows: Let P1 = ∅. While |P1 | < |P0 |: 1. Select uniformly at random (with replacement) parents r , r ∈ P0 2. Let r ← Sel(r , r , , Rec, Lf,D , τ, s, t) 3. If r = ⊥, then P1 = P1 ∪ {r} Observe that as long as the population size |P0 | is polynomially bounded, an evolutionary step can be simulated in polynomial time (with high probability). Starting from some starting population, P0 ⊆ R, we can consider a sequence of evolutionary steps that result from evolutionary mechanism, EM , and selection rule, Sel. Denote the resulting sequence of populations as P0 , P1 , P2 , . . ., where Pi+1 results from Pi by an evolutionary step. We define the notion of evolvability as requiring that for some (polynomially bounded) time step, g, there exist a representation, r ∈ Pg , such that Lf,D (r) ≤ . Formally, Definition 5.3 (Evolvability with Recombination). We say that a concept class, C, is evolvable with respect to distribution, D over X , using loss function, , and selection rule, Sel, in g generations, if there exists a polynomially bounded evolutionary mechanism, EM = (R, Rec, τ, s, t), that for every > 0, starting from any (polynomially bounded) population, P0 ⊆ R, and for every ideal function, f ∈ C, with selection rule, Sel, and oracle, Lf,D (·, ·), produces a sequence of populations, P0 , P1 , . . . , Pg , Chapter 5: The Role of Recombination 94 such that with probability at least 1 − for some g ≤ g, there exists a representation r∗ ∈ Pg such that Lf,D (r∗ ) ≤ . We also consider the case where the evolutionary process succeeds only if it starts from a specific initial population, P0 . We refer to this as evolution with recombination with initialization. Readers familiar with the Wright-Fisher model from the population genetics will notice the similarity between their model and the one presented here. However, while population genetics is concerned mainly with allele frequency distributions, the focus of our work is on understanding the computational limits of evolution. A natural question that arises is, does augmenting Valiant’s model of evolution with recombination increase the power of evolvability? In the sense of polynomially bounded evolution the answer is negative. This follows from the reduction from learning algorithms to evolvability as described in Section 3.8.1 (also see the original result due to Feldman [12]). This is because a polynomially bounded evolutionary mechanism (with recombination) can be simulated by a polynomial time algorithm that only makes queries to the oracle, G≤ (·, ·, ·). Theorem 3.5 shows that every such algorithm can f,D be simulated by a polynomially bounded evolutionary mechanism in Valiant’s model (without recombination). However, we show that evolvability with recombination may be exponentially faster for some concept classes under certain distributions; this follows via a reduction from parallel learning algorithms to evolution with recombination. The reduction described here requires initialization. In Section 5.3, we first formally define a model of parallel learning that is convenient to prove the main result. In Section 5.4, we Chapter 5: The Role of Recombination 95 present the reduction from parallel learning algorithms to evolution with recombination. In Section 5.5, we consider specific concept classes that can be evolved substantially faster with recombination. Finally, in Section 5.6 we show a lower bound that evolution in Valiant’s model (without recombination) may indeed be exponentially slower in certain cases. The algorithms presented here may appear somewhat unnatural; however, the goal is not to model exact physical processes, but to understand their computational limitations. This model can be understood as specifying the outer limits of the evolutionary process, as we do not expect nature to perform computations not captured by efficient Turing machines. 5.3 Model of Parallel Learning Recall some notation defined in Section 3.8.1. Let C be the concept class, f ∈ C the ideal function, f : X → Y and let D be a distribution over X . Let h : X → Y be a hypothesis such that Y ⊆ Y . We consider a loss function, : Y × Y → R+ , that is consistent and bounded, i.e. for every y ∈ Y, miny ∈Y = (y, y) and for every y ∈ Y , y ∈ Y, (y , y) ≤ B. The expected loss of hypothesis, h, with respect to ideal function, f , and distribution, D, is Lf,D = Ex∼D [ (h(x), f (x))]. We have supposed that 0 ∈ Y and the function, Z : X → Y , defined as Z(x) = 0, for every x ∈ X is the zero function3 . Recall the (translated loss) function, Gf,D (h) = Lf,D (h) − Lf,D (Z), and the oracle, G≤ (·, ·, ·), that on query (φ, τ, θ), f,D We may think of 0 as a randomized point in Y , where effectively 0 is a random variable over Y . For example, when Y = {−1, 1}, we think of 0 as the point that takes value −1 or +1 with equal probability. The function, Z, thus defined becomes a randomized function. 3 Chapter 5: The Role of Recombination 96 where φ : X → Y , τ ∈ R+ , θ ∈ R such that τ ≤ |θ| ≤ B, outputs a value as defined below:     1     ≤ Gf,D (φ, τ, θ) = 0       0 or 1  if Gf,D (φ) ≤ θ − τ if Gf,D (φ) ≥ θ + τ otherwise We considered algorithms that learn concept class, C, with respect to distribution, D, using only access to the G≤ (·, ·, ·) oracle. We now extend this to describe parallel f,D learning algorithms that only access the G≤ (·, ·, ·) oracle. f,D We consider a parallel algorithm that uses p (polynomially bounded) processors and we assume that there is a common clock that defines parallel time steps. During each parallel time step a processor can do the following: 1. Make a query to the oracle G≤ (·, ·, ·). f,D 2. Perform some polynomially-bounded computation. 3. Write a message that can be read by every other processor at the end of the current time step. Effectively, we view this as a message broadcast. The oracle, G≤ (·, ·, ·) responds to the queries made by all the p processors in f,D parallel. We are not concerned with the exact mechanism for message passing, for example it could be performed using shared memory. Also, the main resource that we want to exploit by using parallelism is oracle queries. Thus, we allow the processors to perform arbitrary (polynomially-bounded) computation and write arbitrarily long (polynomially-bounded) messages during each parallel time-step. Chapter 5: The Role of Recombination 97 Definition 5.4 (Parallel Learning with G≤ (·, ·, ·) oracle). We say that a concept f,D class, C, over instance space, X , is (τAlg , T )-parallel learnable using algorithm, Alg, using the G≤ (·, ·, ·) oracle and p processors, under distribution, D, and with respect f,D to loss function, , if for every > 0 and target (ideal) function, f ∈ C, in at most T parallel time-steps, Alg, outputs a hypothesis, h, that satisfies, Lf,D (h) ≤ . Every query made by Alg is of the form (φ, τAlg , θ), where φ : X → Y is efficiently evaluable, 1/τAlg is bounded by a polynomial in n and 1/ , and τAlg ≤ |θ| ≤ B. The final hypothesis, h, must be polynomially evaluable and during each parallel time-step, each of the p processors must complete their computation and message writing in polynomial time. 5.4 Reduction to Evolution with Recombination In this section, we prove the result that fast parallel learning algorithms lead to faster (in terms of number of generations) evolution when recombination is allowed. The ideas used in the reduction are similar as those in Section 3.8.1. We first show the result using selection rule, BN-Sel (and also Opt-Sel). The weak selection rule is considered separately in Section 5.4.1. We show the following result: Theorem 5.1. Suppose concept class, C, is (τAlg , T )-parallel learnable by algorithm, Alg, using G≤ (·, ·, ·) oracle and p processors with respect to distribution, D, and f,D loss function, , then C is evolvable with recombination by a polynomially bounded evolutionary mechanism, using selection rule, BN-Sel, under distribution, D, and with respect to loss function, , if evolution is started with an initialized population, P0 , in T (log(p) + 2) + 1 generations. Chapter 5: The Role of Recombination 98 Remark 5.1. Although we prove this theorem for selection rule, BN-Sel, it will be clear from the proof that the reduction is also valid for selection rule, Opt-Sel. The situation for Weak-Sel is more involved and is discussed in greater detail in Section 5.4.1. We first describe a high-level outline of the proof strategy. Let Alg be a (τAlg , T )parallel G≤ (·, ·, ·) algorithm for learning concept class, C. We define (as in Section f,D 3.8.1) a representation class, R, that contains representations encoding the state of a simulation of Alg. We assume that at the end of each parallel step, all processors used by algorithm, Alg, have the same state information denoted by some string, z. This is without loss of generality, since processors are allowed any polynomially bounded communication. However, each processor has a unique identity, i, that differentiates its actions from those of other processors. We start with a population, P0 , in which every member is identical. We assume that at the beginning, each member of the population is r0 ≡ Z. We show that the behavior of the algorithm, Alg, during each parallel-step can be simulated using at exactly log(p) + 2 evolutionary steps. We refer to the simulation of the k th parallel step as phase k. The outline of the simulation is as follows: 1. At the start of phase k, the population is identical, denoted by some representation, rk−1 . Each phase begins with a differentiation step, at the end of which each member of the population has an identity of the processor that it will simulate. Furthermore, there are roughly equal number of members simulating each processor. This is achieved by defining Rec(rk−1 , rk−1 , ), that outputs ri,k−1 with probability 1/p for 1 ≤ i ≤ p. The representation, ri,k−1 , encodes exactly Chapter 5: The Role of Recombination 99 the same function as rk−1 but also encodes the identity, i, of the processor it will simulate. 2. After differentiation, each individual simulates the processor whose identity is encoded in the representation of that individual. This step is similar to the reduction in Section 3.8.1, but is now carried out using a recombinator, Rec, rather than a mutator. The descendants that survive encode a valid query response to the query that the processor makes during the k th parallel time-step (given the state), and also encode the message that the processor broadcasts at the end of the k th parallel time-step. 3. The communication (message passing) in each parallel step is simulated using log(p) evolutionary steps. At the end of these, the population once again becomes identical, denoted by representation, rk , and phase k is over. At the end of phase T , each representation, rT , encodes a state of a valid simulation of algorithm, Alg, and hence can produce a hypothesis, h, that has low (at most ) expected loss. Thus, in one additional evolutionary step, it is possible to have a representation in the population that has expected loss at most . We now describe the construction of the evolutionary mechanism in detail. In order to facilitate and easier reading of the proof, we describe representations that are contained in the representation class, R, as they are required in the proof. We define the recombination operator, Rec, on any pair of representations as we need them. If for any pair of representations, the recombinator is left undefined, we assume that it always outputs, ⊥, i.e. mating between the pair is considered impossible. Chapter 5: The Role of Recombination 100 Recall that each individual (member) of the population has a (string) representation in R. Each representation is thought of as a randomized function (see Section 3.8.1 for more details) – we express r ∈ R as r = ψi : X → Y is a deterministic function, wi ≥ 0 and m i=1 m i=1 wi ψi , where each wi ≤ 1. For any x ∈ X , r(x) = ψi (x) with probability wi , for 1 ≤ i ≤ m, and r(x) = Z(x) = 0 with probability 1 − m i=1 wi . In addition to encoding the randomized function, the representation, r, may also encode state information. We describe the simulation of the k th parallel step of the algorithm, Alg, for some k. We postpone the exact description of representation, rk−1 , but for now assume it encodes the state of a valid simulation of Alg up to the end of the first k − 1 parallel steps. At the end of any parallel step, we may assume that the state of all processors used by the algorithm is identical because of the messages broadcast. The recombinator, Rec(rk−1 , rk−1 , ), outputs ri,k−1 with probability 1/p for 1 ≤ i ≤ p. The representation, ri,k−1 , encodes exactly the same (randomized) function (from X → Y ) as rk−1 , but the string representation of ri,k−1 additionally encodes i, the identity of the ith processor. Thus, suppose that P k−1 is a population that consists only of representations, k−1 rk−1 , then after one evolutionary step we get a population P1 such that roughly k−1 1/p fraction of P1 is ri,k−1 for every 1 ≤ i ≤ p. In the next evolutionary step, the individuals simulate the G≤ (·, ·, ·) query made f,D i,k−1 by the processor whose identity is encoded in their representation. Let (φi,k−1 , τAlg , θz[k−1] ) z[k−1] be the query made by the ith processor during parallel time-step k, where z[k − 1] is the string that captures the state information of Alg, up to the end of k − 1 parallel Chapter 5: The Role of Recombination 101 time-steps. Note that the query response will be either 0 or 1, and then depending on the query response the processor may perform some computation and broadcast a i,k−1 message. Let σb be the message that the ith processor broadcasts when the query response is b, for b ∈ {0, 1}. Define the representation: i,k−1 rb,σi,k−1 = ri,k−1 + b I(b = 1) i,k−1 |θz[k−1] |N φi,k−1 , z[k−1] where N is a number that will make all the representations valid (i.e. make the weights sum up to at most 1) and will be defined later. I() is the identity function i,k−1 which is 1 if its argument is true and 0 otherwise. Thus, the representation, r0,σi,k−1 0 is functionally identical to r i,k−1 and r k−1 , but encodes additional information such as 1 i,k−1 i,k−1 the candidate query response, 0, and the message, σ0 . The representation, r1,σi,k−1 in addition to encoding the candidate query response, 1, and the message, i,k−1 encodes an additional function, φi,k−1 with weight 1/(|θz[k−1] |N ). z[k−1] i,k−1 σ1 , also We now define the recombination operator, Rec, for the representations, ri,k−1 and i,k−1 ri,k−1 , as follows: Let (φi,k−1 , τAlg , θz[k−1] ) be the query made by processor i during z[k−1] the k th parallel step. Then we have the following (the parameter η is to bound probabilities appropriately and will be determined later): i,k−1 i,k−1 • If θz[k−1] > 0, then with probability 1 − η, Rec(ri,k−1 , ri,k−1 , ) = r1,σi,k−1 and with 1 probability η, Rec(r i,k−1 ,r i,k−1 , )= i,k−1 r0,σi,k−1 . 0 i,k−1 i,k−1 • If θz[k−1] < 0, then with probability 1 − η, Rec(ri,k−1 , ri,k−1 , ) = r0,σi,k−1 and with 0 probability η, Rec(ri,k−1 , ri,k−1 , ) = i,k−1 r1,σi,k−1 . 1 We show in Claim 5.1 that with high probability, the descendant chosen will be will be such that it encodes a valid answer (and hence also the corresponding Chapter 5: The Role of Recombination 102 i,k−1 broadcast message) to the query (φi,k−1 , τAlg , θz[k−1] ). Also, in the case that i = i , z[k−1] we define the recombinator so that, Rec(ri,k−1 , ri ,k−1 , ) = ⊥, with probability 1. We will make sure that N and 1/η are polynomially bounded (these are defined in the proof of Theorem 5.1). Then we define the approximation parameter, τ = τAlg /(2BN ), the size function, s(r , r , ) = (1/η) log(1/η) and the tolerance function, t(r , r , ) = 1/N (where the size function and tolerance function are defined for any pair of representations of the form rk−1 or ri,k−1 ). The evolutionary mechanism, EM (R, Rec, τ, s, t), evolves C with recombination, using selection rule, BN-Sel. In the proofs in this section, we use the following version of the Chernoff-Hoeffding bound: If Xi m i=1 are independent and identically distributed random variables, with mean µ, such that Xi ∈ [0, 1], then, Pr ¬ µ 1 ≤ 1+δ m m Xi ≤ µ(1 + δ) i=1 ≤ 2e−δ 2 µm/12 We now prove the following claim: Claim 5.1. Suppose that we begin with a population P k−1 in which each member is identical, rk−1 . Suppose α = 1/p, then for every 0 ≤ δ ≤ 21/4 − 1, after 2 evolutionary steps we get a population in which the following hold with probability at least 1 − 4pe−δ 2 α|P |/24 − 2η|P k−1 | : i,k−1 1. Each member of the population is of the form rb,σi,k−1 where b is a valid query b response to the query i,k−1 (φi,k−1 , τAlg , θz[k−1] ) z[k−1] made by processor i during the k th i,k−1 parallel step and σb is the message the processor i would broadcast. i,k−1 2. If fi is the fraction of the population of type rb,σi,k−1 , then b α ≤ fi ≤ (1 + δ)5 α (1 + δ)5 Chapter 5: The Role of Recombination 103 Proof. Let P k−1 be the starting population; each member of P k−1 is rk−1 . Note that the recombinator Rec(rk−1 , rk−1 , ) outputs a representation ri,k−1 with probability 1/p for every 1 ≤ i ≤ p. Thus, when the population for the next generation (after one evolutionary step) is picked, the probability that any fixed individual has representation, ri,k−1 , is exactly 1/p, for any i. Thus, after one generation, the population will be differentiated and the expected number of individuals ri,k−1 is α|P | = |P |/p. Thus, except with probability 2e−δ 2 α|P |/12 , the fraction of ri,k−1 in the population, say ˜ ˜ fi , satisfies α/(1 + δ) ≤ fi ≤ (1 + δ)α. Taking union bound, this holds for all i with probability at least 1 − 2pe−δ 2 α|P |/12 . Thus, after one evolutionary step, each member of the population is of the form ˜ ˜ ri,k−1 , and assume that fi satisfies α/(1+δ) ≤ fi ≤ α(1+δ) (allowing the evolutionary process to fail with probability 2pe−δ 2 α|P |/12 ). Next, we show the following: Suppose, i,k−1 θz[k−1] > 0, then for tolerance function, t = 1/N , and approximation parameter, i,k−1 τ = τAlg /(2N ), if 0 is the only valid answer to the query, (φi,k−1 , τAlg , θz[k−1] ), then z[k−1] i,k−1 r1,σi,k−1 is a deleterious recombinant. This follows using exactly the same argument as 1 used in the proof of Theorem 3.5. If 0 is the only valid answer to the query, it must i,k−1 i,k−1 be the case that Gf,D (φi,k−1 ) ≥ θz[k−1] + τAlg . Then, Lf,D (r1,σi,k−1 ) − Lf,D (ri,k−1 ) = z[k−1] 1 i,k−1 Gf,D (r1,σi,k−1 ) 1 − Gf,D (ri,k−1 ) = i,k−1 (1/(|θz[k−1] |N ))Gf,D (φi,k−1 ) z[k−1] ≥ 1/N + τAlg /N ≥ t + 2τ . i,k−1 Thus, r1,σi,k−1 is a deleterious recombinant. 1 The following two statements are true: (i) If 0 is not the only valid answer, i.e. 1 is also a valid answer, to the query i,k−1 (φi,k−1 , τAlg , θz[k−1] ), then a descendant chosen as z[k−1] r ← BN-Sel(ri,k−1 , ri,k−1 , Rec, Lf,D , τ, s, t) Chapter 5: The Role of Recombination i,k−1 will satisfy r = r1,σi,k−1 with probability at least 1 − η. 1 104 (ii) Except with probability η, when the size function s ≥ (1/η) log(1/η), the candidate pool, Desc(ri,k−1 , ri,k−1 , ), will contain at least one copy of the repi,k−1 resentation, r0,σi,k−1 . Thus, when 0 is the only valid answer to the query, 0 i,k−1 i,k−1 (φi,k−1 , τAlg , θz[k−1] ), r0,σi,k−1 z[k−1] 0 will be the chosen descendant and not ⊥. i,k−1 The case when θz[k−1] < 0 is (almost exactly) symmetric – in that case we show that i,k−1 when 1 is only valid answer to the query, (φi,k−1 , τAlg , θz[k−1] ), then the representation, z[k−1] i,k−1 r1,σi,k−1 , is beneficial. 1 Thus, the following is true: Except with probability 2|P k−1 |η, after two generai,k−1 tions each member of the population is of the form rb,σi,k−1 , where b is a valid answer b to the query, i,k−1 (φi,k−1 , τAlg , θz[k−1] ). z[k−1] The statement about the fraction of representations of each type now follows immediately using the Chernoff-Hoeffding bound, since the expected number of repi,k−1 ˜ ˜ resentations of the form rb,σi,k−1 (given fi for all i ) is (fi2 / b i ˜ fi2 )|P |. Note that ˜ α/(1 + δ) ≤ fi2 / 4 i ˜ ˜ fi2 ≤ α(1 + δ)4 . Since fi2 / i ˜ fi2 ≥ α/2 for the value of δ in 2 α|P |/24 the statement of the Claim, except with probability 2pe−δ , for all i, it holds that α/(1 + δ)5 ≤ fi ≤ α(1 + δ)5 . The assertion of the claim follows by taking union bound over failure events in the two evolutionary steps. k−1 Thus, at the end of 2 evolutionary steps in phase k, we reach a population, P1 , i,k−1 such that every member of the population is of the form rb,σi,k−1 , where b represents a b valid response to the query, i,k−1 (φi,k−1 , τAlg , θz[k−1] ), z[k−1] made by the ith processor during the k th parallel time-step. We now define the action of the recombinator on these new representations. We define a few more intermediate representations: Chapter 5: The Role of Recombination 105 i,k−1 i,k−1 i,k−1 • Let ρi,k−1 = {r0,σi,k−1 , r1,σi,k−1 } be a set of representations of the type rb,σi,k−1 . 0 0 1 b For notational convenience, we will just say representation, ρi,k−1 , 0 when, in fact, we mean some representation from the set ρi,k−1 . This is because the action of 0 the recombinator on any representation from this set is identical.) • Assume for simplicity that p is a power of 2. For j = 1, . . . , log(p) and for 1 ≤ i ≤ p/2j define the set ρi,k−1 = ρ2i−1,k−1 + ρ2i,k−1 − rk−1 . j j−1 j−1 Note: We have abused notation considerably in the above expression to keep the number of symbols manageable. It is possible to show (by induction), that the set ρi,k−1 has 2j+1 elements. This is immediately true for the case when j i,k−1 i,k−1 j = 0, which has two elements, r0,σi,k−1 and r1,σi,k−1 . 0 1 To see this when j > 0, observe the following: Each representation from the 2i−1,k−1 set ρ2i−1,k−1 is of the form rk−1 + ψj−1 , where rk−1 is the representation j−1 that encodes the state of simulation of the algorithm, Alg, for the first k − 1 2i−1,k−1 parallel time-steps. The part, ψj−1 , encodes query responses obtained by the processors, ((2i−1)−1)2j−1 +1, ((2i−1)−1)2j−1 +2, . . . , ((2i−1)−1)2j−1 + 2j−1 , during the k th parallel time-step. Similarly, a representation in the set, 2i,k−1 2i,k−1 ρ2i,k−1 , is of the form, rk−1 + ψj−1 , where ψj−1 encodes query responses j−1 obtained by the processors, (2i−1)2j−1 +1, . . . , (2i−1)2j−1 +2j−1 , during the k th 2i−1,k−1 2i,k−1 parallel time-step. Thus, the resulting representation, rk−1 + ψj−1 + ψj−1 , is in the set, ρi,k−1 , and encodes query responses obtained by the processors, j (i − 1)2j + 1, (i − 1)2j + 2, . . . , (i − 1)2j + 2j , during the k th parallel time-step. We now define the action of the recombinator on these representations. We overload notation and use representation, ρi,k−1 , to mean some representation from the j Chapter 5: The Role of Recombination set, ρi,k−1 . j 106 1. For representations, ρ2i−1,k−1 and ρ2i,k−1 for j = 1, . . . log(p), Rec(ρ2i−1,k−1 , ρ2i,k−1 , ) j−1 j−1 j−1 j−1 outputs ρi,k−1 with probability 1. We assume that the representation, ρi,k−1 , j j also encodes all messages (broadcast by various processors) that were contained in ρ2i−1,k−1 and ρ2i,k−1 . (Thus, ρi,k−1 encodes complete state information (query j−1 j−1 j responses and message broadcasts) of processors (i − 1)2j + 1, . . . , (i − 1)2j + 2j , in the k th parallel time-step. Note that in particular, ρ1,k−1 includes state log(p) information of all p processors and we denote this representation by rk , the representation at the end of k parallel time-steps.) 2. For any other pair of representations r , r , Rec(r , r , ) = ⊥ with probability 1. Define the tolerance function, t(r , r , ) = (p + 2)/N + τAlg /(BN ), the approximation parameter, τ = τAlg /(2N ), and the size function s(r , r , ) = (1/η) log(1/η), where r and r are representations of the form ρi,k−1 for some i, j, k. j We now prove the following claim: Claim 5.2. Suppose we start with a population that satisfies the assertion of Claim 5.1 and let α = 1/p. Then for every 0 ≤ δ ≤ ln(2)/(5p3 ), after log(p) evolutionary steps, with probability at least 1 − (2p(log(p) + 2)e−δ becomes identical, represented by rk . Proof. The main idea is that we combine state information of processors, two at a time in a tree-like manner and thus after log(p) steps, there are individuals that encode the state of simulation of all p processors (and hence of the parallel algorithm, 2 α|P |/24 + 2|P |η) the population Chapter 5: The Role of Recombination 107 Alg, itself). For simplicity, we assume that p is a power of 2, so that log(p) is an integer. We claim that after j evolutionary steps, with probability at least 1 − (2(j + 2)pe−δ 2 α|P |/12 − 2|P |η) the following hold: 1. Each member of the population is of the form ρi,k−1 , i = 1, . . . , 2log(p)−j . j 2. If fij is the fraction of the population of the form ρi,k−1 , then j 2j α/(1 + α)aj ≤ fij ≤ 2j α(1 + δ)aj , where aj is the sequence defined by the recurrence, a0 = 5, at = 4at−1 + 1. We show this assertion by induction. It clearly holds when j = 0 (because the statement is then just the assertion of Claim 5.1). Suppose the statement holds for j, the consider the corresponding statement for j + 1. Consider an individual ρi,k−1 that is j+1 2i−1,k−1 2i,k−1 a possible descendant of ρ2i−1,k−1 and ρ2i,k−1 . The number of (ρj , ρj ) pairs j j j j is f2i−1 f2i |P |2 and we know that, (2j α|P |)2 j j ≤ f2i−1 f2i |P |2 ≤ (2j α|P |)2 (1 + δ)2aj (1 + δ)2aj On the other hand the total number of possible feasible pairs that may produce any descendants is 2log(p)−(j+1) l=1 j j f2l−1 f2l |P |2 , which satisfies, p (2j α|P |)2 ≤ 2j+1 (1 + δ)2aj 2log(p)−(j+1) j j f2l−1 f2l |P |2 ≤ l=1 p 2j+1 (2j α|P |)2 (1 + δ)2aj Thus the probability for an individual in the (j + 1)th generation, that the parents from the j th generation are ρ2i−1,k−1 and ρ2i,k−1 is at least 2j+1 α/(1 + δ)4aj and at j j most 2j+1 α(1 + δ)4aj . Thus, by the Chernoff-Hoeffding bound, except with a further Chapter 5: The Role of Recombination probability loss of 2pe−δ δ)4aj +1 . 2 f j+1 |P |/12 i 108 we have that 2j+1 α/(1+δ)4aj +1 ≤ fij+1 ≤ 2j+1 α(1+ Notice that at ≤ 5t+1 for all t, and hence for the values of δ in the statement of the claim, the values fji for all i, j are at least α/2. Thus the statement holds by induction. However, after log(p) steps the population is homogeneous and every individual is the representation, ρ1,k−1 ≡ rk . log(p) The only part remaining to be proved is that ρi,k−1 is a feasible descendant of j+1 i,k−1 ρ2i−1,k−1 and ρ2i−1,k−1 . From the proof of Claim 5.1, it is clear that Lf,D (rb,σi,k−1 ) − j j b Lf,D (r k−1 ) ≤ 1/N +2τ (if i,k−1 rb,σi,k−1 b is selected to the next generation it must be beneficial or neutral and t(ri,k−1 , ri,k−1 , ) = 1/N ). Thus, for any representation of the form, ρi,k−1 , at most 2j functions corresponding to some query may be included, each with j weight 1/N . Thus, when the tolerance function t = ((p + 2)/N ) + τAlg /(BN ), it must 2i−1,k−1 be the case that ρi,k−1 is neutral (with respect to ρj and ρ2i,k−1 ), i.e. it is a j+1 j feasible descendant, for any value of j = 1, . . . , log(p). We can now prove Theorem 5.1. of Theorem 5.1. We start with a population where each individual has the representation, r0 ≡ Z. Each parallel step of the algorithm, Alg, is simulated by log(p) + 2 evolutionary steps. Thus, a (τAlg , T )-parallel G≤ (·, ·, ·) algorithm can be simulated f,D using an evolutionary mechanism in T (log(p)+2) evolutionary steps (generations). At the end of this simulation, the population is also identical and encodes the representation rT which contains the final state of the algorithm, Alg. Let hT be the hypothesis that algorithm, Alg, would output, then we define the recombinator operator on rT and rT as follows: With probability η, Rec(rT , rT , ) outputs rT and with probability Chapter 5: The Role of Recombination 109 1 − η, Rec(rT , rT , ) outputs h. Since, the correctness of the simulation guarantees that Lf,D (h) ≤ , the only reason why rT may be selected is that it actually has lower expected loss than hT (except with a very low probability). Finally, define N = pT /τAlg , this makes sure that no (randomized) representation has weights of its constituent deterministic functions adding up to greater than 1. (This is because the total number of queries is at most pT for p processors and T time steps, and the value θ for any of the query must satisfy, |θ| ≥ τAlg .) The total failure probability of the algorithm is 2(p(log(p) + 1))e−δ 2 α|P |/24 + (2|P | + 1)η. Thus, when δ = 1/(10p3 ) (the one used in Claims 5.1 and 5.2), |P | = 9600p7 log(p)/ and η = /(4|P | + 2), except with probability, 1 − , evolution (with recombination) succeeds in producing a representation that has expected loss at most . Remark 5.2. The reader can verify that the above proof is equally valid when the selection rule used is Opt-Sel. We state a simple corollary of Theorem 5.1 that shows that in fact a parallel algorithm that only uses, Gf,D (·, ·), oracle can be simulated by evolution with recombination. This follows direction from Lemma 3.1, which shows that any algorithm that takes T parallel time-steps with access to a Gf,D (·, ·) oracle, can be modified to take T O(log(B/τpa )) parallel time-steps and make queries to the G≤ (·, ·, ·) oracle f,D (where τpa is the smallest value of τ used by the parallel algorithm in any query to the Gf,D (·, ·) oracle). We also indicate the consequences for special cases of CSQ and SQ learning. Note that an efficient parallel algorithm in the CSQ or SQ framework uses polynomially many (in n and 1/ ) processors and only uses values of τ in queries such that 1/τ is polynomially (in n and 1/ bounded). Chapter 5: The Role of Recombination 110 Corollary 5.1. If concept class, C, is learnable by an efficient parallel algorithm that only makes queries to a Gf,D (·, ·) oracle (and for each query τ ≥ τpa ), under distribution, D, and with respect to loss function, , in T parallel-time steps and using p processors, then C is evolvable with recombination, with respect to loss function, , selection rule, BN-Sel (or Opt-Sel), under distribution, D, in T log(B/τpa )(log(p) + 2) + 1 generations (evolutionary time-steps). In particular, 1. When C is learnable by an efficient parallel CSQ algorithm (with access to a CSQ-O oracle) in T time steps, then C is evolvable with recombination, when the loss function is the classification error, c, in O(T (log(n/ ))2 ) generations. 2. When C is learnable by an efficient parallel SQ algorithm (with access to the STAT oracle) in T time-steps, then C is evolvable with recombination, with respect to the class of loss functions defined in Section 3.8.1 (under the title Boolean Ideal Function and Real-Valued Representations) in O(T (log(n/ ))2 ). 5.4.1 Recombination and Weak Selection In this section, we show how the above reduction may be modified when considering evolution with weak selection, i.e. using the selection rule, Weak-Sel. The reduction does not directly go through, because we can no longer rely on selection distinguishing between beneficial and neutral descendants in the proof of Claim 5.1. i,k−1 This was required when θz[k−1] < 0, to ensure that when 1 was the only valid ani,k−1 i,k−1 swer to the query, (φi,k−1 , τAlg , θz[k−1] ), r1,σi,k−1 would be selected, even though the z[k−1] 1 (neutral) descendant, i,k−1 r0,σi,k−1 0 was produced with overwhelmingly higher probability by the recombinator, Rec. Chapter 5: The Role of Recombination 111 We can only prove an equivalent version of Theorem 5.1, when considering ideal functions that are boolean, representations that are (randomized) boolean functions, and the loss function is the classification error, c. Thus, effectively we can show that parallel CSQ algorithms can be simulated by evolution with recombination with respect to classification error as the loss function. Note that in this case, Y = Y = {−1, 1}. It is conceptually easier to allow, Y = [−1, 1] and use the 1 (y 1 loss, i.e. , y) = (1/2)|y − y|. Suppose φ : X → {−1, 1} is a randomized boolean function, let φ1 : X → [−1, 1] be the function where φ1 (x) = E[φ(x) | x], the expectation is taken only over the internal randomness of φ. Note that for any ideal function, f , and distribution, D, Ex∼D [ c (φ(x), f (x))] = Ex∼D [ 1 (φ1 (x), f (x))] = (1 − Ex∼D [φ(x)f (x)])/2 = (1 − Ex∼D [φ1 (x)f (x)])/2 Thus, we abuse notation and denote both the randomized function, φ, and the corresponding, φ1 (with range [−1, 1]), just by the letter, φ. Also, the above argument shows that access to a loss oracle with respect to the oracle, CSQ-O. The reduction requires a new kind of oracle, ZCSQ(·, ·) (this is essentially an oracle that gives a (noisy) response to the question whether Ex∼D [φ(x)f (x)] ≤ 0?). On the query, (φ, τ ), where φ : X → [−1, 1], the ZCSQ(·, ·) oracle responds as follows:    1  if Ex∼D [φ(x)f (x)] ≤ −τ     ZCSQ(φ, τ ) = 0 if Ex∼D [φ(x)f (x)] ≥ τ       1 or 0 otherwise  1 loss is equivalent to the CSQ Chapter 5: The Role of Recombination 112 Feldman implicitly showed that learning with ZCSQ queries is equivalent to CSQ learning [14]. It follows from his proof, that a parallel CSQ algorithm that uses T parallel-time steps, can be modified to be a parallel ZCSQ algorithm that takes T log(1/τ ) parallel time-steps, where τ is the smallest value of the approximation parameter used in query to the oracle, CSQ-O. A proof is provided in Appendix A. In this section, we assume that φ : X → [−1, 1], the loss function is the 1 loss, and that Lf,D (φ) = Ex∼D [ 1 (φ(x), f (x))] denotes the expected loss with respect to the 1 loss function. Define the zero function, Z : X → [−1, 1], as Z(x) = 0 for all x. Then, Lf,D (Z) = 1/2 for every ideal function, f , and target distribution, D. Recall that Gf,D (φ) = Lf,D (φ) − Lf,D (Z) = −(1/2)Ef,D [φ(x)f (x)]. We now suppose that Alg is a parallel algorithm that uses p processors and makes only ZCSQ queries. Assume that the queries always use the approximation parameter, τAlg , where 1/τAlg is bounded by a polynomial in n and 1/ . Suppose that Alg successfully learns concept class, C, in T parallel time-steps. We borrow notation from the proof of Theorem 5.1 above: the representation, ri,k−1 , encodes the state of simulation up to the end of k − 1 parallel time-steps and has the identity of the ith processor, and let (φi,k−1 , τAlg ) be the query made by the ith processor in k th parallel z[k−1] i,k−1 time-step. We modify the representations, rb,σi,k−1 as follows: b i,k−1 rb,σi,k−1 = ri,k−1 + b (−1)b i,k−1 φz[k−1] N b i,k−1 i,k−1 Then observe that Lf,D (rb,σi,k−1 ) − Lf,D (ri,k−1 ) = Gf,D (rb,σi,k−1 ) − Gf,D (ri,k−1 ) = b ((−1) b /N )Gf,D (φi,k−1 ) z[k−1] = ((−1) b+1 /(2N ))Ex∼D [φi,k−1 (x)f (x)]. z[k−1] b i,k−1 Suppose that Rec(ri,k−1 , ri,k−1 , ) outputs rb,σi,k−1 with probability 1/3 for each of b = 0, 1, and i,k−1 r∗,σi,k−1 ∗ i,k−1 with probability 1/3. Here, r∗,σi,k−1 is a representation that is ∗ Chapter 5: The Role of Recombination 113 functionally identical to ri,k−1 , but the ∗ encodes the fact that both 0 or 1 may be i,k−1 valid query responses and σ∗ is the broadcast message in that case. Suppose that the query, (φi,k−1 , τAlg ), has a unique correct answer, i.e. either z[k−1] Ex∼D [φi,k−1 (x)f (x)] ≥ τAlg or Ex∼D [φi,k−1 (x)f (x)] ≤ −τAlg . Let bR be the right z[k−1] z[k−1] i,k−1 answer and bW = 1−bR be the wrong answer. Then, we observe that Lf,D (rb ,σi,k−1 )− R bR Lf,D (r i,k−1 ) ≤ −τAlg /(2N ) and i,k−1 Lf,D (rb ,σi,k−1 ) − Lf,D (ri,k−1 ) W b W ≥ τAlg /(2N ). Thus, if the tolerance function, t(r , r , ) = τAlg /(2N ) and the approximation parameter, τ = i,k−1 i,k−1 τAlg /(6N ), it will be the case that rb ,σi,k−1 is in the multi-set Feas and rb ,σi,k−1 is not R bR W bW in the multi-set Feas, when the weak selection rule, Weak-Sel, is used. When the query, i,k−1 (φz[k−1] , τAlg ), does not have a unique correct answer, all the three representations – i,k i,k i,k−1 r0,σ1i,k−1 , r1,σ1i,k−1 and r∗,σi,k−1 may be in the multi-set Feas. This is not a problem, since 0 1 ∗ the query response encoded in the representation is still a valid one. The rest of the proof (the part in Claim 5.2 and also the probability calculations) is essentially identical to the proof of Theorem 5.1. We note that one minor difference arises due to the fact that when the query response is not unique, the representations, rk , at the end of phase k may not be exactly identical in the function they represent i,k i,k i,k−1 (because they may contain parts from r0,σ1i,k−1 , r1,σ1i,k−1 , or r∗,σi,k−1 ), but they are iden0 1 ∗ tical in the sense that each representation encodes a valid simulation of the parallel algorithm, Alg, for k time-steps (which is sufficient for our purposes). Thus, we get the following theorem. Theorem 5.2. Suppose C can be learned by an efficient parallel algorithm with access to a CSQ oracle, with respect to distribution, D, in T parallel time-steps, then C is evolvable with recombination, with selection rule, Weak-Sel, using either c (classifica- Chapter 5: The Role of Recombination 114 tion error) or 1 loss function, under distribution, D, in O(T (log(n/ ))2 ) generations (evolutionary steps). Remark 5.3. Note that when the representations are randomized boolean functions, all loss functions, , such that (1, −1) = (1, −1) = 0 and (1, 1) = (−1, −1) = 0 are equivalent to the classification error. Thus, when the representations are randomized boolean functions, the above theorem holds essentially for every loss function. 5.5 Some Evolutionary Algorithms In this section, we apply the reductions described in the previous section to obtain faster evolutionary mechanisms with recombination for some concept classes (under certain distributions). All these mechanisms rely on being initialized with a particular starting population. The results in this section are reductions from parallel CSQ learning algorithm, thus the evolutionary mechanisms work for any loss function. 5.5.1 Evolving Conjunctions Valiant showed that the class of monotone conjunctions is evolvable in O(n log(n/ )) generations under the uniform distribution [39]. Diochnos and Tur´n showed that a Valiant’s algorithm can be improved and in fact showed that monotone conjunctions can be evolved in O(log(1/ )) generations under the uniform distribution (with initialization) [10]. However, the logarithmic (in 1/ ) number of generations depends heavily on the assumption of uniform distribution, and it is unclear if it can be generalized easily to other distributions. In fact, our results in Section 5.6 show that Chapter 5: The Role of Recombination 115 there are distributions, under which monotone conjunctions may not be evolved in o(n/ log(n)) generations, in Valiant’s model ( without recombination). The reduction from learning algorithms to evolvability shows that conjunctions ˜ may be evolved in Valiant’s model in O(n) generations, with respect to any fixed distribution. Our reduction from parallel learning algorithms shows that in a model of evolution with recombination, evolution of conjunctions is possible in O((log(n)/ )2 ) generations. This follows from a simple learning constant time parallel learning algorithm in the CSQ model. Proposition 5.1. There is a (non-uniform) parallel CSQ algorithm that learns the class of conjunctions in constant time with respect to any fixed distribution. Proof. Let f be a conjunction of literals – we use the convention that f evaluates to −1 if every literal of f is −1. We show that for any i ∈ [n], we can determine whether, xi , xi or neither appears as a literal in f with high probability. ¯ Note that for any i, we can estimate Pr[f = b ∧ xi = b ] very accurately (up to additive accuracy /(2n) for example) for all b, b ∈ {−1, 1} using the queries, E[f (x)xi ], E[f (x)], E[xi ]. The last query is not correlational, but for any fixed distribution this may be provided as advice. Start with an empty conjunction, g. Perform the following for each i: If Pr[f = −1 ∧ xi = 1] ≤ /(2n), then put xi in g; Else, if Pr[f = −1 ∧ xi = −1] ≤ /(2n), then put xi in g; Else, put neither xi nor xi in g. ¯ ¯ Note that if xi (respectively xi ) is in the true conjunction, f . Then, Pr[f = ¯ −1 ∧ xi = 1] = 0 (respectively Pr[f = −1 ∧ xi = 1] = 0). Therefore, all true literals of f will certainly be inserted in g (because of the accuracy /(2n)). Any extra literal Chapter 5: The Role of Recombination 116 we may have inserted in g, cannot cause error more than /(2n). Thus the total error of the conjunction g with respect to f cannot be more than . Thus, using Theorems 5.1 and 5.2 we get the following corollary: Corollary 5.2. The class of conjunctions is evolvable with recombination under any fixed distribution in at most O((log(n/ ))2 ) generations with respect to any of the selection rules, Weak-Sel, BN-Sel or Opt-Sel, with respect to any consistent and bounded loss function, , for which (1, −1) = (−1, 1) = 0 and (1, 1) = (−1, −1) = 0. 5.5.2 Evolving Homogeneous Linear Separators In Section 3.7.2, we presented an evolutionary mechanism for evolving homogeneous linear separators with respect to radially symmetric distributions in O(n/ ) generations. Here, we give a simple parallel algorithm for learning homogeneous linear separators under radially symmetric distributions in the CSQ model in constant time. This implies an evolutionary mechanism with recombination for evolving homogeneous linear separators in O((log(n/ ))2 ) generations under radially symmetric distributions. Proposition 5.2. There is a parallel CSQ algorithm for learning the class of homogeneous linear separators in Rn in constant time with respect to radially symmetric distributions. Proof. We use notation from Section 3.7.2. Note that a homogeneous linear separator is represented as hw , where w ∈ Rn such that w orthonormal basis for Rn . 2 = 1. Let ei n i=1 be any Chapter 5: The Role of Recombination 117 Let D be a radially symmetric distribution. Then, we saw in the proof of Theorem 3.2 that Prx∼D [hw (x) = hei (x)] = arccos(w · ei )/π. Let pi = Prx∼D [hw (x) = hei (x)] and observe that pi = (1/2) − (1/2)Ex∼D [hw (x)hei (x)]. Let pi be an estimate ˆ of pi , such that |ˆi − pi | ≤ η (This can be obtained by making a query to the CSQ-O p oracle with 2η as the approximation parameter). Let wi = w · ei = cos(πpi ), let wi = cos(π pi ). Then |wi − wi | = | cos(πpi ) − ˆ ˆ ˆ ˜ cos(π pi )| = 2| sin(π(pi + pi )/2) sin(π(pi − pi ))/2)| ≤ π|pi − pi | ≤ πη. Let w = n wi ei ˆ ˆ ˆ ˆ i=1 ˆ √ ˆ ˜ ˜ and w = w/ w 2 . For unit vector, w ∈ Rn , note that w 1 ≤ n w 2 ; in particular, √ when w 2 = 1, w 1 ≤ n. Then we have, ˆ w·w = n ˆ i=1 wi wi n ˆ2 i=1 wi ( √ √ 1 − nηπ √ ≥ 1 − 2 nηπ ≥ 1 + nηπ ≥ √ n 2 nηπ i=1 wi − √ n 2 2 i=1 wi ) + n(ηπ) + 2 nηπ Finally, using the fact that 1−cos(θ) ≥ 4θ2 /π 2 , we get that Lhw ,D (hw ) = Prx∼D [hw (x) = ˆ hw (x)] ≤ n1/4 ˆ πη/2. Choosing η appropriately completes the proof. Corollary 5.3. The class of homogeneous linear separators is evolvable with recombination under the class of radially symmetric distributions in O((log(n/ ))2 ) generations with respect to any of the selection rules, Weak-Sel, BN-Sel, or Opt-Sel, with respect to any consistent and bounded loss function, , for which (−1, 1) = (1, −1) = 0 and (1, 1) = (−1, −1) = 0. Chapter 5: The Role of Recombination 118 5.6 Lower Bounds In this section, we present lower bounds on the number of generations required for evolution in Valiant’s original model, i.e. without recombination. The details of the model are in Chapter 3. We consider evolutionary mechanisms for some concept class, C, under distribution, D, with respect to loss function, . We need to use the notion of an -net. Definition 5.5. We say that a set of functions, N defined from X → Y form an -net for concept class, C, with respect to distribution, D, and loss function, , if for every f ∈ C, there exists some ν ∈ N , ν : X → Y , such that Lf,D (ν) ≤ . We view the evolutionary process as a communication protocol between two parties, which we call A and B for lack of better terminology. For simplicity, we will assume that the evolutionary mechanism is deterministic and that it always succeeds, i.e. not just with probability 1 − . See remark 5.4 to see how these assumptions may be removed. Let R be the representation class, Mut the mutator, τ the approximation parameter, s : R × R+ → N the size function and t : R × R+ → R+ the tolerance function. The communication protocol may take several rounds, and in fact each communication round corresponds to one generation (or evolutionary step). A and B begin with the same starting representation, r0 ∈ R. We assume that the mutator, and the Turing machines that compute the size and tolerance functions are known two both A and B (note that the description length of these Turing machines is a constant independent of n or ). We assume that B has access to the oracle, Lf,D (·, ·), and hence can simulate the entire evolutionary process. We are interested in counting Chapter 5: The Role of Recombination 119 the minimum number of bits, B must send to A, so that A can simulate an identical evolutionary process. Suppose at the beginning of round i, both A and B have some representation, ri−1 ∈ R. (This is true when i = 1, and it will remain true for i > 0 as a result of the communication protocol.) Then, the protocol proceeds as follows: 1. A runs the mutator, Mut(ri−1 , ), s(ri−1 , ) times. These mutations are in an ordered list of size s(ri−1 , ). 2. B does the exact same thing as A – in particular the order of the mutations in B’s list is exactly the same as A’s. B uses the oracle, Lf,D (·, ·), and some selection rule, Sel, to determine the mutation that will be the new representation, ri , at the next generation. If ri = ⊥, B sends the number 0 to A; otherwise B sends the index j to the first mutation in the ordered list that has representation, ri . 3. Thus, B sends A at most log(s(ri−1 , )) + 1 , bits and then A simulates an identical evolutionary process. Note that for a polynomially bounded evolutionary process, log(s(ri−1 , )) + 1 = O(log(n/ )) for any ri−1 ∈ R. Suppose the evolutionary process is guaranteed to succeed in the g generations, then the total number of bits received by A is O(g log(n/ )). If f is the ideal function, let rf be the representation produced by B (and hence also A) at the end of the evolutionary process. Note that, Lf,D (rf ) ≤ , and hence the set N = {rf | f ∈ C} forms an -net of C, with respect to D and . But, this implies that |N | = 2O(g log(n/ )) , since A required only O(g log(n/ )) bits to produce the correct representation in the -net. In fact, this argument gives an information theo- Chapter 5: The Role of Recombination 120 retic lower bound on the number of generations, g, required for evolution in Valiant’s model (without recombination). We state this formally as: Theorem 5.3. For some concept class, C, distribution, D, and loss function, , if S is such that the size of of any -net of C, with respect to D and , must be at least S , then the number of generations required for any evolutionary mechanism in the sense of Valiant (without recombination) requires Ω(log(S )/ log(n/ )) generations. Remark 5.4. If the evolutionary process uses randomness, we use a standard probabilistic method argument to show that there exists a polynomial length string, which when interpreted as random bits is guaranteed to result in successful evolution. The failure probability, , may be reduced to be smaller than 1/2n , by repeating the evolutionary process several times and taking the best one. On the other hand, evolution with recombination allows many more bits of information received per generation (because of polynomial-sized population). We now show some explicit lower bounds for the class of conjunctions and homogeneous linear separators. Lower Bound for Monotone Conjunctions Let X = {−1, 1}n and let ei = (−1, . . . , −1, 1, −1, . . . , −1) be the vector that has −1 in all positions except the ith position. Consider, the simple distribution, D1 , over {−1, 1}n , where PrD1 (ei ) = 1/n, for 1 ≤ i ≤ n. Then, note that any two different monotone conjunctions differ on at least one of the ei s. Suppose that the loss function is the classification error, c, and that the functions in the -net are required to be boolean. Then, if C is the class of monotone conjunctions over {−1, 1}n , for Chapter 5: The Role of Recombination 121 < 1/(2n), any -net of C under distribution, D1 , and loss function, c, must have size 2n . This is because no two monotone conjunctions can have the same function in the -net (if f = g are two monotone conjunctions, Lf,D (g) ≥ 1/n). Thus, we get Corollary 5.4 below; compare with Corollary 5.2. Corollary 5.4. The class of monotone conjunctions under distribution, D1 (defined above), and with classification error, c, as loss function, requires Ω(n/ log(n)) gen- erations for evolution in Valiant’s model (without recombination). Lower Bound for Homogeneous Linear Separator Let Un be the uniform distribution over the unit sphere in n dimensions. It is well known that the -net for homogeneous linear separators, with respect to classification error, c, and distribution, Un , has size at least (1/ )n . Thus, we can prove Corollary 5.5; compare with Corollary 5.3 (note that uniform distribution over the unit sphere is a radially symmetric distribution). Corollary 5.5. The class of homogeneous linear separators in Rn , Hn , under the uniform distribution over the unit sphere, Un , and with classification error, c, as loss function, requires Ω(n/ log(n/ )) generations for evolution in Valiant’s model (without recombination). 5.7 Discussion The results in this chapter suggest that in principle recombination can accelerate evolution through parallelism. The results should be understood as demonstrating Chapter 5: The Role of Recombination 122 that for certain definitions of selection rules and recombination, polylogarithmic generations are sufficient for evolution whenever a suitable parallel learning algorithm exists. In this section, we discuss in some detail the choices made in the model of recombination. The recombinator operator defined here simultaneously captures both mutation and recombination. In a sexually reproducing species, only mutations that occur on the germ line matter and natural selection never acts on mutations alone, but simultaneously also on recombination. We have defined selection of a descendant based on performance comparison with its parents. We take the view that if the parents were able to survive, descendants having performance no worse than their parents should also be able to survive. This definition is different from those common in population genetics where survival probability is determined by relative fitness, thus capturing competition within individuals in a species. However, the mapping from performance (or loss) with respect to a functionality and fitness as defined by the average number of progeny of a particular genotype that survive may be an extremely complicated one. Indeed, mapping of fitness to functional traits of organisms in a quantifiable manner seems to be a very difficult problem. Also, the sharp thresholds we use to decide beneficial vs neutral, optimal vs suboptimal, feasible vs infeasible do seem unnatural. However, the only thing that can be said with certainty is that mapping of performance with respect to a certain functionality and fitness (as in survival of genotype) should be monotonically related. Sharp threshold offer the advantage that they are robust to any such monotone mapping. The population model (with respect to selection) we proposed in this chapter is Chapter 5: The Role of Recombination 123 most realistic when the population is large and distributed over a large area, selective advantage is weak and similar for roughly all beneficial mutations. If selective forces for some mutations are strong, then these mutation are likely to spread in the population quickly, with or without recombination. Maynard Smith [34, 35] has a detailed discussion regarding conditions when sexual reproduction and recombination may accelerate evolution. Chapter 6 Limitations on Evolvability In this chapter, we study some of the limitations that must exist on classes of functions that are evolvable within reasonable (polynomially-bounded) resources. Some of the limitations arise due to purely statistical reasons, i.e. not enough information may be obtained through the processes of mutation (or recombination) and selection to evolve a function that is (almost) accurate. These results use the fact that any concept class that is evolvable is also learnable in the statistical query framework (and in some cases the CSQ framework). These results are part of existing literature and have been summarized in Section 6.3. The question we mainly focus on is: are there strictly computational limitations on evolvability? Alternatively, if the evolutionary process is allowed to use unbounded computational resources, but only polynomially bounded statistical (or informationtheoretic) resources, are more concept classes evolvable? We show that the answer to this question is indeed positive. These results are presented in the framework of statistical query (SQ) and correlational statistical query (CSQ) learning. 124 Chapter 6: Limitations on Evolvability 125 In Section 6.1, we review the SQ and CSQ models and delineate the computational and statistical aspects of these models. In Section 6.2, we precisely define the statistical and computational constraints imposed on the evolutionary process in Valiant’s model. We show that there is a direct correspondence between computational limitations on evolvability and computational limitations on SQ/CSQ learning. This follows from the reduction from learning algorithms to evolutionary mechanisms described in Section 3.8.1. Section 6.4 contains proofs of the main results. In Section 6.5, we quantify computational resources that are sufficient for learning in the SQ framework, whenever statistically efficient learning is possible. 6.1 Statistical Query Learning Models In this section, we describe several variations of SQ models considered in this chapter. We recall notation described in Chapter 2. In this chapter, we focus only on boolean concept classes, so we repeat the definitions of statistical query learning (only for boolean functions) in the interest of clarity. Let X be the instance space and suppose that n characterizes the representation size of each element x ∈ X (e. g. X = {−1, 1}n or X = Rn ). A concept class, C, over X is a subset of boolean functions defined over X , where each concept, c ∈ C, is represented as a binary string; it is required that there exist an efficient Turing machine that outputs c(x), given c ∈ C and x ∈ X as inputs (cf. [23]). Let D be a distribution over X . In the SQ model [22], the learning algorithm has access to a statistical query oracle, STAT(f, D), to which it can make a query of the form (ψ, τ ), where ψ : X ×{−1, 1} → [−1, 1] is the query function and τ is the (inverse polynomial) Chapter 6: Limitations on Evolvability 126 tolerance1 The oracle responds with a value v such that |ED [ψ(x, f (x))] − v| ≤ τ , where f ∈ C is the target concept. The goal of the learning algorithm is to output a hypothesis, h : X → {−1, 1}, such that errD (h, c) = Prx∼D [h(x) = c(x)] ≤ . We use the following characterization of the SQ model due to [7] (see also [12]): A statistical query ψ : X × {−1, 1} → [−1, 1] is said to be target-independent if ψ(x, b) ≡ ψ ti (x) for some function ψ ti : X → [−1, 1]. A statistical query is said to be correlational if ψ(x, y) ≡ yψ cor (x) for some function ψ cor : X → [−1, 1]. [7] showed that any statistical query (ψ, τ ) (in Kearns’ model) can be replaced by two queries, one of which is target-independent and the other correlational, each with tolerance τ /2. We denote an oracle that accepts only target-independent or correlational queries as SQ-O(f, D). Formally, let SQ-O(f, D) denote the statistical query oracle, which on receiving a target-independent query (ψ ti , τ ) (where ψ ti : X → [−1, 1]), responds with a value v such that |ED [ψ ti (x)] − v| ≤ τ and on receiving a correlational statistical query (ψ cor , τ ) (where ψ cor : X → [−1, 1]), responds with a value, v, such that |Ex∼D [ψ cor (x)f (x)] − v| ≤ τ , where f is the target concept. 6.1.1 Distribution-Specific SQ Learning The first framework we consider is distribution-specific SQ learning. It is required that the running time of the algorithm is polynomial in the parameters n and 1/ 1 We have referred to τ as the approximation parameter in the rest of this thesis. This was to avoid confusion with the tolerance function in the context of evolvability. However, in this chapter we use the more conventional notation in learning theory, and refer to τ as the tolerance parameter. Chapter 6: Limitations on Evolvability 127 and also that the queries made by the algorithm are efficiently evaluable2 and use a tolerance parameter τ , that is lower-bounded by some inverse polynomial in n and 1/ . Definition 6.1 (Distribution-Specific SQ Learning). Let X be the instance space (with representation size n), D a distribution over X and C a concept class over X . We say that C is distribution-specific SQ learnable with respect to distribution D, if there exists a randomized algorithm, Alg, that for every , δ > 0, every target concept f ∈ C, with access to oracle SQ-O(f, D), outputs with probability at least 1 − δ, a hypothesis, h, such that errD (h, f ) ≤ . Furthermore, the running time of the algorithm must be polynomial in n and 1/ and 1/δ and the queries made to the oracle and the output hypothesis must be polynomially evaluable and have a tolerance τ that is lower-bounded by an inverse polynomial in n, 1/ . To distinguish the computational complexity and information-theoretic (or statistical) complexity of SQ learning, we use the notion of query complexity. The query complexity for learning concept class C is the minimum number of queries required by any (possibly unbounded) algorithm to learn C in the SQ model. It is worthwhile to point that when defining query complexity, it is not required that the queries be efficiently evaluable and need not even have a small representation. Definition 6.2 (Distribution-Specific SQ Query Complexity). Let X be the instance space (with representation size n), D a distribution over X and C a concept class over X . We say that the query complexity of learning C under distribution, D, By this we mean that given a description of a query function ψ and it’s input x, there exists an efficient Turing machine that outputs ψ(x). 2 Chapter 6: Limitations on Evolvability 128 to accuracy , is bounded by q if there exists a (possibly computationally unbounded) algorithm that for every concept f ∈ C, makes at most q queries to oracle SQ-O(f, D) outputs a hypothesis, h, such that errD (h, f ) ≤ . The tolerance, τ , for the queries must be lower-bounded by an inverse polynomial in n and 1/ . Distribution-Specific Weak SQ Learning In the case of weak learning, the learning algorithm is required only to output a hypothesis whose error is at most 1/2 − γ, where 1/γ is bounded by a polynomial in n. The definitions of distribution-specific weak SQ learning and distribution-specific weak query complexity are identical to the definitions above (except for the requirement on the error of the output hypothesis). 6.1.2 Distribution-Independent SQ Learning The second framework we consider is distribution-independent SQ learning, where the same learning algorithm is required to output an accurate hypothesis for all distributions. As in the case of distribution-specific learning, the requirement is that the running time of the algorithm be polynomial in n and 1/ , the queries made by the algorithm be polynomially evaluable and use a tolerance parameter τ , that is lower-bounded by some inverse polynomial in n and 1/ . Definition 6.3 (Distribution-Independent SQ Learning). Let X be the instance space (with representation size n) and C a concept class over X . We say that C is distribution-independently SQ learnable if there exists a randomized algorithm, Alg, that for every , δ > 0, every target concept f ∈ C and for every distribution, D, over Chapter 6: Limitations on Evolvability 129 X , with access to oracle, SQ-O(f, D), outputs with probability at least 1 − δ, a hypothesis, h, such that errD (h, f ) ≤ . Furthermore, the running time of the algorithm must be polynomial in n and 1/ and 1/δ and the queries made to the oracle and the output hypothesis must be polynomially evaluable and have a tolerance, τ , that is lower-bounded by an inverse polynomial in n, 1/ . As in the case of distribution-specific SQ learning, to distinguish between computational complexity and information-theoretic (or statistical) complexity of learning, we use the notion of distribution-independent SQ query complexity. We re-iterate that the queries themselves are not required to be polynomially evaluable and need not even have a polynomial-size representation. However, the number of queries made and the inverse of the tolerance parameter, τ , must be bounded by some polynomial in n and 1/ . Definition 6.4 (Distribution-Independent SQ Query Complexity). Let X be the instance space (with representation size n) and C a concept class over X . We say that the distribution-independent query complexity of learning C to accuracy is bounded by q, if there exists a (possibly computationally unbounded) algorithm that for every concept f ∈ C, every distribution D over X makes at most q queries to oracle SQ-O(f, D) outputs a hypothesis, h, such that errD (h, f ) ≤ . The tolerance τ for the queries must be lower-bounded by an inverse polynomial in n and 1/ . In the case of distribution-independent SQ learning, weak and strong learning are equivalent (cf. [1]), hence we do not consider the distribution-independent weak learning model. Chapter 6: Limitations on Evolvability 130 Access to Random Examples In the distribution-independent SQ learning setting, we also consider the variant where the learning algorithm has access to random unlabeled examples from the distribution, D. In the case of distribution-specific SQ learning, this model is not particularly interesting, since we may assume instead that the learning algorithm has a reasonably large (polynomial size) sample provided as (non-uniform) advice. The lower bounds shown in Section 6.4 hold even in the model where the learning algorithms have access to random unlabeled examples from the distribution. The upper bound proved in Section 6.5 only holds if the learning algorithm has access to such examples from the distribution. 6.1.3 Correlational Statistical Query Learning The correlational statistical query (CSQ) learning model was introduced by Feldman [12] and he showed that this model is equivalent to Valiant’s evolution model (with boolean ideal functions and boolean representations). In the CSQ model, the learner is only allowed to make statistical queries that are correlational. Let (ψ, τ ) be a query, where ψ : X → [−1, 1] is the query function and τ is the (inverse polynomial) tolerance factor. A CSQ oracle, CSQ-O(f, D), responds with a value v such that |Ex∼D [ψ(x)f (x)] − v| ≤ τ , where f ∈ C is the target concept. Distribution-specific CSQ learning and distribution-specific SQ learning are essentially equivalent, as long as the learning algorithm has a random sample from the distribution as non-uniform advice3 . Thus, we do not consider the case of distribution3 See [12] for more details. Chapter 6: Limitations on Evolvability 131 specific CSQ learning separately. Also, in the case of CSQ learning, it does not make sense to consider models which have access to random unlabeled examples, since this would make it the same as SQ learning. Definition 6.5 (Distribution-Independent CSQ Learning). Let X be the instance space (with representation size n) and C a concept class over X . We say that C is distribution-independently CSQ learnable if there exists a randomized algorithm, Alg, that for every , δ > 0, every target concept, f ∈ C and for every distribution, D over X , with access to oracle, CSQ-O(f, D), outputs with probability at least 1 − δ, a hypothesis, h, such that errD (h, f ) ≤ . Furthermore, the running time of the algorithm must be polynomial in n, 1/ and 1/δ and the queries made to the oracle and the output hypothesis must be polynomially evaluable and have a tolerance τ , that is lower-bounded by a polynomial in n, 1/ . As in the previous cases, one can define the distribution-independent query complexity of CSQ learning. This captures the information-theoretic complexity of CSQ learning. Definition 6.6 (Distribution-Independent CSQ Query Complexity). Let X be the instance space (with representation size n) and C a concept class over X . We say that the distribution-independent query complexity of learning C to accuracy is bounded by q, if there exists a (possibly computationally unbounded) algorithm that for every concept, f ∈ C, every distribution, D over X , makes at most q queries to oracle, CSQ-O(f, D) outputs a hypothesis, h, such that errD (h, f ) ≤ . The tolerance, τ , for the queries must be lower-bounded by an inverse polynomial in n and 1/ . Chapter 6: Limitations on Evolvability 132 6.1.4 Notation Finally, we introduce some additional notation used in the rest of this chapter. Note that all binary strings are strings over {−1, 1} and boolean functions have range {−1, 1}. Let [k] denote the set {1, 2, . . . , k} and for any k bit string, z, let S(z) = {i | zi = −1} ⊆ [k]. Define MAJz : {−1, 1}k → {−1, 1} to be the majority function over S(z). Formally for x ∈ {−1, 1}k , MAJz (x) = −1 if |S(x) ∩ S(z)|/|S(z)| ≥ 1/2 and MAJz (x) = 1 otherwise. Define PARz : {−1, 1}k → {−1, 1} as the parity function over the bits in S(z). Formally, for x ∈ {−1, 1}k , PARz (x) = −1 if |S(x) ∩ S(z)| is odd and PARz (x) = 1 if |S(x) ∩ S(z)| is even, i.e. PARz (x) = i∈S(z) xi . Define ORz : {−1, 1}k → {−1, 1} as the OR function over the bits in S(z). Formally, for x ∈ {−1, 1}k , ORz (x) = −1 if |S(x) ∩ S(z)| ≥ 1 and ORz (x) = 1 otherwise. Fourier Analysis Under the uniform distribution over {−1, 1}n , the set of parity functions, PARz z∈{−1,1}n forms an orthonormal basis (Fourier basis) for real-valued functions defined over ˆ {−1, 1}n . For a function f : {−1, 1}n → R, f (S(z)) = Ex∼Un [f (x)PARz (x)] is the Fourier coefficient of f corresponding to the subset S(z); here Un denotes the uniform distribution over {−1, 1}n . Fourier analysis has been used extensively for learning with respect to the uniform distribution (for a survey see [28]). In particular, Kushilevitz and Mansour showed that with blackbox access to a function f , all Fourier coefficients of f with magnitude at least θ, can be obtained in time poly(n, 1/θ) [25]. We refer to this as the KM algorithm and use it frequently in our proofs. Chapter 6: Limitations on Evolvability 133 6.2 Computational Limitations on Evolvability Recall from Chapter 3, that an evolutionary mechanism is defined as a 5-tuple, EM = (R, Mut, τ, s, t). We required that all components of an evolutionary mechanism be polynomially bounded. Of these the statistical (polynomial) bounds are the following: • The approximation parameter, τ , indicates how accurate an estimate of the true expected loss of a representation, r, may be obtained by accessing the Lf,D (·, ·) oracle. We require that this is be bounded by a polynomial in n and 1/ . We required that the tolerance function, t, may be polynomially sandwiched. However, allowing the tolerance function to superpolynomially small is of no advantage if 1/τ is polynomially bounded. • The size function, s, determines the size of the neighborhood of some representation, r, that is explored at each evolutionary step. • The number of generations (or evolutionary steps) must be at most some polynomial in n and 1/ . The statistical bounds are also computational ones. On the other hand, there are also purely computational polynomial bounds imposed upon the model: • That the representations in R have size bounded by a polynomial in n and 1/ and also that these representations be polynomially evaluable. • The mutator be a polynomial-time randomized Turing machine. • That the size function, s, and tolerance function, t, be efficiently evaluable. Chapter 6: Limitations on Evolvability 134 The question we consider is: if the statistical bounds on the model of evolvability are retained, but it is allowed unbounded computational power, are more concept classes evolvable? We relate the above question to that of computational complexity vs query complexity in the statistical query models. The reduction from learning algorithms to evolutionary mechanism in Section 3.8.1 allows us to observe the following: 1. A computationally efficient statistical query algorithm leads to a computationally and statistically polynomially bounded evolutionary mechanism. 2. A statistical query algorithm with polynomial query complexity can be reduced to an evolutionary mechanism that is polynomially bounded in the statistical properties, but may otherwise be computationally unbounded. This is because the number of queries made by the algorithm translate into the size of the neighborhood and the number of generations. On the other hand, evaluating the representations requires simulating the statistical query algorithm (which may be unbounded), thus the evolutionary mechanism may use possibly unbounded computation. Note that the converse is also true, i.e. if an evolutionary mechanism is statistically bounded, then it may be simulated in the SQ framework with polynomial query complexity. Thus, in the rest of the chapter, we use the language of statistical query frameworks and separate concept classes that are efficiently learnable from those which have polynomial query complexity. Chapter 6: Limitations on Evolvability 135 6.3 Statistical Lower Bounds We summarize some statistical lower bounds that are known in the statistical query framework. Kearns in his original paper on statistical query learning had already demonstrated that parity functions require an exponential number of queries for learning [22]. Blum et al. proved that the number of statistical queries required for weak learning a concept class, C, over some distribution, D, is characterized by a relatively simple combinatorial parameter of C, called the statistical query dimension [6]. The statistical query dimension measures the maximum number of nearly uncorrelated (relative to distribution D) functions in C. These bounds for weak learning were strengthened and extended to other variants by Bshouty and Feldman [7], Blum, Kalai and Wasserman [5], Yang [42], and Feldman [12]. More recently, Simon [32] described a characterization of strong SQ learning with respect to a fixed distribution, D. Simpler and stronger characterizations were subsequently derived by Feldman [13] and Sz¨r´nyi [37]. All of these characterizations are oe in the fixed distribution setting, and the corresponding question in the distributionindependent case is still open. Using these characterizations, Feldman, Lee, and Servedio [17] have shown statistical lower-bounds for learning shallow (in particular, depth-3) monotone formulas. All of these results are based on information-theoretic arguments and hence, are unconditional. Chapter 6: Limitations on Evolvability 136 6.4 Computational Lower Bounds For the sake of completeness, we first state the result that for weak distributionspecific SQ/CSQ learning, the query complexity and computational complexity is essentially the same. This fact was observed by Blum et al. [6] and follows easily from their characterization of SQ learning. Next, we show that for distributionspecific (strong) SQ learning, distribution-independent SQ learning and distributionindependent (strong) CSQ learning, the query complexity is significantly different from the computational complexity. In the case of distribution-specific (strong) SQ learning and distribution-independent SQ learning, we show that there exists a concept class that has polynomial query complexity, but cannot be efficiently learned in the respective SQ model unless RP = NP. In the case of distribution-independent CSQ learning, the separation is based on a stronger assumption: we show that there is a concept class C with polynomial query complexity (of CSQ learning), but cannot be learned efficiently unless every problem in W[P] has a randomized fixed parameter tractable algorithm. 6.4.1 Weak Distribution-Specific SQ/CSQ Learning Blum et al. [6] showed that weak distribution-specific SQ learnability of a concept class is characterized by a combinatorial parameter, the statistical query dimension – SQ-DIM(C, D), which characterizes the number of nearly uncorrelated concepts in C and observed that this implies the equivalence of query complexity and computational complexity in this model of learning. A short proof sketch of the following theorem is provided for completeness. Chapter 6: Limitations on Evolvability 137 Theorem 6.1. If a concept class C is weakly SQ learnable over a distribution D then there exists a polynomial-size circuit that weakly SQ learns C over a distribution D. Proof. SQ-DIM(C, D, γ) is the size of the largest subset S ⊆ C, such that for every c1 , c2 ∈ S, errD (c1 , c2 ) = Prx∼D [c1 (x) = c2 (x)] ≥ 1/2 − γ. A simple weak-learning algorithm just tries every concept from S 4 , and at least one of them will have error less than 1/2 − γ with the target function f (since S is the largest such subset, if this weren’t the case adding the target function f to S would give a larger subset with the same property). Blum et al. [6] showed that SQ-DIM(C, D, γ) is polynomially related to the query complexity of weak SQ learning C under D to accuracy 1/2 − γ. 6.4.2 Strong Distribution-Specific SQ/CSQ Learning Let φ ∈ {−1, 1}m denote a 3-CNF formula over n variables (encoded as a string). Suppose φ is satisfiable and let ζ(φ) denote the lexicographically first satisfying assignment of φ. Throughout this section b ∈ {−1, 1}, x ∈ {−1, 1}m and x ∈ {−1, 1}n and let bxx denote the m + n + 1 bit string obtained by concatenating b, x and x . For φ ∈ {−1, 1}m , where φ is a satisfiable 3-CNF formula, define the function fφ,y : {−1, 1}m+n+1 → {−1, 1} as follows:    MAJφ (x) if b = 1 fφ,y (bxx ) =   PARy (x ) if b = −1 In other words, fφ,y is a function that over one half of the domain is the majority function, MAJφ , and over the other half of the domain is the parity function, PARy . 4 Note that this is a non-uniform algorithm, since the set S needs to be given as advice to the algorithm Chapter 6: Limitations on Evolvability 138 Note that the function fφ,y is efficiently computable given the representation (φ, y). Define C1 to be the following concept class: C1 = {fφ,ζ(φ) | φ is satisfiable}. Theorems 6.2 and 6.3 show that the query complexity of C1 is polynomial, but unless RP = NP there is no polynomial time SQ algorithm for learning C1 . To prove that the query complexity is polynomial, the key idea is that the learning algorithm only needs to (proper) learn majorities in the SQ model, which is easy. The learning algorithm can recover φ and solve for ζ(φ) (possibly using unbounded computation). Thus, fφ,ζ(φ) can be exactly SQ learned using only polynomially many queries. On the other hand, we show that an efficient SQ learning algorithm for C1 can be used to recover a satisfying assignment of the 3-CNF formula φ. The key point to note here is that parities are essentially invisible to statistical queries and hence the only way to learn C1 is to obtain ζ(φ) using φ, which is not possible unless RP = NP. Theorem 6.2. The query complexity of SQ learning C1 with respect to the uniform distribution U is at most m. Proof. Let fφ,ζ(φ) ∈ C1 be the target function. We show how to obtain φ using statistical queries. For i ∈ {1, . . . , m}, define the function ψi : {−1, 1}m+n+1 × {−1, 1} → [−1, 1] as follows:    0 if b = −1 ψi (bxx , y) =   xi y if b = 1 Then, observe that EUm+n+1 [ψ(bxy, fφ,ζ(φ) (bxy))] = (1/2)EUm [xi MAJφ (x)]. It is well known (see for example [31]) that if φi = −1 (i.e. the ith bit is part of the Chapter 6: Limitations on Evolvability 139 √ majority function) then EUm [xi MAJφ (x)] = Ω(1/ m) and 0 otherwise. Hence, by √ setting τ = Θ(1/ m), the query (ψi , τ ) reveals the bit φi . Thus, using at most m = O(n3 ) statistical queries, we obtain φ. Now, it is easy to obtain (possibly using unbounded computation) the value ζ(φ) and thus obtain the function, fφ,ζ(φ) . Theorem 6.3. C1 is not efficiently SQ learnable under the uniform distribution unless RP = NP. Proof. Suppose to the contrary that Alg is a (possibly randomized) algorithm that learns C1 to error at most 0.1 (in fact, to any value noticeably lower than 1/4) in polynomial time. We show that using Alg it is possible (with high probability) to find a satisfying assignment to any 3-CNF formula φ, if one exists. Thus, failure to find a satisfying assignment implies that φ is unsatisfiable. Let φ be a 3-CNF instance. Suppose φ is a satisfiable, so that fφ,ζ(φ) ∈ C1 ; we show that in this case a solution to φ can be obtained with high probability. Suppose Alg makes q statistical queries each with tolerance τ to learn C1 . We show that we can simulate any statistical query (ψ, τ ) with respect to fφ,ζ(φ) efficiently. The queries made by Alg to the oracle SQ-O may be target-independent or correlational. Below, we consider the two cases: 1. Let (ψ ti , τ ) be a target-independent query; we need to return an (additive) τ approximation to the value EUm+n+1 [ψ(bxx )]. This is easily achieved by drawing ˜ a sample of size O(1/τ 2 ) from the uniform distribution and returning the empirical estimate. 2. Let ψ cor be the correlational query. In this case, we need to return an (additive) τ -approximation to the value EUm+n+1 [ψ cor (bxx )fφ,ζ(φ) (bxx )]. For b ∈ {−1, 1}, Chapter 6: Limitations on Evolvability cor define ψb (xx ) ≡ ψ cor (bxx ). Then, 140 EUm+n+1 [ψ cor (bxx )fφ,ζ(φ) (bxx )] = 1 1 cor cor EUm+n [ψ1 (xx )MAJφ (x)] + EUm+n [ψ−1 (xx )PARζ(φ) (x )] 2 2 It suffices to find τ -approximations to both the terms in the above exprescor sion. To obtain a τ -approximate estimate of EUm+n [ψ1 (xx )MAJφ (x)], as in ˜ the earlier case, we can draw a sample of size O(1/τ 2 ) from Um+n and return cor the empirical estimate (since we can efficiently compute the functions ψ1 and MAJφ ). cor We show that either 0 is a τ -approximation to EUm+n [ψ−1 (xx )PARζ(φ) (x )] or we find a satisfying assignment to φ using Fourier analysis. Observe that cor cor EUm+n [ψ−1 (xx )PARζ(φ) (x )] is simply the Fourier coefficient of ψ−1 correspond- ing to ζ(φ) (or actually the set S(ζ(φ)) = {m + i | ζ(φ)i = −1} ⊆ [m + n]). cor We know that all Fourier coefficients of ψ−1 of magnitude larger than τ /2 can be estimated to accuracy τ /4 using the KM algorithm in time poly(n, 1/τ ) (see Section 6.1.4 or [25]). Furthermore, the number of such coefficients is polynomial in n, 1/τ . We check whether any such coefficient (interpreted as a string of length n) is a satisfying assignment of φ. If we find an assignment, we are cor done; if not we know that the coefficient |ψ−1 (S(ζ(φ)))| ≤ τ , since ζ(φ) is a solution to φ and we would have identified it as such, had it been in the list of heavy coefficients. Thus, 0 is an (additive) τ -approximate estimate to the term cor EUm+n [ψ−1 (xx )PARζ(φ) (x )]. Thus, we have shown that we can either find a satisfying assignment to φ or simulate the SQ-O oracle response satisfactorily to all queries made by algorithm Chapter 6: Limitations on Evolvability 141 Alg. In the latter case, the algorithm outputs h such that errUm+n+1 (h, fφ,ζ(φ) ) ≤ 0.1, i.e. EUm+n+1 [h(bxx )fφ,ζ(φ) (bxx )] ≥ 4/5. Let hb (xx ) ≡ h(bxx ), then 1 1 EUm+n+1 [h(bxx )fφ,ζ(φ) (bxx )] = EUm+n [h1 (xx )MAJφ (x)]+ EUm+n [h−1 (xx )PARζ(φ) (x )] 2 2 ˆ The above equation implies that EUm+n [h−1 (xx )PARζ(φ) (x )] = h−1 (S(ζ(φ))) ≥ 3/5, ˆ where h−1 (S(ζ(φ))) is the Fourier coefficient of h−1 corresponding to the set S(ζ(φ)). Thus identifying all large coefficients of h−1 , by the KM algorithm, and checking whether any of the coefficients (when interpreted as a string of length n) satisfies φ, a satisfying assignment of φ is obtained (since ζ(φ) has a large Fourier coefficient). Thus, if φ is satisfiable, using Alg it is possible to find, with high probability, a satisfying assignment to φ. If we fail to find the satisfying assignment, then φ is unsatisfiable. Hence, an algorithm to efficiently SQ learn C1 does not exist unless RP = NP. 6.4.3 Strong Distribution-Independent SQ Learning In this section, we consider the distribution-independent SQ learning model. As in the case of distribution-specific SQ/CSQ learning, we construct a concept class, C2 , such that C2 is distribution-independently SQ learnable, but not efficiently distributionindependently SQ-learnable unless RP = NP. Using the notation from Section 6.4.2 define gφ,y as:    P ARy (x ) x = φ gφ,y (xx ) =   1 otherwise Thus, gφ,y is the function that equals PARy (x ) on the part of the domain that has φ as the prefix and is the constant function 1 otherwise. Define the concept class C2 Chapter 6: Limitations on Evolvability 142 as follows: C2 = {gφ,ζ(φ) | φ is satisfiable}. First, we show that the distribution-independent query complexity of SQ learning C2 is bounded by a polynomial in n. The key idea is that either the constant function 1 is an accurate predictor (if the distribution has almost no mass on points that have φ as a prefix), or else it is possible to recover the 3-CNF formula φ using statistical queries, and then (using possibly unbounded computation) the assignment ζ(φ) can be obtained to learn gφ,ζ(φ) exactly. On the contrary, we show that C2 cannot be efficiently learned in the distribution-independent SQ model unless RP = NP. As in the previous case, we show that an efficient SQ algorithm for learning C2 can be used to find a satisfying assignment to any 3-CNF formula φ, if it exists. These results are proved formally as Theorems 6.4 and 6.5. Theorem 6.4. The distribution-independent query complexity of SQ learning C2 is at most 2m + 1. Proof. Let gφ,ζ(φ) be the target function, D the target distribution and let > 0 be the target error rate. We first test if the hypothesis, the constant 1 function, is accurate. This can be tested using a single correlational statistical query (1, /4). If the value returned is at least 1 − 3 /4, then ED [gφ,ζ(φ) (xx )] ≥ 1 − , i.e. the constant 1 hypothesis is -accurate. If not, we know that gφ,ζ(φ) is −1 on at least /4 fraction of the domain (under the target distribution D). Now, suppose that PrD [gφ,ζ(φ) (xx ) = −1] ≥ /4. For i = 1, . . . , m, define ψi : {−1, 1}m+n → [−1, 1] as the following function: ψi (xx ) = 1, if xi = 1 and ψi (xx ) = 0 otherwise. Chapter 6: Limitations on Evolvability 143 Consider the following expectation, ED [ψi (xx ) − gφ,ζ(φ) (xx )ψi (xx )] If φi = −1 (i.e. the ith bit of the representation of the 3-CNF formula φ is −1), then gφ,ζ(φ) (xx ) = 1 for all points where ψi (xx ) = 0. This is because gφ,ζ(φ) is the constant 1 function on points which do not have φ as a prefix, and if ψi (xx ) = 0, then xi = 1 = φi . Thus, for all points ψi (xx ) = gφ,ζ(φ) (xx )ψi (xx ) and hence the value of the above expectation is exactly 0. On the other hand, if φi = 1, then whenever gφ,ζ(φ) (xx ) = −1, ψi (xx ) = 1. When gφ,ζ(φ) (xx ) = 1, ψi (xx ) − gφ,ζ(φ) (xx )ψi (xx ) = 0. Recall that PrD [gφ,ζ(φ) (xx ) = −1] ≥ /4, thus the above expectation is at least /2. As ED [ψi (xx ) − gφ,ζ(φ) (xx )ψi (xx )] = ED [ψi (xx )] − ED [gφ,ζ(φ) (xx )ψi (xx )], an /8 accurate estimate to the above expectation can be obtained by making a target independent query (ψi , /16) and a correlational query (ψi , /16). Thus, by looking at the query responses it is possible to determine whether φi = 1 or φi = −1. Using 2m queries, each bit of φ can be determined, and then ζ(φ) can be obtained, if necessary by brute force, to output gφ,ζ(φ) . Theorem 6.5. C2 is not efficiently distribution-independently SQ learnable unless RP = NP. Proof. Suppose to the contrary and let Alg be a (possibly randomized) algorithm that efficiently learns C2 in the distribution-independent SQ model. We show that if φ is a satisfiable 3-CNF formula then, using Alg, a satisfying assignment can be constructed with high probability. Chapter 6: Limitations on Evolvability 144 Let φ be a 3-CNF formula that is satisfiable, so that gφ,ζ(φ) ∈ C2 . Let D2 be the distribution defined as follows: D2 (xx ) = 2−n if x = φ, D2 (xx ) = 0 otherwise; thus, D2 is the uniform distribution on strings of the form φx . Let gφ,ζ(φ) be the target concept from C2 and D2 the target distribution. Suppose ≤ 1/4. We run Alg to learn gφ,ζ(φ) . We need to show that we can simulate the queries made by Alg to the oracle SQ-O(gφ,ζ(φ) , D2 ). As in the proof of Theorem 6.3, response to a target-independent query can be ˜ simulated by drawing a sample from D2 of size O(1/τ 2 ) and returning the empirical estimate. In the case of correlational queries also, the main idea is similar to that used in the proof of Theorem 6.3. Let (ψ cor , τ ) be a correlational query, cor cor define ψφ : {−1, 1}n → [−1, 1] to be the function ψφ (x ) = ψ cor (φx ). Thus, cor ED2 [ψ cor (xx )gφ,ζ(φ) (xx )] = EUn [ψφ (x )PARζ(φ) (x )]. This is just the Fourier coefficor cient of ψφ on the subset S(ζ(φ)). Thus, we obtain all large (of magnitude greater cor than τ /2) Fourier coefficients of ψφ and check whether any of them (i.e. their string representations of length n) are a satisfying assignment to φ. If not, then 0 is valid (τ -approximate) answer to the query (ψ cor , τ ). Thus, we can simulate access to the SQ-O(gφ,ζ(φ) , D2 ) oracle to Alg or else we find a satisfying assignment to φ. Suppose we don’t find a satisfying assignment to φ and Alg runs to completion, then for the output hypothesis, h, errD2 (h, gφ,ζ(φ) ) ≤ 1/4 or equivalently, ED2 [h(xx )gφ,ζ(φ) (xx )] ≥ 1/2. Again define hφ (x ) = h(φx ), so that EUn [hφ (x )PARζ(φ) (x )] ≥ 1/2. Thus, looking at the heavy Fourier coefficients of hφ reveals a satisfying assignment to φ. The above algorithm works correctly with high probability. Chapter 6: Limitations on Evolvability 145 If we are unable to find a satisfying assignment of φ, then we report φ as being unsatisfiable. Thus, we get a randomized polytime algorithm for 3-CNF. 6.4.4 Strong Distribution-Independent CSQ Learning Showing a separation between the computational complexity and query complexity of distribution-independent CSQ learning is significantly more involved. The separation in this case is based on a stronger assumption: W[P] does not have randomized fixed parameter tractable algorithms. A fixed parameter tractable algorithm for a decision problem (x, k) is allowed to take running time f (k)p(|x|) where p is a polynomial and f is an arbitrary function. A complete problem for W[P] is weighted circuit satisfiability, i.e. given a circuit φ and parameter k, does there exist a satisfying assignment of Hamming weight k? It is widely believed that W[P] does not have randomized fixed-parameter tractable algorithms and such an algorithm would also imply a subexponential time algorithm for circuit satisfiability (see [11]). The construction relies on Feldman’s recent result [15], where he shows that the class of disjunctions cannot be learned (for information theoretic reasons) in the distribution-independent CSQ model. The class of disjunctions on the other hand is weakly learnable in the distribution-independent CSQ model [12]. Unlike in the case of distribution-independent SQ model, this fact is required5 because any algorithm that only uses correlational statistical queries can only get information about the distribution by first finding some function that is (at least weakly) correlated with the target function under that distribution. 5 Note that the class of parities is not weakly learnable in the SQ model. Chapter 6: Limitations on Evolvability 146 Let φ ∈ {−1, 1}m denote a circuit (represented as a string) with n input variables. For some parameter , let ζ(φ) denote the lexicographically first satisfying assignment of Hamming weight . Let n = 3 n and let Enc : {−1, 1}n → {−1, 1}n be an encoding such that for any string s ∈ {−1, 1}n with has 3 “-1” bits. Furthermore, recovering any “-1” bits, Enc(s) ∈ {−1, 1}n of these 3 “-1” bits of Enc(s) allows us to reconstruct s. Such encodings can be constructed using Reed-Solomon codes and are defined below. We will explain shortly the necessity for these codes for our construction. Let ξ(φ) = Enc(ζ(φ)). Let y ∈ {−1, 1}n (recall that n = 3 n), let x ∈ {−1, 1}m , x ∈ {−1, 1}n and define cφ,y : {−1, 1}m+n → {−1, 1} as follows:    ORy (x ) if x = φ cφ,y (xx ) =   1 otherwise Define the concept class. C3 = {cφ,ξ(φ) | φ has a satisfying assignment of Hamming weight at most }. Theorem 6.6 shows that the query complexity of CSQ learning C3 is polynomial. This can be proved using the fact that OR is weakly learnable and by modifying Feldman’s singleton learning algorithm [14]. This enables us to recover φ and the lexicographically first satisfying assignment of φ can be easily constructed (using unbounded computation). On the other hand, Theorem 6.7 shows that an efficient CSQ algorithm for learning C3 , implies a poly(2 , n) time for the weighted-circuitSAT problem (given (φ, ), does there exist a satisfying assignment for φ of Hamming weight ?). The reduction requires us to set the accuracy of the learning algorithm to O(2− ) and also allows us to only recover one-third of the bits of the hidden OR. Chapter 6: Limitations on Evolvability 147 For this reason we need to use an OR that uses ξ(φ) = Enc(ζ(φ)) rather than ζ(φ). Recovering a third of the bits of ξ(φ) is enough to reconstruct ζ(φ). We now provide more details required for the proof. We first show that the class of disjunctions is weakly learnable in the CSQ model (distribution-independently). Weak Distribution-independent CSQ Learning Disjunctions Let DISJn = {ORz | z ∈ {−1, 1}n } be the class of disjunctions over n variables. Let x ∈ {−1, 1}n and let x1 , x2 , . . . , xn be the input bits. Let W = {−1, x1 , x2 , . . . , xn } be a set of n + 1 functions, where −1 is the constant function that is −1 everywhere, and xi is the function w(x) = xi . The following simple lemma shows that for every z ∈ {−1, 1}n and every distribution D over {−1, 1}n , there exists w ∈ W such that ED [ORz (x)w(x)] ≥ 1/(2n). Thus, this implies that the class DISJn is efficiently weakly distribution-independently CSQ learnable. Feldman [12] gives a proof of this lemma, but we include a proof for completeness. Lemma 6.1. For every ORz ∈ DISJn and every distribution D over {−1, 1}n , there exists w ∈ W such that ED [ORz (x)w(x)] ≥ 1/(2n). Proof. For a string z ∈ {−1, 1}n , recall that S(z) = {i | zi = −1}. Let βz (x) = i∈S(z) xi − |S(z)| + 1. Then observe that ORz (x) = sign(βz (x)), since ORz (x) = 1 i∈S(z) if all xi such that i ∈ S(z) are 1, in which case βz (x) = otherwise βz (x) = i∈S(z) xi − |S(z)| + 1 = 1, xi − |S(z)| + 1 is at most −1. i∈S(z) Note that βz (x) = xi − |S(z)| + 1 is always an odd integer, and hence |βz (x)| ≥ 1 for all x. Thus, for all x, βz (x) sign(βz (x)) ≥ 1. Chapter 6: Limitations on Evolvability Then for any distribution D over {−1, 1}n we have, Ex∼D [βz (x)ORz (x)] = Ex∼D [βz (x) sign(βz (x))] ≥ 1 148 Hence, either ED [(−1) · ORz (x)] ≥ 1/(2(|S(z)| − 1)) or there exists i ∈ S(z) such that ED [xi ORz (x)] ≥ 1/(2|S(z)|). Encoding Sparse Strings We give here a simple implementation of the encoding of sparse strings described above. Let s be a string of length n that contains at most “−1”-bits. We want to encode s as a string, Enc(s), of length 3 n that has at most 3 “−1”-bits such that identifying any of the 3 positions that have “−1” suffice to recover the string s. For a string s of length 3 n, let Dec(s ) denote the (unique) string of length n, such that Enc(Dec(s )) = s , or the null string if no such string exists. For simplicity, let n be a power of 2, say n = 2k . Given s ∈ {−1, 1}n with at most “-1” bits, do the following: Identify the set S = {i | si = −1}; notice that |S| = . We use the Reed-Solomon code to encode the elements of S using a set T , |T | = 3 such that identifying any subset of T of size allows us to recover S. This is done by interpreting i ∈ S as elements of the field F2k and constructing a polynomial of degree − 1, using elements of S as the coefficients. The set T contains an evaluation of this polynomial at 3 different points in F2k . Clearly, identifying any elements of T is enough to perform interpolation and hence obtain S. Now, we can encode T using a string of length 3 n, with at most 3 , “-1” bits as follows: Let T = {t1 , . . . , t3 } and consider the string s = Enc(s) as 3 blocks of length n. In the ith block, only the tth i bit is −1 and the rest are all 1. Notice, that although ti are technically elements of Chapter 6: Limitations on Evolvability 149 F2k , they can be interpreted as integers less than n. Thus identifying the positions of any , “-1” bits of s allows for decoding and recovering s. Denote by Dec(s ) the string s, if s is any string that has at least (corrupted) version of Enc(s). We can now prove the main result, stated as Theorems 6.6 and 6.7. Theorem 6.6. The distribution-independent CSQ query complexity of C3 is at most poly(n, 1/ ) Proof. Let cφ,ξ(φ) be the target concept from C3 and let > 0 be the target error “-1” bits and must have been a rate. We first test the hypothesis that is constant 1 everywhere. This can be tested using the correlational query (1, /4), where 1 is the constant 1 function. If the query response is greater than 1 − 3 /4, then ED [cφ,ξ(φ) (xx )] ≥ 1 − and hence the constant 1 function is an -accurate hypothesis and we are done. Otherwise, PrD [cφ,ξ(φ) (xx ) = −1] ≥ /4. Let φ be the encoding of the 3-CNF-SAT formula corresponding to the target function cφ,ξ(φ) . Let D1 be the marginal distribution over the first m bits of the target distribution D. Suppose h : {−1, 1}m → {−1, 1} is a function satisfying the two properties: (i) h(φ) = 1, and (ii) PrD1 [h(x) = 1 ∧ x = φ] ≤ /(100n). Let D2 be the distribution D conditioned on the first m bits being φ, i.e. D2 (x ) = D(φx )/D1 (φ). Let W be as in Lemma 6.1 and let w ∈ W be such that ED2 [ORξ(φ) (x )w(x )] ≥ 1/(2n). Then, define the function hw (xx ) = w(x ) if h(x) = 1 and hw (xx ) = 0 oth- Chapter 6: Limitations on Evolvability 150 erwise. Note that, ED [hw (xx )cφ,ξ(φ) (xx )] ≥ Pr[x = φ]ED2 [ORξ(φ) (x )w(x )] − Pr[hw (xx ) = 1 ∧ x = φ] D1 D1 ≥ 1 − 4 2n 100n · Now define hw : {−1, 1}m+n → [−1, 1] to be the function, where hw (xx ) = w(x ) i i if h(x) = 1 and xi = 1, and hw (xx ) = 0 otherwise. Now note that if φi = 1, i ED [hw (xx)cφ,ξ(φ) (xx )] ≥ /(8n) − /(100n), as in the previous case. On the other i hand if φi = −1, then ED [hw (xx )cφ,ξ(φ) (xx )] ≤ /(100n). i This gap between the expectations in the two cases is large enough that the response to the correlational statistical query (hw , /(100n)) distinguishes the case when i φ = 1 and φ = −1. Thus m such correlational queries can be used to exactly determine φ and then (possibly using unbounded computation) ξ(φ) may be determined to identify cφ,ξ(φ) . Now, suppose we did not know that h satisfied the properties (i) and (ii), mentioned above. We could still carry out the operations described above to come up ˜ with a candidate φ and guess cφ,ξ(φ) to be the target concept. We can then simply ˜ ˜ make the correlational query (cφ,ξ(φ) , /4) to check whether cφ,ξ(φ) is an -accurate ˜ ˜ ˜ ˜ ˜ hypothesis. Note that if h did indeed satisfy the properties (i) and (ii), then φ = φ. The last part required to complete the proof is to show that it is easy to construct a random hypothesis h that satisfies properties (i) and (ii) with non-negligible (inverse polynomial) probability. Then, several such hypotheses may be generated and each tested until the right one (or one that is good enough) is found. But, this is exactly what Feldman’s algorithm for CSQ learning singletons does [14]. Chapter 6: Limitations on Evolvability 151 Theorem 6.7. C3 is not efficiently distribution-independently CSQ learnable, unless there exists a randomized algorithm that determines whether or not a given circuit φ, has a satisfying assignment of Hamming weight at most in time poly(2 , n). Proof. Suppose that there exists an efficient algorithm, Alg, that distribution-independently CSQ learns C3 . Let φ be a circuit formula. We show that if φ has a satisfying assignment of Hamming weight at most , then using Alg we can find such a solution, with high probability. Let z = ξ(φ), S(z) = {i | zi = −1} and suppose that |S(z)| = k, where k ≤ 3 . Note that ζ(φ) has Hamming weight at most . Then, the function ORz can be expressed as the following polynomial. ORz (x ) = −1 + 2 i∈S(z) 1 + xi 2 χT (x ) = −1 + 2−k+1 T ⊆S(z) where χT (x ) is the parity function over T . Let tz be the polynomial, tz (x ) = −1 + 2−k+1 + 2−k+1 T ⊆S(z) |T |>k/3 χT (x ) Define Dz to be the distribution where Dz (x ) = |tz (x )|/( x |tz (x )|). [?] showed that sign(tz (x )) = ORz (x ), and hence for all x , Dz (x )ORz (x ) = Un (x )tz (x ), where Un is the uniform distribution over n bits. Define D to be the distribution over {−1, 1}m+n , where D(xx ) = Dz (x ) if x = φ and D(xx ) = 0 if x = φ. Now, we run algorithm Alg to learn C3 to accuracy = 2−k−2 , where D is the target distribution and cφ,ξ(φ) is the target concept. We Chapter 6: Limitations on Evolvability 152 need to show that we can simulate oracle CSQ-O for any query (ψ, τ ). Let ψφ : {−1, 1}n → [−1, 1] be the function where ψφ (x ) = ψ(φx ). Note that, ED [ψ(xx )cφ,ξ(φ) (xx )] = EDz [ψφ (x )ORz (x )] = EUn [ψφ (x )tz (x )] Then observe that, EUn [ψφ (x )tz (x )] = (−1 + 2−k+1 )ψφ (∅) + 2−k+1 T ⊆S(z) |T |>k/3 ψφ (T ) Note that the only Fourier coefficients of ψφ that matter are those corresponding to the empty set and sets T ⊆ S(z) such that |T | ≥ k/3. There are at most 2k subsets of S(z). Using the KM algorithm, we can identify in time polynomial in 2k , n, 1/τ , all Fourier coefficients of ψφ whose magnitude is at least τ /2k . Now if there exists a subset T ⊆ S(z) such that |ψφ | ≥ τ /2−k and |T | > k/3, then it will be in the list of coefficients obtained above. But note that T can be converted into a string of length n , say σ(T ), such that Dec(σ(T )) = ζ(φ) which is a satisfying assignment of φ. Thus, for each heavy (magnitude ≥ τ /2k ) Fourier coefficient of ψφ , we check if we get a satisfying assignment to φ. If not, then 0 is a valid answer (τ -approximate) to the query (ψ, τ ). The algorithm, Alg, outputs a hypothesis h. Let hφ (x ) = h(φx ). Note that ED [h(xx )cφ,ξ(φ) (xx )] = EUn [hφ (x )tz (x )] = (−1 + 2−k+1 )hφ (∅) + 2−k+1 T ⊆S(z) |T |>k/3 hφ (T ) ≥ 1 − 2 = 1 − 2−k−1 . Chapter 6: Limitations on Evolvability 153 This means that hφ (T ) ≥ 1/2. T ⊆S(z) |T |>k/3 Thus, as for the queries, identifying and decoding all large (magnitude ≥ 0.1/2k ) Fourier coefficients of hφ reveals a satisfying assignment of φ of Hamming weight at most . 6.5 Upper Bounds In this section, we consider the following question: how much computational power is sufficient for learning in these models, given that the query complexity is polynomial? In the setting where the learning algorithm has access to i.i.d. unlabeled examples from the underlying distribution, we show that an NP-oracle suffices for learning. We show that if the query complexity for a class C is polynomial, then there exists a polynomial-time algorithm that with access to random unlabeled examples from the distribution and with access to an NP-oracle learns C. We use Sz¨r´nyi’s characterizaoe tion of SQ learning, where he shows that any algorithm that makes consistent queries from the class C, learns C. We require an additional natural condition, C ∈ P, i.e. given c as a bit string, there is a polynomial time algorithm that determines whether or not c is a valid representation of a concept in C. Definition 6.7 (Consistent Learner [37]). Let (φi , τi ) SQ learning algorithm Alg and let vi i≥1 i≥i be the queries made by an be the responses of the SQ oracle. Algorithm Alg is said to be consistent if for every j < i, |ED [φj (x)φi (x)] − vj | ≤ τj . Chapter 6: Limitations on Evolvability 154 Sz¨r´nyi [37] proved the following result. oe Theorem 6.8 ([37]). Let q be the query complexity of SQ learning concept class C with respect to distribution D, then there exists τ , such that 1/τ is bounded by poly(q, n, 1/ ) and any consistent algorithm that makes queries of the form (c, τ ), where c ∈ C, eventually makes a query of the form (c , τ ), where err(c ) ≤ /2. The total number of queries made by the algorithm is at most poly(q, n, 1/ ). As a corollary of this result, we can show that an NP-oracle suffices for statistical query learning, when the learning algorithm also has access to unlabeled examples from the underlying distribution. The key idea is that it is possible to find queries from C that are consistent with the previous query responses by using a large enough sample and with access to an NP-oracle. The following theorem follows easily from Theorem 6.8. Theorem 6.9. Let q(C, D, ) be the query complexity of SQ learning concept class C ∈ P with respect to distribution D to accuracy . Then there exists an algorithm that for every target function f ∈ C, for every distribution D, with access to random examples from distribution D, oracle CSQ-O(f, D)6 and an NP-oracle, outputs c ∈ C, such that errD (c , f ) ≤ . The running time of the algorithm is poly(q(C, D, ), n, 1/ ). Proof. Let C be a concept class and assume that every c ∈ C has a representation that uses at most s(n, 1/ ) bits for some polynomial s. Let m = (16s(n, −1 )/τ (n, −1 ))2 log(1/δ). Then a random sample of size m from distribution D satisfies the following, 1 ∀c1 , c2 ∈ C, |ED [c1 (x)c2 (x)] − m 6 m c1 (xk )c2 (xk )| ≤ τ /4 k=1 Since we are assuming that our algorithm has access to i.i.d. random examples from the distribution an oracle that only responds to correlational statistical queries is sufficient. Chapter 6: Limitations on Evolvability 155 Now, consider the following algorithm. First make any query (c1 , τ /4) and receive response v1 . Note that v1 is also a valid query response for the query (c1 , τ ). Given queries (c1 , τ /4), . . . , (ci−1 , τ /4) with responses v1 , . . . , vi−1 . Find ci such that for every j < i it holds simultaneously that, 1 m m ci (xk )cj (xk ) − vj ≤ τ /2 k=1 (6.1) Now it is easy to see that such a ci exists because the true target concept f satisfies this. It is also easy to see that such a ci can be identified easily using an NP-oracle, since ci has a polynomial-size representation (thus obtaining ci one bit at a time), and so the fact that ci ∈ C and the relations (6.1) can be verified easily in polynomial time. An algorithm that makes queries (c1 , τ ), (c2 , τ ), . . . and receives responses v1 , v2 , . . . is consistent. Hence there will be some t = poly(q(C, D, )n, 1/ ) such that errD (ct , f ) ≤ /2. Remark 6.1. We note that the concept classes C1 , C2 , and C3 defined respectively in Sections 6.4.2, 6.4.3, and 6.4.4 are actually not recognized by a polynomial time Turing machine. This is because given a string of the form (φ, ζ(φ)), it is not possible to verify that ζ(φ) is indeed the lexicographically first satisfying assignment to φ unless P = NP. We note however that even then these classes can be learned with access to an NP-oracle because C1 , C2 , C3 ∈ PNP , i.e. with access to an NP-oracle, the lexicographically first satisfying assignments can be constructed (one bit at a time). Remark 6.2. Under stronger cryptographic assumptions, we can construct classes C1 , C2 , C3 ∈ P that are also not efficiently learnable in the respective statistical query Chapter 6: Limitations on Evolvability 156 models. The functions constructed can be of the form (s(z), z), where s(z) is easy to find information and s is a one-way permutation (that is cannot be inverted efficiently). An additional implication of such constructions is average-case computational hardness: learning is hard for most functions in C1 /C2 /C3 . Appendix A Equivalence between CSQ and ZCSQ learning We prove the following Theorem. Theorem A.1. For some concept class, C, and distribution, D, suppose that there is a parallel CSQ algorithm for learning C, that takes T parallel-time steps, uses p processors and makes queries to the oracle, CSQ-O, with approximation parameter 8τ . Then, there is a parallel algorithm that only queries a ZCSQ oracle, uses T log(1/(8τ )) + 1 parallel time-steps and uses polynomially many processors. Proof. Fix some ideal function, f ∈ C, and distribution, D. We consider a CSQ≤ (·, ·, ·) oracle, that takes a query, (φ, τ, θ) and responds as follows:    1  if Ex∼D [φ(x)f (x)] ≤ θ − τ     CSQ≤ (φ, τ, θ) = 0 if Ex∼D [φ(x)f (x)] ≥ θ + τ       1 or 0 otherwise  157 Appendix A: Equivalence between CSQ and ZCSQ learning 158 First, we use Feldman’s result that shows that a query, (φ, 8τ ), to the oracle, CSQ-O, can be simulated by log(1/(8τ )) + 1 queries, to the oracle, CSQ≤ (·, ·, ·). Furthermore, each query to the CSQ≤ (·, ·, ·) oracle is of the form (φ, 4τ, θ), where |θ| ≥ 4τ . (See also Lemma 3.1.) Next, suppose that there exists a function, g, such that Ex∼D [g(x)f (x)] = α. Assume that the function, g, is efficiently evaluable, 1/α is polynomially bounded and that g and α are known. Then, consider the query, ((αφ − θg)/2, ατ ), made to a ZCSQ oracle. We claim that the response, ZCSQ((αφ − θg)/2, ατ ) is a valid response to the query, (φ, 4τ, θ), made to a CSQ≤ (·, ·, ·) oracle. This is because, Ex∼D [(αφ(x) − θg(x))f (x)/2] = (α/2)Ex∼D [φ(x)f (x)]−(θ/2)Ex∼D [g(x)f (x)] = (α/2)(Ex∼D [φ(x)f (x)]− θ). Feldman showed that if some concept class, C, is efficiently CSQ learnable, then there exists some polynomially bounded d, and efficiently constructible (and evaluable) functions, g1 , g2 , . . . , gd , such that for every f ∈ C, there exists gi such that Ex∼D [gi (x)f (x)] ≥ 1/d (see Theorem 5.2 [12]). However, we still are left with the problem of identifying such a gi and the value Ex∼D [gi (x)f (x)]. We simply guess some index i ∈ {1, . . . , d} and α ∈ {jτ /d | j = 1, . . . , d/τ }. Let i be such that if g = gi , then Ex∼D [g (x)f (x)] = α ≥ 1/d and let α be such that α = jτ /d, and α is the smallest number of this form that is at least α. Then, we note that in fact, the query response ZCSQ((α φ − θg )/2, α τ )) is a valid response to the query, (φ, 4τ, θ) to the oracle, CSQ≤ (·, ·, ·). To see this see the Appendix A: Equivalence between CSQ and ZCSQ learning 159 following – suppose Ex∼D [φ(x)f (x)] ≥ θ + 4τ . Then, Ex∼D [(α φ(x) − θg (x))f (x)/2] = α θα Ex∼D [φ(x)f (x)] − 2 2 α αθ+ατ ≥ Ex∼D [φ(x)f (x)] − 2 2 α = (Ex∼D [φ(x)f (x)] − θ) − α τ 2 4τ α ≥ −ατ =ατ 2 Since |α − α| ≤ α τ A similar argument shows that when Ex∼D [φ(x)f (x)] ≤ θ − 4τ , Ex∼D [(α φ(x) − θg (x))f (x)/2] ≤ −α τ . Thus, a CSQ≤ (·, ·, ·) oracle may be simulated by a ZCSQ oracle, if such g and α are known. Note that there are only d2 /τ possible combinations (gi , jτ /d), for i ∈ {1, . . . , d} and j ∈ {1, . . . , d/τ }. We run d2 /τ copies of the parallel CSQ algorithm, and at least one of these, the one that is simulated using the correct (g , α ), is guaranteed to produce a hypothesis, h, such that Prx∼D [f (x) = h(x)] ≤ , or alternatively, Ex∼D [f (x)h(x)] ≥ 1 − 2 . Note that the hypotheses produced by other copies of the parallel algorithm may be incorrect. Suppose that the d2 /τ hypotheses are h1 , . . . , hd2 /τ , and at least one, say h∗ , of them is guaranteed to be such that Ex∼D [f (x)h∗ (x)] ≥ 1 − 2 . We say that hi beats hj , if E[hi (x)f (x)] ≥ E[hj (x)f (x)], or E[(hi (x) − hj (x))f (x)] ≥ 0. Note that if the ZCSQ query, ((hj − hi )/2, τ ), receives response 1, we can be sure that E[hi (x)f (x)] ≥ E[hj (x)f (x)] − 2τ . We compare all the d4 /τ 2 possible pairs, and define the rank of hi to be the number of j, for which the ZCSQ oracle returned 1 on the query, ((hj − hi )/2, τ ). Let h be the hypothesis that such that has highest rank. Recall that h∗ is the hypothesis that satisfies, Ex∼D [h∗ (x)f (x)] ≥ 1 − 2 . Now, either Appendix A: Equivalence between CSQ and ZCSQ learning 160 h = h∗ , or one of the two must hold: (i) the ZCSQ query, ((h∗ − h)/2, τ ) returned 1, or (ii) there exists some h , such that the two ZCSQ queries, ((h − h)/2, τ ) and ((h∗ − h )/2, τ ) both returned 1. In the first case, it is obvious that Ex∼D [h(x)f (x)] ≥ Ex∼D [h∗ (x)f (x)] − 2τ ≥ 1 − 2( +τ ). In the second case, Ex∼D [h(x)f (x)] ≥ Ex∼D [h (x)f (x)]−2τ ≥ Ex∼D [h∗ (x)f (x)]− 4τ ≥ 1 − 2( + 2τ ). The result follows by rescaling appropriately and outputting h. Note that the simulation of the parallel CSQ algorithm, for each of the d2 /τ pairs, can be done in parallel (on pd2 /τ processors – note that each copy of the parallel CSQ algorithm requires p processors). This takes T log(1/(8τ )) parallel timesteps. Then, the d4 /τ 2 comparison queries can be made in parallel as well, on d4 /τ 2 processors in 1 parallel time-step. Note that in our model of parallel computation, since each processor can broadcast at every parallel time-step, the hypothesis, h, that had highest ranked can be output at this stage. Thus, the total number of parallel time-steps required is log(1/(8τ )) + 1. The total number of processors required is max{pd2 /τ, d4 /τ 2 }. Bibliography [1] J. Aslam and S. Decatur. General bounds on statistical query learning and pac learning with noise via hypothesis boosting. Information and Computation, 141(2):85–118, 1998. [2] P. L. Bartlett. Learning with a slowly changing distribution. In Proceedings of the Conference on Learning Theory (COLT), 1992. [3] N. H. Barton and B. Charlesworth. Why sex and recombination? 281:1986–1990, 1998. Science, [4] R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2):101–123, 1997. [5] A. Blum, A. Kalai, and H. Wasserman. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM, 50(4):506–519, 2003. [6] Avrim Blum, Merrick Furst, Jeffery Jackson, Michael J. Kearns, Yishay Mansour, and Steven Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the Symposium on the Theory of Computation (STOC), 1994. [7] Nader Bshouty and Vitaly Feldman. Extended. Journal of Machine Learning Research (JMLR), 2002. [8] Charles Darwin. On the Origin of Species by Means of Natural Selection. John Murray, 1859. [9] Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Neural Information Processing Systems (NIPS), 2005. [10] D. Diochnos and G. Tur´n. On evolvability: The swapping algorithm, product a distributions, and covariance. In Symposium on Stochastic Algorithms, Foundations and Applications, 2009. [11] R. G. Downey and M. R. Fellows. Fixed-parameter tractability and completeness i: Basic results. SIAM Journal on Computing, 24(4):873–921, 1995. 161 Bibliography 162 [12] Vitaly Feldman. Evolution from learning algorithms. In Proceedings of the Symposium on the Theory of Computation (STOC), 2008. [13] Vitaly Feldman. Characterization of statistical query learning. In Proceedings of the Symposium on the Theory of Computation (STOC), 2009. [14] Vitaly Feldman. Robustness of evolvability. In Proceedings of the Conference on Learning Theory (COLT), 2009. [15] Vitaly Feldman. Distribution-independent evolution of linear threshold functions. In Proceedings of the Conference on Learning Theory (COLT), 2011. [16] Vitaly Feldman and Varun Kanade. Computational bounds on statistical query learning. In Proceedings of the Conference on Learning Theory (COLT), 2012. [17] Vitaly Feldman, Homin Lee, and Rocco Servedio. Shallow. In Proceedings of the Conference on Learning Theory (COLT), 2011. [18] R. A. Fisher. The genetical theory of natural selection. Clarendon Press, 1930. [19] D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1):27–46, 1994. [20] Varun Kanade. Evolution with recombination. In Proceedings of IEEE Conference on the Foundations of Computer Science (FOCS), 2011. [21] Varun Kanade, Jennifer Wortman Vaughan, and Leslie G. Valiant. Evolution with drifting targets. In Proceedings of the Conference on Learning Theory (COLT), 2010. [22] Michael J. Kearns. Sq learning. Journal of Computing, 1998. [23] Michael J. Kearns and Umesh Vazirani. Computational Learning Theory. The MIT Press, 1994. [24] A. Kuh, T. Petsche, and R. Rivest. Incrementally learning time-varying halfplanes. In Neural Information Processing Systems (NIPS), 1991. [25] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. SIAM Journal on Computing, 22(6):1331–1348, 1993. [26] A. Livnat, C. Papadimitriou, J. Dushoff, and M. W. Feldman. A mixability theory for the role of sex in evolution. Proceedings of the National Academy of Science, 105(50):19803–19808, 2008. Bibliography 163 [27] A. Livnat, C. Papadimitriou, N. Pippenger, and M. W. Feldman. Sex, mixability, and modularity. Proceedings of the National Academy of Science, 107(4):1452– 1457, 2010. [28] Y. Mansour. Learning boolean functions via the fourier transform. In Theoretical Advances in Neural Computation and Learning, pages 391–424. Kluwer, 1994. [29] L. Michael. Evolvability via the fourier transform, 2010. To appear in Theoretical Computer Science. [30] H. J. Muller. Some genetic aspects of sex. The American Naturalist, 66(703):118– 138, 1932. [31] R. O’Donnell. Computational Applications of Noise Sensitivity. PhD thesis, MIT, 2003. [32] H. Simon. A characterization of strong learnability in the statistical query model. In Proceedings of the Symposium on Theoretical Aspects of Computer Science, pages 393–404, 2007. [33] J. Maynard Smith. The Theory of Evolution. Penguin Books, 1958. [34] J. Maynard Smith. What use is sex? Journal of Theoretical Biology, 30(2):319 – 335, 1971. [35] J. Maynard Smith. The Evolution of Sex. Cambridge University Press, Cambridge, 1978. [36] J. Maynard Smith. The evolution of recombination. The evolution of sex : an examination of current ideas (edited by R. E. Michod and B. R. Levin), pages 106–125, 1988. [37] B. Sz¨r´nyi. Characterizing statistical query learning : Simplified notions and oe proofs. In Proceedings of Algorithmic Learning Theory, pages 186–200, 2009. [38] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 1984. [39] Leslie G. Valiant. Evolvability. Journal of the ACM, 2009. Earlier version appears as Leslie G. Valiant. Evolvability. ECCC Technical Report TR06-120. [40] Paul Valiant. Evolvability of real-valued functions. In Proceedings of Innovations in Theoretical Computer Science (ITCS), 2012. [41] S. Wright. Evolution in mendelian populations. Genetics, 16:97–159, 1931. [42] Ke Yang. New lower bounds for statistical query learning. Journal of Computing, 70(4):485–509, 2005.