Bayesian Grammar Induction for Language Modeling Stanley F. Chen TR-01-95 January 1995 Center for Research in Computing Technology Harvard University Cambridge, Massachusetts Bayesian Grammar Induction for Language Modeling Aiken Computation Laboratory Division of Applied Sciences Harvard University Cambridge, MA 02138 sfc@das.harvard.edu Stanley F. Chen We describe a corpus-based induction algorithm for probabilistic context-free grammars. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks. In two of these domains, our algorithm outperforms these other techniques, marking the rst time a grammar-based language model has surpassed n-gram modeling in a task of at least moderate size. In applications such as speech recognition, handwriting recognition, and spelling correction, performance is limited by the quality of the language model utilized (Bahl et al., 1978; Baker, 1975; Kernighan et al., 1990; Srihari and Baltus, 1992). However, static language modeling performance has remained basically unchanged since the advent of n-gram language models forty years ago (Shannon, 1951). Yet, n-gram language models can only capture dependencies within an n-word window, where currently the largest practical n for natural language is three, and many dependencies in natural language occur beyond a threeword window. In addition, n-gram models are extremely large, thus making them di cult to implement e ciently in memory-constrained applications. An appealing alternative is grammar-based language models. Language models expressed as a probabilistic grammar tend to be more compact than n-gram language models, and have the ability to model long-distance dependencies (Lari and Young, 1990; Resnik, 1992; Schabes, 1992). However, to date there has been little success in con- Abstract structing grammar-based language models competitive with n-gram models in problems of any magnitude. In this paper, we describe a corpus-based induction algorithm for probabilistic context-free grammars that outperforms n-gram models and the Inside-Outside algorithm (Baker, 1979) in medium-sized domains. This result marks the rst time a grammar-based language model has surpassed n-gram modeling in a task of at least moderate size. The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm. Introduction Grammar Induction as Search Grammar induction can be framed as a search problem, and has been framed as such almost without exception in past research (Angluin and Smith, 1983). The search space is taken to be some class of grammars; for example, in our work we search within the space of probabilistic contextfree grammars. The objective function is taken to be some measure dependent on the training data; one generally wants to nd a grammar that in some sense accurately models the training data. Most work in language modeling, including ngram models and the Inside-Outside algorithm, falls under the maximum-likelihood paradigm, where one takes the objective function to be the likelihood of the training data given the grammar. However, the optimal grammar under this objective function is one which generates only strings in the training data and no other strings. Such grammars are poor language models, as they over t the training data and do not model the language at large. In n-gram models and the Inside-Outside algorithm, this issue is evaded by bounding the size and form of the grammars considered, so that the \optimal" grammar cannot be expressed. How- ever, in our work we do not wish to limit the size of the grammars considered. The basic problem behind the maximumlikelihood objective function is that it does not encompass the compelling intuition behind Occam's Razor, that simpler (or smaller) grammars are preferable over complex (or larger) grammars. A factor in the objective function that favors smaller grammars over large can prevent the objective function from preferring grammars that over t the training data. Solomono (1964) presents a Bayesian grammar induction framework that includes such a factor in a motivated manner. The goal of grammar induction is taken to be nding the grammar with the largest a posteriori probability given the training data, that is, nding the grammar G0 where G0 = arg max p(GjO) and where we denote the training data as O, for observations. As it is unclear how to estimate p(GjO) directly, we apply Bayes' Rule and get j G0 = arg max p(OpG)p(G) = arg max p(OjG)p(G) (O) Hence, we can frame the search for G0 as a search with the objective function p(OjG)p(G), the likelihood of the training data multiplied by the prior probability of the grammar. We satisfy the goal of favoring smaller grammars by choosing a prior that assigns higher probabilities to such grammars. In particular, Solomono proposes the use of the universal a priori probability (Solomono , 1960), which is closely related to the minimum description length principle later proposed by (Rissanen, 1978). In the case of grammatical language modeling, this corresponds to taking p(G) = 2 ( ) where l(G) is the length of the description of the grammar in bits. The universal a priori probability has many elegant properties, the most salient of which is that it dominates all other enumerable probability distributions multiplicatively.1 G G l G G this framework does not restrict us to a particular grammar formalism, in our work we consider only probabilistic context-free grammars. We assume a simple greedy search strategy. We maintain a single hypothesis grammar which is initialized to a small, trivial grammar. We then try to nd a modi cation to the hypothesis grammar, such as the addition of a grammar rule, that results in a grammar with a higher score on the objective function. When we nd a superior grammar, we make this the new hypothesis grammar. We repeat this process until we can no longer nd a modi cation that improves the current hypothesis grammar. For our initial grammar, we choose a grammar that can generate any string, to assure that the grammar can cover the training data. The initial grammar is listed in Table 1. The sentential symbol S expands to a sequence of X 's, where X expands to every other nonterminal symbol in the grammar. Initially, the set of nonterminal symbols consists of a di erent nonterminal symbol expanding to each terminal symbol. Notice that this grammar models a sentence as a sequence of independently generated nonterminal symbols. We maintain this property throughout the search process, that is, for every symbol A0 that we add to the grammar, we also add a rule X ! A0 . This assures that the sentential symbol can expand to every symbol; otherwise, adding a symbol will not a ect the probabilities that the grammar assigns to strings. We use the term move set to describe the set of modi cations we consider to the current hypothesis grammar to hopefully produce a superior grammar. Our move set includes the following moves: Move 1: Move 2: Create a rule of the form A ! BC Create a rule of the form A ! B jC As described above, we take grammar induction to be the search for the grammar G0 that optimizes the objective function p(OjG)p(G). While 1 ori Search Algorithm For any context-free grammar, it is possible to express an equivalent grammar using only rules of these forms. As mentioned before, with each new symbol A we also create a rule X ! A. Evaluating the Objective Function l G A very thorough discussion of the universal a priprobability is given by Li and Vitanyi (1993). Consider the task of calculating the objective function p(OjG)p(G) for some grammar G. Calculating p(G) = 2 ( ) is inexpensive; however, calculating p(OjG) requires a parsing of the entire training data. We cannot a ord to parse the training S S X A a ! ! ! ! SX X A a (1 ) () (p(A)) (1) 8 A 2 N fS; X g 8a2T N = the set of all nonterminal symbols T = the set of all terminal symbols Probabilities for each rule are in parentheses. Table 1: Initial hypothesis grammar data for each grammar considered; indeed, to ever be practical for data sets of millions of words, it seems likely that we can only a ord to parse the data once. To achieve this goal, we employ several approximations. First, notice that we do not ever need to calculate the absolute value of the objective function; we need only to be able to distinguish when a move applied to the current hypothesis grammar produces a grammar that has a higher score on the objective function, that is, we need only to be able to calculate the di erence in the objective function resulting from a move. This can be done e ciently if we can quickly approximate how the probability of the training data changes when a move is applied. To make this possible, we approximate the probability of the training data p(OjG) by the probability of the single most probable parse, or Viterbi parse, of the training data. Furthermore, instead of recalculating the Viterbi parse of the training data from scratch when a move is applied, we use heuristics to predict how a move will change the Viterbi parse. For example, consider the case where the training data consists of the two sentences to B , as displayed in Figure 2. This is the actual heuristic we use for moves of the form A ! BC , and we have analogous heuristics for each move in our move set. By predicting the di erences in the Viterbi parse resulting from a move, we can quickly estimate the change in the probability of the training data. Notice that our predicted Viterbi parse can stray a great deal from the actual Viterbi parse, as errors can accumulate as move after move is applied. To minimize these e ects, we process the training data incrementally. In brief, we delay the parsing of a sentence until after we nd the optimal grammar over the previous sentences in the training data. This should yield more accurate Viterbi parses than if we simply parse the whole corpus with the initial hypothesis grammar, and we still parse each sentence but once. Parameter Training In this section, we describe how the parameters of our grammar, the probability associated with each grammar rule, are set. Ideally, in evaluating the objective function for a grammar we should use its optimal parameter settings given the training data, as this is the full score that the given grammar can achieve. However, searching for optimal parameter values is extremely expensive computationally. Instead, we grossly approximate the optimal values by deterministically setting parameters based on the Viterbi parse of the training data parsed so far. We rely on the post-pass, described later, to re ne parameter values. Referring to the rules in Table 1, the parameter is set to an arbitrary small constant. The values of the parameters p(A) are set to the (smoothed) frequency of the X ! A reduction in the Viterbi parse of the data seen so far. The remaining symbols are set to expand uniformly among their possible expansions. O = fBob talks slowly; Mary talks slowlyg In Figure 1, we display the Viterbi parse of this data under the initial hypothesis grammar used in our algorithm. Now, let us consider the move of adding the rule B ! Atalks Aslowly to the initial grammar (as well as the concomitant rule X ! B ). A reasonable heuristic for predicting how the Viterbi parse will change is to replace adjacent X 's that expand to Atalks and Aslowly respectively with a single X that expands S S S X HH X HH HX Aslowly slowly S X S S HH X HH HX Aslowly slowly Atalks AMary talks Mary S S X Atalks ABob talks Bob S S X Figure 1: Initial Viterbi Parse HH HX B HHH X B AMary Atalks Mary talks HH A H slowly ABob Atalks Aslowly slowly Bob talks slowly call triggers.2 A trigger is a phenomenon in the Viterbi parse of a sentence that is indicative that a particular move might lead to a better grammar. For example, in Figure 1 the fact that the symbols Atalks and Aslowly occur adjacently is indicative that it could be pro table to create a rule B ! AtalksAslowly . We have developed a set of triggers for each move in our move set, and only consider a speci c move if it is triggered in the sentence currently being parsed in the incremental processing. A conspicuous shortcoming in our search framework is that the grammars in our search space are fairly unexpressive. Firstly, recall that our grammars model a sentence as a sequence of independently generated symbols; however, in language there is a large dependence between adjacent constituents. Furthermore, the only free parameters in our search are the parameters p(A); all other symbols (except S ) are xed to expand uniformly. These choices were necessary to make the search tractable. 2 triggers H Figure 2: Predicted Viterbi Parse Constraining Moves Consider the move of creating a rule of the form A ! BC . This corresponds to k3 di erent speci c rules that might be created, where k is the current number of symbols in the grammar. As it is too computationally expensive to consider each of these rules at every point in the search, we use heuristics to constrain which moves are appraised. For the left-hand side of a rule, we always create a new symbol. This heuristic selects the optimal choice the vast majority of the time; however, under this constraint the moves described earlier in this section cannot yield arbitrary context-free languages. To partially address this, we add the move Move 3: Post-Pass Create a rule of the form A ! AB jB With this iteration move, we can construct grammars that generate arbitrary regular languages. As yet, we have not implemented moves that enable the construction of arbitrary context-free grammars; this belongs to future work. To constrain the symbols we consider on the right-hand side of a new rule, we use what we This is not to be confused with the use of the term in dynamic language modeling. To address this issue, we use an Inside-Outside algorithm post-pass. Our methodology is derived from that described by Lari and Young (1990). We create n new nonterminal symbols fX1 ; : : :; X g, and create all rules of the form: X ! X X i; j; k 2 f1; : : :; ng X ! A i 2 f1; : : :; ng; A 2 Nold fS; X g n i j k i the performance of n-gram models in a language modeling task of substantial scale. To evaluate our algorithm, we compare the performance of our algorithm to that of n-gram models and the Inside-Outside algorithm. For n-gram models, we tried n = 1; : : :; 10 for each domain. For smoothing a particular n-gram model, we took a linear combination of all lower order n-gram models. In particular, we follow standard practice (Jelinek and Mercer, 1980; Bahl et al., 1983; Brown et al., 1992) and take the smoothed i-gram probability to be a linear combination of the i-gram frequency in the training data and the smoothed (i 1)-gram probability, that is, p(w0jW = w 1 w 1) = c(Ww0) + ( ) c(W ) (1 w 1) ( ) )p(w0 jw 2 where c(W ) denotes the count of the word sequence W in the training data. The smoothing parameters are trained through the ForwardBackward algorithm (Baum and Eagon, 1967) on held-out data. Parameters are tied together for similar c to prevent data sparsity. For the Inside-Outside algorithm, we follow the methodology described by Lari and Young. For a given n, we create a probabilistic contextfree grammar consisting of all Chomsky normal form rules over the n nonterminal symbols fX1 ; : : :X g and the given terminal symbols, that is, all rules X ! X X i; j; k 2 f1; : : :; ng X ! a i 2 f1; : : :; ng; a 2 T where T denotes the set of terminal symbols in the domain. All parameters are initialized randomly. From this starting point, the Inside-Outside algorithm is run until convergence. For smoothing, we combine the expansion distribution of each symbol with a uniform distribution, that is, we take the smoothed parameter p (A ! ) to be p (A ! ) = (1 )p (A ! ) + n3 +1njT j where p (A ! ) denotes the unsmoothed parameter. The value n3 + njT j is the number of di erent ways a symbol expands under the Lari and i i;c W i;c W i i;c i;c n i j k i s s u u Results Nold denotes the set of nonterminal symbols acquired in the initial grammar induction phase, and X1 is taken to be the new sentential symbol. These new rules replace the rst three rules listed in Table 1. The parameters of these rules are initialized randomly. Using this grammar as the starting point, we run the Inside-Outside algorithm on the training data until convergence. In other words, instead of using the naive S ! SX jX rule to attach symbols together in parsing data, we now use the X rules and depend on the Inside-Outside algorithm to train these randomly initialized rules intelligently. This post-pass allows us to express dependencies between adjacent symbols. In addition, it allows us to train parameters that were xed during the initial grammar induction phase. i As mentioned, this work employs the Bayesian grammar induction framework described by Solomono (1960; 1964). However, Solomono does not specify a concrete search algorithm and only makes suggestions as to its nature. Similar research includes work by Cook et al. (1976) and Stolcke and Omohundro (1994). This work also employs a heuristic search within a Bayesian framework. However, a di erent prior probability on grammars is used, and the algorithms are only e cient enough to be applied to small data sets. The grammar induction algorithms most successful in language modeling include the InsideOutside algorithm (Lari and Young, 1990; Lari and Young, 1991; Pereira and Schabes, 1992), a special case of the Expectation-Maximization algorithm (Dempster et al., 1977), and work by McCandless and Glass (1993). In the latter work, McCandless uses a heuristic search procedure similar to ours, but a very di erent search criteria. To our knowledge, neither algorithm has surpassed Previous Work Young methodology. The parameter is trained through the Inside-Outside algorithm on held-out data. This smoothing is also performed on the Inside-Outside post-pass of our algorithm. For each domain, we tried n = 3; : : :; 10. Because of the computational demands of our algorithm, it is currently impractical to apply it to large vocabulary or large training set problems. However, we present the results of our algorithm in three medium-sized domains. In each case, we use 4500 sentences for training, with 500 of these sentences held out for smoothing. We test on 500 sentences, and measure performance by the entropy of the test data. In the rst two domains, we created the training and test data arti cially so as to have an ideal grammar in hand to benchmark results. In particular, we used a probabilistic grammar to generate the data. In the rst domain, we created this grammar by hand; the grammar was a small English-like probabilistic context-free grammar consisting of roughly 10 nonterminal symbols, 20 terminal symbols, and 30 rules. In the second domain, we derived the grammar from manually parsed text. From a million words of parsed Wall Street Journal data from the Penn treebank, we extracted the 20 most frequently occurring symbols, and the 10 most frequently occurring rules expanding each of these symbols. For each symbol that occurs on the right-hand side of a rule but which was not one of the most frequent 20 symbols, we create a rule that expands that symbol to a unique terminal symbol. After removing unreachable rules, this yields a grammar of roughly 30 nonterminals, 120 terminals, and 160 rules. Parameters are set to re ect the frequency of the corresponding rule in the parsed corpus. For the third domain, we took English text and reduced the size of the vocabulary by mapping each word to its part-of-speech tag. We used tagged Wall Street Journal text from the Penn treebank, which has a tag set size of about fty. In Tables 2{4, we summarize our results. The ideal grammar denotes the grammar used to generate the training and test data. For each algorithm, we list the best performance achieved over all n tried, and the best n column states which value realized this performance. We achieve a moderate but signi cant improvement in performance over n-gram models and the Inside-Outside algorithm in the rst two domains, while in the part-of-speech domain we are outperformed by n-gram models but we vastly outperform the Inside-Outside algorithm. In Table 5, we display a sample of the number of parameters and execution time (on a Decstation 5000/33) associated with each algorithm. We choose n to yield approximately equivalent performance for each algorithm. The rst pass row refers to the main grammar induction phase of our algorithm, and the post-pass row refers to the InsideOutside post-pass. Notice that our algorithm produces a signi cantly more compact model than the n-gram model, while running signi cantly faster than the Inside-Outside algorithm even though we use an Inside-Outside post-pass. Part of this discrepancy is due to the fact that we require a smaller number of new nonterminal symbols to achieve equivalent performance, but we have also found that our post-pass converges more quickly even given the same number of nonterminal symbols. This research represents a step forward in the quest for developing grammar-based language models for natural language. We induce models that, while being substantially more compact, outperform n-gram language models in medium-sized domains. The algorithm runs essentially in time and space linear in the size of the training data, so larger domains are within our reach. However, we feel the largest contribution of this work does not lie in the actual algorithm speci ed, but rather in its indication of the potential of the induction framework described by Solomono in 1964. We have implemented only a subset of the moves that we have developed, and inspection of our results gives reason to believe that these additional moves may signi cantly improve the performance of our algorithm. Solomono 's induction framework is not restricted to probabilistic context-free grammars. After completing the implementation of our move set, we plan to explore the modeling of contextsensitive phenomena. This work demonstrates that Solomono 's elegant framework deserves much further consideration. We are indebted to Stuart Shieber for his suggestions and guidance, as well as his invaluable comments on earlier drafts of this paper. This Discussion Acknowledgements best entropy entr. relative n (bits/word) to n-gram ideal grammar 2.30 6.5% our algorithm 7 2.37 3.7% n-gram model 4 2.46 Inside-Outside 9 2.60 +5.7% Table 2: English-like arti cial grammar best entropy entr. relative n (bits/word) to n-gram ideal grammar 4.13 10.4% our algorithm 9 4.44 3.7% n-gram model 4 4.61 Inside-Outside 9 4.64 +0.7% Table 3: Wall Street Journal-like arti cial grammar best entropy entr. relative n (bits/word) to n-gram n-gram model 6 3.01 our algorithm 7 3.15 +4.7% Inside-Outside 7 3.93 +30.6% Table 4: English sentence part-of-speech sequences WSJ artif. n-gram IO rst pass post-pass n entropy no. time (bits/word) params (sec) 3 4.61 15000 50 9 4.64 2000 30000 800 1000 5 4.60 4000 5000 Table 5: Parameters and Training Time material is based on work supported by the National Science Foundation under Grant Number IRI-9350192 to Stuart M. Shieber. D. Angluin and C.H. Smith. 1983. Inductive inference: theory and methods. ACM Computing Surveys, 15:237{269. L.R. Bahl, J.K. Baker, P.S. Cohen, F. Jelinek, B.L. Lewis, and R.L. Mercer. 1978. Recognition of a continuously read natural corpus. In Propages 422{424, Tulsa, Oklahoma, April. Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of Holland, May. the Workshop on Pattern Recognition in Practice, Amsterdam, The Netherlands: North- REFERENCES M.D. Kernighan, K.W. Church, and W.A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the Thirteenth International Conference on Computational Linguistics, pages 205{210. ceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Lalit R. Bahl, Frederick Jelinek, and Robert L. Mercer. 1983. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):179{ 190, March. J.K. Baker. 1975. The DRAGON system { an overview. IEEE Transactions on Acoustics, Speech and Signal Processing, 23:24{29, February. J.K. Baker. 1979. Trainable grammars for speech recognition. In Proceedings of the Spring Conference of the Acoustical Society of America, pages 547{550, Boston, MA, June. L.E. Baum and J.A. Eagon. 1967. An inequality with application to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematicians Society, 73:360{363. Peter F. Brown, Vincent J. DellaPietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467{479, December. Craig M. Cook, Azriel Rosenfeld, and Alan R. Aronson. 1976. Grammatical inference by hill climbing. Information Sciences, 10:59{80. A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1{38. K. Lari and S.J. Young. 1990. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4:35{56. K. Lari and S.J. Young. 1991. Applications of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 5:237{257. Ming Li and Paul Vitanyi. 1993. An Introduction to Kolmogorov Complexity and its Applications. Springer-Verlag. Michael K. McCandless and James R. Glass. 1993. Empirical acquisition of word and phrase classes in the ATIS domain. In Third European Conference on Speech Communication and Technology, Berlin, Germany, September. Fernando Pereira and Yves Schabes. 1992. InsideOutside reestimation from partially bracket corpora. In Proceedings of the 30th Annual Meeting of the ACL, pages 128{135, Newark, Delaware. P. Resnik. 1992. Probabilistic tree-adjoining grammar as a framework for statistical natural language processing. In Proceedings of the 14th International Conference on Computational Linguistics. J. Rissanen. 1978. Modeling by the shortest data description. Automatica, 14:465{471. Y. Schabes. 1992. Stochastic lexicalized treeadjoining grammars. In Proceedings of the 14th International Conference on Computational Linguistics. C.E. Shannon. 1951. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50{64, January. R.J. Solomono . 1960. A preliminary report on a general theory of inductive inference. Technical Report ZTB-138, Zator Company, Cambridge, MA, November. R.J. Solomono . 1964. A formal theory of inductive inference. Information and Control, 7:1{22, 224{254, March, June. Rohini Srihari and Charlotte Baltus. 1992. Combining statistical and syntactic methods in recognizing handwritten sentences. In AAAI Symposium: Probabilistic Approaches to Natural Language, pages 121{127. Andreas Stolcke and Stephen Omohundro. 1994. Best- rst model merging for hidden Markov model induction. Technical Report TR-94003, International Computer Science Institute, Berkeley, CA.