Noise-skipping Earley parsing and in-order tree extraction from shared packed parse forests
CitationDohmann, Jeremy. 2020. Noise-skipping Earley parsing and in-order tree extraction from shared packed parse forests. Bachelor's thesis, Harvard College.
AbstractIn this thesis, I identify 3 shortcomings of state of the art Earley parsing and offer a unified, end-to-end Earley parsing algorithm which remedies these shortcomings. In particular I address the following issues:
1. The Earley algorithm traditionally returns parses without any rank-ordering associated with them. In addition to not ranking the parses during parse time, the data structure used to represent parse results, the shared-packed parse forest (SPPF), doesn’t have any intrinsic way to extract trees in a particular order. This is a major shortcoming because, for all but the most trivial grammars and inputs, the number of potential parses for an input is intractably large, making the parser effectively useless barring some external mechanism to filter results.
2. The Earley algorithm requires that parses explain contiguous spans of tokens without any unexplained out-of-vocabulary items or tokens interceding. In other words, traditional Earley parsing parses only whole strings of tokens, as opposed to permitting parsing only discontinuous subsequences of the input tokens. This is a shortcoming in applications where the input contains spurious tokens or where the grammar is only intended to describe a subset of the input.
3. The Earley algorithm’s run time is proportional to grammar size. Current state of the art methods substantially restrict the size of grammars which can be practically parsed with reasonable memory and time constraints. This is a major shortcoming because it can be difficult to represent sufficiently expressive languages for many applications using anything short of a massive grammar.
The narrative arc of this thesis can be understood as an attempt to redress the three issues above by extending the Earley algorithm to enable parse ranking and quick extraction, make it robust to noise skipping, and make parsing with massive grammars tractable.
In chapter two I address issue 1 by introducing a framework for ranking parses as a function of their intrinsic attributes. I present this approach as an extension of the work done in  and .  introduces parsing as a form of deductive reasoning on a logical system defined by a context-free grammar, while  unifies a number of concepts from deductive parsing (e.g. recognition, assigning probabilities to parses, etc.) as performing calculations on a suitably defined semiring.
Thus chapter two introduces a semiring I refer to as the ‘attribute semiring’ and show how Earley parsing can be used to derive the attributes of the best derivations for Earley states - where “best” is evaluated by applying a utility function f (·) to a state’s attributes. I use the semiring formalism to show which types of attributes and utility functions can be used
for this purpose.
In chapters three and four, I address the first two issues above by introducing extensions to the standard Earley parser and standard SPPF-construction algorithm to handle skipping tokens in noisy inputs as well innovations to associate each Earley state and SPPF node with the attributes of its best derivation. In chapter 3, I prove that the modified noise-skipping parsing algorithm runs in O(w · n 2 ) time where w is the number of tokens allowed to be skipped between any explained tokens, also referred to as the ‘skip width’. Chapter four details the construction of a novel variant of the SPPF called an ordered-SPPF which, by design, permits us to associate the best derivation attributes with SPPF nodes and makes extracting the best tree from a forest trivially easy. In that chapter, I show that construction of the ordered-SPPF can be done in O(w 2 n 3 · log(wn)) time
In chapter five, I address issue two, by showing that the ordered SPPF from chapter four can be used to not only efficiently extract the best tree in the forest but also the top k parses ranked by applying the utility function to their attributes. I show my efficient top-k algorithm can extract the top k parses in O(kn 2 ·log(kn) time.
In appendix A, I address the third issue, by showing an algorithm to filter grammars at run time in order to make large-grammar parsing more efficient.
Finally, in appendix B, I introduce an extension that allows the Earley parser to run continuously and return ranked parses in a streamed manner. This extension requires nontrivial modifications to the core logic in order to maintain memory efficiency while not losing any valid parses.
Thus, this thesis details a unified approach permitting the Earley algorithm to parse valid subsequences of noisy inputs while using massive grammars. The system detailed here is also capable of ranking the parses by customized utility functions over user-defined attributes with the constraints governing such functions detailed in chapter 2.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37367424
- FAS Theses and Dissertations