Similarity-Based Approaches to Natural Language Processing
CitationLee, Lillian Jane. 1997. Similarity-Based Approaches to Natural Language Processing. Harvard Computer Science Group Technical Report TR-11-97.
AbstractStatistical methods for automatically extracting information about associations between words or documents from large collections of text have the potential to have considerable impact in a number of areas, such as information retrieval and natural-language-based user interfaces. However, even huge bodies of text yield highly unreliable estimates of the probability of relatively common events, and, in fact, perfectly reasonable events may not occur in the training data at all. This is known as the sparse data problem. Traditional approaches to the sparse data problem use crude approximations. We propose a different solution: if we are able to organize the data into classes of similar events, then, if information about an event is lacking, we can estimate its behavior from information about similar events. This thesis presents two such similarity-based approaches, where, in general, we measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each cluster with some probability; hierarchical, because cluster centroids are iteratively split to model finer distinctions. Our clustering method, which uses the technique of deterministic annealing, represents (to our knowledge) the first application of soft clustering to problems in natural language processing. We use this method to cluster words drawn from 44 million words of Associated Press Newswire and 10 million words from Grolier's encyclopedia, and find that language models built from the clusters have substantial predictive power. Our algorithm also extends with no modification to other domains, such as document clustering. Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each class, we in essence build a cluster around each word. We compare several such nearest-neighbor approaches on a word sense disambiguation task and find that as a whole, their performance is far superior to that of standard methods. In another set of experiments, we show that using estimation techniques based on the nearest-neighbor model enables us to achieve perplexity reductions of more than 20 percent over standard techniques in the prediction of low-frequency events, and statistically significant speech recognition error-rate reduction.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:25104914
- FAS Scholarly Articles