Advance Access publication February 4, 2015 Political Analysis (2015) 23:254–277 doi:10.1093/pan/mpu019 Computer-Assisted Text Analysis for Comparative Politics Christopher Lucas Department of Government and Institute for Quantitative Social Science, Harvard University, 1737 Cambridge St., Cambridge MA 02138, USA e-mail: clucas@fas.harvard.edu Richard A. Nielsen Department of Political Science, Massachusetts Institute of Technology, 77 Massachusetts Avenue Cambridge, MA 02139, USA e-mail: rnielsen@mit.edu Margaret E. Roberts Department of Political Science, University of California, San Diego, 9500 Gilman Drive, #0521 La Jolla, CA 92093, USA e-mail: meroberts@ucsd.edu Brandon M. Stewart Department of Government and Institute for Quantitative Social Science, Harvard University, 1737 Cambridge Street, Cambridge, MA 02138, USA e-mail: bstewart@fas.harvard.edu Alex Storer Graduate School of Business, Stanford University, 655 Knight Way, Stanford, CA 94305, USA e-mail: astorer@stanford.edu Dustin Tingley Department of Government and Institute for Quantitative Social Science, Harvard University, 1737 Cambridge St., Cambridge, MA 02138, USA e-mail: dtingley@gov.harvard.edu (corresponding author) Edited by Betsy Sinclair Recent advances in research tools for the systematic analysis of textual data are enabling exciting new research throughout the social sciences. For comparative politics, scholars who are often interested in non- English and possibly multilingual textual datasets, these advances may be difficult to access. This article discusses practical issues that arise in the processing, management, translation, and analysis of textual data with a particular focus on how procedures differ across languages. These procedures are combined in two applied examples of automated text analysis using the recently introduced Structural Topic Model. We also show how the model can be used to analyze data that have been translated into a single language via machine translation tools. All the methods we describe here are implemented in open-source software packages available from the authors. Authors’ note: Our thanks to Sam Brotherton and Jetson Leder-Luis for research assistance and Amy Catilinac for discussion about text analyses in comparative politics. We also thank Christopher Blattman, Dan Corstange, Macartan Humphreys, Amaney Jamal, Gary King, Helen Milner, Tamar Mitts, Brendan O’Connor, Arthur Spirling, and the Columbia University Comparative Politics Workshop for comments. Our software discussed in this article is open source and available.  The Author 2015. Published by Oxford University Press on behalf of the Society for Political Methodology. All rights reserved. For Permissions, please email: journals.permissions@oup.com 254 Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 255 1 Introduction In this article, we focus on new tools for comparativists to utilize textual data that can come in many different languages. Massive amounts of textual data are now available to comparativists, from debates in legislative bodies to newspapers to online social media. But using automated content analysis for comparative politics presents important challenges and opportunities, including processing and analyzing text in multiple languages and incorporating data we have about our texts directly into our analyses. Comparativists are not unfamiliar with tools for textual analysis. Many of the automated text analysis innovations within political science were developed by comparativists (e.g., Schrodt and Gerner 1994; Laver, Benoit, and Garry 2003; Slapin and Proksch 2008). After briefly orienting the reader to the range of text analysis methods available, we highlight a particular approach, unsuper- vised topic modeling. For interested readers, an online appendix provides an extensive discussion of supervised, scaling, and unsupervised methods to help readers understand the differences between existing approaches and identify methods that will be helpful for their own projects. To showcase the potential of topic modeling for comparative politics, we use the Structural Topic Model (STM) (Roberts et al. 2013; Roberts, Stewart, Tingley et al. 2014; Roberts et al. 2015) to analyze Arabic fatwas and a novel multilanguage analysis of social media responses in Arabic and Chinese to the Edward Snowden event in June 2013. We argue in this article that the STM should be an important part of the text analysis tool kit for comparativists. The STM provides a flexible way to incorporate “metadata” associated with the text, such as when the text was written, where (e.g., which country) it was written, who wrote it, and characteristics of the author, into the analysis using document-level covariates. In turn, it allows comparativists to understand relation- ships between metadata and topics in their text corpus. An additional contribution of this article is to discuss a range of tools that are necessary to analyze text from different languages. This includes a discussion of how text processing can differ across languages, along with discussion of robust software tools that properly account for differ- ences across languages. We also consider how to simultaneously analyze text in different languages. In doing so, we discuss multilingual approaches to text analysis, briefly introduce a new R package, translateR, to access the Google and Microsoft machine translation APIs, and present a novel way to use the STM in a multilingual setting. The structure of the article is as follows. Section 2 discusses research questions in comparative politics that have benefited from text analysis tools, a multilanguage view of text processing, and new tools for machine translation. Section 3 presents a brief review of text analysis tools with a particular focus on multilanguage text modeling, and introduces the basics of the STM. Section 4 provides two example analyses using the STM: the first looks at Islamic fatwas and the second illustrates a novel way to use the model on machine-translated data, with an application to social media responses in Arabic and Chinese to the Edward Snowden event. 2 Text and Language Basics 2.1 Research Questions and Data Analysis Automated content analysis and comparative politics are well suited for each other. Countries around the world are producing textual data at unprecedented rates. Traditional government stat- istics are often missing, mismeasured, or manipulated, creating a strong incentive for scholars to turn to other forms of data. Meanwhile, governments in almost all countries produce and store large amounts of text data that can be used for descriptive and causal inference. As internet con- nectivity rises, documents produced by individual citizens are becoming available from an increas- ingly diverse set of countries. E-mail and advances in survey technologies allow researchers to more easily collect interviews from politicians and government officials, expanding researchers’ collec- tions of qualitative data. The digitization of archives, historical records, and public documents has exposed the inner workings of governments across the globe to the public eye. While other disciplines are only recently catching on to text as a data source, scholars in com- parative politics have been using text as data for years, and have built up intuitions for how text Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 256 Christopher Lucas et al. should be used for scholarly inference. Scholars of comparative politics have drawn information from archives and interviews and therefore know how to ask political questions with these data, select important text or interview questions, and find meaningful patterns within the data (George and Bennett 2005; Brady and Collier 2010). Scholars in comparative politics have already begun using automated methods for analyzing text to ask important political questions. Perhaps the most readily available form of text on politicians, scholars have been using records of speeches politicians make or deliberations among politicians to better understand the internal political workings of governments. Stewart and Zhukov (2009) use public statements by Russian leaders to understand how military versus political elites influence Russia’s decision to intervene in neighboring countries. Baturo and Mikhaylov (2013) use federal and sub-national legislative addresses in Russia to identify leadership patterns within the Russian government. Schonhardt-Bailey (2006) uses a text-clustering method to analyze thousands of pages of parliamentary debates in the United Kingdom to analyze the discussion about the repeal of the Corn Laws in Britain. Eggers and Spirling (2011) use parliamentary debates to model exchanges among politicians in the British House of Commons. Miller (2013) analyzes speeches in the United Nations to show that speeches by delegations from countries that were previously colonized devote more words to themes of victimization than states that were never colonized. Others have tried to infer the policy positions of political parties or political leaders based on documents describing their positions on policies. The Comparative Manifestos project has collected electoral manifestos from all over the world, allowing scholars to use these text data to answer comparative questions about political systems (Budge 2001). Early versions used human coding, but more recently the Comparative Manifestos project and related projects have been assisted by computer techniques. Catalinac (2014) uses thousands of Japanese election manifestos from 1986 to 2009 to determine how electoral strategies shifted after Japan’s electoral reform in 1994. Nielsen (2013) uses fatwas from websites of Muslim clerics to measure the level of Jihadist thought in these clerics’ writings and understand the drivers of Jihadism. Political scientists have studied newspapers in various languages to ask questions about media freedom and infer relationships between politicians and groups within a country. Van Atteveldt, Kleinnijenhuis, and Ruigrok (2008) analyze Dutch newspapers and extract relationships among political leaders and groups. Coscia and Rios (2012) use news to measure criminal activity in Mexico. Stockmann (2012) studies Chinese newspapers to study how media marketization influ- ences anti-American sentiment in the Chinese media. Finally, scholars in comparative politics have used blogs and social media sources. King, Pan, and Roberts (2013) studied the focus of censorship in social media in China; Jamal et al. (n.d.) studied anti-Americanism in Arabic-language Twitter posts, and Barberá (2012) used Twitter posts to scale citizen liberal-conservative ideal points across the United States and several European countries. These papers demonstrate an emerging trove of data being generated around the world. With more and more political discourse happening in these forums, comparative politics will require tools that can handle large volumes of data and systematic frameworks to analyze the data. 2.2 Text Processing Basics: A Multilanguage View In order to use automated methods to analyze text, first the analyst must ensure the text is machine- readable. Statistical methods for text analysis are often language agnostic, but the tools for prepro- cessing the texts are not. This can be challenging for newcomers in comparative politics, as introductions to text analysis often focus exclusively on methods and software for English texts. We discuss three challenges that must be overcome that are particularly important when working with multiple languages within or across research projects: dealing with encodings, preprocessing for dimensionality reduction, and handling large corpora. Along the way, we point out language- specific variations that comparativists studying particular countries should consider. In order to focus our discussion on less well-known issues that come up when working outside English, we leave to Online Appendix A a more general discussion of topics that are more basic, such as the use of Optical Character Recognition. We discuss how we follow these procedures within our sections, where we give examples. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 257 2.2.1 Dealing with encodings The encoding of text is the way in which the computer translates individual, unique characters into bytes. Each language can have multiple encodings1 and different computers and different softwares will default recognize different encodings. If the analyst is pulling data from multiple different sources, such as different webpages, it is likely that the text will be in different encodings. In this case, it is necessary to convert each document so that all of the encodings match.2 The second step is to make sure the software reads the encoding correctly. This can often be done by changing the preference of the software, or encoding the text so that it matches the software’s default encoding. 2.2.2 Preprocessing to extract the most information Automated text analysis methods usually treat documents as a vector containing the count of each word type within the document, disregarding the order in which the words appear. This “bag-of- words” assumption reduces the dimension of natural language text, representing each document as a single vector with length equal to the number of unique words in the text. Unfortunately, even these dictionaries can be too large to be practical, ranging from thousands to millions of unique words. Fortunately, because most words appear only a few times in the corpus, removing infre- quently occurring words can dramatically reduce the number of unique word types while having only a small impact on the number of tokens. Bounding the size of the vocabulary can play an important role in helping methods perform well in practice. In this section, we describe the most common tools for preprocessing textual data, including stop word removal, stemming, lemmatization, compounding, decompounding, and segmentation. In each case, the goal is to reduce the scale of the problem by treating words with very similar properties identically and removing words that are unnecessary to our interpretation and our model. Along with disregarding word order, the so-called “bag-of-words” assumption,3 these pro- cedures are common preprocessing steps but can differ across languages. Stop word removal. To aid in interpretation and model performance, analysts often remove words that are extremely common but unrelated to the quantity of interest. These “stop words” are dropped before the analysis. In most settings, this involves removing frequently occurring function words such as “and” and “the,” but often removes other types of stop words such as contractions.4 Most languages have lists of common “stop words” that can be provided to prepro- cessing programs we discuss below. We note that for every language, choosing which stop words should be removed is a substantive decision that in some cases can have important effects on the results of the analysis. For example, Campbell and Pennebaker (2003) studied the importance of pronouns, which could be considered stop words in some schemes. Fokkens et al. (2013) found that differing removal of stop words can produce different results in some cases. In other words, choosing a stop word list should be care- fully chosen, based on words that the analyst thinks will not be important in informing the analysis. We discuss how we use stop words in more detail in the specific applications (both multilingual and single language) below. Stemming and lemmatization. Stemming removes the endings of conjugated verbs or plural nouns, leaving just the “stem,” which in many languages is common to all forms of the word. Stemming is useful in any language that changes the end of the word in order to convey a tense or 1For example, Chinese has several dozen encodings, the largest of which are Guobiao (GB), which has a 2- or 4-byte encoding, Big5 which has a 1- or 2-byte encoding, and ISO-2022, which has a 7-byte encoding. 2Most programming languages have packages to transfer between encodings. For example, to convert encodings, we use Python’s package chardet. 3See Online Appendix A for additional discussion. 4In other settings, such as the analysis of style or authorship detection, function words may be the sole quantity of interest (Mosteller and Wallace 1963). Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 258 Christopher Lucas et al. number, which includes English, Spanish, Slovenian, French, modern Greek, and Swedish. Since tense and number are generally not indicative of the topic of the text, combining these terms can be useful for reducing the dimension of the input. However, not all languages require stemming. For example, Chinese verbs are not conjugated and nouns in Chinese are usually not pluralized by adding an ending. A host of studies have shown stemming to be an effective form of preprocessing in English; however, the benefits are both application- and language-specific (Salton 1989; Harman 1991; Krovetz 1995; Hull 1996; Hollink et al. 2004; Manning, Raghavan, and Schutze 2008).5 Stemming is an approximation to a more general goal called lemmatization—identifying the base form of a word and grouping these words together. However, instead of chopping off the end of a word, lemmatization is a more complicated algorithm that identifies the origin of the word, only returning the lemma, or common form of the word. Lemmatization can also determine the context of the word; for example, it will leave saw the noun as is, but will turn saw the verb into see (Manning, Raghavan, and Schutze 2008). While stemming often works almost as well as lemma- tization in languages like English, lemmatization works better for languages where conjugations are not indicated by changing the end of the word, and for agglutinative languages6 where there is a greater variety of forms for each individual word, such as Korean, Turkish, and Hungarian. Compound words. Some languages will frequently concatenate two words that describe two dif- ferent concepts, or split one word that describes one concept. These instances, called compound words or decompounded words, can decrease the efficacy of text analysis techniques because one concept can be hidden in many unique words, or one concept may be split across two words. For example, the German word “Kirche,” or church, can be appended to “rat,” forming “Kirchenrat,” who is a member of the church council, or “pfleger” to form “Kirchenpfleger,” or church warden. If it is appended, the computer will not see “Kirch” as an individual concept. Decompounding this case would separate “Kirch” from its endings. “Compounding languages” include German, Finnish, Danish, Dutch, Norwegian, Swedish, and Greek (Alfonseca, Bilac, and Pharies 2008). On the other hand, the analyst may want to compound words. For example, in English, “national security” and “social security” each contain two separate terms even though they express one concept. Even though they share the word “security,” these concepts are very different from each other, so the analyst might wish to compound these into “nationalsecurity” and “socialsecurity.” All of these decisions should be guided by substantive knowledge. Segmentation. Some languages, like Chinese, Japanese, and Lao, do not have spaces between words and therefore text analysis techniques that rely on the word as the unit of analysis cannot naturally parse the words into individual units. Automatic segmentation must be used before the documents can be processed by a statistical program (see Lunde [2009] for an overview). Segmentation can be done using dictionary methods (Cheng, Young, and Wong 1999) or using statistical methods that learn where spaces are likely to occur between words (Tseng et al. 2005). 2.2.3 Building the document-term matrix Once all preprocessing has been completed, for many automated content techniques (including those detailed in this article), the remaining words are used to construct a document-term matrix (DTM). A DTM is a matrix where each row represents a document and each column represents a unique word. Each cell in the matrix denotes the number of times the word indicated by the column appears in the document indicated by the row. For example, if a document was just the sentence “I support the Tories,” “I” and “the” would likely have already been removed as stop words, so that the document would be represented with a 1 for “Tories” and “support” and a 0 for all other words. 5Several computer programs are available to implement stemming, including txtorg (discussed in Section 2.3), which can implement stemming in multiple different languages. These programs automatically detect common variations in word endings, removing these endings, and plural words into their singular form. 6Languages where most words are formed by combining smaller meaningful language units called morphemes. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 259 Following the “bag-of-words” assumption, the DTM format preserves information about how many times each word appears in a document while discarding information about the word order. The resulting matrix is extremely sparse, meaning a large proportion of the cells are zeros, because most documents will contain only a small fraction of the words in the vocabulary. For even mod- erately sized corpora, this matrix will be too large to store in its rectangular form; however, we can exploit the sparsity of the DTM to store only the non-zero entries. The DTM, or its sparse rep- resentation, is the primary input to most automated text analysis methods, including the ones in this article. 2.3 Multilanguage Preprocessing Tools 2.3.1 Language-specific processing All of the previous steps are not trivial from a workflow perspective, especially for comparativists working in a variety of languages, each of which may require specialized tools. Here we discuss existing methods to deal with preprocessing text within a language. There are two flexible open- source software tools for doing stemming, stop word removal, etc., that cover many languages. First is the R package tm (Feinerer, Hornik, and Meyer 2008), which can stem 11 languages7 and can do stop word removal on 13 languages.8 Another tool is the Python/Lucene-based application txtorg,9 which currently includes support for 32 languages.10 In txtorg, all supported languages go through a suite of best practice preprocessing steps, which includes the appropriate combination of stemming, segmentation, and stop word removal for that particular language. Both of these tools facilitate text preprocessing, though txtorg is dramatically more efficient in handling larger corpora and when searching and subsetting large amounts of text.11 2.3.2 Translation As we discuss in Section 3.2 and illustrate in Section 4.2, there are important instances where modeling textual data from multilingual corpora becomes more efficient and accessible for applied users if the text is first translated into a single language. Of course, though human trans- lation remains the gold standard, the scale of textual data generally far exceeds that which might be feasibly translated by humans. In subsequent sections, we discuss the relevant technical consider- ations of multilingual analysis in greater detail. In this section, we briefly discuss machine transla- tion and introduce an R-based utility for accessing machine translation software developed by Google and Microsoft. Central to comparative politics is, of course, a commitment to cross-national comparison. And while comparativists have developed many techniques for automated text analysis, there presently exists little or no support for cross-lingual comparison. While this limitation does not preclude all potentially interesting comparisons, it prevents a great many. In Section 3.3, we discuss a principled way by which such comparisons can be made with the STM after first translating the corpus into a common language. However, this requires first overcoming the potentially formidable task of translating the data into a common language. The job of a translator is to “render in one language the meaning expressed by a passage of text in another language” (Brown et al. 1990, p. 81), and though there exist many approaches, the basic task of machine translation is to accomplish this conversion with a computer. Because of its many uses and because early barriers to machine translation, which included hardware limitations and a 7Danish, Dutch, English, Finnish, French, German, Norwegian, Portuguese, Russian, Spanish, and Swedish. 8Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. 9Note that txtorg includes a graphical user interface built with TkInter, so users do not need to know Python in order to use txtorg. Nearly all txtorg functionality is accessible without writing any code. 10Arabic, Armenian, Basque, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Norwegian, Persian, Portuguese (separate tools for Brazil and Portugal), Romanian, Russian, Spanish, Swedish, Thai, and Turkish. 11See Online Appendix G for some basic benchmarking information between tm and txtorg. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 260 Christopher Lucas et al. dearth of machine-readable text (Brown et al. 1993), have largely been overcome, there is now heavy investment in machine translation. There exist a number of academic and commercial labs committed to the development of machine translation systems, some of which have led to the founding of new companies. Simultaneously, large, mature software companies like IBM, Microsoft, and Google have also developed their own machine translation systems (Koehn 2009). Because these groups can leverage financial and academic resources beyond those generally accessible to political scientists, we argue that a desirable solution to the problem of machine-translating text for political science is one that leverages the effort made by dedicated research groups in a simple, straightforward way. When translating text for eventual consumption by human readers, there can be no substitute for human translation. Within the literature on machine translation evaluation, it is said that “The closer a machine translation is to a professional human translation, the better it is” (Papineni et al. 2002, p. 1). But compared to translating text for eventual consumption by human readers, trans- lation for multilingual text analysis is a slightly easier problem. As discussed in Section 2.2, most approaches to automated text analysis make a bag-of-words assumption, which implies that the ordering of terms in a document does not matter. The translation software needs only to correctly translate the significant terms in the original document, as any error in word order will be discarded by the bag-of-words assumption. If users want to use machine translation, what should they use? Our answer is to provide an R package, translateR, that permits easy access to two very mature translation systems, namely those produced by Google and Microsoft. The package supports a variety of input and output formats and can be easily used with other text analysis software. Crucially, for our purposes, the package preserves information about individual texts (such as the original language or date of authorship). This is important for using models like the STM that incorporate these data. Moreover, translateR preserves the scalability of machine translation by the translation process via multiple API calls. Users provide as input the data to be translated, either as a dataframe with metadata or as a vector of documents or terms, and translateR makes calls in parallel to the translation API specified by the user (either Bing or Google). As a result, re- searchers spend minimal time reformatting their data and similarly little time waiting for the translation process to finish along with other aspects necessary for standard textual analysis. Additional discussion and syntax are given in Online Appendix C. 3 Computer-Assisted Text Analysis In the previous section, we discussed in detail how to prepare a multilingual corpus for automated approaches to text analysis by creating a DTM. A complete overview of methods for quantitatively analyzing the text is beyond the scope of this article. Unlike the issues involved in multilingual text processing, these methods have been well developed elsewhere (e.g., Grimmer and Stewart 2013). In Section 3.1, we provide a brief, selective overview and direct interested readers to our online appendix, which provides an accessible introduction to a broader range of methods. We then discuss the challenges that arise in moving from single to multilingual corpora (Section 3.2). Finally, in Section 3.3, we describe the STM before providing two applications of its use (Section 4). 3.1 A Brief Overview of Approaches There are essentially two approaches to automated text analysis: supervised and unsupervised methods, each of which amplifies human effort in a different way. In supervised methods, we specify what is conceptually interesting about documents in advance, and then the model seeks to extend our insights to a larger population of unseen documents. Thus, for example, we might manually classify 100 documents into two categories, with the model classifying the remaining 9900 documents in the corpus. In unsupervised methods, such as topic modeling, we do not specify the conceptual structure of the texts beforehand. Instead, we use the model to find a low-dimensional summary that best explains observed documents given some set of assumptions. Consequently, Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 261 human effort shifts from construction of a training set in supervised learning to interpretation of the model results in unsupervised settings. In our applications, we leverage a particular type of unsupervised topic modeling built on the popular Latent Dirichlet Allocation (LDA) model (Blei 2012). LDA is a mixed-membership model, which means that each document is represented as a mixture over a set of topics and each observed word is conditionally independent given its topic.12 Each topic is a distribution over the words in the vocabulary, which crucially are learned rather than assumed by the model. LDA has seen widespread use in computer science and the humanities due to its simple and extensible structure. The full range of text analysis methods including supervised and unsupervised methods is dis- cussed in greater detail in Grimmer and Stewart (2013). We have also included an online appendix for this article containing an abbreviated introduction using a consistent set of heuristic examples using data from a corpus of comparative politics papers published in the American Political Science Review (Online Appendix B). 3.2 Multilingual Text Modeling A considerable advantage to the quantitative approach to text analysis is that the methods are language agnostic. However, a rarely discussed limitation is that the documents are assumed to be drawn from only one language. This can be a frustrating situation for practitioners in comparative politics who are interested in studying a multilingual corpus. Here we discuss the attendant meth- odological issues that apply to both supervised and unsupervised models. In some respects, the most natural approach for handling a multilingual corpus is to perform analysis within the native language but referencing a commonly shared objective. This is the approach taken in manual coding efforts, such as the Comparative Manifestos Project (Volkens et al. 2013), where it is relatively straightforward to define the coding criteria in a language-independent way but analyze each document in its own native language. For keyword and supervised approaches, it is plausible to develop a separate but statistically comparable dictionary or training set for each observed language. Unlike the manual case where a single codebook can be developed in a shared language, the automated approaches require a duplication of effort for each language. While feasible in supervised settings, there is not a clear analogue for unsupervised methods. A second approach is to translate text into a common language. Manual translation by an experienced translator would be extremely costly and so we turn to machine translation tools introduced above. How well this works will depend on the quality of the machine translation and the goal of the analysis. We return to this approach in Section 4.2. The third approach is to develop a model which maintains an explicitly multilingual representa- tion. The central challenge is to develop an alignment between the conceptual representations of the model across languages so that we know a particular scaling, topic, or class in one language is comparable with the representation in another language. We focus here on the challenging case of unsupervised topic models in the style of LDA, where the conceptual representation is being learned from the data, although the general ideas apply straightforwardly to supervised methods as well. Existing approaches to multilingual topic models are differentiated in how they leverage external information to implicitly or explicitly align comparable topics across languages. The Polylingual Topic Model of Mimno et al. (2009) leverages a set of aligned documents, for example Wikipedia articles on the same topic in different languages. By constraining aligned documents to share a distribution over topics, the model is able to align the words associated with a given topic across languages. The Bilingual Topical Admixture model (Zhao and Xing 2006) works with texts which are aligned at the token level (such as through the result of machine translation). Exact translations which are aligned at the token level are more difficult to obtain, but they provide a more direct source of information about topic alignment. Finally, the Multilingual Supervised LDA model (Boyd-Graber and Resnik 2010) uses a combination of sentiment information and aligned dictionaries to develop multilingual topics. Recent work combines these approaches to leverage 12By “mixture” in this context we mean a set of positive values that sum to one. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 262 Christopher Lucas et al. both dictionary- and document-level alignments simultaneously, resulting in a model which is more robust than either independently (Hu et al. 2014).13 From a technical perspective, fitting most of these models involves a relatively straightforward adaptation of the collapsed Gibbs sampling algorithm for LDA (Griffiths and Steyvers 2004).14 The result is a set of topics for each language along with the document-topic loadings. Multilingual models have primarily been used for either document exploration or machine-translation tasks. The existing models for multilingual analysis do, of course, have limitations. The correspondence between the multilingual topics relies on the particular alignment information provided by the user and needs to be validated. This can be particularly challenging for indirect strategies such as the document alignment in the Polylingual Topic Model. For each topic, the user needs to verify that the topic word distributions are comparable across languages. Given that the size of the vocabulary may be in the thousands, assessing model failure can be a substantial challenge even for only two languages.15 While the articles described above provide diagnostic tools for the model results, they are primarily focused on the machine-translation applications that motivate that literature. As a practical matter, multilingual topic models generally lack the ability to include additional document metadata, which we argue below is an important part of applied social science research. In addition, there are limited software tools available for the estimation of these models.16 These critiques are not problematic for the models as presented in their original context, but do suggest challenges for their use in applied comparative research. Below, we suggest a way that machine translations and the STM can be fruitfully combined. 3.3 The STM In our applications (Section 4), we leverage a recently introduced framework, the STM (Roberts et al. 2013; Roberts, Stewart, Tingley et al. 2014; Roberts, Stewart, and Airoldi n.d.) The STM is a mixed-membership topic model (like LDA) with extensions that facilitate the inclusion of document-level metadata.17 The inclusion of this information within the model can both improve the quality of the learned topics and facilitate hypothesis testing. Software for estimating the model is freely available in the R package stm. Before moving on to our applications of STM, we first briefly review several aspects of our use of the STM which are specific to this context. A brief statement of the model is available in Online Appendix D. For additional technical details on estimation and implementation of the model, we refer to existing work (e.g., Roberts et al. 2013; Roberts, Stewart, and Tingley et al 2014; Roberts, Stewart, and Airoldi n.d.). The role of covariates. STM differs from other topic-modeling techniques like LDA in allowing document-level covariates to be included in the model as a method for pooling information. A covariate can be allowed to affect either topical prevalence or topical content. Covariates on 13Boyd-Graber and Blei (2009) introduce a topic model for completely unaligned texts, but they note that the model is highly sensitive to starting values and when run to divergence can result in the nominally equivalent topics between languages diverging. This is evidence for the central role of observed alignment information in pinning down the correspondence between topics. 14For example, for the Polylingual Topic Model, we iteratively sample each token in the document, adjusting the topic- word distribution for the language-specific version of the topic but sharing the document-topic counts across all lan- guages within the document. This algorithm has comparable speed to LDA but with slightly higher memory requirements. 15As the number of languages grows, this problem is compounded by the need to have a single scholar who reads all languages. For example, among our team, no author speaks both Arabic and Chinese, which would make direct validation of a Polylingual Topic Model quite difficult. 16Of the models discussed here, only the Polylingual Topic Model of Mimno et al. (2009) has a publicly available software implementation. A Java implementation is available in the software package Mallet (McCallum 2002). 17The inclusion of document metadata follows and extends two developments within political science. The Dynamic Topic Model (Quinn et al. 2010) is a single-membership model in which the probability of observing a topic moves smoothly through time. The Expressed Agenda Model (Grimmer 2010) is a single-membership model which includes information about document authors. However, no such model exists to include author and time simultaneously. Drawing on these works, our approach generalizes to arbitrary covariate information and extends these setups for the mixed-membership case. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 263 topical prevalence allow documents to share information about which topics are expressed within the document (e.g., women are more likely to talk about topic 1 than men). Users can plot the relationship between their topic prevalence covariates and the expected proportion of a document that belongs to each topic. Covariates on topical content allow for the rates of word use, for each topic, to differ by covariate values (e.g., women are more likely to use a particular word when talking about a particular topic than men). Users can include both prevalence and content covariates, only one type, or neither. Content covariates are a particularly powerful tool which can be used to capture both quantities of interest and condition away systematic differences within the corpus that are not of primary interest. Imagine, for example, we were attempting to compare topical coverage within a large corpus of news reports about China from Agence France Presse (AFP) and Xinhua, China’s state news agency. In order to facilitate a direct comparison, we want the model to discover (for example) a single topic on Tibet; however, systematic differences in the way that AFP and Xinhua cover Tibet may produce separate AFP-Tibet and Xinhua-Tibet topics. Instead, by allowing the model to maintain an AFP version of the topic and a Xinhua version of the topic (which are constrained to be close), we can estimate the differences in word use and still retain a straightforward comparison. If the differences are themselves of interest, the analyst can compare words distinctive to the Xinhua version of the topic (“oil,” “gas,” “resources”) with words distinctive of the AFP version (“culture,” “religion,” “independence”).18 If the differences are simply a nuisance, we can marginalize over source-specific versions of the topics weighting by the document frequency within the corpus as a whole. We will return to this idea in our multilingual analysis, where we use content covariates to condition out systematic differences that result from translation to English from different languages. Topic correlations. In addition to the inclusion of covariates, the second distinctive feature of the STM is the explicit estimation of correlation between topics.19 Graphical depictions of the correlation between topics provide insight into the organizational structure at the corpus level. In essence, the model identifies when two topics are likely to co-occur within a document (here we focus on positive correlations although negative correlations are also estimated). The software we provide allows the user to produce a network graph of topics where each topic is a node and two nodes are connected when they are highly likely to co-occur. This can help the user identify larger themes that transcend topics. Drawing on recent literature in undirected graphical model estimation, we extend the approach developed in Blei and Lafferty (2007) for estimating the edges of the graph. In Online Appendix E, we describe the two graph estimation procedures we provide along with parameters set by the user. We give a specific example of this approach in the next section. 4 Applications In this section, we introduce two applications of the STM. The first application, the analysis of Islamic fatwas, is conducted entirely within the single native language. For the second application, our corpus includes both Chinese and Arabic texts, which we translate into a common language prior to analysis. All the analysis tools used below are built into the R package stm (Roberts, Stewart, and Tingley 2014). 4.1 Jihadi Fatwas In this example, we combine data on Muslim clerics from Nielsen (2013) with expert coding of whether clerics are Jihadist or not to see how the topical content of contemporary Jihadist religious 18The example here is drawn from the data and model described in Roberts et al. (n.d.). 19Correlations are estimated by replacing the Dirichlet distribution in the standard LDA framework with a logistic normal distribution as in the Correlated Topic Model (Blei and Lafferty 2007). When no covariates are specified, the STM reduces to an instance of the Correlated Topic Model. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 264 Christopher Lucas et al. texts differ from those of non-Jihadists. Nielsen collects data on the lives and writings of 101 prominent Jihadist and non-Jihadist Muslim clerics, including the 27,248 texts available from these authors from online sources. A majority of these texts are fatwas—Islamic legal rulings on virtually any aspect of human behavior, ranging from sex and dietary restrictions to violent Jihad. For many clerics, Nielsen also collects books, articles, and sermons on the same types of topics. Collectively, these texts are representative of how clerics choose to interact with religious constituencies; in fact, many of these collections are curated by the clerics themselves. We combine these texts with an independent coding of whether these clerics are Jihadist or not based on two scholarly sources. First, the Militant Ideology Atlas—Executive Report (McCants 2006), Appendix 2, lists 56 individuals that are frequently cited by Jihadists. The authors of the Atlas code whether these are “Jihadi authors” according to substantive knowledge. Second, Jarret Brachman (2009, pp. 26–41) lists the names of prominent clerics in eight ideological categories: establishment Salafists, Madkhali Salafists, Albani Salafists, scientific Salafists, Salafist Ikhwan, Sururis, Qutubis, and Global Jihadists. The latter two categories are Jihadist, whereas the rest are not. These two sources largely overlap; together, they provide expert assessments of 33 of the clerics (20 Jihadists and 13 non-Jihadists) for whom Nielsen collects 11,045 texts. We then estimate an STM with the binary indicator for Jihadi status as a predictor. The results are shown in Fig. 1, with topics presented as collections of words (in this figure, we leave the words in Arabic), along with the topic coefficients and standard errors. We estimate 15 topics after experimenting with 5- and 10-topic models that produced less readily interpretable topics.20 The first inferential task is to infer topic labels from the words that are most representative of each topic. We do this by examining the most frequently occurring words in each topic and the words that have the highest levels of joint frequency and exclusivity (meaning they are common in one topic and rare in others). In several cases, we also examine exemplar documents for a topic—those documents that have the highest proportion of words drawn from the topic. This also serves as a validation step because we check whether words in the topic have the meanings in context that they appear to have in the topic frequency lists. The results in Fig. 1 indicate that topics 1 (Fighting) and 11 (Excommunication) are most correlated with the indicator for Jihadist clerics, matching our a priori predictions based on the content of the topics. Excommunication (takfır in Arabic) is commonly used by Jihadists to condemn fellow Muslims who disagree with Jihadist aims or tactics. The exemplar documents for this topic are fatwas on the rules and justifications for excommunication and other writings that make heavy use of the concept of excommunication. In contrast, topic 1 is a broader Jihadist topic focused primarily on fighting the West—the exemplar documents are fatwas about fighting abroad. Topics on social theory, Islam and modernity, and Shari’a and law are also correlated with Jihadism, though to a lesser degree. A number of other topics are also clearly identifiable, including topic 5 on prayer, topic 6 on Ramadan, and topic 8 on money, pilgrimage, and marriage. As we expected from their content, these topics receive relatively little attention from Jihadists, who are more focused on their violent struggle than with fine distinctions in Islamic legal doctrine and religious ritual. We can use the estimated correlation of topics with other topics to learn more about the structure of the corpus.21 In Fig. 2, we plot the network of topics such that topics that are correlated are linked. Many of the correlations between topics are intuitive and revealing about the nature of Islamic legal discourse. The topic on hadith (the sayings of the Prophet Muhammad) is highly correlated with language about the chain narration by which each hadith is verified as trustworthy. Authors who write about social theory are likely to also write about Islam and modernity, politics, the role of women, and Shari’a and law. 20This is not to say that 15 is the “right” number of topics in this corpus—rather, we find a 15-topic model for uncovering useful insights about the structure of the texts in relation to the Jihadist ideology of their authors. 21We introduce our approach to calculating and graphically representing the correlation structure in Online Appendix E. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 265 Note: The x-axis was incorrect in the published version, ranging from -1.25 to 1.25. The correct x-axis is shown here, ranging from -0.2 to 0.2. Fig. 1 Coefficients and standard errors for a 15-topic Structural Topic Model with Jihadi/not-Jihadi as the predictor of topics in Arab Muslim cleric writings. The words used to label each topic are shown on the left. “F:” indicates words that occur most frequently in each topic. “FREX:” indicates words that are frequent and exclusive to each topic. The Arabic words are in their stemmed form. Figure 2 shows correlations between topics preferred by Jihadists. Documents that include language about excommunication tend to also include text about creed (what Muslims believe), shari’a and law, the Prophet, and fighting. Documents about fighting are likely to also include politics, discussions of Salafism, Islam and modernity, and Shari’a and law. In contrast, texts about non-Jihadi legal issues—prayer, Ramadan, money, pilgrimage, and marriage—are unlikely to be about more than one topic. This aligns with our qualitative assessment of the corpus: the modal Jihadist fatwa is article-length and ranges across multiple topics, whereas the modal non-Jihadist fatwa is paragraph-length and gives a precise ruling on only one topic. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 266 Christopher Lucas et al. Money, Pilgrimage, and Marriage Ramadan Hadith Hadith Narration Creed Excommunication The Prophet Prayer Shari'a● and Law Salafism Fighting Islam and Modernity Family and Women Social theory Politics Fig. 2 The network of correlated topics for a 15-topic Structural Topic Model with Jihadi/not-Jihadi as the predictor of topics in Arab Muslim cleric writings. Node size is proportional to the number of words in the corpus devoted to each topic. Node color indicates the magnitude of the coefficient, with redder nodes having more positive coefficients for the Jihadi indicator and blue nodes having more negative coefficients. Edge width is proportional to the strength of the correlation between topics. The presence of at least two clearly Jihadist topics invites further inquiry. Figure 2 shows that these topics are correlated in general, but do all Jihadists write on both topics? Do some write more on one? Does this split indicate an intellectual divide within the Jihadist subgroup? To take a first cut at these questions, we simply plot the proportion of the Excommunication topic against the Fighting topic, as shown in Fig. 3. The results teach us several new things about how Jihadists and non-Jihadists write. First, for many Jihadists, document space spent on Fighting is substitute for space spent on Excommunication.22 Usama bin Laden has the highest proportion of words devoted to Fighting—about 38%—but he spends only 2% of his words discussing the excommunication topic. This accords with Bin Laden’s long-time focus on the goal of targeting and provoking the West through both writings and deed. At the other extreme, Ahmad al-Khalidi and Ali Khudayr, respectively, devote 46% and 32% of their writing to excommunication and almost none to fighting. This is not surprising when we consider the life trajectories of these clerics. Both have issued fatwas excommunicating prominent Muslims for alleged heresies and both have spent time in Saudi prisons for doing so. This finding adds further face validity to our findings—the clerics most interested in writing about excommunication of fellow Muslims are those that have also carried it out repeatedly. Between these endpoints, most other Jihadists spread out on a continuum where more discussion of excommunication means less of fighting and vice versa. It is likely that these two topics are virtually all that some of these authors write about. Given that filler words and others must still be assigned to topics, it may simply be the case that no more than 50% of a document can be allocated across these Jihadi topics. 22This is not inconsistent with the finding that these two topics are correlated within texts. The presence of one topic increases the likelihood of the presence of the other topic in a text, but some authors focus on one topic more than the other. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 267 0.4 ● Jihadists ● Usama bin Laden Non−Jihadists Umar Abd−al−Rahman 0.3 (the blind sheikh) ● ● ● ● ● ● ● ● ● ● 0.2 0.1 Abdullah Azzam ●● ● ● Ahmed al−Khalidi ● Sayyid Q u t b ● Abu al−Ala al−Mawdudi ● ● ● Ali al−Khudayr 0.0 −0.1 0.0 0.1 0.2 0.3 0.4 Excommunication topic proportion Fig. 3 Estimated topic proportions by fighting the West and excommunication topics, separated out by Jihadist versus Jihadist coding. 0.5 0.4 0.3 0.2 0.1 0.0 1880 1900 1920 1940 1960 1980 Year of cleric birth Fig. 4 The proportion of words by each Jihadi author devoted to excommunication or fighting, plotted against the year of their birth with a best-fit line. Several Jihadist authors have low enough proportions of both Jihadist topics that they could be mistaken for non-Jihadist clerics. Sayyid Qutb is often considered one of the founders of the modern worldwide Jihadist movement, but only 3% of his writing is devoted to the topics that tend to occupy other Jihadists. Similarly, Abu al-Ala’ al-Mawdudi and Abdullah Azzam are considered canonical authors by Jihadists, but only about 10% of their writings are devoted to the topics of fighting and excommunication. To see what is unique about the writing of these authors, we look at the topics to which they devote the most attention and find that their profiles are very similar. Each devotes the bulk of their writing to writing about social theory, politics, and Islam and modernity. We find that the current Jihadist focus on fighting the West and excommunication is relatively new. We show this in Fig. 4 by summing the proportion of writing that each Jihadist author devotes to either excommunication Fighting topic proportion Proportion of writing devoted to excommunication and fighting Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 268 Christopher Lucas et al. or fighting and plotting it against the year of each cleric’s birth. Among the set of individuals identified in the secondary literature as Jihadists, only those of relatively recent vintage are writing on the two topics that are now core to Jihadist ideology. To summarize, we find that a 15-topic model provides insight into the structure of an Islamic legal corpus that includes work by Jihadists and non-Jihadists. Although one might expect Jihadism to be monolithic, there are in fact multiple ways that Jihadists write about their subject. In particular, there is suggestive evidence of a trade-off for many Jihadists between focusing on fighting the West and focusing on excommunicating fellow Muslims they feel are inadequately supporting the Jihadist cause. We also find that an older generation of Jihadist writers does not write about either of these topics, suggesting that Jihadist writing was more eclectic in the past but has become homogenized over time. 4.2 Reactions to Snowden in China and the Middle East In this section, we provide an illustrative example of how machine translation can be used in conjunction with the STM to make comparisons across countries and languages. An important theoretical and empirical agenda is understanding how other countries view the United States (Katzenstein and Keohane 2007; Chiozza 2009; Lynch 2007; Telhami 2002; Rubin 2002). One way to understand views of the United States is to compare responses to specific events (e.g., Jamal et al. n.d.). Here we look at responses to a single event across different language communities. We collected thousands of social media posts in Arabic and Chinese during June 2013, the month when former U.S. government employee Edward Snowden disclosed thousands of classified documents that detailed the U.S. government’s clandestine surveillance program. Because the documents leaked by Snowden contained many revelations about surveillance of and cooperation with other countries, some scholars worried that the leaks would undermine U.S. legitimacy abroad (Farrell and Finnemore 2013). We focus on the reaction of citizens in China and the Middle East, arguably two of the most important U.S. strategic areas in the world. We generate our corpus by collecting the universe of unique posts from Twitter in Arabic containing the word for “Snowden” and the universe of unique posts from Sina Weibo containing the word for “Snowden” in Chinese from June 1 to June 30, 2013.23 Twitter is banned in China, so a collection of Twitter posts in Chinese would contain those of foreign Chinese speakers, or of those who are sophisticated enough to jump the Great Firewall, and therefore would be a potentially biased sample. Sina Weibo is the closest comparable platform to Twitter in China.24 4.2.1 Two approaches to machine translation Ideally, we want to analyze both Arabic and Chinese within the same topic model. Leaving the two corpora in their respective languages would lead to essentially no overlap in vocabulary between the Arabic and Chinese posts. As a result, each corpus would have its own individual topics, since the model cannot recognize that Snowden in Arabic is the same word as Snowden in Chinese, rendering direct comparison of topical content essentially impossible. As described in Section 3.2, we need to use some type of external alignment between languages to analyze the two corpora within the same model. Translation provides alignment by creating overlap between the two corpora. Here we explore solutions based on machine translation as software implementations are widely available and continuously improving. We use two approaches to machine translation: translating the entire corpus and translating only terms that appear in the DTM. Both approaches easily extend to document sets containing more than two languages. In the first approach, we use machine translation to translate both corpora of text completely into a common language, English. There are compelling reasons to translate to a “third-party” language, particularly when that language is English. Perhaps the most basic reason for choosing 23We point readers interested in the preprocessing that we conducted on the Snowden corpus to Online Appendix F.3. 24Both Weibo and Twitter restrict the number of words within posts. All data were obtained from Crimson Hexagon. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 269 English is the ability to communicate research findings to an English-speaking audience. We also wanted both sets of text to undergo the same amount of translation. If we had translated the Arabic into Chinese, for example, and left the Chinese text untranslated, the Chinese corpus might dominate the topic model as it would have no words that were “untranslatable.” This at least makes it more plausible that the inevitable error introduced in translation is roughly comparable between the two language groups, resulting in a type of symmetry. Of course, this may not be the case when two languages are more closely related or where the translation accuracy is substantially higher for one kind of transformation. Beyond the appeal of symmetry, English is a particularly useful common language due to its role in machine-translation systems. Most modern machine translation systems use parallel corpora to learn the parameters of a statistical model. However, many language pairs do not have large parallel corpora easily available and so instead a “pivot,” or bridge, language is used as an intermediate point in creating the translation. English is a common pivot language due to the widespread availability of texts. Thus, not only would we expect the Chinese to English and Arabic to English to have particularly high accuracy, but for many machine translation systems a translation between Chinese and Arabic will involve a translation through English.25 For more on pivot languages in statistical machine translation systems, we refer readers to Utiyama and Isahara (2007) and Paul et al. (2009). Habash and Hu (2009) discuss the specific case of using English as a pivot language for Chinese and Arabic.26 We use Google Translate to perform translation, passing each post through translateR and recording the translation. Online Appendix F discusses our preprocessing steps. The complete corpus strategy is ideal because it introduces no additional sources of information loss beyond the machine-translation process. Because each original text is translated, words are always considered within the context that they appear. Context not only improves accuracy in most machine-translation systems, but may, in some cases, be necessary for an appropriate translation. The downside is that the process of machine-translating a corpus of even a few thousand documents can be expensive and time-consuming because all the text is passed to the machine- translation service. Given these considerations, we also investigate a second approach which relies on only the minimal number of translation queries. We first created a DTM for each language’s corpus separately and translate only those terms. We take the intersection of the two translated vocabularies and merge the document-term matrices together. While this approach discards word context within translation, it is considerably cheaper.27 The cost of translation for the complete corpus grows linearly with the size of the corpus because every occurrence of every unique term is translated. By contrast, in the term-by-term translation, the marginal cost of translating an additional document decreases as the corpus grows, because there are fewer and fewer unique terms in each additional document as more documents are added to the corpus. 25As a proprietary system, we do not know for sure if Google Translate uses English as a pivot language for Chinese and Arabic. However, even if it does not, we can expect that it would provide reasonable results based on the widespread availability of English parallel corpora (e.g., Linguistic Data Consortium catalog). 26For researchers looking to apply these methods to their own texts, we recommend English as a useful default choice for a common language, even if some of the documents are already in English. In particular, circumstances with language groups which are closely related or where excellent parallel text corpora or an available different common language may be more appropriate. The applied researcher can always investigate different options by informally evaluating translation quality by using Google Translate to process a small sample of documents. 27For our corpus, the full document translation costs approximately US$450, whereas the term translation was approximately US$10 (both with Google Translate accessed through translateR). In general, as of summer 2014, translation with the Google API costs US$20 per 1 million characters of text, so 500,000 characters costs US$10, 2 million characters costs US$40, etc. (more information at https://cloud.google.com/translate/v2/pricing). The Microsoft Translator API operates with a very different cost structure. Users sign up for a monthly plan, which caps the total number of characters that can be translated in a single month. It is free to translate up to 2,000,000 characters per month, US$40 for 4,000,000 characters per month, US$160 for 16,000,000 characters per month, etc. (more information at https://datamarket.azure.com/dataset/bing/microsofttranslator). Note that for both Google and Microsoft, a “character” means an escaped, URL-safe character, so documents written in a language like Chinese often become three to four times longer. However, translateR automatically converts the characters to their URL-escaped versions, so users do not need to do so manually. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 270 Christopher Lucas et al. In our case, the two approaches give somewhat comparable results. We strongly caution, though, that this may not be true in general. Fortunately, validation of the translation strategy is relatively straightforward. The natural first check is to verify that topics are not exclusively related to a particular language.28 If this is not a problem, reading documents highly associated with a particular topic in the native language provides a validation of the translation process. If the documents are largely in agreement with the semantic meaning of the concept as represented in the new language, then the loss of information from the approximate translation procedure is likely acceptable. Analysts should of course be attentive to the way that systematic errors in translation will affect the particular argument that they wish to make and adjust accordingly. While the complete corpus translation approach is to be preferred in general, the term translation strategy can provide a cost-effective alternative in particular cases. We imagine that this might be particularly useful for early exploratory analyses, which can be used to justify the greater expenditure of complete corpus translation approaches. 4.2.2 Correcting for systematic differences between languages As discussed in Section 2.3.2, machine translation is not an error-free process. In either of the approaches discussed above, there will be untranslated words, mistranslated words, or words with multiple meanings. As such, words that mean the same thing in the Chinese and Arabic corpus could sometimes map onto different words in English that are synonyms of each other. Just as a native Arabic speaker would speak English differently than a native Chinese speaker, using a vocabulary and sentence structure that most closely maps onto their respective native languages, the “way” in which machine translation interprets each language will be different for the two different corpora. These linguistic differences pose a challenge for topic models. We want to ensure that the topics are uncovering differences in semantic content rather than linguistic idiosyncrasies in describing that content. As discussed in Section 3.3, the STM allows for this facet of a corpus. Within the STM, we can use a content covariate to capture variations in word use attributable to observed covariates. Here we include the document’s original language as a content covariate in order to capture linguistic differences in describing a topic. This allows us to effectively marginalize over differences in word rate use that arise due to linguistic differences or errors in translation. For example, the Chinese word for liquor translates into “wine” in Google translate. The Arabic word for liquor, however, translates into “spirits.” If there were a “party” topic within our corpus, this would allow both the Chinese and Arabic documents to talk about the party, but the Chinese version of the translation would use wine slightly more and the Arabic translation would use spirits slightly more. Crucially, there is a set of common words that do overlap between the two languages, which allows us to learn that these systematic differences between the languages are related words and not completely separate concepts.29 4.2.3 Results Next, we discuss the results of our illustrative analysis. For all of our analyses, we used a 15-topic model, using an indicator variable for what language community generated the social media post as both topic prevalence and content covariate, as well as a smooth function of time (date of the post) as a topic prevalence covariate. For simplicity, we focus on three different substantive topics in this analysis. The first, which we label “attack,” deals with concerns about the United States attacking one’s own country or society. The second, labeled “human rights,” deals with posts about the 28Note that this need not signal a problem, as a topic could actually be specific to a particular country or language. We merely include this to emphasize that such findings should be checked to ensure that they did not arise by a failure in the translation process. 29Note the similarity here to the multilingual models discussed in Section 3.2. While those models explicitly maintain models in two or more languages using external alignment information, here we are maintaining a model in only one language but allowing for limited residual variations from the original language. This provides a more parsimonious model structure and facilitates interpretation of the model results. Situations that call for an explicit representation of the topics within multiple languages would be better served by some of the alternatives discussed in Section 3.2. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 271 Snowdon Topic, No Content Covariate, Chinese Snowdon Topic, Content Covariate, Chinese 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Expected Topic Proportion Expected Topic Proportion Snowdon Topic, No Content Covariate, Arabic Snowdon Topic, Content Covariate, Arabic 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Expected Topic Proportion Expected Topic Proportion Fig. 5 Histogram of topic proportions for the topic where the word “Snowdon” is most important. Without a content covariate, this topic is dominated by Chinese tweets and has very few Arabic tweets. With a content covariate, this topic mixes between Chinese and Arabic tweets. implications of the Snowden episode for American credibility on issues related to freedom and human rights. The third, labeled “asylum,” concerns news updates about Snowden’s movements and whether or not he will be granted asylum and in which country. First, we emphasize the role of the content covariates in handling the multiple source languages. For some reason, the Chinese version of Snowden’s last name translated to “Snowdon,” instead of “Snowden,” whereas the Chinese version of Snowden’s full name translated to “Edward Snowden.” This was not the case in Arabic. Therefore, the Chinese examples were likely to use the word “Snowdon,” in addition to “Snowden.” “Snowdon” did not appear in the Arabic texts at all. Similarly, the Chinese encoding in Google Translate creates the word “quote” when a quotation mark appears. Therefore, many of the words in the Chinese corpus have “quote” attached to them, for example “quotsnowden” or “quotprism.” Of course, the analyst could go through and identify each of these mistakes and correct them, but this would be time-consuming or impossible for larger tasks. By modeling the fact that machine translation will make different mistakes in Chinese than in Arabic using a content covariate, we allow Chinese and Arabic tweets to talk about the same topic, while allowing the tweets from each language to use slightly modified versions of the vocabulary. Consider the “Snowdon” mistake. For purposes of comparison, we ran a topic model that did not include a content covariate. Within this topic model, the word “Snowdon” pinned down its own topic. Because Snowdon was one of the words defining the topic, it was completely dominated by Chinese tweets; no Arabic tweets were estimated to have more than 0.1 of this topic (see Fig. 5). However, this is a mistake. Chinese tweets translated to “Snowdon” are often discussing the same topic as Arabic tweeters using “Snowden.” When we include the content covariate, “Snowdon” appears in a topic with “Snowden,” and this topic is similarly distributed in Chinese and Arabic tweets. Had we failed to include a content covariate, we would have created a topic falsely associated with the Chinese tweets. We now explore the results of the model and compare the full machine translation to the translation of the document-term matrix. To illustrate, we focus first on the two topics related to the image of the United States in the eyes of Chinese- and Arabic-language tweeters, namely the “attack” and “human rights” topics. The “attack” topic, which discusses the U.S. “attacking” other countries, particularly focused on Snowden’s allegations that the U.S. government hacked into Chinese government agencies and businesses. This topic contains words such as “China,” “company,” “attack,” and “relationship.” Many of the tweets question the United States–China bilateral relationship going forward. Fig. 6 shows an example of a tweet that is largely devoted to this topic, displaying both the original text Frequency Frequency 0 2000 5000 0 2000 5000 Frequency Frequency 0 2000 4000 0 2000 4000 Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 272 Christopher Lucas et al. [Snowden broke the news that the American government invasion of China Network] Snowden published evidence for many years, said that the American government intrusion Chinese network has at least four years, the goal of the US government to reach hundreds of hackers, which also includes schools. Hackers typically by a massive invasion through the router, then the invasion of thousands of computers in one fell swoop, no one individual computer intrusion. http://t.cn/zHRpotF Fig. 6 Example post for the attack topic. Logically, especially from the Western approach to human rights warriors from other countries show a similar point of view, the US-led West should pay tribute to Snowden, gratitude, and ultimately awarded the nomination Hasa Rove European Human Rights Award, or simply America's own Herman - Hammett Human Rights Award, Lantos Human Rights Award, it is not the British Parliament as well as the first Westminster Human Rights Award. http://t.cn/zH3iwJ6 Fig. 7 Example post for the human rights topic. and its translation. We generated the translated text by calling the Google Translate API with translateR. Note also that the translation captures the essence of the post. The “human rights” topic discusses the U.S. record on human rights and whether the Snowden disclosures undermine this record. This topic contains words such as “violate,” “freedom,” “human,” “right,” and “traitor.” Some of the posts also discuss whether the United States is a hypocrite, violating U.S. citizens’ human rights while also advocating for greater human rights protection abroad. Fig. 7 displays a tweet on this topic, where again we use translateR to access the Google Translate API for the translated text. Given the dramatic cost difference between the full-text and DTM translations, it is useful to investigate the similarity of the resulting topic models. If the DTM translation produces comparable results, it will clearly be preferable on cost alone. We investigate the similarity of the models for our two topics of interest, cautioning that congruence between the models for this case does not produce a general result. We examine the alignment between models by comparing the topic-word distributions of all topics in both models. Because the two models use different vocabularies, we identify the common Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 273 Topics, Full Text Translation Human Rights: world, freedom, want, right, snowden, peopl, support Asylum: snowden, ecuador, asylum, polit, foreign, iceland, request Attack: govern, snowden, internet, communic, network, china, britain -0.15 -0.10 -0.05 0.00 0.05 Difference in Topic Proportions (Chinese-Arabic) Topics, Term-by-Term Translation Human Rights: snowden, spi, edward, also, peopl, countri, arrest Asylum: ecuador, snowden, america, state, asylum, head, next Attack: china, usa, network, govern, snowden, attack, global -0.20 -0.15 -0.10 -0.05 0.00 0.05 Difference in Topic Proportions (Chinese-Arabic) Fig. 8 Topics related to U.S. reputation. The top plot is the estimation with full-text translation, and the bottom plot is the estimation of these topics with DTM-translation. Both plots show the relationship between the topics and the Chinese and Arabic corpuses. terms and calculate the correlation between every pair of topics using the overlapping words.30 Some of the topics align quite clearly, including the two we have highlighted above. In Online Appendix F.4, we provide a visualization of the correlations between all topic pairs. We explore the question of model alignment further by investigating how our aggregate inferences about the relative rates of topical prevalence would change under the different 30Specifically, we construct a marginal estimate of the topic word distribution  by weighting the Chinese- and Arabic- specific version of the topics by their relative frequency in the corpus. We then take the intersection of the vocabulary between the two models and calculate the correlation between the distributions over those words for each topic pair. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 274 Christopher Lucas et al. translation strategies. In Fig. 8, we plot the three topics and their estimates under each of the two translation methods. Note that the three displayed topics—“Attack,” “Human Rights,” and “Asylum”—all have similar frequent terms and substantively similar estimates. For both the full- text translation and the document-term matrix translation, the “Attack” and “Human Rights” topics are more associated with Chinese posts than with Arabic posts. At least in this case, two investigators using different translation methods might have reached similar substantive conclusions, namely that microbloggers in China seem very ready to condemn the U.S. government for hacking Chinese companies and the government and for “trampling” human rights. This analysis is further explored in Online Appendix F, along with an overview of additional topics in the model and the technical details of the estimation. Which topics are more associated with Arabic tweets? Arabic tweeters are more likely to be sharing news about the Snowden disclosures. The “Asylum” topic is associated with Arabic tweets and is related to speculation about where Edward Snowden will end up seeking asylum. This topic contains words such as “Ecuador,” “Iceland,” “shelter,” “request,” and “asylum.” Arabic tweeters are much more likely to be sharing news, rather than opinions about the U.S. government’s reputation. These results begin to speak to our original interest in the ways that the reputation of the United States was damaged in the eyes of Chinese and Middle Eastern social media users during the Snowden incident. However, we also find that these topics were more prevalent within the Chinese corpus than the Arabic corpus. This is unlike other events where there are strong reactions to U.S. intervention in the Middle East by Arabic twitter users (Jamal et al. n.d.). The Snowden disclosures seemed to affect Chinese perceptions of the United States more strongly; not only was there considerable outcry about U.S. cyber-intervention in China, but the Snowden event generated discussion of how the United States in fact opposes human rights and is less democratic than the U.S. attempts to seem on the world stage. Perhaps, given the perception that U.S. cyber activities targeted China, the Chinese response is consistent with previous work focusing on the Middle East. Using this type of workflow, scholars could examine reactions by many countries to other world events that the United States is involved in. 5 Conclusion The volume of textual data is growing rapidly throughout the world. The form of this textual data is no longer simply in the form of newspapers, books, etc., but also in social media and other internet-based content that puts even fewer restrictions on the generation of textual data (e.g., Barberá 2012). There is no sign that this trend will change. Even if a tiny fraction of these data is ultimately of interest to comparativists, they will need to understand a range of issues relevant to different languages that are actively being studied by scholars. This article introduces comparativists to a range of important topics in textual analysis. We walked through a variety of research questions that comparative politics scholars have been asking and answering with textual data, and introduced the basics of textual processing with a focus on non-English texts. Next, we discussed the managing and preprocessing of text from a multilanguage perspective, including a brief discussion of new software such as txtorg, as well as a discussion of machine-based translation where we introduce a new R package translateR that provides easy access to the Google Translate API. Next, we briefly discussed techniques for text analysis, emphasizing the existing tools for multilingual text analysis. Finally, we used the STM to provide two examples of how comparativists can use metadata to incorporate their knowledge of corpus structure into unsupervised learning, including a novel way to use the STM model when text has first been translated to a single language. Future developments designed to address remaining challenges could proceed in a number of different directions. We are particularly interested in harnessing the ever-increasing advances in automated translation with existing text analysis techniques. No doubt existing translation methods are imperfect; however, translation is an active research area in academia and industry, which suggests that these systems will continue to improve over time. An open question for social scientists is how to best leverage these developments for applied research. A critical part of this Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 275 process is developing diagnostic tools for assessing the sensitivity of our analysis tools to translation error. Finally, we plan to continue developing open-source software which brings the necessary tools for automated text analysis to the end-user. The three software packages described here cover different portions of the text analysis workflow, from processing of texts to estimating the model. We plan to continue refining these tools with comparative politics scholars in mind, while developing new software, including a browser-based system for interactive topic model exploration. Funding Dustin Tingley gratefully acknowledges his Dean’s support for this project. References Alfonseca, E., S. Bilac, and S. Pharies. 2008. Decompounding query keywords from compounding languages. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, 253–256. Association for Computational Linguistics. Barberá, P. 2012. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. In APSA 2012 Annual Meeting Paper. Baturo, A., and S. Mikhaylov. 2013. Life of Brian revisited: Assessing informational and non-informational leadership tools. Political Science Research and Methods 1(01):139–57. Blei, D. M. 2012. Probabilistic topic models. Communications of the ACM 55(4):77–84. Blei, D. M., and J. D. Lafferty. 2007. A correlated topic model of science. Annals of Applied Statistics 1(1):17–35. Boyd-Graber, J., and D. M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 75–82. AUAI Press. Boyd-Graber, J., and P. Resnik. 2010. Holistic sentiment analysis across languages: Multilingual supervised latent dirichlet allocation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 45–55. Association for Computational Linguistics. Brachman, J. 2009. Global Jihadism. New York: Routledge. Brady, H. E., and D. Collier. 2010. Rethinking social inquiry: Diverse tools, shared standards. Lanham, MD: Rowman & Littlefield. Brown, P. F., J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. 1990. A statistical approach to machine translation. Computational Linguistics 16(2):79–85. Brown, P. F., V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2):263–311. Budge, I., K. Hans-Dieter, V. Andrea, B. Judith, and T. Eric. 2001. Mapping Policy Preferences: Estimates for Parties, Electors, and Governments 1945–1998. Oxford: Oxford University Press, Oxford, UK. Campbell, R. S., and J. W. Pennebaker. 2003. The secret life of pronouns flexibility in writing style and physical health. Psychological Science 14(1):60–65. Catalinac, A. 2014. Pork to policy: The Rise of National Security in Elections in Japan, unpublished manuscript. Cheng, K.-S., G. H. Young, and K.-F. Wong. 1999. A study on word-based and integral-bit Chinese text compression algorithms. Journal of the American Society for Information Science 50(3):218–28. Chiozza, G. 2009. Anti-Americanism and the American world order. Baltimore: Johns Hopkins University Press. Coscia, M., and V. Rios. 2012. Knowing where and how criminal organizations operate using web content. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 1412–1421. ACM. Eggers, A., and A. Spirling 2011. Partisan convergence in executive-legislative interactions modeling debates in the House of Commons, 1832–1915. unpublished manuscript. Farrell, H., and M. Finnemore 2013. The end of hypocrisy: American foreign policy in the age of leaks. Foreign Affairs 92:22. Feinerer, I., K. Hornik, and D. Meyer. 2008. Text mining infrastructure in R. Journal of Statistical Software 25(5):1–54. Fokkens, A., M. Van Erp, M. Postma, T. Pedersen, P. Vossen, and N. Freire. 2013. Offspring from reproduction problems: What replication failure teaches us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1691–1701, Sofia, Bulgaria, August. Association for Computational Linguistics. George, A., and A. Bennett. 2005. Case studies and theory development in the social sciences. Cambridge, MA: MIT Press. Griffiths, T. L., and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(Suppl 1):5228–235. Grimmer, J. 2010. A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis 18(1):1. Grimmer, J., and B. M. Stewart. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3):267–97. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 276 Christopher Lucas et al. Habash, N., and J. Hu. 2009. Improving Arabic-Chinese statistical machine translation using English as pivot language. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pp. 173–81. Association for Computational Linguistics. Harman, D. 1991. How effective is suffixing? JASIS 42(1):7–15. Hollink, V., J. Kamps, C. Monz, and M. De Rijke. 2004. Monolingual document retrieval for European languages. Information Retrieval 7(1–2):33–52. Hu, Y., K. Zhai, V. Eidelman, and J. Boyd-Graber. 2014. Polylingual tree-based topic models for translation domain adaptation. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): 1166–1176. Hull, D. A. 1996. Stemming algorithms: A case study for detailed evaluation. JASIS 47(1):70–84. Jamal, A., R. O. Keohane, D. Romney, and D. Tingley. n.d. Anti-Americanism or anti-interventionism? Evidence from the Arabic Twitter universe. Perspectives on Politics. Forthcoming. Katzenstein, P. J., and R. O. Keohane. 2007. Varieties of anti-Americanism: A framework for analysis. In Anti-Americanisms in world politics, eds. P. J. Katzenstein and R. O. Keohane, 9–38. Ithaca: Cornell University Press. King, G., J. Pan, and M. E. Roberts. 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review 107:1–18. Koehn, P. 2009. Statistical machine translation. Cambridge, UK: Cambridge University Press. Krovetz, R. J. 1995. Word-sense disambiguation for large text databases PhD thesis, University of Massachusetts, Amherst. Laver, M., K. Benoit, and J. Garry. 2003. Extracting policy positions from political texts using words as data. American Political Science Review 97(02):311–31. Lunde, K. 2009. CJKV information processing. New York, NY: O’Reilly Media, Inc. Lynch, M. 2007. Anti-Americanism in the Arab world. In Anti-Americanisms in world politics, eds. P. J. Katzenstein and R. O. Keohane, 196–224. Ithaca: Cornell University Press. Manning, C. D., P. Raghavan, and H. Schütze. 2008. Introduction to information retrieval, Vol. 1. Cambridge: Cambridge University Press. McCallum, A. K. 2002. Mallet: A machine learning for language toolkit. Available at http://mallet.cs.umass.edu. McCants, W. 2006. Militant ideology atlas. Technical report, Combating Terrorism Center, U.S. Military Academy. Miller, M. C. 2013. Wronged by empire: Post-imperial ideology and foreign policy in India and China. Stanford, CA: Stanford University Press. Mimno, D., H. M. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, 880–889. Association for Computational Linguistics. Mosteller, F., and D. L. Wallace. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association 58(302):275–309. Nielsen, R. 2013. The lonely Jihadist: Weak networks and the radicalization of Muslim clerics. PhD Thesis, Harvard University. Ann Arbor: ProQuest/UMI (Publication No. 3567018). Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: A method for automatic evaluation of machine transla- tion. In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, 311–318. Association for Computational Linguistics. Paul, M., H. Yamamoto, E. Sumita, and S. Nakamura. 2009. On the importance of pivot language selection for statistical machine translation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 221–224. Association for Computational Linguistics. Quinn, K., B. Monroe, M. Colaresi, M. Crespin, and D. Radev. 2010. How to analyze political attention with minimal assumptions and costs. American Journal of Political Science 54(1):209–228. Roberts, M. E., B. M. Stewart, and E. Airoldi. 2015. A model of text for experimentation in the social sciences. Unpublished manuscript. Roberts, M. E., B. M. Stewart, and D. Tingley. 2014. stm: R package for structural topic models. R package version 0.6.21. software package http://structuraltopicmodel.com/. Roberts, M. E., B. M. Stewart, D. Tingley, and E. M. Airoldi. 2013. The structural topic model and applied social science. Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. Roberts, M. E., B. M. Stewart, D. Tingley, C. Lucas, J. Leder-Luis, S. Gadarian, B. Albertson, and D. Rand. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58(4):1064–1082. Rubin, B. 2002. The real roots of Arab anti-Americanism. Foreign Affairs 81(6):73–85. Salton, G. 1989. Automatic text processing: The transformation, analysis, and retrieval of information by computer. Boston, MA: Addison-Wesley. Schonhardt-Bailey, C. 2006. From the Corn Laws to free trade [electronic resource]: Interests, ideas, and institutions in historical perspective. Cambridge, MA: MIT Press. Schrodt, P. A., and D. J. Gerner. 1994. Validity assessment of a machine-coded event data set for the Middle East, 1982–92. American Journal of Political Science 38(3):825–854. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015 Computer-Assisted Text Analysis 277 Slapin, J. B., and S.-O. Proksch. 2008. A scaling model for estimating time-series party positions from texts. American Journal of Political Science 52(3):705–722. Stewart, B. M., and Y. M. Zhukov. 2009. Use of force and civil–military relations in Russia: An automated content analysis. Small Wars & Insurgencies 20(2):319–343. Stockmann, D. 2012. Media commercialization and authoritarian rule in China. New York, NY: Cambridge University Press. Telhami, S. 2002. The stakes: America and the Middle East. Boulder, CO: Westview Press. Tseng, H., P. Chang, G. Andrew, D. Jurafsky, and C. Manning. 2005. A conditional random field word segmenter for Sighan Bakeoff 2005. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Vol. 171. Jeju Island, Korea. Utiyama, M., and H. Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In 2007 Proceedings of NAACL/HLT, pp. 484–491. Van Atteveldt, W., J. Kleinnijenhuis, and N. Ruigrok. 2008. Parsing, semantic networks, and political authority using syntactic analysis to extract semantic relations from Dutch newspaper articles. Political Analysis 16(4):428–446. Volkens, A., P. Lehmann, N. Merz, S. Regel, A. Werner, O. Lacewell, and H. Schultze. 2013. The manifesto data collection. In Manifesto Project (MRG/CMP/MARPOR). Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB). Zhao, B., and E. P. Xing. 2006. Bitam: Bilingual topic admixture models for word alignment. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 969–76. Association for Computational Linguistics. Downloaded from http://pan.oxfordjournals.org/ at MIT Libraries on April 13, 2015