Political Methodology Tools for Data Linkage, Effect Estimation, and Statistical Inference
Access StatusFull text of the requested work is not available in DASH at this time ("dark deposit"). For more information on dark deposits, see our FAQ.
Jerzak, Connor Thomas
MetadataShow full item record
CitationJerzak, Connor Thomas. 2021. Political Methodology Tools for Data Linkage, Effect Estimation, and Statistical Inference. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences.
AbstractWhen performing social science research, scholars must assemble data, estimate effects, and perform inference. This dissertation tackles challenges associated with these three tasks.
In the first chapter, we address the record linkage problem. Scholars of organizations often face unique challenges connecting datasets because disparate sources rarely share identifiers or covariates. Therefore, researchers usually resort to exact or fuzzy string matching between different lists of organization names. Nevertheless, these techniques may struggle to find correct pairs, as widely used names for the same entity often have few characters in common (e.g., `Chase' and `JPM'). This paper offers an alternative that uses the wealth of human experiences documented on the business networking website, LinkedIn. It illustrates two distinct ways in which researchers can make use of the half-billion records from this network to build an organizational name directory of high probability matches between organizational names from publicly traded firms, NGOs, small businesses, and government agencies from across the world. We highlight the directory's value through three evaluation exercises, where we show an up to 60\% increase in the true positive rate with a fixed false-positive rate budget obtained compared to fuzzy matching between English names, and a dramatic improvement in the successful a linkage of Western and Mandarin Chinese company names.
In the second chapter, we examine the theoretical behavior of nearest neighbor matching, an important observational data analysis technique. Previous theory shows that the performance of matching improves slowly with more data, but has not fully disentangled the sensitivity of matching's bias to covariate imbalance or differences in intervention group variability as the sample size changes. Here, we characterize the first-order behavior of this bias in nearest neighbor matching associated with covariate imbalance and observation variance under an independent Normal covariate model. We also derive the asymptotic nearest neighbor distribution for the general Multivariate Normal case (where variables may be correlated). The results suggest new algorithms that can, with additional assumptions, control the bias, which we illustrate by examining the effect of a job training program on income.
In the third chapter, we propose methods for inferring the presence of unobserved interactions between people and groups, which play a fundamental role in domestic and international politics. The chapter develops a machine learning approach for detecting and characterizing unobserved interference dynamics using all available information: outcome, covariate, and independent variable data. Given minimal assumptions, this approach guarantees an analyst-set cap on the rate of false influence detection while exploiting the power of modern machine learning. It is able to reconstruct important aspects of the influence structure of a network that was approximately measured by investigators in a school bullying experiment. I apply the method to 11 social science experiments and focus on one of these, a voter turnout intervention in the UK, as a case study.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368413
- FAS Theses and Dissertations