FAS Theses and Dissertations

Permanent URI for this collectionhttps://dash.harvard.edu/handle/1/4927603

Browse

Search Results

Now showing 1 - 10 of 8596
  • Thumbnail Image
    Publication
    Mosaic nucleic acids that bind purine nucleotides
    Trevino, Simon Gonzalez; Szostak, Jack
    Models for the origin of life have maintained that the first cells relied upon a single biopolymer for both genotype and phenotype. RNA may have provided these activities through its ability to transfer information via base-pairing and its ability to fold into functional structures. It follows that a comprehensive account of abiogenesis would include an understanding of prebiotic ribonucleotide synthesis. However, studies along these lines have shown that, depending on conditions, prebiotic chemistry may yield diverse nucleotides; some of which are based on sugars other than ribose. This monomer pool would likely support the polymerization of nucleic acid molecules characterized by a heterogeneous sugar-phosphate backbone. Copies of such mosaic nucleic acid (MNA) would conserve sequence information, but not the order and content of sugars in the sugar-phosphate backbone. Might MNA represent a possible source of early biological activity? The answer to this question largely depends on whether the structural heterogeneity of the sugar-phosphate backbone would allow for the emergence of selectable function. To test this possibility, we used in vitro selection to isolate purine nucleotide-binding MNA aptamers from a large library of random MNA sequences (containing an ∼1:1 mixed assignment of deoxy- and ribonucleotides). We report two MNA aptamers that bind either ATP or GTP with weak affinity (apparent KDs = ∼350 µM each) and moderate to high specificity. We conclude that variations in nucleic acid backbone content, perhaps introduced by imprecise synthesis, may not have posed an insurmountable barrier for the emergence of simple biological function.
  • Publication
    Leveraging Latent Spaces for Fair Results in Vector Database Image Retrieval
    (2024-11-26) Glynn, Alexander P; Calmon, Flavio; Pehlevan, Cengiz
    This work attempts to debias image retrieval from a vector database across a potentially broad class of sensitive attributes. Vector representations of images and text are increasingly common as components in downstream models, and additionally display strong capabilities without further training (zero-shot) for classification and retrieval tasks. However, undesired bias, especially with regard to race and gender, within these representations is well-studied and impacts downstream tasks. We present the Conceptually Diverse Images (CDI) algorithm to confront these biases in image retrieval. CDI debiases image retrieval over a set of flexibly-chosen group attributes that serve as protected classes. CDI leverages latent information from a foundational vector embedding model to work in a zero-shot fashion – no training is required to intervene for a particular set of protected attributes. A concept layer is produced through projecting a set of images similar to the query into a space defined by their position in regard to a zero-shot classification problem across each attribute. By then taking a maximally diverse set across their positions on this “concept bottleneck,” CDI increases the fairness of the returned images across several measurable metrics, including subgroup fairness notions. We present provable results on the relationship between a maximally diverse set on an idealized concept space and fairness notions for retrieval problems and show the in-practice performance of CDI against recent competing methods of debiasing image search from a vector database. CDI displays competitive performance with prior (trained and zero-shot) methods of debiasing, several of which we extend for the first time to subgroup fairness. It shows best-in-class performance on certain metrics, and, on most it extends the Pareto Front of the Precision-Bias curve to allow for more aggressive fairness trade-offs. We additionally examine the use of more attributes than we can measure, which promisingly comes at low-cost to those we can, and conduct multiple ablation tests to justify components of CDI.
  • Publication
    Echoes of an Empire: Mortality in the Former Soviet Union Since the Mid-1990s
    (2024-11-26) Gavel, Julia; Cutler, David; Boycko, Maxim
    The collapse of the Soviet Union in 1991 initiated a period of significant socio-economic transformation across its former republics, fundamentally altering the landscape of public health and mortality. This thesis explores the complex interplay of economic, healthcare, behavioral, and psychosocial factors in- fluencing mortality rates in the former Soviet Union from the mid-1990s through 2019. Utilizing linear regression models, it assesses the impact of these factors across fifteen former Soviet countries, with a detailed case study on Kyrgyzstan, highlighting the nuanced nature of mortality trends in the region. The study reveals that economic indicators alone do not account for the observed mortality trends within the former Soviet Union from the mid-1990s through the 2010s contrary to expectations. Rather, it is the shifts in risk behaviors, with a particular emphasis on alcohol consumption, that correlate most significantly with mortality rates across the region (with an R-squared value of 0.32 and strong graphical correlation for mortality by alcohol consumption). This finding highlights the limited impact of economic growth, healthcare system reforms, and psychosocial outlook on mortality trends, all of which failed to show graphical and in the absence of targeted interventions addressing risk behaviors. By examining the variance in mortality trends across the region, this thesis provides insight into the broader socio-political dynamics influencing public health outcomes post-Soviet dissolution.
  • Publication
    Pricing in the Polls: How Expected Election Outcomes Drive Asset Price Reactions in Advanced and Emerging Market Economies
    (2024-11-26) O'Connor, Taryn Marie; Pons, Vincent
    How do prior expectations influence the reaction in equity markets to an election result? In this thesis, I explore this question utilizing a dataset of 92 elections from 1988-2020 and election polls that were released ahead of their final rounds. Through the use of event studies, I find that in elections that are expected to be close prior to the vote taking place left-wing victories generate -1.753 percentage points of abnormal returns in response to election outcomes, compared to a -1.073 percentage point drop when only elections with close outcomes are considered. Additionally, there is no significant market reaction to right-wing victories in the same indices, potentially signaling that market actors are placing too much weight on a market-friendly outcome in their priors. Lastly, this analysis shows that emerging market nations have reactions of higher magnitude to election outcomes compared to advanced economies.
  • Publication
    Geometry Optimization and Validation for a Front Suspension Assembly of a Formula Hybrid Car
    (2024-11-26) Serafin, Tommaso; Bertoldi, Katia
    Formula Student, through its various declinations like Formula Hybrid, is a staple of the undergraduate engineering experience at all major universities, and it has been shown to greatly improve the quality of work of those that take part in it. Because of this, the goal of this project is to create a strong foundation, in the form of a fully designed, optimized, and validated front suspension assembly, from which a Harvard SEAS Formula Hybrid car will be developed. The overarching objective is to develop a subsystem capable of meeting the performance and reliability requirements of the competition, but attention will also be given to the level of complexity required to integrate it in a future car, ensuring that it is attainable for a first-time team with little to no prior knowledge on Formula-style vehicle development, so that a full built can be taken to competition in by 2026.
  • Publication
    Understanding Transcription Factor Activation and Repression Strength with Protein Language Models
    (2024-11-26) Petersen, Lillian Kay; Buenrostro, Jason
    Transcription factors are proteins that regulate gene expression by binding to specific sites in the genome and recruiting cofactors to either activate or repress nearby genes. Transcription factors are unique due to their enrichment of intrinsically disordered regions — regions that don't spontaneously fold into stable three-dimensional structures, but instead rapidly fluctuate between a range of unstable conformations. Because disordered regions mutate rapidly and are not subject to the same evolutionary constraints of structured proteins, they can't be aligned with other sequences and have remained difficult to study. Recently, protein language models have emerged as promising predictors of protein structure and function. Because they take in one sequence at a time, it has been hoped that they will develop an understanding of disordered proteins, but no thorough benchmark has investigated this to date. In this thesis I systematically benchmark protein language models on their ability to both identify the location of effector domains within transcription factors and predict the effect of mutations and deletions on activation and repression strength, using large scale activation and repression data from Delrosso et al. 2023. We find that activation domains, which are highly disordered, can easily be identified and characterized by amino acid composition, and recommend simpler, mechanistic models for activation prediction. Analysis of model weights lead us to notice that lysine is highly enriched in activation domains, but deletion of lysines further increases activation. Based on this finding, we hypothesize that post-translational modifications on lysines may act as built-in regulators of activation. Repression strength, which involves more structured interactions, is better predicted by protein language models, and even exhibits improved performance as model size increases. Protein language models may learn characteristics related to repression strength during pretraining, suggesting that complex models are appropriate for engineering goals in this context. This thesis demonstrates promising results for activation and repression prediction, and suggests that mapping the regulatory logic of effector domains is within reach with additional data.
  • Publication
    Investigating Ionic Liquid Chemical Structures for Applications in Whole Tumor Cell Vaccine Immunotherapy
    (2024-11-26) Kapsalis, Litsa Magdalini; Mitragotri, Samir; Rodrigues, Danika; Mitragotri, Samir; Ba, Demba; Weitz, David
    Whole tumor cell vaccines are an emerging immunotherapy strategy whereby tumor cells are inactivated and modulated ex vivo prior to injection back into a patient to deliver a library of antigens for the body’s immune system to recognize and target. Currently, irradiation is the most common method of ex vivo induction of cell death in tumor cell vaccines. However, irradiation has limitations, especially insufficient tumor immunogenicity. In this study, we explored ionic liquids (ILs) as alternatives in the preparation of whole tumor cell vaccines. Here, we aimed to create a tunable library of ILs for whole tumor cell vaccines by determining the impact of IL anion length, branching, and unsaturation on 4T1 murine mammary cancer cells. We examined the impact of IL anion structure on cell cytotoxicity, cell death pathways, immunogenicity of cell death, and autophagy activation. We determined that IL-induced cytotoxicity increases with anion carbon chain length and with concentration, and decreases by adding branched methyl, ethyl groups, and double bonds (with constant carbon chain length) in the anion structure. Among all ILs, apoptosis appeared as the primary mechanism of cell death, though the stage of apoptotic activity was tunable. Furthermore, anion branching and unsaturation decreases exposure of calreticulin. ILs were shown to induce ATP release in a concentration-dependent manner, though autophagy was not observed to be induced significantly. This work has provided further characterization of the biological interface of ILs in activating whole tumor cell vaccines and has demonstrated their tunable ability to induce immunogenic activity in cancer.
  • Publication
    Analysis of the Harvard Computer Society Email Archives: An Exploration of Differential Privacy in Practice
    (2024-11-26) Cooper, William Chen; Dwork, Cynthia
    This thesis provides a rudimentary introduction to differential privacy as a framework for modern data privacy, using the Harvard Computer Society email list archives as an investigative medium. The differentially private analysis of this dataset includes but is not limited to: time series of list usage, email topic modeling, and sentiment analysis. OpenDP’s Python package for differential privacy is used extensively to execute computations, and the API is evaluated as a standalone programming framework within itself. Novel graph differential private algorithms are both implemented and empirically assessed. Lastly, this thesis discusses a significant inherent challenge in balancing contrasting aspects of differential privacy and exploratory data analysis.
  • Publication
    A Physics-Oriented Approach to the Classification of Extreme Weather Events
    (2024-11-26) Hartvigsen, Benjamin Russell; Linz, Marianna
    Common approaches to the classification of extreme weather events often only consider the intensity and/or rarity of the event without considering the physical processes driving these events. This approach could negatively impact the quality of data used for the study of these events by falsely including data from events driven by different physical processes. Alternatively, these approaches could also artificially limit the quantity of available data to a subset of a larger group of similar events. To address this issue, I suggest an alternative approach that groups events based on similarities in the physical processes driving them. More specifically, throughout this thesis, I outline a method of defining thresholds for the classification of extreme events based on qualitative shifts in the relative contribution of these processes. This method is demonstrated on a simple synthetic example in which it is able to pick up on changes to the standard deviation from which values are being sampled. After this, the method is applied to temperature data from two model runs (historical and warming scenarios) using the Geophysical Fluid Dynamics Laboratory's AM4 and CM4 models. This data is used to demonstrate the physical differences on either side of the threshold selections and the variability in how these events may be affected by anthropogenic climate change.
  • Publication
    Information Theory for Vector Databases
    (2024-11-26) Rakhamimov, Joel; Calmon, Flavio; Lu, Yue; Ba, Demba
    Vector embeddings are a relatively novel concept that have been rapidly increasing in popularity for uses in data science. They are the output of a deep learning model, which usually takes an image or a string of text, as an input. An embedding is an element of $\mathbb{R}^n$, for which it is much easier to perform actions like similarity search than the original image space, which is likely high dimensional. As of yet, there is no rigorous information theoretic treatment of vector databases or vector embeddings, leaving the theoretical foundations weak. In this work, as part of research done in the Flavio Calmon group, we take the first step to apply information theory to vector databases, using CLIP as our embedding model and CIFAR-10 as our image dataset. First, we cover the formulation of the problem of vector embedding, including the theory underlying the embeddings, considering it as a transformation, and viewing the problem of embedding as a Gromov-Wasserstein alignment. We then discuss curious statistical properties of the embeddings, including their distribution and mutual distances. Part of our findings is that calculating distances between vector embeddings results in significantly faster and more accurate sorting, rather than doing the same for their original images. However, the absolute cosine similarity between vectors from similar images is still very low. Throughout, we describe dimensionality reduction techniques, product quantization codes, the cone effect, and Gaussianity of embeddings, the last two of which we observe weak results for.