Publication:

Large Language Model Embeddings for Single-Cell Transcriptomics: A Framework for Robust Classification of Motor Neuron Vulnerability

Loading...
Thumbnail Image

Date

2026-02-10

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Nguyen, Laura. 2026. Large Language Model Embeddings for Single-Cell Transcriptomics: A Framework for Robust Classification of Motor Neuron Vulnerability. Bachelors Thesis, Harvard University Engineering and Applied Sciences.

Abstract

Selective vulnerability is a defining feature of neurodegenerative disease, yet conventional single-cell analysis often blur subtype boundaries and obscure the molecular programs that distinguish resilience from decline. To address these limitations, this study introduces an integrative framework that combines large language model (LLM)-derived gene embeddings with contrastive learning to generate biologically contextualized representations of single-cell RNA sequencing data. The framework was applied to retinal ganglion cells (RGCs) and validated in motor neurons (MNs) to assess generalizability across neuronal contexts. For each cell, the most highly expressed genes were linked to curated textual summaries and embedded using pretrained transformer models. Expression-weighted aggregation produced cell-level representations, while contrastive learning further refined these embeddings by isolating subtype-specific transcriptional features from background expression. Across embedding architectures, LLM-based representations consistently outperformed graph-based baselines in MN subtype classification and preserved rare RGC subtypes under data-limited conditions. The same embedding and contrastive framework transferred reliably across datasets through KNN-based mapping, reproducing subtype topology without additional training. Visualization and cross-referencing analyses revealed transcriptional gradients consistent with known vulnerability hierarchies and highlighted pathways associated with neuronal differentiation, metabolic resilience, and degeneration. Collectively, these findings demonstrate that integrating literature-derived biological context with self-supervised learning provides a scalable framework for investigating the molecular logic of selective neuronal vulnerability in neurodegenerative disease.

Description

Other Available Sources

Research Data

Keywords

Computer science, Neurosciences

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories