Publication: Large Language Model Embeddings for Single-Cell Transcriptomics: A Framework for Robust Classification of Motor Neuron Vulnerability
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Selective vulnerability is a defining feature of neurodegenerative disease, yet conventional single-cell analysis often blur subtype boundaries and obscure the molecular programs that distinguish resilience from decline. To address these limitations, this study introduces an integrative framework that combines large language model (LLM)-derived gene embeddings with contrastive learning to generate biologically contextualized representations of single-cell RNA sequencing data. The framework was applied to retinal ganglion cells (RGCs) and validated in motor neurons (MNs) to assess generalizability across neuronal contexts. For each cell, the most highly expressed genes were linked to curated textual summaries and embedded using pretrained transformer models. Expression-weighted aggregation produced cell-level representations, while contrastive learning further refined these embeddings by isolating subtype-specific transcriptional features from background expression. Across embedding architectures, LLM-based representations consistently outperformed graph-based baselines in MN subtype classification and preserved rare RGC subtypes under data-limited conditions. The same embedding and contrastive framework transferred reliably across datasets through KNN-based mapping, reproducing subtype topology without additional training. Visualization and cross-referencing analyses revealed transcriptional gradients consistent with known vulnerability hierarchies and highlighted pathways associated with neuronal differentiation, metabolic resilience, and degeneration. Collectively, these findings demonstrate that integrating literature-derived biological context with self-supervised learning provides a scalable framework for investigating the molecular logic of selective neuronal vulnerability in neurodegenerative disease.