Person: Charlon, Thomas
Email Address
AA Acceptance Date
Birth Date
Research Projects
Organizational Units
Job Title
Last Name
First Name
Name
Search Results
Publication Word Embeddings in Mental Health
(2024-06-12) Charlon, Thomas; Cai, TianxiWord Embeddings In Mental Health
Mental health related diagnoses have been on the rise these last years, especially since the pandemic. In the CELEHS laboratory, we analyze electronic health records to help clinicians identify at-risk patients requiring follow-up. In this talk I will present the results of Glove word embeddings on 8,000 open-access suicide-related publications, using the text2vec, opticskxi and sgraph R packages. I developed a novel methodology based on random projections to efficiently find diverse clusters of related concepts in unstructured text data, and evaluate the results by predicting pairs of related concepts, and comparing them to clinician-based known relationships. While many biomedical natural language processing approaches focus on the analysis of specific known concepts, as the ones indexed by the Unified Medical Language System (UMLS), the analysis of the complete text can enable to find novel relationships and borderline concepts.
The Diagnostic and Statistical Manual of Mental Disorders (DSM) is the main reference for clinicians to diagnose mental health diseases, and describes sets of symptoms that form the required diagnostic criteria for each disease. The DSM emphasizes that many patients are diagnosed with multiple conditions, and that current diagnoses could benefit from introducing multidimensional assessments, by taking into account the severity, intensity, duration, and combinations of symptoms, to form more precise diagnostics and help treatment. The DSM also underlines the necessity for diagnoses that take into account the spectrum and gradients of disorders observed, as in schizoaffective and autism spectrum disorders. To this end, the analysis of unstructured text data can help identify clusters of conditions and enable new multidimensional classifications of mental health disorders.
My talk will first present an overview of the problematic as underlined in the DSM and introduce Glove word embeddings using the text2vec package on a set of 8,000 open-access suicide-related publications. I then demonstrate how to explore the embeddings using vector operations to manually find clusters of related concepts, and in a second step automate the discovery of such clusters using the density-based clustering package opticskxi, and visualize the clusters as graphs using the sgraph network visualization package. Further clusters are then discovered by applying semi-directed vector operations, a novel method inspired by random projections. In a last step, I introduce ways to evaluate such clusters, using a database of 17,000 known concepts pairs curated by clinicians with expert knowledge, by predicting pairs of related concepts using a false positive threshold cut-off on cosine similarities.
Novel methodologies in natural language processing will enable us to further understand mental health disorders and their interactions. Specific disorders have been associated to personality traits, as schizophrenia with neuroticism and autism with obsessive-compulsive, and the modeling of such interactions and the further discovery of novel interactions will enable us to enhance the treatment of mental health disorders and identify clinically-actionable features.
Main Sections
00:00 Introduction, Center for Suicide Research and Prevention project 05:42 Text processing and embeddings computation 18:11 Exploration of embeddings, vector operations 29:51 Density-based clustering with OPTICS k-Xi 36:20 Evaluation and knowledge graph generation
More Resources
Center for Suicide Research and Prevention: https://csrp.mgh.harvard.edu/ Git repository of demo: https://gitlab.com/thomaschln/psychclust_rmed24 OPTICS k-Xi density based clustering R package: https://cran.r-project.org/package=opticskxi Knowledge graphs R package: https://cran.r-project.org/package=kgraph
Publication Knowledge graphs for drug discovery workshop
(2024-10) Charlon, Thomas; Cai, TianxiIn order to help clinical research organizations explore their genetic associations results, we developed the kgraph and sgraph packages to build and visualize knowledge graphs in Shiny with Sigma.js.
Imagine you have a data frame of p-value associations between phenotypes and genes: using these packages, you can easily identify which genes are associated with several phenotypes. By selecting phenotypes, all associated genes are displayed, and the genes associated with several phenotypes are highlighted and spatially grouped. Nodes and edges are efficiently scaled to instantly give you a grasp of the most important information, while still enabling you to zoom in on specific points you want to explore. Additionally, we have added the ability to overlay supplementary nodes as drug information, to show e.g. their maximum clinical phase reached, to identify which drugs target which genes and what is the status of the drug development progress.
The sgraph package provides an htmlwidget interface to Sigma.js graph visualization for use in Shiny and is available on CRAN since May 2024. The kgraph package focuses on building knowledge graphs and then calls the sgraph package. It is currently being packaged for CRAN submission, and the development version is available at https://gitlab.com/thomaschln/kgraph. It includes a Shiny app, of which a live version can be found at https://celehs.connect.hms.harvard.edu/kgraph/
Resources mentioned in the workshop:
{opticskxi} OPTICS K-Xi Density-Based Clustering https://cran.r-project.org/package=opticskxi {kgraph} Knowledge Graphs R package https://cran.r-project.org/package=kgraph {kgraph} vignette https://cran.r-project.org/web/packages/kgraph/vignettes/kgraph.html Word embeddings in mental health (R/Medicine 2024 presentation recording) https://www.youtube.com/watch?v=LCK0UqQ1oK4Workshop recorded as part of the 2024 R/Pharma Workshop Series
Publication R / Python Pipelines for Biomedical LLM Semantic Search Apps
(2025-03-08) Hoche, Joseph; Charlon, Thomas; Cai, TianxiLeveraging Pytorch's GPU indexing and R's data management, evaluation and visualization capabilities
At the CELEHS laboratory we are particularly interested by LLM-based embeddings as BGE and BERT. As the number of models increases, we need methods to compare their clinical usefulness. While some R packages exist to leverage GPU capabilities, Pytorch is by far more used for GPU computation. In contrast, R is efficient for data management and visualization. How should one build robust and reproducible pipelines incorporating them both ? My answer is well-designed pipelines with Docker, Makefile, and Elasticsearch. In this talk I will showcase my design approaches to such challenges.