Identifying Antibiotic Resistance in Mycobacterium Tuberculosis With Machine Learning: A Quick and Accurate Alternative to Conventional Diagnostics

Chen, Michael L.

View/Open

CHEN-SENIORTHESIS-2020.pdf (16.80Mb)

Author

Chen, Michael L.

Metadata

Show full item record

Citation

Chen, Michael L. 2020. Identifying Antibiotic Resistance in Mycobacterium Tuberculosis With Machine Learning: A Quick and Accurate Alternative to Conventional Diagnostics. Bachelor's thesis, Harvard College.

Abstract

The diagnosis of multidrug resistant and extensively drug resistant tuberculosis is a global health priority. Whole genome sequencing of clinical Mycobacterium tuberculosis isolates promises to circumvent the long wait times and limited scope of conventional phenotypic antimicrobial susceptibility testing, but gaps remain in predicting phenotype accurately from genotypic data especially for certain drugs. My primary aim was to implement and explore statistical methods and deep learning algorithms using a rich dataset to build a high performing and fast predicting model to detect anti-tuberculosis drug resistance.
I collected targeted or whole genome sequencing and conventional drug resistance phenotyping data from 3,601 Mycobacterium tuberculosis strains enriched for resistance to first- and second-line drugs. I investigated the utility of (1) rare variants and variants known to be determinants of resistance for at least one drug and (2) statistical methods and deep learning architectures in predicting phenotypic drug resistance to 10 anti-tuberculosis drugs. Performance was validated on an independent validation set, as well as compared to a convolutional neural network approach on an expanded set of 10,198 Mycobacterium tuberculosis strains.
The highest performing machine and statistical learning methods included both rare variants and those known to be causal of resistance for at least one drug. Both simpler L2 penalized regression and a multidrug wide and deep neural network (MD-WDNN) had high predictive performance. The average AUCs for the highest performing model, the MD-WDNN, were 0.979 for first-line drugs and 0.936 for second-line drugs during repeated cross-validation. On an independent validation set, the highest performing model showed average AUCs, sensitivities, and specificities, respectively, of 0.937, 87.9%, and 92.7% for first-line drugs and 0.891, 82.0% and 90.1% for second-line drugs. The method has higher predictive performance compared to previously reported machine learning models during cross-validation, with higher AUCs for 8 of 10 drugs. High performance remained on the expanded set of 10,198 strains, and the extension to a convolutional neural network approach showed promising results with interpretable saliency map visualizations.
Overall, the machine learning models described in this work significantly improve the accuracy of antibiotic resistance prediction and hold promise in bringing sequencing technologies closer to the bedside.

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

Citable link to this page

https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364665

Collections

FAS Theses and Dissertations [6136]

Contact administrator regarding this item (to report mistakes or request changes)