Publication: Evaluation of a Large-Scale Biomedical Data Annotation Initiative
Open/View Files
Date
2009
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
BioMed Central
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Lacson, Ronilda, Erik Pitzer, Christian Hinske, Pedro Galante, and Lucila Ohno-Machado. 2009. Evaluation of a large-scale biomedical data annotation initiative. BMC Bioinformatics 10(Suppl 9): S10.
Research Data
Abstract
Background: This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators. Results: There were 12,500 samples annotated with approximately 30 variables, in each of six disease categories – breast cancer, colon cancer, inflammatory bowel disease (IBD), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and Type 1 diabetes mellitus (DM). The annotators provided excellent variable coverage, with known values for over 98% of three critical variables: disease state, tissue, and sample type. There was 89% strict inter-annotator agreement and 92% agreement when using semantic and partial similarity measures. Conclusion: We show that it is possible to perform manual re-annotation of a large repository in a reliable manner.
Description
Other Available Sources
Keywords
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service