Show simple item record

dc.contributor.authorFang, Ruihua
dc.contributor.authorSchindelman, Gary
dc.contributor.authorAuken, Kimberly Van
dc.contributor.authorFernandes, Jolene
dc.contributor.authorChen, Wen
dc.contributor.authorWang, Xiaodong
dc.contributor.authorDavis, Paul
dc.contributor.authorTuli, Mary Ann
dc.contributor.authorMarygold, Steven J
dc.contributor.authorMillburn, Gillian
dc.contributor.authorMatthews, Beverley
dc.contributor.authorZhang, Haiyan
dc.contributor.authorBrown, Nick
dc.contributor.authorGelbart, William Martin
dc.contributor.authorSternberg, Paul W
dc.date.accessioned2013-11-01T17:08:14Z
dc.date.issued2012
dc.identifier.citationFang, Ruihua, Gary Schindelman, Kimberly Van Auken, Jolene Fernandes, Wen Chen, Xiaodong Wang, Paul Davis, et. al. 2012. Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 13:16.en_US
dc.identifier.issn1471-2105en_US
dc.identifier.urihttp://nrs.harvard.edu/urn-3:HUL.InstRepos:11248784
dc.description.abstractBackground: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.en_US
dc.description.sponsorshipMolecular and Cellular Biologyen_US
dc.language.isoen_USen_US
dc.publisherBioMed Centralen_US
dc.relation.isversionofdoi:10.1186/1471-2105-13-16en_US
dc.relation.hasversionhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305665/pdf/en_US
dash.licenseLAA
dc.subjectanimalsen_US
dc.subjectartificial intelligenceen_US
dc.subjectautomatonen_US
dc.subjectautomationen_US
dc.subjectCaenorhabditis elegansen_US
dc.subjectfactual databasesen_US
dc.subjectgenetic databasesen_US
dc.subjectDrosophila melanogasteren_US
dc.subjectgeneticsen_US
dc.subjectgenomicsen_US
dc.subjectmiceen_US
dc.subjectpublicationsen_US
dc.subjectsupport vector machinesen_US
dc.titleAutomatic Categorization of Diverse Experimental Information in the Bioscience Literatureen_US
dc.typeJournal Articleen_US
dc.description.versionVersion of Recorden_US
dc.relation.journalBMC Bioinformaticsen_US
dash.depositing.authorGelbart, William Martin
dc.date.available2013-11-01T17:08:14Z
dc.identifier.doi10.1186/1471-2105-13-16*
dash.authorsorderedfalse
dash.contributor.affiliatedZhang, Haiyan
dash.contributor.affiliatedWang, Xiaodong
dash.contributor.affiliatedGelbart, William Martin
dash.contributor.affiliatedMatthews, Beverley


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record