SWAN: A Distributed Knowledge Infrastructure for Alzheimer Disease Research

13 SWAN – a Semantic Web Application in Neuromedicine – is a project to develop an effective, integrated scientiﬁc knowledge infrastructure for the Alzheimer disease (AD) research community, using the energy and self-organization of that community, enabled by Semantic Web technology. This infrastructure may later be deployed for research communities in other neuromedical disorders. SWAN incorporates the full biomedical research knowledge lifecycle in its ontological model, including support for personal data organization, hypothesis generation, experimentation, laboratory data organization, and digital pre-publication collaboration. Community, laboratory, and personal digital resources may all be organized and interconnected using SWAN’s common semantic framework.


Introduction
Neurodegenerative diseases are highly complex disorders.
Researchers over the past 20 years have made significant progress in understanding Alzheimer disease and related neurological disorders.They have produced an abundance of data implicating diverse biological mechanisms in the etiology of such diseases.These include genes, environmental risk factors, changes in cell functions, DNA damage, accumulation of misfolded proteins, cell death, immune responses, changes related to aging, reduced regenerative capacity, and others.Yet there is still no clear agreement on the etiology of AD.Citation analysis from the Alzheimer Research Forum estimates that there are more than 40,000 citations in the PubMed database of relevance to neurodegenerative diseases, and 150-200 new studies are published each week.
The challenge of integrating so much data into testable 38 hypotheses and unified concepts is clearly formidable.39 Researchers must strive to formulate testable hypotheses built on 40 a corpus of research derived from multiple experimental modal-41 ities within many subfields of biomedicine and related areas, in 42 all of which it is impossible to be expert simultaneously.The 43 situations for Parkinson's, Huntington's, and ALS researchers 44 are similar.

45
SWAN is an attempt to develop a practical, common, 46 semantically-structured, web-compatible framework for scien-47 tific discourse using Semantic Web technology [1][2][3] applied to 48 the problems of integrating multimodal scientific discourse, in 49 the search for a cure for Alzheimer disease.The initial concept 50 for SWAN was proposed in a talk at the W3C Semantic Web in 51 Life Sciences workshop, October 2004 [4].

52
SWAN is intended to operate at the individual and community 53 levels, enabling a system of interoperable personal and commu-54 nity knowledge bases.Individuals will use SWAN software as 55 a personal tool to find, filter, and organize information.At the 56 community level, the same software and the same ontological 57 framework can be used to organize and curate the research of 58 way, knowledge and discourse can be organized on a commu-103 nity website, a laboratory website, or a personal computer in 104 mutually interoperable schemas.

105
Three Supporting System Use Cases further specify the pri-106 mary use case: 107 2 Available at http://purl.org/swan/0.1.The trailing slash is significant.Also, depending upon how they deal with content types, some browsers may require a "view source" operation to see the RDF.
• Organize and annotate digital scientific resources as integrated KBs across content types, using multiple ontologies.
• Securely share digital scientific resources including the ontologies and annotation generated in Use Case 1, from individuals to diverse communities and back again.
• Provide integrated access to digital scientific resources for a single scientist, a single community, or multiple communities, as a distributed knowledgebase, organized by the structures specified in Use Case 1.

Discussion
Biomedical researchers engage in certain typical patterns of activity in keeping up with the literature, developing hypotheses, planning research, applying for grants, analyzing data, and preparing for publication.These activities are common to the vast majority of researchers.They include • Searching, reading, and thinking critically about the professional literature in their field.
• Formulating testable hypotheses consistent with the "story" or explanatory model.
• Finding possible connections amongst disparate data, creating a plausible explanatory "story" or model which can bridge gaps or open challenges in the existing body of knowledge.
• Designing experiments to test their hypotheses.
• Running the experiments.
• Collecting and analyzing experimental data.
• Interpreting data, e.g. by modifying the hypothesis, connecting it to other findings or hypotheses.
• Organizing personal collections of publications and related documents according to a relevant conceptual system to enable retrieval at a later date.
• Applying for grants to support their work (which typically involves presenting the model, hypotheses, and preliminary data).
• Communicating with other researchers, funding agencies, publishers, conference organizers, and local institutional management.
• Writing scientific articles for publication, preparing conference presentations, informal talks, and poster sessions.
Many of these activities are currently supported by public or private information systems, ranging from Google ® to personal Excel ® spreadsheets and personal bibliographic managers such as EndNote ® .However, these tools all have their shortcomings from the knowledge ecosystem view, because they lack semantic constructs connecting the personal, community, and sciencewide realms of discourse.Because digital resources in these spaces are largely organized using incompatible knowledge schemas, contextual information in the knowledge ecosystem is continually lost as it passes through human beings navigating point-and-click interfaces.
A public ontology is required for scientific communicationit establishes the terms of discourse.Biologists have been developing ontologies since at least the time of Aristotle.Private ontologies, inherently modifiable without discussion, Clearly it is essential to incorporate shared public concepts and relationships into the organizational scheme, while also providing for personal differences or discoveries to be modeled and declared.What we are after here, from the viewpoint of the philosophy of science, is a formal way to represent potentially incompatible scientific models, which does not also force them to become incommensurable.To do this we require some public bridging ontology.In SWAN this is an ontology of reasoning and discourse.
Visser et al. discuss the problem of heterogeneous ontologies as barriers to system interoperability of varying severity [6] and discuss approaches to allowing heterogenous ontologies to communicate within a distributed system.This is essentially our problem, and we adopt an approach largely consistent with two of their proposed solutions (1) domain partitioning and (2) alternative domain views [7].We will limit ontology mismatches to what Visser and Cui call content heterogeneity across a core set of structures.

Formally, SWAN adopts what Hausser calls the "+construc-
tive" response in ontological model theory: in our ontological model, "the model-structure is part of the speaker-hearer" [8].
We recognize the act of cognition as seated in individuals practicing a scientific discipline in the material world... and make it part of our semantics.A significant part of this discipline is represented by scientific discourse.Hausser associates the [+constructive] interpretation particularly with the goal of analyzing language meaning, as opposed to the [−constructive] response, whose goal is "to characterize truth" and which he associates (exclusively) with science and mathematics.However, we do not make such a dichotomy.At least in biomedicine, discourse is not restricted to absolute propositions in which the author and context are either absent from the scene, or irrelevant 195 to validation. 196 The [+constructive] model is in many ways implicit in bibli-197 ographic databases.GenBank [9] long ago4 moved from a data 198 model in which a consensus sequence was maintained, as "abso-199 lute truth", to a model accepting and publishing the varying 200 experimental results of each researcher.This model therefore 201 recognizes the speaker... but the hearer remains implicit.An 202 explicit treatment of the hearer allows a collaboration network 203 to be established.

204
Publication is a prominent part of the scientific discourse.205 Our notion is to join it with the supporting reasoning and evi-206 dentiary data in a knowledge schema.A conceptual model 207 of knowledge acquisition and publication by an individual 208 scientist is shown in Fig. 1.Documents (or evidence), and 209 assertions upon documents, are fundamental objects in our 210 system.Document assertions connect the discourse to its 211 foundations, and concern the document characteristics, prove-212 nance, content, statements about the documents, categorization 213 of the documents, and relationships to other documents and 214 assertions.

215
We are not attempting to construct a formal computational 216 language of biology.What we are attempting in our ontology 217 is to increase the interoperability across various models speci-218 fied in text, through establishing improved connections among 219 documents and assertions about them.(e.g.Alzheimer's Research Forum), collaborator information, previously published and non-published data (this may be a problem due to copyright issues), and detailed methods, including specifics on reagents (which can be a non-trivial issue).This additional information would give the paper multiple dimensions by embedding this associated information within the paper (when opened electronically) and/or providing links to other information that is too large to embed.This concept is an expansion of the orange to green transition seen in the righthand portion of Fig. 1.Clearly, all the information under "Private knowledge" space is not transmitted in the publication process for many reasons, including the motivation and the ability to collect this information in a standardized way.If a researcher is collecting this additional information in a software program during the building of a "Private hypothesis" (Fig. 1 top-half), knowing that it will be used for their publication (bottom half), then it will provide strong motivation for its use.Additionally, if the data structure becomes a standard way to relay information to other researchers, investigators will support its use (e.g.Word or Excel documents).
Publishing is one of the major factors motivating researchers, because it is closely tied to securing funding and promotion.Publications are a snapshot of an individual's thoughts and experiments, and of the evolution of scientific thought as a whole.As indicated in the bottom of Fig. 1, time is the X-axis.The process depicted here represents a unit of time (although variable) which repeats itself over a scientist's life manyfold.Often what is lost in this process is how these units became connected and any information that never made it to publication.This could be due to lack of time, funding, technical problems, incorrect hypothesis or lack of acceptance by the scientific community for a certain line of reasoning.Much of this information is kept as "Private knowledge" cloistered in notebooks or the archives of the brain.
Providing a platform to document ideas that succeeded (i.e.published), failed or were never evaluated has a very significant scientific value allowing current or future generations to extend, avoid, or develop these ideas.Such a model could either have a historical perspective built on years of accumulated knowledge or may be a de novo idea based on a new observation.
An immediate example of this program's value could be seen in a student-teacher relationship, in transmitting the teacher's view of a particular subject to a naive student.If the student wants to understand this view it would useful if he or she was able to see a model of this hypothesis containing all the information gathered together to support this idea.This project has the potential to build a program that would allow the collection of thoughts, data, and experiences over a lifetime, creating a scientific life history.Most of this data will be collected in the "Private knowledge" space, but is built on the Publication Model described above.
A significant question is, when will one allow their private world to become public?At a minimum, scientists would be inclined to release this "Private knowledge" at the end of their scientific careers.Nonetheless, without the effort to collect this highly valuable knowledge it is doomed to be lost forever.Additionally, some of the payoff of the collection of this "Private knowledge" would not always be immediate, but would be the 312 beginning of a knowledge base that would grow, benefiting 313 future generations.These two models are not mutually exclusive, 314 but in fact are intertwined because the "Publication Model" is 315 an element repeated over time giving a "Scientific Life History."316 The value of collecting this information cannot be underesti-317 mated and to our knowledge has not been done in a systematic 318 manner that would be searchable.

The SWAN pilot 320
The SWAN pilot project has three major components, which 321 are intended to work together as an integrated whole.

325
The SWAN ontology permits knowledge content from multi-326 ple stages of the scientific discovery life-cycle to be represented 327 in the W3C Resource Description Framework (RDF), in a way 328 that can support electronic pre-publication group sharing and 329 collaboration, as well as personal and community knowledge 330 base construction.The current version of this early schema 11 331 (Clark, Gao et al. [10]) can be persistently referenced on the web 332 Y. Gao  Evidentiary Citations are used in asserting that some Assertion is evidence supporting a Hypothesis, Claim, or other Assertion.Inclusive Citations are used to specify the Assertions which belong to a Collection.Referencing Citations are used wherever a reference to something is made for a purpose other than those previously described.
Annotation may be structured or unstructured.Structured annotation means attaching a Concept (tag or term) to an Assertion.Unstructured annotation means attaching free text.Assertions may be imported from Alzforum, Pubmed, EndNote bibliographies previously exported in XML, RDF N3 serialization, and from other SWAN-RDF stores, using SwIM.Assertions may also be exported in RDF or in EndNotecompatible XML.SWAN Assertions may be organized by placing them in a Collection.SWAN uses a speaker-hearer core ontological model.Therefore, Persons and Groups need to be defined as sources and targets of discourse for each Assertion.Groups are named collections of Persons.Persons are a subclass of Group containing only a single Person.
Concepts are nodes in controlled vocabularies, which may also be hierarchical (taxonomies).Concepts natively supported include special Alzforum categories, MeSH terms, and Gene Ontology 12 13 (Harris et al. [12]) categories.Genes and Pro-

Fig. 2
Fig.2is a conceptual sketch of the relationship of scientific 221 hypotheses, public and private ontologies, and documents.We 222 believe that a successful knowledge infrastructure needs to sup-223 port these relationships with special emphasis on public, private, 224

Fig. 3 .
Fig. 3. SWAN semantic relationships. 319 et al. / Web Semantics: Science, Services and Agents on the World Wide Web xxx (2006) xxx-xxx for re-use by other applications.Fig. 3 gives an example of how 333 the schema instantiates a Hypothesis with supporting Claims 334 and evidence, combining public (community) and private infor-335 mation.